PLDA in the i-supervector space for text-independent speaker verification
- Ye Jiang^{1},
- Kong Aik Lee^{2} and
- Longbiao Wang^{1}Email author
https://doi.org/10.1186/s13636-014-0029-2
© Jiang et al.; licensee Springer 2014
Received: 26 November 2013
Accepted: 21 June 2014
Published: 15 July 2014
Abstract
In this paper, we advocate the use of the uncompressed form of i-vector and depend on subspace modeling using probabilistic linear discriminant analysis (PLDA) in handling the speaker and session (or channel) variability. An i-vector is a low-dimensional vector containing both speaker and channel information acquired from a speech segment. When PLDA is used on an i-vector, dimension reduction is performed twice: first in the i-vector extraction process and second in the PLDA model. Keeping the full dimensionality of the i-vector in the i-supervector space for PLDA modeling and scoring would avoid unnecessary loss of information. We refer to the uncompressed i-vector as the i-supervector. The drawback in using the i-supervector with PLDA is the inversion of large matrices in the estimation of the full posterior distribution, which we show can be solved rather efficiently by portioning large matrices into smaller blocks. We also introduce the Gaussianized rank-norm, as an alternative to whitening, for feature normalization prior to PLDA modeling. We found that the i-supervector performs better during normalization. A better performance is obtained by combining the i-supervector and i-vector at the score level. Furthermore, we also analyze the computational complexity of the i-supervector system, compared with that of the i-vector, at four different stages of loading matrix estimation, posterior extraction, PLDA modeling, and PLDA scoring.
Keywords
1 Introduction
Recent research in text-independent speaker verification has been focusing on the problem of compensating the mismatch between training and test speech segments. Such mismatch in most part is due to the variations induced by the transmission channel. There are two fundamental approaches to tackling this problem. The first approach operates at the front-end via the exploration of discriminative information in speech in the form of features (e.g., voice source, spectro-temporal, prosodic, high-level) [1]-[6]. The second approach relies on the effective modeling of speaker characteristic in the classifier design (e.g., GMM-UBM, GMM-SVM, JFA, i-vector, PLDA) [4],[7]-[15]. In this paper, we focus on the speaker modeling.
Over the past few years, many approaches based on the use of Gaussian mixture models (GMM) in a GMM universal background model (GMM-UBM) framework [7] have been proposed to improve the performance of speaker verification system. The GMM-UBM is a generative model in which a speaker model is trained only on data from the same speaker. New criteria have then been developed that allow discriminative learning of generative models. Support vector machine (SVM) is acknowledged as one of the pre-eminent discriminative approaches [16]-[18], and it has been successfully combined with GMM, such as the GMM-SVM [8],[9],[19]-[21]. Nevertheless, approaches based on GMM-SVM are unable to cope well with the channel effects [22],[23]. To compensate for the channel effects, it was shown using the joint factor analysis (JFA) technique that the speaker and channel variability can be confined as two disjoint subspaces in the parameter spaces of GMM [12],[24]. The word ‘joint’ refers to the fact that not only the speaker, but also the channel variability is treated in a single JFA model. However, it was been reported that the channel space obtained by the JFA does contain some residual speaker information [25].
Inspired by the JFA approach, it was shown in [13] that speaker and session variability can be represented by a single subspace referred to as the total variability space. The major motivation for defining such a subspace is to extract a low-dimensional identity vector (i.e., the so-called i-vector) from the feature sequence of a speech segment. The advantage of i-vector is that it represents a speech segment as a fixed-length vector instead of a variable-length sequence of acoustic features. This greatly simplifies the modeling and scoring processes in speaker verification. For instance, we can assume that the i-vector is generated from a Gaussian density [13] instead of the mixture of Gaussian densities as usual in the case of acoustic features [7]. In this regard, linear discriminant analysis (LDA) [13],[26],[27], nuisance attribute projection (NAP) [8],[13],[28], within-class covariance normalization (WCCN) [13],[29],[30], probabilistic LDA (PLDA) [10],[31], and the heavy-tailed PLDA [32] have shown to be effective for such fixed-length data. In this paper, we focus on PLDA with Gaussian prior instead of heavy-tailed prior. It was recently shown in [33] that the advantage of the heavy-tailed assumption diminishes with a simple length normalization on the i-vector before PLDA modeling.
Because the total variability matrix is always a low-rank rectangular matrix, a dimension reduction process is also imposed by the i-vector extractor [12]. In this study, we advocate the use of the uncompressed form of the i-vector. Similar to that in [13], our extractor converts speech sequence into a fixed-length vector but retains its dimensionality in the full supervector space. Modeling of speaker and session variability is then carried out using PLDA, which has shown to be effective in handling high-dimensional data. By doing so, we avoid reducing the dimensionality of the i-vector twice: first in the extraction process and second in the PLDA model. Any dimension reduction procedure will unavoidably discard information. Our intention is therefore to keep the full dimensionality until the scoring stage with PLDA and to investigate the performance of PLDA in the i-supervector space. We refer to the uncompressed form of i-vector as the i-supervector, or the identity supervector, following the nomenclature in [13],[29]. Similar to that in the i-vector extraction, the i-supervector is computed as the posterior mean of a latent variable, but with a much higher dimensionality.
The downside of using i-supervector with PLDA is that we have to deal with the inversion of large matrices. The size of the matrices becomes enormous when more sessions are available for each speaker in the development data^{a}. One option is to estimate the subspaces in a decoupled manner, which might lead to suboptimal solution [12],[24]. In [34], we showed that the joint estimation of subspaces can be accomplished by partitioning large matrices into smaller blocks, thereby making the inversion and the joint estimation feasible. In this study, we present the same approach with more detail and further refinement. We also look into various normalization methods and introduce the use of the Gaussianized rank-norm for the PLDA. In the experiments, we compare the performance of both i-vector and i-supervector under no normalization and various normalization conditions. Meanwhile, a fusion system that combines the i-vector and i-supervector is presented as well. In addition, we provide an analysis of the computational complexity associated with the i-vector and i-supervector at four different stages: loading matrix estimation, i-vector and i-supervector extraction, PLDA model training, and verification score calculation.
The paper is organized as follows. In Section 2, we introduce the i-vector paradigm, which includes the formulation of the i-vector and i-supervector and its relationship to the classical maximum a posteriori (MAP). Section 3 introduces the probabilistic LDA, where we show that the inversion of a large matrix in PLDA can be solved by exploiting some inherent structure of the precision matrix. Section 4 deals with PLDA scoring and introduces the Gaussianized rank-norm. We present some experimental results in Section 5 and conclude the paper in Section 6.
2 I-vector paradigm
2.1 I-vector extraction
2.2 I-supervector extraction
One could deduce (9) from (7) and (8) by setting D^{T}D = τ^{−1}Σ and using the results in (5). The parameter τ is referred to as the relevance factor, which is set empirically in the range between 8 and 16 [7]. This is different from that in (7), where the matrix D is trained from a dataset using the EM algorithm in a manner similar to the matrix T for the i-vector. Secondly, the i-supervector is taken as the posterior of the latent variable z which is absent in the relevance MAP formulation.
2.3 From i-vector to i-supervector
The i-vector extraction is formulated in probabilistic terms based on a latent variable model as in (2), similarly for the case of i-supervector in (6). One obvious benefit is that in addition to obtain the i-vector as the posterior mean ϕ_{ x } of the latent variable x, we could also compute the posterior covariance (4) which quantifies the uncertainty of the estimate and fold in the information in subsequent modeling [37]. Nevertheless, any form of dimension reduction would unavoidably discard information. Following the same latent variable modeling paradigm, we proposed the i-supervector as an uncompressed form of i-vector representation.
Figure 1 compares the i-vector and i-supervector approaches from the extraction process to the subsequent PLDA modeling (recall that the parameter C denotes the number of mixtures in the UBM. F is the size of the acoustic feature vectors. D is the length of the i-vector while the i-supervector has a much higher dimensionality of C·F.). The biggest difference is that there are two rounds of dimension reduction which occurred in the i-vector PLDA system, whereas there is only one time reduction for the case of i-supervector PLDA. In this paper, our motivation is to keep the full dimensionality of the supervector as the input to the PLDA model which has shown to be an efficient model for high-dimensional data [10]. We envisage that more information would be preserved via the use of i-supervector, which could be exploited with the use of PLDA.
3 PLDA modeling in i-supervector space
3.1 Probabilistic LDA
Comparing (1) and (13), we see that both i-vector extraction process and the PLDA model involve dimension reduction via a similar form of subspace modeling. This observation motivates us to explore the use of PLDA on i-supervector. The extraction process serves as the front-end which converts a variable-length sequence $\mathcal{O}$ to a fixed-length vector without reducing the dimension. Speaker modeling and channel compensation are then carried out in the original supervector space.
The downside of using i-supervector with PLDA is that we have to deal with large matrices as illustrated in the lower panel of Figure 1. The size of the matrices becomes enormous when more sessions are available for each speaker in the development data. This is typically the case for speaker recognition where the number of utterances per speaker is usually in the range from ten to over a hundred [38],[39]. In the following, we estimate the parameters $\left\{\begin{array}{cccc}\hfill \mathbf{\mu},\hfill & \hfill \mathbf{F},\hfill & \hfill \mathbf{G},\hfill & \hfill \mathbf{\Sigma}\hfill \end{array}\right\}$ of the PLDA model using the expectation maximization (EM) algorithm. We show how large matrices could be partitioned into sub-matrices, thereby making the matrix inversion and EM steps feasible.
3.2 E-step: joint estimation of posterior means
The matrix is large as we consider the joint inference of latent variables $\left\{\begin{array}{cccc}\hfill {\mathbf{h}}_{\mathit{i}},\hfill & \hfill {\mathbf{w}}_{\mathit{i}1},\hfill & \hfill \dots ,\hfill & \hfill {\mathbf{w}}_{\mathit{iJ}}\hfill \end{array}\right\}$ representing a speaker and all sessions from the same speaker. The size of the matrix increases with the number of sessions J, while more sessions are always desirable for more robust parameter estimation. Direct inversion of the matrix becomes intractable.
One interesting point to note from (23) is that the i-supervector ϕ_{ ij } is first centralized to the global mean μ and the speaker mean F⋅E{h_{ i }} before projection to the session variability space.
3.3 M-step: model estimation
4 Likelihood ratio computation
4.1 Model comparison
4.2 PLDA verification score
One way to look at (39) is that it centralizes the vector ϕ_{ l } and projects it onto the subspace F where speaker information co-vary the most (i.e., dimension reduction) while de-emphasizing the subspace pertaining to channel variability. In (38), K = log|M_{2}|/2 − log|M_{1}| is constant for the given set of parameters $\left\{\begin{array}{ccc}\hfill \mathbf{F},\hfill & \hfill \mathbf{G},\hfill & \hfill \mathbf{\Sigma}\hfill \end{array}\right\}$. Though K diminishes when score normalization is applied, we could calculate the two log-determinant terms easily by using the property of eigenvalue decomposition. In particular, we compute log|M_{2}| as $-{\displaystyle {\sum}_{\mathit{n}=1}^{{\mathit{N}}_{\mathit{F}}}\mathit{log}\left(2{\mathit{\lambda}}_{\mathit{n}}+1\right)}$ and log|M_{1}| as $-{\displaystyle {\sum}_{\mathit{n}=1}^{{\mathit{N}}_{\mathit{F}}}\mathit{log}\left({\mathit{\lambda}}_{\mathit{n}}+1\right)}$, where {λ_{ n }: n = 1, 2,…, N_{ F }} are the eigenvalues of the matrix F^{T}JF (c.f. (24)).
4.3 I-supervector pre-conditioning
5 Experiment
5.1 Experimental setup
Experiments were carried out on the core task (short2-short3) of NIST SRE08 [42]. We use two well-known metrics in evaluating the performance, namely, equal error rate (EER) and minimum detection cost (MinDCF). Two gender-dependent UBMs consisting of 512 Gaussians were trained using data drawn from the SRE04. Speech parameters were represented by a 54-dimensional vector of mel frequency cepstral coefficients (MFCC) with first and second derivatives appended.
Corpora used for training various components of the system
Switchboard | NIST SRE04 | NIST SRE05 | NIST SRE06 | |||
---|---|---|---|---|---|---|
Tel | Tel | Tel | Mic | Tel | Mic | |
UBM | X | |||||
T | X | X | X | |||
D | X | X | X | |||
PLDA model | X | X | X | X | ||
Whitening | X | X | X | X | X | X |
G-rank-norm | X | X | X | X | X | X |
s-norm | X | X | X | X |
5.2 Feature and score normalization
Performance comparison of i-vector and i-supervector on NIST SRE08 core task with no normalization applied
Male | Female | |||
---|---|---|---|---|
EER | MinDCF | EER | MinDCF | |
i-vector | ||||
Det1 (raw) | 9.6696 | 4.3332 | 13.9834 | 5.4534 |
Det4 (raw) | 5.9883 | 2.7667 | 14.5646 | 5.8298 |
Det5 (raw) | 5.9601 | 2.4829 | 11.4183 | 3.9617 |
Det6 (raw) | 6.1785 | 3.1206 | 8.1486 | 3.7028 |
i-supervector | ||||
Det1 (raw) | 8.7329 | 3.9681 | 12.5471 | 5.3874 |
Det4 (raw) | 4.7950 | 2.4082 | 12.3123 | 5.7750 |
Det5 (raw) | 5.6250 | 2.3139 | 8.1731 | 3.8216 |
Det6 (raw) | 5.2632 | 2.6605 | 6.7976 | 3.3368 |
Performance comparison of normalization methods on i-vector and i-supervector
Male | Female | |||
---|---|---|---|---|
EER | MinDCF | EER | MinDCF | |
i-vector | ||||
raw | 6.1785 | 3.1206 | 8.1486 | 3.7028 |
len | 4.9411 | 2.6286 | 6.4409 | 3.0581 |
white + len | 4.5458 | 2.4546 | 6.3193 | 3.0065 |
white + len + snorm | 4.3478 | 2.2155 | 6.1530 | 3.0034 |
i-supervector | ||||
raw | 5.2632 | 2.6605 | 6.7976 | 3.3368 |
len | 4.9199 | 2.6271 | 6.3667 | 3.3624 |
grank + len | 4.8982 | 2.6676 | 6.0976 | 3.2588 |
grank + len + snorm | 4.5888 | 2.3737 | 6.2639 | 3.1132 |
5.3 Channel factors in i-supervector space
Performance of i-supervector PLDA system with different numbers of channel factors, N _{ G }
Number of channel factors,N_{ G } | Male | Female | ||
---|---|---|---|---|
EER | MinDCF | EER | MinDCF | |
0 | 15.9148 | 5.6324 | 19.1796 | 6.8120 |
10 | 8.3524 | 3.9297 | 10.2550 | 4.5087 |
20 | 6.1785 | 3.2124 | 8.9246 | 3.9208 |
30 | 5.6114 | 2.8483 | 7.6497 | 3.5515 |
40 | 5.1487 | 2.7037 | 7.6497 | 3.4161 |
50 | 4.8552 | 2.5573 | 6.9290 | 3.3179 |
100 | 4.6911 | 2.4875 | 6.5196 | 3.2814 |
150 | 4.5974 | 2.3983 | 6.4856 | 3.1239 |
200 | 4.5888 | 2.3737 | 6.2639 | 3.1132 |
250 | 5.0099 | 2.5092 | 6.4302 | 3.1420 |
300 | 4.7693 | 2.5843 | 6.2084 | 3.1114 |
5.3.1 Performance comparison
In this section, we compared the performance of i-supervector and i-vector under different train-test channel conditions. The PLDA models used for i-vector and i-supervector were the same as described in Section 5.2. In addition, we included microphone data (drawn from SRE05 and SRE06) for the whitening transform, Gaussianized rank-norm, and s-norm to better handle the interview (int) and microphone (mic) channel conditions.
Performance comparison under various train-test channel conditions of NIST SRE08 short2-short3 core task
Conditions | ||||
---|---|---|---|---|
det1 (int-int) | det4 (int-tel) | |||
EER | MinDCF | EER | MinDCF | |
i-vector | 7.2964 | 3.5189 | 5.7919 | 2.7576 |
i-supervector | 7.8769 | 3.6724 | 5.9421 | 3.0541 |
Fusion | 7.0711 | 3.4100 | 5.2489 | 2.5892 |
det5 (tel-mic) | det6 (tel-tel) | |||
EER | MinDCF | EER | MinDCF | |
i-vector | 6.0462 | 2.0975 | 5.5602 | 2.7556 |
i-supervector | 4.7554 | 2.2475 | 5.7740 | 2.8949 |
Fusion | 4.4158 | 2.0260 | 5.3398 | 2.7537 |
5.4 Computation complexity comparison
Comparison of computational complexity (total time/real time factor) at various stages of implementation
Loading matrix | Posterior extraction | PLDA modeling | PLDA scoring | |
---|---|---|---|---|
i-vector | 60,300 s/4.8e−2 | 470 s/2.83e−3 | 47 s/3.74e−7 | 142 s/4.30e−4 |
i-supervector | 380 s/3.03e−6 | 24 s/1.45e−4 | 950 s/7.57e−6 | 425 s/1.29e−3 |
After training the total variability space, we extracted i-vector and i-supervector for all utterances. Table 6 shows the time required for extracting the i-vectors and i-supervectors from the entire SRE04 dataset. The result shows that i-vector extraction consumes much more time than i-supervector. PLDA models were then trained for i-vectors and i-supervectors drawn from Switchboard, SRE04, and SRE05. We can see that training a PLDA model on the i-vector takes much lesser time than for the i-supervector. Finally, we compared the computation requirement for PLDA scoring on the NIST SRE08 short2-short3 core task with 98,776 trials. It can be seen that i-supervector scoring took more time than i-vector mainly due to its comparatively high dimensionality. In summary, the i-supervector system requires less computation at the front-end while the i-vector system is faster at the back-end PLDA.
6 Conclusions
We have introduced the use of the uncompressed form of i-vector (i.e., the i-supervector) for PLDA-based speaker verification. Similar to i-vector, an i-supervector represents a variable-length speech utterance as a fixed-length vector. But different from i-vector, we keep the total variability space having the same dimensionality as the original supervector space. To this end, we showed how manipulation of high-dimensional matrices can be done efficiently in training and scoring with the PLDA model. We also introduced the use of Gaussianized rank-norm for feature normalization prior to PLDA modeling.
Compared to i-vector, we found that i-supervector performs better when no normalization (on both feature and score) was applied. This suggests that the Gaussian assumption imposed by PLDA becomes less stringent and easier to fulfill in the higher dimensional i-supervector space. However, the performance improvement given by the high dimensionality diminishes when full normalization is applied. As such, current normalization strategy, though effective, has to be improved for better performance. This is a point for future work. We also showed that fusion system can give competitive performance compared to either i-vector or i-supervector. Furthermore, we analyzed the computational complexity of the i-supervector system, compared to that of the i-vector, at four different stages, namely, loading matrix estimation, posterior extraction, PLDA modeling, and PLDA scoring. Actually, the results showed that the i-supervector system took much less time than the i-vector system in terms of loading matrix and posterior extraction.
Endnote
^{a}The number of sessions is usually limited in face recognition for which PLDA was originally proposed in [10].
Declarations
Acknowledgements
This work was partially supported by a research grant from the Tateisi Science and Technology Foundation.
Authors’ Affiliations
References
- G Doddington, Speaker recognition based on idiolectal differences between speakers, in Proc. 7th European Conference on Speech Communication and Technology (Eurospeech) (Scandinavia, 2001), pp. 2521–2524Google Scholar
- CE Wilson, S Manocha, S Vishnubhotla, A new set of features for text-independent speaker identification, in Proc. Interspeech (Pittsburgh, PA, USA, 2006), pp. 1475–1478Google Scholar
- T Kinnunen, KA Lee, H Li, Dimension reduction of the modulation spectrogram for speaker verification, in The Speaker and Language Recognition Workshop (Stellenbosch, South Africa, 2008)Google Scholar
- Kinnunen T, Li HZ: An overview of text-independent speaker recognition: from features to supervectors. Speech Comm. 2010, 52(1):12-40. 10.1016/j.specom.2009.08.009View ArticleGoogle Scholar
- L Wang, K Minami, K Yamamoto, S Nakagawa, Speaker identification by combining MFCC and phase information in noisy environments, in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Dallas, TX, USA, 2010), pp. 4502–4505Google Scholar
- Nakagawa S, Wang L, Ohtsuka S: Speaker identification and verification by combining MFCC and phase information. IEEE Trans. Audio Speech Lang. Process. 2012, 20(4):1085-1095. 10.1109/TASL.2011.2172422View ArticleGoogle Scholar
- Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 2000, 10(1):19-41. 10.1006/dspr.1999.0361View ArticleGoogle Scholar
- WM Campbell, DE Sturim, DA Reynolds, A Solomonoff, SVM based speaker verification using a GMM supervector kernel and NAP variability compensation, in Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (Philadelphia, USA, 2005), pp. 97–100Google Scholar
- Campbell WM, Sturim DE, Reynolds DA: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 2006, 13(5):308-311. 10.1109/LSP.2006.870086View ArticleGoogle Scholar
- SJD Prince, JH Elder, Probabilistic linear discriminant analysis for inferences about identity, in Proc. International Conference on Computer Vision (Rio De Janeiro, Brazil, 2007), pp. 1–8Google Scholar
- Wang L, Kitaoka N, Nakagawa S: Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM. Speech Comm. 2007, 9(6):501-513. 10.1016/j.specom.2007.04.004View ArticleGoogle Scholar
- Kenny P, Ouellet P, Dehak N, Gupta V, Dumouchel P: A study of inter-speaker variability in speaker verification. IEEE Trans. Audio. Speech Lang. Process. 2008, 16(5):980-988. 10.1109/TASL.2008.925147View ArticleGoogle Scholar
- Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P: Front-end factor analysis for speaker verification. IEEE Trans. Audio. Speech Lang. Process. 2011, 19(4):788-798. 10.1109/TASL.2010.2064307View ArticleGoogle Scholar
- Kua JMK, Epps J, Ambikairajah E: i-Vector with sparse representation classification for speaker verification. Speech Comm. 2013, 55(5):707-720. 10.1016/j.specom.2013.01.005View ArticleGoogle Scholar
- Kelly F, Drygajlo A, Harte N: Speaker verification in score-ageing-quality classification space. Comput. Speech Lang. 2013, 27(5):1068-1084. 10.1016/j.csl.2012.12.005View ArticleGoogle Scholar
- Wan V, Campbell WM: Support vector machines for speaker verification and identification. IEEE Workshop Neural Netw. Signal Process. 2000, 2: 77-784.Google Scholar
- Campbell WM, Campbell JP, Reynolds DA: Support vector machines for speaker and language recognition. Comp. Speech Lang. 2006, 20: 210-229. 10.1016/j.csl.2005.06.003View ArticleGoogle Scholar
- Bishop C: Pattern Recognition and Machine Learning. Springer Science & Business Media, New York; 2006.Google Scholar
- KA Lee, C You, H Li, T Kinnunen, A GMM-based probabilistic sequence kernel for speaker recognition, in Proc. Interspeech (Antwerp, Belgium, 2007), pp. 294–297Google Scholar
- You CH, Lee KA, Li H: GMM-SVM kernel with a Bhattacharyya-based distance for speaker recognition. IEEE Trans. Audio Speech Lang. Process. 2010, 18(6):1300-1312. 10.1109/TASL.2009.2032950View ArticleGoogle Scholar
- Dong X, Zhaohui W: Speaker recognition using continuous density support vector machines. Electron. Lett. 2001, 37(17):1099-1101. 10.1049/el:20010741View ArticleGoogle Scholar
- Wan V, Renals S: Speaker verification using sequence discriminant support vector machines. IEEE Trans. Speech Audio Process. 2005, 13(2):203-210. 10.1109/TSA.2004.841042View ArticleGoogle Scholar
- N Dehak, G Chollet, Support vector GMMs for speaker verification, in Proc. IEEE Odyssey: The Speaker and Language Recognition Workshop (San Juan, Puerto Rico, 2006)Google Scholar
- Kenny P, Boulianne G, Ouellet P, Dumouchel P: Speaker and session variability in GMM-Based speaker verification. IEEE Trans. Audio Speech Lang. Process. 2007, 15(4):1448-1460. 10.1109/TASL.2007.894527View ArticleGoogle Scholar
- N Dehak, Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker verification, in Ph.D. thesis (École de Technologie Supérieure, Université du Québec, 2009)Google Scholar
- A Kanagasundaram, D Dean, R Vogt, M McLaren, S Sridharan, M Mason, Weighted LDA techniques for i-vector based speaker verification, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (Kyoto, Japan, 2012), pp. 4781–4794Google Scholar
- Kanagasundaram A, Dean D, Sridharan S, McLaren M, Vogt R: I-vector based speaker recognition using advanced channel compensation techniques. Comput. Speech Lang. 2014, 28(1):121-140. 10.1016/j.csl.2013.04.002View ArticleGoogle Scholar
- BGB Fauve, D Matrouf, N Scheffer, J-F Bonastre, JSD Mason, State-of-the-art performance in text-independent speaker verification through open-source software, in IEEE International Conference on Acoustics, Speech, and Signal Processing (Honolulu, USA, 2007), pp. 1960–1968Google Scholar
- M Senoussaoui, P Kenny, N Dehak, P Dumouchel, An i-vector extractor suitable for speaker recognition with both microphone and telephone speech, in Proc.Odyssey: The Speaker and Language Recognition Workshop (Brno, Czech, 2010)Google Scholar
- A Kanagasundaram, R Vogt, D Dean, S Sridharan, M Mason, I-vector based speaker recognition on short utterances, in Proc. Interspeech (Florence, 2011), pp. 2341–2344Google Scholar
- L Machlica, Z Zajic, An efficient implementation of probabilistic linear discriminant analysis, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (Vancouver, Canada, 2013), pp. 7678–7682Google Scholar
- P Kenny, Bayesian speaker verification with heavy-tailed priors, in Proc. Odyssey: Speaker and Language Recognition Workshop (Brno, Czech, 2010)Google Scholar
- D Garcia-Romero, CY Espy-Wilson, Analysis of i-vector length normalization in speaker recognition systems, in Proc. Interspeech (Florence, Italy, 2011), pp. 249–252Google Scholar
- Y Jiang, KA Lee, Z Tang, B Ma, A Larcher, H Li, PLDA modeling in i-vector and supervector space for speaker verification, in Proc. Interspeech (Portland, USA, 2012)Google Scholar
- Kenny P, Boulianne G, Dumouchel P: Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 2005, 13(3):345-354. 10.1109/TSA.2004.840940View ArticleGoogle Scholar
- Gauvain J, Lee C-H: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chain. IEEE Trans. Speech Audio Process. 1994, 2(2):291-298. 10.1109/89.279278View ArticleGoogle Scholar
- P Kenny, T Stafylakis, P Ouellet, MJ Alam, P Dumouchel, PLDA for speaker verification with utterance of arbitrary duration, in Proc. IEEE ICASSP (Vancouver, Canada, 2013), pp. 7649–7653Google Scholar
- H Li, B Ma, KA Lee, CH You, H Sun, A Larcher, IIR system description for the NIST 2012 speaker recognition evaluation, in NIST SRE'12 Workshop (Orlando, 2012)Google Scholar
- R Saeidi, KA Lee, T Kinnunen, T Hasan, B Fauve, P-M Bousque, E Khoury, PL Sordo Martinez, JMK Kua, CH You, H Sun, A Larcher, P Rajan, V Hautamaki, C Hanilci, B Braithwaite, R Gonzales-Hautamki, SO Sadjadi, G Liu, H Boril, N Shokouhi, D Matrouf, L El Shafey, P Mowlaee, J Epps, T Thiruvaran, DA van Leeuwen, B Ma, H Li, JHL Hansen et al., I4U submission to NIST SRE2012: a large-scale collaborative effort for noise-robust speaker verification, in Proc. Interspeech (Lyon, France, 2013), pp. 1986–1990Google Scholar
- Murphy KP: Machine Learning-A Probabilistic Perspective. MIT Press, Massachusetts; 2012.Google Scholar
- A Stolcke, S Kajarekar, L Ferrer, Nonparametric feature normalization for SVM-based speaker verification, in Proc. ICASSP (Ohio, USA, 2008), pp. 1577–1580Google Scholar
- NIST, The NIST year 2008 speaker recognition evaluation plan, , [http://www.itl.nist.gov/iad/mig/tests/sre/2008/]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.