An improved i-vector extraction algorithm for speaker verification
- Wei Li^{1}Email author,
- Tianfan Fu^{2} and
- Jie Zhu^{1}
https://doi.org/10.1186/s13636-015-0061-x
© Li et al. 2015
Received: 9 July 2014
Accepted: 9 June 2015
Published: 27 June 2015
Abstract
Over recent years, i-vector-based framework has been proven to provide state-of-the-art performance in speaker verification. Each utterance is projected onto a total factor space and is represented by a low-dimensional feature vector. Channel compensation techniques are carried out in this low-dimensional feature space. Most of the compensation techniques take the sets of extracted i-vectors as input. By constructing between-class covariance and within-class covariance, we attempt to minimize the between-class variance mainly caused by channel effect and to maximize the variance between speakers. In the real-world application, enrollment and test data from each user (or speaker) are always scarce. Although it is widely thought that session variability is mostly caused by channel effects, phonetic variability, as a factor that causes session variability, is still a matter to be considered. We propose in this paper a new i-vector extraction algorithm from the total factor matrix which we term component reduction analysis (CRA). This new algorithm contributes to better modelling of session variability in the total factor space.
We reported results on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation (SREs) dataset. As measured both by equal error rate and the minimum values of the NIST detection cost function, 10–15 % relative improvement is achieved compared to the baseline of traditional i-vector-based system.
Keywords
1 Introduction
In the last decade, Gaussian mixture model based on universal background model (GMM-UBM) framework has demonstrated strong performance in speaker verification. It is commonly believed that the mean vectors of GMMs represent the most characteristics of speakers [1], which are obtained by using the maximum a posteriori (MAP) adaptation. Traditional MAP (or relevance MAP) treats each Gaussian component as a statistically independent distribution, which leaves many drawbacks in the real-world application: Only those components which are observed in the speaker frames are adapted; if training and testing sessions are too short or suffer from significant phonetic variability, the performance of verification may encounter an obvious degradation; on the other hand, traditional MAP does not have solution to the effects of channel distortion, especially when the enrollment and test sessions are from different channels.
Extended from GMM-UBM framework, factor analysis (FA) technique [2, 3] attempts to model the speaker components jointly. Each speaker is represented by the mean supervector which is a linear combination of the set of eigenvoices. Only a few hundreds of free parameters need to be estimated, which ensures that the speaker mean supervector converges quickly by a short duration of utterance. Based on FA technique, joint factor analysis (JFA) [4, 5] decomposes GMM supervector into speaker component S and channel component C, JFA makes the assumption that speaker component and channel component are statistically independent, although it is known by now that in this modelling, channel effects are not speaker-independent (for example, it has been proven that gender-dependent eigenchannel modelling is more effective than gender-independent modeling [6]), JFA has demonstrated superior performance for text-independent speaker detection tasks in the past NIST Speaker Recognition Evaluation (SREs).
Inspired by JFA approach, Dehak et al. [7] propose a combination of speaker space and channel space. A new low-dimensional space named total factor space is defined. In this new space, each utterance is represented by a low-dimensional feature vector termed i-vector. The idea of i-vector opens a new era to the analysis of speaker and session variability. Many compensation techniques and scoring methods have been proposed [6–9], which have shown better results than JFA approach. Recently, i-vector extraction with a preliminary length normalization and probabilistic linear discriminant analysis (PLDA) has become the state-of-the-art configuration for speaker verification [6, 9].
However, despite the success of the i-vector paradigm, its applicability in text-dependent speaker verification still remains questionable. In the case that the phonetic content of enrollment and test sessions is same, we may assert that only those components which are observed in the speaker frames need to be adapted, but FA-based techniques provide a global adaptation to all the Gaussian components including those unobserved Gaussian components, whose performance should be questionable compared with traditional MAP adaptation. Currently, related papers have also shown that the traditional MAP approach is superior to the i-vector-based ones [10–12].
Inspired by the drawback of i-vector in text-dependent speaker verification, we attempt to give an analysis that although FA-based i-vector approach offers better performance in modeling phonetic variability, a global adaptation to all the Gaussian components is not an optimal adaptation method. In this paper, we still focus on text-independent speaker verification, we propose a new i-vector extraction algorithm termed component reduction analysis (CRA). Compared with the traditional MAP adaptation that only adapts those observed Gaussian components, CRA approach abandons those Gaussian components which give the least contribution in modeling speaker frames, length of each basis of total factor matrix is truncated while the dimensions of the total factor space remain unchanged, extracted i-vectors also remain unchanged. No modification is needed for subsequent channel compensation and scoring methods. Experiments were carried out on the core condition of NIST 2008 SREs. Experimental results show that by applying CRA algorithm in the phase of i-vector extraction, a 10–15 % relative improvement is obtained compared with the baseline system adopting traditional i-vector algorithm and related channel compensation techniques.
This paper is organized as follow. Section 2 describes the construction of total factor matrix and the paradigm of i-vector extraction. Section 3 analyzes and verifies that phonetic variability and phonetic imbalance widely exist in speaker frames, and following this inference, we demonstrate that conventional i-vector extraction paradigm corresponding to a global adaptation of all Gaussian components is not an optimal adaptation method. Hence, an improved mechanism is introduced to compensate phonetic variability and phonetic imbalance. In Section 4, we propose our CRA algorithm applied in the i-vector extraction phase. We also propose a zero-order Baum-Welch statistics normalization approach to compensate the effects of CRA algorithm. Experiments and results are given in Section 5. Section 6 concludes the paper.
2 Total factor space and i-vector extraction
Proposed by N. Dehak, in the i-vector framework, no separate modeling of speaker and channel space is made, a single space named total factor space is constructed to model speaker and channel variability jointly. Each utterance is projected onto total factor space and is represented by a low-dimensional feature vector. The channel compensation techniques and scoring methods are carried out in this low-dimensional space, as opposed to the high-dimensional GMM supervector space for MAP adaptation and JFA.
where m is a speaker- and channel-independent supervector, whose value is often taken from UBM supervector. T is the total factor matrix with low rank, which expands a subspace containing speaker- and channel-dependent information. w is a vector with a prior of standard Gaussian distribution. Speaker frames of an utterance are represented by posterior estimation of w, the new feature vector w is named total factor, often referred to as identity vector or i-vector. The process of training the total factor matrix is detailedly explained in [2], which is similar as learning the eigenvoice matrix. In order to sufficiently estimate T-Matrix, large quantity of development corpus is necessary.
3 Phonetic variability analysis
where c=1,…,C is the Gaussian index and P(c|y _{ t },Ω) corresponds to the posterior probability of mixture component c generating the frame vector y _{ t }. (2) and (3) are named zero-order and first-order Baum-Welch statistics.
where M _{ c } denotes the cth component of M, m _{ c } denotes the cth component of UBM m, (1/L)F _{ c } is the normalized first-order Baum-Welch statistics, r is termed relevance factor which is an empirical value and has to be manually tuned. From Eq. (4), we can see that the posterior mean vectors of the cth Gaussian component M _{ c } is an interpolation between the mean of UBM Gaussian component m _{ c } and the normalized first-order Baum-Welch statistics (1/L)F _{ c }. N _{ c } can be regarded as a conditioning factor, as the amount of speaker frames observed by the component c increases, M _{ c } will get closer to the real statistical mean vectors (1/L)F _{ c }.
where T _{ c } denotes the cth component of T. T _{ c } can be regarded as the total basis for the cth Gaussian component. From Eq. (5), we can see that although the zero-order Baum-Welch statistics of Gaussian components vary from each other, i-vector approach adapts the mean vector of each Gaussian component in a same degree. In other words, we might say that the conditioning for each Gaussian component is identical, controlled by w.
Although FA-based approaches have been proven to be effective in compensating the phonetic variability, as we have mentioned above, its performance is not so good as the traditional MAP adaptation in text-dependent speaker verification. An intuitive explanation could be that not all the Gaussian components should be adapted if the phonetic contents of enrollment and test are the same, only those components which are observed in speaker frames need to be adapted. Extended from text-dependent verification to text-independent verification, a rational inference could be that considering the sparsity in the real-world application and imbalanced distribution of speaker frames, it is better not to adapt those Gaussian components which contribute least in modeling speaker frames. Like we used to do in the traditional MAP adaptation, as it is uncertain to perform a full adaptation to those Gaussian components without enough observation speaker frames, those Gaussian components with minimum zero-order Baum-Welch statistics are not chosen to join in the adaptation process.
In order to verify our thought, a statistical experiment is designed based on the corpus from the core condition of NIST 2008 SREs. The core condition of the 2008 SREs is named short2-short3, each file is a 5-min telephone conversation recording containing roughly 2 min of speech.
We randomly pick out 20 male recordings which are all from different speakers, a male UBM model containing 1024 Gaussian components is used to extract the zero-order Baum-Welch statistics. The configurations of feature extraction and UBM are same as experiments section.
Proportion of zero-order Baum-Welch statistics of top N Gaussian components
Number of N | Percentage |
---|---|
100 | 0.2988 |
200 | 0.4791 |
300 | 0.6183 |
400 | 0.7288 |
500 | 0.8166 |
600 | 0.8848 |
700 | 0.9362 |
800 | 0.9732 |
900 | 0.9946 |
Extended from traditional MAP adaptation approach to the i-vector adaptation approach, Gaussian components are no longer assumed to be independently distributed, total variability can be captured by a low-dimensional factor space, it means that sparse training data can give a global adaptation of total Gaussian components. A graphical explanation will be given to show that although total factor space approach is effective to compensate phonetic variability, it is still not an optimal method.
4 Component reduction analysis
4.1 Implementation of component reduction analysis
where the process of calculating w can be derived from [2], N(u) is the diagonal matrix of dimension C F×C F whose diagonal blocks are N _{ c }(u)I,(c=1,…,C), I is the identity matrix of dimension F×F, \(\tilde {\mathbf {F}}_{c}(u)\) is a supervector of dimension C F×1 obtained by concatenating all the centralized first-order Baum-Welch statistics \(\tilde {\mathbf {F}}_{c}(u)\). Here Σ is a diagonal covariance matrix of dimension C F×C F that is estimated during the training of T. It models the residual variabilities not captured by the total variability matrix T.
in which \(\mathbf {T}_{c_{n}^{'}}\) is the sub-matrix of the \(c_{n}^{'}\)th block of T, \(\mathbf {N}_{c_{n}^{'}}(u)\mathbf {\Sigma }_{c_{n}^{'}}^{-1}\) is the sub-matrix of the diagonal part of the \(c_{n}^{'}\) block of Σ ^{−1} N(u), and \(\tilde {\mathbf {F}}_{c_{n}^{'}}(u)\) is the sub-matrix of the \(c_{n}^{'}\)th row block of \(\tilde {\mathbf {F}}(u)\).
4.2 Zero-order Baum-Welch statistics normalization
This kind of normalization ensures that after applying CRA, the summation of zero-order Baum-Welch statistics of top R Gaussian components is identical to the summation of zero-order Baum-Welch statistics not applying CRA, this summation also equals to the number of total speaker frames. Although this normalization seems to be a slightly rough approximation, experimental result in the next section shows that slight improvement is obtained after applying zero-order Baum-Welch statistics normalization.
5 Experiments
5.1 Databases
All experiments were carried out on the core condition of NIST 2008 SREs. NIST 2005 and NIST 2006 were used as development datasets. Our experiments are based on male telephone data (det6) and English-only male telephone data (det7) for both training and testing. The core condition of the 2008 SREs contains 648 males.
5.2 Experimental setup
Our experiments operated on the Mel frequency cepstral coefficients (MFCCs), and speech/silence segmentation was performed according to the index of transcriptions provided by the NIST with its automatic speech recognition (ASR) tool. The MFCC frames are extracted using a 25-ms Hamming window, every 10-ms step, 19 order coefficients together with log energy, 20 first-order delta, and 10 second-order delta were appended, equal to a total dimension of F=50, where we follow the configuration of [13]. All frames were subjected to cepstral mean normalization (CMN) to obey a (0,1) distribution.
We used a male UBM containing 1024 Gaussians and the order of total factor matrix T is 400, the corpus from the 2005 1conv4w transcription index and 2006 1conv4w transcription index was used to train the UBM with a total length of 17 h of speech from about 550 speakers. The corpus from the 2005 8conv4w and 2006 8conv4w transcription index was used as the development datasets to train the total factor matrix because sessions for each speaker consists of recordings from eight different microphones which is capable of modeling the speaker and channel variability. The development set of i-vectors was also extracted from the same corpus set which was used to train the T-Matrix. All the decision scores were given without normalization.
In our experiment, total sets of i-vectors were extracted with our CRA algorithm, including development set, enrollment set, and test set. We also performed experiments that only enrollment and test sets were extracted using CRA algorithm, and results show that both approaches gave better results than the baseline of traditional i-vector extraction approach, but the results of whole extraction (development set, enrollment set and test set) are best. The threshold R is tuned manually.
5.3 Results and analysis
Comparison of cosine scoring with different compensation and normalization techniques
Male | ||||
---|---|---|---|---|
English trials | All trials | |||
EER(%) | DCF | EER(%) | DCF | |
LDA(210) | 2.50 | 0.0246 | 5.57 | 0.0538 |
LDA(210), R = 1000 | 2.48 | 0.0245 | 5.48 | 0.0533 |
LDA(210), R = 975 | 2.31 | 0.0226 | 5.37 | 0.0521 |
LDA(210), R = 950 | 2.27 | 0.0220 | 5.24 | 0.0514 |
LDA(210), R = 925 | 2.23 | 0.0211 | 5.18 | 0.0510 |
LDA(210), R = 900 | 2.15 | 0.0207 | 5.10 | 0.0503 |
LDA(210), R = 875 | 2.30 | 0.0224 | 5.12 | 0.054 |
LDA(210), R = 850 | 2.53 | 0.0250 | 5.17 | 0.510 |
EFRnorm, LDA(210) | 2.35 | 0.0233 | 5.40 | 0.0525 |
EFRnorm, LDA(210), R = 900 | 2.29 | 0.0225 | 5.19 | 0.0512 |
For our system, best LDA dimension is 210 (in the case that dimension of total factor matrix is 400), standardization before LDA compensation (EFR norm four iterations) enhances the performance and gives a baseline of 2.35 % for EER and 0.0233 for DCF in the English trials (det7), 5.40 % for EER and 0.0525 for DCF in the multi-language trials (det6, all trials). After applying CRA, as we reduce the number of components, EER and DCF decrease slightly, the best performance is obtained when the threshold R is 900, which gives a minimum EER of 2.15 and minimum DCF of 0.0207 in the English trials and a minimum EER of 5.10 and minimum DCF of 0.0503 in the multi-language trials. Here what should be highlighted is that the optimal dimension of LDA after applying CRA algorithm is still unchanged, the reason that CRA has no effect on the optimal choice of dimension of LDA is that the CRA aims at compensating the phonetic variability and phonetic imbalance, whereas LDA aims at compensating the channel variability. However, the best performance is obtained without EFR normalization. Even though CRA still works in the case of applying EFR normalization, its performance is not so good as in the case of without EFR normalization both in English trials and in multi-language trials. One possible reason for this result is that the performance of EFR normalization is highly dependent on the scale of development dataset, and in our experiment, we did not offer that much corpus to construct full-scale between- and within-speaker covariance matrices.
Comparison of results with and without zero-order Baum-Welch statistics normalization
Male | ||||
---|---|---|---|---|
English trials | All trials | |||
EER(%) | DCF | EER(%) | DCF | |
R = 1000, no zero-order norm | 2.48 | 0.0245 | 5.48 | 0.0533 |
R = 1000, with zero-order norm | 2.48 | 0.0245 | 5.48 | 0.0533 |
R = 950, no zero-order norm | 2.29 | 0.0221 | 5.25 | 0.0516 |
R = 950, with zero-order norm | 2.27 | 0.0220 | 5.24 | 0.0514 |
R = 900, no zero-order norm | 2.19 | 0.0211 | 5.12 | 0.0506 |
R = 900, with zero-order norm | 2.15 | 0.0207 | 5.10 | 0.0503 |
6 Conclusions
This paper proposes an improved i-vector extraction algorithm. We analyze the intrinsic sparsity and imbalanced distribution of speaker frames, those redundant Gaussian components, are discarded in the i-vector extraction phase so as to compensate the phonetic variability. We attempt to make a preliminary beginning to combine the advantages of traditional MAP adaptation in text-dependent speaker verification and i-vector-based framework in text-independent speaker verification. Besides speaker variability and channel variability, phonetic variability, as another impact factor, is taken into consideration. We carried out experiments on the core condition of male portion of NIST 2008 SREs. Experimental results show that a 10–15 % relative improvement is obtained.
Despite the effectiveness of our CRA algorithm, there exists many questions to be solved and be explained. First is the conflict of our CRA algorithm and the EFR normalization. More experiments have to be designed to explore the relationship and compatibility of this two techniques. Second issue is that the threshold R of the CRA is a parameter that has to be tuned manually, which is highly dependent on the length of speech frames, R=900 is an optimal value in the core condition of NIST 2008 SREs, but we cannot give a similar configuration in the female portion, so do in the conditions like 10sec-10sec and short2-10sec of NIST 2008 SREs dataset, an adapted adjustment of the threshold R has to be proposed to make speaker verification more practical. On the other hand, we expect our CRA algorithm based on i-vector framework may improve the performance of i-vector-based system in text-dependent speaker verification.
Declarations
Acknowledgements
We would like to thank the superior open-source ALIZE/LIA_RAL platform, its powerful functionality enables us to build a state-of-the-art speaker verification system and focus on the ideas we are interested in.
This article is supported by the National Natural Science Foundation of China (61371147): A study of sparse compression and precise reconstruction of high definition audio in transform domain and its application in the mobile terminal.
This article is also supported by the National Natural Science Foundation of China (61271349): Study on music similarity model based on nonlinear dynamical characteristics analysis of acoustic signal and its application.
Authors’ Affiliations
References
- DA Reynolds, TF Quatieri, RB Dunn, Speaker verification using adapted gaussian mixture models. Digit. Signal Process. 10(1), 19–41 (2000).View ArticleGoogle Scholar
- P Kenny, G Boulianne, P Dumouchel, Eigenvoice modeling with sparse training data. IEEE Trans. Speech and Audio Process. 13(3), 345–354 (2005).View ArticleGoogle Scholar
- P Kenny, P Ouellet, N Dehak, V Gupta, P Dumouchel, A study of interspeaker variability in speaker verification. IEEE Trans. Audio, Speech, and Lang. Process. 16(5), 980–988 (2008).View ArticleGoogle Scholar
- P Kenny, Joint factor analysis of speaker and session variability: Theory and algorithms. CRIM, Montreal, (Report) CRIM-06/08-13 (2005).Google Scholar
- P Kenny, G Boulianne, P Ouellet, P Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans. Audio, Speech, and Lang. Process. 15(4), 1435–1447 (2007).View ArticleGoogle Scholar
- P Kenny, in Odyssey. Bayesian speaker verification with heavy-tailed priors, (2010), p. 14.Google Scholar
- N Dehak, P Kenny, R Dehak, P Dumouchel, P Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio, Speech, and Lang. Process. 19(4), 788–798 (2011).View ArticleGoogle Scholar
- P-M Bousquet, D Matrouf, J-F Bonastre, in INTERSPEECH. Intersession compensation and scoring methods in the i-vectors space for speaker recognition, (2011), pp. 485–488.Google Scholar
- P-M Bousquet, A Larcher, D Matrouf, J-F Bonastre, O Plchot, in Odyssey: The Speaker and Language Recognition Workshop. Variance-spectra based normalization for i-vector standard and probabilistic linear discriminant analysis (Singapore, Singapore, 2012), pp. 157–164.Google Scholar
- H Aronowitz, in Odyssey 2012-The Speaker and Language Recognition Workshop. Text dependent speaker verification using a small development set, (2012).Google Scholar
- A Larcher, P Bousquet, KA Lee, D Matrouf, H Li, J-F Bonastre, in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference On. I-vectors in the context of phonetically-constrained short utterances for speaker verification (IEEE, 2012), pp. 4773–4776.Google Scholar
- T Stafylakis, P Kenny, P Ouellet, J Perez, M Kockmann, P Dumouchel, Text-dependent speaker recognition using plda with uncertainty propagation. Matrix. 500, 1 (2013).Google Scholar
- J-F Bonastre, N Scheffer, D Matrouf, C Fredouille, A Larcher, A Preti, G Pouchoulin, NW Evans, BG Fauve, JS Mason, in Odyssey. Alize/spkdet: a state-of-the-art open source software for speaker recognition, (2008), p. 20.Google Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.