A sub-band-based feature reconstruction approach for robust speaker recognition
© Yan et al.; licensee Springer. 2014
Received: 26 April 2014
Accepted: 2 October 2014
Published: 21 October 2014
Although the field of automatic speaker or speech recognition has been extensively studied over the past decades, the lack of robustness has remained a major challenge. The missing data technique (MDT) is a promising approach. However, its performance depends on the correlation across frequency bands. This paper presents a new reconstruction method for feature enhancement based on the trait. In this paper, the degree of concentration across frequency bands is measured with principal component analysis (PCA). Through theoretical analysis and experimental results, it is found that the correlation of the feature vector extracted from the sub-band (SB) is much stronger than the ones extracted from the full-band (FB). Thus, rather than dealing with the spectral features as a whole, this paper splits full-band into sub-bands and then individually reconstructs spectral features extracted from each SB based on MDT. At the end, those constructed features from all sub-bands will be recombined to yield the conventional mel-frequency cepstral coefficient (MFCC) for recognition experiments. The 2-sub-band reconstruction approach is evaluated in speaker recognition system. The results show that the proposed approach outperforms full-band reconstruction in terms of recognition performance in all noise conditions. Finally, we particularly discuss the optimal selection of frequency division ways for the recognition task. When FB is divided into much more sub-bands, some of the correlations across frequency channels are lost. Consequently, efficient division ways need to be investigated to perform further recognition performance.
KeywordsRobustness Missing data technique (MDT) Reconstruction Sub-band (SB) Full-band (FB) Principal component analysis (PCA)
The performance of speaker or speech recognition systems degrades rapidly when they operate under conditions that differ from those used for training. Therefore, accomplishing noise robustness is a key issue to make these systems deployable in real world conditions. Solutions have been presented to solve this issue, such as feature-based –, score-based ,, model-based –, i-vectors , and the missing data technique (MDT) –.
MDT can compensate for disturbances of the arbitrated type, so that this method which is based on the time-frequency representation is suitable to the problem of noise mismatch .
In MDT, two different methods have been considered to perform speech or speaker recognition with incomplete data: marginalization – and reconstruction ,. In marginalization, the unreliable components are discarded or integrated up to the observed values. While the reconstruction method involves the estimation of the corrupted features using statistical methods, such as minimum mean square error (MMSE) , maximum a posteriori (MAP), and maximum likelihood (ML). Marginalization , and reconstruction  have been applied in speaker recognition system. However, marginalization suffers from two main drawbacks ,. First, as known to us, utterance-level processing, such as mean and variance normalization, is capable of improving the recognition performance, but it cannot be performed with an incomplete spectrum . Second, recognition has been carried out with spectral features. However, it is well known that cepstral features outperform spectral ones. Moreover, of all the methods, marginalization is assumed to have the most overhead. Consequently, if the complete reconstructed spectrogram is available, the recognizer is no longer constrained to perform recognition using spectral features. A more optimal set of parameters from the reconstructed spectrum will be derived.
In this paper, MAP reconstruction method  is used. Its efficiency significantly depends on the correlation between the spectral features. Conventional MAP reconstruction method is conducted on full-band ,. According to our analysis, the spectral vectors extracted from the sub-band have more relevance than the ones extracted from the full-band. The conclusion will be illustrated in Section 1. Based on the above theory and the sub-band idea –, a multi-sub-band reconstruction approach is proposed to improve on the recognition performance. The principle is to divide the full-band into multiple sub-bands and then independently reconstruct missing features extracted from every sub-band. After that, those features from all sub-bands will be recombined to yield the typical mel-frequency cepstral coefficient (MFCC) vector.
As one of many feature enhancement methods, the proposed reconstruction approach can be used in speaker and speech recognition system. To evaluate its validity, this paper will combine the new reconstruction method with speaker recognition system.
This paper is organized as follows. In the next section, the theory of the proposed reconstruction approach is analyzed. Section 1 is devoted to describing the proposed reconstruction approach. Section 1 describes the baseline experiment system and the experimental framework which is adopted to evaluate the proposed technique. Finally, Section 1 concludes this paper and discusses some future directions.
2 The analysis of concentration
As we know, the more concentrated the feature vector is, the higher its redundancy is, that is, the greater its correlation is . It is measured by the degree of concentration with principal component analysis (PCA).
In order to analyze the degree of concentration of the feature vector , the eigenvalues of associated covariance matrix Θ i need to be calculated and then need to be arranged in descending order. It is represented as .
That is, R i (m) is the accumulative contribution rate of the first m principle components. Concentration level is the minimum m that makes R i (m)>r, where r is a predefined concentration coefficient.
For certain r, a smaller implies that the i th sub-band feature vector is confined along a smaller number of principle directions, and therefore, the feature vector is much more closely related to each other according to the above definition.
In the same manner, the degree of concentration of the full-band feature vector could be analyzed.
Considering the recorded positions of the 2-dimensional feature vector in Figure 2 and the corresponding contribution rate, together with our analysis, the following conclusion is obtained: the higher the redundancy of the data is, that is, the greater its correlation is, the smaller the corresponding concentration level is. As MAP reconstruction method is based on the correlation between the feature vectors, the smaller the concentration level is, the higher the validity of the reconstruction is.
3 Multi-sub-band reconstruction for speaker recognition system
As one of many feature enhancement methods, the multi-sub-band reconstruction method in MDT can be applied in the Gaussian mixture model (GMM) , the SVM-GMM , and the universal background model (UBM)-GMM recognition system. Based on the validity of the UBM-GMM system shown in , the proposed reconstruction method is evaluated in a UBM-GMM speaker recognition system. In this section, the MDT-based speaker recognition system is described.
3.1 UBM-GMM model
In this paper, a speaker-independent UBM is used. A speaker-dependent model can be derived from UBM by adapting the UBM parameters to the speech material of the corresponding speaker using MAP estimation ,.
3.2 Feature vector
Mel log-spectral vector and MFCC are used in the reconstruction and recognition stage, respectively. The unreliable components are reconstructed based on the statistical relationship between the log-spectral vector.
3.3 Mask estimation
where and represent the k th frequency bands of the power spectrum of speech and noise, respectively, in individual T-F units. What calls for special attention is that the estimation of speech and noise components is carried out in the spectral domain before applying mel filter.
The estimate of the noise spectrum is derived from the noisy signal spectrum. The estimation method is shown in . The estimate of the speech spectrum can be derived by subtracting the estimated noise spectrum from the corrupted signal spectrum. In this paper, the technique to accomplish this is to perform spectral subtraction by applying an SNR-dependent gain function MMSE log-STSA  in the frequency domain.
3.4 MAP estimation for unreliable components
where Θ ru is the cross covariance between and and .
and then the unreliable components of the vector are reconstructed using MAP estimation method.
3.5 The proposed multi-sub-band reconstruction approach
The estimation of speech and noise components is carried out in the spectral domain.
A mask will be obtained which classifies the T-F representation into reliable and unreliable components corresponding to the frequency range of P mel filters. The above two steps are carried out before applying the mel filter.
P mel filters are used to smooth the power spectrum and then its logarithm is taken.
The mel log-spectral vector is multiplied by the mask estimated in step (b).
The feature vector corresponding to full-band is divided into ones corresponding to 2 sub-bands.
Based on SP trained in the first part, the feature vectors corresponding to every sub-band are reconstructed, individually.
The reconstructed vector of 2-sub-band is recombined to yield the typical MFCC vector.
3.6 Baseline system
The system described in  assumes that the unreliable components are bounded between zero and the observed mel log-spectrum and the mel log-spectrum is independent, and marginalization is applied to process the corrupted vector. The feature vector used in recognition is a P-dimensional mel log-spectrum. We compare the performance of the proposed system with the baseline system.
New reconstruction method is evaluated on a closed set of 30 speakers and 140 utterances per speaker. The sampling frequency is 16 KHz. For each speaker, 70% of the available speech material is randomly selected to train the corresponding speaker model, 7% is used for training SP for reconstruction stage, and the remaining 23% is used for test.
In the training stage, we use a voice activity detector (VAD) based on power to ensure that silence frames would not impact on the establishing model.
Speaker recognition performance is evaluated on a subset of ten randomly selected speakers involving a total of 30 sentences per speaker (20 sentences for training speaker-dependent GMM and 10 sentences for testing). In the test phase, utterances are mixed at various SNRs with noise signals drawn from the NOISEX database .
4.1 Experiment 1: performance comparison between marginalization and reconstruction including full-band and 2-sub-band reconstruction
In the first experiment, we compare the performances of two systems which use the marginalization and reconstruction methods to process the corrupted features and then evaluate the validity of the proposed reconstruction method. The point is that recognition has to be carried out with spectral features in the former system. While in the latter system, MFCC are extracted for recognition.
Recognition performance of FB, 2-SB reconstruction, and marginalization in the presence of different types of noise (unit: %)
It can be observed in Table 1 that the performance obtained from both reconstruction methods clearly outperforms the baseline system.
The results show that 2-sub-band reconstruction method performs better than full-band for all noise types. The recognition performance is higher at a larger SNR.
The recognition performance in babble noise is higher than the other four noise types in most cases for two kinds of reconstruction methods.
The corresponding relative improvements regarding full-band reconstruction are 2.55%, 1.49%, 1.10%, 1.63%, and 1.03% at a SNR of 0, 5, 10, 15, and 20 dB, respectively. Recognition performance improves the most at a SNR of 0 dB.
The improved recognition performance is 6.04%, 6.97%, 8.99%, 5.63%, and 11.37% in babble, factory1, pink, white, and destroyer-engine noise, respectively. Recognition performance improves the most in destroyer-engine noise.
The contribution rate (%) of every principle component
4.2 Experiment 2: influence of different division ways of full-band
Different division ways of full-band and the corresponding recognition performance
1-2, 3-4, 5-6, 7-8, 9-10, 11-12,13-14,15-16,
17-18, 19-20, 21-22, 23-24
1-3, 4-6, 7-9, 10-12, 13-15, 16-18, 19-21, 22-24
1-4, 5-8, 9-12, 13-16, 17-20, 21-24
1-6, 7-12, 13-18, 19-24
1-8, 9-16, 17-24
When the full-band is divided into 12 sub-bands, the recognition performance is inferior. The observation shows that the correlations between the feature vector are lost when the number of sub-bands is more numerous.
This paper presents a new feature enhancement method, which is evaluated in a UBM-GMM speaker recognition system. In the proposed method, the reconstruction is executed on a partial sub-band independently and then the reconstructed spectrum is recombined into a complete spectrum to yield the conventional MFCC for recognition. Compared to full-band reconstruction method, recognition performance obtained by the proposed reconstruction approach has been shown to be higher in five noise types. The experiment has also reflected that the recognition performance depends on the frequency division ways, thus the optimal division ways need to be developed.
The first experiment has revealed the following results. First, MFCC features outperform spectral ones for speaker recognition. Second, the recognition performance obtained by reconstruction is higher than marginalization. Third, the recognition performance obtained by the 2-sub-band reconstruction method is superior to the full-band reconstruction in five noise types and at all SNRs. The second experiment has shown that different frequency division ways could influence on the recognition performance.
In order to achieve further recognition performance improvements, on the one hand, an optimal frequency division way will be very important. On the other hand, analyzing the distribution property of various noise types and then accurately identifying destroyed components are also research hot spots. In the end, research on mask estimation algorithms is required to precisely separate reliable from unreliable components.
- J Pelecanos, S Sridharan, Feature warping for robust speaker verification. ISCA Workshop Speaker Recognition, June 213–218 (2001).Google Scholar
- Reynolds DA: Channel robust speaker verification via feature mapping. ICASSP 2003, 2: 53-56.Google Scholar
- Chandran V, Ning D, Sridharan S: Speaker identification using higher order spectral phase features and their effectiveness vis-avis mel-cepstral features, vol. 3072. In Biometric Authentication. Springer Verlag, Berlin; 2004:1-20.Google Scholar
- Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Process 2000, 10(1–3):19-41. 10.1006/dspr.1999.0361View ArticleGoogle Scholar
- Auckenthaler R, Carey M, Lloyd-Thomas H: Score normalization for text-independent speaker verification systems. Digital Signal Process 2000, 10(1–3):42-54. 10.1006/dspr.1999.0360View ArticleGoogle Scholar
- P Kenny, P Dumouchel, in Proc. ODYSSEY 2004-The Speaker and Language Recognition Workshop. Experiments in speaker verification using factor analysis likelihood ratios (Toledo, Spain, May 31–June 3 2004), pp. 219–226.Google Scholar
- Kenny P, Boulianne G, Ouellet P, Dumouchel P: Factor analysis simplified. ICASSP 2005, 1: 637-640.Google Scholar
- Jančovič P, Köküer M: Estimation of voicing-character of speech spectra based on spectral shape. IEEE Signal Process. Lett 2007, 14(1):66-69. 10.1109/LSP.2006.881517View ArticleGoogle Scholar
- M McLaren, D van Leeuwen, in ICASSP. Improved speaker recognition when using i-vectors from multiple speech sources (Prague, 2011), pp. 5460–5463.Google Scholar
- González JA, Peinado AM, Ma N, Gómez AM, Barker J: MMSE-based missing-feature reconstruction with temporal modeling for robust speech recognition. IEEE Trans. Audio, Speech, Lang. Process 2013, 21(3):624-635. 10.1109/TASL.2012.2229982View ArticleGoogle Scholar
- May T, van de Par S, Kohlrausch A: Noise-robust speaker recognition combining missing data techniques and universal background modeling. IEEE Trans. Audio, Speech, Lang. Process 2012, 20(1):108-121. 10.1109/TASL.2011.2158309View ArticleGoogle Scholar
- Togneri R, Pullella D: An overview of speaker identification: accuracy and robustness issues. IEEE Circuits Syst. Mag. 2011, Second Quarter: 23-61. 10.1109/MCAS.2011.941079View ArticleGoogle Scholar
- Cooke M, Green P, Josifovski L, Vizinho A: Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 2001, 34: 267-285. 10.1016/S0167-6393(00)00034-0View ArticleGoogle Scholar
- Zhao X, Shao Y, Wang D: CASA-Based Robust Speaker Identification. IEEE Trans. Audio, Speech Lang. Process 2012, 20: 1608-1616. 10.1109/TASL.2012.2186803View ArticleGoogle Scholar
- Ma N, Barker J, Christensen H, Green P: Combining speech fragment decoding and adaptive noise floor modelling. IEEE Trans. Audio Speech Lang. Process 2012, 20(3):818-827. 10.1109/TASL.2011.2165945View ArticleGoogle Scholar
- Gemmeke JF, Hamme VanH, Cranen B, Boves L: Compressive sensing for missing data imputation in noise robust speech recognition. IEEE J. Sel. Topics Signal Process 2010, 4(2):272-287. 10.1109/JSTSP.2009.2039171View ArticleGoogle Scholar
- JA González, AM Peinado, AM Gómez, N Ma, J Barker, in IEEE Trans. Audio Speech Lang. Process. Combining missing-data reconstruction and uncertainty decoding for robust speech recognition (Kyoto, 2012), pp. 4693–4696.Google Scholar
- Raj B, Seltzer ML, Stern RM: Reconstruction of missing features for robust speech recognition. Speech Commun. 1997, 43: 195-202.Google Scholar
- B Raj, Reconstruction of incomplete spectrograms for robust speech recognition. PhD dissertation, Pittsburgh, PA, Carnegie Mellon Univ, 2000.Google Scholar
- Raj L, Bonastre JF: Subband approach for automatic speaker recognition: optimal division of the frequency domain. In Proc. Audio and Video based Biometric Person Authentication. LNCS. Edited by: Bigün J, Chollet G, Borgefors G. Springer, Heidelberg; 1997:195-202.Google Scholar
- Besacier L, Bonastre JF, Fredouille C: Localization and selection of speaker-specific information with statistical modeling. Speech Comm. 2000, 31: 89-106. 10.1016/S0167-6393(99)00070-9View ArticleGoogle Scholar
- Bourlard H, Dupont S: A new ASR approach based on independent processing and recombination of partial frequency bands. ICSLP 1996, 1: 426-429.Google Scholar
- J Shlens, A tutorial on principal component analysis. Systems Neurobiology Laboratory, Salk Institute for Biological Studies, version 2, 1–13 (2005).Google Scholar
- Reynolds D, Rose R: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process 1995, 3(1):72-83. 10.1109/89.365379View ArticleGoogle Scholar
- Campbell W, Sturim D, Reynolds D: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett 2006, 13(5):308-311. 10.1109/LSP.2006.870086View ArticleGoogle Scholar
- Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 2000, 10: 19-41. 10.1006/dspr.1999.0361View ArticleGoogle Scholar
- Raj B, Stern RM: Missing-feature approaches in speech recognition. IEEE Signal Process. Mag 2005, 22(5):101-116. 10.1109/MSP.2005.1511828View ArticleGoogle Scholar
- Brown GJ, Wang D: Separation of Speech by Computational Auditory Scene Analysis. Springer Verlag, New York; 2005.View ArticleGoogle Scholar
- Seltzer ML, Raj B, Stern RM: A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition. Speech Commun. 2004, 43(4):379-393. 10.1016/j.specom.2004.03.006View ArticleGoogle Scholar
- X Zhao, Y Wang, D Wang, Robust speaker identification in noisy and reverberant conditions. IEEE Trans. Audio, Speech Lang. Process. 22, 836–845 (2014, in press).Google Scholar
- Rainer Martin: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process 2001, 9: 504-512. 10.1109/89.928915View ArticleGoogle Scholar
- M Brookes, VOICEBOX: Speech Processing Toolbox for MATLAB (2009). [Online] Available: ., [http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html]Google Scholar
- Papoulis A: Probability, Random Variables, and Stochastic Processes. Academic Press, New York; 1991.Google Scholar
- A Varga, H Steeneken, M Tomlinson, D Jones, in Tech. Rep., Speech Res. Unit, Defense Res. Agency. The NOISEX-92 study on the effect of additive noise on automatic speech recognition (Malvern, U.K., 1992). (Available from NOISEX-92 CD-ROMS).Google Scholar
- A Martin, G Doddington, T Kamm, M Ordowski, in Proceedings of the European Conference on Speech communication and Technology. The DET curve in assessment of detection task performance, (1997), pp. 1895–1898.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.