Auditory Sparse Representation for Robust Speaker Recognition Based on Tensor Structure
- Qiang Wu^{1} and
- Liqing Zhang^{1}Email author
https://doi.org/10.1155/2008/578612
© Q.Wu and L. Zhang. 2008
Received: 31 December 2007
Accepted: 29 September 2008
Published: 2 November 2008
Abstract
This paper investigates the problem of speaker recognition in noisy conditions. A new approach called nonnegative tensor principal component analysis (NTPCA) with sparse constraint is proposed for speech feature extraction. We encode speech as a general higher-order tensor in order to extract discriminative features in spectrotemporal domain. Firstly, speech signals are represented by cochlear feature based on frequency selectivity characteristics at basilar membrane and inner hair cells; then, low-dimension sparse features are extracted by NTPCA for robust speaker modeling. The useful information of each subspace in the higher-order tensor can be preserved. Alternating projection algorithm is used to obtain a stable solution. Experimental results demonstrate that our method can increase the recognition accuracy specifically in noisy environments.
1. Introduction
Automatic speaker recognition has been developed into an important technology for various speech-based applications. Traditional recognition system usually comprises two processes: feature extraction and speaker modeling. Conventional speaker modeling methods such as Gaussian mixture models (GMMs) [1] achieve very high performance for speaker identification and verification tasks on high-quality data when training and testing conditions are well controlled. However, in many practical applications, such systems generally cannot achieve satisfactory performance for a large variety of speech signals corrupted by adverse conditions such as environmental noise and channel distortions.
Traditional GMM-based speaker recognition system, as we know, degrades significantly under adverse noisy conditions, which is not applicable to most real-world problems. Therefore, how to capture robust and discriminative feature from acoustic data becomes important. Commonly used speaker features include short-term cepstral coefficients [2, 3] such as linear predictive cepstral coefficients (LPCCs), mel-frequency cepstral coefficients (MFCCs), and perceptual linear predictive (PLP) coefficients. Recently, main efforts are focused on reducing the effect of noises and distortions. Feature compensation techniques [4–7] such as CMN and RASTA have been developed for robust speech recognition. Spectral subtraction [8, 9] and subspace-based filtering [10, 11] techniques assuming a priori knowledge of the noise spectrum have been widely used because of their simplicity.
Currently, the computational auditory nerve models and sparse coding attract much attention from both neuroscience and speech signal processing communities. Lewicki [12] demonstrated that efficient coding of natural sounds could provide an explanation for both the form of auditory nerve filtering properties and their organization as a population. Smith and Lewicki [13, 14] proposed an algorithm for learning efficient auditory codes using a theoretical model for coding sound in terms of spikes. Sparse coding of sound and speech [15–18] is also proved to be useful for auditory modeling and speech separation, providing a potential way for robust speech feature extraction.
As a powerful data modeling tool for pattern recognition, multilinear algebra of the higher-order tensor has been proposed as a potent mathematical framework to manipulate the multiple factors underlying the observations. In order to preserve the intrinsic structure of data, higher-order tensor analysis method was applied to feature extraction. De Lathauwer et al. [19] proposed the higher-order singular value decomposition for tensor decomposition, which is a multilinear generalization of the matrix SVD. Vasilescu and Terzopoulos [20] introduced a nonlinear, multifactor model called Multilinear ICA to learn the statistically independent components of multiple factors. Tao et al. [21] applied general tensor discriminant analysis to the gait recognition which reduced the under sample problem.
In this paper, we propose a new feature extraction method for robust speaker recognition based on auditory periphery model and tensor structure. A novel tensor analysis approach called NTPCA is derived by maximizing the covariance of data samples on tensor structure. The benefits of our feature extraction method include the following. (1) Preprocessing step motivated by the auditory perception mechanism of human being provides a higher frequency resolution at low frequencies and helps to obtain robust spectrotemporal feature. (2) A supervised learning procedure via NTPCA finds the projection matrices of multirelated feature subspaces which preserve the individual, spectrotemporal information in the tensor structure. Furthermore, the variance maximum criteria ensures that noise component can be removed as useless information in the minor subspace. (3) Sparse constraint on NTPCA enhances energy concentration of speech signal which will preserve the useful feature during the noise reduction. The sparse tensor feature extracted by NTPCA can be further processed into a representation called auditory-based nonnegative tensor cepstral coefficients (ANTCCs), which can be used as feature for speaker recognition. Furthermore, Gaussian mixture models [1] are employed to estimate the feature distributions and speaker model.
The remainder of this paper is organized as follows. In Section 2, an alternative projection learning algorithm NTPCA is developed for feature extraction. Section 3 describes the auditory model and sparse tensor feature extraction framework. Section 4 presents the experimental results for speaker identification on three speech datasets in the noise-free and noisy environments. Finally, Section 5 gives a summary of this paper.
2. Nonnegative Tensor PCA
2.1. Principle of Multilinear Algebra
In this section, we briefly introduce multilinear algebra and details can be found in [19, 21, 22]. Multilinear algebra is the algebra of higher-order tensors. A tensor is a higher-order generalization of a matrix. Let denotes a tensor. The order of is . An element of is denoted by , where and . The mode- vectors of are -dimensional vectors obtained from by varying index and keeping other indices fixed. We introduce the following definitions relevant to this paper.
Definition 1.1 (mode- matricizing).
The mode- matricizing of an th-order tensor is a matrix , where and . The mode- matricizing of is denoted as or .
Definition 2.2 (tensor contraction).
2.2. Principal Component Analysis with Nonnegative and Sparse Constraint
where is the square Frobenius norm, the second term relaxes the orthogonal constraint of traditional PCA, the third term is the sparse constraint, is a balancing parameter between reconstruction and orthogonality, controls the amount of additional sparseness required.
2.3. Nonnegative Tensor Principal Component Analysis
We calculate the nonnegative roots of (14) and zero as the nonnegative global maximum of . Algorithm 1 lists the alternating projection optimization procedure for Nonnegative Tensor PCA.
Algorithm 1: Alternating projection optimization procedure for NTPCA.
Input: Training tensor , the dimensionality of the output tensors
, , , maximum number of training iterations , error threshold .
Output: The projection matrix , the output tensors .
Initialization: Set randomly, iteration index .
Step 1. Repeat until convergence
Step 4. Iterate over every entries of until convergence
– Set the value of to the global nonnegative maximizer of
(12) by evaluating it over all nonnegative roots of
(14) and zero;
Step 5. Check convergence: the training stage of NTPCA convergence
3. Auditory Feature Extraction Based on Tensor Structure
3.1. Feature Extraction Based on Auditory Model
We extract the features by imitating the process occurred in the auditory periphery and pathway, such as outer ear, middle ear, basilar membrane, inner hair cell, auditory nerves, and cochlear nucleus.
Because the outer ear and the middle ear together generate a bandpass function, we implement traditional pre-emphasis to model the combined outer and middle ear functions , where is the discrete-time speech signal, , and is the filtered output signal. Its purpose is to raise the energy for those frequency components located in the high-frequency domain in order that those formants can be extracted in the high-frequency domain.
where is the order of the filter, is the number of filterbanks. For the th filter bank, is the center frequency, is the equivalent rectangular bandwidth (ERB) of the auditory filter, is the phase, are constants, where determines the rate of decay of the impulse response, which is related to bandwidth. The outputs of each gammatone filterbank is .
3.2. Sparse Representation Based on Tensor Structure
For the final feature set, we apply discrete cosine transform (DCT) on the feature vector to reduce the dimensionality and decorrelate feature components. A vector of cepstral coefficients is obtained from sparse feature representation , where is discrete cosine transform matrix.
4. Experiments and Discussion
In this section, we describe the evaluation results of a close-set speaker identification system using ANTCC feature. Comparisons with MFCC, LPCC, and RASTA-PLP features are also provided.
4.1. Clean Data Evaluation
The first stage is to evaluate the performance of different speaker identification methods in the two clean speech datasets: Grid and TIMIT.
For Grid dataset, there are 17 000 sentences spoken by 34 speakers (18 males and 16 females). In our experiment, the sampling rate of speech signals was 8 kHz. For the given speech signals, we employed every window of length 8000 samples (1 second) and time duration 20 samples (2.5 milliseconds) and 36 gammatone filters were selected. We calculated the projection matrix in spectrotemporal domain using NTPCA after the calculation of the average firing rates in the inner hair cells. 170 sentences (5 sentences each person) were selected randomly as the training data for learning projection matrices in different subspaces. 1700 sentences (50 sentences each person) were used as training data and 2040 sentences (60 sentences each person) were used as testing data.
TIMIT is a noise-free speech database recorded with a high-quality microphone sampled at 16 kHz. In this paper, randomly selected 70 speakers in the train folder of TIMIT were used in the experiment. In TIMIT, each speaker produces 10 sentences, the first 7 sentences were used for training, and the last 3 sentences were used for testing, which were about 24 s of speech for training and 6 s for testing. For the projection matrix learning, we select 350 sentences (5 sentences each person) as training data and the dimension of sparse tensor representation is 32.
Identification accuracy with different mixture numbers for clean data of Grid and TIMIT datasets.
Features | Grid(%) | TIMIT(%) | ||||||
---|---|---|---|---|---|---|---|---|
16 | 32 | 64 | 128 | 16 | 32 | 64 | 128 | |
ANTCC | 99.9 | 100 | 100 | 100 | 96.5 | 97.62 | 98.57 | 98.7 |
LPCC | 100 | 100 | 100 | 100 | 97.6 | 98.1 | 98.1 | 98.1 |
MFCC | 100 | 100 | 100 | 100 | 98.1 | 98.1 | 98.57 | 99 |
PLP | 100 | 100 | 100 | 100 | 89.1 | 92.38 | 90 | 93.1 |
From the simulation results, we can see that all the methods can give a good performance for the Grid dataset with different Gaussian mixture numbers. For the TIMIT dataset, MFCC also represents a good performance on the testing conditions. And ANTCC feature provides the same performance as MFCC when the Gaussian mixture number increases. This may indicate that the distribution of ANTCC feature is sparse and not smooth, which causes the performance to degrade when the Gaussian mixture number is too small. So we have to increase Gaussian mixture number to fit its actual distribution.
4.2. Performance Evaluation under Different Noisy Environments
In consideration of practical applications of robust speaker identification, different noise classes were considered to evaluate the performance of ANTCC against the other commonly used features and identification accuracy was assessed again. Noise samples for the experiments were obtained from Noisex-92 database. The noise clippings were added to clean speech obtained from Grid and TIMIT datasets to generate testing data.
4.2.1. Grid Dataset in Noisy Environments
Identification accuracy in four noisy conditions (white, pink, factory, and f16) for Grid dataset.
(%) | SNR | ANTCC | GMM-UBM | MFCC | LPCC | RASTA-PLP |
---|---|---|---|---|---|---|
White | 0 dB | 10.29 | 3.54 | 2.94 | 2.45 | 9.8 |
5 dB | 38.24 | 13.08 | 9.8 | 3.43 | 12.25 | |
10 dB | 69.61 | 26.5 | 24.02 | 8.82 | 24.51 | |
15 dB | 95.59 | 55.29 | 42.65 | 25 | 56.37 | |
Pink | 0 dB | 9.31 | 10.67 | 16.67 | 7.35 | 10.29 |
5 dB | 45.1 | 21.92 | 28.92 | 15.69 | 24.51 | |
10 d | 87.75 | 54.51 | 49.51 | 37.25 | 49.02 | |
15 d | 95.59 | 88.09 | 86.27 | 72.55 | 91.18 | |
Factory | 0 dB | 8.82 | 11.58 | 14.71 | 9.31 | 11.27 |
5 dB | 44.61 | 41.92 | 35.29 | 25 | 29.9 | |
10 d | 87.75 | 60.04 | 66.18 | 52.94 | 63.24 | |
15 d | 97.55 | 88.2 | 92.65 | 87.75 | 96.57 | |
F16 | 0 dB | 9.8 | 8.89 | 7.35 | 7.84 | 12.25 |
5 dB | 27.49 | 15.6 | 12.75 | 15.2 | 26.47 | |
10 d | 69.12 | 45.63 | 52.94 | 36.76 | 50 | |
15 d | 95.1 | 82.4 | 76.47 | 63.73 | 83.33 |
From the identification comparison, the performance under Gaussian white additive noise indicates that ANTCC is the predominant feature and topping to 95.59% under SNR of 15 dB. However, it is not recommended for noise level less than 5 dB SNR where the identification rate becomes less than 40%. RASTA-PLP is the second-best feature, yet it yields 56.37% less than ANTCC under 15 dB SNR.
4.2.2. Timit Dataset in Noisy Environments
For speaker identification experiments that were conducted using TIMIT dataset with different additive noise, the general setting was almost the same as that used with clean TIMIT dataset.
Identification accuracy in four noisy conditions (white, pink, factory, and f16) for TIMIT dataset.
(%) | SNR | ANTCC | MFCC | LPCC | RASTA-PLP |
---|---|---|---|---|---|
White | 0 dB | 2.9 | 1.43 | 2.38 | 2.38 |
5 dB | 3.81 | 2.38 | 2.86 | 5.24 | |
10 dB | 29.52 | 3.33 | 6.19 | 15.71 | |
15d B | 64.29 | 11.43 | 12.86 | 39.52 | |
Pink | 0 dB | 2.43 | 1.43 | 3.33 | 1.43 |
5 dB | 13.81 | 1.9 | 3.81 | 5.24 | |
10 d | 50.95 | 8.57 | 8.1 | 27.14 | |
15 d | 78.57 | 30 | 32.86 | 60.95 | |
Factory | 0 dB | 2.43 | 1.43 | 2.76 | 1.43 |
5 dB | 12.86 | 3.33 | 10.48 | 10 | |
10 d | 49.52 | 21.9 | 34.29 | 46.67 | |
15 d | 78.1 | 70 | 73.81 | 74.76 | |
F16 | 0 dB | 2.9 | 2.86 | 2.33 | 1.43 |
5 dB | 15.24 | 7.14 | 14.76 | 8.1 | |
10 d | 47.14 | 24.76 | 28.57 | 34.76 | |
15 d | 77.62 | 57.14 | 67.62 | 60.48 |
4.2.3. Aurora2 Dataset Evaluation Result
Aurora2 dataset is designed to evaluate the performance of speech recognition algorithms in noisy conditions. In the training set, there are 110 speakers (55 males and 55 females) with clean and noisy speech data. In our experiments, the sampling rate of speech signals was 8 kHz. For the given speech signals, we employed time window of length 8000 samples (1 second) and time duration 20 samples (2.5 millisecond) and 36 cochlear filterbanks. As described above, we calculated the projection matrix using NTPCA after the calculation of cochlear power feature. 550 sentences (5 sentences each person) were selected randomly as the training data for learning projection matrix in different subspaces and 32 dimension sparse tensor representation are extracted.
In order to estimate the speaker model and test the efficiency of our method, we used 5500 sentences (50 sentences each person) as training data and 1320 sentences (12 sentences each person) mixed with different kinds of noise were used as testing data. The testing data was mixed with subway, babble, car noise, and exhibition hall in SNR intensities of 20 dB, 15 dB, 10 dB, and 5 dB. For the final feature set, 16 cepstral coefficients were extracted and used for speaker modeling.
Identification accuracy in four noisy conditions (subway, car noise, babble, and exhibition hall) for Aurora2 noise testing dataset.
(%) | SNR | ANTCC | MFCC | LPCC | RASTA-PLP |
---|---|---|---|---|---|
Subway | 5 dB | 26.36 | 2.73 | 5.45 | 14.55 |
10 dB | 63.64 | 16.36 | 11.82 | 39.09 | |
15 dB | 75.45 | 44.55 | 34.55 | 57.27 | |
20 dB | 89.09 | 76.36 | 60.0 | 76.36 | |
Babble | 5 dB | 43.27 | 16.36 | 15.45 | 22.73 |
10 dB | 62.73 | 51.82 | 33.64 | 57.27 | |
15 dB | 78.18 | 79.09 | 66.36 | 86.36 | |
20 dB | 87.27 | 93.64 | 86.36 | 92.73 | |
Car noise | 5 dB | 19.09 | 5.45 | 3.64 | 8.18 |
10 dB | 30.91 | 17.27 | 10.91 | 35.45 | |
15 dB | 60.91 | 44.55 | 33.64 | 60.91 | |
20 dB | 78.18 | 78.18 | 59.09 | 79.45 | |
Exhibition hall | 5 dB | 24.55 | 1.82 | 2.73 | 13.64 |
10 dB | 62.73 | 20.0 | 19.09 | 31.82 | |
15 dB | 85.45 | 50.0 | 44.55 | 59.09 | |
20 dB | 95.45 | 76.36 | 74.55 | 82.73 |
4.3. Discussion
In our feature extraction framework, the preprocessing method is motivated by the auditory perception mechanism of human being which simulates a cochlear-like peripheral auditory stage. The cochlear-like filtering uses the ERB, which compresses the information in high-frequency region. So such feature can provide a much higher frequency resolution at low frequencies as shown in Figure 1(b).
NTPCA is applied to extract the robust feature by calculating projection matrices in multirelated feature subspace. This method is a supervised learning procedure which preserves the individual, spectrotemporal information in the tensor structure.
Our feature extraction model is a noiseless model, and here we add sparse constraints to NTPCA. It is based on the fact that in sparse coding the energy of the signal is concentrated on a few components only, while the energy of additive noise remains uniformly spread on all the components. As a soft-threshold operation, the absolute values of pattern from the sparse coding components are compressed towards to zero. The noise is reduced while the signal is not strongly affected. We also employ the variance maximum criteria to extract the helpful feature in principal component subspace for identification. The noise component will be removed as the useless information in minor components subspace.
From Section 4.1, we know the performance of ANTCC in clean speech is not better than conventional feature MFCC and LPCC when the speaker model estimation with few Gaussian mixtures. The main reason is that the sparse feature does not have the smoothness property as MFCC and LPCC. We have to increase the Gaussian mixture number to fit its actual distribution.
5. Conclusions
In this paper, we presented a novel speech feature extraction framework which is robust to noise with different SNR intensities. This approach is primarily data driven and is able to extract robust speech feature called ANTCC, which is invariant to noise types and interference with different intensities. We derived new feature extraction methods called NTPCA for robust speaker identification. The study is mainly focused on the encoding of speech based on general higher-order tensor structure to extract the robust auditory-based feature from interrelated feature subspace. The frequency selectivity features at basilar membrane and inner hair cells were used to represent the speech signals in the spectrotemporal domain, and then NTPCA algorithm was employed to extract the sparse tensor representation for robust speaker modeling. The discriminative and robust information of different speakers may be preserved after the multirelated subspace projection. Experimental results on three datasets showed that the new method improved the robustness of feature, in comparison to baseline systems trained on the same speech datasets.
Declarations
Acknowledgments
The work was supported by the National High-Tech Research Program of China (Grant no. 2006AA01Z125) and the National Science Foundation of China (Grant no. 60775007).
Authors’ Affiliations
References
- Reynolds DA, Rose RC: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing 1995,3(1):72-83. 10.1109/89.365379View ArticleGoogle Scholar
- Hermansky H: Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America 1990,87(4):1738-1752. 10.1121/1.399423View ArticleGoogle Scholar
- Rabiner LR, Juang B: Fundamentals on Speech Recognition. Prentice Hall, Upper Saddle River, NJ, USA; 1996.Google Scholar
- Hermansky H, Morgan N: RASTA processing of speech. IEEE Transactions on Speech and Audio Processing 1994,2(4):578-589. 10.1109/89.326616View ArticleGoogle Scholar
- Reynolds DA: Experimental evaluation of features for robust speaker identification. IEEE Transactions on Speech and Audio Processing 1994,2(4):639-643. 10.1109/89.326623View ArticleGoogle Scholar
- Mammone RJ, Zhang X, Ramachandran RP: Robust speaker recognition: a feature-based approach. IEEE Signal Processing Magazine 1996,13(5):58-71.View ArticleGoogle Scholar
- van Vuuren S: Comparison of text-independent speaker recognition methods on telephone speech with acoustic mismatch. Proceedings of the 4th International Conference on Spoken Language (ICSLP '96), October 1996, Philadelphia, Pa, USA 1788-1791.View ArticleGoogle Scholar
- Berouti M, Schwartz R, Makhoul J, Beranek B, Newman I, Cambridge MA: Enhancement of speech corrupted by acoustic noise. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '79), April 1979, Washington, DC, USA 4: 208-211.View ArticleGoogle Scholar
- Wu MY, Wang DL: A two-stage algorithm for one-microphone reverberant speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing 2006,14(3):774-784.View ArticleGoogle Scholar
- Hu Y, Loizou PC: A perceptually motivated subspace approach for speech enhancement. Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP '02), September 2002, Denver, Colo, USA 1797-1800.Google Scholar
- Hermus K, Wambacq P, Van hamme H: A review of signal subspace speech enhancement and its application to noise robust speech recognition. EURASIP Journal on Advances in Signal Processing 2007,2007(1):195-209.MathSciNetView ArticleGoogle Scholar
- Lewicki MS: Efficient coding of natural sounds. Nature Neuroscience 2002,5(4):356-363. 10.1038/nn831View ArticleGoogle Scholar
- Smith EC, Lewicki MS: Efficient coding of time-relative structure using spikes. Neural Computation 2005,17(1):19-45. 10.1162/0899766052530839View ArticleMATHGoogle Scholar
- Smith EC, Lewicki MS: Efficient auditory coding. Nature 2006,439(7079):978-982. 10.1038/nature04485View ArticleGoogle Scholar
- Klein DJ, König P, Körding KP: Sparse spectrotemporal coding of sounds. EURASIP Journal on Applied Signal Processing 2003,2003(7):659-667. 10.1155/S1110865703303051View ArticleMATHGoogle Scholar
- Kim T, Lee SY: Learning self-organized topology-preserving complex speech features at primary auditory cortex. Neurocomputing 2005, 65-66: 793-800.View ArticleGoogle Scholar
- Asari H, Pearlmutter BA, Zador AM: Sparse representations for the cocktail party problem. The Journal of Neuroscience 2006,26(28):7477-7490. 10.1523/JNEUROSCI.1563-06.2006View ArticleGoogle Scholar
- Plumbley MD, Abdallah SA, Blumensath T, Davies ME: Sparse representations of polyphonic music. Signal Processing 2006,86(3):417-431. 10.1016/j.sigpro.2005.06.007View ArticleMATHGoogle Scholar
- De Lathauwer L, De Moor B, Vandewalle J: A multilinear singular value decomposition. SIAM Journal on Matrix Analysis & Applications 2000,21(4):1253-1278. 10.1137/S0895479896305696MathSciNetView ArticleMATHGoogle Scholar
- Vasilescu MAO, Terzopoulos D: Multilinear independent components analysis. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), June 2005, San Diego, Calif, USA 1: 547-553.Google Scholar
- Tao D, Li X, Wu X, Maybank SJ: General tensor discriminant analysis and Gabor features for gait recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000,29(10):1700-1715.View ArticleGoogle Scholar
- De Lathauwer L: Signal processing based on multilinear algebra, Ph.D. thesis. Katholike Universiteit Leuven, Leuven, Belgium; 1997.Google Scholar
- Zass R, Shashua A: Nonnegative sparse PCA. In Advances in Neural Information Processing Systems. Volume 19. MIT Press, Cambridge, Mass, USA; 2007:1561-1568.Google Scholar
- Slaney M: Auditory toolbox: Version 2. Interval Research Corporation, 1998-010, 1998Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.