 Research
 Open access
 Published:
A subbandbased feature reconstruction approach for robust speaker recognition
EURASIP Journal on Audio, Speech, and Music Processing volume 2014, Article number: 40 (2014)
Abstract
Although the field of automatic speaker or speech recognition has been extensively studied over the past decades, the lack of robustness has remained a major challenge. The missing data technique (MDT) is a promising approach. However, its performance depends on the correlation across frequency bands. This paper presents a new reconstruction method for feature enhancement based on the trait. In this paper, the degree of concentration across frequency bands is measured with principal component analysis (PCA). Through theoretical analysis and experimental results, it is found that the correlation of the feature vector extracted from the subband (SB) is much stronger than the ones extracted from the fullband (FB). Thus, rather than dealing with the spectral features as a whole, this paper splits fullband into subbands and then individually reconstructs spectral features extracted from each SB based on MDT. At the end, those constructed features from all subbands will be recombined to yield the conventional melfrequency cepstral coefficient (MFCC) for recognition experiments. The 2subband reconstruction approach is evaluated in speaker recognition system. The results show that the proposed approach outperforms fullband reconstruction in terms of recognition performance in all noise conditions. Finally, we particularly discuss the optimal selection of frequency division ways for the recognition task. When FB is divided into much more subbands, some of the correlations across frequency channels are lost. Consequently, efficient division ways need to be investigated to perform further recognition performance.
1 Introduction
The performance of speaker or speech recognition systems degrades rapidly when they operate under conditions that differ from those used for training. Therefore, accomplishing noise robustness is a key issue to make these systems deployable in real world conditions. Solutions have been presented to solve this issue, such as featurebased [1]–[3], scorebased [4],[5], modelbased [6]–[8], ivectors [9], and the missing data technique (MDT) [10]–[12].
MDT can compensate for disturbances of the arbitrated type, so that this method which is based on the timefrequency representation is suitable to the problem of noise mismatch [12].
In MDT, two different methods have been considered to perform speech or speaker recognition with incomplete data: marginalization [13]–[15] and reconstruction [16],[17]. In marginalization, the unreliable components are discarded or integrated up to the observed values. While the reconstruction method involves the estimation of the corrupted features using statistical methods, such as minimum mean square error (MMSE) [10], maximum a posteriori (MAP), and maximum likelihood (ML). Marginalization [11],[14] and reconstruction [10] have been applied in speaker recognition system. However, marginalization suffers from two main drawbacks [17],[18]. First, as known to us, utterancelevel processing, such as mean and variance normalization, is capable of improving the recognition performance, but it cannot be performed with an incomplete spectrum [18]. Second, recognition has been carried out with spectral features. However, it is well known that cepstral features outperform spectral ones. Moreover, of all the methods, marginalization is assumed to have the most overhead. Consequently, if the complete reconstructed spectrogram is available, the recognizer is no longer constrained to perform recognition using spectral features. A more optimal set of parameters from the reconstructed spectrum will be derived.
In this paper, MAP reconstruction method [10] is used. Its efficiency significantly depends on the correlation between the spectral features. Conventional MAP reconstruction method is conducted on fullband [18],[19]. According to our analysis, the spectral vectors extracted from the subband have more relevance than the ones extracted from the fullband. The conclusion will be illustrated in Section 1. Based on the above theory and the subband idea [20]–[22], a multisubband reconstruction approach is proposed to improve on the recognition performance. The principle is to divide the fullband into multiple subbands and then independently reconstruct missing features extracted from every subband. After that, those features from all subbands will be recombined to yield the typical melfrequency cepstral coefficient (MFCC) vector.
As one of many feature enhancement methods, the proposed reconstruction approach can be used in speaker and speech recognition system. To evaluate its validity, this paper will combine the new reconstruction method with speaker recognition system.
This paper is organized as follows. In the next section, the theory of the proposed reconstruction approach is analyzed. Section 1 is devoted to describing the proposed reconstruction approach. Section 1 describes the baseline experiment system and the experimental framework which is adopted to evaluate the proposed technique. Finally, Section 1 concludes this paper and discusses some future directions.
2 The analysis of concentration
As we know, the more concentrated the feature vector is, the higher its redundancy is, that is, the greater its correlation is [23]. It is measured by the degree of concentration with principal component analysis (PCA).
In this paper, the Pdimensional mel logspectral vector is used for reconstruction. Mel filters are used to represent a frame spectrum as a logspectral vector of Pdimensional (termed as fullband feature vector). The frequency region (0,f_{ s }/2) is divided into C subbands. Let P_{ i } denote the number of mel filters corresponding to the i th subband. Apparently,
Corresponding to the t th frame and i th subband, the output of mel filters (termed as the i th subband feature vector) is represented as follows:
In order to analyze the degree of concentration of the feature vector {\stackrel{\u20d7}{Y}}_{i}^{t}, the eigenvalues of associated covariance matrix Θ_{ i } need to be calculated and then need to be arranged in descending order. It is represented as \left[{\lambda}_{i,1},{\lambda}_{i,2},\xb7\xb7\xb7{\lambda}_{i,{P}_{i}}\right].
To learn how closely the i th subband feature vector {\stackrel{\u20d7}{Y}}_{i}^{t} is in the space of the P_{ i }dimension, the socalled concentration level {M}_{R}^{i}\left(r\right) is introduced and computed as follows:
That is, R_{ i }(m) is the accumulative contribution rate of the first m principle components. Concentration level {M}_{R}^{i}\left(r\right) is the minimum m that makes R_{ i }(m)>r, where r is a predefined concentration coefficient.
For certain r, a smaller {M}_{R}^{i}\left(r\right) implies that the i th subband feature vector is confined along a smaller number of principle directions, and therefore, the feature vector is much more closely related to each other according to the above definition.
In the same manner, the degree of concentration of the fullband feature vector could be analyzed.
The accumulative contribution rate of the first m principle components corresponding to the 4subband and fullband is shown in Figure 1. The conclusion should be clear. The concentration level corresponding to each subband in the 4subband is smaller than the one corresponding to the fullband.
The correlation between the redundancy and accuracy of the prediction is best visualized using 2dimensional examples as shown in Figure 2. The 2dimensional examples involve the feature vector extracted from clean and noisy utterances, together with MAP reconstruction obtained for the noisy utterance. Babble noise at 0 dB signaltonoise ratio (SNR) has been added to obtain the noisy utterance. Panels (a) and (b), respectively, reflect a range of 2dimensional feature vectors with different redundancies. The redundancy of data in panel (b) is lower than that in panel (a). The reconstruction data corresponding to the data with high redundancy and low redundancy is defined along the first principle direction and scattered. In short, the fatter the cloud is, the lower the prediction accuracy is in a 2dimensional case.
Figure 3 shows the contribution rate of two principle components which are obtained from the covariance matrix of the 2dimensional feature vector. When the value of the predefined concentration coefficient r is 0.9, the concentration level which is corresponding to the data shown in Figure 2a,b is {M}_{R}^{\left(\mathit{\text{high}}\right)}\left(r\right)=1 and {M}_{R}^{\left(\mathit{\text{low}}\right)}\left(r\right)=2, respectively.
Considering the recorded positions of the 2dimensional feature vector in Figure 2 and the corresponding contribution rate, together with our analysis, the following conclusion is obtained: the higher the redundancy of the data is, that is, the greater its correlation is, the smaller the corresponding concentration level is. As MAP reconstruction method is based on the correlation between the feature vectors, the smaller the concentration level is, the higher the validity of the reconstruction is.
3 Multisubband reconstruction for speaker recognition system
As one of many feature enhancement methods, the multisubband reconstruction method in MDT can be applied in the Gaussian mixture model (GMM) [24], the SVMGMM [25], and the universal background model (UBM)GMM recognition system. Based on the validity of the UBMGMM system shown in [11], the proposed reconstruction method is evaluated in a UBMGMM speaker recognition system. In this section, the MDTbased speaker recognition system is described.
3.1 UBMGMM model
In this paper, a speakerindependent UBM is used. A speakerdependent model can be derived from UBM by adapting the UBM parameters to the speech material of the corresponding speaker using MAP estimation [11],[26].
3.2 Feature vector
Mel logspectral vector and MFCC are used in the reconstruction and recognition stage, respectively. The unreliable components are reconstructed based on the statistical relationship between the logspectral vector.
3.3 Mask estimation
In order to perform MDT, a mask must be required which classifies the timefrequency (TF) units into reliable and unreliable components. Various strategies have been proposed to estimate a mask, such as SNRbased estimation [27], auditory and perceptual estimation [14],[28], classifierbased estimation [29], and DNNbased estimation [30]. It is, however, outside the scope of this paper to analyze and compare all existing approaches. Because the focus of this paper is to robustly identify speakers in the presence of noise, the mask m(t,k) is determined by estimating the local SNR in individual TF units. SNRbased mask estimation method is applied to decide whether a TF unit is reliable.
where{\left\hat{\stackrel{\u20d7}{S}}(t,k)\right}^{2} and {\left\hat{\stackrel{\u20d7}{N}}(t,k)\right}^{2} represent the k th frequency bands of the power spectrum of speech and noise, respectively, in individual TF units. What calls for special attention is that the estimation of speech and noise components is carried out in the spectral domain before applying mel filter.
The estimate of the noise spectrum is derived from the noisy signal spectrum. The estimation method is shown in [31]. The estimate of the speech spectrum {\left\hat{\stackrel{\u20d7}{S}}(t,k)\right}^{2} can be derived by subtracting the estimated noise spectrum {\left\hat{\stackrel{\u20d7}{N}}(t,k)\right}^{2} from the corrupted signal spectrum. In this paper, the technique to accomplish this is to perform spectral subtraction by applying an SNRdependent gain function MMSE logSTSA [32] in the frequency domain.
3.4 MAP estimation for unreliable components
In MAP estimation, the unreliable components are estimated by making their likelihood condition on the reliable components [18] be maximum.
A feature vector \stackrel{\u20d7}{x}\in {\Re}^{{P}_{j}\ast 1} is divided into reliable and unreliable components based on SNRbased mask estimation method.
assuming that p\left(\stackrel{\u20d7}{x};\stackrel{\u20d7}{\mu},\Theta \right) is the probability distribution function (pdf) of a Gaussian distribution with mean vector μ and covariance matrix Θ. According to the nature of Gaussian distribution, p\left({\stackrel{\u20d7}{x}}_{r};\stackrel{\u20d7}{\mu},\Theta \right) and p\left({\stackrel{\u20d7}{x}}_{u};\stackrel{\u20d7}{\mu},\Theta \right) would therefore also be Gaussian [33]. Consequently,
where Θ_{ ru } is the cross covariance between{\stackrel{\u20d7}{x}}_{r} and {\stackrel{\u20d7}{x}}_{u} and {\Theta}_{\mathit{\text{ru}}}={\Theta}_{\mathit{\text{ur}}}^{T}.
It can now be shown that p\left({\stackrel{\u20d7}{x}}_{u}{\stackrel{\u20d7}{x}}_{r},\stackrel{\u20d7}{\mu},\Theta \right) is given by
where C is a normalizing constant. The following equation can be obtained from Equations 11, 12, and 13.
Figure 4 shows the process of reconstruction. The values of the statistical parameters such as {\stackrel{\u20d7}{\mu}}_{r}, {\stackrel{\u20d7}{\mu}}_{u},{\Theta}_{\mathit{\text{ur}}}, and Θ_{ rr } must be learned from the training corpus. A vector is said to belong to the cluster that is most likely to have generated it. As the distribution of the vector is assumed to be Gaussian, the cluster membership {\hat{m}}_{\stackrel{\u20d7}{x}\left(t\right)} of a vector \stackrel{\u20d7}{x}\left(t\right) is defined as
and then the unreliable components of the vector are reconstructed using MAP estimation method.
3.5 The proposed multisubband reconstruction approach
Assuming that utilizing P mel filter to smooth the N FFT magnitude coefficients. The reconstruction is individually conducted on 2 subbands consisting of consecutive channels (P/2dimensional channels) with no band overlap (subband 1: channel 1 to P/2, subband 2: channel P/2+1 to P). The reconstruction method falls neatly into two parts as shown in Figure 5. In the first part, the statistical parameters (SP) used in construction are individually trained for different subbands. The steps of the second part are as follows:

(a)
The estimation of speech and noise components is carried out in the spectral domain.

(b)
A mask will be obtained which classifies the TF representation into reliable and unreliable components corresponding to the frequency range of P mel filters. The above two steps are carried out before applying the mel filter.

(c)
P mel filters are used to smooth the power spectrum and then its logarithm is taken.

(d)
The mel logspectral vector is multiplied by the mask estimated in step (b).

(e)
The feature vector corresponding to fullband is divided into ones corresponding to 2 subbands.

(f)
Based on SP trained in the first part, the feature vectors corresponding to every subband are reconstructed, individually.

(g)
The reconstructed vector of 2subband is recombined to yield the typical MFCC vector.
3.6 Baseline system
The system described in [11] assumes that the unreliable components are bounded between zero and the observed mel logspectrum and the mel logspectrum is independent, and marginalization is applied to process the corrupted vector. The feature vector used in recognition is a Pdimensional mel logspectrum. We compare the performance of the proposed system with the baseline system.
4 Experiments
New reconstruction method is evaluated on a closed set of 30 speakers and 140 utterances per speaker. The sampling frequency is 16 KHz. For each speaker, 70% of the available speech material is randomly selected to train the corresponding speaker model, 7% is used for training SP for reconstruction stage, and the remaining 23% is used for test.
In the training stage, we use a voice activity detector (VAD) based on power to ensure that silence frames would not impact on the establishing model.
Speaker recognition performance is evaluated on a subset of ten randomly selected speakers involving a total of 30 sentences per speaker (20 sentences for training speakerdependent GMM and 10 sentences for testing). In the test phase, utterances are mixed at various SNRs with noise signals drawn from the NOISEX database [34].
Figure 6 describes evaluation system in which 24 mel filters are used to smooth the spectrum and the fullband is divided into 2 subbands (SB1: channel 1 to 12, SB2: channel 13 to 24), and 34demensional MFCC consisting of 16 static MFCC coefficients including the 0th order coefficient and first order temporal derivatives is used for recognition. At the end, cepstral mean normalization (CMN) is applied to improve robustness.
4.1 Experiment 1: performance comparison between marginalization and reconstruction including fullband and 2subband reconstruction
In the first experiment, we compare the performances of two systems which use the marginalization and reconstruction methods to process the corrupted features and then evaluate the validity of the proposed reconstruction method. The point is that recognition has to be carried out with spectral features in the former system. While in the latter system, MFCC are extracted for recognition.
The DET curves visualize the tradeoff between missed detections and false alarms [35]. Figure 7 gives the recognition performance of two systems in destroyerengine noise at a SNR of 0 dB. The results in the figure show that cepstral features outperform spectral ones for speaker recognition.
Figure 8 shows that the recognition performance of the latter system improves when reconstruction is applied to process the corrupted features, and the recognition accuracy of the latter system using 2subband reconstruction method is improved 5.65% more than fullband reconstruction method.
In order to evaluate the validity of the proposed reconstruction method in various noise types, this paper conducts recognition experiments in babble, factory1, pink, white, and destroyerengine noise. The SNRdependent recognition accuracy for recognition system is presented in Table 1. The last table depicts the average performance over all noise conditions.
Based on the experimental results reported in Table 1, the corresponding SNRdependent curves are shown for all noise types in Figures 9, 10, and 11.
The following observations can be made:

(a)
It can be observed in Table 1 that the performance obtained from both reconstruction methods clearly outperforms the baseline system.

(b)
The results show that 2subband reconstruction method performs better than fullband for all noise types. The recognition performance is higher at a larger SNR.

(c)
The recognition performance in babble noise is higher than the other four noise types in most cases for two kinds of reconstruction methods.

(d)
The corresponding relative improvements regarding fullband reconstruction are 2.55%, 1.49%, 1.10%, 1.63%, and 1.03% at a SNR of 0, 5, 10, 15, and 20 dB, respectively. Recognition performance improves the most at a SNR of 0 dB.

(e)
The improved recognition performance is 6.04%, 6.97%, 8.99%, 5.63%, and 11.37% in babble, factory1, pink, white, and destroyerengine noise, respectively. Recognition performance improves the most in destroyerengine noise.
We analyze the relationship between reconstruction performance and the correlation of the feature vector with PCA. Table 2 shows the contribution rate of every principle component. When the value of concentration coefficient r is 0.95, the corresponding concentration levels are {M}_{R}^{\mathit{\text{FB}}}\left(r\right)=10, {M}_{R}^{1}\left(r\right)=6, and {M}_{R}^{2}\left(r\right)=5. Based on the conclusion shown in Section 1, a smaller {M}_{R}^{i}\left(r\right) implies a stronger concentration for the feature vector. Consequently, since the correlation of every subband is stronger than the fullband, the performance of the 2subband reconstruction approach is better.
The result of PCA will be obtained by decomposing eigenvalues of the covariance matrix, which is relevant to the reconstruction. The accumulative contribution rate of the principle components is shown in Figure 12.
4.2 Experiment 2: influence of different division ways of fullband
The conclusion that the recognition performance obtained by the proposed reconstruction method is superior to fullband reconstruction has been obtained in Experiment 1. The choice of an optimal division of fullband seems to be crucial for subband reconstruction method. In order to find the optimal division, this paper conducts a series of recognition experiments. The division ways and the corresponding recognition performance are shown in Table 3.
These experiments are conducted in babble noise which is highly nonstationary and a SNR of 0 dB. The results are shown in Figure 13.
The recognition performance is ranked corresponding to different division ways starting with the highest performance: 4 subbands, 2 subbands, 3 subbands, 8 subbands, 6 subbands, and 12 subbands. The relationship between channel number and recognition performance is not obvious. In order to explain the relationship between the recognition performance and the division ways, we analyze the case of 4 subband and 2 subband. The results are shown in Figure 14. Assume that the amount of information presented by the original data is 100%. When fullband is divided into 4 subbands, the amount of information presented by the first four principle components derived from subband 1, subband 2, subband 3, and subband 4 is 94.67%, 98.79%, 98.58%, and 98.73%, respectively. However, if the fullband is divided into 2 subbands, the amount of information presented by the first four principle components derived from both subbands is 90.61% and 93.73%. That is, the redundancy of the feature vector extracted from each subband is higher on the condition that the fullband is divided into 4 subbands.
When the fullband is divided into 12 subbands, the recognition performance is inferior. The observation shows that the correlations between the feature vector are lost when the number of subbands is more numerous.
5 Conclusions
This paper presents a new feature enhancement method, which is evaluated in a UBMGMM speaker recognition system. In the proposed method, the reconstruction is executed on a partial subband independently and then the reconstructed spectrum is recombined into a complete spectrum to yield the conventional MFCC for recognition. Compared to fullband reconstruction method, recognition performance obtained by the proposed reconstruction approach has been shown to be higher in five noise types. The experiment has also reflected that the recognition performance depends on the frequency division ways, thus the optimal division ways need to be developed.
The first experiment has revealed the following results. First, MFCC features outperform spectral ones for speaker recognition. Second, the recognition performance obtained by reconstruction is higher than marginalization. Third, the recognition performance obtained by the 2subband reconstruction method is superior to the fullband reconstruction in five noise types and at all SNRs. The second experiment has shown that different frequency division ways could influence on the recognition performance.
In order to achieve further recognition performance improvements, on the one hand, an optimal frequency division way will be very important. On the other hand, analyzing the distribution property of various noise types and then accurately identifying destroyed components are also research hot spots. In the end, research on mask estimation algorithms is required to precisely separate reliable from unreliable components.
References
J Pelecanos, S Sridharan, Feature warping for robust speaker verification. ISCA Workshop Speaker Recognition, June 213–218 (2001).
Reynolds DA: Channel robust speaker verification via feature mapping. ICASSP 2003, 2: 5356.
Chandran V, Ning D, Sridharan S: Speaker identification using higher order spectral phase features and their effectiveness visavis melcepstral features, vol. 3072. In Biometric Authentication. Springer Verlag, Berlin; 2004:120.
Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Process 2000, 10(1–3):1941. 10.1006/dspr.1999.0361
Auckenthaler R, Carey M, LloydThomas H: Score normalization for textindependent speaker verification systems. Digital Signal Process 2000, 10(1–3):4254. 10.1006/dspr.1999.0360
P Kenny, P Dumouchel, in Proc. ODYSSEY 2004The Speaker and Language Recognition Workshop. Experiments in speaker verification using factor analysis likelihood ratios (Toledo, Spain, May 31–June 3 2004), pp. 219–226.
Kenny P, Boulianne G, Ouellet P, Dumouchel P: Factor analysis simplified. ICASSP 2005, 1: 637640.
Jančovič P, Köküer M: Estimation of voicingcharacter of speech spectra based on spectral shape. IEEE Signal Process. Lett 2007, 14(1):6669. 10.1109/LSP.2006.881517
M McLaren, D van Leeuwen, in ICASSP. Improved speaker recognition when using ivectors from multiple speech sources (Prague, 2011), pp. 5460–5463.
González JA, Peinado AM, Ma N, Gómez AM, Barker J: MMSEbased missingfeature reconstruction with temporal modeling for robust speech recognition. IEEE Trans. Audio, Speech, Lang. Process 2013, 21(3):624635. 10.1109/TASL.2012.2229982
May T, van de Par S, Kohlrausch A: Noiserobust speaker recognition combining missing data techniques and universal background modeling. IEEE Trans. Audio, Speech, Lang. Process 2012, 20(1):108121. 10.1109/TASL.2011.2158309
Togneri R, Pullella D: An overview of speaker identification: accuracy and robustness issues. IEEE Circuits Syst. Mag. 2011, Second Quarter: 2361. 10.1109/MCAS.2011.941079
Cooke M, Green P, Josifovski L, Vizinho A: Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 2001, 34: 267285. 10.1016/S01676393(00)000340
Zhao X, Shao Y, Wang D: CASABased Robust Speaker Identification. IEEE Trans. Audio, Speech Lang. Process 2012, 20: 16081616. 10.1109/TASL.2012.2186803
Ma N, Barker J, Christensen H, Green P: Combining speech fragment decoding and adaptive noise floor modelling. IEEE Trans. Audio Speech Lang. Process 2012, 20(3):818827. 10.1109/TASL.2011.2165945
Gemmeke JF, Hamme VanH, Cranen B, Boves L: Compressive sensing for missing data imputation in noise robust speech recognition. IEEE J. Sel. Topics Signal Process 2010, 4(2):272287. 10.1109/JSTSP.2009.2039171
JA González, AM Peinado, AM Gómez, N Ma, J Barker, in IEEE Trans. Audio Speech Lang. Process. Combining missingdata reconstruction and uncertainty decoding for robust speech recognition (Kyoto, 2012), pp. 4693–4696.
Raj B, Seltzer ML, Stern RM: Reconstruction of missing features for robust speech recognition. Speech Commun. 1997, 43: 195202.
B Raj, Reconstruction of incomplete spectrograms for robust speech recognition. PhD dissertation, Pittsburgh, PA, Carnegie Mellon Univ, 2000.
Raj L, Bonastre JF: Subband approach for automatic speaker recognition: optimal division of the frequency domain. In Proc. Audio and Video based Biometric Person Authentication. LNCS. Edited by: Bigün J, Chollet G, Borgefors G. Springer, Heidelberg; 1997:195202.
Besacier L, Bonastre JF, Fredouille C: Localization and selection of speakerspecific information with statistical modeling. Speech Comm. 2000, 31: 89106. 10.1016/S01676393(99)000709
Bourlard H, Dupont S: A new ASR approach based on independent processing and recombination of partial frequency bands. ICSLP 1996, 1: 426429.
J Shlens, A tutorial on principal component analysis. Systems Neurobiology Laboratory, Salk Institute for Biological Studies, version 2, 1–13 (2005).
Reynolds D, Rose R: Robust textindependent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process 1995, 3(1):7283. 10.1109/89.365379
Campbell W, Sturim D, Reynolds D: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett 2006, 13(5):308311. 10.1109/LSP.2006.870086
Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 2000, 10: 1941. 10.1006/dspr.1999.0361
Raj B, Stern RM: Missingfeature approaches in speech recognition. IEEE Signal Process. Mag 2005, 22(5):101116. 10.1109/MSP.2005.1511828
Brown GJ, Wang D: Separation of Speech by Computational Auditory Scene Analysis. Springer Verlag, New York; 2005.
Seltzer ML, Raj B, Stern RM: A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition. Speech Commun. 2004, 43(4):379393. 10.1016/j.specom.2004.03.006
X Zhao, Y Wang, D Wang, Robust speaker identification in noisy and reverberant conditions. IEEE Trans. Audio, Speech Lang. Process. 22, 836–845 (2014, in press).
Rainer Martin: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process 2001, 9: 504512. 10.1109/89.928915
M Brookes, VOICEBOX: Speech Processing Toolbox for MATLAB (2009). [Online] Available: ., [http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html]
Papoulis A: Probability, Random Variables, and Stochastic Processes. Academic Press, New York; 1991.
A Varga, H Steeneken, M Tomlinson, D Jones, in Tech. Rep., Speech Res. Unit, Defense Res. Agency. The NOISEX92 study on the effect of additive noise on automatic speech recognition (Malvern, U.K., 1992). (Available from NOISEX92 CDROMS).
A Martin, G Doddington, T Kamm, M Ordowski, in Proceedings of the European Conference on Speech communication and Technology. The DET curve in assessment of detection task performance, (1997), pp. 1895–1898.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Yan, F., Zhang, Y. & Yan, J. A subbandbased feature reconstruction approach for robust speaker recognition. J AUDIO SPEECH MUSIC PROC. 2014, 40 (2014). https://doi.org/10.1186/s1363601400407
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1363601400407