A sub-band-based feature reconstruction approach for robust speaker recognition

Yan, Furong; Zhang, Yanbin; Yan, Jiachang

doi:10.1186/s13636-014-0040-7

Research
Open access
Published: 21 October 2014

A sub-band-based feature reconstruction approach for robust speaker recognition

Furong Yan¹,
Yanbin Zhang¹ &
Jiachang Yan¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2014, Article number: 40 (2014) Cite this article

2279 Accesses
3 Citations
Metrics details

Abstract

Although the field of automatic speaker or speech recognition has been extensively studied over the past decades, the lack of robustness has remained a major challenge. The missing data technique (MDT) is a promising approach. However, its performance depends on the correlation across frequency bands. This paper presents a new reconstruction method for feature enhancement based on the trait. In this paper, the degree of concentration across frequency bands is measured with principal component analysis (PCA). Through theoretical analysis and experimental results, it is found that the correlation of the feature vector extracted from the sub-band (SB) is much stronger than the ones extracted from the full-band (FB). Thus, rather than dealing with the spectral features as a whole, this paper splits full-band into sub-bands and then individually reconstructs spectral features extracted from each SB based on MDT. At the end, those constructed features from all sub-bands will be recombined to yield the conventional mel-frequency cepstral coefficient (MFCC) for recognition experiments. The 2-sub-band reconstruction approach is evaluated in speaker recognition system. The results show that the proposed approach outperforms full-band reconstruction in terms of recognition performance in all noise conditions. Finally, we particularly discuss the optimal selection of frequency division ways for the recognition task. When FB is divided into much more sub-bands, some of the correlations across frequency channels are lost. Consequently, efficient division ways need to be investigated to perform further recognition performance.

1 Introduction

The performance of speaker or speech recognition systems degrades rapidly when they operate under conditions that differ from those used for training. Therefore, accomplishing noise robustness is a key issue to make these systems deployable in real world conditions. Solutions have been presented to solve this issue, such as feature-based [1]–[3], score-based [4],[5], model-based [6]–[8], i-vectors [9], and the missing data technique (MDT) [10]–[12].

MDT can compensate for disturbances of the arbitrated type, so that this method which is based on the time-frequency representation is suitable to the problem of noise mismatch [12].

In MDT, two different methods have been considered to perform speech or speaker recognition with incomplete data: marginalization [13]–[15] and reconstruction [16],[17]. In marginalization, the unreliable components are discarded or integrated up to the observed values. While the reconstruction method involves the estimation of the corrupted features using statistical methods, such as minimum mean square error (MMSE) [10], maximum a posteriori (MAP), and maximum likelihood (ML). Marginalization [11],[14] and reconstruction [10] have been applied in speaker recognition system. However, marginalization suffers from two main drawbacks [17],[18]. First, as known to us, utterance-level processing, such as mean and variance normalization, is capable of improving the recognition performance, but it cannot be performed with an incomplete spectrum [18]. Second, recognition has been carried out with spectral features. However, it is well known that cepstral features outperform spectral ones. Moreover, of all the methods, marginalization is assumed to have the most overhead. Consequently, if the complete reconstructed spectrogram is available, the recognizer is no longer constrained to perform recognition using spectral features. A more optimal set of parameters from the reconstructed spectrum will be derived.

In this paper, MAP reconstruction method [10] is used. Its efficiency significantly depends on the correlation between the spectral features. Conventional MAP reconstruction method is conducted on full-band [18],[19]. According to our analysis, the spectral vectors extracted from the sub-band have more relevance than the ones extracted from the full-band. The conclusion will be illustrated in Section 1. Based on the above theory and the sub-band idea [20]–[22], a multi-sub-band reconstruction approach is proposed to improve on the recognition performance. The principle is to divide the full-band into multiple sub-bands and then independently reconstruct missing features extracted from every sub-band. After that, those features from all sub-bands will be recombined to yield the typical mel-frequency cepstral coefficient (MFCC) vector.

As one of many feature enhancement methods, the proposed reconstruction approach can be used in speaker and speech recognition system. To evaluate its validity, this paper will combine the new reconstruction method with speaker recognition system.

This paper is organized as follows. In the next section, the theory of the proposed reconstruction approach is analyzed. Section 1 is devoted to describing the proposed reconstruction approach. Section 1 describes the baseline experiment system and the experimental framework which is adopted to evaluate the proposed technique. Finally, Section 1 concludes this paper and discusses some future directions.

2 The analysis of concentration

As we know, the more concentrated the feature vector is, the higher its redundancy is, that is, the greater its correlation is [23]. It is measured by the degree of concentration with principal component analysis (PCA).

In this paper, the P-dimensional mel log-spectral vector is used for reconstruction. Mel filters are used to represent a frame spectrum as a log-spectral vector of P-dimensional (termed as full-band feature vector). The frequency region (0,f_s/2) is divided into C sub-bands. Let P_i denote the number of mel filters corresponding to the i th sub-band. Apparently,

\sum_{i = 1}^{C} P_{i} = P

(1)

Corresponding to the t th frame and i th sub-band, the output of mel filters (termed as the i th sub-band feature vector) is represented as follows:

{\vec{Y}}_{i}^{t} = {(y (1, t), \cdot \cdot \cdot y (P_{i}, t))}^{T}

(2)

In order to analyze the degree of concentration of the feature vector ${\vec{Y}}_{i}^{t}$ , the eigenvalues of associated covariance matrix Θ_i need to be calculated and then need to be arranged in descending order. It is represented as $[λ_{i, 1}, λ_{i, 2}, \cdot \cdot \cdot λ_{i, P_{i}}]$ .

To learn how closely the i th sub-band feature vector ${\vec{Y}}_{i}^{t}$ is in the space of the P_i-dimension, the so-called concentration level $M_{R}^{i} (r)$ is introduced and computed as follows:

R_{i} (m) = \frac{\sum_{i = 1}^{m} λ_{i, l}}{\sum_{i = 1}^{P_{i}} λ_{i, l}}, m = 1, 2, \cdot \cdot \cdot P_{i}

(3)

M_{R}^{i} (r) = arg min_{m} (R_{i} (m) > r)

(4)

That is, R_i(m) is the accumulative contribution rate of the first m principle components. Concentration level $M_{R}^{i} (r)$ is the minimum m that makes R_i(m)>r, where r is a predefined concentration coefficient.

For certain r, a smaller $M_{R}^{i} (r)$ implies that the i th sub-band feature vector is confined along a smaller number of principle directions, and therefore, the feature vector is much more closely related to each other according to the above definition.

In the same manner, the degree of concentration of the full-band feature vector could be analyzed.

The accumulative contribution rate of the first m principle components corresponding to the 4-sub-band and full-band is shown in Figure 1. The conclusion should be clear. The concentration level corresponding to each sub-band in the 4-sub-band is smaller than the one corresponding to the full-band.

The correlation between the redundancy and accuracy of the prediction is best visualized using 2-dimensional examples as shown in Figure 2. The 2-dimensional examples involve the feature vector extracted from clean and noisy utterances, together with MAP reconstruction obtained for the noisy utterance. Babble noise at 0 dB signal-to-noise ratio (SNR) has been added to obtain the noisy utterance. Panels (a) and (b), respectively, reflect a range of 2-dimensional feature vectors with different redundancies. The redundancy of data in panel (b) is lower than that in panel (a). The reconstruction data corresponding to the data with high redundancy and low redundancy is defined along the first principle direction and scattered. In short, the fatter the cloud is, the lower the prediction accuracy is in a 2-dimensional case.

Figure 3 shows the contribution rate of two principle components which are obtained from the covariance matrix of the 2-dimensional feature vector. When the value of the predefined concentration coefficient r is 0.9, the concentration level which is corresponding to the data shown in Figure 2a,b is $M_{R}^{(high)} (r) = 1$ and $M_{R}^{(low)} (r) = 2$ , respectively.

Considering the recorded positions of the 2-dimensional feature vector in Figure 2 and the corresponding contribution rate, together with our analysis, the following conclusion is obtained: the higher the redundancy of the data is, that is, the greater its correlation is, the smaller the corresponding concentration level is. As MAP reconstruction method is based on the correlation between the feature vectors, the smaller the concentration level is, the higher the validity of the reconstruction is.

3 Multi-sub-band reconstruction for speaker recognition system

As one of many feature enhancement methods, the multi-sub-band reconstruction method in MDT can be applied in the Gaussian mixture model (GMM) [24], the SVM-GMM [25], and the universal background model (UBM)-GMM recognition system. Based on the validity of the UBM-GMM system shown in [11], the proposed reconstruction method is evaluated in a UBM-GMM speaker recognition system. In this section, the MDT-based speaker recognition system is described.

3.1 UBM-GMM model

In this paper, a speaker-independent UBM is used. A speaker-dependent model can be derived from UBM by adapting the UBM parameters to the speech material of the corresponding speaker using MAP estimation [11],[26].

3.2 Feature vector

Mel log-spectral vector and MFCC are used in the reconstruction and recognition stage, respectively. The unreliable components are reconstructed based on the statistical relationship between the log-spectral vector.

3.3 Mask estimation

In order to perform MDT, a mask must be required which classifies the time-frequency (T-F) units into reliable and unreliable components. Various strategies have been proposed to estimate a mask, such as SNR-based estimation [27], auditory and perceptual estimation [14],[28], classifier-based estimation [29], and DNN-based estimation [30]. It is, however, outside the scope of this paper to analyze and compare all existing approaches. Because the focus of this paper is to robustly identify speakers in the presence of noise, the mask m(t,k) is determined by estimating the local SNR in individual T-F units. SNR-based mask estimation method is applied to decide whether a T-F unit is reliable.

\begin{matrix} m (t, k) = \{\begin{matrix} 0, & if {|\hat{\vec{S}} (t, k)|}^{2} \leq {|\hat{\vec{N}} (t, k)|}^{2} \\ 1, & otherwise \end{matrix} \end{matrix}

(5)

where ${|\hat{\vec{S}} (t, k)|}^{2}$ and ${|\hat{\vec{N}} (t, k)|}^{2}$ represent the k th frequency bands of the power spectrum of speech and noise, respectively, in individual T-F units. What calls for special attention is that the estimation of speech and noise components is carried out in the spectral domain before applying mel filter.

The estimate of the noise spectrum is derived from the noisy signal spectrum. The estimation method is shown in [31]. The estimate of the speech spectrum ${|\hat{\vec{S}} (t, k)|}^{2}$ can be derived by subtracting the estimated noise spectrum ${|\hat{\vec{N}} (t, k)|}^{2}$ from the corrupted signal spectrum. In this paper, the technique to accomplish this is to perform spectral subtraction by applying an SNR-dependent gain function MMSE log-STSA [32] in the frequency domain.

3.4 MAP estimation for unreliable components

In MAP estimation, the unreliable components are estimated by making their likelihood condition on the reliable components [18] be maximum.

{\hat{\vec{x}}}_{u} = arg max_{{\vec{x}}_{u}} p ({\vec{x}}_{u} | {\vec{x}}_{r}, \vec{μ}, Θ)

(6)

A feature vector $\vec{x} \in ℜ^{P_{j} * 1}$ is divided into reliable and unreliable components based on SNR-based mask estimation method.

\vec{x_{r}} \in ℜ^{D_{1} * 1} \vec{x_{u}} \in ℜ^{D_{2} * 1}, D_{1} + D_{2} = P_{j}

(7)

\vec{x} = [{\vec{x}}_{r}, {\vec{x}}_{u}]

(8)

assuming that $p (\vec{x}; \vec{μ}, Θ)$ is the probability distribution function (pdf) of a Gaussian distribution with mean vector μ and covariance matrix Θ. According to the nature of Gaussian distribution, $p ({\vec{x}}_{r}; \vec{μ}, Θ)$ and $p ({\vec{x}}_{u}; \vec{μ}, Θ)$ would therefore also be Gaussian [33]. Consequently,

\vec{μ} = [{\vec{μ}}_{r}, {\vec{μ}}_{u}]

(9)

Θ = [\begin{matrix} Θ_{rr} & Θ_{ru} \\ Θ_{ur}, & Θ_{uu} \end{matrix}]

(10)

\begin{matrix} p ({\vec{x}}_{r}, {\vec{μ}}_{r}, Θ_{rr}) = C_{1} exp [- 0.5 {({\vec{x}}_{r} - {\vec{μ}}_{r})}^{T} Θ_{rr}^{- 1} ({\vec{x}}_{r} - {\vec{μ}}_{r})] \end{matrix}

(11)

\begin{matrix} p ({\vec{x}}_{u}, {\vec{x}}_{r}, \vec{μ}, Θ) = C_{2} exp [- 0.5 {(\vec{x} - \vec{μ})}^{T} Θ^{- 1} (\vec{x} - \vec{μ})] \end{matrix}

(12)

where Θ_ru is the cross covariance between ${\vec{x}}_{r}$ and ${\vec{x}}_{u}$ and $Θ_{ru} = Θ_{ur}^{T}$ .

It can now be shown that $p ({\vec{x}}_{u} | {\vec{x}}_{r}, \vec{μ}, Θ)$ is given by

\begin{array}{lcr} p ({\vec{x}}_{u} | {\vec{x}}_{r}, \vec{μ}, Θ) = \frac{p ({\vec{x}}_{u}, {\vec{x}}_{r}, \vec{μ}, Θ)}{p ({\vec{x}}_{r}, {\vec{μ}}_{r}, Θ_{rr})} \\ = C exp [- 0.5 ({\vec{x}}_{u} - {\vec{μ}}_{u}) - Θ_{ur} Θ_{rr}^{- 1} ({\vec{x}}_{r} - {\vec{μ}}_{r})] \end{array}

(13)

where C is a normalizing constant. The following equation can be obtained from Equations 11, 12, and 13.

\begin{matrix} {\hat{\vec{x}}}_{u} = arg max_{{\vec{x}}_{u}} [p ({\vec{x}}_{u} | {\vec{x}}_{r}, \vec{μ}, Θ)] = {\vec{μ}}_{u} + Θ_{ur} Θ_{rr}^{- 1} ({\vec{x}}_{r} - {\vec{μ}}_{r}) \end{matrix}

(14)

Figure 4 shows the process of reconstruction. The values of the statistical parameters such as ${\vec{μ}}_{r}$ , ${\vec{μ}}_{u}, Θ_{ur}$ , and Θ_rr must be learned from the training corpus. A vector is said to belong to the cluster that is most likely to have generated it. As the distribution of the vector is assumed to be Gaussian, the cluster membership ${\hat{m}}_{\vec{x} (t)}$ of a vector $\vec{x} (t)$ is defined as

\begin{matrix} {\hat{m}}_{\vec{x} (t)} = arg max_{m} [p (m | \vec{x} (t))] = arg max_{m} [p (\vec{x} (t) | m) p (t)] \end{matrix}

(15)

and then the unreliable components of the vector are reconstructed using MAP estimation method.

3.5 The proposed multi-sub-band reconstruction approach

Assuming that utilizing P mel filter to smooth the N FFT magnitude coefficients. The reconstruction is individually conducted on 2 sub-bands consisting of consecutive channels (P/2-dimensional channels) with no band overlap (sub-band 1: channel 1 to P/2, sub-band 2: channel P/2+1 to P). The reconstruction method falls neatly into two parts as shown in Figure 5. In the first part, the statistical parameters (SP) used in construction are individually trained for different sub-bands. The steps of the second part are as follows:

(a)
The estimation of speech and noise components is carried out in the spectral domain.
(b)
A mask will be obtained which classifies the T-F representation into reliable and unreliable components corresponding to the frequency range of P mel filters. The above two steps are carried out before applying the mel filter.
(c)
P mel filters are used to smooth the power spectrum and then its logarithm is taken.
(d)
The mel log-spectral vector is multiplied by the mask estimated in step (b).
(e)
The feature vector corresponding to full-band is divided into ones corresponding to 2 sub-bands.
(f)
Based on SP trained in the first part, the feature vectors corresponding to every sub-band are reconstructed, individually.
(g)
The reconstructed vector of 2-sub-band is recombined to yield the typical MFCC vector.

3.6 Baseline system

The system described in [11] assumes that the unreliable components are bounded between zero and the observed mel log-spectrum and the mel log-spectrum is independent, and marginalization is applied to process the corrupted vector. The feature vector used in recognition is a P-dimensional mel log-spectrum. We compare the performance of the proposed system with the baseline system.

4 Experiments

New reconstruction method is evaluated on a closed set of 30 speakers and 140 utterances per speaker. The sampling frequency is 16 KHz. For each speaker, 70% of the available speech material is randomly selected to train the corresponding speaker model, 7% is used for training SP for reconstruction stage, and the remaining 23% is used for test.

In the training stage, we use a voice activity detector (VAD) based on power to ensure that silence frames would not impact on the establishing model.

Speaker recognition performance is evaluated on a subset of ten randomly selected speakers involving a total of 30 sentences per speaker (20 sentences for training speaker-dependent GMM and 10 sentences for testing). In the test phase, utterances are mixed at various SNRs with noise signals drawn from the NOISEX database [34].

Figure 6 describes evaluation system in which 24 mel filters are used to smooth the spectrum and the full-band is divided into 2 sub-bands (SB1: channel 1 to 12, SB2: channel 13 to 24), and 34-demensional MFCC consisting of 16 static MFCC coefficients including the 0th order coefficient and first order temporal derivatives is used for recognition. At the end, cepstral mean normalization (CMN) is applied to improve robustness.

4.1 Experiment 1: performance comparison between marginalization and reconstruction including full-band and 2-sub-band reconstruction

In the first experiment, we compare the performances of two systems which use the marginalization and reconstruction methods to process the corrupted features and then evaluate the validity of the proposed reconstruction method. The point is that recognition has to be carried out with spectral features in the former system. While in the latter system, MFCC are extracted for recognition.

The DET curves visualize the trade-off between missed detections and false alarms [35]. Figure 7 gives the recognition performance of two systems in destroyer-engine noise at a SNR of 0 dB. The results in the figure show that cepstral features outperform spectral ones for speaker recognition.

Figure 8 shows that the recognition performance of the latter system improves when reconstruction is applied to process the corrupted features, and the recognition accuracy of the latter system using 2-sub-band reconstruction method is improved 5.65% more than full-band reconstruction method.

In order to evaluate the validity of the proposed reconstruction method in various noise types, this paper conducts recognition experiments in babble, factory1, pink, white, and destroyer-engine noise. The SNR-dependent recognition accuracy for recognition system is presented in Table 1. The last table depicts the average performance over all noise conditions.

Table 1 Recognition performance of FB, 2-SB reconstruction, and marginalization in the presence of different types of noise (unit: %)

Full size table

Based on the experimental results reported in Table 1, the corresponding SNR-dependent curves are shown for all noise types in Figures 9, 10, and 11.

The following observations can be made:

(a)
It can be observed in Table 1 that the performance obtained from both reconstruction methods clearly outperforms the baseline system.
(b)
The results show that 2-sub-band reconstruction method performs better than full-band for all noise types. The recognition performance is higher at a larger SNR.
(c)
The recognition performance in babble noise is higher than the other four noise types in most cases for two kinds of reconstruction methods.
(d)
The corresponding relative improvements regarding full-band reconstruction are 2.55%, 1.49%, 1.10%, 1.63%, and 1.03% at a SNR of 0, 5, 10, 15, and 20 dB, respectively. Recognition performance improves the most at a SNR of 0 dB.
(e)
The improved recognition performance is 6.04%, 6.97%, 8.99%, 5.63%, and 11.37% in babble, factory1, pink, white, and destroyer-engine noise, respectively. Recognition performance improves the most in destroyer-engine noise.

We analyze the relationship between reconstruction performance and the correlation of the feature vector with PCA. Table 2 shows the contribution rate of every principle component. When the value of concentration coefficient r is 0.95, the corresponding concentration levels are $M_{R}^{FB} (r) = 10$ , $M_{R}^{1} (r) = 6$ , and $M_{R}^{2} (r) = 5$ . Based on the conclusion shown in Section 1, a smaller $M_{R}^{i} (r)$ implies a stronger concentration for the feature vector. Consequently, since the correlation of every sub-band is stronger than the full-band, the performance of the 2-sub-band reconstruction approach is better.

Table 2 The contribution rate (%) of every principle component

Full size table

The result of PCA will be obtained by decomposing eigenvalues of the covariance matrix, which is relevant to the reconstruction. The accumulative contribution rate of the principle components is shown in Figure 12.

4.2 Experiment 2: influence of different division ways of full-band

The conclusion that the recognition performance obtained by the proposed reconstruction method is superior to full-band reconstruction has been obtained in Experiment 1. The choice of an optimal division of full-band seems to be crucial for sub-band reconstruction method. In order to find the optimal division, this paper conducts a series of recognition experiments. The division ways and the corresponding recognition performance are shown in Table 3.

Table 3 Different division ways of full-band and the corresponding recognition performance

Full size table

These experiments are conducted in babble noise which is highly non-stationary and a SNR of 0 dB. The results are shown in Figure 13.

The recognition performance is ranked corresponding to different division ways starting with the highest performance: 4 sub-bands, 2 sub-bands, 3 sub-bands, 8 sub-bands, 6 sub-bands, and 12 sub-bands. The relationship between channel number and recognition performance is not obvious. In order to explain the relationship between the recognition performance and the division ways, we analyze the case of 4 sub-band and 2 sub-band. The results are shown in Figure 14. Assume that the amount of information presented by the original data is 100%. When full-band is divided into 4 sub-bands, the amount of information presented by the first four principle components derived from sub-band 1, sub-band 2, sub-band 3, and sub-band 4 is 94.67%, 98.79%, 98.58%, and 98.73%, respectively. However, if the full-band is divided into 2 sub-bands, the amount of information presented by the first four principle components derived from both sub-bands is 90.61% and 93.73%. That is, the redundancy of the feature vector extracted from each sub-band is higher on the condition that the full-band is divided into 4 sub-bands.

When the full-band is divided into 12 sub-bands, the recognition performance is inferior. The observation shows that the correlations between the feature vector are lost when the number of sub-bands is more numerous.

5 Conclusions

This paper presents a new feature enhancement method, which is evaluated in a UBM-GMM speaker recognition system. In the proposed method, the reconstruction is executed on a partial sub-band independently and then the reconstructed spectrum is recombined into a complete spectrum to yield the conventional MFCC for recognition. Compared to full-band reconstruction method, recognition performance obtained by the proposed reconstruction approach has been shown to be higher in five noise types. The experiment has also reflected that the recognition performance depends on the frequency division ways, thus the optimal division ways need to be developed.

The first experiment has revealed the following results. First, MFCC features outperform spectral ones for speaker recognition. Second, the recognition performance obtained by reconstruction is higher than marginalization. Third, the recognition performance obtained by the 2-sub-band reconstruction method is superior to the full-band reconstruction in five noise types and at all SNRs. The second experiment has shown that different frequency division ways could influence on the recognition performance.

In order to achieve further recognition performance improvements, on the one hand, an optimal frequency division way will be very important. On the other hand, analyzing the distribution property of various noise types and then accurately identifying destroyed components are also research hot spots. In the end, research on mask estimation algorithms is required to precisely separate reliable from unreliable components.

References

J Pelecanos, S Sridharan, Feature warping for robust speaker verification. ISCA Workshop Speaker Recognition, June 213–218 (2001).
Google Scholar
Reynolds DA: Channel robust speaker verification via feature mapping. ICASSP 2003, 2: 53-56.
Google Scholar
Chandran V, Ning D, Sridharan S: Speaker identification using higher order spectral phase features and their effectiveness vis-avis mel-cepstral features, vol. 3072. In Biometric Authentication. Springer Verlag, Berlin; 2004:1-20.
Google Scholar
Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Process 2000, 10(1–3):19-41. 10.1006/dspr.1999.0361
Article Google Scholar
Auckenthaler R, Carey M, Lloyd-Thomas H: Score normalization for text-independent speaker verification systems. Digital Signal Process 2000, 10(1–3):42-54. 10.1006/dspr.1999.0360
Article Google Scholar
P Kenny, P Dumouchel, in Proc. ODYSSEY 2004-The Speaker and Language Recognition Workshop. Experiments in speaker verification using factor analysis likelihood ratios (Toledo, Spain, May 31–June 3 2004), pp. 219–226.
Google Scholar
Kenny P, Boulianne G, Ouellet P, Dumouchel P: Factor analysis simplified. ICASSP 2005, 1: 637-640.
Google Scholar
Jančovič P, Köküer M: Estimation of voicing-character of speech spectra based on spectral shape. IEEE Signal Process. Lett 2007, 14(1):66-69. 10.1109/LSP.2006.881517
Article Google Scholar
M McLaren, D van Leeuwen, in ICASSP. Improved speaker recognition when using i-vectors from multiple speech sources (Prague, 2011), pp. 5460–5463.
Google Scholar
González JA, Peinado AM, Ma N, Gómez AM, Barker J: MMSE-based missing-feature reconstruction with temporal modeling for robust speech recognition. IEEE Trans. Audio, Speech, Lang. Process 2013, 21(3):624-635. 10.1109/TASL.2012.2229982
Article Google Scholar
May T, van de Par S, Kohlrausch A: Noise-robust speaker recognition combining missing data techniques and universal background modeling. IEEE Trans. Audio, Speech, Lang. Process 2012, 20(1):108-121. 10.1109/TASL.2011.2158309
Article Google Scholar
Togneri R, Pullella D: An overview of speaker identification: accuracy and robustness issues. IEEE Circuits Syst. Mag. 2011, Second Quarter: 23-61. 10.1109/MCAS.2011.941079
Article Google Scholar
Cooke M, Green P, Josifovski L, Vizinho A: Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 2001, 34: 267-285. 10.1016/S0167-6393(00)00034-0
Article Google Scholar
Zhao X, Shao Y, Wang D: CASA-Based Robust Speaker Identification. IEEE Trans. Audio, Speech Lang. Process 2012, 20: 1608-1616. 10.1109/TASL.2012.2186803
Article Google Scholar
Ma N, Barker J, Christensen H, Green P: Combining speech fragment decoding and adaptive noise floor modelling. IEEE Trans. Audio Speech Lang. Process 2012, 20(3):818-827. 10.1109/TASL.2011.2165945
Article Google Scholar
Gemmeke JF, Hamme VanH, Cranen B, Boves L: Compressive sensing for missing data imputation in noise robust speech recognition. IEEE J. Sel. Topics Signal Process 2010, 4(2):272-287. 10.1109/JSTSP.2009.2039171
Article Google Scholar
JA González, AM Peinado, AM Gómez, N Ma, J Barker, in IEEE Trans. Audio Speech Lang. Process. Combining missing-data reconstruction and uncertainty decoding for robust speech recognition (Kyoto, 2012), pp. 4693–4696.
Google Scholar
Raj B, Seltzer ML, Stern RM: Reconstruction of missing features for robust speech recognition. Speech Commun. 1997, 43: 195-202.
Google Scholar
B Raj, Reconstruction of incomplete spectrograms for robust speech recognition. PhD dissertation, Pittsburgh, PA, Carnegie Mellon Univ, 2000.
Google Scholar
Raj L, Bonastre JF: Subband approach for automatic speaker recognition: optimal division of the frequency domain. In Proc. Audio and Video based Biometric Person Authentication. LNCS. Edited by: Bigün J, Chollet G, Borgefors G. Springer, Heidelberg; 1997:195-202.
Google Scholar
Besacier L, Bonastre JF, Fredouille C: Localization and selection of speaker-specific information with statistical modeling. Speech Comm. 2000, 31: 89-106. 10.1016/S0167-6393(99)00070-9
Article Google Scholar
Bourlard H, Dupont S: A new ASR approach based on independent processing and recombination of partial frequency bands. ICSLP 1996, 1: 426-429.
Google Scholar
J Shlens, A tutorial on principal component analysis. Systems Neurobiology Laboratory, Salk Institute for Biological Studies, version 2, 1–13 (2005).
Google Scholar
Reynolds D, Rose R: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process 1995, 3(1):72-83. 10.1109/89.365379
Article Google Scholar
Campbell W, Sturim D, Reynolds D: Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett 2006, 13(5):308-311. 10.1109/LSP.2006.870086
Article Google Scholar
Reynolds DA, Quatieri TF, Dunn RB: Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 2000, 10: 19-41. 10.1006/dspr.1999.0361
Article Google Scholar
Raj B, Stern RM: Missing-feature approaches in speech recognition. IEEE Signal Process. Mag 2005, 22(5):101-116. 10.1109/MSP.2005.1511828
Article Google Scholar
Brown GJ, Wang D: Separation of Speech by Computational Auditory Scene Analysis. Springer Verlag, New York; 2005.
Book Google Scholar
Seltzer ML, Raj B, Stern RM: A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition. Speech Commun. 2004, 43(4):379-393. 10.1016/j.specom.2004.03.006
Article Google Scholar
X Zhao, Y Wang, D Wang, Robust speaker identification in noisy and reverberant conditions. IEEE Trans. Audio, Speech Lang. Process. 22, 836–845 (2014, in press).
Google Scholar
Rainer Martin: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process 2001, 9: 504-512. 10.1109/89.928915
Article Google Scholar
M Brookes, VOICEBOX: Speech Processing Toolbox for MATLAB (2009). [Online] Available: ., [http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html]
Google Scholar
Papoulis A: Probability, Random Variables, and Stochastic Processes. Academic Press, New York; 1991.
Google Scholar
A Varga, H Steeneken, M Tomlinson, D Jones, in Tech. Rep., Speech Res. Unit, Defense Res. Agency. The NOISEX-92 study on the effect of additive noise on automatic speech recognition (Malvern, U.K., 1992). (Available from NOISEX-92 CD-ROMS).
Google Scholar
A Martin, G Doddington, T Kamm, M Ordowski, in Proceedings of the European Conference on Speech communication and Technology. The DET curve in assessment of detection task performance, (1997), pp. 1895–1898.
Google Scholar

Download references

Author information

Authors and Affiliations

Beijing University of Posts and Telecommunications, No.10 Xitucheng Road, Beijing, 100876, China
Furong Yan, Yanbin Zhang & Jiachang Yan

Authors

Furong Yan
View author publications
You can also search for this author in PubMed Google Scholar
Yanbin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiachang Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Furong Yan.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Yan, F., Zhang, Y. & Yan, J. A sub-band-based feature reconstruction approach for robust speaker recognition. J AUDIO SPEECH MUSIC PROC. 2014, 40 (2014). https://doi.org/10.1186/s13636-014-0040-7

Download citation

Received: 26 April 2014
Accepted: 02 October 2014
Published: 21 October 2014
DOI: https://doi.org/10.1186/s13636-014-0040-7

A sub-band-based feature reconstruction approach for robust speaker recognition

Abstract

1 Introduction

2 The analysis of concentration

3 Multi-sub-band reconstruction for speaker recognition system

3.1 UBM-GMM model

3.2 Feature vector

3.3 Mask estimation

3.4 MAP estimation for unreliable components

3.5 The proposed multi-sub-band reconstruction approach

3.6 Baseline system

4 Experiments

4.1 Experiment 1: performance comparison between marginalization and reconstruction including full-band and 2-sub-band reconstruction

4.2 Experiment 2: influence of different division ways of full-band

5 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords