Biomimetic multi-resolution analysis for robust speaker recognition

Nemala, Sridhar Krishna; Zotkin, Dmitry N; Duraiswami, Ramani; Elhilali, Mounya

doi:10.1186/1687-4722-2012-22

Research
Open access
Published: 07 September 2012

Biomimetic multi-resolution analysis for robust speaker recognition

Sridhar Krishna Nemala¹,
Dmitry N Zotkin²,
Ramani Duraiswami² &
…
Mounya Elhilali¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2012, Article number: 22 (2012) Cite this article

2999 Accesses
4 Citations
Metrics details

Abstract

Humans exhibit a remarkable ability to reliably classify sound sources in the environment even in presence of high levels of noise. In contrast, most engineering systems suffer a drastic drop in performance when speech signals are corrupted with channel or background distortions. Our brains are equipped with elaborate machinery for speech analysis and feature extraction, which hold great lessons for improving the performance of automatic speech processing systems under adverse conditions. The work presented here explores a biologically-motivated multi-resolution speaker information representation obtained by performing an intricate yet computationally-efficient analysis of the information-rich spectro-temporal attributes of the speech signal. We evaluate the proposed features in a speaker verification task performed on NIST SRE 2010 data. The biomimetic approach yields significant robustness in presence of non-stationary noise and reverberation, offering a new framework for deriving reliable features for speaker recognition and speech processing.

Introduction

In addition to the intended message, human voice carries the unique imprint of a speaker. Just like fingerprints and faces, voice prints are biometric markers with tremendous potential for forensic, military, and commercial applications[1]. However, despite enormous advances in computing technology over the last few decades, automatic speaker verification (ASV) systems still rely heavily on training data collected in controlled environments, and most systems face a rapid degradation in performance when operating under previously unseen conditions (e.g. channel mismatch, environmental noise, or reverberation). In contrast, human perception of speech and ability to identify sound sources (including voices) is quite remarkable even at relatively high distortion levels[2]. Consequently, the pursuit of human-like recognition capabilities has spurred great interest in understanding how humans perceive and process speech signals.

One of the intriguing processes taking place in the central auditory system involves ensembles of neurons with variable tuning to spectral profiles of acoustic signals. In addition to the frequency (tonotopic) organization emerging as early as the cochlea, neurons in the central auditory system (specifically in the midbrain and more prominently in the auditory cortex) exhibit tuning to a variety of filter bandwidths and shapes[3]. This elegant neural architecture provides a detailed multi-resolution analysis of the spectral sound profile, which is presumably relevant to speech and speaker recognition. Only few studies so far have attempted to use this cortical representation in speech processing, yielding some improvements for automatic speech recognition at the expense of substantial computational complexity[4, 5]. To the best of our knowledge, no similar work was done in ASV.

In the present report, we explore the use of a multi-resolution analysis for robust speaker verification. Our representation is simple, effective, and computationally-efficient. The proposed scheme is carefully optimized to be particularly sensitive to the information-rich spectro-temporal attributes of the signal while maintaining robustness to unseen noise distortions. The choice of model parameters builds on our current knowledge of psychophysical principles of speech perception in noise[6, 7] complemented with a statistical analysis of the dependencies between spectral details of the message and speaker information. We evaluate the proposed features in an ASV system and compare it against one of the best performing systems in NIST 2010 SRE evaluation[8] under detrimental conditions such as white noise, non-stationary additive noise, and reverberation.

The following section describes details of the proposed multi-resolution spectro-temporal model. It is followed by an analysis that motivates the choice of model parameters to maximize speaker information retention. Next, we describe the experimental setup and results. We finish with a discussion of these results and comment on potential extensions towards achieving further noise robustness.

The biomimetic multi-resolution analysis

An overview of the processing chain described in this section is presented in Figure1.

Peripheral analysis

The speech signal is processed through a pre-emphasis stage (implemented as a first-order high pass filter with pre-emphasis coefficient 0.97), and a time-frequency auditory spectrogram is generated using a biomimetic sound processing model described in details in[9] and briefly summarized here (Equation 1). First, the signal s(t) undergoes a cochlear frequency analysis modeled by a bank of 128 constant-Q (Q=4) highly asymmetric bandpass filters h(t;f) equally spaced over the span of 51/3 octaves on a logarithmic frequency axis. The filterbank output is a spatiotemporal pattern of cochlea basilar membrane displacements y_coch(t f) over 128 channels. Next, a lateral inhibitory network detects discontinuities in the responses across the tonotopic (frequency) axis, resulting in further filterbank frequency selectivity enhancement. This step is modeled as a first-order differentiation operation across the channel array followed by a half-wave rectifier and a short-term integrator. The temporal integration window is given by μ(t;τ)=e^−t/τu(t) with time constant τ=10 ms mimicking the further loss of phase-locking observed in the midbrain. This time constant controls the frame rate of the spectral vectors. Finally, a nonlinear cubic root compression of the spectrum is performed, resulting in an auditory spectrogram y(t f):

\begin{matrix} y_{coch} (t, f) = s (t) \otimes_{t} h (t; f), \\ y_{lin} (t, f) = max (\partial_{t} y_{coch} (t, f), 0), \\ y (t, f) = {[y_{lin} (t, f) \otimes_{t} μ (t; τ)]}^{1 / 3}, \end{matrix}

(1)

where ⊗_trepresents convolution with respect to time. The choice of the auditory spectrogram is motivated by its neurophysiological foundation as well as its proven self-normalization and robustness properties (see[10] for full details).

Spectral cortical analysis

The auditory spectrogram is processed further in order to capture the spectral details present in each spectral slice. The processing is based on neurophysiological findings that neurons in the central auditory pathway are tuned not only to frequencies but also to spectral shapes, in particular to peaks of various widths on the log-frequency axis[3, 11, 12]. The spectral width is characterized by a parameter called scale and is measured in cycles per octave, or CPO. Physiological data indicates that auditory cortex neurons are highly scale-selective, thus expanding the cochlear one-dimensional tonotopic axis onto a two-dimensional sheet that explicitly encodes tonotopy as well as spectral shape details (see Figures1 and2).

The cortical analysis is implemented using a bank of modulation filters operating in the Fourier domain. The algorithm processes each data frame individually. The Fourier transform of each spectral slice y(t₀,f) is multiplied by a modulation filter H_S(Ω;Ω_c) that is tuned to spectral features of scale Ω_c. The filtering operates on the magnitude of the signal. After filtering, the inverse Fourier transform is performed and the real part is taken as the new filtered slice. This process is then repeated with a number of different Ω_c, yielding a number of filtered spectrograms y(t,f;Ω_c), each with features of scale Ω_cemphasized (see Figure1). This set of spectrograms constitutes the spectral cortical representation of the sound.

The filter H_S(Ω;Ω_c) is defined as

H_{S} (Ω; Ω_{c}) = {(Ω / Ω_{c})}^{2} e^{[1 - {(Ω / Ω_{c})}^{2}]}, 0 \leq Ω \leq Ω_{max},

(2)

where Ω_max is the highest spectral modulation frequency (set at 12 CPO given our spectrogram resolution of 24 channels per octave).

Choice of spectral parameters

The set of scales Ω_cis chosen by dividing the spectral modulation axis into equal energy regions using a training corpus (TIMIT database[13]) as described below. Define the average spectral modulation profile $\bar{Y} (Ω) = {<{<|Y (Ω; t_{0})|>}_{T}>}_{Ψ}$ as the ensemble mean of the magnitude Fourier transform of the spectral slice y(t₀f) averaged over all times T and over entire speech corpus Ψ. The resulting ensemble profile (shown in Figure3a) is then divided into M equal energy regions Γ_k:

\begin{matrix} Γ_{k} = \int_{Ω_{k}}^{Ω_{k + 1}} \bar{Y} (Ω) dΩ, Γ_{k} = Γ_{k + 1}, k = 1, \dots, M - 1, \end{matrix}

(3)

where Ω_k and Ω_{k + 1} denote the lower and upper cutoffs for k th band, Ω₁=0, and Ω_M=4.^a This sampling scheme ensures that the high energy regions are sampled more densely, which has the dual advantage of sampling the given modulation space with a relatively small set of scales and emphasizing high-energy signal components, which are presumably noise-robust. Setting M=5 results in cutoffs at {0.18,0.59,1.34,2.36,4}, which are approximated to the nearest log-scale as Ω_c={0.25,0.5,1.0,2.0,4.0}. Finally, in order to put less emphasis on message-dominant regions of the spectrum, we drop the 0.25 CPO filter, which carries mostly articulatory and formant-specific information relevant to the speech message (analysis presented in the next section). The remaining set of Ω_c={0.5,1.0,2.0,4.0} is found to be a good tradeoff between computational complexity and system performance.

Temporal filtration

In this stage, the spectral cortical features are processed through a bandpass temporal modulation filter to remove information that is believed to be mostly irrelevant. It was shown in[14] that the neurons in the auditory cortex are mostly sensitive to the modulation rates between 0.5 and 12 Hz and that the same modulation range represents the information crucial for speech comprehension[7]. Accordingly, the filtering is performed by multiplying the Fourier transform of the time sequence of each spectral feature by a bandpass filter H_T(w;w_lw_h):

\begin{matrix} H_{T} (w; w_{l}, w_{h}]) = {(αw)}^{2} e^{[1 - {(αw)}^{2}]}, \\ α = \{\begin{matrix} 1 / w_{l}, 0 \leq w < w_{l}, \\ 1 / w, w_{l} \leq w \leq w_{h}, \\ 1 / w_{h}, w_{h} < w \leq w_{max}, \end{matrix} \end{matrix}

(4)

where w_l=0.5 Hz, w_h=12.0 Hz, w_max=1/(2t_f), and t_f=10 ms (the frame length). After filtering in Fourier domain, the inverse Fourier transform is performed and the real part of the output forms the temporally filtered spectral cortical representation of the sound y_w(t f;Ω_c). This operation is performed on an utterance by utterance basis.

Cortical features

To reduce computational complexity and to allow use of state-of-the-art speaker verification machinery (which generally expects a relatively low-dimensional input), the spectral cortical representation is downsampled in frequency by a factor of 4 (Figure1). The resulting feature representation has a dimensionality of 128 (32 auditory frequency channels multiplied by four scales used for analysis). The features are then normalized to zero mean and unit variance for each utterance, yielding the reduced set of spectrograms $ŷ_{w} (t, f; Ω_{c})$ . Principal component analysis is used to further reduce the feature dimensionality to 19. This number is chosen for consistency with the dimensionality of the standard Mel-Frequency Cepstral Coefficients (MFCC) feature set used for speaker recognition. The reduced features, along with their first- and second-order derivatives, form the final 57-dimensional cortical feature vector used for the speaker verification task.

Speech information versus speaker information

The speech signal carries both speech message and speaker identity information in distinct yet overlapping components. Separation of these elements is a non-trivial task in general. In the multi-resolution framework presented above, the broadest filters (0.25 and 0.5 CPO) capture primarily the overall spectral profile and formant peaks, while the others (1, 2, and 4 CPO) reflect narrower spectral details such as harmonics and subharmonic structure. In order to select a set of scales (Ω_c) that are most relevant for the speaker recognition task, we analyze the mutual information (MI) between the feature vector (X), the speech message (Y₁), and the speaker identity (Y₂). The MI is a measure of the statistical dependence between random variables[15] and is defined for two discrete random variables X and Y as

I (X; Y_{i}) = \sum_{x \in X, y \in Y_{i}} p (x, y) \underset{2}{log} \frac{p (x, y)}{p (x) p (y)} .

(5)

To estimate the MI, the continuous feature vector is quantized by dividing its support into cells of equal volume. To characterize the speech message, phoneme labels from the TIMIT corpus are first divided into four broad phoneme classes. The variable Y₁thus takes four discrete values representing the phoneme categories: vowels, stops, fricatives, and nasals. The average MI (taken as the mean MI across all the frequency bands for a given scale) between the feature vector and the speech message is shown in Figure3b (top) as a function of scale. For the speaker identity MI test, the TIMIT “sa1” speech utterance (She had your dark suit in greasy wash water all year) spoken by 100 different subjects is used; thus, Y₂ takes 100 discrete values representing the speaker. The average MI between the feature vector and the speaker identity is shown in Figure3b (bottom), again as a function of scale.^b

Notice that while the lower scale (0.25 CPO) clearly provides significantly more information about the underlying linguistic message, the MI peak in Figure3c (bottom) is centered at 1 CPO, highlighting the significance of pitch and harmonically-related frequency channels in representing speaker-specific information. In order to put less emphasis on message-carrying features of the speech signal, we drop the 0.25 CPO filter at the feature encoding stage for our ASV system and choose Ω_c={0.5,1.0,2.0,4.0} CPO.^c

Experiments and results

Recognition setup

Text independent speaker verification experiments are conducted on the NIST 2010 speaker recognition evaluation (SRE) data set[8]. The extended core task of the evaluation involves 6.9 million trials broken down into nine common conditions reflecting a variety of channel mismatch scenarios[8] (see Table1).

Table 1 List of conditions for NIST 2010 extended core task

Full size table

The front end of the implemented ASV system uses either the 57-dimensional MFCC feature vector or the 57-dimensional cortical feature vector. The MFCC feature vector is computed by invoking RASTAMAT “melfcc” function with ‘numcep’ parameter set to 20, dropping the first (energy) component of the output, and appending first- and second-order derivatives of the resultant feature vector. The cortical feature vector is obtained as described in the previous sections. For fair comparison between MFCC and cortical features, MFCC was supplemented with mean subtraction, variance normalization, and RASTA filtering[16] applied at the utterance level. Such processing parallels the temporal filtering and normalization performed on cortical features. A combination of ASR output provided by NIST and an in-house energy-based VAD system is used to drop all non-speech frames from input data.

The back-end is a robust state-of-the-art UBM-GMM system[17, 18]. In a UBM-GMM system, each speaker’s distribution of feature vectors is modeled as a mixture of Gaussians, forming a Gaussian mixture speaker model (GMSM). In addition, a universal background model (UBM) defines a “generic” speaker. The UBM typically has hundreds of thousands of parameters and is trained on a very large amount of data (hundreds of hours of speech), which should include speech produced by a large number of individual speakers (in our case, the 2048-center diagonal-covariance UBM is trained on NIST SRE 2004, 2005, 2006, and 2008; Fisher; Switchboard-2; and Switchboard-Cellular databases). As the amount of speech available per individual speaker is typically much less than required to train the speaker model from scratch, the GMSM is produced by adapting UBM means so that the resulting model best describes the available speaker data. Finally, given the UBM, the candidate GMSM, and the audio file, the system extracts the feature vectors from the audio file and computes the log-likelihoods of these feature vectors belonging to the GMSM and to the UBM. The difference between these log-likelihoods constitutes the output score for this particular trial.

Our ASV system additionally employs the technique known as joint factor analysis[19, 20]. JFA use enables channel variability compensation by offsetting the channel effects and more robust speaker model estimation by using more informative prior on speaker model distribution. To use JFA in the described framework, an alternative representation of the speaker model—a single vector Z (“supervector”)—is formed by concatenaging all GMSM means. JFA is trained in advance on a large annotated collection of audio files to learn the channel subspace (the basis over which Z preferentially varies when the same speaker’s voice is presented over different channels) and the speaker subspace (the basis over which Z preferentially varies when different speakers are presented over the same channel). In our system, the dimensionalities of speaker subspace and of channel subspace are 300 and 150, respectively. Then, when processing the previously unseen data, components of inter-speaker differences attributable to speaker/to channel are emphasized/canceled, respectively. This is done by projecting corresponding supervectors into speaker/channel subspaces, using speaker subspace projection of Z to modify GMSM, using channel subspace projection of Z to modify UBM, and performing scoring with these modified GMSM and UBM. Also, as the log-likelihood calculation is expensive, in our system an approximation to it is computed based on an inner product[20] is used.

Finally, the obtained scores are subject to ZT-normalization[21], and the decision threshold minimizing equal error rate (EER) is chosen (separately for each condition).

Noise conditions

Every trial in NIST SRE 2010 consists of computing the matching score between a speaker model and an audio file. To evaluate the noise robustness of the proposed cortical features, several distorted versions of these audio files are created by adding different types of noise reflecting a variety of real world scenarios:

White noise at signal-to-noise ratio (SNR) levels from 24 to 0 dB in 6 dB steps;
Babble noise (from Aurora database[22]), same SNR levels;
Subway noise (from Aurora database[22]), same SNR levels;
Simulated reverberation with R T₆₀from 200 to 1,200 ms in steps of 200 ms.

It is important to mention that all training (UBM, JFA, and speaker model training) is done exclusively on clean data, and only the test audio files are corrupted. Note also that the train-test mismatch created by addition of noise/reverberation is superimposed on the train-test mismatch inherent to the SRE 2010 data.

Results

Figure4 shows the speaker verification performance in terms of EER for the cortical features and for the MFCC features as a function of noise type/strength and trial condition. The results clearly demonstrate that the proposed cortical features provide substantially lower EER than the MFCC as noise level increases, indicating their robustness. The average performance for each noise type and trial condition is shown in Table2. On average (across all conditions and all noise types), the cortical-features-based system yields 15.9% relative EER improvement over the robust state-of-the-art MFCC system. It is worth noting that the proposed approach is outperformed by the MFCC-based approach in only 4 out of the 36 cases. Because the proposed metric incorporates both a biomimetic auditory spectrogram previously shown to exhibit some noise-robustness characteristics[10] as well as multiresolution decomposition, we investigated further the contribution of both components in the reported improvements. We tested the system using the auditory spectrogram alone or an adaptation of the auditory spectrogram described here, coupled with a cepstral transformation. Neither system performed as well as the proposed multiresolution decomposition, hence strengthening the claim that our proposed multiresolution analysis is indeed responsible for the performance improvements shown in Table2.

Table 2 Average ASV performance (EER, %) as a function of noise type and condition

Full size table

In some ASV applications, metrics other than EER may be more relevant. For example, in certain biometric speaker verification systems the key requirement is a low false alarm rate. We present our results here in terms of two additional metrics more suitable in such case, namely Miss-10 and quadratic DCF (decision cost function) metrics. These two metrics were used in the NIST 2011 IARPA BEST program SRE[23]. The Miss-10 metric is defined as the false alarm rate P_FA obtained when the decision threshold is set such that the miss rate P_Miss=10%, and the quadratic DCF is defined as

DCF = C_{Miss} \times {P_{Miss}}^{2} \times P_{target} + C_{FA} \times P_{FA} \times (1 - P_{target})

(6)

with the parameter values C_Miss=100, C_FA=10, and P_target=0.01.

The average verification performance for each noise type using the Miss-10 and quadratic DCF metrics is shown in Tables3 and4, respectively. As seen from the data, in the low false alarm region the proposed cortical features outperform the robust state-of-the-art MFCC system with even larger margin: 28.8% relative using the Miss-10 metric and 22.6% relative using the quadratic DCF metric.

Table 3 Average ASV performance (Miss-10 metric, %) as a function of noise type and condition

Full size table

Table 4 Average ASV performance (quadratic DCF metric) as a function of noise type and condition

Full size table

Discussion and conclusions

In this report, we explore the applicability of a multi-resolution analysis of speech signals to ASV. This framework maps the speech signal onto a rich feature space, highlighting and separating information about the glottal excitation signal, glottal shape, vocal tract geometry, and articulatory configuration (as each of these elements is an underlying factor for features of different width located in different areas on the log-frequency axis; see e.g.[24]). The cortical representation can be viewed as a “local” variant (w.r.t. log-frequency axis) of the analysis provided by MFCC analysis. This analogy stems from the fact that MFCC roughly correspond to spectral features of different widths integrated over the whole frequency range. In this work, both the “global-integration” MFCC approach and the “local” cortical approach are tested in a state-of-the-art ASV system on the NIST SRE 2010 dataset. While both perform comparably in clean condition, the cortical features are substantially more robust on noisy data, including non-stationary distortions as well as reverberation.

One of the intuitions behind the robustness observed in the proposed features is the fact that speech and noise generally exhibit different spectral shapes while occupying an overlapping spectral range. The expansion of the spectral axis with the multi-resolution analysis allows the extrication of some speech components from the masking noise, suppressing the noise components and providing for increased robustness. Furthermore, by highlighting the range between 0.5 and 4 CPO, the model stresses the most speaker-informative regions in the speech spectrum, which in turn map onto a modulation space to which humans are highly sensitive[7]. Such range is also commensurate with neurophysiological tuning observed in mammalian auditory cortex with most neurons concentrated around a spectral tuning of the order of few CPOs[3, 14]. A similar emphasis is put on the temporal dynamics of the signal by underscoring the region between 0.5 and 12 Hz, which defines natural boundaries for speech perception in noise by human listeners[7, 25–28] and mostly coincides with temporal tuning of mammalian cortical neurons[14]. Higher temporal modulation frequencies represent mostly the syllabic and segmental rate of speech[2].

Unlike comparable multi-resolution schemes recently developed[4, 5], the proposed approach does not involve dimension-expanded representations (close to 30,000 dimensions, which inherently require computationally-expensive schemes and therefore have limited applicability). Instead, our model is constrained to lie in a perceptually-relevant spectral modulation space and further uses a careful sampling scheme to encode the information with only four spectral analysis filters. This has the dual advantage of producing a feature space that is both low-dimensional and highly robust. The careful optimization of model parameters is necessary to strike a balance between simple and efficient computation and noise robustness.

Importantly, in our approach no model components have been customized in any way to deal with a specific noise condition, making it suitable for a wide range of acoustic environments. In addition, the model has been minimally customized for the speaker recognition task and can in fact provide a general framework for a variety of speech processing tasks. Our preliminary results do indeed show great robustness of a similar scheme for automatic speech recognition. It is therefore essential to emphasize that the performance obtained with the cortical features is solely a property of the features themselves and is achieved without any noise compensation techniques. Our ongoing efforts are aimed at achieving further improvements by applying the described multi-resolution cortical analysis on enhanced spectral profiles obtained using speech enhancement techniques, which involve estimation of noise characteristics in various forms[29].

Endnotes

^aWe constraint the range of spectral modulations to 4 CPO, which covers more than 90% of the entire spectral modulation energy in speech and is most important for speech comprehension[7].

^bThe difference in MI levels between the speech message and speaker identity may be attributed to the observation that the speech signal encodes more information about the underlying linguistic message than about the speaker.

^cIn addition to the MI analysis, we performed an empirical test regarding use of 0.25 CPO filter. An experiment was run on clean data with Ω_c={0.25,0.5,1.0,2.0} CPO and yielded a 3.4% EER—a decrease of performance compared with 2.7% EER for the system that used Ω_c={0.5,1.0,2.0,4.0} CPO.

References

Beigi H: Fundamentals of Speaker Recognition. Springer, Berlin; 2011.
Book Google Scholar
Greenberg S, Popper A, Ainsworth W: Speech Processing in the Auditory System. Springer, Berlin; 2004.
Google Scholar
O’Connor K, Yin P, Petkov C, Sutter M: Complex spectral interactions encoded by auditory cortical neurons: relationship between bandwidth and pattern. Front Syst. Neurosci 2010, 4: 4-145.
Google Scholar
Woojay J, Juang B: Speech analysis in a model of the central auditory system. IEEE Trans. Speech Audio Process 2007, 15: 1802-1817.
Article Google Scholar
Wu Q, Zhang L, Shi G: Robust speech feature extraction based on Gabor filtering and tensor factorization. Proc. IEEE Intl. Conf. Acoust. Speech Signal Proc., Taipei, Taiwan 2009, 4649-4652.
Google Scholar
Elhilali M, Chi T, Shamma SA: A spectro-temporal modulation index (STMI) for assessment of speech intelligibility. Speech Commun 2003, 41: 331-348. 10.1016/S0167-6393(02)00134-6
Article Google Scholar
Elliott T, Theunissen F: The modulation transfer function for speech intelligibility. PLoS Comput. Biol 2009, 5: e1000302. 10.1371/journal.pcbi.1000302
Article Google Scholar
NIST 2010 speaker recognition evaluation http://www.nist.gov/speech/tests/sre/2010
Yang X, Wang K, Shamma SA: Auditory representations of acoustic signals. IEEE Trans. Inf. Theory 1992, 38: 824-839. 10.1109/18.119739
Article Google Scholar
Wang K, Shamma SA: Self-normalization noise-robustness in early auditory representations. IEEE Trans. Speech Audio Process 1994, 2: 421-435. 10.1109/89.294356
Article Google Scholar
Schreiner C, Calhoun B: Spectral envelope coding in cat primary auditory cortex: properties of ripple transfer functions. J. Aud. Neurosc 1995, 1: 39-61.
Google Scholar
Versnel H, Kowalski N, Shamma SA: Ripple analysis in ferret primary auditory cortex. iii. topographic distribution of ripple response parameters. J. Aud. Neurosc 1995, 1: 271-286.
Google Scholar
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL: DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus. vol LDC93S1 Linguistic Data Consortium, Philadelphia; 1993.
Google Scholar
Miller L, Escabi M, Read H, Schreiner C: Spectrotemporal receptive fields in the lemniscal auditory thalamus and cortex. J. Neurophysiol 2002, 87(1):516-527.
Google Scholar
Cover T, Thomas J: Elements of Information Theory. 2nd edition. Wiley-Interscience, New York; 2006.
Google Scholar
Hermansky H, Morgan N: RASTA processing of speech. IEEE Trans. Speech Audio Process 1994, 2(4):382-395.
Article Google Scholar
Kinnunen T, Lib H: An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 2010, 52: 12-40. 10.1016/j.specom.2009.08.009
Article Google Scholar
Garcia-Romero D, et al.: The UMD-JHU 2011 speaker recognition system. In Proc. IEEE Intl. Conf. Acoust. Speech Signal Proc. Kyoto, Japan; 2012:4229-4232.
Google Scholar
Kenny P, Boulianne G, Ouellet P, Dumouchel P: Speaker and session variability in gmm-based speaker verification. IEEE Trans. Audio Speech Lang. Process 2007, 15: 1448-1460.
Article Google Scholar
Garcia-Romero D, Espy-Wilson C: Joint factor analysis for speaker recognition reinterpreted as signal coding using overcomplete dictionaries. In Proc. Odyssey Speaker and Language Recognition Workshop. Brno, Czech Republic; 2010:117-124.
Google Scholar
Auckenthaler R, Carey M, Lloyd-Thomas H: Score normalization for text-independent speaker verification system. Digit. Signal Proc 2000, 1(10):42-54.
Article Google Scholar
Hirsch H, Pearce D: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In ISCA ITRW ASR2000. vol. 4 Beijing, China; 2000:29-32.
Google Scholar
NIST 2011 speaker recognition evaluation http://www.nist.gov/itl/iad/mig/best.cfm
Zotkin D, Chi T, Shamma SA, Duraiswami R: Neuromimetic sound representation for percept detection and manipulation. EURASIP J. App. Sig. Process 2005, 2005: 1350-1364. 10.1155/ASP.2005.1350
Article Google Scholar
Steeneken H, Houtgast T: A physical method for measuring speech-transmission quality. J. Acoust. Soc. Am 1979, 67: 318-326.
Article Google Scholar
Drullman R, Festen J, Plomp R: Effect of temporal envelope smearing on speech reception. J. Acoust. Soc. Am 1994, 95: 1053-1064. 10.1121/1.408467
Article Google Scholar
Arai T, Pavel M, Hermansky H, Avendano C: Syllable intelligibility for temporally filtered lpc cepstral trajectories. J. Acoust. Soc. Am 1999, 105: 2783-2791. 10.1121/1.426895
Article Google Scholar
Greenberg S, Arai T, Grant K: The Role of Temporal Dynamics in Understanding Spoken Language. In NATO Science Series: Life and Behavioural Sciences. IOS Press, Amsterdam; 2006:171-190.
Google Scholar
Loizou P: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton; 2007.
Google Scholar

Download references

Acknowledgements

This research is partly supported by the IIS-0846112 (NSF), FA9550-09-1-0234 (AFOSR), 1R01AG036424-01 (NIH), N000141010278 (ONR), and by the Office of the Director of National Intelligence (ODNI), the Intelligence Advanced Research Projects Activity (IARPA), through the Army Research Laboratory (ARL). All statements of fact, opinion, or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of IARPA, the ODNI, or the U.S. Government.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA
Sridhar Krishna Nemala & Mounya Elhilali
Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA
Dmitry N Zotkin & Ramani Duraiswami

Authors

Sridhar Krishna Nemala
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry N Zotkin
View author publications
You can also search for this author in PubMed Google Scholar
Ramani Duraiswami
View author publications
You can also search for this author in PubMed Google Scholar
Mounya Elhilali
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mounya Elhilali.

Additional information

Competing Interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Nemala, S.K., Zotkin, D.N., Duraiswami, R. et al. Biomimetic multi-resolution analysis for robust speaker recognition. J AUDIO SPEECH MUSIC PROC. 2012, 22 (2012). https://doi.org/10.1186/1687-4722-2012-22

Download citation

Received: 26 July 2011
Accepted: 17 August 2012
Published: 07 September 2012
DOI: https://doi.org/10.1186/1687-4722-2012-22

Biomimetic multi-resolution analysis for robust speaker recognition

Abstract

Introduction

The biomimetic multi-resolution analysis

Peripheral analysis

Spectral cortical analysis

Choice of spectral parameters

Temporal filtration

Cortical features

Speech information versus speaker information

Experiments and results

Recognition setup

Noise conditions

Results

Discussion and conclusions

Endnotes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing Interests

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords