Biomimetic multi-resolution analysis for robust speaker recognition

Humans exhibit a remarkable ability to reliably classify sound sources in the environment even in presence of high levels of noise. In contrast, most engineering systems suffer a drastic drop in performance when speech signals are corrupted with channel or background distortions. Our brains are equipped with elaborate machinery for speech analysis and feature extraction, which hold great lessons for improving the performance of automatic speech processing systems under adverse conditions. The work presented here explores a biologically-motivated multi-resolution speaker information representation obtained by performing an intricate yet computationally-efficient analysis of the information-rich spectro-temporal attributes of the speech signal. We evaluate the proposed features in a speaker verification task performed on NIST SRE 2010 data. The biomimetic approach yields significant robustness in presence of non-stationary noise and reverberation, offering a new framework for deriving reliable features for speaker recognition and speech processing.


Introduction
In addition to the intended message, human voice carries the unique imprint of a speaker. Just like finger-prints and faces, voice prints are biometric markers with tremendous potential for forensic, military, and commercial applications [1]. However, despite enormous advances in computing technology over the last few decades, automatic speaker verification (ASV) systems still rely heavily on training data collected in controlled environments, and most systems face a rapid degradation in performance when operating under previously unseen conditions (e.g. channel mismatch, environmental noise, or reverberation). In contrast, human perception of speech and ability to identify sound sources (including voices) is quite remarkable even at relatively high distortion levels [2]. Consequently, the pursuit of This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
human-like recognition capabilities has spurred great interest in understanding how humans perceive and process speech signals.
One of the intriguing processes taking place in the central auditory system involves ensembles of neurons with variable tuning to spectral profiles of acoustic signals. In addition to the frequency (tonotopic) organization emerging as early as the cochlea, neurons in the central auditory system (specifically in the midbrain and more prominently in the auditory cortex) exhibit tuning to a variety of filter bandwidths and shapes [3]. This elegant neural architecture provides a detailed multi-resolution analysis of the spectral sound profile, which is presumably relevant to speech and speaker recognition. Only few studies so far have attempted to use this cortical representation in speech processing, yielding some improvements for automatic speech recognition at the expense of substantial computational complexity [4,5]. To the best of our knowledge, no similar work was done in ASV.
In the present report, we explore the use of a multi-resolution analysis for robust speaker verification. Our representation is simple, effective, and computationally-efficient. The proposed scheme is carefully optimized to be particularly sensitive to the information-rich spectro-temporal attributes of the signal while maintaining robustness to unseen noise distortions. The choice of model parameters builds on our current knowledge of psychophysical principles of speech perception in noise [6,7] complemented with a statistical analysis of the dependencies between spectral details of the message and speaker information. We evaluate the proposed features in an ASV system and compare it against one of the best performing systems in NIST 2010 SRE evaluation [8] under detrimental conditions such as white noise, non-stationary additive noise, and reverberation.
The following section describes details of the proposed multi-resolution spectro-temporal model. It is followed by an analysis that motivates the choice of model parameters to maximize speaker information retention. Next, we describe the experimental setup and results. We finish with a discussion of these results and comment on potential extensions towards achieving further noise robustness.

The biomimetic multi-resolution analysis
An overview of the processing chain described in this section is presented in Figure 1.

Peripheral analysis
The speech signal is processed through a pre-emphasis stage (implemented as a first-order high pass filter with pre-emphasis coefficient 0.97), and a time-frequency auditory spectrogram is generated using a biomimetic sound processing model described in details in [9] and briefly summarized here (Equation 1). First, the signal s(t) undergoes a cochlear frequency analysis modeled by a bank of 128 constant-Q (Q = 4) highly asymmetric bandpass filters h(t; f) equally spaced over the span of 5 1 / 3 octaves on a logarithmic frequency axis. The filterbank output is a spatiotemporal pattern of cochlea basilar membrane displacements y coch (t, f) over 128 channels. Next, a lateral inhibitory network detects discontinuities in the responses across the tonotopic (frequency) axis, resulting in further filterbank frequency selectivity enhancement. This step is modeled as a first-order differentiation operation across the channel array followed by a half-wave rectifier and a short-term integrator. The temporal integration window is given by μ(t; τ) = e −t/τ u(t) with time constant τ = 10 ms mimicking the further loss of phase-locking observed in the midbrain. This time constant controls the frame rate of the spectral vectors. Finally, a nonlinear cubic root compression of the spectrum is performed, resulting in an auditory spectrogram y(t, f): where ⊗ t represents convolution with respect to time. The choice of the auditory spectrogram is motivated by its neurophysiological foundation as well as its proven selfnormalization and robustness properties (see [10] for full details).

Spectral cortical analysis
The auditory spectrogram is processed further in order to capture the spectral details present in each spectral slice. The processing is based on neurophysiological findings that neurons in the central auditory pathway are tuned not only to frequencies but also to spectral shapes, in particular to peaks of various widths on the log-frequency axis [3,11,12]. The spectral width is characterized by a parameter called scale and is measured in cycles per octave, or CPO. Physiological data indicates that auditory cortex neurons are highly scale-selective, thus expanding the cochlear one-dimensional tonotopic axis onto a two-dimensional sheet that explicitly encodes tonotopy as well as spectral shape details (see Figures 1 and 2).
The cortical analysis is implemented using a bank of modulation filters operating in the Fourier domain. The algorithm processes each data frame individually. The Fourier transform of each spectral slice y(t 0 , f) is multiplied by a modulation filter H S (Ω; Ω c ) that is tuned to spectral features of scale Ω c . The filtering operates on the magnitude of the signal. After filtering, the inverse Fourier transform is performed and the real part is taken as the new filtered slice. This process is then repeated with a number of different Ω c , yielding a number of filtered spectrograms y(t, f; Ω c ), each with features of scale Ω c emphasized (see Figure 1). This set of spectrograms constitutes the spectral cortical representation of the sound.
The filter H S (Ω; Ω c ) is defined as where Ω max is the highest spectral modulation frequency (set at 12 CPO given our spectrogram resolution of 24 channels per octave).

Choice of spectral parameters
The set of scales Ω c is chosen by dividing the spectral modulation axis into equal energy regions using a training corpus (TIMIT database [13]) as described below. Define the average spectral modulation profile Ȳ(Ω) = 〈〈|Y(Ω; t 0 )|〉 T 〉 Ψ as the ensemble mean of the magnitude Fourier transform of the spectral slice y(t 0 , f) averaged over all times T and over entire speech corpus Ψ. The resulting ensemble profile (shown in Figure 3a) is then divided into M equal energy regions Γ k : where Ω k and Ω k+1 denote the lower and upper cutoffs for kth band, Ω 1 = 0, and Ω M = 4. a This sampling scheme ensures that the high energy regions are sampled more densely, which has the dual advantage of sampling the given modulation space with a relatively small set of scales and emphasizing high-energy signal components, which are presumably noise-robust. 0} is found to be a good tradeoff between computational complexity and system performance.

Temporal filtration
In this stage, the spectral cortical features are processed through a bandpass temporal modulation filter to remove information that is believed to be mostly irrelevant. It was shown in [14] that the neurons in the auditory cortex are mostly sensitive to the modulation rates between 0.5 and 12 Hz and that the same modulation range represents the information crucial for speech comprehension [7]. Accordingly, the filtering is performed by multiplying the Fourier transform of the time sequence of each spectral feature by a bandpass filter H T (w; w l , w h ): where w l = 0.5Hz, w h = 12.0Hz, w max = 1/(2t f ), and t f = 10 ms (the frame length). After filtering in Fourier domain, the inverse Fourier transform is performed and the real part of the output forms the temporally filtered spectral cortical representation of the sound y w (t, f; Ω c ). This operation is performed on an utterance by utterance basis.

Cortical features
To reduce computational complexity and to allow use of state-of-the-art speaker verification machinery (which generally expects a relatively low-dimensional input), the spectral cortical representation is downsampled in frequency by a factor of 4 ( Figure 1). The resulting feature representation has a dimensionality of 128 (32 auditory frequency channels multiplied by four scales used for analysis). The features are then normalized to zero mean and unit variance for each utterance, yielding the reduced set of spectrograms ŷ w (t, f; Ω c ). Principal component analysis is used to further reduce the feature dimensionality to 19. This number is chosen for consistency with the dimensionality of the standard Mel-Frequency Cepstral Coefficients (MFCC) feature set used for speaker recognition. The reduced features, along with their first-and second-order derivatives, form the final 57-dimensional cortical feature vector used for the speaker verification task.

Speech information versus speaker information
The speech signal carries both speech message and speaker identity information in distinct yet overlapping components. Separation of these elements is a non-trivial task in general. In the multi-resolution framework presented above, the broadest filters (0.25 and 0.5 CPO) capture primarily the overall spectral profile and formant peaks, while the others (1, 2, and 4 CPO) reflect narrower spectral details such as harmonics and subharmonic structure. In order to select a set of scales (Ω c ) that are most relevant for the speaker recognition task, we analyze the mutual information (MI) between the feature vector (X), the speech message (Y 1 ), and the speaker identity (Y 2 ). The MI is a measure of the statistical dependence between random variables [15] and is defined for two discrete random variables X and Y as To estimate the MI, the continuous feature vector is quantized by dividing its support into cells of equal volume. To characterize the speech message, phoneme labels from the TIMIT corpus are first divided into four broad phoneme classes. The variable Y 1 thus takes four discrete values representing the phoneme categories: vowels, stops, fricatives, and nasals. The average MI (taken as the mean MI across all the frequency bands for a given scale) between the feature vector and the speech message is shown in Figure 3b (top) as a function of scale. For the speaker identity MI test, the TIMIT "sa1" speech utterance (She had your dark suit in greasy wash water all year) spoken by 100 different subjects is used; thus, Y 2 takes 100 discrete values representing the speaker. The average MI between the feature vector and the speaker identity is shown in Figure 3b (bottom), again as a function of scale. b Notice that while the lower scale (0.25 CPO) clearly provides significantly more information about the underlying linguistic message, the MI peak in Figure 3c (bottom) is centered at 1 CPO, highlighting the significance of pitch and harmonically-related frequency channels in representing speaker-specific information. In order to put less emphasis on message-carrying features of the speech signal, we drop the 0.25 CPO filter at the feature encoding stage for our ASV system and choose Ω c = {0.5, 1.0, 2.0, 4.0} CPO. c

Recognition setup
Text independent speaker verification experiments are conducted on the NIST 2010 speaker recognition evaluation (SRE) data set [8]. The extended core task of the evaluation involves 6.9 million trials broken down into nine common conditions reflecting a variety of channel mismatch scenarios [8] (see Table 1).
The front end of the implemented ASV system uses either the 57-dimensional MFCC feature vector or the 57-dimensional cortical feature vector. The MFCC feature vector is computed by invoking RASTAMAT "melfcc" function with 'numcep' parameter set to 20, dropping the first (energy) component of the output, and appending first-and second-order derivatives of the resultant feature vector. The cortical feature vector is obtained as described in the previous sections. For fair comparison between MFCC and cortical features, MFCC was supplemented with mean subtraction, variance normalization, and RASTA filtering [16] applied at the utterance level. Such processing parallels the temporal filtering and normalization performed on cortical features. A combination of ASR output provided by NIST and an in-house energy-based VAD system is used to drop all non-speech frames from input data.
The back-end is a robust state-of-the-art UBM-GMM system [17,18]. In a UBM-GMM system, each speaker's distribution of feature vectors is modeled as a mixture of Gaussians, forming a Gaussian mixture speaker model (GMSM available per individual speaker is typically much less than required to train the speaker model from scratch, the GMSM is produced by adapting UBM means so that the resulting model best describes the available speaker data. Finally, given the UBM, the candidate GMSM, and the audio file, the system extracts the feature vectors from the audio file and computes the log-likelihoods of these feature vectors belonging to the GMSM and to the UBM. The difference between these log-likelihoods constitutes the output score for this particular trial. Our ASV system additionally employs the technique known as joint factor analysis [19,20]. JFA use enables channel variability compensation by offsetting the channel effects and more robust speaker model estimation by using more informative prior on speaker model distribution. To use JFA in the described framework, an alternative representation of the speaker model-a single vector Z ("supervector")-is formed by concatenaging all GMSM means. JFA is trained in advance on a large annotated collection of audio files to learn the channel subspace (the basis over which Z preferentially varies when the same speaker's voice is presented over different channels) and the speaker subspace (the basis over which Z preferentially varies when different speakers are presented over the same channel). In our system, the dimensionalities of speaker subspace and of channel subspace are 300 and 150, respectively. Then, when processing the previously unseen data, components of inter-speaker differences attributable to speaker/to channel are emphasized/canceled, respectively. This is done by projecting corresponding supervectors into speaker/channel subspaces, using speaker subspace projection of Z to modify GMSM, using channel subspace projection of Z to modify UBM, and performing scoring with these modified GMSM and UBM. Also, as the log-likelihood calculation is expensive, in our system an approximation to it is computed based on an inner product [20] is used.
Finally, the obtained scores are subject to ZT-normalization [21], and the decision threshold minimizing equal error rate (EER) is chosen (separately for each condition).

Noise conditions
Every trial in NIST SRE 2010 consists of computing the matching score between a speaker model and an audio file. To evaluate the noise robustness of the proposed cortical features, several distorted versions of these audio files are created by adding different types of noise reflecting a variety of real world scenarios: • White noise at signal-to-noise ratio (SNR) levels from 24 to 0 dB in 6 dB steps; • Babble noise (from Aurora database [22]), same SNR levels; • Subway noise (from Aurora database [22]), same SNR levels; • Simulated reverberation with RT 60 from 200 to 1,200 ms in steps of 200 ms.
It is important to mention that all training (UBM, JFA, and speaker model training) is done exclusively on clean data, and only the test audio files are corrupted. Note also that the traintest mismatch created by addition of noise/reverberation is superimposed on the train-test mismatch inherent to the SRE 2010 data. Figure 4 shows the speaker verification performance in terms of EER for the cortical features and for the MFCC features as a function of noise type/strength and trial condition.

Results
The results clearly demonstrate that the proposed cortical features provide substantially lower EER than the MFCC as noise level increases, indicating their robustness. The average performance for each noise type and trial condition is shown in Table 2. On average (across all conditions and all noise types), the cortical-features-based system yields 15.9% relative EER improvement over the robust state-of-the-art MFCC system. It is worth noting that the proposed approach is outperformed by the MFCC-based approach in only 4 out of the 36 cases. Because the proposed metric incorporates both a biomimetic auditory spectrogram previously shown to exhibit some noise-robustness characteristics [10] as well as multiresolution decomposition, we investigated further the contribution of both components in the reported improvements. We tested the system using the auditory spectrogram alone or an adaptation of the auditory spectrogram described here, coupled with a cepstral transformation. Neither system performed as well as the proposed multiresolution decomposition, hence strengthening the claim that our proposed multiresolution analysis is indeed responsible for the performance improvements shown in Table 2.
In some ASV applications, metrics other than EER may be more relevant. For example, in certain biometric speaker verification systems the key requirement is a low false alarm rate. We present our results here in terms of two additional metrics more suitable in such case, namely Miss-10 and quadratic DCF (decision cost function) metrics. These two metrics were used in the NIST 2011 IARPA BEST program SRE [23]. The Miss-10 metric is defined as the false alarm rate P FA obtained when the decision threshold is set such that the miss rate P Miss = 10%, and the quadratic DCF is defined as DCF = C Miss × P Miss 2 × P target + C FA × P FA × (1 − P targrt ) (6) with the parameter values C Miss = 100, C FA = 10, and P target = 0.01.
The average verification performance for each noise type using the Miss-10 and quadratic DCF metrics is shown in Tables 3 and 4, respectively. As seen from the data, in the low false alarm region the proposed cortical features outperform the robust state-of-the-art MFCC system with even larger margin: 28.8% relative using the Miss-10 metric and 22.6% relative using the quadratic DCF metric.

Discussion and conclusions
In this report, we explore the applicability of a multi-resolution analysis of speech signals to ASV. This framework maps the speech signal onto a rich feature space, highlighting and separating information about the glottal excitation signal, glottal shape, vocal tract geometry, and articulatory configuration (as each of these elements is an underlying factor for features of different width located in different areas on the log-frequency axis; see e.g. [24]). The cortical representation can be viewed as a "local" variant (w.r.t. log-frequency axis) of the analysis provided by MFCC analysis. This analogy stems from the fact that MFCC roughly correspond to spectral features of different widths integrated over the whole frequency range. In this work, both the "global-integration" MFCC approach and the "local" cortical approach are tested in a state-of-the-art ASV system on the NIST SRE 2010 dataset. While both perform comparably in clean condition, the cortical features are substantially more robust on noisy data, including non-stationary distortions as well as reverberation.
One of the intuitions behind the robustness observed in the proposed features is the fact that speech and noise generally exhibit different spectral shapes while occupying an overlapping spectral range. The expansion of the spectral axis with the multi-resolution analysis allows the extrication of some speech components from the masking noise, suppressing the noise components and providing for increased robustness. Furthermore, by highlighting the range between 0.5 and 4 CPO, the model stresses the most speaker-informative regions in the speech spectrum, which in turn map onto a modulation space to which humans are highly sensitive [7]. Such range is also commensurate with neurophysiological tuning observed in mammalian auditory cortex with most neurons concentrated around a spectral tuning of the order of few CPOs [3,14]. A similar emphasis is put on the temporal dynamics of the signal by underscoring the region between 0.5 and 12Hz, which defines natural boundaries for speech perception in noise by human listeners [7,[25][26][27][28] and mostly coincides with temporal tuning of mammalian cortical neurons [14]. Higher temporal modulation frequencies represent mostly the syllabic and segmental rate of speech [2].
Unlike comparable multi-resolution schemes recently developed [4,5], the proposed approach does not involve dimension-expanded representations (close to 30,000 dimensions, which inherently require computationally-expensive schemes and therefore have limited applicability). Instead, our model is constrained to lie in a perceptually-relevant spectral modulation space and further uses a careful sampling scheme to encode the information with only four spectral analysis filters. This has the dual advantage of producing a feature space that is both low-dimensional and highly robust. The careful optimization of model parameters is necessary to strike a balance between simple and efficient computation and noise robustness.
Importantly, in our approach no model components have been customized in any way to deal with a specific noise condition, making it suitable for a wide range of acoustic environments.
In addition, the model has been minimally customized for the speaker recognition task and can in fact provide a general framework for a variety of speech processing tasks. Our preliminary results do indeed show great robustness of a similar scheme for automatic speech recognition. It is therefore essential to emphasize that the performance obtained with the cortical features is solely a property of the features themselves and is achieved without any noise compensation techniques. Our ongoing efforts are aimed at achieving further improvements by applying the described multi-resolution cortical analysis on enhanced spectral profiles obtained using speech enhancement techniques, which involve estimation of noise characteristics in various forms [29].

Figure 4. Evaluation results
Performance of the proposed cortical features (red filled squares) and enhanced MFCC features (black open circles) on NIST SRE 2010 "extended core" database as a function of noise level, noise type, and condition. In each subplot, the noise level is shown on X axis and the EER (in percents) is on Y axis. Columns and rows of subplots belong to the same noise type and to the same condition, respectively. Note the Y-axis ranges are not the same in the subplots.  Table 2 Average ASV performance (EER, %) as a function of noise type and condition  Table 3 Average ASV performance (Miss-10 metric, %) as a function of noise type and condition  Table 4 Average ASV performance (quadratic DCF metric) as a function of noise type and condition