Auditory processing-based features for improving speech recognition in adverse acoustic conditions
© Maganti and Matassoni; licensee Springer. 2014
Received: 17 January 2012
Accepted: 16 April 2014
Published: 6 May 2014
The paper describes an auditory processing-based feature extraction strategy for robust speech recognition in environments, where conventional automatic speech recognition (ASR) approaches are not successful. It incorporates a combination of gammatone filtering, modulation spectrum and non-linearity for feature extraction in the recognition chain to improve robustness, more specifically the ASR in adverse acoustic conditions. The experimental results with standard Aurora-4 large vocabulary evaluation task revealed that the proposed features provide reliable and considerable improvement in terms of robustness in different noise conditions and are comparable to those of standard feature extraction techniques.
Present technological advances in speech processing systems aim at providing robust and reliable interfaces for practical deployment. Achieving robust performance of these systems in adverse and noisy environments is one of the major challenges in applications such as dictation, voice-controlled devices, human-computer dialog systems and navigation systems. Speech acquisition, processing and recognition in non-ideal acoustic environments are complex tasks due to presence of unknown additive noise and reverberation. Additive noise from interfering noise sources and convolutive noise arising from acoustic environment and transmission channel characteristics mostly contribute to the degradation of speech intelligibility as well as the performance of speech recognition systems. This article addresses the problem of achieving robustness in large vocabulary automatic speech recognition (ASR) systems by incorporating principles inspired by cochlea processing in the human auditory system.
The degradation of recognition accuracy for ASR systems in noisy environments is mostly due to the discrepancy between training and testing conditions. The training data are recorded in clean conditions, and the accuracy gets degraded when it is tested against data acquired in noisy conditions. Various speech signal enhancement, feature normalization and model parameterization techniques are applied at various phases of processing to reduce the mismatch between training and testing conditions[2, 3]. Spectral subtraction-, Wiener filtering-, statistical model- and subspace-based speech enhancement techniques aim at improving the quality of speech signal captured through a single microphone or microphone array[4, 5]. Feature normalization attempts to represent parameters that are less sensitive to noise by modifying the extracted features. Common techniques include cepstral mean normalization (CMN) which forces the mean of each element of the cepstral feature vector to be zero for all utterances. Other variants include mean-variance normalization (MVN), cepstral mean subtraction and variance normalization (CMSVN) and relative spectral (RASTA) filtering[2, 6]. Model adaptation approaches modify the acoustic model parameters’ match with the observed speech features[4, 7].
The auditory system-based techniques have been used in speech recognition to improve the robustness[8–15]. Examples of non-uniform frequency resolution in popular speech analysis techniques include Mel frequency-based features and perceptual linear prediction which attempt to emulate human auditory perception. The gammatone filter bank with non-uniform bandwidths and non-uniform spacing of center frequencies provided better robustness in adverse noise conditions for speech recognition tasks[12–15].
Another important characteristic, the modulation spectrum of speech, represents low temporal modulation components and is important for speech intelligibility[16, 17]. Similar to the perceptual ability of human auditory system, the relative prominence of slow temporal modulations is different at various frequencies. The gammatone filter bank-derived modulation spectral features have shown to improve the robustness for far-field speaker identification. Our previously described auditory-based modulation spectral feature is a combination of gammatone filtering and modulation spectral features for robust speech recognition.
The present paper describes an alternate approach, in which the gammatone filtering, non-linearity and modulation spectrum for feature extraction are combined. The enhanced speech signal improved the accuracy of the system by reducing the sensitivity. The features derived from the combination are used to provide robustness, particularly in the context of mismatch between training and testing in noisy environments. The studied features are shown to be reliable and robust to various noises for a large vocabulary task. For comparison purposes, the recognition results obtained by using conventional features are tested, and the usage of the proposed features is proved to be efficient.
The paper is organized as follows: Section Related work gives an overview of the auditory-inspired features including gammatone filter bank processing, modulation spectrum and non-linearity processing. Section Auditory processing-based features describes the methodology for feature extraction. Section Database description and experiments presents database description and experiments. Section Recognition results discusses the results, and finally, Section Conclusions concludes the paper.
Most state-of-the-art ASR systems perform far below the human auditory system in the presence of noise. Auditory modeling, which simulates some properties of the human auditory system, has been applied to speech recognition systems to enhance robustness. The information coded in auditory spike trains and the information transfer processing principles found in the auditory pathway are used in. The neural synchrony is used for creating noise-robust representations of speech. The model parameters are fine-tuned to conform to the population discharge patterns in the auditory nerve which are then used to derive estimates of the spectrum on a frame-by-frame basis. This was extremely effective in noise and improved performance of the ASR dramatically. Various auditory processing-based approaches were proposed to improve robustness, and in particular, the works described in[13, 20] were focused to address the additive noise problem. Further, in, a model of auditory perception (PEMO) developed by Dau et al. is used as a front end for ASR, which performed better than the standard Mel-frequency cepstral coefficient (MFCC) for an isolated word recognition task. The auditory processing-related principles attempted to model human hearing to some extent have been applied for speech recognition[6, 17]. The modulation spectrum is an important psychoacoustic property which represents a slow temporal modulation which is significant for determining speech intelligibility. For improving robustness, the normalized modulation spectra have been proposed in. Similar work in the context of large vocabulary speech recognition such as noisy Wall Street Journal (New York, NY, USA) and GALE task as reported in[24, 25].
Feature extraction at different stages of the auditory model output to determine which component has the highest impact on the accuracy of recognition has been studied. The study also evaluated the contribution of each stage in auditory processing for improving robustness on the resource management database by using SPHINX-III speech recognition system (Carnegie Mellon University, Pittsburgh, PA, USA). Particularly, the effects of rectification, non-linearities, short-term adaptation and low-pass filtering were shown to contribute the most to robustness at low SNRs.
In another study, the techniques motivated by human auditory processing are shown to improve the accuracy of automatic speech recognition systems. It was shown that non-linearities in the representation, especially non-linear threshold effect, played important role in improving robustness. Other important aspect was the impact of time-frequency resolution based on the observations that the best estimates of attributes of noise are obtained by using relatively long observation windows and frequency smoothing provides significant improvements to robust recognition.
In the context of speaker identification, auditory-based features have been extensively studied. The contrasts of MFCC and gammatone frequency cepstral coefficients (GFCC) have been compared, and the noise robust improvements by GFCC has been explained in.
In our earlier studies, several auditory processing-motivated features have shown considerably improved robustness for both additive noise and reverberation. However, all these above studies are confined to small and medium vocabulary tasks. In that direction, it is an attempt to apply these techniques for large and complex vocabulary task, namely Aurora-4, which is based on Wall Street Journal database. Artificially added noises ranged from SNRs of 5 to 15 dB with a variety of noises which include babble, car, street and restaurant. The effects at different stages of processing are analyzed to study the contribution of each stage for improving robustness. A preliminary version of our work was presented earlier.
Auditory processing-based features
In this section, a general overview of gammatone filter bank-, non-linearity- and modulation spectrum-based auditory features is presented.
Gammatone filter bank
where n is the order of the filter, b is the bandwidth of the filter, a is the amplitude, fc is the filter center frequency, and ϕ is the phase.
where h m (k) is the impulse response of the filter.
where x i [t] is the i th channel log gamma spectral value, and y i [t] is the corresponding sigmoid compressed value. The optimal parameters are derived from evaluation of resource management development set in additive noise at 10 dB.
Database description and experiments
The Aurora 4 evaluation task provides a standard database for comparing the effectiveness of robust techniques in LVCSR tasks in the presence of mis-matched channels and additive noises. It is a part of the ETSI standardization process and derived from the standard 5k WSJ0 Wall Street Journal database. It has 7,180 training utterances of approximately 15 h and 330 test utterances with an average duration of 7 s.
The acoustic data (both training and test) are also available in two different sampling frequencies (8 and 16 kHz), compressed or uncompressed. Two different training conditions were specified. Under clean training (clean train), the training set is the full SI-84 WSJ train set processed with no noise added. Under multicondition training (multi-train), about half of the training data were recorded using one microphone; the other half were recorded under a different channel (also used in some of the test sets) with different types of noise and different SNRs added. The noise types are similar to the noisy conditions in test.
The Aurora 4 test data include 14 test sets from two different channel conditions and six different added noises (in addition to the clean environment). The SNR was randomly selected between 0 and 15 dB on an utterance-by-utterance basis. Six noisy environments and one clean environment no noise (set01), car (set02), babble (set03), restaurant (set04), street (set05), airport (set06) and train (set07) are considered in the evaluation set which comprises 5,000 words under two different channel conditions. The original audio data for test conditions 1 to 7 was recorded with a Sennheiser microphone (Lower Saxony, Germany), while test conditions 8 to 14 were recorded using a second microphone that was randomly selected from a set of 18 different microphones. These included such common types as a Crown PCC-160 (Elkhart, IN, USA), Sony ECM-50PS (New York, NY, USA) and a Nakamichi CM100 (Tokyo, Japan). Noise was digitally added to this audio data to simulate operational environments.
The HTK setup followed is three-state cross-word triphone models tied to approximately 3,000 tied states, each represented by four-component Gaussian mixtures with diagonal covariance, together with the 5,000 closed vocabulary bigram language model (LM). Triphone states were tied using the linguistic-driven top-down decision-tree clustering technique, resulting in a total of 3,135 tied states in clean train and 3,068 tied states in multi-train. The CMU dictionary was used to map lexical items into phoneme strings, and the 5,000-word closed vocabulary bigram LM was used. The LM weights, pruning thresholds and insertion penalties were based on.
Accuracy rate (%) baselines for different feature extraction techniques
Accuracy rates (%) for the different extraction techniques
We hypothesize that we do not observe the similar effect in this case due to different task (large vocabulary with triphones) and different environment (only additive). However, from the table, we can see that the optimized non-linearity improved the performance of GFCC and GFCC-MS considerably. Further, we can also be observe that the contribution towards improved performance from the non-linearity is consistent for all types of noises. This clearly demonstrates that including a non-linearity is significantly beneficial for improving robustness in noisy environment.
The features proposed in the present study are derived from auditory characteristics, which include gammatone filtering, non-linear processing and modulation spectral processing, emulating the cochlear and the middle ear to improve robustness. In earlier studies, several auditory processing-motivated features have improved robustness for small and medium vocabulary tasks. The paper has studied the application of these techniques to large and complex vocabulary task, namely, the Aurora-4 database. The results have shown that the proposed features considerably improved the robustness in all types of noise conditions. However, the present study is essentially confined to handle noise effects on speech and has not considered reverberant conditions. The selected weights for the non-linearity were heuristic, and automatic selection of optimal weights from the evaluation data is desirable. For the future, we would like to investigate these issues and evaluate the performance of the proposed features for reverberant environments and large vocabulary tasks.
- Barker J, Vincent E, Ma N, Christensen C, Green P: The PASCAL CHiME speech separation and recognition challenge. Comput. Speech Lang 2013, 27(3):621-633. 10.1016/j.csl.2012.10.004View ArticleGoogle Scholar
- Droppo J, Acero A: robustness, Environmental. In Springer Handbook of Speech Processing. Edited by: Benesty J, Sondhi MM, Huang Y. New York: Springer; 2008:653-679.View ArticleGoogle Scholar
- Gales MJF: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang 1998, 12(2):75-98. 10.1006/csla.1998.0043View ArticleGoogle Scholar
- Omologo M, Svaizer P, Matassoni M: Environmental conditions and acoustic transduction in hands-free speech recognition. Speech Commun. 1998, 25: 75-95. 10.1016/S0167-6393(98)00030-2View ArticleGoogle Scholar
- Martin R: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process 2001, 9(5):504-512. 10.1109/89.928915View ArticleGoogle Scholar
- Hermansky H, Morgan N: RASTA processing of speech. IEEE Trans. Speech Audio Proc 1994, 2(4):578-589. 10.1109/89.326616View ArticleGoogle Scholar
- Gales MJF, Young SJ: A fast and flexible implementation of parallel model combination. ICASSP, 1995, 1: 133-136.Google Scholar
- Kim C: Signal processing for robust speech recognition motivated by auditory processing. Ph.D. Thesis, CMU 2010Google Scholar
- Brown GJ, Palomaki KJ: A reverberation-robust automatic speech recognition system based on temporal masking. J. Acoustical Soc. Am 2008, 123(5):2978.View ArticleGoogle Scholar
- Ghitza O: Auditory models and human performance in tasks related to speech coding and speech recognition. IEEE Trans. Speech Audio Proc. SAP-2(1) 1994, 115-132.Google Scholar
- Kim D-S, Lee S-Y, Kil RM: Auditory processing of speech signals for robust speech recognition in real-world noisy environments. IEEE Trans Speech Audio Proc 1999, 7: 55-69. 10.1109/89.736331View ArticleGoogle Scholar
- Dimitriadis D, Maragos P, Potamianos A: On the effects of filterbank design and energy computation on robust speech recognition. IEEE Trans. Audio Speech Lang. Proc 2011, 19: 1504-1516.View ArticleGoogle Scholar
- Flynn R, Jones E: A comparative study of auditory-based front-ends for robust speech recognition using the Aurora 2 database. Paper presented at the IET Irish signals and systems conference Dublin, Ireland, 28–30, June 2006 pp. 28–30Google Scholar
- Schluter R, Bezrukov I, Wagner H, Ney H: Gammatone features and feature combination for large vocabulary speech recognition. Paper presented in the IEEE international conference on acoustics, speech, and signal processing (ICASSP) Honolulu, HI, USA, 15–20 April 2007 pp. 649–652Google Scholar
- Shao Y, Jin Z, Wang DL, Srinivasan S: An auditory-based feature for robust speech recognition. Paper presented at the IEEE international conference on acoustics, speech, and signal processing (ICASSP) Taipei, Taiwan, 19–24 April 2009 pp. 4625–4628Google Scholar
- Drullman R, Festen J, Plomp R: Effect of reducing slow temporal modulations on speech reception. J. Acoustical Soc. Am 1994, 95: 2670-2680. 10.1121/1.409836View ArticleGoogle Scholar
- Kanedera N, Arai T, Hermansky H, Pavel M: On the importance of various modulation frequencies for speech recognition. Paper presented at the Eurospeech Rhodes Greece, 22–25 Sept 1997 pp. 1079–1082Google Scholar
- Falk TH, Chan WY: Modulation spectral features for robust far-field speaker identification. IEEE Trans. Audio Speech Lang. Process 2010, 18(1):90-100.View ArticleGoogle Scholar
- Maganti HK, Matassoni M: An auditory based modulation spectral feature for reverberant speech recognition. Paper presented at the 13th annual conference of the International Speech Communication Association (Interspeech) Makuhari, Japan, 26–30 Sept 2010 pp. 570–573Google Scholar
- Deng L, Sheikhzadeh H: Use of temporal codes computed from a cochlear model of speech recognition, chapter 15. In Listening to Speech: An Auditory Perspective. Edited by: Greenberg S, Ainsworth W. Mahwah: Lawrence Erlbaum; 2006:237-256.Google Scholar
- Kleinschmidt M, Tchorz J, Kollmeier B: Combining speech enhancement and auditory feature extraction for robust speech recognition. Speech Commun. 2001, 34: 75-91. 10.1016/S0167-6393(00)00047-9View ArticleGoogle Scholar
- Dau T, Pueschel D, Kohlrausch A: A quantitative model of the effective signal processing in the auditory system. J. Acoustical Soc. Am 1996, 99: 3615-3622. 10.1121/1.414959View ArticleGoogle Scholar
- Xiong X, Eng Siong C, Haizhou L: Normalization of the speech modulation spectra for robust speech recognition. IEEE Trans. Audio Speech Lang. Proc 2008, 16(8):1662-1674.View ArticleGoogle Scholar
- Mitra V, Franco H, Graciarena M, Mandal A: Normalized amplitude modulation features for large vocabulary noise-robust speech recognition. Paper presented at the IEEE international conference on acoustics, speech and signal processing (ICASSP) Kyoto, Japan, 25–30 March 2012, pp. 4117–4120Google Scholar
- Valente F, Magimai-Doss M, Plahl C, Ravuri SV: Hierarchical processing of the modulation spectrum for GALE Mandarin LVCSR system. Paper presented at the meeting of the International Speech Communication Association (Interspeech) Brighton, UK, 6–10 Sept 2009, pp. 2963–2966Google Scholar
- Chiu Y-HB, Raj B, Stern RM: Learning-based auditory encoding for robust speech recognition. Paper presented at the IEEE international conference on acoustics, speech and signal processing (ICASSP) Dallas, TX, USA, 14–19 March 2010, pp. 4278–4281Google Scholar
- Zhao X, Shao Y, Wang DL: CASA-based robust speaker identification. IEEE Trans. Audio Speech Lang. Proc 2012, 20–25: 1608-1616.View ArticleGoogle Scholar
- Zhao X, Wang DL: Analyzing noise robustness of MFCC and GFCC features in speaker identification. Paper presented at the IEEE international conference on acoustics, speech and signal processing (ICASSP) Vancouver, Canada, 26–31 May 2013, pp. 7204–7208Google Scholar
- Matassoni M, Maganti HK, Omologo M: Non-linear spectro-temporal modulations for reverberant speech recognition. Paper presented at the joint workshop on hands-free speech communication and microphone arrays (HSCMA) Edinburgh, Scotland, 30 May–1 June 2011, pp. 115–120Google Scholar
- Slaney M: An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank. in Apple technical report, Perception Group, 1993Google Scholar
- Glasberg B, Moore B: Derivation of auditory filter shapes from notched-noise data. Hearing Res 1990, 47: 103-108. 10.1016/0378-5955(90)90170-TView ArticleGoogle Scholar
- Ellis DPW: Gammatone-like spectrograms,. . Accessed 6 June 2011. http://www.ee.columbia.edu/~dpwe/resources/matlab/gammatonegram/
- Parihar N, Picone J, Pearce D, Hirsch HG: Performance analysis of the Aurora large vocabulary baseline system. Paper presented at the 12th European signal processing conference (EUSIPCO)n Vienna, Austria, 6–10 Sept 2004, pp. 553–556Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.