A perceptual masking approach for noise robust speech recognition
© Maganti and Matassoni; licensee Springer. 2012
Received: 8 December 2011
Accepted: 12 November 2012
Published: 22 December 2012
This article describes a modified technique for enhancing noisy speech to improve automatic speech recognition (ASR) performance. The proposed approach improves the widely used spectral subtraction which inherently suffers from the associated musical noise effects. Through a psychoacoustic masking and critical band variance normalization technique, the artifacts produced by spectral subtraction are minimized for improving the ASR accuracy. The popular advanced ETSI-2 front end is tested for comparison purposes. The performed speech recognition evaluations on the noisy standard AURORA-2 tasks show enhanced performance for all noise conditions.
Enhancement of noise corrupted speech signals is a challenging task for speech processing systems to be deployed in real-world applications. In practice, speech signals are usually degraded by additive background noises, reverberation effects and speech signals from other speakers. The primary goal of robust speech processing techniques is to improve intelligibility and quality of noise corrupted speech in perspective human listeners and modify the same and extract robust features that lead to improved performance for speech recognition systems.
Apart from extracting robust features which represent parameters less sensitive to noise by modifying the extracted features, other research directions aimed at increasing the performance of speech recognizers in noise are: speech signal enhancement, model adaptation and hybrid methods[2–5]. The model adaptation techniques fail to perform in constantly changing environments where little or no adaptation data is available and hybrid methods attempt to preprocess speech signals and depend on reliability of estimations of those segments. On the other hand the signal enhancement techniques require no training, and provide “real-time” enhancement of the recognition accuracy. The spectral subtraction method of speech enhancement is the most widely used conventional method for reducing additive noise. Many improvements are proposed to deal with the problems typically associated to spectral subtraction such as residual broadband noise and narrow band tonal noise referred as musical noise. Other variants of spectral subtraction for this purpose include spectral subtraction with over subtraction, non-linear spectral subtraction, multi-band spectral subtraction, MMSE spectral subtraction, and extended spectral subtraction[3, 6].
Spectral subtraction based on perceptual properties has been investigated to improve intelligibility and quality of the speech signals[7–9]. The masking properties of human auditory system are incorporated into the enhancement process in order to attenuate the noise components that are already inaudible due to masking. In, the selected masking threshold level is high, so that the residual noise will be masked and will be inaudible. In, a psycho-acoustical spectral weighting rule is proposed which utilizes only estimates of the masking threshold and noise power spectral density for complete masking of distortions of the residual noise. The application of using human auditory masking in Kalman filtering to speech enhancement is considered in. Furthermore, another novel approach based on sub-band variance normalization technique was proposed where speech frames are characterized by high variance and noise frames by low variance, which are suppressed to improve the ASR performance in presence of both additive noise and reverberation.
In the present study, an alternate approach based on psycho-acoustical model for reducing the artifacts associated with spectral subtraction for improving speech recognition performance in the presence of additive noise is proposed. Based on the human auditory system, the noise below the audible threshold is suppressed which reduces the amount of modification to the spectral magnitude, and hence the amount of distortion introduced into the cleaned speech signal. Further, critical band variance normalization is performed to minimize the musical noise which is caused by increased variance at random frequencies. The features derived from the combination of techniques are effective in providing robustness. The studied features are shown to be reliable and robust to the effects of the additive noise. The effectiveness of the proposed features is demonstrated with experiments on noisy AURORA-2 database. For comparison purposes the recognition results obtained by using the standard spectral subtraction and ETSI advanced front-end are tested.
The article is organized into the following sections. Section 2 briefly summarizes the principle spectral subtraction using over subtraction. Section 3 describes psycho-acoustic motivated features including tone and noise masking and critical band variance normalization. Section 4 discusses the spectral subtraction with perceptual post-filter and Section 5 describes the database, experiments and results. Finally, Section 6 concludes the article.
Spectral subtraction for speech enhancement
where y(t) is the degraded speech signal, s(t) represents the clean signal, d(t) is the additive noise, which is uncorrelated with the speech signal and unknown.
where is the average value of noise square-magnitude taken during non-speech activity. The performance of this technique depends on the accuracy of noise estimation and is limited by the processing distortions caused by random variations of the noise spectrum. The non-linear mapping of spectral estimates that fall below a threshold, where noise has been overestimated results in some randomly located negative values for the estimated clean speech magnitude. This leads to undesired residual noise called musical noise (narrow band spectrum with randomly distributed tones over time and frequency).
where α, β are the subtraction factor and spectral floor parameter. To reduce the speech distortion caused by large values of α, its value is adapted from frame to frame depending on the segmental noisy signal to noise ratio (NSNR) of the frame. In general, less subtraction is applied for frames with high NSNR and vice versa. For the estimate of the noise power spectrum, minimum statistics technique is used. The NSNR is computed using a decision directed approach as proposed in.
where m is the frame index and α0 is the value of α at 0 dB NSNR. With the higher over-subtraction, the stronger components with a low SNR are attenuated which prevent musical noise. But, too strong over-subtraction may suppress too many components causing distortion to the signal.
Psychoacoustical masking model
The oversubtracted spectral subtraction reduces the noise to some extent but the musical noise is not completely eliminated, effecting the quality of the speech signal. There is a trade-off between the amount of noise reduction and speech distortion. The perceptual based techniques help in reducing the noise by taking advantage of the masking properties of the auditory system. In order to further enhance the quality, the noise and tones are masked and critical band variance normalization is performed by incorporating the masking properties of human auditory system. The human auditory system does not perceive all the frequencies in the similar way, and is limited to mask certain sounds in the presence of competitive sounds. The two main properties of the human auditory system that make up the psychoacoustic model are: absolute threshold of hearing and auditory masking.
Absolute threshold of hearing
The frequency components of the signal with power levels that fall below the absolute threshold of hearing (ATH) can be discarded, as they do not contribute in improving perceptibility of the signal.
The inability of the human auditory perception system to identify the minute differences in frequency when played at the same time is known as masking. A strong 1 kHz signal masks the nearby frequencies, making them inaudible to the listener. For a masked signal to be heard, its power has to be increased to a level greater than that of a threshold that is determined by the frequency of the masker tone and its strength.
The masking analysis method described in MPEG1 audio coder is used to detect the tonal and nontonal components. The tonal and noise masking threshold that give the maximum level of noise that is inaudible in the presence of speech is computed. The calculation steps as described in are:
The masking threshold is computed from the short-term power spectral density estimate of the input signal. The power density spectrum is obtained from the FFT of the input signal, following multiplication by a Hann window. The magnitude of each spectral component is converted to a decibel scale, to obtain the estimate P[k]. The power spectrum is normalized to a level of 96 dB SPL, such that the maximum spectral component corresponds to this value.
Tonal maskers are removed from the power spectrum P[k], by setting all frequency lines within the examined range to −∞. The sound pressure levels of noise maskers are obtained by summing the energies of spectral lines within each critical band to compute PTM(z).
where z j , Δz, and P M (z j ) represent, respectively, the masker Bark frequency, the Bark frequency separation between the masker and target and the sound pressure level of the masker. The spread of masking is only considered within the range of −3 ≤ Δz < 8, for reasons of implementation complexity.
Spectral subtraction with perceptual post-filter
MFCC features are extracted from the log Mel spectrum by applying a discrete cosine transformation. First thirteen cepstral coefficients along with their first- and second-order derivatives are used.
Experiments and results
The ASR experiments are performed with the proposed approach using a full HTK based recognition system on connected digit recognition task using the Aurora 2 database. The database was designed to evaluate the front-end of ASR systems in noisy conditions, and the training and testing follow the specifications described in. The task is speaker independent connected digit recognition.
Testing data include eight types of realistic background noise subway, babble, car, exhibition hall, restaurant, street, airport and train station noise at various SNRs (clean, 20, 15, 10, 5, 0, and −5 dB). There are three test sets. Set A contains 4004 utterances in the first four types of noise, set B contains 4004 utterances in the other four, and set C contains 2002 utterances, where only subway and street noise are present.
From Figures6 and7, it can be seen that spectral subtraction has the highest word error rates (WER) for both clean and multi-condition training. Also, it can be observed that for a minimal average loss in case of clean speech for both clean and multi-condition training, an improvement is obtained with the proposed features for all the noise conditions. The improvement is particularly large for −5, 0, 5, 10 and 15 dB SNRs. It can also be observed that the improvements are better for training on clean data than multi-condition data which is consistent with. For both the cases, the approach is precisely able to remove noise as much as possible improving the recognition accuracy.
It can be clearly observed from the Figures6,7, and8 that the performance of proposed features is consistently the best for all noise conditions irrespective of the training. To overcome the basic limitation and application of spectral subtraction technique for recognition tasks, the combination of psychoacoustical masking and critical band variance normalization is effective in minimizing the artifacts without causing distortion to the original speech signal, thereby improving the ASR accuracy.
This article presented a psychoacoustical masking and critical band variance normalization based spectral subtraction approach to improve the speech recognition performance in noisy environments.
The spectral subtraction method was used to reduce the broadband noise due to peaks, and the combination of masking and variance normalization technique was effective in reducing the artifacts by reducing the dynamic range of its magnitude spectrum, which resulted in the improved speech recognition performance. The proposed approach has been evaluated on a standard Aurora-2 database. Results were compared with standard ETSI-2 advanced front-end which show that the proposed features perform consistently better both in terms of robustness and reliability for all types of noises.
In future investigations, improvement of auditory based features to deal with both additive noise and reverberation simultaneously will be investigated. Also, evaluation of these features on large vocabulary tasks to deal with real world noisy speech will be studied.
- Droppo J, Acero A: Environmental robustness, chapter 33. In Springer Handbook of Speech Processing Edited by: Benesty J, Sondhi MM, Huang (eds) Y. (2008), 653-680.View ArticleGoogle Scholar
- Furui S, Sondhi M: Marcel Dekker, Inc. In Advances in speech signal processing. New York; 1991.Google Scholar
- Ephraim Y, Cohen I: Recent advancements in speech enhancement. The Electrical Engineering Handbook, CRC Press; 2006.Google Scholar
- Berouti M, Schwartz R, Makhaul J: Enhancement of speech corrupted by acoustic noise. In Proc. IEEE International Conf. on Acoustics, Speech, and Signal Processing (ICASSP). Washington, USA; 1979:208-211.Google Scholar
- Gales MJF: Robust Speech Recognition of Uncertain or Missing Data, chapter 1. In Springer series of Signal, Image and Speech Processing Edited by: Kolossa D, Umbach (eds) RH. 2011, 101-125.Google Scholar
- Krishnamoorthy P, Prasanna SRM: Reverberant speech enhancement by temporal and spectral processing. IEEE Transactions on Speech Audio Processing 2009, 17(2):253-266.View ArticleGoogle Scholar
- Virag N: Single channel speech enhancement based on masking properties of the human auditory system. IEEE Transactions on Speech Audio Processing 1999, 7(2):126-137. 10.1109/89.748118View ArticleGoogle Scholar
- Gustafsson S, Jax P, Vary P: A novel psychoacoustically motivated audio enhancement algorithm preserving background noise characteristics. In Proc. IEEE International Conf. on Acoustics, Speech, and Signal Processing (ICASSP). Washington, USA; 1998:397-400.Google Scholar
- Ma N, Bouchard M, Goubran RA: Speech enhancement using a masking threshold constrained kalman filter and its heuristic implementations. IEEE Transactions on Speech Audio Processing 2006, 14(1):19-32.View ArticleGoogle Scholar
- Maganti HK, Zanon S, Matassoni M: Sub-band spectral variance feature for noise robust ASR. In Proc. 19th European Signal Processing Conference (EUSIPCO). Barcelona, Spain; 2011:2114-2118.Google Scholar
- ETSI ES 202050 STQ: Distributed Speech Recognition, Advanced Front-End Feature Extraction Algorithm, Compression Algorithm, ETSI ES 202 050 v1.1.3. 2003-11.
- Boll S: Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acosutics. Speech Signal Processing 1979, 27(2):113-120. 10.1109/TASSP.1979.1163209View ArticleGoogle Scholar
- Martin R: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transactions on Speech Audio Processing 2001, 9: 504-512. 10.1109/89.928915View ArticleGoogle Scholar
- Ephraim Y, Malah D: Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on Acosutics, Speech Signal Processing 1984, ASSP-32(6):1109-1121.View ArticleGoogle Scholar
- Painter T, Spanias A: Perceptual coding of digital audio. Proc. of the IEEE 2000, 88: 451-513.View ArticleGoogle Scholar
- Young SJ, Evermann G, Gales MJF, Kershaw D, Moore G, Odell JJ, Ollason DG, Povey D, Valtchev V, Woodland PC: The HTK book (version 3.4). Cambridge University Engineering Department, Cambridge, UK; 2006.Google Scholar
- Hirsch HG, Pearce D: Applying the Advanced ETSI Frontend to the Aurora-2 task, in version 1.1. 2006.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.