- Open Access
Speech intelligibility improvement in noisy reverberant environments based on speech enhancement and inverse filtering
© The Author(s). 2018
- Received: 24 October 2017
- Accepted: 24 April 2018
- Published: 23 May 2018
The speech intelligibility of indoor public address systems is degraded by reverberation and background noise. This paper proposes a preprocessing method that combines speech enhancement and inverse filtering to improve the speech intelligibility in such environments. An energy redistribution speech enhancement method was modified for use in reverberation conditions, and an auditory-model-based fast inverse filter was designed to achieve better dereverberation performance. An experiment was performed in various noisy, reverberant environments, and the test results verified the stability and effectiveness of the proposed method. In addition, a listening test was carried out to compare the performance of different algorithms subjectively. The objective and subjective evaluation results reveal that the speech intelligibility is significantly improved by the proposed method.
- Speech intelligibility
- Speech enhancement
- Inverse filtering
- Auditory model
An indoor public address (I-PA) system is a sound amplification system that is widely used in auditoriums, classrooms, factories, and conference rooms. However, its speech intelligibility is often degraded due to near-end  reverberation and background noise . Therefore, it is desirable to find an effective way to improve the speech intelligibility in such environments.
Reverberation is caused by wall reflections that distort the sound transmission channel , while background noise degrades the speech intelligibility through noise masking . Thus, methods for improving speech intelligibility can be broadly classified into two categories. The first focuses on compensation for transmission channel distortion [3, 5–12], and the other category focuses on noise suppression and speech enhancement. In the first category, sound transmission in enclosed spaces is regarded as a linear time invariant (LTI) system [5, 6], so the output response of the system can be expressed as the convolution of the input signal and room impulse response (RIR). Therefore, the influence of reverberation can be eliminated by realizing the inverse of an RIR . However, this inverse will be either unstable or acausal since the RIR is generally considered a nonminimum phase function .
In the first research on this problem , Neely realized a stable and causal inverse filter through decomposition of the RIR into the minimum phase and all-pass phase. This inverse filter can basically eliminate the distortion caused by wall reflections. An adaptive equalization (A-EQ) method  was later proposed to compensate for the distortion of the room frequency response. The equalizer could minimize the square errors between the target response and input signal adaptively, but the method was very sensitive to peaks and notches for the room responses.
Based on Neely’s method, a new equalization method was proposed by combining a vector quantization method with an all-pole room transfer function (RTF) model to reduce the effects of the reverberation by using a lower equalizer order . However, this approach is based on an approximation of the RTF, so the exact solution of the inverse filter cannot be obtained. Kirkeby and Nelson proposed a fast inverse filtering (FIF) method for designing single or multichannel sound reproduction systems [8–10]. This method uses the principles of least squares optimization to obtain a stable and causal inverse filter, as well as regularization, to realize fast deconvolution. Although this method needs to use relatively long inverse filters, the algorithm has higher accuracy and fast deconvolution speed. Therefore, this algorithm has received much attention and is still used in current equalization methods.
Based on this algorithm, a warped domain equalization (W-EQ) method was proposed to improve the listening experience . This method uses the bark scale, which is related to auditory perception and low-frequency response equalization and produces a better listening experience than other equalization methods [6, 13, 14]. However, the bark scale is not an auditory model and cannot simulate the frequency response characteristics of the basilar membrane in the cochlea. Moreover, these equalization methods do not account for the influence of background noise on speech intelligibility.
Increasing the playback level is one clear solution to improve the speech intelligibility in the event of background noise. However, it is impossible to increase the output level indefinitely due to the limited power output of loudspeakers and the pain-threshold pressure limitation of the ear . In addition, in the case of I-PA systems, the listener is located in a noisy environment, and the noise reaches the ears without any possibility of intercepting it beforehand . Therefore, a preprocessing speech enhancement method without increasing the output power would be more suitable for use with I-PA systems [16, 17].
An energy redistribution voiced/unvoiced (ERVU) method was proposed to improve intelligibility without increasing the output power . The method redistributes more speech energy to the transient regions to reinforce speech signals. A perceptual distortion measure (PDM)-based speech enhancement (PDMSE) method was proposed  based on the ERVU method and the PDM algorithm . Compared with the ERVU method, the PDMSE method can further improve speech quality without decreasing intelligibility. However, these methods do not consider the influence of reverberation on speech intelligibility.
In recent years, only a few studies have considered the effects of reverberation and background noise simultaneously [19–22]. Some methods just use the near-end speech enhancement method to reduce the influence of both reverberation and background noise [19, 20]. Other methods pre-compensate the output speech by obtaining the optimal solution of the established mathematical model to improve intelligibility [21, 22]. Crespo and Hendriks  proposed a multizone speech reinforcement method based on a general optimization framework. The signal model considered the influence of RTF on intelligibility in noisy environments, and the effectiveness of this approach was verified by simulation.
Hendriks et al.  proposed an approximated speech intelligibility index (ASII) method to improve the speech intelligibility in a single-zone scenario. Unlike the Multizone method , the ASII method uses a speech intelligibility index to establish a mathematical model that includes late reverberation and noise. The optimal solution of the mathematical model is used to preprocess the output speech to improve intelligibility. Although the Multizone and ASII methods could improve the speech intelligibility in noisy and reverberant environments, the distortion of the speech transmission channel and the auditory features of the human ear were not considered at the same time during the signal preprocessing. Therefore, the Multizone and ASII methods do not fundamentally compensate the distortion of the transmission channel, and the dereverberation performance is quite limited.
This paper proposes a new preprocessing method for improving speech intelligibility by a combination of the PDMSE method and the FIF method. The PDMSE method was modified for reverberant environments, and a new Gammatone (GT)-filter-based FIF method was designed to achieve better equalization and dereverberation performance. Compared with the A-EQ, W-EQ, and FIF equalization methods, the GT-filter-based FIF method can further decrease the distortion of the transmission channel. Compared with individual FIF and PDMSE methods, the improved combination method has better stability and higher speech quality. Furthermore, compared with the multizone and ASII methods, the combination method can significantly improve the speech intelligibility in different noisy and reverberant environments.
To validate the method, an experiment was performed in real environments with various noise and reverberation conditions. The speech transmission index, spectrogram, log-spectral distortion measure, short-time objective intelligibility measure, and modified rhyme test were used to compare the performance. The objective and subjective evaluation results illustrate that the method can effectively improve the speech intelligibility of I-PA systems in noisy and reverberant environments.
The remainder of the paper is organized as follows. Section 2 describes the algorithm in detail. Section 3 describes the experimental design and hardware setup. Section 4 presents the test results of the evaluations, and Section 5 concludes the paper.
In the block for speech preprocessing and synthesis, a modified PDMSE method is used in the speech enhancement stage to increase the energy of transient speech. Next, a GT-filter-based FIF method is used in the equalization stage to pre-compensate the distortion of the transmission channel. The final preprocessing and synthesis signal sout (n) is used as an input for the loudspeaker to broadcast. The distortion signal ε (n) is then recorded by a microphone, and TF decomposition and GT filtering are once again performed to obtain the short-term distortion frame ε m,i .
The power spectral density (PSD) estimation module is next applied to estimate the energy of background noise. Finally, the gain function α is calculated by the PDMSE algorithm, and the inverse sub-filters v i are obtained by the GT-filter-based FIF algorithm. Both parameters are used to adjust the preprocessing speech signal to obtain the best speech intelligibility. Furthermore, based on the method by Meng et al. , a sine sweep signal with a length of 10 s is used as an excitation signal to obtain the RIR in advance to calculate the inverse filter.
Three modules in the block diagram of Fig. 1 are mainly discussed in the next sections. Section 2.1 gives the description of the PDMSE algorithm module. Section 2.2 discusses the GT-filter-based FIF algorithm module, and Section 2.3 presents the block for speech preprocessing and synthesis.
2.1 Improved preprocessing speech enhancement
2.2 Improved fast inverse filtering
For the single-input-single-output (SISO) system, the RIR between the loudspeaker and receiver point contains all the information of the sound transmission channel. The GT filter banks were used to decompose the RIR h to obtain the sub-filters, which are based on the auditory model; that is, h i = h ∗ g i , where g i denotes the ith GT filters and h i denotes the decomposed ith sub-filters. In this process, a total amount of 40 sub-filters are decomposed in the frequency range of 125 to 8000 Hz. The Fast Fourier transform (FFT) is then performed on the decomposed ith sub-filters to obtain the ith frequency response H i (k) of these sub-filters.
The time-domain inverse sub-filters v i (k) are determined by computing the inverse FFT of the ith frequency domain inverse sub-filters V i (k). A “cyclic shift” of the inverse FFT is used to implement a modeling delay  to obtain causal and stable time-domain inverse sub-filters. Since finite impulse response (FIR) filters are used to replace the length of the “true” inverse sub-filters during the computation, the window function is used for v i (k) to suppress aliasing in the time domain.
2.3 Synthesis of preprocessing speech signals
Hanning analysis and synthesis windowing are used with 50% overlap.
A SISO audio system was established to simulate the I-PA system and applied in real environments to validate the proposed algorithm. To obtain data in different noisy reverberant environments, the experiments were performed using different rooms, types of noise, and signal-to-noise ratios (SNRs).
3.1 Experimental design
Information about the four test rooms
Room size (m)
In this experiment, the volume of the loudspeaker was adjusted to keep the sound pressure level (SPL) of the listener position at 60 dB. To simplify the experimental system, it was assumed that the input speech from the far end is clear speech without any distortion. Furthermore, clear female speech with a sampling frequency of 16,000 Hz was randomly selected from the TIMIT database  as the input signal.
3.2 Hardware setup
The speech recording and computing were performed using MATLAB software. For the hardware layout in each room, the distance between the sound source and measuring microphone was set between 3 and 5 m based on the different sizes of the rooms. The noise source was set up on a different side from the measuring microphone at a distance of 1.5 m. The sound source, noise source, and microphone were all installed on a tripod with a relative distance of 1.5 m from the floor. To ensure the consistency and validity of the listening test samples, 640 speech signals were selected from the MRT database . The signals were tested and saved in this experiment and later used as the test samples in the subjective evaluation.
A total of 576 conditions were tested in the real environments (4 noise types × 6 SNRs × 6 algorithms × 4 rooms). For conciseness, only the most representative experimental results are presented.
4.1 Objective results
Four kinds of measurements were performed to evaluate and compare the performance of the proposed method objectively. The speech transmission index was used to evaluate the dereverberation performance of the GT-filter-based FIF method. The spectrogram was used to visually display the changes of speech intelligibility before and after processing. The log-spectral distortion measure was used to compare the speech distortion of algorithms under different noise types, and the short-time objective intelligibility measure was used to predict and compare the changes of speech intelligibility of different algorithms.
4.1.1 Speech transmission index
Evaluation standards of STI values according to ICE 60268-16
Subjective intelligibility impression
Comparison of STI values of different methods
No equalized RIR
Improved FIF method
The STI values with no equalized RIR decrease as RT increases. When RT increases to 3.57 s, the STI value decreases to 0.4, and the transmission channel is seriously distorted due to reverberation. However, after equalizing the RIR by the proposed algorithm, the STI values are significantly improved. Compared with the other methods in Table 3, it is clear that the STI values for the proposed method are always higher than those of the other equalization or dereverberation methods. These results prove that the auditory-model-based sub-filter equalization can further improve the speech intelligibility of the transmission channel under different RT conditions.
Figure 4d shows noisy reverberant speech degraded by white noise and reverberation simultaneously. The degraded speech loses the speech information over 2000 Hz, and the speech information of the remaining part is quite blurry. Figure 4e shows the speech signals obtained by the proposed method in a noisy reverberant environment. Compared with the noisy reverberant speech in Fig. 4d, the speech frames are independent of each other without smearing effects after applying the proposed method. Compared with the clean speech in Fig. 4a, the processed speech has not lost any important speech information. Therefore, the comparison results intuitively show that the proposed method can significantly improve the speech intelligibility in noisy reverberant environments.
4.1.3 Log-spectral distortion measure
4.1.4 Short-time objective intelligibility measure
A short-time objective intelligibility measure (STOI) is a method of obtaining intelligibility scores directly by analyzing the clean and processed signals . It yields high correlations with subjective listening results and is usually used to evaluate the intelligibility of denoised speech . The objective speech intelligibility is more meaningful than the LSD measure for investigating the effectiveness of the proposed method. No unified objective intelligibility evaluation standards have been designed to predict distortions caused by additive noise and reverberation simultaneously. Nevertheless, we still attempted to use the STOI measure to predict the changes of speech intelligibility objectively.
There is no literature to support that the STOI measures can be used for the intelligibility evaluation of reverberant speech . Therefore, the STOI measure was merely used to predict the intelligibility trends of different algorithms. However, compared with the results of the subjective listening test in Section 4.2, it is clear that the STOI prediction results have highly consistent trends with the subjective evaluation results. Therefore, the STOI prediction can be regarded as a meaningful reference result among the various objective evaluations.
4.2 Subjective results
The modified rhyme test (MRT)  was used for subjective and realistic evaluation of the speech intelligibility in a noise and reverberation environment. The MRT database contains a total of 2700 audio source files, including five males and four females reading 300 words. The 300 words read by each person are divided into 50 six-word groups of rhyming or similar-sounding English words, such as “same,” “name,” “game,” “tame,” “came,” and “fame.” Each word is a monosyllable of the form consonant-vowel-consonant (CVC), and the six words in each list differ in only the leading or trailing consonant. In this listening test, a total of 640 audio source files (4 RTs × 4 SNRs × 4 algorithms × 10 groups) were randomly selected from the database, modified using the four different of methods, and degraded by factory noise-II at SNRs of − 10, − 5, 0, and 5 dB in different reverberation conditions. This procedure was performed in the experiment described in Section 3, and the processed audio files were recorded by a laptop as test speech for the subjective evaluation.
Eighteen non-native English speakers (including 13 males and 5 females age 23 to 32) were invited to the listening test. All the listeners were knowledgeable of the English pronunciation and had no hearing impairments. Importantly, all of the listeners were Master’s or Ph.D. students with a technical background in acoustics, and they were familiar with the basic concepts of reverberation and noise. The subjective tests were carried out in an anechoic chamber to prevent the effects of background noise and reverberation on the test speech. The same loudspeaker used in the experiment was also used in the listening test, and the volume of the loudspeaker was adjusted to keep the output SPL within the normal hearing range.
Before the listening test, some training samples were presented to the listeners to familiarize them with the test procedure. The audio files were played randomly for the different algorithms, RTs, and SNRs. Each sentence was played only once, and the listener had 5 s to choose the right answer from a set of six alternative words on the response sheet. The intelligibility score of different algorithms under various SNR and RT conditions was obtained as the mean percentage of correct words.
A speech preprocessing method that combines the modified PDMSE and the improved FIF was proposed to improve the speech intelligibility of I-PA systems in noisy reverberant environments. The combination method reduces noise masking by means of speech enhancement and eliminates the influence of reverberation by means of transmission channel equalization. The experimental results showed that the speech intelligibility is significantly improved in noisy reverberant environments by the proposed method.
Compared with individual PDMSE and FIF methods, the combination method can stably reduce speech distortion under various noisy reverberant conditions. Furthermore, the subjective listening tests confirmed the validity and stability of the proposed method, and its mean intelligibility score was higher than those of state-of-the-art reference algorithms. Future work will focus on a method to obtain RIR in real time under noisy reverberant environments to realize real-time and steady improvement of speech intelligibility in variable room boundary conditions.
This work was supported by a 2017 grant from the Russian Science Foundation (Project No. 17-19-01389).
CML gave academic guidance to this research work and revised the manuscript. HYD designed the core methodology of this study, programmed the algorithms and carried out the experiments, and drafted the manuscript. Both authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Taal, CH, Hendriks, RC, Heusdens, R (2012). A speech preprocessing strategy for intelligibility improvement in noise based on a perceptual distortion measure, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4061–4064).Google Scholar
- Crespo, JB, & Hendriks, RC (2014). Speech reinforcement in noisy reverberant environments using a perceptual distortion measure, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 910–914).Google Scholar
- Miyoshi, M, & Kaneda, Y. (1988). Inverse filtering of room acoustics. IEEE Transactions on Acoustics, Speech, and Signal Process, 36(2), 145–152.View ArticleGoogle Scholar
- Maganti, HK, & Matassoni, M. (2012). A perceptual masking approach for noise robust speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2012(1), 29.View ArticleGoogle Scholar
- Neely, ST, & Allen, JB. (1979). Invertibility of a room impulse response. The Journal of the Acoustical Society of America, 66(1), 165–169.View ArticleGoogle Scholar
- Elliott, SJ, & Nelson, PA. (1989). Multiple-point equalization in a room using adaptive digital filters. Journal of the Audio Engineering Society, 37(11), 899–907.Google Scholar
- Mourjopoulos, JN. (1994). Digital equalization of room acoustics. Journal of the Audio Engineering Society, 42(11), 884–900.Google Scholar
- Tokuno, H, Kirkeby, O, Nelson, PA, et al. (1997). Inverse filter of sound reproduction systems using regularization. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 80(5), 809–820.Google Scholar
- Kirkeby, O, Nelson, PA, Hamada, H, et al. (1998). Fast deconvolution of multichannel systems using regularization. IEEE Transactions on Speech and Audio Processing, 6(2), 189–194.View ArticleGoogle Scholar
- Kirkeby, O, & Nelson, PA. (1999). Digital filter design for inversion problems in sound reproduction. Journal of the Audio Engineering Society, 47(7/8), 583–595.Google Scholar
- Radlovic, BD, & Kennedy, RA. (2000). Nonminimum-phase equalization and its subjective importance in room acoustics. IEEE Transactions on Speech and Audio Processing, 8(6), 728–737.View ArticleGoogle Scholar
- Cecchi, S, Romoli, L, Carini, A, et al. (2014). A multichannel and multiple position adaptive room response equalizer in warped domain: real-time implementation and performance evaluation. Applied Acoustics, 82, 28–37.View ArticleGoogle Scholar
- Mourjopoulos, J, Clarkson, P, Hammond, J (1982). A comparative study of least-squares and homomorphic techniques for the inversion of mixed phase signals, IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP) (pp. 1858–1861).Google Scholar
- Fuster, L, de Diego, M, Ferrer, M, et al. (2012). A biased multichannel adaptive algorithm for room equalization, In Proceedings of the 20th European Signal Processing Conference (EUSIPCO) (pp. 1344–1348).Google Scholar
- B Sauert, P Vary, Improving speech intelligibility in noisy environments by near end listening enhancement. ITG-Fachbericht-Sprachkommunikation. (2006)Google Scholar
- Skowronski, MD, & Harris, JG. (2006). Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environments. Speech Communication, 48(5), 549–558.View ArticleGoogle Scholar
- Taal, CH, Hendriks, RC, Heusdens, R. (2014). Speech energy redistribution for intelligibility improvement in noise based on a perceptual distortion measure. Computer Speech & Language, 28(4), 858–872.View ArticleGoogle Scholar
- Taal, C, & Heusdens, R (2009). A low-complexity spectro-temporal based perceptual model, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 153–156).Google Scholar
- Kusumoto, A, Arai, T, Kinoshita, K, et al. (2005). Modulation enhancement of speech by a pre-processing algorithm for improving intelligibility in reverberant environments. Speech Communication, 45(2), 101–113.View ArticleGoogle Scholar
- Hodoshima, N, Arai, T, Kusumoto, A, et al. (2006). Improving syllable identification by a preprocessing method reducing overlap-masking in reverberant environments. The Journal of the Acoustical Society of America, 119(6), 4055–4064.View ArticleGoogle Scholar
- Crespo, JB, & Hendriks, RC. (2014). Multizone speech reinforcement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(1), 54–66.View ArticleGoogle Scholar
- Hendriks, RC, Crespo, JB, Jensen, J, et al. (2015). Optimal near-end speech intelligibility improvement incorporating additive noise and late reverberation under an approximation of the short-time SII. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(5), 851–862.View ArticleGoogle Scholar
- Meng, Q, Sen, D, Wang, S, et al. (2008). Impulse response measurement with sine sweeps and amplitude modulation schemes, IEEE 2nd International Conference on Signal Processing and Communication Systems(ICSPCS) (pp. 1–5).Google Scholar
- Hendriks, RC, Heusdens, R, Jensen, J (2010). MMSE based noise PSD tracking with low complexity, IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4266–4269).Google Scholar
- Faraji, N, & Hendriks, RC (2012). Noise power spectral density estimation for public address systems in noisy reverberant environments, In Proceedings of International Workshop on Acoustic Signal Enhancement (IWAENC) (pp. 1–4).Google Scholar
- Peng, W, Ser, W, Zhang, M (2001). Bark scale equalizer design using warped filter, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3317–3320).Google Scholar
- Moore, BC, & Glasberg, BR. (1983). Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. The Journal of the Acoustical Society of America, 74(3), 750–753.View ArticleGoogle Scholar
- Maganti, HK, & Matassoni, M. (2014). Auditory processing-based features for improving speech recognition in adverse acoustic conditions. EURASIP Journal on Audio, Speech, and Music Processing, 2014(1), 21.View ArticleGoogle Scholar
- Patterson, RD (1986). Auditory filters and excitation patterns as representations of frequency resolution. Frequency selectivity in hearing, (pp. 123–177).Google Scholar
- Toole, FE (2000). The acoustics and psychoacoustics of loudspeakers and rooms—the stereo past and the multichannel future, Audio Engineering Society Convention (p. 109).Google Scholar
- Varga, A, & Steeneken, HJ. (1993). Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12(3), 247–251.View ArticleGoogle Scholar
- Zue, V, Seneff, S, Glass, J. (1990). Speech database development at MIT: TIMIT and beyond. Speech Communication, 9(4), 351–356.View ArticleGoogle Scholar
- Miner, R, & Danhauer, JL. (1975). Modified rhyme test and synthetic sentence identification test scores of normal and hearing-impaired subjects listening in multitalker noise. Journal of the American Audiology Society, 2(2), 61–67.Google Scholar
- International Electrotechnical Commission. (2011). IEC 60268-16 Sound system equipment-Part 16: Objective rating of speech intelligibility by speech transmission index. Paris: International OECD Publishing.Google Scholar
- Flanagan, JL (2013). Speech analysis synthesis and perception, (vol. 3). New York: Springer Science & Business Media p. 150.Google Scholar
- Gomez, R, Nakamura, K, Mizumoto, T, et al. (2013). Mitigating the effects of reverberation for effective human-robot interaction in the real world, 13th IEEE International Conference on RAS, Humanoid Robots (Humanoids) (pp. 177–182).Google Scholar
- Habets, EAP. (2007). Single-and multi-microphone speech dereverberation using spectral enhancement. Dissertation Abstracts International, 68(04).Google Scholar
- Naylor, PA, & Gaubitch, ND (2010). Speech dereverberation, (p. 40). New York: Springer Science & Business Media.View ArticleMATHGoogle Scholar
- Taal, CH, Hendriks, RC, Heusdens, R, et al. (2011). An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 2125–2136.View ArticleGoogle Scholar
- Loizou, PC (2013). Speech enhancement: theory and practice, (2nd ed., pp. 552–567). Boca Raton: CRC Press.Google Scholar