- Research Article
- Open Access
Fast Noise Compensation and Adaptive Enhancement for Speech Separation
© R. Hu and Y. Zhao 2008
- Received: 4 December 2007
- Accepted: 12 May 2008
- Published: 5 June 2008
We propose a novel approach to improve adaptive decorrelation filtering- (ADF-) based speech source separation in diffuse noise. The effects of noise on system adaptation and separation outputs are handled separately. First, fast noise compensation (NC) is developed for adaptation of separation filters, forcing ADF to focus on source separation; next, output noises are suppressed by speech enhancement. By tracking noise components in output cross-correlation functions, the bias effect of noise on the system adaptation objective function is compensated, and by adaptively estimating output noise autocorrelations, the speech separation output is enhanced. For fast noise compensation, a blockwise fast ADF (FADF) is implemented. Experiments were conducted on real and simulated diffuse noises. Speech mixtures were generated by convolving TIMIT speech sources with acoustic path impulse responses measured in a real room with reverberation time second. The proposed techniques significantly improved separation performance and phone recognition accuracy of ADF outputs.
- Blind Source Separation
- Speech Enhancement
- Separate Speech
- Output Noise
- Separation Output
Interference speech and diffuse noise present double folds of challenges for hands-free automatic speech recognition (ASR) and speech communication. For practical applications of blind source separation (BSS), it is important to address the effects of noise in speech separation: (1) noise may degrade the conditions of BSS and hence hurt the separation performances; (2) BSS aims at source separation and has limited ability in suppressing diffuse noise. Although "bias removal" has been identified as a general approach for improving speech separation in noise , the performance depends largely on specific separation algorithms. Some noise compensation (NC) methods, for example , were proposed for a natural gradient-based separation algorithm. Other reported studies either focused primarily on theoretical issues, for example , or handled only conditions like uncorrelated noises, for example , or simplified mixing models, such as anechoic mixing . The limitations of BSS in noise suppression were reported previously. Araki et al. [6, 7], established the mechanism similarities between BSS and the adaptive null beamformer. Asano et al.  grouped the two approaches into "spatial inverse" processing and pointed out that they are only able to suppress directional interferences but not omnidirectional ambient noises. Therefore, when both interference speech and diffuse noise are present, output noise suppression is needed in addition to separation processing. On the other hand, speech enhancement algorithms that are formulated for stationary noises cannot be applied directly in this scenario, because the adaptation of separation filters makes the output noise statistics time varying. Such variation may happen frequently when the mixing acoustic paths change, for example when a speaker moves.
In our previous works [9, 10], the separation model of adaptive decorrelation filtering (ADF) [11, 12] was significantly improved for noise-free speech mixtures in both aspects of convergence rate and steady-state filter estimation accuracy. A noise-compensated ADF  was proposed for speech mixtures contaminated by white uncorrelated noises. However, in real sound fields, diffuse noises are colored and spatially correlated in low frequency which deteriorate ADF performance more severely than uncorrelated noises . It appears that noise can be removed from speech inputs prior to ADF separation. But such a noise prefiltering deteriorates the condition for subsequent source separation, due to nonlinear distortions introduced by speech enhancement .
In the current work, we propose to address the challenge of speech separation and diffuse noise suppression by an effective two-step strategy. First, a noise compensation (NC)  algorithm is developed to improve speech separation performances; effective blockwise implementations of compensation processing and ADF filtering are derived in FFT. As separation filters change over time, output noise statistics of cross-correlations are tracked so that filter adaptation bias can be removed. Second, output noise autocorrelations are estimated and used to enhance the speech signals separated in the first step , so as to improve speech quality.Speech separation, enhancement, and phone recognition experiments were conducted, and the results are presented to show the performances of the proposed separation and en-hancement techniques.
In the following, we use variables in bold lower case for vec-tors, bold upper case for matrices, superscript for transposition, for the identity matrix, "*" for convolution, and for expectation. The correlation matrix formed by vectors and is defined as , and the correlation vector between a scalar and a vector as . and denote filter and block lengths, respectively. Speech and noise signal vectors contain consecutive samples up to current time , and their counterparts with samples up to time are marked with tilde.
and approximating by instantaneous correlations , the same adaptation equation can be obtained. For the step-size ,  proposed an input-normalized technique based on a convergence analysis, which was combined in  with variable step-size (VSS) techniques to accelerate convergence and reduce ADF estimation error.
Direct computations of noise cross-correlation vectors in NC-ADF adaptation (9) are not feasible for real-timeapplications since the terms in (4) require matrix-vector multiplications for every time sample. For fixed speaker locations, the changes of ADF filters are in general small within short time intervals (e.g., around 30 milliseconds). The slow change of ADF parameters and the short-term stationarity of input noise make it possible to update compensation terms in a blockwise fashion, reducing the update rate by a factor of (block-length). To speed up NC-ADF, we first reduce the update rate for compensation terms and then utilize the Toeplitz structures of both the system and the correlation matrices to derive an FFT-based estimation of (4).
where is the -pointreverse of . Similarly, components of are obtained by with The vectors and also have a similar structure, where with , and with . Based on such convolutive expressions, the -point sequences , , and can be computed by -point FFTs ( ). For modularity, the -point sequence can be decomposed into two -point subsequences and computed with two -point FFT-IFFT modules. In this way, all the sequences above only need to be zero-padded to length , because only -point results are required in each module. The rest points with aliasing are irrelevant and are discarded.
The samplewise procedures of filtering (1) and adaptation (6) of ADF are also modified for a blockwise implementation to enable fast noise compensation. The fast computation of (1) can use the standard overlap-add fast convolution  under the approximation that filters are constant within each block.
can be computed by an FFT-based fast implementation .
where is defined by replacing and with their noisy counterparts in (19), is from (15), and the block step-size is computed by (10). The normalization gain factor in (11) uses ADF input powers that are estimated from the samples of both current and previous blocks. To prevent overcompensation in NC-FADF, in (17) is set to zero when negative values occur. The denominator in (10) is also added a small positive number to avoid divide-by-zeros. Triangular windows , , are applied to both correlation estimate and ADF adaptation vectors to prevent instability.
The overlap-add method requires . When = and the FFT length = , the computation of -point FFTs is distributed to the block of length , resulting in a complexity of per time-sample for NC-FADF, in contrast to for a direct estimation of NC terms that are required by matrix-vector multiplications.
Utilizing the adaptively estimated noise statistics , many algorithms can be considered for postenhancement of ADF outputs. The time domain constrained (TDC) type of the generalized subspace (GSub) method  is tested due to its ability to handle colored noise. The TDC-GSub processing is applied to every block of ADF outputs, where for the th block it requires the noise autocorrelation matrix , which can be constructed by forming a symmetric Teoplitz matrix from the output autocorrelation vector in (21). Specifically, constitutes the first column and the first row of . Another piece of information that the TDC-GSub algorithm takes is the autocorrelation matrix of the noisy ADF output, , which is estimated from ADFoutputs of the current block. The TDC-GSub processing is performed on each nonoverlapping subframe of length and the major steps are the same as in .
The computations of matrix inversion, multiplication, and eigendecomposition become acceptable when a small value is used for (2.5 milliseconds). In addition, a measure is taken to speed up TDC-GSub by utilizing the short-term stationary property of separated speech signals 's. Within 20 milliseconds, the variations of 's are relatively small, obviating the need for updating their eigendecompositions in every subframes. In practice, the computation rate for both steps 1 and 2 are thus reduced to every 12.5 milliseconds, without introducing significant degradations.
The basic setup for ADF was and and the separation filters were initialized with zeros, representing a totally blind condition (if certain prior knowledge of the acoustic paths can be incorporated into the initial separation filters, then ADF separation performance can be improved, especially in severe noise). In all cases, a pre-emphasis ( ) was applied to speech mixtures to remove the 6-dB/octave tilt of speech long-term spectrum and to reduce eigenvalue dispersion for faster convergence . Pre-emphasis enhances perceptually important speech components, and it also alters input noise properties as well as the relative strengths of noise and speech measured in signal-to-noise ratio (SNR): , where is the power of the clean speech mixture signal, and is the power of the noise component. In fact, the simulated speech-shaped noise spectrum was flattened by pre-emphasis, resulting in a loss of SNR of approximately 3 dB. On the other hand, the recorded diffuse noise retained a significant amount of coloration and spatial correlation after pre-emphasis that increased SNR by 12 dB through suppressing strongly correlated low-frequency noise components (see Figure 3). In subsequent discussions, SNR and target-to-interference ratio (TIR) refer to those evaluated on pre-emphasized input and output components, where TIR is defined as , with the power of target speech and the power of interference speech component. For FADF and NC-FADF, the block length was and the FFT length was . Since VSS without NC would corrupt adaptation at high levels of noise, it was not applied to ADF (6) and FADF. In the appendix, more details are provided for the definitions of SNR and TIR.
Gain in TIR (dB) (simulated speech-shaped noise).
Output SNR (dB) (simulated speech-shaped noise).
Gain in TIR (dB) (real diffuse noise).
Experiments were conducted to compare the cases of using NC-FADF or FADF, with and without adaptive speech en-hancements. Since SNR was altered by pre-emphasis differently for simulated and real diffuse noises, the range of initial SNRs were chosen differently for these two cases so that the input target speech had the same SNRs after pre-emphasis. After adaptive online speech enhancement, a de-emphasis was applied to the enhanced speech.
Phone recognitions were performed by using HTK toolkit  for the noisy mixture, the noisy separated speech, and the enhanced separated speech of the target. The speech signals were represented by sequences of feature vectors obtained from overlapped short-time analysis window of 20 milliseconds. Each feature vector consisted of 13 cepstral coefficients and their first- and second-order time derivatives. Both training and test data from TIMIT database were processed with spectral mean subtraction. Hidden Markov modeling (HMM) was used for 39 context independent phone units, defined by the phone grouping scheme of . Each phone unit had 3 emission states, with state observation probabilities modeled by size-8 Gaussian mixture densities. Phone bigram was used as "language model."
In this paper, we have presented methods of noise compensation and adaptive speech enhancement to improve the performances of ADF speech separation in diffuse noise. Fast implementations for ADF and noise compensation have been made that warrant real-time online applications. FADF has achieved performance comparable to that of ADF with a much faster speed. NC-FADF significantly improved the separation performance for speech mixtures in diffuse noise, and the integration of NC-FADF with speech enhancement significantly improved phone recognition accuracies in separated speech. Future investigations may include other enhancement algorithms and noise-reduction implementations for a more streamlined integration with the NC-FADF procedure.
- Hyvarinen A, Karhunen J, Oja E: Independent Component Analysis. John Wiley & Sons, New York, NY, USA; 2001.View ArticleGoogle Scholar
- Aichner R, Buchner H, Kellermann W: Convolutive blind source separation for noisy mixtures. Proceedings of Joint Meeting of the German and the French Acoustical Societies (CFA/DAGA '04), March 2004, Strasbourg, FranceGoogle Scholar
- Douglas SC, Cichocki A, Amari S: Bias removal technique for blind source separation with noisy measurements. Electronics Letters 1998,34(14):1379-1380. 10.1049/el:19980994View ArticleGoogle Scholar
- Hu R, Zhao Y: Adaptive decorrelation filtering algorithm for speech source separation in uncorrelated noises. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 1: 1113-1116.MathSciNetGoogle Scholar
- Balan R, Rosca J, Richard S: Scalable non-square blind separation in the presence of noise. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 5: 293-296.Google Scholar
- Araki S, Makino S, Mukai R, Saruwatari H: Equivalence between frequency domain blind source separation and frequency domain adaptive beamforming. Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech '01), September 2001, Aalborg, Denmark 4: 2595-2598.MATHGoogle Scholar
- Araki S, Mukai R, Makino S, Nishikawa T, Saruwatari H: The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech. IEEE Transactions on Speech and Audio Processing 2003,11(2):109-116. 10.1109/TSA.2003.809193View ArticleMATHGoogle Scholar
- Asano F, Hayamizu S, Yamada T, Nakamura S: Speech enhancement based on the subspace method. IEEE Transactions on Speech and Audio Processing 2000,8(5):497-507. 10.1109/89.861364View ArticleGoogle Scholar
- Hu R, Zhao Y: Variable step size adaptive decorrelation filtering for competing speech separation. Proceedings of the 9th European Conference on Speech Communication and Technology (Eurospeech '05), September 2005, Lisbon, Portugal 1: 2297-2300.Google Scholar
- Zhao Y, Hu R, Li X: Speedup convergence and reduce noise for enhanced speech separation and recognition. IEEE Transactions on Audio, Speech and Language Processing 2006,14(4):1235-1244.View ArticleGoogle Scholar
- Weinstein E, Feder M, Oppenheim AV: Multi-channel signal separation by decorrelation. IEEE Transactions on Speech and Audio Processing 1993,1(4):405-413. 10.1109/89.242486View ArticleGoogle Scholar
- Yen K-C, Zhao Y: Adaptive co-channel speech separation and recognition. IEEE Transactions on Speech and Audio Processing 1999,7(2):138-151. 10.1109/89.748119View ArticleGoogle Scholar
- Yen K, Huang J, Zhao Y: Co-channel speech separation in the presence of correlated and uncorrelated noises. Proceedings of the 6th European Conference on Speech Communication and Technology (Eurospeech '99), September 1999, Budapest, Hungary 2587-2589.Google Scholar
- Hu R, Zhao Y: Fast noise compensation for speech separation in diffuse noise. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), May 2006, Toulouse, France 5: 865-868.Google Scholar
- Hu R, Zhao Y: Adaptive speech enhancement for speech separation in diffuse noise. Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH '06), September 2006, Pittsburgh, PA, USA 2618-2621.Google Scholar
- Oppenheim AV, Schafer RW: Discrete-Time Signal Processing. Prentice-Hall, Englewood Cliffs, NJ, USA; 1989.MATHGoogle Scholar
- Hu Y, Loizou PC: A generalized subspace approach for enhancing speech corrupted by colored noise. IEEE Transactions on Speech and Audio Processing 2003,11(4):334-341. 10.1109/TSA.2003.814458View ArticleGoogle Scholar
- RWCP Sound Scene Database in Real Acoustic Environments ATR Spoken Language Translation Research Lab, Japan, 2001Google Scholar
- Odell J, Ollason D, Valtchev V, Young S, Kershawl D, Woodland P: HTK Speech Recognition Toolkit. 1999, http://htk.eng.cam.ac.uk/docs/docs.shtmlGoogle Scholar
- Lee K-F, Hon H-W: Speaker-independent phone recognition using hidden markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing 1989,37(11):1641-1648. 10.1109/29.46546View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.