 Research Article
 Open Access
Fast Noise Compensation and Adaptive Enhancement for Speech Separation
 Rong Hu^{1} and
 Yunxin Zhao^{1}Email author
https://doi.org/10.1155/2008/349214
© R. Hu and Y. Zhao 2008
 Received: 4 December 2007
 Accepted: 12 May 2008
 Published: 5 June 2008
Abstract
We propose a novel approach to improve adaptive decorrelation filtering (ADF) based speech source separation in diffuse noise. The effects of noise on system adaptation and separation outputs are handled separately. First, fast noise compensation (NC) is developed for adaptation of separation filters, forcing ADF to focus on source separation; next, output noises are suppressed by speech enhancement. By tracking noise components in output crosscorrelation functions, the bias effect of noise on the system adaptation objective function is compensated, and by adaptively estimating output noise autocorrelations, the speech separation output is enhanced. For fast noise compensation, a blockwise fast ADF (FADF) is implemented. Experiments were conducted on real and simulated diffuse noises. Speech mixtures were generated by convolving TIMIT speech sources with acoustic path impulse responses measured in a real room with reverberation time second. The proposed techniques significantly improved separation performance and phone recognition accuracy of ADF outputs.
Keywords
 Blind Source Separation
 Speech Enhancement
 Separate Speech
 Output Noise
 Separation Output
1. Introduction
Interference speech and diffuse noise present double folds of challenges for handsfree automatic speech recognition (ASR) and speech communication. For practical applications of blind source separation (BSS), it is important to address the effects of noise in speech separation: (1) noise may degrade the conditions of BSS and hence hurt the separation performances; (2) BSS aims at source separation and has limited ability in suppressing diffuse noise. Although "bias removal" has been identified as a general approach for improving speech separation in noise [1], the performance depends largely on specific separation algorithms. Some noise compensation (NC) methods, for example [2], were proposed for a natural gradientbased separation algorithm. Other reported studies either focused primarily on theoretical issues, for example [3], or handled only conditions like uncorrelated noises, for example [4], or simplified mixing models, such as anechoic mixing [5]. The limitations of BSS in noise suppression were reported previously. Araki et al. [6, 7], established the mechanism similarities between BSS and the adaptive null beamformer. Asano et al. [8] grouped the two approaches into "spatial inverse" processing and pointed out that they are only able to suppress directional interferences but not omnidirectional ambient noises. Therefore, when both interference speech and diffuse noise are present, output noise suppression is needed in addition to separation processing. On the other hand, speech enhancement algorithms that are formulated for stationary noises cannot be applied directly in this scenario, because the adaptation of separation filters makes the output noise statistics time varying. Such variation may happen frequently when the mixing acoustic paths change, for example when a speaker moves.
In our previous works [9, 10], the separation model of adaptive decorrelation filtering (ADF) [11, 12] was significantly improved for noisefree speech mixtures in both aspects of convergence rate and steadystate filter estimation accuracy. A noisecompensated ADF [4] was proposed for speech mixtures contaminated by white uncorrelated noises. However, in real sound fields, diffuse noises are colored and spatially correlated in low frequency which deteriorate ADF performance more severely than uncorrelated noises [13]. It appears that noise can be removed from speech inputs prior to ADF separation. But such a noise prefiltering deteriorates the condition for subsequent source separation, due to nonlinear distortions introduced by speech enhancement [13].
In the current work, we propose to address the challenge of speech separation and diffuse noise suppression by an effective twostep strategy. First, a noise compensation (NC) [14] algorithm is developed to improve speech separation performances; effective blockwise implementations of compensation processing and ADF filtering are derived in FFT. As separation filters change over time, output noise statistics of crosscorrelations are tracked so that filter adaptation bias can be removed. Second, output noise autocorrelations are estimated and used to enhance the speech signals separated in the first step [15], so as to improve speech quality.Speech separation, enhancement, and phone recognition experiments were conducted, and the results are presented to show the performances of the proposed separation and enhancement techniques.
2. ADF Model in Noise
In the following, we use variables in bold lower case for vectors, bold upper case for matrices, superscript for transposition, for the identity matrix, "*" for convolution, and for expectation. The correlation matrix formed by vectors and is defined as , and the correlation vector between a scalar and a vector as . and denote filter and block lengths, respectively. Speech and noise signal vectors contain consecutive samples up to current time , and their counterparts with samples up to time are marked with tilde.
and approximating by instantaneous correlations , the same adaptation equation can be obtained. For the stepsize , [12] proposed an inputnormalized technique based on a convergence analysis, which was combined in [9] with variable stepsize (VSS) techniques to accelerate convergence and reduce ADF estimation error.
3. Noise Compensation for ADF
where is the estimate of output noise crosscorrelation, and the discount factor to preventovercompensation. In the following, is used.
4. Fast Implementation of Noise Compensation and ADF
4.1. Fast Update of Compensation Terms
Direct computations of noise crosscorrelation vectors in NCADF adaptation (9) are not feasible for realtimeapplications since the terms in (4) require matrixvector multiplications for every time sample. For fixed speaker locations, the changes of ADF filters are in general small within short time intervals (e.g., around 30 milliseconds). The slow change of ADF parameters and the shortterm stationarity of input noise make it possible to update compensation terms in a blockwise fashion, reducing the update rate by a factor of (blocklength). To speed up NCADF, we first reduce the update rate for compensation terms and then utilize the Toeplitz structures of both the system and the correlation matrices to derive an FFTbased estimation of (4).
where is the pointreverse of . Similarly, components of are obtained by with The vectors and also have a similar structure, where with , and with . Based on such convolutive expressions, the point sequences , , and can be computed by point FFTs ( ). For modularity, the point sequence can be decomposed into two point subsequences and computed with two point FFTIFFT modules. In this way, all the sequences above only need to be zeropadded to length , because only point results are required in each module. The rest points with aliasing are irrelevant and are discarded.
4.2. Fast ADF and NCFADF
The samplewise procedures of filtering (1) and adaptation (6) of ADF are also modified for a blockwise implementation to enable fast noise compensation. The fast computation of (1) can use the standard overlapadd fast convolution [16] under the approximation that filters are constant within each block.
can be computed by an FFTbased fast implementation [16].
where is defined by replacing and with their noisy counterparts in (19), is from (15), and the block stepsize is computed by (10). The normalization gain factor in (11) uses ADF input powers that are estimated from the samples of both current and previous blocks. To prevent overcompensation in NCFADF, in (17) is set to zero when negative values occur. The denominator in (10) is also added a small positive number to avoid dividebyzeros. Triangular windows , , are applied to both correlation estimate and ADF adaptation vectors to prevent instability.
The overlapadd method requires . When = and the FFT length = , the computation of point FFTs is distributed to the block of length , resulting in a complexity of per timesample for NCFADF, in contrast to for a direct estimation of NC terms that are required by matrixvector multiplications.
5. Adaptive Enhancement of Separated Speech
5.1. Tracking of ADF Output Noise Autocorrelations
where , , , and .
5.2. Enhancement of Separated Speech
Utilizing the adaptively estimated noise statistics , many algorithms can be considered for postenhancement of ADF outputs. The time domain constrained (TDC) type of the generalized subspace (GSub) method [17] is tested due to its ability to handle colored noise. The TDCGSub processing is applied to every block of ADF outputs, where for the th block it requires the noise autocorrelation matrix , which can be constructed by forming a symmetric Teoplitz matrix from the output autocorrelation vector in (21). Specifically, constitutes the first column and the first row of . Another piece of information that the TDCGSub algorithm takes is the autocorrelation matrix of the noisy ADF output, , which is estimated from ADFoutputs of the current block. The TDCGSub processing is performed on each nonoverlapping subframe of length and the major steps are the same as in [17].
Step 1.
Do eigendecomposition for matrix , with , and is the number of positive eigenvalues.
Step 2.
with .
Step 3.
Enhance the th ADF output by .
The computations of matrix inversion, multiplication, and eigendecomposition become acceptable when a small value is used for (2.5 milliseconds). In addition, a measure is taken to speed up TDCGSub by utilizing the shortterm stationary property of separated speech signals 's. Within 20 milliseconds, the variations of 's are relatively small, obviating the need for updating their eigendecompositions in every subframes. In practice, the computation rate for both steps 1 and 2 are thus reduced to every 12.5 milliseconds, without introducing significant degradations.
6. Complexity Analysis
Counts of real multiplications.
Computation  Complexity estimates  Gain  

Direct  Fast  
ADF filtering 



ADF adapt 



's 



's 



SS  , (  
TDCGSub 

7. Experiments
7.1. Experimental Data and Setup
The basic setup for ADF was and and the separation filters were initialized with zeros, representing a totally blind condition (if certain prior knowledge of the acoustic paths can be incorporated into the initial separation filters, then ADF separation performance can be improved, especially in severe noise). In all cases, a preemphasis ( ) was applied to speech mixtures to remove the 6dB/octave tilt of speech longterm spectrum and to reduce eigenvalue dispersion for faster convergence [10]. Preemphasis enhances perceptually important speech components, and it also alters input noise properties as well as the relative strengths of noise and speech measured in signaltonoise ratio (SNR): , where is the power of the clean speech mixture signal, and is the power of the noise component. In fact, the simulated speechshaped noise spectrum was flattened by preemphasis, resulting in a loss of SNR of approximately 3 dB. On the other hand, the recorded diffuse noise retained a significant amount of coloration and spatial correlation after preemphasis that increased SNR by 12 dB through suppressing strongly correlated lowfrequency noise components (see Figure 3). In subsequent discussions, SNR and targettointerference ratio (TIR) refer to those evaluated on preemphasized input and output components, where TIR is defined as , with the power of target speech and the power of interference speech component. For FADF and NCFADF, the block length was and the FFT length was . Since VSS without NC would corrupt adaptation at high levels of noise, it was not applied to ADF (6) and FADF. In the appendix, more details are provided for the definitions of SNR and TIR.
7.2. Speech Separation Performance
Gain in TIR (dB) (simulated speechshaped noise).
Original  Preemphsized  Baseline  FADF  NCFADF 

SNR  SNR( )  
3 dB 




9 dB 




15 dB 




21 dB 




27 dB 




Output SNR (dB) (simulated speechshaped noise).
Original  Preemphsized  Baseline  FADF  NCFADF 

SNR  SNR( )  
3 dB 




9 dB 




15 dB 




21 dB 




27 dB 




Gain in TIR (dB) (real diffuse noise).
Original  Preemphsized  Baseline  FADF  NCFADF 

SNR  SNR( )  
dB 




dB 




0 dB 




6 dB 




12 dB 




Output SNR (dB) (real diffuse noise).
SNR  SNR( ) 




dB 




dB 




0 dB 




6 dB 




12 dB 




7.3. Speech Enhancement and Phone Recognition
Experiments were conducted to compare the cases of using NCFADF or FADF, with and without adaptive speech enhancements. Since SNR was altered by preemphasis differently for simulated and real diffuse noises, the range of initial SNRs were chosen differently for these two cases so that the input target speech had the same SNRs after preemphasis. After adaptive online speech enhancement, a deemphasis was applied to the enhanced speech.
Phone recognitions were performed by using HTK toolkit [19] for the noisy mixture, the noisy separated speech, and the enhanced separated speech of the target. The speech signals were represented by sequences of feature vectors obtained from overlapped shorttime analysis window of 20 milliseconds. Each feature vector consisted of 13 cepstral coefficients and their first and secondorder time derivatives. Both training and test data from TIMIT database were processed with spectral mean subtraction. Hidden Markov modeling (HMM) was used for 39 context independent phone units, defined by the phone grouping scheme of [20]. Each phone unit had 3 emission states, with state observation probabilities modeled by size8 Gaussian mixture densities. Phone bigram was used as "language model."
7.4. Sensitivity to Noise Estimation
8. Conclusions and Future Work
In this paper, we have presented methods of noise compensation and adaptive speech enhancement to improve the performances of ADF speech separation in diffuse noise. Fast implementations for ADF and noise compensation have been made that warrant realtime online applications. FADF has achieved performance comparable to that of ADF with a much faster speed. NCFADF significantly improved the separation performance for speech mixtures in diffuse noise, and the integration of NCFADF with speech enhancement significantly improved phone recognition accuracies in separated speech. Future investigations may include other enhancement algorithms and noisereduction implementations for a more streamlined integration with the NCFADF procedure.
Declarations
Authors’ Affiliations
References
 Hyvarinen A, Karhunen J, Oja E: Independent Component Analysis. John Wiley & Sons, New York, NY, USA; 2001.View ArticleGoogle Scholar
 Aichner R, Buchner H, Kellermann W: Convolutive blind source separation for noisy mixtures. Proceedings of Joint Meeting of the German and the French Acoustical Societies (CFA/DAGA '04), March 2004, Strasbourg, FranceGoogle Scholar
 Douglas SC, Cichocki A, Amari S: Bias removal technique for blind source separation with noisy measurements. Electronics Letters 1998,34(14):13791380. 10.1049/el:19980994View ArticleGoogle Scholar
 Hu R, Zhao Y: Adaptive decorrelation filtering algorithm for speech source separation in uncorrelated noises. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 1: 11131116.MathSciNetGoogle Scholar
 Balan R, Rosca J, Richard S: Scalable nonsquare blind separation in the presence of noise. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 5: 293296.Google Scholar
 Araki S, Makino S, Mukai R, Saruwatari H: Equivalence between frequency domain blind source separation and frequency domain adaptive beamforming. Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech '01), September 2001, Aalborg, Denmark 4: 25952598.MATHGoogle Scholar
 Araki S, Mukai R, Makino S, Nishikawa T, Saruwatari H: The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech. IEEE Transactions on Speech and Audio Processing 2003,11(2):109116. 10.1109/TSA.2003.809193View ArticleMATHGoogle Scholar
 Asano F, Hayamizu S, Yamada T, Nakamura S: Speech enhancement based on the subspace method. IEEE Transactions on Speech and Audio Processing 2000,8(5):497507. 10.1109/89.861364View ArticleGoogle Scholar
 Hu R, Zhao Y: Variable step size adaptive decorrelation filtering for competing speech separation. Proceedings of the 9th European Conference on Speech Communication and Technology (Eurospeech '05), September 2005, Lisbon, Portugal 1: 22972300.Google Scholar
 Zhao Y, Hu R, Li X: Speedup convergence and reduce noise for enhanced speech separation and recognition. IEEE Transactions on Audio, Speech and Language Processing 2006,14(4):12351244.View ArticleGoogle Scholar
 Weinstein E, Feder M, Oppenheim AV: Multichannel signal separation by decorrelation. IEEE Transactions on Speech and Audio Processing 1993,1(4):405413. 10.1109/89.242486View ArticleGoogle Scholar
 Yen KC, Zhao Y: Adaptive cochannel speech separation and recognition. IEEE Transactions on Speech and Audio Processing 1999,7(2):138151. 10.1109/89.748119View ArticleGoogle Scholar
 Yen K, Huang J, Zhao Y: Cochannel speech separation in the presence of correlated and uncorrelated noises. Proceedings of the 6th European Conference on Speech Communication and Technology (Eurospeech '99), September 1999, Budapest, Hungary 25872589.Google Scholar
 Hu R, Zhao Y: Fast noise compensation for speech separation in diffuse noise. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), May 2006, Toulouse, France 5: 865868.Google Scholar
 Hu R, Zhao Y: Adaptive speech enhancement for speech separation in diffuse noise. Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH '06), September 2006, Pittsburgh, PA, USA 26182621.Google Scholar
 Oppenheim AV, Schafer RW: DiscreteTime Signal Processing. PrenticeHall, Englewood Cliffs, NJ, USA; 1989.MATHGoogle Scholar
 Hu Y, Loizou PC: A generalized subspace approach for enhancing speech corrupted by colored noise. IEEE Transactions on Speech and Audio Processing 2003,11(4):334341. 10.1109/TSA.2003.814458View ArticleGoogle Scholar
 RWCP Sound Scene Database in Real Acoustic Environments ATR Spoken Language Translation Research Lab, Japan, 2001Google Scholar
 Odell J, Ollason D, Valtchev V, Young S, Kershawl D, Woodland P: HTK Speech Recognition Toolkit. 1999, http://htk.eng.cam.ac.uk/docs/docs.shtmlGoogle Scholar
 Lee KF, Hon HW: Speakerindependent phone recognition using hidden markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing 1989,37(11):16411648. 10.1109/29.46546View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.