- Research Article
- Open Access
Fast Noise Compensation and Adaptive Enhancement for Speech Separation
EURASIP Journal on Audio, Speech, and Music Processing volume 2008, Article number: 349214 (2008)
We propose a novel approach to improve adaptive decorrelation filtering- (ADF-) based speech source separation in diffuse noise. The effects of noise on system adaptation and separation outputs are handled separately. First, fast noise compensation (NC) is developed for adaptation of separation filters, forcing ADF to focus on source separation; next, output noises are suppressed by speech enhancement. By tracking noise components in output cross-correlation functions, the bias effect of noise on the system adaptation objective function is compensated, and by adaptively estimating output noise autocorrelations, the speech separation output is enhanced. For fast noise compensation, a blockwise fast ADF (FADF) is implemented. Experiments were conducted on real and simulated diffuse noises. Speech mixtures were generated by convolving TIMIT speech sources with acoustic path impulse responses measured in a real room with reverberation time second. The proposed techniques significantly improved separation performance and phone recognition accuracy of ADF outputs.
Interference speech and diffuse noise present double folds of challenges for hands-free automatic speech recognition (ASR) and speech communication. For practical applications of blind source separation (BSS), it is important to address the effects of noise in speech separation: (1) noise may degrade the conditions of BSS and hence hurt the separation performances; (2) BSS aims at source separation and has limited ability in suppressing diffuse noise. Although "bias removal" has been identified as a general approach for improving speech separation in noise , the performance depends largely on specific separation algorithms. Some noise compensation (NC) methods, for example , were proposed for a natural gradient-based separation algorithm. Other reported studies either focused primarily on theoretical issues, for example , or handled only conditions like uncorrelated noises, for example , or simplified mixing models, such as anechoic mixing . The limitations of BSS in noise suppression were reported previously. Araki et al. [6, 7], established the mechanism similarities between BSS and the adaptive null beamformer. Asano et al.  grouped the two approaches into "spatial inverse" processing and pointed out that they are only able to suppress directional interferences but not omnidirectional ambient noises. Therefore, when both interference speech and diffuse noise are present, output noise suppression is needed in addition to separation processing. On the other hand, speech enhancement algorithms that are formulated for stationary noises cannot be applied directly in this scenario, because the adaptation of separation filters makes the output noise statistics time varying. Such variation may happen frequently when the mixing acoustic paths change, for example when a speaker moves.
In our previous works [9, 10], the separation model of adaptive decorrelation filtering (ADF) [11, 12] was significantly improved for noise-free speech mixtures in both aspects of convergence rate and steady-state filter estimation accuracy. A noise-compensated ADF  was proposed for speech mixtures contaminated by white uncorrelated noises. However, in real sound fields, diffuse noises are colored and spatially correlated in low frequency which deteriorate ADF performance more severely than uncorrelated noises . It appears that noise can be removed from speech inputs prior to ADF separation. But such a noise prefiltering deteriorates the condition for subsequent source separation, due to nonlinear distortions introduced by speech enhancement .
In the current work, we propose to address the challenge of speech separation and diffuse noise suppression by an effective two-step strategy. First, a noise compensation (NC)  algorithm is developed to improve speech separation performances; effective blockwise implementations of compensation processing and ADF filtering are derived in FFT. As separation filters change over time, output noise statistics of cross-correlations are tracked so that filter adaptation bias can be removed. Second, output noise autocorrelations are estimated and used to enhance the speech signals separated in the first step , so as to improve speech quality.Speech separation, enhancement, and phone recognition experiments were conducted, and the results are presented to show the performances of the proposed separation and en-hancement techniques.
2. ADF Model in Noise
In the following, we use variables in bold lower case for vec-tors, bold upper case for matrices, superscript for transposition, for the identity matrix, "*" for convolution, and for expectation. The correlation matrix formed by vectors and is defined as , and the correlation vector between a scalar and a vector as . and denote filter and block lengths, respectively. Speech and noise signal vectors contain consecutive samples up to current time , and their counterparts with samples up to time are marked with tilde.
where and are vec-tors of the clean speech mixture and the noise, respectively, with , , . The filter matrix
is , where is an Toeplitz matrix and its th row is , . For the noisy ADF output , its speech-only output is denoted by and the noise output component by . Then, the effect of noise in the system output correlation matrix is described by . The I/O relations in correlation vectors of speech are
The noise correlation I/O relations have the same form:
In the absence of noise, the basic ADF adaptationalgorithm is given in  as
It has been shown in  that by taking the decorrelation objective functions as
and approximating by instantaneous correlations , the same adaptation equation can be obtained. For the step-size ,  proposed an input-normalized technique based on a convergence analysis, which was combined in  with variable step-size (VSS) techniques to accelerate convergence and reduce ADF estimation error.
The proposed system for improving ADF in noise works in two steps, as shown in Figure 2. In the NC step, the noise effects on the adaptation procedure (6), including the step-size computation, are reduced to improve speech separation. In the adaptive enhancement step, the ADFspeech outputs are enhanced by noise reduction. The details of the techniques for these two processing steps are covered in Sections 3 and 4, respectively.
3. Noise Compensation for ADF
Since the objective function in the form of (7) becomes , the presence of noise deteriorates the adaptation performance of (6) which contains bias caused by output noise cross-correlations. As shown in (4), the noise component in output cross-correlation varies as filters adapt. The time-varying noise effect can be reduced by using an estimate of speech cross-correlation , that is,
Based on (8), the noise-compensated ADF (NC-ADF) is ob-tained as
where is the estimate of output noise cross-corre-lation, and the discount factor to preventovercompensation. In the following, is used.
In the current work, for the computation of step-sizes, the VSS technique of  is extended to include acompensation of output noise powers. The effect of unequal source energies on filter estimation errors is that the lower the relative strength of the th source, the higher the esti-mation error will be for the filter . To reduce the ADF estimation error caused by unbalanced source energies, step-sizes can be scaled by relative short-term powers of ADF outputs as
where the normalizing gain factor was given by 
with the short-term power of the th input, and () the constant gain factor that controls convergence speed. The estimated average speech output power is
The noise compensation to output power is made bysubtracting noise power from the power of noisy ADF output, that is,
and the output noise power is obtained from (5) as
4. Fast Implementation of Noise Compensation and ADF
4.1. Fast Update of Compensation Terms
Direct computations of noise cross-correlation vectors in NC-ADF adaptation (9) are not feasible for real-timeapplications since the terms in (4) require matrix-vector multiplications for every time sample. For fixed speaker locations, the changes of ADF filters are in general small within short time intervals (e.g., around 30 milliseconds). The slow change of ADF parameters and the short-term stationarity of input noise make it possible to update compensation terms in a blockwise fashion, reducing the update rate by a factor of (block-length). To speed up NC-ADF, we first reduce the update rate for compensation terms and then utilize the Toeplitz structures of both the system and the correlation matrices to derive an FFT-based estimation of (4).
The estimate of output bias (4) can be rewritten as
with , , , and . Computations of and share the same structure. The components of vector , that is, , , can be expressed as the last samples, in reversed order, of the convolution , that is,
where is the -pointreverse of . Similarly, components of are obtained by with The vectors and also have a similar structure, where with , and with . Based on such convolutive expressions, the -point sequences , , and can be computed by -point FFTs (). For modularity, the -point sequence can be decomposed into two -point subsequences and computed with two -point FFT-IFFT modules. In this way, all the sequences above only need to be zero-padded to length , because only -point results are required in each module. The rest points with aliasing are irrelevant and are discarded.
From (13)–(15), the noise-free ADF output powers used in the VSS computation are estimated by
4.2. Fast ADF and NC-FADF
The samplewise procedures of filtering (1) and adaptation (6) of ADF are also modified for a blockwise implementation to enable fast noise compensation. The fast computation of (1) can use the standard overlap-add fast convolution  under the approximation that filters are constant within each block.
By using a constant step-size in each block, a block-adaptive procedure for filter update can be obtained. For noise-free ADF, consider the th block covering samples from to , and let be the filters of the current block. After obtaining ADF outputs of the th block by a fast convolution filtering, the step-size can be estimated to update filters in the entire current block. By summing up both sides of (6) for , the new filters for the next block, , can be estimated as
The cross-correlation estimate
can be computed by an FFT-based fast implementation .
Similarly, the blockwise NC-FADF is obtained from (9) as
where is defined by replacing and with their noisy counterparts in (19), is from (15), and the block step-size is computed by (10). The normalization gain factor in (11) uses ADF input powers that are estimated from the samples of both current and previous blocks. To prevent overcompensation in NC-FADF, in (17) is set to zero when negative values occur. The denominator in (10) is also added a small positive number to avoid divide-by-zeros. Triangular windows , , are applied to both correlation estimate and ADF adaptation vectors to prevent instability.
The overlap-add method requires . When = and the FFT length = , the computation of -point FFTs is distributed to the block of length , resulting in a complexity of per time-sample for NC-FADF, in contrast to for a direct estimation of NC terms that are required by matrix-vector multiplications.
5. Adaptive Enhancement of Separated Speech
5.1. Tracking of ADF Output Noise Autocorrelations
Although NC-FADF improves the speech separation performance in noise, the separation outputs are still contaminated by noise. Thus, a speech enhancement postprocessing should be integrated with ADF to reduce noise in each output. To do so, we need to track the time-varying output noise statistics as filters evolve from block to block by a fast computation of (5). Similar to the derivations of (15), we obtain autocorrelation of ADF output noise for the th block:
where , , , and . Since input noise is stationary, its auto- and cross-correlations can be measured a priori during a speech inactive period. The fast mappings from input noise correlations to output noise autocorrelation, depending only on current system parameters 's and 's, are implemented as fast convolutions of the following signal sequences:
where , , , and .
5.2. Enhancement of Separated Speech
Utilizing the adaptively estimated noise statistics , many algorithms can be considered for postenhancement of ADF outputs. The time domain constrained (TDC) type of the generalized subspace (GSub) method  is tested due to its ability to handle colored noise. The TDC-GSub processing is applied to every block of ADF outputs, where for the th block it requires the noise autocorrelation matrix , which can be constructed by forming a symmetric Teoplitz matrix from the output autocorrelation vector in (21). Specifically, constitutes the first column and the first row of . Another piece of information that the TDC-GSub algorithm takes is the autocorrelation matrix of the noisy ADF output, , which is estimated from ADFoutputs of the current block. The TDC-GSub processing is performed on each nonoverlapping subframe of length and the major steps are the same as in .
Do eigendecomposition for matrix , with , and is the number of positive eigenvalues.
Compute the optimal estimator , where the eigendomain filtering gains are obtained by , , and is determined from
Enhance the th ADF output by .
The computations of matrix inversion, multiplication, and eigendecomposition become acceptable when a small value is used for (2.5 milliseconds). In addition, a measure is taken to speed up TDC-GSub by utilizing the short-term stationary property of separated speech signals 's. Within 20 milliseconds, the variations of 's are relatively small, obviating the need for updating their eigendecompositions in every subframes. In practice, the computation rate for both steps 1 and 2 are thus reduced to every 12.5 milliseconds, without introducing significant degradations.
6. Complexity Analysis
The complexities of the major computation steps in terms of the average number of real multiplications per time-sample are listed in Table 1. Trivial computation overheads are ignored. The gain of the fast over the direct implementations are evaluated for and . The counts for FFT are based on the regular radix-2 method. It is possible to further reduce the complexities of computations. In Table 1, only a coarse complexity estimate is made for TDC-GSub, based on direct implementations of matrix operations. Faster computation techniques for TDC-GSub and complexity analyses are out of the scope of this paper.
7.1. Experimental Data and Setup
Speech mixtures were generated from a convolution of clean speech sources in TIMIT database with real acoustic impulse responses measured in a room of reverberation time second . The speakers were approximately away from two microphones that were mounted apart on a circular array of radius , and the distance between the two speakers was . The target speech was sampled at 16 kHz and had 40 sentences from 4 speakers (faks0, felc0, mdab0, mreb0). The competing speech contained randomly selected TIMIT sentences. Both simulated and real diffuse noise conditions were tested. The simulated noise is speech-shaped and was generated by the following procedure:
where 's are white Gaussian excitations and 's are linear prediction coefficients (LPC) estimated from clean TIMIT data. Real diffuse noises were recorded in a computer lab with a pair of omnidirectional microphones placed in the center of the lab, where the microphones were the same distance apart as that of the array microphone pair. Ventilation and air-conditioning systems and 8 desktop workstations were working simultaneously, generating diffuse noises that fit the stationary assumption. As a default setting, a 2-second speech inactive segment immediately preceding the speech was used to estimate input noise statistics. Figure 3 illustrates the cross-power spectra for both types of noises.
The basic setup for ADF was and and the separation filters were initialized with zeros, representing a totally blind condition (if certain prior knowledge of the acoustic paths can be incorporated into the initial separation filters, then ADF separation performance can be improved, especially in severe noise). In all cases, a pre-emphasis () was applied to speech mixtures to remove the 6-dB/octave tilt of speech long-term spectrum and to reduce eigenvalue dispersion for faster convergence . Pre-emphasis enhances perceptually important speech components, and it also alters input noise properties as well as the relative strengths of noise and speech measured in signal-to-noise ratio (SNR): , where is the power of the clean speech mixture signal, and is the power of the noise component. In fact, the simulated speech-shaped noise spectrum was flattened by pre-emphasis, resulting in a loss of SNR of approximately 3 dB. On the other hand, the recorded diffuse noise retained a significant amount of coloration and spatial correlation after pre-emphasis that increased SNR by 12 dB through suppressing strongly correlated low-frequency noise components (see Figure 3). In subsequent discussions, SNR and target-to-interference ratio (TIR) refer to those evaluated on pre-emphasized input and output components, where TIR is defined as , with the power of target speech and the power of interference speech component. For FADF and NC-FADF, the block length was and the FFT length was . Since VSS without NC would corrupt adaptation at high levels of noise, it was not applied to ADF (6) and FADF. In the appendix, more details are provided for the definitions of SNR and TIR.
7.2. Speech Separation Performance
The separation performances were evaluated by system gains in TIR, defined as . In Tables 2 and 4, the TIR gains of NC-FADF outperform those of the baseline for both types of noises, at the cost of a slightly decreased SNR, as shown in Tables 3 and 5. Since FADF is a fast and approximate implementation of the baseline ADF, it suffered a slight degradation from the baseline and showed occasional instability in the iterative estimations of separation filters. The TIR gain values in Tables 2 and 4 are computed from the noise-free components in the noisy outputs and . It is interesting to observe that under severe noise conditions, for example dB (original), the baseline ADF actually increased SNR. This is consistent with the analysis in  that in correlated noises, the baseline ADF tends to divert from speech separation to noise cancellation. Tables 3 and 5 show that the NC algorithm can force ADF to focus on speech separation, rather than noise cancellation.
7.3. Speech Enhancement and Phone Recognition
Experiments were conducted to compare the cases of using NC-FADF or FADF, with and without adaptive speech en-hancements. Since SNR was altered by pre-emphasis differently for simulated and real diffuse noises, the range of initial SNRs were chosen differently for these two cases so that the input target speech had the same SNRs after pre-emphasis. After adaptive online speech enhancement, a de-emphasis was applied to the enhanced speech.
The overall enhancement of target speech against the effects of both interfering jammer and noise are shown by the target-to-interference-and-noise ratio (TINR) in Figures 4 and 5, where TINRs are defined in the appendix for the input, the separation output, and the separation output with noise reduction. It is seen that NC-FADF outperformed FADF in both types of noises under almost all SNR conditions. At high SNRs, the TINR improvements come mainly from the separation processing of NC-FADF or FADF, as speech jammer is the dominant problem. The larger TINR gains obtained by NC-FADF over FADF were also attributed to its use of the variable step-size adaptation defined in (10) and (12)–(14), while without noise compensation the VSS was unavailable to FADF. This advantage of using variable step size over fixed step size in ADF adaptation is consistent with the findings in . At low SNRs, the TINR improvement is mainly contributed by the suppression of the noise components, and in the real diffuse noise, the separation processing had a stronger effect on TINR improvement than in the simulated noise. When the SNR is very low, where the energy of speech mixture is dominated by the noise, the TIR improvement (between target and jammer speech) by NC-FADF contributed less to the overall TINR gains, and here the enhancement processing by TDC-GSub improved TINR greatly in both types of noises.
Phone recognitions were performed by using HTK toolkit  for the noisy mixture, the noisy separated speech, and the enhanced separated speech of the target. The speech signals were represented by sequences of feature vectors obtained from overlapped short-time analysis window of 20 milliseconds. Each feature vector consisted of 13 cepstral coefficients and their first- and second-order time derivatives. Both training and test data from TIMIT database were processed with spectral mean subtraction. Hidden Markov modeling (HMM) was used for 39 context independent phone units, defined by the phone grouping scheme of . Each phone unit had 3 emission states, with state observation probabilities modeled by size-8 Gaussian mixture densities. Phone bigram was used as "language model."
The phone accuracy results in simulated and real diffuse noise cases are shown in Figures 6 and 7, respectively. The upper limit of phone accuracy was , which was obtained from the target speech separated from the clean speech mixtures by ADF. It is observed that when SNR is low or moderate, the adaptive enhancement techniques significantly improved the phone recognition accuracy of the separation outputs. Similar to the TINR results, at high SNRs, the improvement to phone accuracy comes mainly from speech separation, where NC-FADF is significantly better than FADF. Comparative experimental results were also generated for the proposed approach of applying TDC-GSub as a postprocessor after FADF (FADF enhanced by TDC-GSub postprocessing) and the apparent alternative of using TDC-GSub as a preprocessor prior to FADF (FADF after TDC-GSub Preprocessing). It is seen that the former performed better than the latter, especially in real diffuse noise. In general, the combination of NC-FADF with TDC-GSub postprocesing achieved the highest accuracy performance.
7.4. Sensitivity to Noise Estimation
In real applications, there are scenarios where the speech inactive periods are short, which would reduce the reliability of noise statistic estimation. It is therefore of interest to evaluate the feasibility of the proposed NC-FADF algorithm when the input noise statistics are estimated from short data segments. For this purpose, an experiment was performed to vary the speech inactive period from 0.5 second through 2.5 seconds, and the noise statistics computed from the different periods were used by NC-FADF followed by TDC-GSub to perform speech separation and enhancement. The test results confirmed that for the two types of noises investigated in the current work, there is no significant difference in the overall system performance over this range of speech-inactive intervals. Figure 8 illustrates the phone recognition performance versus the speech inactive interval lengths in real diffuse noise. It is seen that except for a performance drop when the speech inactive length was 0.5 second, phone accuracy remained essentially the same for all other speech inactive lengths. In simulated noise, the accuracy performance remained essentially the same for all of the speech inactive lengths, including the 0.5 second case. In general, in an online system a voice activity detection module is needed to identify speech inactive periods, and for fast-varying nonstationary input noises, robust algorithms are needed to estimate time-varying noise properties with adaptive memory lengths. Although this issue is practically important, it is out of the scope of the current work.
8. Conclusions and Future Work
In this paper, we have presented methods of noise compensation and adaptive speech enhancement to improve the performances of ADF speech separation in diffuse noise. Fast implementations for ADF and noise compensation have been made that warrant real-time online applications. FADF has achieved performance comparable to that of ADF with a much faster speed. NC-FADF significantly improved the separation performance for speech mixtures in diffuse noise, and the integration of NC-FADF with speech enhancement significantly improved phone recognition accuracies in separated speech. Future investigations may include other enhancement algorithms and noise-reduction implementations for a more streamlined integration with the NC-FADF procedure.
Hyvarinen A, Karhunen J, Oja E: Independent Component Analysis. John Wiley & Sons, New York, NY, USA; 2001.
Aichner R, Buchner H, Kellermann W: Convolutive blind source separation for noisy mixtures. Proceedings of Joint Meeting of the German and the French Acoustical Societies (CFA/DAGA '04), March 2004, Strasbourg, France
Douglas SC, Cichocki A, Amari S: Bias removal technique for blind source separation with noisy measurements. Electronics Letters 1998,34(14):1379-1380. 10.1049/el:19980994
Hu R, Zhao Y: Adaptive decorrelation filtering algorithm for speech source separation in uncorrelated noises. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 1: 1113-1116.
Balan R, Rosca J, Richard S: Scalable non-square blind separation in the presence of noise. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 5: 293-296.
Araki S, Makino S, Mukai R, Saruwatari H: Equivalence between frequency domain blind source separation and frequency domain adaptive beamforming. Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech '01), September 2001, Aalborg, Denmark 4: 2595-2598.
Araki S, Mukai R, Makino S, Nishikawa T, Saruwatari H: The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech. IEEE Transactions on Speech and Audio Processing 2003,11(2):109-116. 10.1109/TSA.2003.809193
Asano F, Hayamizu S, Yamada T, Nakamura S: Speech enhancement based on the subspace method. IEEE Transactions on Speech and Audio Processing 2000,8(5):497-507. 10.1109/89.861364
Hu R, Zhao Y: Variable step size adaptive decorrelation filtering for competing speech separation. Proceedings of the 9th European Conference on Speech Communication and Technology (Eurospeech '05), September 2005, Lisbon, Portugal 1: 2297-2300.
Zhao Y, Hu R, Li X: Speedup convergence and reduce noise for enhanced speech separation and recognition. IEEE Transactions on Audio, Speech and Language Processing 2006,14(4):1235-1244.
Weinstein E, Feder M, Oppenheim AV: Multi-channel signal separation by decorrelation. IEEE Transactions on Speech and Audio Processing 1993,1(4):405-413. 10.1109/89.242486
Yen K-C, Zhao Y: Adaptive co-channel speech separation and recognition. IEEE Transactions on Speech and Audio Processing 1999,7(2):138-151. 10.1109/89.748119
Yen K, Huang J, Zhao Y: Co-channel speech separation in the presence of correlated and uncorrelated noises. Proceedings of the 6th European Conference on Speech Communication and Technology (Eurospeech '99), September 1999, Budapest, Hungary 2587-2589.
Hu R, Zhao Y: Fast noise compensation for speech separation in diffuse noise. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), May 2006, Toulouse, France 5: 865-868.
Hu R, Zhao Y: Adaptive speech enhancement for speech separation in diffuse noise. Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH '06), September 2006, Pittsburgh, PA, USA 2618-2621.
Oppenheim AV, Schafer RW: Discrete-Time Signal Processing. Prentice-Hall, Englewood Cliffs, NJ, USA; 1989.
Hu Y, Loizou PC: A generalized subspace approach for enhancing speech corrupted by colored noise. IEEE Transactions on Speech and Audio Processing 2003,11(4):334-341. 10.1109/TSA.2003.814458
RWCP Sound Scene Database in Real Acoustic Environments ATR Spoken Language Translation Research Lab, Japan, 2001
Odell J, Ollason D, Valtchev V, Young S, Kershawl D, Woodland P: HTK Speech Recognition Toolkit. 1999, http://htk.eng.cam.ac.uk/docs/docs.shtml
Lee K-F, Hon H-W: Speaker-independent phone recognition using hidden markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing 1989,37(11):1641-1648. 10.1109/29.46546
Definitions of SNR, TIR, and TINR
Since the ADF filtering model (1) is linear, the superposition principle holds, that is, its output components of target, interference, and noise can be computed separately from its respective input components. Unlike the linear model of ADF, the speech enhancement module is nonlinear and its output components cannot be separately estimated from its individual input components. Therefore, the separate computation of output TIR and SNR are not feasible for the speech enhancement module. Instead, TINRs can be estimated by taking the signal energies other than the original target as the sum of noise and interference signals. The computations of SNR, TIR, and TINR are defined below with respect to channel 1 (the definitions are similar for channel2):
At ADF input, and are the powers of the clean mixture and the noise components, respectively; and are the powers of the target and the interference speech signals, respectively; is the sum of interference speech and noise. At ADF output, and , and , and are the counterparts of the above components at ADF input. The component is the output speech after enhancement processing.