Combining Superdirective Beamforming and Frequency-Domain Blind Source Separation for Highly Reverberant Signals
© LinWang et al. 2010
Received: 14 January 2010
Accepted: 1 June 2010
Published: 24 June 2010
Frequency-domain blind source separation (BSS) performs poorly in high reverberation because the independence assumption collapses at each frequency bins when the number of bins increases. To improve the separation result, this paper proposes a method which combines two techniques by using beamforming as a preprocessor of blind source separation. With the sound source locations supposed to be known, the mixed signals are dereverberated and enhanced by beamforming; then the beamformed signals are further separated by blind source separation. To implement the proposed method, a superdirective fixed beamformer is designed for beamforming, and an interfrequency dependence-based permutation alignment scheme is presented for frequency-domain blind source separation. With beamforming shortening mixing filters and reducing noise before blind source separation, the combined method works better in reverberation. The performance of the proposed method is investigated by separating up to 4 sources in different environments with reverberation time from 100 ms to 700 ms. Simulation results verify the outperformance of the proposed method over using beamforming or blind source separation alone. Analysis demonstrates that the proposed method is computationally efficient and appropriate for real-time processing.
The objective of acoustic source separation is to estimate original sound sources from the mixed signals. This technique has found a lot of applications in noise-robust speech recognition and high-quality hands-free telecommunication systems. A classical example is to separate audio sources observed in a real room, known as a cocktail party environment, where a number of people are talking concurrently. A lot of research has focused on the problem but development is currently still in progress. Two kinds of techniques are promising in achieving source separation with multiple microphones: beamforming and blind source separation.
Beamforming is a technique used in sensor array for directional signal reception [1, 2]. Based on a model of the wavefront from acoustic sources, it can enhance target direction and suppress unwanted ones by coherently summing signals from the sensors. Beamforming can be classified as either fixed beamforming or adaptive one, depending on how the beamformer weights are chosen. The weights of a fixed beamformer do not depend on array data and are chosen to present a specified response for all scenarios. The most conventional fixed beamformer is a delay-and-sum one, which however requires a large number of microphones to achieve high performance. Another filter-and-sum beamformer has superdirectivity response with optimized weights. The weights of an adaptive beamformer are chosen based on the statistics of array data to optimize array response. In source separation system, each source signals may be separately obtained using the directivity of the array if the directions of sources are known. However, beamforming has limited performance in highly reverberant conditions because it can not suppress the interfering reverberation coming from the desired direction.
Blind source separation (BSS) is a technique for recovering the source signals from observed signals with the mixing process unknown . It just relies on the independence assumption of source signals to estimate them from the mixtures. The cocktail party problem is a challenge because the mixing process is convolutive, where the observations are combinations of filtered versions of sources. A large number of unmixing filter coefficients should be calculated simultaneously to recover the original signals. The convolutive BSS problem can be solved in the time domain or the frequency domain . In time domain BSS, the separation network is derived by optimizing a time-domain cost function [5–7]. However, these approaches may not be effective due to slow convergence and large computational load. In frequency-domain BSS, the observed time-domain signals are converted into the time-frequency domain by short-time Fourier transform (STFT); then instantaneous BSS is applied to each frequency bin, after which the separated signals of all frequency bins are combined and inverse-transformed to the time domain [8, 9]. Although satisfactory instantaneous separation may be achieved within all frequency bins, combining them to recover the original sources is a challenge because of the unknown permutations associated with individual frequency bins. This is the permutation ambiguity problem. There are two common strategies to solve this problem. The first strategy is to exploit the interfrequency dependence of separated signals [10, 11]. The second strategy is to exploit the position information of sources such as direction of arrival [12, 13]. By analyzing the directivity pattern formed by a separation matrix, source direction can be estimated and permutations aligned. Generally these two strategies can be combined to get a better permutation alignment .
Besides the permutation problem, another fundamental problem also limits the performance of frequency-domain BSS: the dilemma in determining the STFT analysis frame length [15–17]. Frames shorter than mixing filters generate incomplete instantaneous mixtures, while long frames collapse the independence measure at each frequency bin and disturb separation. The conflict is even severer in high reverberation with long mixing filters. Generally, a frequency-domain BSS which works well in low (100–200 ms) reverberation has degraded performance in medium (200–500 ms) and high ( 500 ms) reverberation. Since the problem originates from a processing step, which approximates linear convolutions with circular convolutions, in frequency-domain BSS, we call it "circular convolution approximation problem". This problem will be further elaborated in Section 2.2. Although great progress has been made for the permutation problem in recent years, few methods have been proposed with good separation results in a highly reverberant environment.
To improve the separation performance in high reverberation, this paper proposes a method which combines beamforming and blind source separation. Assuming the sound source locations are known, the proposed method employs beamforming as a preprocessor for blind source separation. With beamforming reducing reverberation and enhancing signal-to-noise ratio, blind source separation works well in reverberant environments, and thus the combined method performs better than using either of the two methods alone. Since the proposed method requires the knowledge of source locations for beamforming, it is a semiblind method. However, the source locations may be estimated with an array sound source localization algorithm or using other approaches, which is beyond the scope of this paper [18, 19].
In fact, the relationship between blind source separation and beamforming has been intensively investigated in recent years, and adaptive beamforming is commonly used to explain the physical principle of convolutive BSS [15, 20]. In addition, many approaches have been presented that combine both techniques. Some of these combined approaches are aimed at resolving the permutation ambiguity inherent in frequency-domain BSS [12, 21], whereas other approaches utilize beamforming to provide a good initialization for BSS or to accelerate its convergence [22–24]. So far as we know, there were no systematically studies on a direct application of the BSS-beamforming combination to high reverberant environments.
The rest of paper is organized as follows. Frequency-domain BSS and its circular convolution approximation problem are introduced in Section 2. The proposed method combining BSS and beamforming is presented in Section 3. Section 4 gives experimental results in various reverberant environments. Finally conclusions are drawn in Section 5.
2. Frequency-Domain BSS and Its Fundamental Problem
2.1. Frequency-Domain BSS
where is the STFT of , is the Fourier transform of ; is a permutation matrix and a scaling matrix, all at frequency . The source permutation and gain indeterminacy are problems inherent in frequency-domain BSS. It is necessary to correct them before transforming the signals back to the time domain.
2.2. Circular Convolution Approximation Problem
Equation (3) is only an approximation since it implies a circular convolution but not a linear convolution in the time domain; it is correct only when the mixing filter length is short compared to .
As a result, it is necessary to work with to ensure the accuracy of (3). However in that case, the instantaneous separation performance is saturated before reaching a sufficient separation, because decreased time resolution for STFT and fewer data available in each frequency bin will collapse the independence assumption and deteriorate instantaneous separation [15, 17].
In a nutshell, short frames make the conversion to instantaneous mixture incomplete, while long ones disturb the separation. This contradiction is even severer in highly reverberant environments, where the mixing filters are much longer than STFT analysis frame. This is the reason for the poor performance of frequency-domain BSS in high reverberation.
It is necessary to work with to ensure the accuracy of (3). In this case, however, long frames worsen time resolution in the time-frequency domain and decrease the number of samples in each bin. As the result, the independence of source signals decreases greatly at some bins, leading to deteriorated instantaneous BSS and hence significantly reducing convolutive BSS performance in high reverberation [15, 17]. In other words, short frames make the conversion to instantaneous mixture incomplete, while long ones disturb the separation. The conflict becomes severer in highly reverberant environments and lead to the degraded performance.
3. Combined Separation Method
The interfering residuals due to reverberation after beamforming are further reduced by blind source separation.
The poor separation performance of blind source separation in reverberant environments is compensated for by beamforming, which suppresses the reflected paths and shortens the mixing filters;
Beamformer enhances the source in its path and suppresses the ones outside. It thus enhances signal-to-noise ratio and provides a cleaner output for blind source separation to process.
where is the beamforming output vector, is the observed vector, is a sequence of matrices containing the impulse responses of beamformer, is the global impulse response by combining and , and the operator "*" denotes matrix convolution.
It can be seen from (5)–(7) that, with beamforming reducing reverberation and enhancing signal-to-noise ratio, the combined method is able to replace the original mixing network , which results from the room impulse response, with a new mixing network , which is easier to separate.
Regarding the implementation detail, two techniques are employed: superdirective beamformer, which can fully exert the dereverberation and noise reduction ability of a microphone array, and frequency-domain blind source separation, which is well known for its fast convergence and small computation. These two issues will be addressed as below.
An adaptive beamformer obtains directive response mainly by analyzing the statistical information contained in the array data, not by utilizing the spatial information directly. Its essence is similar to that of convolutive blind source separation . Cascading them together is equivalent to using the same techniques repeatedly, hence contributing little to performance improvement.
An adaptive beamformer generally adapts its weights during breaks in the target signal . However, it is a challenge to predict signal breaks when several people are talking concurrently. This significantly limits the applicability of adaptive beamforming to source separation.
In contrast, a fixed beamformer, which relies mainly on spatial information, does not have such disadvantages. It is data-independent and more stable. Given a look direction, the directive response is obtained for all scenarios. Thus a fixed beamformer is preferred in the proposed method.
where is the beamforming weight vector composed of beamforming weights from each sensor, and is the output vector composed of outputs from each sensor, and denotes conjugate transpose. The depends on the array geometry and source directivity, as well as the array output optimization criterion such as a signal-to-noise ratio (SNR) gain criterion [29–31].
The terms and in (10) depend on the array geometry and the target source direction. For a circular array, the calculation of and is given as follows .
3.2. Frequency-Domain Blind Source Separation
As discussed before, the workflow of frequency-domain blind source separation is shown in Figure 1. Three realization details will be addressed: instantaneous BSS, permutation alignment, and scaling correction.
3.2.1. Instantaneous BSS
After decomposing time-domain convolutive mixing into frequency-domain instantaneous mixing, it is possible to perform separation at each frequency bin with a complex-valued instantaneous BSS algorithm. Here we use Scaled Infomax algorithm, which is not sensitive to initial values, and is able to converge to the optimal solution within 100 iterations .
3.2.2. Permutation Alignment
Permutation ambiguity inherent in frequency-domain BSS is a challenge in the combined method. Generally, there are two approaches to cope with the permutation problem. One is to exploit the dependence of separated signals across frequencies. Another is to exploit the position information of sources: the directivity pattern of the mixing/unmixing matrix provides a good reference for permutation alignment. However, in the combined method, the directivity information contained in the mixing matrix does not exist any longer after beamforming. Even if the source positions are known, they are not much helpful for permutation alignment. Consequently, what we can use for permutation is merely the first reference: the interfrequency dependence of separated signals. In  we have proposed a permutation alignment approach with good results, which is based on an interfrequency dependence measure: the powers of separated signals. Its principle is briefly given as below.
where the denominator is the total power of the observed signals , the numerator is the power of the th separated signal, and is the th component of the separated signal , that is, . Being in the range , (17) is close to 1 when the th separated signal is dominant, and close to 0 when others are dominant. The power ratio measure can clearly exhibit the signal activity due to the sparsity of speech signals.
where and are indices of two separated channels, and are two frequencies, , , are, respectively, the correlation, mean, and standard deviation at time (The time index is omitted for clarity). Note that denotes expectation. Being in the range , (18) tends to be high if the output channels and originate from the same source and low if they represent different sources. This property will be used for aligning the permutation.
Reference  has proposed a permutation alignment approach based on the power ratio measure. Binwise permutation alignment is applied first across all frequency bins, using the correlation of separated signal powers; then the full frequency band is partitioned into small regions based on the binwise permutation alignment result. Finally, regionwise permutation alignment is performed, which can prevent the spreading of the misalignment at isolated frequency bins to others and thus improves permutation. This permutation alignment approach is employed in the proposed method.
3.2.3. Scaling Correction
where is after permutation correction and is the one after scaling correction, denotes inversion of a square matrix or pseudoinversion of a rectangular matrix; retains only the main diagonal components of the matrix.
3.3. Computational Complexity Analysis
We think the result is quite acceptable. For 4 sources recorded by a 16-element microphone array, , , the average computation involves about 7200 complex-valued multiplications for each sample time (with 16 sample points). Thus, in terms of computational complexity, the proposed algorithm is promising for real-time applications.
4. Experiment Results and Analysis
We evaluate the performance of the proposed method in simulated experiments in two parts. The first part verifies the dereverberation performance of beamforming. The second investigates the performance of the proposed method in various reverberant conditions, and compares it with a BSS-only method and a beamforming-only one.
The implementation detail of the algorithm is as follows. For blind source separation, the Tukey window is used in STFT, with a shift size of 1/4 window length. The iteration number of instantaneous Scaled Infomax algorithm is 100. The processing bandwidth is between 100 and 3750 Hz (sampling rate being 8 kHz). The STFT frame size will vary according to different experimental conditions. For beamforming, a circular microphone array is used to design the beamformer with the filter length 2048, the array size will vary according to different experimental conditions.
4.1. Simulation Environment and Evaluation Measures
The separation performance is measured by signal-to-interference ratio (SIR) in dB.
4.2. Dereverberation Experiment
It can be seen that the diagonal components in Figure 8 are superior to off-diagonal ones. This implies that the target sources are dominant in the outputs. To demonstrate the dereverberation performance of beamforming, Figure 8(a) is enlarged in Figure 7(b) and compared with the original impulse response in Figure 7(a). Obviously, the mixing filter becomes shorter after beamforming, and the reverberation becomes smaller. This indicates that dereverberation is achieved. So far, the two advantages of beamforming, dereverberation and noise reduction, are observed as expected. Thus the new mixing network (n) should be easier to separate than the original mixing network. In this experiment, the average input SIR is SIRIN = dB, and the output one, enhanced by beamforming, is SIRBM = 3.3 dB. Setting the STFT frame size at 2048 and applying BSS to the beamformed signals, we get an average output SIR of the combined method of SIROUT = 16.3 dB, a 19.1 dB improvement over the input: 6.1 dB improvement at the beamforming stage, and 13 dB further improvement at the BSS stage.
4.3. Experiments Reverberant Environments
Three experiments are conducted to investigate the performance of the proposed method and compare it with the BSS-only and the beamforming-only method. The first examines the performance of the BSS-only method in medium reverberation with different STFT frame sizes. The second compares the performance of the proposed method and the other two methods in various reverberant conditions. The third examines the performance of the proposed method with various microphone array sizes.
4.3.1. BSS with Different STFT Frame Size
Obviously, an optimal STFT frame size may exist for a specific reverberation. However, due to complex acoustical environments and varieties of source signals, it is difficult to determine this value precisely. How to choose an appropriate frame length may be a topic of our future research. Generally, 1024 or 2048 can be used as a common frame length. Here we use an analysis frame length of 2048 for all reverberant conditions in the remaining experiments.
4.3.2. Performance Comparison among Three Methods
The experiments in this part illustrate the superiority of the proposed method over using beamforming or blind source separation alone. The comparison between proposed method with other hybrid methods in different reverberant conditions will be further investigated in our future research.
4.3.3. Performance of the Combined Method with Different Microphone Array Size
Given the poor performance of blind source separation in high reverberation, the paper proposes a method which combines beamforming and blind source separation. Using superdirective beamforming as a preprocessor of frequency-domain blind source separation, the combined method is able to integrates the advantages of both techniques and complements the weakness of them alone. Simulation in different conditions ( = 100 ms–700 ms) illustrates the superiority of the proposed method over using beamforming or blind source separation alone; and the performance improvement increases with the microphone array size. The proposed method is promising for real-time processing with its high computational efficiency.
This paper is partly supported by the National Natural Science Foundation of China (60772161, 60372082) and the Specialized Research Fund for the Doctoral Program of Higher Education of China (200801410015). This paper is also supported by NRC-MOE Research and Post-doctoral Fellowship Program from Ministry of Education of China and National Research Council of Canada. The authors would like to thank Dr. Michael R. Stinson of National Research Council Canada for his invaluable discussions.
- Van Veen BD, Buckley KM: Beamforming: a versatile approach to spatial filtering. IEEE ASSP magazine 1988, 5(2):4-24.View ArticleGoogle Scholar
- Van Trees HL: Optimum Array Processing—Part IV of Detection, Estimation, and Modulation Theory. Wiley-Interscience, New York, NY, USA; 2002.Google Scholar
- Hyvarien A, Karhunen J, Oja E: Independent Component Analysis. John Wiley & Sons, New York, NY, USA; 2001.View ArticleGoogle Scholar
- Pedersen MS, Larsen J, Kjems U, Parra LC: A survey of convolutive blind source separation methods. In Springer handbook on Speech Processing and Speech Communication. Springer, London, UK; 2007:1-34.Google Scholar
- Douglas SC, Sun X: Convolutive blind separation of speech mixtures using the natural gradient. Speech Communication 2003, 39(1-2):65-78. 10.1016/S0167-6393(02)00059-6MATHView ArticleGoogle Scholar
- Aichner R, Buchner H, Yan F, Kellermann W: A real-time blind source separation scheme and its application to reverberant and noisy acoustic environments. Signal Processing 2006, 86(6):1260-1277. 10.1016/j.sigpro.2005.06.022MATHView ArticleGoogle Scholar
- Douglas SC, Gupta M, Sawada H, Makino S: Spatio-temporal FastICA algorithms for the blind separation of convolutive mixtures. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(5):1511-1520.View ArticleGoogle Scholar
- Smaragdis P: Blind separation of convolved mixtures in the frequency domain. Neurocomputing 1998, 22(1–3):21-34.MATHView ArticleGoogle Scholar
- Sawada H, Araki S, Makino S: Frequency-domain blind source separation. In Blind Speech Separation. Springer, London, UK; 2007:47-78.Google Scholar
- Murata N, Ikeda S, Ziehe A: An approach to blind source separation based on temporal structure of speech signals. Neurocomputing 2001, 41(1–4):1-24.MATHView ArticleGoogle Scholar
- Sawada H, Araki S, Makino S: Measuring dependence of bin-wise separated signals for permutation alignment in frequency-domain BSS. Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '07), May 2007 3247-3250.Google Scholar
- Saruwatari H, Kurita S, Takeda K, Itakura F, Nishikawa T, Shikano K: Blind source separation combining independent component analysis and beamforming. EURASIP Journal on Applied Signal Processing 2003, 2003(11):1135-1146. 10.1155/S1110865703305104MATHView ArticleGoogle Scholar
- Ikram MZ, Morgan DR: Permutation inconsistency in blind speech separation: Investigation and solutions. IEEE Transactions on Speech and Audio Processing 2005, 13(1):1-13.View ArticleGoogle Scholar
- Sawada H, Mukai R, Araki S, Makino S: A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Transactions on Speech and Audio Processing 2004, 12(5):530-538. 10.1109/TSA.2004.832994View ArticleGoogle Scholar
- Araki S, Mukai R, Makino S, Nishikawa T, Saruwatari H: The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech. IEEE Transactions on Speech and Audio Processing 2003, 11(2):109-116. 10.1109/TSA.2003.809193View ArticleGoogle Scholar
- Hiroe A: Blind vector deconvolution: convolutive mixture models in short-time Fourier transform domain. Proceedings of the International Workshop on Independent Component Analysis (ICA '07), 2007, Lecture Notes in Computer Science 4666: 471-479.Google Scholar
- Nishikawa T, Saruwatari H, Shikano K: Blind source separation of acoustic signals based on multistage ICA combining frequency-domain ICA and time-domain ICA. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 2003, E86-A(4):846-858.Google Scholar
- Silverman HF, Yu Y, Sachar JM, Patterson WR III: Performance of real-time source-location estimators for a large-aperture microphone array. IEEE Transactions on Speech and Audio Processing 2005, 13(4):593-606.View ArticleGoogle Scholar
- Madhu N, Martin R: A scalable framework for multiple speaker localisation and tracking. Proceedings of the InternationalWorkshop on Acoustic Echo and Noise Control, 2008, Seatle, Wash, USA 1-4.Google Scholar
- Parra L, Fancourt C: An adaptive beamforming perspective on convolutive blind source separation. In Noise Reductionin Speech Applications. Edited by: Davis GM. CRC Press, Boca Raton, Fla, USA; 2002:361-376.Google Scholar
- Ikram MZ, Morgan DR: A beamforming approach to permutation alignment for multichannel frequency-domain blind speech separation. Proceedings of the IEEE International Conference on Acustics, Speech, and Signal Processing, May 2002 1: 881-884.View ArticleGoogle Scholar
- Parra LC, Alvino CV: Geometric source separation: Merging convolutive source separation with geometric beamforming. IEEE Transactions on Speech and Audio Processing 2002, 10(6):352-362. 10.1109/TSA.2002.803443View ArticleGoogle Scholar
- Saruwatari H, Kawamura T, Nishikawa T, Lee A, Shikano K: Blind source separation based on a fast-convergence algorithm combining ICA and beamforming. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(2):666-678.View ArticleGoogle Scholar
- Gupta M, Douglas SC: Beamforming initialization and data prewhitening in natural gradient convolutive blind sourceseparation of speech mixtures. In Independent Component Analysis and Signal Separation. Volume 4666. Springer, Berlin, Germany; 2007:512-519. 10.1007/978-3-540-74494-8_64View ArticleGoogle Scholar
- Bingham E, Hyvärinen A: A fast fixed-point algorithm for independent component analysis of complex valued signals. International Journal of Neural Systems 2000, 10(1):1-8.View ArticleGoogle Scholar
- Bell AJ, Sejnowski TJ: An information-maximization approach to blind separation and blind deconvolution. Neural Computation 1995, 7(6):1129-1159. 10.1162/neco.19188.8.131.529View ArticleGoogle Scholar
- Amari S, Cichocki A, Yang HH: A new learning algorithm for blind signal separation. Advances in Neural Information Processing Systems 1996, 8: 757-763.Google Scholar
- Pan Q, Aboulnasr T: Combined spatial/beamforming and time/frequency processing for blind source separation. Proceedings of the European Signal Processing Conference, 2005, Antalya, Turkey 1-4.Google Scholar
- Cox H, Zeskind RM, Kooij T: Practical supergain. IEEE Transactions on Acoustics, Speech, and Signal Processing 1986, 34(3):393-398. 10.1109/TASSP.1986.1164847View ArticleGoogle Scholar
- Ryan JG, Goubrân RA: Array optimization applied in the near field of a microphone array. IEEE Transactions on Speech and Audio Processing 2000, 8(2):173-176. 10.1109/89.824702View ArticleGoogle Scholar
- Bouchard C, Havelock DI, Bouchard M: Beamforming with microphone arrays for directional sources. Journal of the Acoustical Society of America 2009, 125(4):2098-2104. 10.1121/1.3089221View ArticleGoogle Scholar
- Douglas SC, Gupta M: Scaled natural gradient algorithms for instantaneous and convolutive blind source separation. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), April 2007 2: 637-640.Google Scholar
- Wang L, Ding H, Yin F: A region-growing permutation alignment approach in frequency-domain blind source separationof speech mixtures. IEEE Transactions on Audio, Speech and Language Processing. In pressGoogle Scholar
- Matsuoka K, Nakashima S: Minimal distortion principle for blind source separation. Proceedings of the International Workshop on Independent Component Analysis (ICA '01), 2001 722-727.Google Scholar
- Allen JB, Berkley DA: Image method for efficiently simulating small-room acoustics. Journal of the Acoustical Society of America 1979, 65(4):943-950. 10.1121/1.382599View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.