- Open Access
Nonparametric Bayesian sparse factor analysis for frequency domain blind source separation without permutation ambiguity
© Nagira et al.; licensee Springer. 2013
- Received: 16 June 2012
- Accepted: 27 October 2012
- Published: 22 January 2013
Blind source separation (BSS) and sound activity detection (SAD) from a sound source mixture with minimum prior information are two major requirements for computational auditory scene analysis that recognizes auditory events in many environments. In daily environments, BSS suffers from many problems such as reverberation, a permutation problem in frequency-domain processing, and uncertainty about the number of sources in the observed mixture. While many conventional BSS methods resort to a cascaded combination of subprocesses, e.g., frequency-wise separation and permutation resolution, to overcome these problems, their outcomes may be affected by the worst subprocess. Our aim is to develop a unified framework to cope with these problems. Our method, called permutation-free infinite sparse factor analysis (PF-ISFA), is based on a nonparametric Bayesian framework that enables inference without a pre-determined number of sources. It solves BSS, SAD and the permutation problem at the same time. Our method has two key ideas: unified source activities for all the frequency bins and the activation probabilities of all the frequency bins of all the sources. Experiments were carried out to evaluate the separation performance and the SAD performance under four reverberant conditions. For separation performance in the BSS_EVAL criteria, our method outperformed conventional complex ISFA under all conditions. For SAD performance, our method outperformed the conventional method by 5.9–0.5% in F-measure under the condition RT20 = 30–600 [ms], respectively.
- Independent Component Analysis
- Sound Source
- Source Separation
- Blind Source Separation
- Short Time Fourier Transform
unknown mixing processes,
source number uncertainty,
performance degradation caused by mutually dependent functions.
The first one indicates that the CASA system should work without information specific to a certain environment or a situation such as the environment’s impulse responses or the sound source locations. The second one expresses that the CASA system should achieve robust estimation under the condition that the number of sources is unknown. The third one means that the mixture of audio signals captured in a room may contain reverberations that affect the microphone array processing. The last one means that cascaded processing to cope with the above-mentioned difficulties may be severely affected by the worst subprocess of the CASA system. When source separation processing is performed in the frequency domain, the output signals are affected by the permutation problem, which is ambiguity in the output order for different frequency bins. Conventional methods take the cascade approach. The mixed signals are separated for each frequency bin first, and then the permutation problem is solved. As mentioned above, the overall performance is limited to the performance of the worst subprocess.
Our solution for overcoming these difficulties is as follows. The mixing process is modeled stochastically and inferred on the basis of this model. To handle source number uncertainty, we introduce a nonparametric Bayesian approach. The reverberation is absorbed by using frequency-domain processing. Unified analysis of the source separation and permutation resolution is used to optimize these mutually dependent functions.
This article presents a permutation-free infinite sparse factor analysis (PF-ISFA): a joint estimation method that simultaneously achieves frequency-domain source separation and SAD using a minimum amount of prior information. PF-ISFA achieves robust estimation without using prior information about the number of sources. PF-ISFA extends the frequency-domain ISFA , which is a nonparametric Bayesian frequency-domain source separation method. We build a generative process that explains the observed sound source mixture and derive a Bayesian inference to retrieve respective sound sources and sound activities. The key idea of PF-ISFA is that all the frequency bins of signals are processed at the same time to avoid the permutation problem. In particular, a unified source activity for all frequency bins is introduced into its generative model.
The rest of this article is organized as follows. Section 2 summarizes the main problem treated, and introduces study related to our method. Section 3 explains conventional ISFA in the time and frequency domains and then introduces our new method PF-ISFA. Section 4 gives detailed posterior inferences of PF-ISFA. Section 5 presents experimental results, and Section 6 concludes this article.
This section first summarizes conventional methods for ISFA: Section 3.1 shows the model of ISFA in the time domain , and Section 3.2 explains its expansion into the frequency domain (FD-ISFA)  and its problems. Then, Section 3.3 describes a model of FD-ISFA without permutation ambiguity (PF-ISFA).
ISFA in time domain
Here, a k is the k th row of A, and p ε , q ε , p A , q A , p α , and q α are the hyperparameters. IBP(α) is the Indian buffet process (IBP)  with concentration parameter α. IBP  is a stochastic process that can deal with a potentially infinite number of signals. It is used in order to achieve separation without using prior knowledge about the number of sources.
A Bayesian hierarchical model aims at explaining the uncertainty in the model from the observed data by treating latent variables as a probabilistic variable rather than a fixed value. In our model, we place a gamma prior on the concentration parameter of IBP so that the emergence of sources in Z can be controlled by the data we have.
ISFA in frequency domain
Since the convoluted mixture is converted into complex spectra by using STFT, the elements of X, S, A, and E become complex-valued variables. FD-ISFA is a model for complex values that arises in frequency-domain processing. It can deal with an instantaneous mixture of complex spectra.
Conjugacy is one of the helpful properties of Bayesian inference. If we choose a conjugate prior, a closed-form expression can be given for the posterior. The variances and have a conjugate inverse gamma prior, and the Gaussian conjugate prior can be used for the mixing matrix A. For simplicity, the univariate complex normal distribution is introduced as a conjugate prior of source signal S. It is noted that a super-Gaussian prior, such as student-t or Laplace distribution, should be used for speech signals. The complex extension of these distributions is non-trivial. We don’t deal with the complex super-Gaussian prior in this article and this is one of our future study.
The processing flow of FD-ISFA is as follows. After STFT, the complex spectra are whitened in each frequency bin, and FD-ISFA is applied for each frequency bin of these complex spectra independently. FD-ISFA is plagued by two well-known ambiguities of frequency domain BSS: the scaling ambiguity and permutation ambiguity. The scaling ambiguity is that the amplitude of the output signals may not equal that of the original sources. Some post-processing methods are needed to resolve these two ambiguities. The projection back method  is an effective solution for the scaling ambiguity. The permutation ambiguity is solved by using the methods mentioned above [15, 16]. After these problems have been solved, estimated complex spectra are assembled into source signals by using inverse STFT.
New method: PF-ISFA
Our new method, permutation-free ISFA (PF-ISFA), achieves both BSS and SAD without being affected by the permutation problem. Its key idea for avoiding the permutation problem is unified activity for each time frame. Conventional ISFA is applied independently to each frequency bin. That is to say, conventional ISFA does not consider any relations across frequency bins. This is the main reason for the permutation problem. By contrast, in the PF-ISFA model, all frequency bins are unified by the activity matrix. Since this unified activity controls the output order of source signals, PF-ISFA is not affected by the permutation problem.
PF-ISFA estimates the source signals S, their time-frequency activities Z, the mixing matrix A, unified activities B, activation probabilities Ψ, and other parameters by using only the observed signal X.
One of the main differences between this PF-ISFA model and conventional ISFA model is the unified activity matrix for each time frame B and the activation probability matrix for each frequency bin Ψ. A graphical model of conventional ISFA is shown in Figure 2. Whereas each frequency bin is independently estimated in the conventional ISFA model, all frequency bins are bundled together by the unified activity matrix in the PF-ISFA model.
Here, all data points are assumed to be independent and identically distributed. The smaller the sum of the noise terms is, the higher the likelihood of PF-ISFA is.
Initialize parameters using their priors.
- 2.At each time t, carry out the following:
For each source k and frequency bin f, sample the activation probability ψ kf from Equation (28).
For each source k and frequency bin f, sample mixing matrix a kf from Equation (29).
If there is a source that is always inactive, remove it.
Update , , and α from Equations (30), (31), and (32), respectively.
Go to 2.
This method is based on the Metropolis-Hastings algorithm . The posterior distributions of the latent variables are derived from Bayes’ theorem by multiplying the priors by the likelihood function.
Here, s-fkt means sft except for sfkt, and ε-fkt means .
Source activity of each time-frequency frame
is the probability of prior.
Unified activity for each time frame
Here, X t is x1t, …, xFt and S-kt and Z-kt are S and Z except for s1kt, …, sFkt and z1kt, …, zFkt, respectively.
where . This is derived from the priors of source activity based on IBP .
To decide whether or not bkt is active, we sample u from Uniform(0,1) and compare it with r / (1 + r). If u ≤ r / (1 + r), then bkt becomes active; otherwise, it remains dormant.
Number of new sources
Some source signals that were not active at the beginning are active at time t for the first time. Let κ t be the number of these sources. This κ t is sampled with the Metropolis-Hastings algorithm.
Here, is the D × κ t matrix of the additional part of A f . When new κ t sources appear, the mixing matrix should be expanded from D × K to D × (K + κ t ). means the mixing matrix for these new sources.
Activation probability for each frequency bin
where is the number of active time-frequency frames of source k in the f th frequency bin, and is the number of active time frames of source k.
Variance of noise and mixing matrix
Concentration parameter of IBP
where K+ is the active number of sources, and is the n th harmonic number.
In this section, we evaluate the separation performance and the accuracy of the source activity. Section 5.1 presents the results of separation performance and SAD performance compared with FD-ISFA . Section 5.2 shows the separation results compared with PF-ICA  using two or four microphones (D = 2, 4) and various source locations.
Compared with FD-ISFA
No. of sources K
STFT window length
STFT shift length
(p ε , q ε ) = (10000, 1.0)
(p A , q A ) = (1.1, 0.1)
(p α , q α ) = (3.2, 0.21)
β = 0.5
The values of these hyperparameters are empirically-selected. The small means the smaller the noise term becomes. Therefore, p ε and q ε is set to 10000 and 1.0 in order to get smaller variance. In contrast, should have a certain amount because affects the amplitudes of output signals. If is too large, the power of estimated signals become small, and then these signals are considered to be inactive.
When FD-ISFA is used, the results, shown in Figure 9, contained many horizontal lines; however, there are fewer of these lines in Figure 10. These lines are the spectrogram of the other separated signal. This means that the output orders of the FD-ISFA results are not aligned for all frequency bins. However, there are no horizontal lines in the spectrogram of PF-ISFA (Figure 8). This shows that the output order is aligned; in other words, the permutation problem has been solved by using PF-ISFA.
The spectrogram shown in Figure 8 has vivid time structure. This indicates that the constraint on the unified activity is too strong and the activation probability for each frequency bin becomes almost one. In order to improve this phenomenon, we might introduce a hyperparameter which can control the activation probability appropriate to observed signals.
We also evaluated our method in terms of the signal-to-distortion ratio (SDR), the image-to-spatial distortion ratio (ISR), the source-to-interference ratio (SIR), and the source-to-artifacts ratio (SAR) . SDR is an overall measure of the separation performance; ISR is a measure of the correctness of the inter-channel information; SIR is a measure of the suppression of the interference signals; and SAR is a measure of the naturalness of the separated signals.
Separation result of Section 5.1 [dB]
RT 20 = 150 ms
RT 20 = 400 ms
RT 20 = 600 ms
Our method (PF-ISFA) outperformed FD-ISFA with permutation solver for all criteria except for SAR under all conditions. In particular, it improved the SIR by 2.82 dB under the condition RT20 = 30 [ms], 0.91 dB under RT20 = 150 [ms], 0.41 dB under RT20 = 400 [ms], and 0.70 dB under RT20 = 600 [ms].
One of the reasons of the poor performance of FD-ISFA is due to the cascade approach. The results show that FD-ISFA achieves better performance if the permutation problem is perfectly solved. Therefore, this poor performance comes from the permutation solver. This indicates that the overall performance of cascade approach is severely affected by the performance of worst subprocess.
These results show that the performance in rooms with reverberation times of 150, 400, and 600 [ms] is worse than for RT20 = 30 [ms] reverberation. This is because the reverberation time of these rooms are longer than the STFT window length (64 [ms]). If the reverberation time is longer than the STFT window length, the reverberation affects multiple time frames, and this degrades the performance.
The result of PF-ISFA (Perm) and that of PF-ISFA (Non-Perm) is different. If the source activity results are poor, the activities of two separated signals become similar. In this case, the permutation ambiguity is likely to arise because the unified activity matrix becomes meaningless. In other words, PF-ISFA marks better result when each source signal has different activity.
Next, we evaluated our method in SAD accuracy. The SAD result of PF-ISFA was estimated as unified source activities, that is the parameter bft in Section 3.3. Since FD-ISFA estimated the sound activity for each frequency bin independently, we calculated the number of active bins for each time frame and determined the source activity of each time frame by using threshold processing.
Average SAD performance: precision, recall, and F -measure
Our method achieved a better precision rate and lower recall rate than FD-ISFA, and the results show that PF-ISFA achieved robust SAD performance under reverberant condition. This is because PF-ISFA estimates the source activities using a unified parameter for all frequency bins. PF-ISFA is less likely to determine that the time frame is active, even if some frequency bins have a certain power level.
Compared with PF-ICA
Separation result of Section 5.2 [dB]
RT 20 = 150 ms
RT 20 = 400 ms
RT 20 = 600 ms
Table 4 indicates that PF-ISFA marks better average SIR except for the condition RT20 = 150 [ms]. This means that PF-ISFA can suppress the interference signal better than PFICA. PF-ISFA and PF-ICA marks similar results by the average SDR when D = 2, and The SDR score of PF-ISFA is lower than that of PF-ICA when D = 4. This is because these SDR scores are affected by the SAR scores. The output signals of PF-ICA are created by multiplying separation matrix by observed signals. Then, the artificial noise is not likely to emerge. In contrast, PF-ISFA estimates the source signals by sampling, and PF-ISFA output is based on the best one sample of all samples created during estimation.
This article presented a joint estimation method of BSS and SAD in the frequency domain that also solves the permutation problem. It was designed by using a nonparametric Bayesian approach. Unified source activity was introduced to automatically align the permutations of the output order for all frequency bins.
Our method improves the average SIR by 2.82–0.41 dB compared with the baseline method based on FD-ISFA when separating convoluted mixtures of RT20 = 30 [ms]–600 [ms] room environments. It also outperforms FD-ISFA under reverberant conditions (RT20 = 150, 400, 600 ms). For SAD performance, our method outperforms the conventional method by 5.9–0.5% in F-measure under the condition RT20 = 20–600 [ms], respectively.
In the future, we will evaluate the separation performance of a mixture of signals from three or more talkers. We will attempt to develop a method that can separate mixtures with longer reverberations (i.e., longer than the STFT window length) robustly. Last but not least, the method should be sped up to achieve real-time processing so that it can be applied to robot applications.
This study was partially supported by KAKENHI and Honda Research Institute Japan Inc., Ltd.
- Rosenthal D, Okuno HG: Computational auditory scene analysis. USA: CRC press; 1998.Google Scholar
- Wang D, Brown G: Computational auditory scene analysis: principles, algorithms, and applications. USA: Wiley-IEEE press; 2006.View ArticleGoogle Scholar
- Sohn J, Kim N, Sung W: A statistical model-based voice activity detection. IEEE Signal Process. Lett 1999, 6: 1-3.View ArticleGoogle Scholar
- Ramırez J, Segura J, Benıtez C, De La Torre A, Rubio A: Efficient voice activity detection algorithms using long-term speech information. Speech Commun 2004, 42(3):271-287. 10.1016/j.specom.2003.10.002View ArticleGoogle Scholar
- Nagira K, Takahashi T, Ogata T, Okuno HG: Complex extension of infinite sparse factor analysis for blind speech separation. In Proc. of International Conference on Latent Variable Analysis and Signal Separation. Tel-Aviv; 2012:388-396.View ArticleGoogle Scholar
- Pedersen MS, Larsen J, Kjems U, Parra LC: Convolutive blind source separation methods, Part I. In Springer Handbook of Speech Processing. Edited by: Benesty J, Sondhi MM, Huang Y. Springer Press; 2008:1065-1094.View ArticleGoogle Scholar
- Nakadai K, Takahashi T, Okuno H, Nakajima H, Hasegawa Y, Tsujino H: Design and implementation of robot audition system “HARK” open source software for listening to three simultaneous speakers. Adv. Robot 2010, 24(5):739-761. 10.1163/016918610X493561View ArticleGoogle Scholar
- Asano F, Ikeda S, Ogawa M, Asoh H, Kitawaki N: Combined approach of array processing and independent component analysis for blind separation of acoustic signals. IEEE Trans. Speech Audio Process 2003, 11(3):204-215. 10.1109/TSA.2003.809191View ArticleGoogle Scholar
- Nakajima H, Nakadai K, Hasegawa Y, Tsujino H: Adaptive step-size parameter control for real-world blind source separation. In Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing. Las Vegas; 2008:149-152.Google Scholar
- Comon P: Independent component analysis, a new concept? Signal Process 1994, 36(3):287-314. 10.1016/0165-1684(94)90029-9MATHView ArticleGoogle Scholar
- Hyvärinen A: Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans. Neural Netw 1999, 10(3):626-634. 10.1109/72.761722View ArticleGoogle Scholar
- Cardoso J, Souloumiac A: Blind beamforming for non-Gaussian signals. IEE Proceedings F Radar and Signal Processing 1993, 140(6):362-370. 10.1049/ip-f-2.1993.0054View ArticleGoogle Scholar
- Sawada H, Mukai R, Araki S, Makino S: Polar coordinate based nonlinear function for frequency-domain blind source separation. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing. Orlando; 2002:1001-1004.Google Scholar
- Knowles D, Ghahramani Z: Infinite sparse factor analysis and infinite independent components analysis. In Proc. of Independent Component Analysis and Signal Separation. London; 2007:381-388.View ArticleGoogle Scholar
- Sawada H, Mukai R, Araki S, Makino S: A robust and precise method for solving the permutation problem of frequency-domain blind source separation. IEEE Trans. Speech Audio Process 2004, 12(5):530-538. 10.1109/TSA.2004.832994View ArticleGoogle Scholar
- Sawada H, Araki S, Makino S: Measuring dependence of bin-wise separated signals for permutation alignment in frequency-domain BSS. In Proc. of IEEE International Symposium on Circuits and Systems. New Orleans; 2007:3247-3250.Google Scholar
- Lee I, Kim T, Lee T: Fast fixed-point independent vector analysis algorithms for convolutive blind source separation. Signal Process 2007, 87(8):1859-1871. 10.1016/j.sigpro.2007.01.010MATHView ArticleGoogle Scholar
- Hiroe A: Solution of permutation problem in frequency domain ICA, using multivariate probability density functions. In Proc. of International Conference on Independent Component Analysis and Blind Signal Separation. Charleston; 2006:601-608.View ArticleGoogle Scholar
- Hirayama J, Maeda S, Ishii S: Markov and semi-Markov switching of source appearances for nonstationary independent component analysis. IEEE Trans. Neural Netw 2007, 18(5):1326-1342.View ArticleGoogle Scholar
- Hsieh H, Chien J: Online Bayesian learning for dynamic source separation. In Proc. of IEEE International Conference on Acoustics Speech and Signal Processing. Dallas; 2010:1950-1953.Google Scholar
- Araki S, Sawada H, Makino S: Blind speech separation in a meeting situation with maximum SNR beamformers. In Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing. Honolulu; 2007:41-44.Google Scholar
- Griffiths T, Ghahramani Z: Infinite latent feature models and the Indian buffet process. Adv. Neural Inf. Process. Syst 2006, 18: 475-482.Google Scholar
- Murata N, Ikeda S, Ziehe A: An approach to blind source separation based on temporal structure of speech signals. Neurocomputing 2001, 41: 1-24. 10.1016/S0925-2312(00)00345-3MATHView ArticleGoogle Scholar
- Hastings W: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970, 57(1):97-109. 10.1093/biomet/57.1.97MATHView ArticleGoogle Scholar
- Meeds E, Ghahramani Z, Neal R, Roweis S: Modeling dyadic data with binary latent factors. Adv. Neural Inf. Process. Syst 2007, 19: 977-984.Google Scholar
- Vincent E, Sawada H, Bofill P, Makino S, Rosca J: First stereo audio source separation evaluation campaign: data, algorithms and results. In Proc. of Independent Component Analysis and Signal Separation. London; 2007:552-559.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.