 Research
 Open access
 Published:
Nonparametric Bayesian sparse factor analysis for frequency domain blind source separation without permutation ambiguity
EURASIP Journal on Audio, Speech, and Music Processing volume 2013, Article number: 4 (2013)
Abstract
Blind source separation (BSS) and sound activity detection (SAD) from a sound source mixture with minimum prior information are two major requirements for computational auditory scene analysis that recognizes auditory events in many environments. In daily environments, BSS suffers from many problems such as reverberation, a permutation problem in frequencydomain processing, and uncertainty about the number of sources in the observed mixture. While many conventional BSS methods resort to a cascaded combination of subprocesses, e.g., frequencywise separation and permutation resolution, to overcome these problems, their outcomes may be affected by the worst subprocess. Our aim is to develop a unified framework to cope with these problems. Our method, called permutationfree infinite sparse factor analysis (PFISFA), is based on a nonparametric Bayesian framework that enables inference without a predetermined number of sources. It solves BSS, SAD and the permutation problem at the same time. Our method has two key ideas: unified source activities for all the frequency bins and the activation probabilities of all the frequency bins of all the sources. Experiments were carried out to evaluate the separation performance and the SAD performance under four reverberant conditions. For separation performance in the BSS_EVAL criteria, our method outperformed conventional complex ISFA under all conditions. For SAD performance, our method outperformed the conventional method by 5.9–0.5% in Fmeasure under the condition RT_{20} = 30–600 [ms], respectively.
Introduction
Computational auditory scene analysis (CASA) aims to find auditory events and extract valuable information from captured sound signals [1, 2]. An overview of CASA system is depicted in Figure 1. First, the CASA system captures sound signals by using a microphone array. Then, it detects sound activities of each source and separates the mixture into individual sources. Finally, it visualizes the auditory events or recognizes these separated sound sources. This article focuses on the source activity detection (SAD) and sound source separation. SAD is useful for CASA systems because this function helps these systems discover audio sources especially when a huge amount of archived audio signals is analyzed. Another example of the benefit of the SAD is compatibility with automatic speech recognition. For accurate automatic speech recognition, it is necessary to extract the voiced part, which is referred to as voice activity detection [3, 4]. Sound source separation is essential for CASA systems because we often observe a mixture of multiple sound sources in our daily environment. Our goal is to develop a simultaneous sound activity detection and sound source separation system for CASA.
The combination of sound source separation and source activity detection should overcome the following difficulties for realworld applications:

1.
unknown mixing processes,

2.
source number uncertainty,

3.
reverberation, and

4.
performance degradation caused by mutually dependent functions.
The first one indicates that the CASA system should work without information specific to a certain environment or a situation such as the environment’s impulse responses or the sound source locations. The second one expresses that the CASA system should achieve robust estimation under the condition that the number of sources is unknown. The third one means that the mixture of audio signals captured in a room may contain reverberations that affect the microphone array processing. The last one means that cascaded processing to cope with the abovementioned difficulties may be severely affected by the worst subprocess of the CASA system. When source separation processing is performed in the frequency domain, the output signals are affected by the permutation problem, which is ambiguity in the output order for different frequency bins. Conventional methods take the cascade approach. The mixed signals are separated for each frequency bin first, and then the permutation problem is solved. As mentioned above, the overall performance is limited to the performance of the worst subprocess.
Our solution for overcoming these difficulties is as follows. The mixing process is modeled stochastically and inferred on the basis of this model. To handle source number uncertainty, we introduce a nonparametric Bayesian approach. The reverberation is absorbed by using frequencydomain processing. Unified analysis of the source separation and permutation resolution is used to optimize these mutually dependent functions.
This article presents a permutationfree infinite sparse factor analysis (PFISFA): a joint estimation method that simultaneously achieves frequencydomain source separation and SAD using a minimum amount of prior information. PFISFA achieves robust estimation without using prior information about the number of sources. PFISFA extends the frequencydomain ISFA [5], which is a nonparametric Bayesian frequencydomain source separation method. We build a generative process that explains the observed sound source mixture and derive a Bayesian inference to retrieve respective sound sources and sound activities. The key idea of PFISFA is that all the frequency bins of signals are processed at the same time to avoid the permutation problem. In particular, a unified source activity for all frequency bins is introduced into its generative model.
The rest of this article is organized as follows. Section 2 summarizes the main problem treated, and introduces study related to our method. Section 3 explains conventional ISFA in the time and frequency domains and then introduces our new method PFISFA. Section 4 gives detailed posterior inferences of PFISFA. Section 5 presents experimental results, and Section 6 concludes this article.
Problem statement and related study
This section starts by summarizing the problem that is solved in this article and the assumptions needed to solve it. After that, the study related to this problem, especially concerning source separation, the permutation problem, and sound detection methods, are introduced.
Problem statement
The problem statement is briefly summarized below.
Input:

Sound mixtures of K sources captured by D microphones.
Output:

Estimated K source signals,

Detected source activities of source signals.
Assumptions:

1.
The number of sources K is not more than the number of microphones D.

2.
The locations of the sources do not change.
The sound activity represents whether or not sound is active in each time frame. This sound activity estimation enables sound detection. The system estimates the source activities of K source signals and separates the D mixed signals captured by the microphones into K sources without prior information, such as locations, microphone locations, and impulse responses between sound sources and microphones. The first assumption means that this system deals with a determined or overdetermined problem. The second assumption means that the mixing process from the sources to the microphones is unchanged.
Requirements
This system should fulfill some requirements in order to work in daily environments. These requirements are summarized as follows.

1.
Blind source separation,

2.
Frequency domain processing,

3.
Permutation resolution,

4.
Robust estimation without source knowledge, and

5.
Unified approach.
These requirements are described in detail below.
Blind source separation
One of the system’s major requirements is to work with the minimum amount of prior information. This is because getting prior information, such as the direction of arrival of sound or the reverberation level of the room, in advance is a troublesome task for the system. In addition, even if the prior information can be obtained, the separation performance is severely affected by the quality of the information. The system should not be dependent on such information. The source separation method that uses the minimum prior information is called blind source separation (BSS).
Frequency domain processing
There are two reasons why frequency domain processing is inevitable for CASA. One is to deal with reverberation and the other is to model source signals using the sparseness of sound energy.
The mixing process of speech signals in our daily surroundings is modeled as a convoluted mixture [6]. The signals captured by the microphones consist of a mixture of ones from various sources and they are contaminated by reflections, reverberations, and arrival time lags at the microphones. To model these timedelayed signals, a convoluted mixture is often used.
Attempts to solve a BSS problem involving convoluted mixtures of signals mainly use, frequency domain processing. This is because the convoluted mixture in the time domain can be explained in a simplistic form in the frequency domain. Specifically, the short time Fourier transform (STFT) can convert a convoluted mixture in the time domain into instantaneous mixtures for all frequency bins. In other words, STFT can absorb the reverberation of the source signals within the window length. Thus, frequency domain processing is effective when BSS is applied to audio signals in practical situations.
Permutation resolution
As mentioned above, the convoluted mixture in the time domain is converted into instantaneous mixtures for individual frequency bins. Many frequencydomain BSS methods independently separate the mixed signals for all the frequency bins; thus, an ambiguity arises in the output order. The system must arrange the separated signals in the correct order for the frequency bins. This is called the “permutation problem”. The permutation problem should be solved in order to achieve frequencydomain BSS.
Robust estimation without source knowledge
Many CASA systems and many source separation methods use prior knowledge about source signals for robust estimation to improve the performance. For instance, HARK [7] localizes the sound sources before separation by using the number of sources. When independent component analysis (ICA), a wellknown BSS method, is applied to the input signals, principal component analysis (PCA) is commonly used as preprocessing for ICA [8]. This is because the number of dimensions of ICA’s input signals can be reduced. However, getting prior knowledge about sources is difficult for the system, so robust estimation without source knowledge is desirable. A nonparametric Bayesian framework is helpful for robust inference without knowing the number of sources.
Unified approach
A unified estimation method enables effective processing because it makes the most of the information available from the observed signals. Many source separation frameworks use a cascaded approach. For instance, HARK [7] localizes the sources first and then separates the observed signals into individual sources; the conventional frequencydomain ICA separates the observations and then resolves the permutation problem. One of the critical weak points of these cascaded approaches is that the separation performance is limited to the performance of the worst subprocess.
Related study
Source separation method of speech signals
Source separation is being actively studied for signal processing. Some methods use the source and microphone locations. Delayandsum beamforming and null beamforming are methods that emphasize or suppress the signal from a specific direction. These methods can be implemented with less computational complexity. HARK uses geometric higherorder decorrelationbased source separation (GHDSS) [9]. GHDSS separates mixed signals by using a higherorder decorrelation between the sound source signals and geometric constraints derived from the positional relationships among the microphones. The weak point of these methods is that they require the source and microphone locations. This prior information cannot easily be obtained in advance.
Many BSS methods have already been introduced. One wellknown BSS method is ICA, which separates mixed signals on the basis of the statistical independence between of different source signals. Many algorithms are used for ICA, such as the minimization mutual information [10], Fast ICA [11], and JADE [12]. For BSS for speech signals, frequencydomain ICA is commonly used [13]. While ICA does achieve BSS, it does not detect the activities of individual sources; moreover, frequencydomain ICA is plagued by the permutation problem.
ISFA [14] is a BSS method based on the nonparametric Bayesian approach. It achieves SAD and BSS simultaneously, but it is modeled in the time domain, so it is vulnerable to the reverberation that often appears in our daily surroundings.
Frequencydomain ISFA (FDISFA), which we proposed in our previous study [5], can handle a convoluted mixture that contains room reverberation. One problem for FDISFA is the permutation problem. Conventional FDISFA independently separates the signals for all the frequency bins, so it cannot avoid permutation ambiguity.
Permutation problem
Some methods solve the permutation problem by post processing. One method is based on estimation of the direction of arrival and interfrequency correlation of the signal envelopes [15]; another uses the power ratio of the signals as a dominance measure [16].
Other methods avoid this problem by using a unified criterion from among all frequency bins. Independent vector analysis (IVA) [17] and permutationfree ICA [18] are BSS methods that avoid the permutation problem. These methods are based on ICA and cannot simultaneously achieve sound source detection.
BSS framework achieving SAD
Some BSS frameworks obtain SAD information simultaneously. Switching ICA [19] is a BSS method which can achieve SAD. Switching ICA employs a hidden Markov model (HMM) on its model to represent whether the source is active or not. The SAD information is obtained from these estimated hidden variables of HMM. Nonstationary Bayesian ICA [20] achieves dynamic source separation by estimating the sources and the mixing matrices for each time frame on the basis of variational Bayesian inference. The SAD information is obtained from automatic relevance determination (ARD) parameters, which are the precision parameters of the probabilistic density of the mixing matrix. Since these methods are timedomain approaches, it is not appropriate for speech separation of convoluted mixtures.
The combination of a maximize signaltonoise ratio beamformer, a voice activity detector and online clustering achieves BSS and SAD [21]. This method is a cascade approach. It achieves SAD and the timedifference of arrival estimation first and then separates signals using this them. As mentioned above, the weak point of cascaded approach is that the separation performance is limited to the performance of the worst subprocess.
ISFA
This section first summarizes conventional methods for ISFA: Section 3.1 shows the model of ISFA in the time domain [14], and Section 3.2 explains its expansion into the frequency domain (FDISFA) [5] and its problems. Then, Section 3.3 describes a model of FDISFA without permutation ambiguity (PFISFA).
ISFA in time domain
ISFA [14] achieves BSS of instantaneous mixtures of timedomain signals without knowing the number of sources. It is based on the following instantaneous mixture model, which expresses that D × T observed data X is composed of a linear combination of K × T source signals S.
where A is a D × K mixing matrix, E is a D × T Gaussian noise term, and Z is a binary mask on X. ⊙ denotes elementwise multiplication. Let x_{dt}, a_{dk}, z_{kt}, s_{kt}, ε_{dt} be the elements of X, A, Z, S, and E, respectively. The generative model of ISFA is shown in Figure 2. {\sigma}_{A}^{2} and {\sigma}_{\epsilon}^{2} are the variance parameters of the elements of A and E.
The priors of these parameters are as follows:
Here, a_{ k } is the k th row of A, and p_{ ε }, q_{ ε }, p_{ A }, q_{ A }, p_{ α }, and q_{ α } are the hyperparameters. IBP(α) is the Indian buffet process (IBP) [22] with concentration parameter α. IBP [22] is a stochastic process that can deal with a potentially infinite number of signals. It is used in order to achieve separation without using prior knowledge about the number of sources.
In the time domain, each element of X, A, S, and E is a realvalued variable. Each of these variables has a normal distribution as a prior. \mathcal{N}(\mu ,{\sigma}^{2}) is a normal distribution with mean μ and variance σ^{2}. The probability density function of this normal distribution is
The IBP concentration parameter has a gamma prior, and the variance parameters of A and E have inverse gamma priors. \mathcal{G}(b,\theta ) and \mathcal{I}\mathcal{G}(b,\theta ) are gamma distribution and the inverse gamma distribution with shape parameter b and scale parameter θ, respectively. The probability density functions of these distributions are
A Bayesian hierarchical model aims at explaining the uncertainty in the model from the observed data by treating latent variables as a probabilistic variable rather than a fixed value. In our model, we place a gamma prior on the concentration parameter of IBP so that the emergence of sources in Z can be controlled by the data we have.
ISFA in frequency domain
Since the convoluted mixture is converted into complex spectra by using STFT, the elements of X, S, A, and E become complexvalued variables. FDISFA is a model for complex values that arises in frequencydomain processing. It can deal with an instantaneous mixture of complex spectra.
The generative model is the same as for timedomain ISFA. However, the priors of these complexvalued elements are different from those of timedomain ISFA.
Here, instead of the normal distribution, a univariate complex normal distribution {\mathcal{N}}_{C} is used for complexvalued parameters. The probability density functions of this distribution is
Conjugacy is one of the helpful properties of Bayesian inference. If we choose a conjugate prior, a closedform expression can be given for the posterior. The variances {\sigma}_{\mathit{\epsilon}}^{2} and {\sigma}_{\mathbf{A}}^{2} have a conjugate inverse gamma prior, and the Gaussian conjugate prior can be used for the mixing matrix A. For simplicity, the univariate complex normal distribution is introduced as a conjugate prior of source signal S. It is noted that a superGaussian prior, such as studentt or Laplace distribution, should be used for speech signals. The complex extension of these distributions is nontrivial. We don’t deal with the complex superGaussian prior in this article and this is one of our future study.
The processing flow of FDISFA is as follows. After STFT, the complex spectra are whitened in each frequency bin, and FDISFA is applied for each frequency bin of these complex spectra independently. FDISFA is plagued by two wellknown ambiguities of frequency domain BSS: the scaling ambiguity and permutation ambiguity. The scaling ambiguity is that the amplitude of the output signals may not equal that of the original sources. Some postprocessing methods are needed to resolve these two ambiguities. The projection back method [23] is an effective solution for the scaling ambiguity. The permutation ambiguity is solved by using the methods mentioned above [15, 16]. After these problems have been solved, estimated complex spectra are assembled into source signals by using inverse STFT.
New method: PFISFA
Our new method, permutationfree ISFA (PFISFA), achieves both BSS and SAD without being affected by the permutation problem. Its key idea for avoiding the permutation problem is unified activity for each time frame. Conventional ISFA is applied independently to each frequency bin. That is to say, conventional ISFA does not consider any relations across frequency bins. This is the main reason for the permutation problem. By contrast, in the PFISFA model, all frequency bins are unified by the activity matrix. Since this unified activity controls the output order of source signals, PFISFA is not affected by the permutation problem.
The flow of PFISFA is depicted in Figure 3, and the generative process of PFISFA is described in Figure 4. Let F be the number of frequency bins. PFISFA is also based on instantaneous mixture for each frequency bin. PFISFA deals with the Ftuple frequency bins at the same time. The elements of Z, X, S, E, and A are defined as x_{fdt}, a_{ f d k }, z_{fkt}, s_{fkt}, ε_{fdt}, respectively.
The following model is introduced to unify the activities of all frequency bins.
where Bernoulli(x) is the Bernoulli distribution with parameter x. b_{kt} is the unified source activity of source k at time t, and Ψ is the probability of the source k becoming active (activation probability) in the f th frequency bin. B represents the K × T matrix of b_{kt} and Ψ means the K × F matrix of ψ_{kf}. Let β be the hyperparameter. The prior distributions of the newly introduced variables are assumed to be as follows:
PFISFA estimates the source signals S, their timefrequency activities Z, the mixing matrix A, unified activities B, activation probabilities Ψ, and other parameters by using only the observed signal X.
One of the main differences between this PFISFA model and conventional ISFA model is the unified activity matrix for each time frame B and the activation probability matrix for each frequency bin Ψ. A graphical model of conventional ISFA is shown in Figure 2. Whereas each frequency bin is independently estimated in the conventional ISFA model, all frequency bins are bundled together by the unified activity matrix in the PFISFA model.
The likelihood function of PFISFA is written as follows.
where
Here, all data points are assumed to be independent and identically distributed. The smaller the sum of the noise terms is, the higher the likelihood of PFISFA is.
Inference of PFISFA
The model parameters of PFISFA are estimated by using an iterative algorithm based on the nonparametric Bayesian model. Sound source separation and SAD are achieved by estimating s_{kft} and b_{kt}, respectively. The parameter update algorithm is given as follows.

1.
Initialize parameters using their priors.

2.
At each time t, carry out the following:

21
For each source k, sample b_{kt} from Equation (26).

22
If b_{kt}?=?1, sample z_{kft} from Equation (20) and for each frequency bin f; otherwise z_{kft}?=?0.

23
If z_{kft}?=?1, sample s_{kft} from Equation (18); otherwise s_{kft}?=?0.

24
Determine the number of new classes ?_{ t }, and initialize the parameters.

21

3.
For each source k and frequency bin f, sample the activation probability ψ _{kf} from Equation (28).

4.
For each source k and frequency bin f, sample mixing matrix a _{kf} from Equation (29).

5.
If there is a source that is always inactive, remove it.

6.
Update {\sigma}_{\mathit{\epsilon}}^{2}, {\sigma}_{\mathbf{A}}^{2}, and α from Equations (30), (31), and (32), respectively.

7.
Go to 2.
This method is based on the MetropolisHastings algorithm [24]. The posterior distributions of the latent variables are derived from Bayes’ theorem by multiplying the priors by the likelihood function.
Sound sources
When z_{fkt} is active, s_{fkt} is sampled by using the following posterior.
where
Here, s_{fkt} means s_{ft} except for s_{fkt}, and ε_{fkt} means \mathit{\epsilon}{}_{{z}_{\text{fkt}}=0}.
Source activity of each timefrequency frame
If b_{kt} = 1, z_{fkt} is sampled from its posterior distribution. The posterior of z_{fkt} is calculated as follows.
where
is the probability of likelihood, and
is the probability of prior.
Then, the following posterior distribution is derived.
where
Unified activity for each time frame
To calculate the ratio of the probability that b_{kt} becomes active to the probability that b_{kt} becomes inactive,we use Equation (23). This ratio r is divided into two parts: the ratio of prior r_{ p } and the ratio of the likelihood of f th frequency bin r_{l,f}.
where
Here, X_{ t } is x_{1t}, …, x_{Ft} and S_{kt} and Z_{kt} are S and Z except for s_{1kt}, …, s_{Fkt} and z_{1kt}, …, z_{Fkt}, respectively.
The ratio of prior r_{ p } is calculated by using:
where {m}_{k,t}=\sum _{{t}^{\prime}\ne t}{b}_{k{t}^{\prime}}. This is derived from the priors of source activity based on IBP [22].
The ratio of likelihood r_{l,f} is calculated by using Equation (25).
The posterior probability of z_{kt} = 1 is calculated using ratio r.
To decide whether or not b_{kt} is active, we sample u from Uniform(0,1) and compare it with r / (1 + r). If u ≤ r / (1 + r), then b_{kt} becomes active; otherwise, it remains dormant.
Number of new sources
Some source signals that were not active at the beginning are active at time t for the first time. Let κ_{ t } be the number of these sources. This κ_{ t } is sampled with the MetropolisHastings algorithm.
First, the prior distribution of κ_{ t } is P\left({\kappa}_{t}\right\alpha )=\text{Poisson}\left(\frac{\alpha}{T}\right). After sampling κ_{ t }, we initialize the new sources and their activities. Next, we decide whether this update is acceptable or not. Let ξ and ξ^{∗} be the current state (i.e., the condition before transition) and the next state candidate (the condition after transition), respectively. The acceptance probability of the transition is \text{min}(1,{r}_{\xi \to {\xi}^{\ast}}). According to Meeds [25] and Knowles [14], {r}_{\xi \to {\xi}^{\ast}} becomes the ratio of the likelihood of the current state to that of the next state. This ratio can be calculated as follows.
where
Here, {\mathbf{A}}_{f}^{\ast} is the D × κ_{ t } matrix of the additional part of A_{ f }. When new κ_{ t } sources appear, the mixing matrix should be expanded from D × K to D × (K + κ_{ t }). {\mathbf{A}}_{f}^{\ast} means the mixing matrix for these new sources.
Activation probability for each frequency bin
ψ_{kf} is sampled by the following posterior.
where {n}_{\text{kf}}=\sum _{t=1}^{T}{z}_{\text{kft}} is the number of active timefrequency frames of source k in the f th frequency bin, and {m}_{k}=\sum _{t=1}^{T}{b}_{\text{kt}} is the number of active time frames of source k.
Mixing matrix
The mixing matrix is estimated in each column. The posterior distribution is
where
Variance of noise and mixing matrix
The variance of noise corresponds to the noise level of the estimated signals, and the variance of the mixing matrix affects the scale of the estimated signals. Their posteriors are as follows.
Concentration parameter of IBP
The posterior distribution of concentration parameter α is
where K_{+} is the active number of sources, and {H}_{n}=\sum _{j=1}^{n}\frac{1}{j} is the n th harmonic number.
Experimental results
In this section, we evaluate the separation performance and the accuracy of the source activity. Section 5.1 presents the results of separation performance and SAD performance compared with FDISFA [5]. Section 5.2 shows the separation results compared with PFICA [18] using two or four microphones (D = 2, 4) and various source locations.
Compared with FDISFA
The experiments used simulated mixtures in four rooms with reverberation times of 20, 150, 400, and 600 [ms]. The simulated mixtures were generated by convoluting the impulse responses measured in the rooms. These impulse responses were recorded by using the microphone array depicted in Figure 5. We use two microphones in these experiments (D = 2). The microphone and source locations are shown in Figure 6, and experimental conditions are listed in Table 1. For each condition, 200 mixtures using JNAS phonemebalanced sentences were tested.
The values of these hyperparameters are empiricallyselected. The small {\sigma}_{\epsilon}^{2} means the smaller the noise term becomes. Therefore, p_{ ε } and q_{ ε } is set to 10000 and 1.0 in order to get smaller variance. In contrast, {\sigma}_{A}^{2} should have a certain amount because {\sigma}_{A}^{2} affects the amplitudes of output signals. If {\sigma}_{A}^{2} is too large, the power of estimated signals become small, and then these signals are considered to be inactive.
Separation performance
First, an example of the experimental results obtained from the separation experiment using mixed signals (D = 2) in a room with reverberation time of 20 [ms] is shown. Spectrograms of a source signal, a signal separated using PFISFA, a signal separated using conventional FDISFA, and a permutationaligned signal separated using FDISFA are shown in Figures 7, 8, 9, and 10, respectively.
When FDISFA is used, the results, shown in Figure 9, contained many horizontal lines; however, there are fewer of these lines in Figure 10. These lines are the spectrogram of the other separated signal. This means that the output orders of the FDISFA results are not aligned for all frequency bins. However, there are no horizontal lines in the spectrogram of PFISFA (Figure 8). This shows that the output order is aligned; in other words, the permutation problem has been solved by using PFISFA.
The spectrogram shown in Figure 8 has vivid time structure. This indicates that the constraint on the unified activity is too strong and the activation probability for each frequency bin becomes almost one. In order to improve this phenomenon, we might introduce a hyperparameter which can control the activation probability appropriate to observed signals.
We also evaluated our method in terms of the signaltodistortion ratio (SDR), the imagetospatial distortion ratio (ISR), the sourcetointerference ratio (SIR), and the sourcetoartifacts ratio (SAR) [26]. SDR is an overall measure of the separation performance; ISR is a measure of the correctness of the interchannel information; SIR is a measure of the suppression of the interference signals; and SAR is a measure of the naturalness of the separated signals.
The results are summarized in Table 2. Larger value means better separation. “NonPerm” was calculated from the output signals themselves; in other words, their permutations were not aligned. “Solver” means that the permutations were aligned using interfrequency correlation of signal envelope. “Perm” means that the output signals permutations are aligned using the correlation between the outputs and the original sources; in other words, the permutations were aligned by using the original source signals for reference.
Our method (PFISFA) outperformed FDISFA with permutation solver for all criteria except for SAR under all conditions. In particular, it improved the SIR by 2.82 dB under the condition RT_{20} = 30 [ms], 0.91 dB under RT_{20} = 150 [ms], 0.41 dB under RT_{20} = 400 [ms], and 0.70 dB under RT_{20} = 600 [ms].
One of the reasons of the poor performance of FDISFA is due to the cascade approach. The results show that FDISFA achieves better performance if the permutation problem is perfectly solved. Therefore, this poor performance comes from the permutation solver. This indicates that the overall performance of cascade approach is severely affected by the performance of worst subprocess.
These results show that the performance in rooms with reverberation times of 150, 400, and 600 [ms] is worse than for RT_{20} = 30 [ms] reverberation. This is because the reverberation time of these rooms are longer than the STFT window length (64 [ms]). If the reverberation time is longer than the STFT window length, the reverberation affects multiple time frames, and this degrades the performance.
The result of PFISFA (Perm) and that of PFISFA (NonPerm) is different. If the source activity results are poor, the activities of two separated signals become similar. In this case, the permutation ambiguity is likely to arise because the unified activity matrix becomes meaningless. In other words, PFISFA marks better result when each source signal has different activity.
SAD performance
Next, we evaluated our method in SAD accuracy. The SAD result of PFISFA was estimated as unified source activities, that is the parameter b_{ft} in Section 3.3. Since FDISFA estimated the sound activity for each frequency bin independently, we calculated the number of active bins for each time frame and determined the source activity of each time frame by using threshold processing.
The precision rate, recall rate, and Fmeasure of the source activity accuracy are listed in Table 3. PFISFA results are indicated by bold type. PFISFA outperformed FDISFA in precision rate and Fmeasure in all reverberant conditions. In particular, it improved the Fmeasure by 5.9 points, 0.9 points, 1.7 points, and 0.5 points under the conditions RT_{20} = 30 [ms], 150 [ms], 400 [ms], and 600 [ms], respectively.
Our method achieved a better precision rate and lower recall rate than FDISFA, and the results show that PFISFA achieved robust SAD performance under reverberant condition. This is because PFISFA estimates the source activities using a unified parameter for all frequency bins. PFISFA is less likely to determine that the time frame is active, even if some frequency bins have a certain power level.
Compared with PFICA
In second experiment, we used two or four microphones (D = 2, 4) to observe the two sound source mixture with interval θ = 60, 120,and 180[deg]. For each interval, 20 mixtures were tested using JNAS phonemebalanced sentences. The microphone and source locations is shown in Figure 11. We use red microphones when D = 2. In order to calculate SDR, ISR, SIR, and SAR, two signals which maximize SDR score are chosen from estimated signals when using four microphones.
The average SDR and SIR of separated signals are shown in Figures 12 and 13 for each interval when D = 2 and 4, respectively. Table 4 summarizes average SDR, ISR, SIR, and SAR of all intervals.
Table 4 indicates that PFISFA marks better average SIR except for the condition RT_{20} = 150 [ms]. This means that PFISFA can suppress the interference signal better than PFICA. PFISFA and PFICA marks similar results by the average SDR when D = 2, and The SDR score of PFISFA is lower than that of PFICA when D = 4. This is because these SDR scores are affected by the SAR scores. The output signals of PFICA are created by multiplying separation matrix by observed signals. Then, the artificial noise is not likely to emerge. In contrast, PFISFA estimates the source signals by sampling, and PFISFA output is based on the best one sample of all samples created during estimation.
Conclusion and future study
This article presented a joint estimation method of BSS and SAD in the frequency domain that also solves the permutation problem. It was designed by using a nonparametric Bayesian approach. Unified source activity was introduced to automatically align the permutations of the output order for all frequency bins.
Our method improves the average SIR by 2.82–0.41 dB compared with the baseline method based on FDISFA when separating convoluted mixtures of RT_{20} = 30 [ms]–600 [ms] room environments. It also outperforms FDISFA under reverberant conditions (RT_{20} = 150, 400, 600 ms). For SAD performance, our method outperforms the conventional method by 5.9–0.5% in Fmeasure under the condition RT_{20} = 20–600 [ms], respectively.
In the future, we will evaluate the separation performance of a mixture of signals from three or more talkers. We will attempt to develop a method that can separate mixtures with longer reverberations (i.e., longer than the STFT window length) robustly. Last but not least, the method should be sped up to achieve realtime processing so that it can be applied to robot applications.
References
Rosenthal D, Okuno HG: Computational auditory scene analysis. USA: CRC press; 1998.
Wang D, Brown G: Computational auditory scene analysis: principles, algorithms, and applications. USA: WileyIEEE press; 2006.
Sohn J, Kim N, Sung W: A statistical modelbased voice activity detection. IEEE Signal Process. Lett 1999, 6: 13.
Ramırez J, Segura J, Benıtez C, De La Torre A, Rubio A: Efficient voice activity detection algorithms using longterm speech information. Speech Commun 2004, 42(3):271287. 10.1016/j.specom.2003.10.002
Nagira K, Takahashi T, Ogata T, Okuno HG: Complex extension of infinite sparse factor analysis for blind speech separation. In Proc. of International Conference on Latent Variable Analysis and Signal Separation. TelAviv; 2012:388396.
Pedersen MS, Larsen J, Kjems U, Parra LC: Convolutive blind source separation methods, Part I. In Springer Handbook of Speech Processing. Edited by: Benesty J, Sondhi MM, Huang Y. Springer Press; 2008:10651094.
Nakadai K, Takahashi T, Okuno H, Nakajima H, Hasegawa Y, Tsujino H: Design and implementation of robot audition system “HARK” open source software for listening to three simultaneous speakers. Adv. Robot 2010, 24(5):739761. 10.1163/016918610X493561
Asano F, Ikeda S, Ogawa M, Asoh H, Kitawaki N: Combined approach of array processing and independent component analysis for blind separation of acoustic signals. IEEE Trans. Speech Audio Process 2003, 11(3):204215. 10.1109/TSA.2003.809191
Nakajima H, Nakadai K, Hasegawa Y, Tsujino H: Adaptive stepsize parameter control for realworld blind source separation. In Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing. Las Vegas; 2008:149152.
Comon P: Independent component analysis, a new concept? Signal Process 1994, 36(3):287314. 10.1016/01651684(94)900299
Hyvärinen A: Fast and robust fixedpoint algorithms for independent component analysis. IEEE Trans. Neural Netw 1999, 10(3):626634. 10.1109/72.761722
Cardoso J, Souloumiac A: Blind beamforming for nonGaussian signals. IEE Proceedings F Radar and Signal Processing 1993, 140(6):362370. 10.1049/ipf2.1993.0054
Sawada H, Mukai R, Araki S, Makino S: Polar coordinate based nonlinear function for frequencydomain blind source separation. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing. Orlando; 2002:10011004.
Knowles D, Ghahramani Z: Infinite sparse factor analysis and infinite independent components analysis. In Proc. of Independent Component Analysis and Signal Separation. London; 2007:381388.
Sawada H, Mukai R, Araki S, Makino S: A robust and precise method for solving the permutation problem of frequencydomain blind source separation. IEEE Trans. Speech Audio Process 2004, 12(5):530538. 10.1109/TSA.2004.832994
Sawada H, Araki S, Makino S: Measuring dependence of binwise separated signals for permutation alignment in frequencydomain BSS. In Proc. of IEEE International Symposium on Circuits and Systems. New Orleans; 2007:32473250.
Lee I, Kim T, Lee T: Fast fixedpoint independent vector analysis algorithms for convolutive blind source separation. Signal Process 2007, 87(8):18591871. 10.1016/j.sigpro.2007.01.010
Hiroe A: Solution of permutation problem in frequency domain ICA, using multivariate probability density functions. In Proc. of International Conference on Independent Component Analysis and Blind Signal Separation. Charleston; 2006:601608.
Hirayama J, Maeda S, Ishii S: Markov and semiMarkov switching of source appearances for nonstationary independent component analysis. IEEE Trans. Neural Netw 2007, 18(5):13261342.
Hsieh H, Chien J: Online Bayesian learning for dynamic source separation. In Proc. of IEEE International Conference on Acoustics Speech and Signal Processing. Dallas; 2010:19501953.
Araki S, Sawada H, Makino S: Blind speech separation in a meeting situation with maximum SNR beamformers. In Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing. Honolulu; 2007:4144.
Griffiths T, Ghahramani Z: Infinite latent feature models and the Indian buffet process. Adv. Neural Inf. Process. Syst 2006, 18: 475482.
Murata N, Ikeda S, Ziehe A: An approach to blind source separation based on temporal structure of speech signals. Neurocomputing 2001, 41: 124. 10.1016/S09252312(00)003453
Hastings W: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970, 57(1):97109. 10.1093/biomet/57.1.97
Meeds E, Ghahramani Z, Neal R, Roweis S: Modeling dyadic data with binary latent factors. Adv. Neural Inf. Process. Syst 2007, 19: 977984.
Vincent E, Sawada H, Bofill P, Makino S, Rosca J: First stereo audio source separation evaluation campaign: data, algorithms and results. In Proc. of Independent Component Analysis and Signal Separation. London; 2007:552559.
Acknowledgements
This study was partially supported by KAKENHI and Honda Research Institute Japan Inc., Ltd.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Nagira, K., Otsuka, T. & Okuno, H.G. Nonparametric Bayesian sparse factor analysis for frequency domain blind source separation without permutation ambiguity. J AUDIO SPEECH MUSIC PROC. 2013, 4 (2013). https://doi.org/10.1186/1687472220134
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1687472220134