 Research
 Open Access
 Published:
An iterative modelbased approach to cochannel speech separation
EURASIP Journal on Audio, Speech, and Music Processing volume 2013, Article number: 14 (2013)
Abstract
Cochannel speech separation aims to separate two speech signals from a single mixture. In a supervised scenario, the identities of two speakers are given, and current methods use pretrained speaker models for separation. One issue in modelbased methods is the mismatch between training and test signal levels. We propose an iterative algorithm to adapt speaker models to match the signal levels in testing. Our algorithm first obtains initial estimates of source signals using unadapted speaker models and then detects the input signaltonoise ratio (SNR) of the mixture. The input SNR is then used to adapt the speaker models for more accurate estimation. The two steps iterate until convergence. Compared to searchbased SNR detection methods, our method is not limited to given SNR levels. Evaluations demonstrate that the iterative procedure converges quickly in a considerable range of SNRs and improves separation results significantly. Comparisons show that the proposed system performs significantly better than related modelbased systems.
1 Introduction
In daily listening environments, noise corrupts speech and creates substantial difficulty for various applications such as hearing aid design and automatic speech recognition. When noise is a nonspeech signal, existing algorithms often exploit the intrinsic properties of speech/noise for segregation. However, when interference is another voice, the generic properties of speech signals alone are insufficient for separation, and current methods also utilize speaker characteristics. The problem of separating two voices from a single mixture is often referred to as cochannel speech separation. Depending on the information used in cochannel speech separation, we can classify the algorithms into two categories: unsupervised and supervised. In unsupervised methods, speaker identities and pretraining with clean speech are not available, while supervised methods often assume both.
Motivated by human perceptual principles, computational auditory scene analysis (CASA) aims to segregate a voice of interest by exploiting inherent features of speech such as pitch and common onsets [1]. CASA methods are typically unsupervised. For example, pitch and amplitude modulation are utilized to separate voiced portions of cochannel speech, and the estimated pitches in neighboring frames are grouped using pitch continuity [2]. To group temporally disjoint timefrequency (TF) regions, a system [3] employs speaker models to perform a joint estimation of speaker identities and sequential grouping. Later in [4], the system is extended to handle unvoiced speech based on onset/offsetbased segmentation [5] and modelbased grouping. Similarly, another CASA system extracts speaker homogeneous TF regions and employs speaker models and missing data techniques to group them into speech streams [6]. Note that the aforementioned methods use speaker models for sequential grouping, or to group temporally disjoint speech regions, and thus are not completely unsupervised. A recent system [7] applies unsupervised clustering to group speech regions into two speaker groups by maximizing the ratio of between and withincluster distances.
Supervised methods often formulate separation as an estimation problem, i.e., given an input mixture, one estimates the two underlying speech sources. To solve this underdetermined equation, a general approach is to represent the speakers by two trained models, and the two patterns (each from one speaker) best approximating the mixture are used to reconstruct the sources. For example, an early study [8] employs a factorial hidden Markov model (HMM) to model a speaker, and a binary mask is generated by comparing the two estimated sources. In another system [9], Gaussian mixture models (GMM) are used to describe speakers, and speech signals are estimated by a minimum meansquare estimator (MMSE). In MMSE estimation, the posterior probabilities of all Gaussian pairs are computed and used to reconstruct the sources (see [10] for a similar system). The GMMbased methods [9, 10] do not model the temporal dynamics of speech. A layered HMM model is proposed to model both temporal and grammar dynamics by transition matrices [11]. A 2D Viterbi decoding technique is used to detect the most likely Gaussian pair in each frame, and a maximum a posteriori (MAP) estimator is used for estimation. In a speakerindependent setting, Stark et al. [12] propose a factorial HMM to model vocal tract characteristics and use detected pitch to reconstruct speech sources. In addition to these methods, other models are applied to capture speakers, including eigenvectors to model and adapt speakers [13], nonnegative matrix factorizationbased models [14, 15], and sinusoidal models [16].
As pointed out in [9], one problem the modelbased methods face is generalization to different input signaltonoise ratio (SNR) levels (note here that we consider interfering speech as noise). The system [9] does not address this problem and assumes that test mixtures have the same energy level as the training mixtures. Further, the system is designed to only handle 0dB mixtures. Similarly, a conditional random fieldbased method in [17] is only applied to separate 0dB speech mixtures. The factorial HMM system [12] employs a quantile filtering to estimate a gain for each frame and then uses that to adjust the corresponding mean vector in a codebook. Radfar and Dansereau [18] propose a searchbased method to detect the input SNR, but one has to specify the search range. In this method, different gains are hypothesized, and the one maximizing likelihood of the whole utterance is taken as the estimate. Radfar et al. [19] use a quadratic function to approximate the likelihood function of a factorial HMM and employ an iterative approach to estimate the gain. The HMM system [11] detects the model gains jointly with the speaker identities given a closed set of speakers and uses an expectationmaximization (EM) algorithm to further adapt the gains. However, the complexity of gain adaptation is quadratic to the number of states, and the convergence speed of the EM algorithm is unknown. Sinusoidal models are also employed to model speakers for joint speaker separation and identification [20], and SNR estimation can be achieved by adapting a universal background model using segregated speech [21].
In this work, we propose an iterative algorithm to generalize to different input SNR conditions given speaker identities. Building on the GMM system [9], we first incorporate temporal dynamics using transition matrices [11]. Then, our algorithm estimates initial TF masks for two speakers by assuming that the input SNR is 0 dB. The initial masks are used to estimate an utterancelevel SNR, which is in turn used to adapt the speaker models. Then, the adapted models are used in a new iteration of separation. The above two steps iterate until both input SNR and the estimated masks become stable. Experiments show that it converges relatively fast and is computationally simple. Compared to the method of [19], our method is simpler and can be applied to factorial HMMs as well as other models (e.g., GMMs). In addition, our method does not require a search range for the estimated input SNR. Comparisons show that the proposed algorithm significantly outperforms related methods.
The rest of the paper is organized as follows. We first present the basic model in Section 2. Section 3 describes iterative estimation. Evaluation and comparison are given in Section 4, and we conclude the paper in Section 5.
2 Modelbased separation
We first introduce speaker models and source estimation methods. Throughout the paper, we denote vectors by boldface lowercase and matrices by boldface uppercase letters. Given two speakers a and b, the timedomain cochannel speech signal is a simple addition of two source speech signals. Decomposing the signals into the TF domain using a linear filterbank and assuming two source signals are uncorrelated at each channel, we have
where X_{ a }(c,m) and X_{ b }(c,m) denote the power spectrum at the TF unit of channel c and time frame m of speakers a and b, respectively, and Y(c,m) is the spectrum of the mixture. We then take the logarithm of all entities and use logmax approximation to model the relationship between the mixture and sources: in the logspectral domain, the mixture at each TF unit is equal to the stronger source. Thus, (1) can be approximated as
where X_{ a }(c,m), X_{ b }(c,m), and Y(c,m) represent the logarithms of X_{ a }(c,m), X_{ b }(c,m), and Y(c,m), respectively. The logmax approximation is originally proposed in [22] to describe the mixing process of speech and noise in robust speech recognition and is later employed in twospeaker separation. A mathematical analysis in [9] shows that the approximation error in (2) is reasonable, but more accurate approximations exist that take both amplitude and phase into consideration [23].
2.1 Speaker models
We use a gammatone filterbank consisting of 128 filters to decompose the input signal into different frequency channels [1]. The center frequencies of the filters spread logarithmically from 50 to 8,000 Hz. Each filtered signal is then divided into 20ms time frames with 10ms frame shift, resulting in a cochleagram. The log spectra are computed by taking the elementwise logarithm of the energy in the cochleagram matrix.
Following [9], we build speaker models using GMMs. For each speaker, we build a 128dimensional GMM from the log spectra of their clean utterances and use a diagonal covariance matrix for each Gaussian for efficiency and tractability. Letting x_{ a } be the logspectral vectors of speaker a, the GMM for speaker a can be parameterized as
where K is the number of Gaussians indexed by k, c is the index of frequency channels, and ${x}_{a}^{c}$ is the c th element of x_{ a }. $N(\xb7;{\mu}_{a,k}^{c},{\sigma}_{a,k}^{c})$ denotes a onedimensional Gaussian distribution with mean ${\mu}_{a,k}^{c}$ and variance ${\sigma}_{a,k}^{c}$, which correspond to the c th dimension of the k th Gaussian in the GMM. In addition, p_{ a }(k) denotes the prior of k th Gaussian. Similarly, the model of speaker b is
For each speaker, the conditional distribution given a specific Gaussian is a 128dimensional Gaussian distribution, i.e., $p\left({\mathbf{x}}_{a}\right{k}_{a})=\prod _{c=1}^{128}N({x}_{a}^{c};{\mu}_{a,{k}_{a}}^{c},{\sigma}_{a,{k}_{a}}^{c})$ and $p\left({\mathbf{x}}_{b}\right{k}_{b})=\prod _{c=1}^{128}N({x}_{b}^{c};{\mu}_{b,{k}_{b}}^{c},{\sigma}_{b,{k}_{b}}^{c})$, where k_{ a } and k_{ b } are two Gaussian indices, and $p\left({x}_{a}^{c}\right{k}_{a})$ and $p\left({x}_{b}^{c}\right{k}_{b})$ are onedimensional Gaussians.
Given the above speaker models and the mixing Equation (2), we can derive a perchannel statistical relationship between the mixture and two sources as follows:
Here, we use subscripts ${x}_{a}^{c}$ and ${x}_{b}^{c}$ to differentiate the probability functions for speakers a and b. ${\Phi}_{{x}_{a}^{c}}(\xb7{k}_{a})$ and ${\Phi}_{{x}_{b}^{c}}(\xb7{k}_{b})$ are their corresponding cumulative distributions. In a probabilistic manner, (5) provides a way of approximating the mixture using two clean speaker models, which in turn can be used to estimate two source signals given the mixture as the observation.
2.2 Source estimation
One method to estimate the sources is the MMSE estimator, which aims to minimize the expectation of the square error between the estimated and underlying true signals given the observations [9]. As a result, for a logspectral vector y, the c th element of source x_{ a } can be estimated as
According to the total probability formula, $p\left({x}_{a}^{c}\right\mathbf{y})$ in (6) can be expanded as follows:
Note that $p\left({x}_{a}^{c}\right{k}_{a},{k}_{b},{y}^{c})$ here only depends on y^{c} instead of y due to the diagonal covariance assumption. The posterior p(k_{ a },k_{ b }y) in (7) can be calculated as
where $p\left(\mathbf{y}\right{k}_{a},{k}_{b})={\prod}_{c=1}^{128}p({y}^{c}{k}_{a},{k}_{b})$ again because of the diagonal covariance matrix. On the other hand, $p\left({x}_{a}^{c}\right{k}_{a},{k}_{b},{y}^{c})$ in (7) can be computed by using the Bayes rule:
From (9) to (10), the constraint ${x}_{a}^{c}\le {y}^{c}$ and the logmax assumption are used, and a detailed derivation can be found in [22]. We then incorporate (8) and (10) to (7) and combine with (6) to estimate the source speaker a
The MMSE estimate of speaker b can be computed similarly.
In addition to directly estimating the sources, we estimate a soft mask for speaker a as
Note that the soft mask for speaker b is $p({x}_{a}^{c}\le {x}_{b}^{c}\mathbf{y})=1p({x}_{a}^{c}>{x}_{b}^{c}\left\mathbf{y}\right)$. In [9], the soft mask is found to perform consistently better than a binarized mask.
An alternative to the MMSE estimator is a MAP estimator. The essence of MAP estimation is similar to MMSE, but instead of using every pair of Gaussians in (7), it only uses the most likely Gaussian pair
where ${k}_{a}^{\ast}$ and ${k}_{b}^{\ast}$ correspond to the pair of Gaussians yielding the highest posterior probability among all possible pairs. The estimate of source signals can be computed similarly to (11) but using only ${k}_{a}^{\ast}$ and ${k}_{b}^{\ast}$. A soft mask can also be derived like (12) using only ${k}_{a}^{\ast}$ and ${k}_{b}^{\ast}$. In experiments, we find that the performance of the MAP estimator is similar to that of MMSE, mainly because at each frame, one pair of Gaussians often approximates the mixture much better than others.
2.3 Incorporating temporal dynamics
The cochannel speech separation system in [9] models speaker characteristics using GMMs and ignores the temporal information of speech signals. A natural extension to the GMMs to incorporate temporal dynamics is using a factorial HMM model. Specifically, for each speaker, we can estimate the most likely Gaussian index for each frame in a clean utterance using a MAP estimator. Each utterance thus generates a sequence of Gaussian indices. The transitions between all neighboring Gaussian indices are then used to build a 2D histogram, which can then be normalized to produce a transition matrix [11].
In the factorial HMM system, the hidden states of the two HMMs at each frame are the most likely Gaussian indices of two speakers. While the detection of the Gaussian indices is based on only individual frames in a GMMbased model, a 2D Viterbi search is used in [11] to find the most likely Gaussian index sequences. Specifically, the 2D Viterbi integrates all frames and the transition information across time to find the most likely two Gaussian sequences, each of which corresponds to one speaker [24].
We use δ_{ t }(k_{ a },k_{ b }) to denote the highest probability along a single path (i.e., a sequence of state pairs) accounting for the first t frames and ending at state k_{ a },k_{ b }
where ${s}_{a}^{t}$ and ${s}_{b}^{t}$ denote the hidden states of speakers a and b at time frame t, respectively, and λ represents the factorial HMM. (14) can be computed iteratively by
where p(k_{ a }k a′) is the transition probability of speaker a from state k a′ to k_{ a }, and p(k_{ b }k b′) is that of speaker b. p(y_{ t }k_{ a },k_{ b }) can be calculated similarly as in (8). The optimal Gaussian index sequences are detected by a 2D Viterbi decoding [24], and the MAP estimator is used for estimating sources.
In (15), an exhaustive search for each pair of k_{ a } and k_{ b } across T frames has a complexity of O(T K^{4}), where K is the number of Gaussians for each speaker and T is the number of frames. It is time consuming if K is relatively large. In our study, we use a beam search to speed up the process (see also [25]). Given a beam width of W, we only search for the W most likely previous state pairs (i.e., k a′ and k b′ in (15)), and the time complexity is reduced to O(T W K^{2}). The results presented in Section 4 indicate that a beam width of 16 gives a comparable performance to the exhaustive search.
3 Iterative estimation
As mentioned in Section 1, modelbased methods such as [9] face the difficulty of generalizing to different mixing conditions. It is partly because the GMMs are trained using logspectral vectors and hence are sensitive to the overall speech energy. More importantly, if the GMMs of two speakers are trained using clean utterances at certain energy levels, in testing they need to be adjusted according to the input SNR. In [9], mixtures with nonzero input SNR are separated using unadjusted models, but the performance is worse.
We propose to detect the input SNR and use that to adapt the speaker models and reestimate the sources. To estimate the input SNR from the mixture, one has to first have some source information. Thus, SNR detection and source estimation become a chickenandegg problem, i.e., the performance of one task depends on the success of the other. One general approach to deal with this type of problem is to perform an iterative estimation (e.g., [2]). In the initial stage of the iterative procedure, we apply the unadapted speaker models to obtain initial separation. Based on the initial source estimates, we calculate the input SNR and use that to adapt the speaker models. The adapted models are in turn used to reestimate the sources. The two steps iterate until convergence. As an alternative, we also explore a searchbased method which jointly estimates sources and the input SNR.
3.1 Initial mask estimation
For a pair of speakers, we first perform an initial estimate by using their models pretrained using clean utterances at a perutterance energy level of 60 dB. Initially, the input SNR is assumed to be 0 dB, and a mixture is scaled to an energy level of 63 dB corresponding to the addition of two 60dB source signals. We use the 2D Viterbi decoding based on (15) to detect the most likely Gaussian index sequence and then estimate a soft mask of the target speaker using the MAP estimator in Section 2.2.
3.2 SNR estimation and model adaptation
Denoting the estimated soft masks of speakers a and b as M_{ a } and M_{ b }, respectively, we use them to filter the mixture cochleagram to obtain the corresponding segregated signals. With the mixture cochleagram E_{ y }, the SNR of the target and interferer in the cochleagram domain can be calculated as
where M_{ a }(c,m) denotes the ratio of speaker a at the TF unit of channel c and frame m, and M_{ b }(c,m)=1−M_{ a }(c,m). R corresponds to the input SNR of the filtered speech signals. As analyzed in [26], due to gammatone filtering which has a certain passband, one usually should compensate for the loss of energy to calculate the SNR of the original timedomain signals. However, in our work, the frequency range of the gammatone filterbank is between 50 and 8,000 Hz, and both target and interference are speech signals with a sampling frequency of 16 kHz. There is thus little energy loss in the filtering process, and the estimated SNR of filtered signals is close to that of the original timedomain signals. Thus, we directly use the SNR of filtered signals in (16) as our estimate.
We then adapt two speaker models to match the estimated input SNR. In particular, the target speaker model (speaker a) is fixed (i.e., trained by using 60dB clean utterances), and we adapt the interferer model and the mixture. Given an input SNR of R dB, the interfering signal energy level is thus
where x_{ b }[t] denotes the timedomain speech of speaker b. That is, instead of using 60dB utterances, the interferer model should be trained using 60−R dB signals, and the original utterances should be scaled by a multiplicative factor of 10^{−R/10}. Since the difference lies in a constant factor, we can directly scale the parameters of the GMM models, i.e., the mean and variance. Specifically, the means of the interferer GMM are scaled by an additive factor of β= log(10^{−R/10}) since logspectral vectors are used in training, while the variances will remain unchanged because β is an additive factor.
On the other hand, the mixture energy level can be computed by combining the target and interfering signal levels
where y[t] is the timedomain cochannel signal, and x_{ a }[t] is the source signal of speaker a. In the above calculation, we assume that the timedomain target and interfering signal are uncorrelated at each frame. Given (17) and (18), we have adapted the interfering speaker model and the mixture and created a more matched condition for separation.
3.3 Iterative estimation
Given any input mixture, we first obtain the initial mask estimates M_{a,0} and M_{b,0} as described in Section 3.1. Given M_{a,0} and M_{b,0}, we then estimate the input SNR using (16). The estimated SNR is used to adapt the model of speaker b and mixture by (17) and (18), respectively. They are then used together with the target speaker model to reestimate the soft masks based on the 2D Viterbi decoding described in Section 2.3 and the MAP estimator in Section 2.2. To get the maximal performance, the iterative process should continue until neither the estimated input SNR nor speaker masks change. However, empirically, we observe that the separation performance becomes stable when the estimated input SNR change is smaller than 0.5 dB. We thus use this as the stop criterion and terminate the estimation process when the difference of estimated input SNRs between two iterations is less than 0.5 dB.
As an illustration, Figure 1a shows a cochleagram of a cochannel signal at −9 dB consisting of two male utterances, where a brighter unit indicates stronger energy. Figure 1b shows the clean target speech and Figure 1c the clean interfering speech. We show the initially segregated target and interferer in Figure 1d,e, respectively, and the final segregated target and interferer are presented in Figure 1f,g, respectively. As shown in the figure, the iterative estimation improves the quality of segregated speech signals.
3.4 An alternative method
In addition to the iterative method, we have also tried a searchbased method to jointly estimate the source state sequences and the input SNR. For example, we use a test corpus described in Section 4 and hypothesize the input SNR in a range from −9 to 6 dB with an increment of 3 dB. At each hypothesized input SNR, we adapt the mixture and interfering speaker model according to (17) and (18) and use them to detect state sequences using the 2D Viterbi decoding, and then estimate the soft masks based on the MAP estimator. For all hypothesized SNR conditions, we calculate the joint likelihood of all mixture frames and the Gaussian sequences being generated by the factorial HMM, and the hypothesized input SNR corresponding to the highest likelihood is selected as the detected value. The corresponding state sequence is then used for estimation. We have evaluated the performance of this method using the corpus described in Section 4, and it is about 0.5 dB worse than the iterative method and is computationally more expensive. Note that the discrete SNR range includes the true SNR value in each testing condition to favor the SNRbased search method. How to specify the input SNR levels in search is unclear in practice.
4 Evaluation and comparisons
We use twotalker mixtures in the Speech Separation Challenge (SSC) corpus [27] for evaluation. For each speaker, a 256component GMM model (i.e., K=256) is trained using all of the speaker’s clean utterances in the training set. Here, K is chosen with the consideration of performance and computation complexity. In training, each clean utterance is normalized to a 60dB energy level, and the log spectra are calculated as described in Section 2.1. An HMM model is then built upon each GMM using the same utterances as described in Section 2.3. We use the test part of the SSC corpus and create twospeaker mixtures at SNRs from −9 to 6 dB (with an increment of 3 dB) for evaluation. We randomly select 100 twospeaker mixtures in each SNR condition for testing. Note that the mixture utterances are the same across different SNRs, and mixtures at opposite SNRs are not symmetric since they are generated by fixing the target and scaling the interfering utterances. The 100 mixtures contain 51 differentgender mixtures, 23 malemale mixtures, and 26 femalefemale mixtures. All test mixtures are downsampled from 25 to 16 kHz for faster processing.
We evaluate the segregation performance using the SNR gain of the target speaker, which is calculated as the output SNR of segregated target speech subtracted by the corresponding input SNR. For each segregated target, we take its clean speech signal as the ground truth and compute the output SNR as
where x_{ a }[t] and ${\widehat{x}}_{a}\left[t\right]$ are the original clean signals and signals resynthesized from the estimated mask, respectively. Note that a waveform signal can be obtained from a soft mask [1]. In our test conditions, target and interfering speakers are treated symmetrically, e.g., an interferer at 6 dB is considered as a target at −6 dB. Thus, at each input SNR, we calculate the target SNR gain as the average of the target SNR gain at that input SNR and the interferer SNR gain at the negative of that input SNR. For example, the SNR gain at −6 dB is the average of the target SNR gain at the −6 dB SNR and the interferer SNR gain at the 6 dB SNR.
4.1 System configuration
As we mentioned in Section 2.3, an exhaustive 2D Viterbi search is time consuming, and we use beam search for speedup. The beam width W needs to be chosen to balance the performance and complexity. In Figure 2, we vary W from 1, 4, 16, and 64 to 256, and the corresponding target SNR gains are shown in different curves. For the largest beam width of 256, the beam search already performs comparably to an exhaustive search. On the other hand, a beam width of 1 amounts to a greedy algorithm where we only keep the path with the highest likelihood at each frame. In Figure 2, we observe that when W is between 16 and 256, the SNR gains at all conditions are almost the same. However, the gains degrade significantly when W is further reduced. We thus choose W to be 16. Compared to an exhaustive search, the computational complexity is greatly reduced from O(T K^{4}) to O(T K^{2}).
Another parameter impacting the system performance is the number of iterations in iterative estimation. In our experiments, we observe that the estimated input SNR and masks become stable quickly. Figures 3 and 4 show the SNR and mask estimation performance, respectively, in terms of the number of iterations. In Figure 3, we measure the SNR estimation performance as the difference of the estimated from the true input SNRs. Each curve in the figure corresponds to the estimation errors at one SNR condition. Before any estimation (i.e., number of iterations =0), the input SNR is assumed to be 0 dB and the error is the negative of the underlying true SNR. After the first iteration, the errors decrease significantly for all SNR conditions except for the 0dB case. This is because at 0 dB, the initial estimate happens to be the same as the true SNR, and any estimation can only deviate away from 0 dB. In this case, we observe that the estimated SNR gets a little worse and then becomes stable. For other SNR conditions, the errors keep decreasing as more iterations are performed, and all of them become stable by the fifth iteration. In Figure 4, we measure the performance of mask estimation by the SNR gain of the segregated target. Initially, the SNR gain is 0 dB, and then, the quality of estimated masks improves substantially after the iteration starts. As shown in the figure, the first iteration brings about 4 to 8dB improvements for all SNR conditions, and the second iteration mainly improves the performance at −6 and −9 dB (by 1.8 and 3 dB, respectively). The performance at most SNR conditions become stable after three iterations. At −9 dB, the estimated mask gains a small improvement for further iterations. In the experiments, we observe that the estimated masks often become stable when the estimated input SNR changes less than 0.5 dB. Thus, we use this as the stop criterion for iterative estimation. By this criterion, an average of 3 iterations is often enough for convergence.
4.2 Comparisons
We compare the proposed system to related modelbased methods, which include the MMSEbased system by [9], a similar system based on a MAP estimator, and an HMMbased system incorporating temporal dynamics. Note that all aforementioned systems are implemented by us in the cochleagram domain for matched comparisons. In training GMMs, we follow [9] and normalize mixtures to have 0 mean and unit variance and use 256 Gaussians in GMMs. We use the soft mask result instead of the direct estimates in [9] since it gives the best result. The transition probabilities in HMM are calculated according to [11]. The mean SNR gains with 95% confidence intervals of these methods are presented in Figure 5.

As shown in Figure 5, the proposed system achieves an SNR gain of 11.9 dB at the input SNR of −9 dB, and the gain decreases gradually as the input SNR increases. At 9 dB, the SNR gain is about 3.9 dB. On average, our method achieves an SNR gain of 7.4 dB. Compared to the method of Reddy and Raj, our method performs comparably at 0 dB but significantly better at other input SNRs. For example, the proposed system performs about 2.7 dB better at −9 dB, and the improvement gets smaller as the input SNR gets closer to 0 dB. A similar trend is also observed at positive input SNRs. On average, the proposed system performs 1.2 dB better than the Reddy and Raj method. In the figure, we also show the performance of another MMSE method (black bars), a version of the Reddy and Raj system that does not require the energy levels of training and testing to be the same. In this method, we assume the input SNR to be 0 dB and scale the mixture as described in Section 3.1. As we expect, the performance is a little worse (about 0.3 dB) than the original Reddy and Raj system due to the unmatched signal levels. We also compare to a MAPbased separation method described in Section 2.2. Using only the most likely Gaussian pair for estimation, the MAP method is more efficient than the MMSE method but performs about 0.1 dB worse. Our system performs about 1.6 dB better than the MAPbased method. To isolate the effect of iterative estimation, we have also evaluated the performance of the HMM system alone. As shown in the figure, this method achieves an average SNR gain of about 6.3 dB, about 0.5 dB better than the MAPbased method. This improvement comes from the use of temporal dynamics. Comparing this performance with the proposed system, we get the benefit of iterative estimation, which further increases the SNR gain of the HMM system by about 1.1 dB. In addition, we note that iterative estimation can also be incorporated into other modelbased systems. For example, we add iterative estimation to the MMSE method (denoted by as MMSEiterative in Figure 5) and obtain an improvement of 1.2 dB. Similarly, the MAPiterative method outperforms the original MAP method by about 1.2 dB. Lastly, to show the upper bound performance of our system, we have utilized the true input SNR and ideal hidden states in estimation. This ideal performance is presented as the HMM ideal in Figure 5. It is about 0.9 dB better than the proposed system, which indicates that our system is close to the ceiling performance.
We have compared to a factorial HMMbased method which is capable of adapting speaker models for separating mixtures with different signal levels [12]. In this method, pitches of two speakers are first estimated by a factorial HMM. Then, vocal tract responses are modeled by vector quantization or nonnegative matrix factorization (NMF) and used with estimated pitches to estimate the source signals. Since the vocal tract responses are normalized in modeling, a gain factor is introduced to scale the source spectra. Specifically, a gain vector is calculated as the difference of the mixture and source spectra, and then quantile filtering is used to select a robust estimate. To compare to this method, we use the criterion of targettomasker ratio (TMR) as in [12] in the following experiments. In the speakerdependent case, the method reports about a 6.6dB gain in terms of TMR at 0dB input TMR. Specifically, it achieves a TMR of about 7 dB in the samegender female (SGF) case, 4.5 dB in samegender male (SGM) case, and 8.3 dB in the differentgender (DG) case. These results correspond to the best performance in a setting where NMF is used for modeling. We evaluate our method using TMR, and the results for 0dB mixtures are shown in Figure 6. Note that we used the same corpus as in [12], but the exact mixtures may be different. As in [12], we show the TMRs in SGM, SGF, and DG cases separately, and the horizontal lines in the centers of the boxes correspond to means, and the distance between a line and a box boundary depicts standard deviation. The improvements are 9.6, 8.4, and 10.4 dB in the SGF, SGM, and DG cases, respectively, and on average, the improvement is about 9.4 dB. These results show that our system performs substantially better than [12] in all kinds.
In addition to the SNR performance, we also evaluate the system using a hit minus falsealarm (HIT −FA) rate which has been shown to be a good indicator of human speech intelligibility [28]. As in [28], we calculate the hit rate as the percentage of correctly labeled target dominant TF units and the false alarm (FA) rate as the percentage of incorrectly labeled interferer dominant TF units. To calculate these rates, we convert the soft masks to binary masks using a threshold of 0.5, i.e., the TF units with a probability greater than 0.5 are labeled as 1 and 0 otherwise. The HIT −FA rates of our system and the Reddy and Raj system are shown in Figure 7. We observe that the proposed algorithm performs uniformly better than the Reddy and Raj system at all SNR conditions. For our system, the average HIT −FA rate is about 64.4%, and the rates are relatively stable at different input SNR conditions. On average, it is about 7.5% better than the Reddy and Raj system. The performance gap between our system and the Reddy and Raj system are bigger when the input SNR deviates from 0 dB. This again confirms that iterative estimation is effective for generalizing to nonzero SNR mixtures.
Finally, we evaluate our system and compare with the Reddy and Raj system using a shorttime objective intelligibility (STOI) [29], which is shown to be highly correlated to human speech intelligibility. As shown in Figure 8, both our method and the Reddy and Raj system perform significantly better than unprocessed mixtures. Our method performs generally better than Reddy and Raj’s across a range of SNRs, especially when SNR is far away from 0 dB. As Mowlaee et al. have also evaluated their sinusoidal modelingbased method using STOI [20], it is interesting to draw some comparisons with their performance. Since the exact mixtures in our experiments are different from those in [20], it is more informative to look at the relative STOI improvements over unprocessed mixtures. Roughly speaking, our STOI improvements are comparable to those in [20]. For example, our improvement is about 0.15 at −9 dB and 0.1 at 6 dB, while in [20] (Figure 8), the improvement at −9 dB is about 0.22, but there is no improvement at 6 dB.
5 Conclusions
We have proposed an iterative algorithm for modelbased cochannel speech separation. First, temporal dynamics is incorporated into speaker models using HMM. We then present an iterative method to deal with signal level differences between training and test conditions. Specifically, the proposed system first uses unadapted speaker models to segregate two speech signals and detects the input SNR. The detected SNR is then used to adapt the interferer model and the mixture for reestimation. The two steps iterate until convergence. Systematic evaluations show that our iterative method improves segregation performance significantly and also converges quickly. Comparisons show that it performs significantly better than related modelbased methods in terms of SNR gains as well as HIT −FA and STOI scores.
We note that SNR estimation in our system uses the whole mixture, which would not be feasible for realtime applications. However, one can slightly modify it to work in real time. For example, at one frame, one could use only previous frames for Viterbi decoding and SNR detection. The detected SNR could be used to adapt speaker models for separation in later frames and then get updated correspondingly. Such an update may be performed periodically to track the input SNR, and the update frequency would depend on the extent to which the input SNR varies.
In this work, our description is limited to twotalker situations as in related modelbased methods. The proposed system could be extended to deal with multitalker separation problems. For example, the MMSE estimators can be extended to perform threetalker separation according to [9]. As for iterative estimation, one can estimate the energy ratios between multiple speakers instead of the SNR in the twospeaker case and adapt the speaker models accordingly. One issue in multitalker situations is that the complexity of (13) is exponential to the number of speakers, and a faster decoding method thus needs to be used (e.g., [9, 30]).
References
 1.
Wang DL, Brown GJ (eds): Computational, Auditory Scene Analysis: Principles,Algorithms and Applications. Hoboken: WileyIEEE Press; 2006.
 2.
Hu G, Wang DL: A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans. Audio, Speech, Lang. Process 2010, 18: 20672079.
 3.
Shao Y, Wang DL: Sequential organization of speech in computational auditory scene analysis. Speech Comm 2009, 51: 657667. 10.1016/j.specom.2009.02.003
 4.
Shao Y, Srinivasan S, Jin Z, Wang DL: A computational auditory scene analysis system for speech segregation and robust speech recognition. Comput. Speech Lang 2010, 24: 7793. 10.1016/j.csl.2008.03.004
 5.
Hu G, Wang DL: Auditory segmentation based on onset and offset analysis. IEEE Trans. Audio, Speech, Lang. Process 2007, 15: 396405.
 6.
Barker J, Ma N, Coy A, Cooke M: Speech fragment decoding techniques for simultaneous speaker identification and speech recognition. Comput. Speech Lang 2010, 24: 94111. 10.1016/j.csl.2008.05.003
 7.
Hu K, Wang DL: An unsupervised approach to cochannel speech separation. IEEE Trans. Audio Speech Lang. Process 2013, 21: 120129.
 8.
Roweis S, One microphone source separation: Adv. Neural Inf. Process. Syst. 2001, 13: 793799.
 9.
Reddy A, Raj B: Soft mask methods for singlechannel speaker separation. IEEE Trans. Audio, Speech, Lang. Process 2007, 15(6):17661776.
 10.
Radfar MH, Dansereau RM: Singlechannel speech separation using soft masking filtering. IEEE Trans. Audio, Speech, Lang. Process 2007, 15(8):22992310.
 11.
Hershey JR, Rennie SJ, Olsen PA, Kristjansson TT: Superhuman multitalker speech recognition: a graphical modeling approach. Comput. Speech Lang 2010, 24: 4566. 10.1016/j.csl.2008.11.001
 12.
Stark M, Wohlmayr M, Pernkopf F: Sourcefilterbased singlechannel speech separation using pitch information. IEEE Trans. Audio, Speech, Lang. Process 19(2):242255.
 13.
Weiss R, Ellis D: Speech separation using speakeradapted eigenvoice speech models. Comput. Speech Lang 2010, 24: 1629. 10.1016/j.csl.2008.03.003
 14.
Mysore GJ, Smaragdis P, Raj B: Nonnegative hidden Markov modeling of audio with application to source separation. In Proc. 9th Int. Conf. Latent Variable Analysis and Signal Separation. Heidelberg: Springer; 2010.
 15.
Smaragdis P: Convolutive speech bases their application to supervised speech separation. IEEE Trans. Audio, Speech, Lang. Process 2007, 15: 112.
 16.
Mowlaee P, Christensen MG, Jensen SH: New results on singlechannel speech separation using sinusoidal modeling. IEEE Trans. Audio Speech Lang. Process 2011, 19: 12651277.
 17.
Yeung YT, Lee T, Leung CC: Integrating multiple observations for modelbased singlemicrophone speech separation with conditional random fields. In Proc. ICASSP12 IEEE. New York; 2012:257260.
 18.
Radfar MH, Dansereau RM: Longterm gain estimation in modelbased single channel speech separation. In Proc. WASPAA IEEE. New York; 2007.
 19.
Radfar MH, Wong W, Dansereau RM, Chan WY: Scaled factorial hidden Markov models: a new technique for compensating gain differences in modelbased single channel speech separation. 2010.
 20.
Mowlaee P, Saeidi R, Christensen MG, Tan ZH, Kinnunen T, Franti P, Jensen SH: A joint approach for singlechannel speaker identification and speech separation. Audio, Speech, and Language Processing, IEEE Transactions on 2012, 20(9):25862601.
 21.
Saeidi R, Mowlaee P, Kinnunen T, Tan ZH, Christensen MG, Jensen SH, Franti P: Signaltosignal ratio independent speaker identification for cochannel speech signals. In Pattern Recognition (ICPR), 2010 20th International Conference on IEEE,(IEEE. New York; 2010:45654568.
 22.
Nádas A, Nahamoo D, Picheny MA: Speech recognition using noiseadaptive prototypes. IEEE Trans. Acoust., Speech, Signal Process 1989, 37: 14951503. 10.1109/29.35387
 23.
Mowlaee P, Martin R: On phase importance in parameter estimation for singlechannel source separation, in Acoustic Signal Enhancement. In Proceedings of IWAENC 2012; International Workshop on VDE. New York: IEEE; 2012:14.
 24.
Varga AP, Moore RK: Hidden Markov model decomposition of speech and noise. 1990.
 25.
Shao Y, Wang DL: Modelbased sequential organization in cochannel speech. IEEE Trans. Audio, Speech, Lang. Process 2006, 14: 289298.
 26.
Narayanan A, Wang DL: A CASA based system for longterm, SNR estimation. IEEE Trans. Audio Speech Lang. Process 2012, 20: 25182527.
 27.
Cooke M, Lee T: Speech, Separation Challenge. 21 September 2006.http://staffwww.dcs.shef.ac.uk/people/M.Cooke/SpeechSeparation [ Challenge.htm]
 28.
Kim G, Lu Y, Hu Y, Loizou PC: An, algorithm that improves speech intelligibility in noise for normalhearing listeners. 2009, 126(3):14861494.
 29.
Taal CH, Hendriks RC, Heusdens R, Jensen J: A shorttime objective intelligibility measure for timefrequency weighted noisy speech, in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on IEEE. 2010, 42144217.
 30.
Rennie S, Hershey J, Olsen P: Single channel multitalker speech recognition: graphical modeling approaches. IEEE Signal Process. Mag 2010, 27(6):6680.
Acknowledgements
This research was supported by an AFOSR grant (FA95501210130).
Author information
Additional information
Competing interests
Both authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Received
Accepted
Published
DOI
Keywords
 Speech Signal
 Gaussian Mixture Model
 Nonnegative Matrix Factorization
 Iterative Estimation
 Speaker Model