An iterative modelbased approach to cochannel speech separation
 Ke Hu^{1}Email author and
 DeLiang Wang^{1, 2}
https://doi.org/10.1186/16874722201314
© Hu and Wang; licensee Springer. 2013
Received: 24 January 2013
Accepted: 10 June 2013
Published: 26 June 2013
Abstract
Cochannel speech separation aims to separate two speech signals from a single mixture. In a supervised scenario, the identities of two speakers are given, and current methods use pretrained speaker models for separation. One issue in modelbased methods is the mismatch between training and test signal levels. We propose an iterative algorithm to adapt speaker models to match the signal levels in testing. Our algorithm first obtains initial estimates of source signals using unadapted speaker models and then detects the input signaltonoise ratio (SNR) of the mixture. The input SNR is then used to adapt the speaker models for more accurate estimation. The two steps iterate until convergence. Compared to searchbased SNR detection methods, our method is not limited to given SNR levels. Evaluations demonstrate that the iterative procedure converges quickly in a considerable range of SNRs and improves separation results significantly. Comparisons show that the proposed system performs significantly better than related modelbased systems.
1 Introduction
In daily listening environments, noise corrupts speech and creates substantial difficulty for various applications such as hearing aid design and automatic speech recognition. When noise is a nonspeech signal, existing algorithms often exploit the intrinsic properties of speech/noise for segregation. However, when interference is another voice, the generic properties of speech signals alone are insufficient for separation, and current methods also utilize speaker characteristics. The problem of separating two voices from a single mixture is often referred to as cochannel speech separation. Depending on the information used in cochannel speech separation, we can classify the algorithms into two categories: unsupervised and supervised. In unsupervised methods, speaker identities and pretraining with clean speech are not available, while supervised methods often assume both.
Motivated by human perceptual principles, computational auditory scene analysis (CASA) aims to segregate a voice of interest by exploiting inherent features of speech such as pitch and common onsets [1]. CASA methods are typically unsupervised. For example, pitch and amplitude modulation are utilized to separate voiced portions of cochannel speech, and the estimated pitches in neighboring frames are grouped using pitch continuity [2]. To group temporally disjoint timefrequency (TF) regions, a system [3] employs speaker models to perform a joint estimation of speaker identities and sequential grouping. Later in [4], the system is extended to handle unvoiced speech based on onset/offsetbased segmentation [5] and modelbased grouping. Similarly, another CASA system extracts speaker homogeneous TF regions and employs speaker models and missing data techniques to group them into speech streams [6]. Note that the aforementioned methods use speaker models for sequential grouping, or to group temporally disjoint speech regions, and thus are not completely unsupervised. A recent system [7] applies unsupervised clustering to group speech regions into two speaker groups by maximizing the ratio of between and withincluster distances.
Supervised methods often formulate separation as an estimation problem, i.e., given an input mixture, one estimates the two underlying speech sources. To solve this underdetermined equation, a general approach is to represent the speakers by two trained models, and the two patterns (each from one speaker) best approximating the mixture are used to reconstruct the sources. For example, an early study [8] employs a factorial hidden Markov model (HMM) to model a speaker, and a binary mask is generated by comparing the two estimated sources. In another system [9], Gaussian mixture models (GMM) are used to describe speakers, and speech signals are estimated by a minimum meansquare estimator (MMSE). In MMSE estimation, the posterior probabilities of all Gaussian pairs are computed and used to reconstruct the sources (see [10] for a similar system). The GMMbased methods [9, 10] do not model the temporal dynamics of speech. A layered HMM model is proposed to model both temporal and grammar dynamics by transition matrices [11]. A 2D Viterbi decoding technique is used to detect the most likely Gaussian pair in each frame, and a maximum a posteriori (MAP) estimator is used for estimation. In a speakerindependent setting, Stark et al. [12] propose a factorial HMM to model vocal tract characteristics and use detected pitch to reconstruct speech sources. In addition to these methods, other models are applied to capture speakers, including eigenvectors to model and adapt speakers [13], nonnegative matrix factorizationbased models [14, 15], and sinusoidal models [16].
As pointed out in [9], one problem the modelbased methods face is generalization to different input signaltonoise ratio (SNR) levels (note here that we consider interfering speech as noise). The system [9] does not address this problem and assumes that test mixtures have the same energy level as the training mixtures. Further, the system is designed to only handle 0dB mixtures. Similarly, a conditional random fieldbased method in [17] is only applied to separate 0dB speech mixtures. The factorial HMM system [12] employs a quantile filtering to estimate a gain for each frame and then uses that to adjust the corresponding mean vector in a codebook. Radfar and Dansereau [18] propose a searchbased method to detect the input SNR, but one has to specify the search range. In this method, different gains are hypothesized, and the one maximizing likelihood of the whole utterance is taken as the estimate. Radfar et al. [19] use a quadratic function to approximate the likelihood function of a factorial HMM and employ an iterative approach to estimate the gain. The HMM system [11] detects the model gains jointly with the speaker identities given a closed set of speakers and uses an expectationmaximization (EM) algorithm to further adapt the gains. However, the complexity of gain adaptation is quadratic to the number of states, and the convergence speed of the EM algorithm is unknown. Sinusoidal models are also employed to model speakers for joint speaker separation and identification [20], and SNR estimation can be achieved by adapting a universal background model using segregated speech [21].
In this work, we propose an iterative algorithm to generalize to different input SNR conditions given speaker identities. Building on the GMM system [9], we first incorporate temporal dynamics using transition matrices [11]. Then, our algorithm estimates initial TF masks for two speakers by assuming that the input SNR is 0 dB. The initial masks are used to estimate an utterancelevel SNR, which is in turn used to adapt the speaker models. Then, the adapted models are used in a new iteration of separation. The above two steps iterate until both input SNR and the estimated masks become stable. Experiments show that it converges relatively fast and is computationally simple. Compared to the method of [19], our method is simpler and can be applied to factorial HMMs as well as other models (e.g., GMMs). In addition, our method does not require a search range for the estimated input SNR. Comparisons show that the proposed algorithm significantly outperforms related methods.
The rest of the paper is organized as follows. We first present the basic model in Section 2. Section 3 describes iterative estimation. Evaluation and comparison are given in Section 4, and we conclude the paper in Section 5.
2 Modelbased separation
where X_{ a }(c,m), X_{ b }(c,m), and Y(c,m) represent the logarithms of X_{ a }(c,m), X_{ b }(c,m), and Y(c,m), respectively. The logmax approximation is originally proposed in [22] to describe the mixing process of speech and noise in robust speech recognition and is later employed in twospeaker separation. A mathematical analysis in [9] shows that the approximation error in (2) is reasonable, but more accurate approximations exist that take both amplitude and phase into consideration [23].
2.1 Speaker models
We use a gammatone filterbank consisting of 128 filters to decompose the input signal into different frequency channels [1]. The center frequencies of the filters spread logarithmically from 50 to 8,000 Hz. Each filtered signal is then divided into 20ms time frames with 10ms frame shift, resulting in a cochleagram. The log spectra are computed by taking the elementwise logarithm of the energy in the cochleagram matrix.
For each speaker, the conditional distribution given a specific Gaussian is a 128dimensional Gaussian distribution, i.e., $p\left({\mathbf{x}}_{a}\right{k}_{a})=\prod _{c=1}^{128}N({x}_{a}^{c};{\mu}_{a,{k}_{a}}^{c},{\sigma}_{a,{k}_{a}}^{c})$ and $p\left({\mathbf{x}}_{b}\right{k}_{b})=\prod _{c=1}^{128}N({x}_{b}^{c};{\mu}_{b,{k}_{b}}^{c},{\sigma}_{b,{k}_{b}}^{c})$, where k_{ a } and k_{ b } are two Gaussian indices, and $p\left({x}_{a}^{c}\right{k}_{a})$ and $p\left({x}_{b}^{c}\right{k}_{b})$ are onedimensional Gaussians.
Here, we use subscripts ${x}_{a}^{c}$ and ${x}_{b}^{c}$ to differentiate the probability functions for speakers a and b. ${\Phi}_{{x}_{a}^{c}}(\xb7{k}_{a})$ and ${\Phi}_{{x}_{b}^{c}}(\xb7{k}_{b})$ are their corresponding cumulative distributions. In a probabilistic manner, (5) provides a way of approximating the mixture using two clean speaker models, which in turn can be used to estimate two source signals given the mixture as the observation.
2.2 Source estimation
The MMSE estimate of speaker b can be computed similarly.
Note that the soft mask for speaker b is $p({x}_{a}^{c}\le {x}_{b}^{c}\mathbf{y})=1p({x}_{a}^{c}>{x}_{b}^{c}\left\mathbf{y}\right)$. In [9], the soft mask is found to perform consistently better than a binarized mask.
where ${k}_{a}^{\ast}$ and ${k}_{b}^{\ast}$ correspond to the pair of Gaussians yielding the highest posterior probability among all possible pairs. The estimate of source signals can be computed similarly to (11) but using only ${k}_{a}^{\ast}$ and ${k}_{b}^{\ast}$. A soft mask can also be derived like (12) using only ${k}_{a}^{\ast}$ and ${k}_{b}^{\ast}$. In experiments, we find that the performance of the MAP estimator is similar to that of MMSE, mainly because at each frame, one pair of Gaussians often approximates the mixture much better than others.
2.3 Incorporating temporal dynamics
The cochannel speech separation system in [9] models speaker characteristics using GMMs and ignores the temporal information of speech signals. A natural extension to the GMMs to incorporate temporal dynamics is using a factorial HMM model. Specifically, for each speaker, we can estimate the most likely Gaussian index for each frame in a clean utterance using a MAP estimator. Each utterance thus generates a sequence of Gaussian indices. The transitions between all neighboring Gaussian indices are then used to build a 2D histogram, which can then be normalized to produce a transition matrix [11].
In the factorial HMM system, the hidden states of the two HMMs at each frame are the most likely Gaussian indices of two speakers. While the detection of the Gaussian indices is based on only individual frames in a GMMbased model, a 2D Viterbi search is used in [11] to find the most likely Gaussian index sequences. Specifically, the 2D Viterbi integrates all frames and the transition information across time to find the most likely two Gaussian sequences, each of which corresponds to one speaker [24].
where p(k_{ a }k a′) is the transition probability of speaker a from state k a′ to k_{ a }, and p(k_{ b }k b′) is that of speaker b. p(y_{ t }k_{ a },k_{ b }) can be calculated similarly as in (8). The optimal Gaussian index sequences are detected by a 2D Viterbi decoding [24], and the MAP estimator is used for estimating sources.
In (15), an exhaustive search for each pair of k_{ a } and k_{ b } across T frames has a complexity of O(T K^{4}), where K is the number of Gaussians for each speaker and T is the number of frames. It is time consuming if K is relatively large. In our study, we use a beam search to speed up the process (see also [25]). Given a beam width of W, we only search for the W most likely previous state pairs (i.e., k a′ and k b′ in (15)), and the time complexity is reduced to O(T W K^{2}). The results presented in Section 4 indicate that a beam width of 16 gives a comparable performance to the exhaustive search.
3 Iterative estimation
As mentioned in Section 1, modelbased methods such as [9] face the difficulty of generalizing to different mixing conditions. It is partly because the GMMs are trained using logspectral vectors and hence are sensitive to the overall speech energy. More importantly, if the GMMs of two speakers are trained using clean utterances at certain energy levels, in testing they need to be adjusted according to the input SNR. In [9], mixtures with nonzero input SNR are separated using unadjusted models, but the performance is worse.
We propose to detect the input SNR and use that to adapt the speaker models and reestimate the sources. To estimate the input SNR from the mixture, one has to first have some source information. Thus, SNR detection and source estimation become a chickenandegg problem, i.e., the performance of one task depends on the success of the other. One general approach to deal with this type of problem is to perform an iterative estimation (e.g., [2]). In the initial stage of the iterative procedure, we apply the unadapted speaker models to obtain initial separation. Based on the initial source estimates, we calculate the input SNR and use that to adapt the speaker models. The adapted models are in turn used to reestimate the sources. The two steps iterate until convergence. As an alternative, we also explore a searchbased method which jointly estimates sources and the input SNR.
3.1 Initial mask estimation
For a pair of speakers, we first perform an initial estimate by using their models pretrained using clean utterances at a perutterance energy level of 60 dB. Initially, the input SNR is assumed to be 0 dB, and a mixture is scaled to an energy level of 63 dB corresponding to the addition of two 60dB source signals. We use the 2D Viterbi decoding based on (15) to detect the most likely Gaussian index sequence and then estimate a soft mask of the target speaker using the MAP estimator in Section 2.2.
3.2 SNR estimation and model adaptation
where M_{ a }(c,m) denotes the ratio of speaker a at the TF unit of channel c and frame m, and M_{ b }(c,m)=1−M_{ a }(c,m). R corresponds to the input SNR of the filtered speech signals. As analyzed in [26], due to gammatone filtering which has a certain passband, one usually should compensate for the loss of energy to calculate the SNR of the original timedomain signals. However, in our work, the frequency range of the gammatone filterbank is between 50 and 8,000 Hz, and both target and interference are speech signals with a sampling frequency of 16 kHz. There is thus little energy loss in the filtering process, and the estimated SNR of filtered signals is close to that of the original timedomain signals. Thus, we directly use the SNR of filtered signals in (16) as our estimate.
where x_{ b }[t] denotes the timedomain speech of speaker b. That is, instead of using 60dB utterances, the interferer model should be trained using 60−R dB signals, and the original utterances should be scaled by a multiplicative factor of 10^{−R/10}. Since the difference lies in a constant factor, we can directly scale the parameters of the GMM models, i.e., the mean and variance. Specifically, the means of the interferer GMM are scaled by an additive factor of β= log(10^{−R/10}) since logspectral vectors are used in training, while the variances will remain unchanged because β is an additive factor.
where y[t] is the timedomain cochannel signal, and x_{ a }[t] is the source signal of speaker a. In the above calculation, we assume that the timedomain target and interfering signal are uncorrelated at each frame. Given (17) and (18), we have adapted the interfering speaker model and the mixture and created a more matched condition for separation.
3.3 Iterative estimation
Given any input mixture, we first obtain the initial mask estimates M_{a,0} and M_{b,0} as described in Section 3.1. Given M_{a,0} and M_{b,0}, we then estimate the input SNR using (16). The estimated SNR is used to adapt the model of speaker b and mixture by (17) and (18), respectively. They are then used together with the target speaker model to reestimate the soft masks based on the 2D Viterbi decoding described in Section 2.3 and the MAP estimator in Section 2.2. To get the maximal performance, the iterative process should continue until neither the estimated input SNR nor speaker masks change. However, empirically, we observe that the separation performance becomes stable when the estimated input SNR change is smaller than 0.5 dB. We thus use this as the stop criterion and terminate the estimation process when the difference of estimated input SNRs between two iterations is less than 0.5 dB.
3.4 An alternative method
In addition to the iterative method, we have also tried a searchbased method to jointly estimate the source state sequences and the input SNR. For example, we use a test corpus described in Section 4 and hypothesize the input SNR in a range from −9 to 6 dB with an increment of 3 dB. At each hypothesized input SNR, we adapt the mixture and interfering speaker model according to (17) and (18) and use them to detect state sequences using the 2D Viterbi decoding, and then estimate the soft masks based on the MAP estimator. For all hypothesized SNR conditions, we calculate the joint likelihood of all mixture frames and the Gaussian sequences being generated by the factorial HMM, and the hypothesized input SNR corresponding to the highest likelihood is selected as the detected value. The corresponding state sequence is then used for estimation. We have evaluated the performance of this method using the corpus described in Section 4, and it is about 0.5 dB worse than the iterative method and is computationally more expensive. Note that the discrete SNR range includes the true SNR value in each testing condition to favor the SNRbased search method. How to specify the input SNR levels in search is unclear in practice.
4 Evaluation and comparisons
We use twotalker mixtures in the Speech Separation Challenge (SSC) corpus [27] for evaluation. For each speaker, a 256component GMM model (i.e., K=256) is trained using all of the speaker’s clean utterances in the training set. Here, K is chosen with the consideration of performance and computation complexity. In training, each clean utterance is normalized to a 60dB energy level, and the log spectra are calculated as described in Section 2.1. An HMM model is then built upon each GMM using the same utterances as described in Section 2.3. We use the test part of the SSC corpus and create twospeaker mixtures at SNRs from −9 to 6 dB (with an increment of 3 dB) for evaluation. We randomly select 100 twospeaker mixtures in each SNR condition for testing. Note that the mixture utterances are the same across different SNRs, and mixtures at opposite SNRs are not symmetric since they are generated by fixing the target and scaling the interfering utterances. The 100 mixtures contain 51 differentgender mixtures, 23 malemale mixtures, and 26 femalefemale mixtures. All test mixtures are downsampled from 25 to 16 kHz for faster processing.
where x_{ a }[t] and ${\widehat{x}}_{a}\left[t\right]$ are the original clean signals and signals resynthesized from the estimated mask, respectively. Note that a waveform signal can be obtained from a soft mask [1]. In our test conditions, target and interfering speakers are treated symmetrically, e.g., an interferer at 6 dB is considered as a target at −6 dB. Thus, at each input SNR, we calculate the target SNR gain as the average of the target SNR gain at that input SNR and the interferer SNR gain at the negative of that input SNR. For example, the SNR gain at −6 dB is the average of the target SNR gain at the −6 dB SNR and the interferer SNR gain at the 6 dB SNR.
4.1 System configuration
4.2 Comparisons

As shown in Figure 5, the proposed system achieves an SNR gain of 11.9 dB at the input SNR of −9 dB, and the gain decreases gradually as the input SNR increases. At 9 dB, the SNR gain is about 3.9 dB. On average, our method achieves an SNR gain of 7.4 dB. Compared to the method of Reddy and Raj, our method performs comparably at 0 dB but significantly better at other input SNRs. For example, the proposed system performs about 2.7 dB better at −9 dB, and the improvement gets smaller as the input SNR gets closer to 0 dB. A similar trend is also observed at positive input SNRs. On average, the proposed system performs 1.2 dB better than the Reddy and Raj method. In the figure, we also show the performance of another MMSE method (black bars), a version of the Reddy and Raj system that does not require the energy levels of training and testing to be the same. In this method, we assume the input SNR to be 0 dB and scale the mixture as described in Section 3.1. As we expect, the performance is a little worse (about 0.3 dB) than the original Reddy and Raj system due to the unmatched signal levels. We also compare to a MAPbased separation method described in Section 2.2. Using only the most likely Gaussian pair for estimation, the MAP method is more efficient than the MMSE method but performs about 0.1 dB worse. Our system performs about 1.6 dB better than the MAPbased method. To isolate the effect of iterative estimation, we have also evaluated the performance of the HMM system alone. As shown in the figure, this method achieves an average SNR gain of about 6.3 dB, about 0.5 dB better than the MAPbased method. This improvement comes from the use of temporal dynamics. Comparing this performance with the proposed system, we get the benefit of iterative estimation, which further increases the SNR gain of the HMM system by about 1.1 dB. In addition, we note that iterative estimation can also be incorporated into other modelbased systems. For example, we add iterative estimation to the MMSE method (denoted by as MMSEiterative in Figure 5) and obtain an improvement of 1.2 dB. Similarly, the MAPiterative method outperforms the original MAP method by about 1.2 dB. Lastly, to show the upper bound performance of our system, we have utilized the true input SNR and ideal hidden states in estimation. This ideal performance is presented as the HMM ideal in Figure 5. It is about 0.9 dB better than the proposed system, which indicates that our system is close to the ceiling performance.
5 Conclusions
We have proposed an iterative algorithm for modelbased cochannel speech separation. First, temporal dynamics is incorporated into speaker models using HMM. We then present an iterative method to deal with signal level differences between training and test conditions. Specifically, the proposed system first uses unadapted speaker models to segregate two speech signals and detects the input SNR. The detected SNR is then used to adapt the interferer model and the mixture for reestimation. The two steps iterate until convergence. Systematic evaluations show that our iterative method improves segregation performance significantly and also converges quickly. Comparisons show that it performs significantly better than related modelbased methods in terms of SNR gains as well as HIT −FA and STOI scores.
We note that SNR estimation in our system uses the whole mixture, which would not be feasible for realtime applications. However, one can slightly modify it to work in real time. For example, at one frame, one could use only previous frames for Viterbi decoding and SNR detection. The detected SNR could be used to adapt speaker models for separation in later frames and then get updated correspondingly. Such an update may be performed periodically to track the input SNR, and the update frequency would depend on the extent to which the input SNR varies.
In this work, our description is limited to twotalker situations as in related modelbased methods. The proposed system could be extended to deal with multitalker separation problems. For example, the MMSE estimators can be extended to perform threetalker separation according to [9]. As for iterative estimation, one can estimate the energy ratios between multiple speakers instead of the SNR in the twospeaker case and adapt the speaker models accordingly. One issue in multitalker situations is that the complexity of (13) is exponential to the number of speakers, and a faster decoding method thus needs to be used (e.g., [9, 30]).
Declarations
Acknowledgements
This research was supported by an AFOSR grant (FA95501210130).
Authors’ Affiliations
References
 Wang DL, Brown GJ (eds): Computational, Auditory Scene Analysis: Principles,Algorithms and Applications. Hoboken: WileyIEEE Press; 2006.View ArticleGoogle Scholar
 Hu G, Wang DL: A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans. Audio, Speech, Lang. Process 2010, 18: 20672079.View ArticleGoogle Scholar
 Shao Y, Wang DL: Sequential organization of speech in computational auditory scene analysis. Speech Comm 2009, 51: 657667. 10.1016/j.specom.2009.02.003View ArticleGoogle Scholar
 Shao Y, Srinivasan S, Jin Z, Wang DL: A computational auditory scene analysis system for speech segregation and robust speech recognition. Comput. Speech Lang 2010, 24: 7793. 10.1016/j.csl.2008.03.004View ArticleGoogle Scholar
 Hu G, Wang DL: Auditory segmentation based on onset and offset analysis. IEEE Trans. Audio, Speech, Lang. Process 2007, 15: 396405.View ArticleGoogle Scholar
 Barker J, Ma N, Coy A, Cooke M: Speech fragment decoding techniques for simultaneous speaker identification and speech recognition. Comput. Speech Lang 2010, 24: 94111. 10.1016/j.csl.2008.05.003View ArticleGoogle Scholar
 Hu K, Wang DL: An unsupervised approach to cochannel speech separation. IEEE Trans. Audio Speech Lang. Process 2013, 21: 120129.Google Scholar
 Roweis S, One microphone source separation: Adv. Neural Inf. Process. Syst. 2001, 13: 793799.Google Scholar
 Reddy A, Raj B: Soft mask methods for singlechannel speaker separation. IEEE Trans. Audio, Speech, Lang. Process 2007, 15(6):17661776.View ArticleGoogle Scholar
 Radfar MH, Dansereau RM: Singlechannel speech separation using soft masking filtering. IEEE Trans. Audio, Speech, Lang. Process 2007, 15(8):22992310.View ArticleGoogle Scholar
 Hershey JR, Rennie SJ, Olsen PA, Kristjansson TT: Superhuman multitalker speech recognition: a graphical modeling approach. Comput. Speech Lang 2010, 24: 4566. 10.1016/j.csl.2008.11.001View ArticleGoogle Scholar
 Stark M, Wohlmayr M, Pernkopf F: Sourcefilterbased singlechannel speech separation using pitch information. IEEE Trans. Audio, Speech, Lang. Process 19(2):242255.Google Scholar
 Weiss R, Ellis D: Speech separation using speakeradapted eigenvoice speech models. Comput. Speech Lang 2010, 24: 1629. 10.1016/j.csl.2008.03.003View ArticleGoogle Scholar
 Mysore GJ, Smaragdis P, Raj B: Nonnegative hidden Markov modeling of audio with application to source separation. In Proc. 9th Int. Conf. Latent Variable Analysis and Signal Separation. Heidelberg: Springer; 2010.Google Scholar
 Smaragdis P: Convolutive speech bases their application to supervised speech separation. IEEE Trans. Audio, Speech, Lang. Process 2007, 15: 112.View ArticleGoogle Scholar
 Mowlaee P, Christensen MG, Jensen SH: New results on singlechannel speech separation using sinusoidal modeling. IEEE Trans. Audio Speech Lang. Process 2011, 19: 12651277.View ArticleGoogle Scholar
 Yeung YT, Lee T, Leung CC: Integrating multiple observations for modelbased singlemicrophone speech separation with conditional random fields. In Proc. ICASSP12 IEEE. New York; 2012:257260.Google Scholar
 Radfar MH, Dansereau RM: Longterm gain estimation in modelbased single channel speech separation. In Proc. WASPAA IEEE. New York; 2007.Google Scholar
 Radfar MH, Wong W, Dansereau RM, Chan WY: Scaled factorial hidden Markov models: a new technique for compensating gain differences in modelbased single channel speech separation. 2010.Google Scholar
 Mowlaee P, Saeidi R, Christensen MG, Tan ZH, Kinnunen T, Franti P, Jensen SH: A joint approach for singlechannel speaker identification and speech separation. Audio, Speech, and Language Processing, IEEE Transactions on 2012, 20(9):25862601.View ArticleGoogle Scholar
 Saeidi R, Mowlaee P, Kinnunen T, Tan ZH, Christensen MG, Jensen SH, Franti P: Signaltosignal ratio independent speaker identification for cochannel speech signals. In Pattern Recognition (ICPR), 2010 20th International Conference on IEEE,(IEEE. New York; 2010:45654568.View ArticleGoogle Scholar
 Nádas A, Nahamoo D, Picheny MA: Speech recognition using noiseadaptive prototypes. IEEE Trans. Acoust., Speech, Signal Process 1989, 37: 14951503. 10.1109/29.35387View ArticleGoogle Scholar
 Mowlaee P, Martin R: On phase importance in parameter estimation for singlechannel source separation, in Acoustic Signal Enhancement. In Proceedings of IWAENC 2012; International Workshop on VDE. New York: IEEE; 2012:14.Google Scholar
 Varga AP, Moore RK: Hidden Markov model decomposition of speech and noise. 1990.View ArticleGoogle Scholar
 Shao Y, Wang DL: Modelbased sequential organization in cochannel speech. IEEE Trans. Audio, Speech, Lang. Process 2006, 14: 289298.View ArticleGoogle Scholar
 Narayanan A, Wang DL: A CASA based system for longterm, SNR estimation. IEEE Trans. Audio Speech Lang. Process 2012, 20: 25182527.View ArticleGoogle Scholar
 Cooke M, Lee T: Speech, Separation Challenge. 21 September 2006.http://staffwww.dcs.shef.ac.uk/people/M.Cooke/SpeechSeparation [ Challenge.htm]Google Scholar
 Kim G, Lu Y, Hu Y, Loizou PC: An, algorithm that improves speech intelligibility in noise for normalhearing listeners. 2009, 126(3):14861494.Google Scholar
 Taal CH, Hendriks RC, Heusdens R, Jensen J: A shorttime objective intelligibility measure for timefrequency weighted noisy speech, in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on IEEE. 2010, 42144217.View ArticleGoogle Scholar
 Rennie S, Hershey J, Olsen P: Single channel multitalker speech recognition: graphical modeling approaches. IEEE Signal Process. Mag 2010, 27(6):6680.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.