- Open Access
An iterative model-based approach to cochannel speech separation
© Hu and Wang; licensee Springer. 2013
- Received: 24 January 2013
- Accepted: 10 June 2013
- Published: 26 June 2013
Cochannel speech separation aims to separate two speech signals from a single mixture. In a supervised scenario, the identities of two speakers are given, and current methods use pre-trained speaker models for separation. One issue in model-based methods is the mismatch between training and test signal levels. We propose an iterative algorithm to adapt speaker models to match the signal levels in testing. Our algorithm first obtains initial estimates of source signals using unadapted speaker models and then detects the input signal-to-noise ratio (SNR) of the mixture. The input SNR is then used to adapt the speaker models for more accurate estimation. The two steps iterate until convergence. Compared to search-based SNR detection methods, our method is not limited to given SNR levels. Evaluations demonstrate that the iterative procedure converges quickly in a considerable range of SNRs and improves separation results significantly. Comparisons show that the proposed system performs significantly better than related model-based systems.
- Speech Signal
- Gaussian Mixture Model
- Nonnegative Matrix Factorization
- Iterative Estimation
- Speaker Model
In daily listening environments, noise corrupts speech and creates substantial difficulty for various applications such as hearing aid design and automatic speech recognition. When noise is a nonspeech signal, existing algorithms often exploit the intrinsic properties of speech/noise for segregation. However, when interference is another voice, the generic properties of speech signals alone are insufficient for separation, and current methods also utilize speaker characteristics. The problem of separating two voices from a single mixture is often referred to as cochannel speech separation. Depending on the information used in cochannel speech separation, we can classify the algorithms into two categories: unsupervised and supervised. In unsupervised methods, speaker identities and pretraining with clean speech are not available, while supervised methods often assume both.
Motivated by human perceptual principles, computational auditory scene analysis (CASA) aims to segregate a voice of interest by exploiting inherent features of speech such as pitch and common onsets . CASA methods are typically unsupervised. For example, pitch and amplitude modulation are utilized to separate voiced portions of cochannel speech, and the estimated pitches in neighboring frames are grouped using pitch continuity . To group temporally disjoint time-frequency (T-F) regions, a system  employs speaker models to perform a joint estimation of speaker identities and sequential grouping. Later in , the system is extended to handle unvoiced speech based on onset/offset-based segmentation  and model-based grouping. Similarly, another CASA system extracts speaker homogeneous T-F regions and employs speaker models and missing data techniques to group them into speech streams . Note that the aforementioned methods use speaker models for sequential grouping, or to group temporally disjoint speech regions, and thus are not completely unsupervised. A recent system  applies unsupervised clustering to group speech regions into two speaker groups by maximizing the ratio of between- and within-cluster distances.
Supervised methods often formulate separation as an estimation problem, i.e., given an input mixture, one estimates the two underlying speech sources. To solve this underdetermined equation, a general approach is to represent the speakers by two trained models, and the two patterns (each from one speaker) best approximating the mixture are used to reconstruct the sources. For example, an early study  employs a factorial hidden Markov model (HMM) to model a speaker, and a binary mask is generated by comparing the two estimated sources. In another system , Gaussian mixture models (GMM) are used to describe speakers, and speech signals are estimated by a minimum mean-square estimator (MMSE). In MMSE estimation, the posterior probabilities of all Gaussian pairs are computed and used to reconstruct the sources (see  for a similar system). The GMM-based methods [9, 10] do not model the temporal dynamics of speech. A layered HMM model is proposed to model both temporal and grammar dynamics by transition matrices . A 2-D Viterbi decoding technique is used to detect the most likely Gaussian pair in each frame, and a maximum a posteriori (MAP) estimator is used for estimation. In a speaker-independent setting, Stark et al.  propose a factorial HMM to model vocal tract characteristics and use detected pitch to reconstruct speech sources. In addition to these methods, other models are applied to capture speakers, including eigenvectors to model and adapt speakers , nonnegative matrix factorization-based models [14, 15], and sinusoidal models .
As pointed out in , one problem the model-based methods face is generalization to different input signal-to-noise ratio (SNR) levels (note here that we consider interfering speech as noise). The system  does not address this problem and assumes that test mixtures have the same energy level as the training mixtures. Further, the system is designed to only handle 0-dB mixtures. Similarly, a conditional random field-based method in  is only applied to separate 0-dB speech mixtures. The factorial HMM system  employs a quantile filtering to estimate a gain for each frame and then uses that to adjust the corresponding mean vector in a codebook. Radfar and Dansereau  propose a search-based method to detect the input SNR, but one has to specify the search range. In this method, different gains are hypothesized, and the one maximizing likelihood of the whole utterance is taken as the estimate. Radfar et al.  use a quadratic function to approximate the likelihood function of a factorial HMM and employ an iterative approach to estimate the gain. The HMM system  detects the model gains jointly with the speaker identities given a closed set of speakers and uses an expectation-maximization (EM) algorithm to further adapt the gains. However, the complexity of gain adaptation is quadratic to the number of states, and the convergence speed of the EM algorithm is unknown. Sinusoidal models are also employed to model speakers for joint speaker separation and identification , and SNR estimation can be achieved by adapting a universal background model using segregated speech .
In this work, we propose an iterative algorithm to generalize to different input SNR conditions given speaker identities. Building on the GMM system , we first incorporate temporal dynamics using transition matrices . Then, our algorithm estimates initial T-F masks for two speakers by assuming that the input SNR is 0 dB. The initial masks are used to estimate an utterance-level SNR, which is in turn used to adapt the speaker models. Then, the adapted models are used in a new iteration of separation. The above two steps iterate until both input SNR and the estimated masks become stable. Experiments show that it converges relatively fast and is computationally simple. Compared to the method of , our method is simpler and can be applied to factorial HMMs as well as other models (e.g., GMMs). In addition, our method does not require a search range for the estimated input SNR. Comparisons show that the proposed algorithm significantly outperforms related methods.
The rest of the paper is organized as follows. We first present the basic model in Section 2. Section 3 describes iterative estimation. Evaluation and comparison are given in Section 4, and we conclude the paper in Section 5.
where X a (c,m), X b (c,m), and Y(c,m) represent the logarithms of X a (c,m), X b (c,m), and Y(c,m), respectively. The log-max approximation is originally proposed in  to describe the mixing process of speech and noise in robust speech recognition and is later employed in two-speaker separation. A mathematical analysis in  shows that the approximation error in (2) is reasonable, but more accurate approximations exist that take both amplitude and phase into consideration .
2.1 Speaker models
We use a gammatone filterbank consisting of 128 filters to decompose the input signal into different frequency channels . The center frequencies of the filters spread logarithmically from 50 to 8,000 Hz. Each filtered signal is then divided into 20-ms time frames with 10-ms frame shift, resulting in a cochleagram. The log spectra are computed by taking the element-wise logarithm of the energy in the cochleagram matrix.
For each speaker, the conditional distribution given a specific Gaussian is a 128-dimensional Gaussian distribution, i.e., and , where k a and k b are two Gaussian indices, and and are one-dimensional Gaussians.
Here, we use subscripts and to differentiate the probability functions for speakers a and b. and are their corresponding cumulative distributions. In a probabilistic manner, (5) provides a way of approximating the mixture using two clean speaker models, which in turn can be used to estimate two source signals given the mixture as the observation.
2.2 Source estimation
The MMSE estimate of speaker b can be computed similarly.
Note that the soft mask for speaker b is . In , the soft mask is found to perform consistently better than a binarized mask.
where and correspond to the pair of Gaussians yielding the highest posterior probability among all possible pairs. The estimate of source signals can be computed similarly to (11) but using only and . A soft mask can also be derived like (12) using only and . In experiments, we find that the performance of the MAP estimator is similar to that of MMSE, mainly because at each frame, one pair of Gaussians often approximates the mixture much better than others.
2.3 Incorporating temporal dynamics
The cochannel speech separation system in  models speaker characteristics using GMMs and ignores the temporal information of speech signals. A natural extension to the GMMs to incorporate temporal dynamics is using a factorial HMM model. Specifically, for each speaker, we can estimate the most likely Gaussian index for each frame in a clean utterance using a MAP estimator. Each utterance thus generates a sequence of Gaussian indices. The transitions between all neighboring Gaussian indices are then used to build a 2-D histogram, which can then be normalized to produce a transition matrix .
In the factorial HMM system, the hidden states of the two HMMs at each frame are the most likely Gaussian indices of two speakers. While the detection of the Gaussian indices is based on only individual frames in a GMM-based model, a 2-D Viterbi search is used in  to find the most likely Gaussian index sequences. Specifically, the 2-D Viterbi integrates all frames and the transition information across time to find the most likely two Gaussian sequences, each of which corresponds to one speaker .
where p(k a |k a′) is the transition probability of speaker a from state k a′ to k a , and p(k b |k b′) is that of speaker b. p(y t |k a ,k b ) can be calculated similarly as in (8). The optimal Gaussian index sequences are detected by a 2-D Viterbi decoding , and the MAP estimator is used for estimating sources.
In (15), an exhaustive search for each pair of k a and k b across T frames has a complexity of O(T K4), where K is the number of Gaussians for each speaker and T is the number of frames. It is time consuming if K is relatively large. In our study, we use a beam search to speed up the process (see also ). Given a beam width of W, we only search for the W most likely previous state pairs (i.e., k a′ and k b′ in (15)), and the time complexity is reduced to O(T W K2). The results presented in Section 4 indicate that a beam width of 16 gives a comparable performance to the exhaustive search.
As mentioned in Section 1, model-based methods such as  face the difficulty of generalizing to different mixing conditions. It is partly because the GMMs are trained using log-spectral vectors and hence are sensitive to the overall speech energy. More importantly, if the GMMs of two speakers are trained using clean utterances at certain energy levels, in testing they need to be adjusted according to the input SNR. In , mixtures with nonzero input SNR are separated using unadjusted models, but the performance is worse.
We propose to detect the input SNR and use that to adapt the speaker models and re-estimate the sources. To estimate the input SNR from the mixture, one has to first have some source information. Thus, SNR detection and source estimation become a chicken-and-egg problem, i.e., the performance of one task depends on the success of the other. One general approach to deal with this type of problem is to perform an iterative estimation (e.g., ). In the initial stage of the iterative procedure, we apply the unadapted speaker models to obtain initial separation. Based on the initial source estimates, we calculate the input SNR and use that to adapt the speaker models. The adapted models are in turn used to re-estimate the sources. The two steps iterate until convergence. As an alternative, we also explore a search-based method which jointly estimates sources and the input SNR.
3.1 Initial mask estimation
For a pair of speakers, we first perform an initial estimate by using their models pre-trained using clean utterances at a per-utterance energy level of 60 dB. Initially, the input SNR is assumed to be 0 dB, and a mixture is scaled to an energy level of 63 dB corresponding to the addition of two 60-dB source signals. We use the 2-D Viterbi decoding based on (15) to detect the most likely Gaussian index sequence and then estimate a soft mask of the target speaker using the MAP estimator in Section 2.2.
3.2 SNR estimation and model adaptation
where M a (c,m) denotes the ratio of speaker a at the T-F unit of channel c and frame m, and M b (c,m)=1−M a (c,m). R corresponds to the input SNR of the filtered speech signals. As analyzed in , due to gammatone filtering which has a certain passband, one usually should compensate for the loss of energy to calculate the SNR of the original time-domain signals. However, in our work, the frequency range of the gammatone filterbank is between 50 and 8,000 Hz, and both target and interference are speech signals with a sampling frequency of 16 kHz. There is thus little energy loss in the filtering process, and the estimated SNR of filtered signals is close to that of the original time-domain signals. Thus, we directly use the SNR of filtered signals in (16) as our estimate.
where x b [t] denotes the time-domain speech of speaker b. That is, instead of using 60-dB utterances, the interferer model should be trained using 60−R dB signals, and the original utterances should be scaled by a multiplicative factor of 10−R/10. Since the difference lies in a constant factor, we can directly scale the parameters of the GMM models, i.e., the mean and variance. Specifically, the means of the interferer GMM are scaled by an additive factor of β= log(10−R/10) since log-spectral vectors are used in training, while the variances will remain unchanged because β is an additive factor.
where y[t] is the time-domain cochannel signal, and x a [t] is the source signal of speaker a. In the above calculation, we assume that the time-domain target and interfering signal are uncorrelated at each frame. Given (17) and (18), we have adapted the interfering speaker model and the mixture and created a more matched condition for separation.
3.3 Iterative estimation
Given any input mixture, we first obtain the initial mask estimates Ma,0 and Mb,0 as described in Section 3.1. Given Ma,0 and Mb,0, we then estimate the input SNR using (16). The estimated SNR is used to adapt the model of speaker b and mixture by (17) and (18), respectively. They are then used together with the target speaker model to re-estimate the soft masks based on the 2-D Viterbi decoding described in Section 2.3 and the MAP estimator in Section 2.2. To get the maximal performance, the iterative process should continue until neither the estimated input SNR nor speaker masks change. However, empirically, we observe that the separation performance becomes stable when the estimated input SNR change is smaller than 0.5 dB. We thus use this as the stop criterion and terminate the estimation process when the difference of estimated input SNRs between two iterations is less than 0.5 dB.
3.4 An alternative method
In addition to the iterative method, we have also tried a search-based method to jointly estimate the source state sequences and the input SNR. For example, we use a test corpus described in Section 4 and hypothesize the input SNR in a range from −9 to 6 dB with an increment of 3 dB. At each hypothesized input SNR, we adapt the mixture and interfering speaker model according to (17) and (18) and use them to detect state sequences using the 2-D Viterbi decoding, and then estimate the soft masks based on the MAP estimator. For all hypothesized SNR conditions, we calculate the joint likelihood of all mixture frames and the Gaussian sequences being generated by the factorial HMM, and the hypothesized input SNR corresponding to the highest likelihood is selected as the detected value. The corresponding state sequence is then used for estimation. We have evaluated the performance of this method using the corpus described in Section 4, and it is about 0.5 dB worse than the iterative method and is computationally more expensive. Note that the discrete SNR range includes the true SNR value in each testing condition to favor the SNR-based search method. How to specify the input SNR levels in search is unclear in practice.
We use two-talker mixtures in the Speech Separation Challenge (SSC) corpus  for evaluation. For each speaker, a 256-component GMM model (i.e., K=256) is trained using all of the speaker’s clean utterances in the training set. Here, K is chosen with the consideration of performance and computation complexity. In training, each clean utterance is normalized to a 60-dB energy level, and the log spectra are calculated as described in Section 2.1. An HMM model is then built upon each GMM using the same utterances as described in Section 2.3. We use the test part of the SSC corpus and create two-speaker mixtures at SNRs from −9 to 6 dB (with an increment of 3 dB) for evaluation. We randomly select 100 two-speaker mixtures in each SNR condition for testing. Note that the mixture utterances are the same across different SNRs, and mixtures at opposite SNRs are not symmetric since they are generated by fixing the target and scaling the interfering utterances. The 100 mixtures contain 51 different-gender mixtures, 23 male-male mixtures, and 26 female-female mixtures. All test mixtures are downsampled from 25 to 16 kHz for faster processing.
where x a [t] and are the original clean signals and signals resynthesized from the estimated mask, respectively. Note that a waveform signal can be obtained from a soft mask . In our test conditions, target and interfering speakers are treated symmetrically, e.g., an interferer at 6 dB is considered as a target at −6 dB. Thus, at each input SNR, we calculate the target SNR gain as the average of the target SNR gain at that input SNR and the interferer SNR gain at the negative of that input SNR. For example, the SNR gain at −6 dB is the average of the target SNR gain at the −6 dB SNR and the interferer SNR gain at the 6 dB SNR.
4.1 System configuration
As shown in Figure 5, the proposed system achieves an SNR gain of 11.9 dB at the input SNR of −9 dB, and the gain decreases gradually as the input SNR increases. At 9 dB, the SNR gain is about 3.9 dB. On average, our method achieves an SNR gain of 7.4 dB. Compared to the method of Reddy and Raj, our method performs comparably at 0 dB but significantly better at other input SNRs. For example, the proposed system performs about 2.7 dB better at −9 dB, and the improvement gets smaller as the input SNR gets closer to 0 dB. A similar trend is also observed at positive input SNRs. On average, the proposed system performs 1.2 dB better than the Reddy and Raj method. In the figure, we also show the performance of another MMSE method (black bars), a version of the Reddy and Raj system that does not require the energy levels of training and testing to be the same. In this method, we assume the input SNR to be 0 dB and scale the mixture as described in Section 3.1. As we expect, the performance is a little worse (about 0.3 dB) than the original Reddy and Raj system due to the unmatched signal levels. We also compare to a MAP-based separation method described in Section 2.2. Using only the most likely Gaussian pair for estimation, the MAP method is more efficient than the MMSE method but performs about 0.1 dB worse. Our system performs about 1.6 dB better than the MAP-based method. To isolate the effect of iterative estimation, we have also evaluated the performance of the HMM system alone. As shown in the figure, this method achieves an average SNR gain of about 6.3 dB, about 0.5 dB better than the MAP-based method. This improvement comes from the use of temporal dynamics. Comparing this performance with the proposed system, we get the benefit of iterative estimation, which further increases the SNR gain of the HMM system by about 1.1 dB. In addition, we note that iterative estimation can also be incorporated into other model-based systems. For example, we add iterative estimation to the MMSE method (denoted by as MMSE-iterative in Figure 5) and obtain an improvement of 1.2 dB. Similarly, the MAP-iterative method outperforms the original MAP method by about 1.2 dB. Lastly, to show the upper bound performance of our system, we have utilized the true input SNR and ideal hidden states in estimation. This ideal performance is presented as the HMM ideal in Figure 5. It is about 0.9 dB better than the proposed system, which indicates that our system is close to the ceiling performance.
We have proposed an iterative algorithm for model-based cochannel speech separation. First, temporal dynamics is incorporated into speaker models using HMM. We then present an iterative method to deal with signal level differences between training and test conditions. Specifically, the proposed system first uses unadapted speaker models to segregate two speech signals and detects the input SNR. The detected SNR is then used to adapt the interferer model and the mixture for re-estimation. The two steps iterate until convergence. Systematic evaluations show that our iterative method improves segregation performance significantly and also converges quickly. Comparisons show that it performs significantly better than related model-based methods in terms of SNR gains as well as HIT −FA and STOI scores.
We note that SNR estimation in our system uses the whole mixture, which would not be feasible for real-time applications. However, one can slightly modify it to work in real time. For example, at one frame, one could use only previous frames for Viterbi decoding and SNR detection. The detected SNR could be used to adapt speaker models for separation in later frames and then get updated correspondingly. Such an update may be performed periodically to track the input SNR, and the update frequency would depend on the extent to which the input SNR varies.
In this work, our description is limited to two-talker situations as in related model-based methods. The proposed system could be extended to deal with multi-talker separation problems. For example, the MMSE estimators can be extended to perform three-talker separation according to . As for iterative estimation, one can estimate the energy ratios between multiple speakers instead of the SNR in the two-speaker case and adapt the speaker models accordingly. One issue in multi-talker situations is that the complexity of (13) is exponential to the number of speakers, and a faster decoding method thus needs to be used (e.g., [9, 30]).
This research was supported by an AFOSR grant (FA9550-12-1-0130).
- Wang DL, Brown GJ (eds): Computational, Auditory Scene Analysis: Principles,Algorithms and Applications. Hoboken: Wiley-IEEE Press; 2006.View ArticleGoogle Scholar
- Hu G, Wang DL: A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans. Audio, Speech, Lang. Process 2010, 18: 2067-2079.View ArticleGoogle Scholar
- Shao Y, Wang DL: Sequential organization of speech in computational auditory scene analysis. Speech Comm 2009, 51: 657-667. 10.1016/j.specom.2009.02.003View ArticleGoogle Scholar
- Shao Y, Srinivasan S, Jin Z, Wang DL: A computational auditory scene analysis system for speech segregation and robust speech recognition. Comput. Speech Lang 2010, 24: 77-93. 10.1016/j.csl.2008.03.004View ArticleGoogle Scholar
- Hu G, Wang DL: Auditory segmentation based on onset and offset analysis. IEEE Trans. Audio, Speech, Lang. Process 2007, 15: 396-405.View ArticleGoogle Scholar
- Barker J, Ma N, Coy A, Cooke M: Speech fragment decoding techniques for simultaneous speaker identification and speech recognition. Comput. Speech Lang 2010, 24: 94-111. 10.1016/j.csl.2008.05.003View ArticleGoogle Scholar
- Hu K, Wang DL: An unsupervised approach to cochannel speech separation. IEEE Trans. Audio Speech Lang. Process 2013, 21: 120-129.Google Scholar
- Roweis S, One microphone source separation: Adv. Neural Inf. Process. Syst. 2001, 13: 793-799.Google Scholar
- Reddy A, Raj B: Soft mask methods for single-channel speaker separation. IEEE Trans. Audio, Speech, Lang. Process 2007, 15(6):1766-1776.View ArticleGoogle Scholar
- Radfar MH, Dansereau RM: Single-channel speech separation using soft masking filtering. IEEE Trans. Audio, Speech, Lang. Process 2007, 15(8):2299-2310.View ArticleGoogle Scholar
- Hershey JR, Rennie SJ, Olsen PA, Kristjansson TT: Super-human multi-talker speech recognition: a graphical modeling approach. Comput. Speech Lang 2010, 24: 45-66. 10.1016/j.csl.2008.11.001View ArticleGoogle Scholar
- Stark M, Wohlmayr M, Pernkopf F: Source-filter-based single-channel speech separation using pitch information. IEEE Trans. Audio, Speech, Lang. Process 19(2):242-255.Google Scholar
- Weiss R, Ellis D: Speech separation using speaker-adapted eigenvoice speech models. Comput. Speech Lang 2010, 24: 16-29. 10.1016/j.csl.2008.03.003View ArticleGoogle Scholar
- Mysore GJ, Smaragdis P, Raj B: Non-negative hidden Markov modeling of audio with application to source separation. In Proc. 9th Int. Conf. Latent Variable Analysis and Signal Separation. Heidelberg: Springer; 2010.Google Scholar
- Smaragdis P: Convolutive speech bases their application to supervised speech separation. IEEE Trans. Audio, Speech, Lang. Process 2007, 15: 1-12.View ArticleGoogle Scholar
- Mowlaee P, Christensen MG, Jensen SH: New results on single-channel speech separation using sinusoidal modeling. IEEE Trans. Audio Speech Lang. Process 2011, 19: 1265-1277.View ArticleGoogle Scholar
- Yeung YT, Lee T, Leung CC: Integrating multiple observations for model-based single-microphone speech separation with conditional random fields. In Proc. ICASSP-12 IEEE. New York; 2012:257-260.Google Scholar
- Radfar MH, Dansereau RM: Long-term gain estimation in model-based single channel speech separation. In Proc. WASPAA IEEE. New York; 2007.Google Scholar
- Radfar MH, Wong W, Dansereau RM, Chan WY: Scaled factorial hidden Markov models: a new technique for compensating gain differences in model-based single channel speech separation. 2010.Google Scholar
- Mowlaee P, Saeidi R, Christensen MG, Tan ZH, Kinnunen T, Franti P, Jensen SH: A joint approach for single-channel speaker identification and speech separation. Audio, Speech, and Language Processing, IEEE Transactions on 2012, 20(9):2586-2601.View ArticleGoogle Scholar
- Saeidi R, Mowlaee P, Kinnunen T, Tan ZH, Christensen MG, Jensen SH, Franti P: Signal-to-signal ratio independent speaker identification for co-channel speech signals. In Pattern Recognition (ICPR), 2010 20th International Conference on IEEE,(IEEE. New York; 2010:4565-4568.View ArticleGoogle Scholar
- Nádas A, Nahamoo D, Picheny MA: Speech recognition using noise-adaptive prototypes. IEEE Trans. Acoust., Speech, Signal Process 1989, 37: 1495-1503. 10.1109/29.35387View ArticleGoogle Scholar
- Mowlaee P, Martin R: On phase importance in parameter estimation for single-channel source separation, in Acoustic Signal Enhancement. In Proceedings of IWAENC 2012; International Workshop on VDE. New York: IEEE; 2012:1-4.Google Scholar
- Varga AP, Moore RK: Hidden Markov model decomposition of speech and noise. 1990.View ArticleGoogle Scholar
- Shao Y, Wang DL: Model-based sequential organization in cochannel speech. IEEE Trans. Audio, Speech, Lang. Process 2006, 14: 289-298.View ArticleGoogle Scholar
- Narayanan A, Wang DL: A CASA based system for long-term, SNR estimation. IEEE Trans. Audio Speech Lang. Process 2012, 20: 2518-2527.View ArticleGoogle Scholar
- Cooke M, Lee T: Speech, Separation Challenge. 21 September 2006.http://staffwww.dcs.shef.ac.uk/people/M.Cooke/SpeechSeparation [ Challenge.htm]Google Scholar
- Kim G, Lu Y, Hu Y, Loizou PC: An, algorithm that improves speech intelligibility in noise for normal-hearing listeners. 2009, 126(3):1486-1494.Google Scholar
- Taal CH, Hendriks RC, Heusdens R, Jensen J: A short-time objective intelligibility measure for time-frequency weighted noisy speech, in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on IEEE. 2010, 4214-4217.View ArticleGoogle Scholar
- Rennie S, Hershey J, Olsen P: Single channel multi-talker speech recognition: graphical modeling approaches. IEEE Signal Process. Mag 2010, 27(6):66-80.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.