- Research Article
- Open Access
Independent Component Analysis and Time-Frequency Masking for Speech Recognition in Multitalker Conditions
© Dorothea Kolossa et al. 2010
- Received: 25 September 2009
- Accepted: 1 April 2010
- Published: 10 May 2010
When a number of speakers are simultaneously active, for example in meetings or noisy public places, the sources of interest need to be separated from interfering speakers and from each other in order to be robustly recognized. Independent component analysis (ICA) has proven a valuable tool for this purpose. However, ICA outputs can still contain strong residual components of the interfering speakers whenever noise or reverberation is high. In such cases, nonlinear postprocessing can be applied to the ICA outputs, for the purpose of reducing remaining interferences. In order to improve robustness to the artefacts and loss of information caused by this process, recognition can be greatly enhanced by considering the processed speech feature vector as a random variable with time-varying uncertainty, rather than as deterministic. The aim of this paper is to show the potential to improve recognition of multiple overlapping speech signals through nonlinear postprocessing together with uncertainty-based decoding techniques.
- Independent Component Analysis
- Speech Recognition
- Independent Component Analysis
- Automatic Speech Recognition
- MFCC Feature
When speech recognition is to be used in arbitrary, noisy environments, interfering speech poses significant problems due to the ovelapping spectra and nonstationarity. If automatic speech recognition (ASR) is nonetheless required, for example for robust voice control in public spaces or for meeting transcription, the use of independent component analysis (ICA) can be important to segregate all involved speech sources for subsequent recognition. In order to attain the best results, it is often helpful to apply an additional nonlinear gain function to the ICA output to suppress residual speech and noise. After a short introduction to ICA in Section 2, this paper shows in Section 3 how such nonlinear gain functions can be attained based on three different principal approaches.
However, while source separation itself is greatly improved by nonlinear postprocessing, speech recognition results often suffer from artefacts and loss in information due to such masks. In order to compensate for these losses and to obtain results exceeding those of ICA alone, we suggest the use of uncertainty-of-observation techniques for the subsequent speech recognition. This allows for the utilization of a feature uncertainty estimate, which can be derived considering both artefacts and incorrectly suppressed components of target speech, and will be described in more detail in Section 4. From such an uncertain description of the speech signal in the spectrum domain, uncertainties need to be made available also in the feature domain, in order to be used for recognition. This can be achieved by the so-called "uncertainty propagation," which converts an uncertain description of speech from the spectrum domain, where ICA takes place, to the feature domain of speech recognition. After this uncertainty propagation, detailed in Section 5, recognition can take place under observation uncertainty, as shown in Section 6.
The entire process is vitally dependent on the appropriate estimation of uncertainties. Results given in Section 8 show that when the exact uncertainty in the spectrum domain is known, recognition results with the suggested approach are far in excess of those achievable by ICA alone. Also, a realistically computable uncertainty estimate is introduced, and experiments and results given in Sections 7 and 8 show that with this practical uncertainty measure, significant improvements of recognition performance can be attained for noisy, reverberant room recordings.
The presented method is closely related to other works that consider observation vectors as uncertain for decoding purposes, most often for noisy speech recognition [1–4], but in some cases also for speech recognition in multitalker conditions, as, for example, [5, 6], or  in conjunction with speech segregation via binary masking (see, e.g. [8, 9]).
The main novelty in comparison with the above techniques is the use of independent component analysis in conjunction with uncertainty estimation and with a piecewise approach of transforming uncertainties to the feature domain of interest. This allows for the suggested approach to utilize the combined strengths of independent component analysis and soft time-frequency masking, and to still be used with a wide range of feature parameterizations, often without the need for recomputing the uncertainty mapping function to the desired ASR-domain. Corresponding results are shown here for both MFCC and RASTA-PLP coefficients, but the discussed uncertainty transformation approach also generalizes well to the ETSI advanced front end, as shown in .
Independent component analysis has been successfully employed for the separation of speech mixtures in both clean and noisy environments [11, 12]. Alternative methods include adaptive beamforming, which is closely related to independent component analysis when information-theoretic cost functions are applied , sparsity-based methods that utilize amplitude-delay histograms [6, 8, 14], or grouping cues typical of human stream segregation . Here, independent component analysis has been chosen due to its inherent robustness to noise and its ability to handle strong reverberation by frequency-by-frequency optimization of the cost function.
is the frequency bin at which the permutation problem has to be solved and denotes the frequency bin to be used as reference, and is a constant. For this strategy, ordering permutations first at higher frequencies and proceeding downward has proven beneficial; therefore, the ordering at the maximum frequency bin was chosen as reference, and sorting according to (8) took place binwise in descending order.
However, an ideal mask is impossible to obtain realistically; thus, approximations to it are required. For obtaining such an approximation, mask estimation based on ICA results has been proposed and shown to be successful, both for binary and soft masks, see, for example, [17, 18, 20]. The motivation for this procedure lies both in the noise-robustness of ICA, which can therefore unmix signals even when large interferences make the estimation of a time-frequency mask extremely difficult, and also in the fact that ICA will unmix signals even in those time-frequency regions, where two or more of them are simultanously active to a significant extent.
two types of interference-based masks,
which will be described in the subsequent sections.
3.1. Amplitude-Based Masks
One of the simplest post-masks suitable for postprocessing of ICA results is based on comparing the magnitude of all ICA outputs . Due to the sparsity of sources in an appropriate spectral representation , only one should be dominant; therefore, all others are discarded.
before the mask is computed.
3.2. Phase-Based Masks
The source separation performance of ICA can also be seen from a beamforming perspective. When the unmixing filters learned by ICA are viewed as frequency-variant beamformers, it can be shown that successful ICA effectively places zeros in the directions of all interfering sources . Therefore, the zero directions of the unmixing filters should be indicative of all source directions. Thus, when the local direction of arrival (DOA) is estimated from the phase of any one given time-frequency bin, this should give an indication of the dominant source in this bin. This is the principle underlying phase-based time-frequency masking strategies.
Phase-based postmasking of ICA outputs was introduced in . In this method, the angle between the 'th target basis vector of the unmixing matrix and the microphone signal vector is used in order to determine whether and to what degree a given channel should be masked.
where the estimated mixing matrix is given in terms of its constituent column vectors, . When comparing (18) and (2), and considering (3), it can be seen that the columns of correspond to the columns of , the matrix containing the values of the room transfer function for each frequency, up to an arbitrary scaling of column vectors and a reordering of sources, which is constant over frequencies after the permutation correction. Thus, in those time-frequency bins, where source is dominant, the associated basis vector should correspond to the column of the mixing matrix associated with source . In general, the index may be different from the index , due to possible permutations. However, as this change of indices will be consistent over frequency, it is disregarded in the following.
After the normalized basis vectors are thus available, masking is carried out based on the angle between the observed vector and the basis vector . This angle is computed in a whitened space, where and are premultiplied by the whitening matrix , which is the inverse square root of the sensor autocorrelation matrix, .
The parameter describes the steepness of the mask and is the transition point, where the mask takes on the value . More details on the mask computation can be found in .
3.3. Interference-Based Masks
As an alternative criterion for masking, residual interference in the signal may be estimated and the mask may be computed as an MMSE estimator of the clean signal. This can be achieved with a number of approaches, two of which will be presented here in more detail.
3.3.1. Ephraim-Malah Filter-Based Post-Filtering
where is the amplitude estimator gain. For the calculation of the gain , different speech enhancement algorithms can be used. In the following, we are using the log spectral amplitude estimator (LSA) as proposed by Ephraim and Malah .
3.3.2. Inclusion of Speech Presence Probabilities
Due to the use of time-frequency masking, part of the information of the original signal might be eliminated along with the interfering sources. To compensate for this lack of information, each masked estimated source is considered as uncertain and described in the form of a posterior distribution of each Fourier coefficient of the clean signal given the available information.
Estimating the uncertainty in the spectrum domain has clear advantages, when contrasted with uncertainty estimation in the domain of speech recognition, since much intermediate information about the signal and noise process as well as the mask is known in this phase of signal processing, but is generally not available in the further steps of feature extraction. This has motivated a number of studies on spectrum domain uncertainty estimation, most recently for example [7, 10]. In contrast to other methods, the suggested strategy possesses two advantages: it does not need a detailed spectrum domain speech prior, which may require a large number of components or may incur the need for adaptation to the speaker and environment; and it gives a computationally very inexpensive approximation that is applicable for both binary and soft masks.
where the mean is set equal to the Fourier coefficient obtained from post-masking and the variance represents the lack of information, or uncertainty. In order to determine , two alternative procedures were used.
4.1. Ideal Uncertainties
where is the reference signal. However, these ideal uncertainties are available only in experiments where a reference signal has been recorded. Thus, the ideal results may only serve as a perspective of what the suggested method would be capable of if a very high quality error estimate were already available.
4.2. Masking Error Estimate
In practice, it is necessary to approximate the ideal uncertainty estimate using values that are actually available. Since much of the estimation error is due to the time-frequency mask, in further experiments such a masking error was used as the single basis of the uncertainty measure.
In order to avoid adapting parameters to each of the test signals and masks, this minimization was carried out only once and only for a mixture not used in testing. After averaging over all mask types, the same value of was used in all experiments and for all datasets. This optimal value was .
When uncertain features are available in the STFT domain, they could in principle be used for spectrum domain speech recognition. However, as shown in , due to the less robust spectrum domain models, this does not provide for optimum results. Instead, a more successful approach is to transform the uncertain description of speech from the spectrum domain to the domain of speech recognition. This can in principle be achieved by two approaches, data-driven as in  or model-driven as in . In the following, we only consider the model-driven approach, which can achieve very low propagation errors with small memory requirements and without the need for a training phase . However, a detailed comparison of both principal methods remains an interesting target for future work.
In order to carry out the propagation through the feature extraction process, the uncertain spectrum domain description is considered as specifying speech as a random variable according to (35). If such an uncertain description of the STFT is used, the corresponding posterior distribution has to be propagated into the feature domain. For this purpose, the effect of all transformations in the feature extraction process on this probability distribution needs to be considered, which will result in an estimated feature domain random variable, describing both the mean of the speech features as well as the associated degree of uncertainty. Since this computation takes place for each feature and in each bin, subsequent recognition will have a maximally precise description of all uncertainties, allowing the algorithm to focus most on those features that are most reliable, and, if desired, to replace the uncertain ones by better estimates under simultaneous consideration of the recognizer speech model.
In conventional automatic speech recognition, only the STFT of each estimated source must be transformed into the feature domain of automatic speech recognition. Feature extractions involve multiple transformations, some of them nonlinear, which are performed jointly on multiple features of the same frame or by combining features from different time frames. Propagating an uncertain description of the STFT of each estimated source is therefore a complicated task that can be simplified by propagating only first- and second-order information. This section shows how this propagation can be attained by a piecewise approach in which the feature extraction is divided into different steps and the optimal method is chosen to perform uncertainty propagation in each step. Uncertainty propagation is used with two of the more robust speech recognition features, namely the Mel-cepstrum coefficients (MFCCs)  and the cepstral coefficients obtained from the RelAtive SpecTrAl Perceptual Linear Prediction (RASTA-PLP) feature extraction , here denoted as RASTA-LPCCs.
5.1. Mel-Cepstral Feature Extraction
Extract the short-time spectral amplitude (STSA) from the STFT.
Compute each filter output of a Mel-filterbank as a weighted sum of the STSA features of each frame.
Apply the logarithm to each filter output.
Compute the discrete cosine transform (DCT) from each frame of log-filterbank features.
In order to propagate random variables rather than deterministic signals, these steps were modified as follows.
Step corresponds to the computation of the logarithm. Since the distribution of the Mel-STSA uncertain features has a relatively low skewness and the dimensionality of the features has been reduced by approximately one order of magnitude through the application of the Mel-filterbank, the use of the pseudo-Montecarlo method termed unscented transform  provides an acceptable trade-off between accuracy and computational cost. Details regarding the use of the unscented transform for uncertainty propagation can be found in .
5.2. Relative Spectral Perceptual Linear Prediction Feature Extraction
Extract the power spectral density (PSD) from the STFT.
Compute each filter output of a Bark-filterbank as a weighted sum of the PSD features of each frame.
Apply the logarithm to each filter output.
Filter the resulting frames with the RASTA IIR filter.
Add the equal loudness curve and multiply by 0.33 to simulate the power law of hearing.
Apply the exponential to invert the effect of the logarithm.
Compute an all-pole model of each frame to obtain the linear prediction coefficients (LPCs).
Compute cepstral coefficients from each LPC frame.
Steps and correspond to conventional linear transformations in the logarithm domain, and therefore the propagation through them can be solved by applying (41) to obtain the means and covariances . Furthermore, since the assumption of log-normality in the Bark-PSD domain implies that the log-domain features are normally distributed, RASTA, preemphasis, and power-law transformations do not alter this condition.
The final steps of the RASTA-LPCC feature extraction, Steps and correspond to the computation of the all-pole model to obtain the LPC coefficients, described in the conventional PLP technique , and the computation of the cepstral coefficients from the LPCs using [30, equation 3] Due to the complex nature of these transformations and the low skewness of the uncertain features after the exponential transformation, the propagation is computed using the unscented transform, similarly to the case of the logarithm transformation for the Mel-cepstral features.
When features for speech recognition are given not as point estimates, but rather in the form of a posterior distribution with estimated mean and covariance the speech decoder must be modified in order to take this additional information into account. A number of approaches exist, both for binary and for continuous-valued uncertainties, for example, [2, 36, 37].
Here, two missing feature approaches were applied, which are capable of considering real-valued uncertainties. These methods, modified imputation  and HMM variance compensation , have been implemented for the Hidden Markov Model Toolkit (HTK)  and were used in the tests.
Both methods are appropriate for HMM-based systems, where recognition takes place by finding the optimum HMM state sequence , which gives the best match to the feature vector sequence when each HMM state has an associated output probability distribution .
6.1. HMM Variance Compensation
6.2. Modified Imputation
7.1. Room Recordings
Distance between speaker
Angular position of the speaker (as shown in Figure 4)
Angular position of the speaker (as shown in Figure 4)
7.2. Model Training
The HMM speech recognizer was trained with the HTK toolkit . The trained HMMs comprised phoneme-level models with 6-component MOG emitting probabilities and a conventional left-right structure. The training data was mixed and it comprised the 114 speakers of the TI-DIGITS clean speech database along with the room recordings for speakers sa and rk used for adaptation. Speakers used for adaptation were removed from the test set. The feature extractions presented in Section 5 were also complemented with cepstral mean subtraction (CMS) for further reduction of convolutive effects. Since CMS is a linear operation, it poses no additional difficulty for uncertainty propagation.
7.3. Parameter Settings of Time-Frequency Masks
Parameters of all masks were set manually for good performance on all datasets, and were kept consistent throughout all experiments.
7.3.1. Amplitude-Based Masking
7.3.2. Phase-Based Masking
In phase-based masking according to (22), there are two free parameters as well, again a mask gain and also a mask threshold, the angle threshold . However, optimum performance was reached for different parameter values depending on the recognizer parameterization. For optimal performance on MFCC features, they were set to and , which will be refered to as Phase1 in the results. In contrast, for RASTA-PLP-based recognition, better results were generally achieved with and (Phase2), that is, the same threshold but less steep of a mask gain.
7.3.3. Interference-Based Masking
The second interference-based algorithm additionally includes the speech probability estimate defined in Section 3.3.2. Thus, in addition to the parameters and , there are additional parameters in the weighting function (34). These are and , parameters specifying the two threshold points and the mask gain. They are defined to correspond to the mean absolute value of the estimated signal Fourier coefficients , the mean absolute value of the noise estimate Fourier coefficient ; and the mask gain is set to . For windowing in (33), a Hanning window of size is used. For this algorithm, the abbreviation IBPE will be used.
8.1. Recognition Performance Measurement
8.2. Multispeaker Recognition Results
Word accuracy (WA) of ASR tests for RASTA-PLP features, estimated uncertainties. Here, the algorithms Phase1 and Phase2 utilize the parameters defined in Section 7.3.2, the entries with the heading Amplitude correspond with the mask given in Section 7.3.1, and the two interference-based strategies IB and IBPE are specified in Section 7.3.3. The two robust recognition strategies are abbreviated by MI for modified imputation and UD for uncertainty decoding.
Word accuracy (WA) of ASR tests for RASTA-PLP features, true uncertainties.
No. of speakers
ICA + Mask
ICA + Mask + UD
ICA + Mask + MI
Similar performance gains can be observed in the case of MFCC features, where word error rates can be reduced by 64% and 62% for uncertainty decoding and modified imputation, respectively. Comparing the uncertain recognition strategies, again, modified imputation is on average the better performer for RASTA-PLPs, whereas uncertainty decoding leads to better performance gains for MFCCs. Concerning the masking strategies, it is clear that the IB-mask, which has fairly aggressive parameter settings and an extremely low recognition rate without missing feature approaches, is the best for this case of ideal uncertainties.
An overview of the use of independent component analysis for speech recognition under multitalker conditions has been given. As shown by the presented results, the conventional strategy of purely linear source separation can be improved by post-masking in the time-frequency domain, if this is accompanied by missing-feature speech recognition. Especially for three-speaker scenarios, this improves the recognition rate notably. Interestingly, the optimal decoding strategy is apparently dependent on the features that are used for recognition. Whereas modified imputation was clearly superior for RASTA features, better results for MFCC features have almost consistently been achieved by uncertainty decoding, even though uncertainties were estimated in the spectrum domain for both features and propagated to the recognition domain of interest. Further work will be necessary to determine how these results correspond to the degree of model mismatch in both domains, with the aim of determining an optimal decoding strategy depending on specific application scenarios.
A vital aspect of missing feature recognition is still the estimation of the feature uncertainty. Here, an ideal uncertainty estimate will result in superior recognition performance for all considered test cases and all applied post masks. Since such an ideal uncertainty is not available in practice, the value needs to be estimated from available data. In the presented cases, this measure has been derived from the ICA output signal and the applied nonlinear gain function. The resulting uncertainty estimate has a correlation coefficient of 0.45 with the true uncertainties, leading to superior and consistent performance among all tested uncertainty estimates.
However, uncertainty estimation for the ICA output signals should be improved further, in order to approximate more closely the ideally achievable performance of this strategy. For this purpose, it will be interesting to compare the proposed uncertainty estimation to other approaches. Specifically, the uncertainty estimation described in  is of interest for use with any type of recognition feature and preprocessing method, but it requires learning of a regression tree for the given specific feature set and environment. In contrast, feature-specific methods described for example in [2, 3] are applicable only to the feature domain they have been derived for, but can be used without the need for additional training stages.
Since none of the above methods is designed specifically for use with ICA, another direction of research is a better use of the statistical information gathered during source separation. Further research can thus focus on an optimal use of this intermediate data, and on its combination with more detailed prior models in the spectrum domain, as those in , for arriving at more accurate uncertainty estimates which utilize all avaliable data from multiple microphones.
- Kristjansson TT, Frey BJ: Accounting for uncertainty in observations: a new paradigm for robust automatic speech recognition. Proceedings of the IEEE International Conference on Acustics, Speech, and Signal Processing (ICASSP '02), 2002Google Scholar
- Deng L, Droppo J, Acero A: Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Transactions on Speech and Audio Processing 2005, 13(3):412-421.View ArticleGoogle Scholar
- Stouten V, Van Hamme H, Wambacq P: Application of minimum statistics and minima controlled recursive averaging methods to estimate a cepstral noise model for robust ASR. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), May 2006 1:Google Scholar
- Van Segbroeck M, Van Hamme H: Robust speech recognition using missing data techniques in the prospect domain and fuzzy masks. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), 2008 4393-4396.Google Scholar
- Kolossa D, Klimas A, Orglmeister R: Separation and robust recognition of noisy, convolutive speech mixtures using time-frequency masking and missing data techniques. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA '05), 2005 82-85.Google Scholar
- Kühne M, Togneri R, Nordholm S: Time-frequency masking: linking blind source separation and robust speech recognition. In Speech Recognition: Technologies and Applications. IN-TECH, Vienna, Austria; 2008:61-80.Google Scholar
- Srinivasan S, Wang D: Transforming binary uncertainties for robust speech recognition. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(7):2130-2140.View ArticleGoogle Scholar
- Yilmaz Ö, Rickard S: Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing 2004, 52(7):1830-1847. 10.1109/TSP.2004.828896MathSciNetView ArticleGoogle Scholar
- Brown G, Wang D: Separation of speech by computational auditory scene analysis. In Speech Enhancement. Edited by: Benesty J, Makino S, Chen J. Springer, New York, NY, USA; 2005:371-402.View ArticleGoogle Scholar
- Astudillo RF, Kolossa D, Mandelartz P, Orglmeister R: An uncertainty propagation approach to robust ASR using the ETSI advanced front-end. IEEE Journal of Selected Topics in Signal Processing. In pressGoogle Scholar
- Buchner H, Aichner R, Kellermann W: TRINICON: a versatile framework for multichannel blind signal processing. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '04), 2004 3: 889-892.Google Scholar
- Makino S, Lee T-W, Sawada H (Eds): Blind Speech Separation. Springer, New York, NY, USA; 2007.Google Scholar
- Kumatani K, McDonough J, Klakow D, Garner PN, Li W: Adaptive beamforming with a maximum negentropy criterion. Proceedings of the Hands-Free Speech Communication and Microphone Arrays (HSCMA '08), January 2008, Trento, Italy 180-183.View ArticleGoogle Scholar
- Roman N, Wang D, Brown GJ: Speech segregation based on sound localization. Journal of the Acoustical Society of America 2003, 114(4 I):2236-2252.View ArticleGoogle Scholar
- Brown GJ, Cooke M: Computational auditory scene analysis. Computer Speech and Language 1994, 8(4):297-336. 10.1006/csla.1994.1016View ArticleGoogle Scholar
- Cichocki A, Amari S: Adaptive Blind Signal and Image Processing. John Wiley & Sons, New York, NY, USA; 2002.View ArticleGoogle Scholar
- Sawada H, Araki S, Mukai R, Makino S: Blind extraction of dominant target sources using ICA and time-frequency masking. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(6):2165-2173.View ArticleGoogle Scholar
- Hoffmann E, Kolossa D, Orglmeister R: A batch algorithm for blind source separation of acoustic signals using ICA and time-frequency masking. Proceedings of the 7th International Conference on Independent Component Analysis and Signal Separation (ICA '07), 2007, London, UK 480-487.View ArticleGoogle Scholar
- Kamata K, Hu X, Kobatake H: A new approach to the permutation problem in frequency domain blind source separation. Proceedings of the 5th International Conference on Independent Component Analysis and Blind Signal Separation (ICA '04), September 2004, Granada, Spain 849-856.View ArticleGoogle Scholar
- Kolossa D, Orglmeister R: Nonlinear postprocessing for blind speech separation. Proceedings of the 5th International Conference on Independent Component Analysis and Signal Separation (ICA '04), 2004, Granada, Spain 832-839.View ArticleGoogle Scholar
- Pedersen MS, Wang D, Larsen J, Kjems U: Overcomplete blind source separation by combining ICA and binary time-frequency masking. Proceedings of the IEEE Workshop on Machine Learning for Signal Processing, September 2005 15-20.Google Scholar
- Kolossa D: Independent component analysis for environmentally robust speech recognition, Ph.D. dissertation. TU Berlin, Berlin, Germany; 2007.Google Scholar
- Araki S, Makino S, Hinamoto Y, Mukai R, Nishikawa T, Saruwatari H: Equivalence between frequency-domain blind source separation and frequency-domain adaptive beamforming for convolutive mixtures. EURASIP Journal on Applied Signal Processing 2003, 2003(11):1157-1166. 10.1155/S1110865703305074MATHView ArticleGoogle Scholar
- Ephraim Y, Malah D: Speech enhancement using a minimum mean-square error-log-spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 1985, 33(2):443-445. 10.1109/TASSP.1985.1164550View ArticleGoogle Scholar
- Cohen I: Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator. IEEE Signal Processing Letters 2002, 9(4):113-116. 10.1109/97.1001645View ArticleGoogle Scholar
- Cohen I: On speech enhancement under signal presence uncertainty. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '01), May 2001, Salt Lake City, Utah, USA 167-170.Google Scholar
- Ephraim Y, Cohen I: Recent advancements in speech enhancement. In The Electrical Engineering Handbook. CRC Press, Boca Raton, Fla, USA; 2006.Google Scholar
- Astudillo RF, Kolossa D, Orglmeister R: Propagation of statistical information through non-linear feature extractions for robust speech recognition. Proceedings of the 27th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt '07), November 2007 954: 245-252.View ArticleGoogle Scholar
- Raj B, Seltzer ML, Stern RM: Reconstruction of missing features for robust speech recognition. Speech Communication 2004, 43(4):275-296. 10.1016/j.specom.2004.03.007View ArticleGoogle Scholar
- Davis SB, Mermelstein P: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 1980, 28(4):357-366. 10.1109/TASSP.1980.1163420View ArticleGoogle Scholar
- Hermansky H, Morgan N: RASTA processing of speech. IEEE Transactions on Speech and Audio Processing 1994, 2(4):578-589. 10.1109/89.326616View ArticleGoogle Scholar
- Julier S, Uhlmann J: A general method for approximating nonlinear transformations of probability distributions. University of Oxford, Oxford, UK; 1996.Google Scholar
- Astudillo RF, Kolossa D, Orglmeister R: Uncertainty propagation for speech recognition using RASTA features in highly nonstationary noisy environments. Proceedings of the Workshop for Speech Communication (ITG '08), 2008Google Scholar
- Gales M: Model-based techniques for noise robust speech recognition, Ph.D. thesis. Cambridge University, Cambridge, UK; 1996.Google Scholar
- Hermansky H, Hanson BA, Wakita H: Perceptually based linear predictive analysis of speech. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '85), 1985 509-512.View ArticleGoogle Scholar
- Barker J, Green P, Cooke M: Linking auditory scene analysis and robust ASR by missing data techniques. Proceedings of the Workshop on Innovation in Speech Processing (WISP '01), 2001Google Scholar
- Arrowood J, Clements M: Using observation uncertainty in HMM decoding. Proceedings of the International Conference on Spoken Language Processing (ICSLP '02), 2002Google Scholar
- Young S: The HTK Book (for HTK Version 3.4). Cambridge University, Engineering DepartmentGoogle Scholar
- Leonard RG: Database for speaker-independent digit recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '84), 1984 3:Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.