Skip to main content

Independent Component Analysis and Time-Frequency Masking for Speech Recognition in Multitalker Conditions

Abstract

When a number of speakers are simultaneously active, for example in meetings or noisy public places, the sources of interest need to be separated from interfering speakers and from each other in order to be robustly recognized. Independent component analysis (ICA) has proven a valuable tool for this purpose. However, ICA outputs can still contain strong residual components of the interfering speakers whenever noise or reverberation is high. In such cases, nonlinear postprocessing can be applied to the ICA outputs, for the purpose of reducing remaining interferences. In order to improve robustness to the artefacts and loss of information caused by this process, recognition can be greatly enhanced by considering the processed speech feature vector as a random variable with time-varying uncertainty, rather than as deterministic. The aim of this paper is to show the potential to improve recognition of multiple overlapping speech signals through nonlinear postprocessing together with uncertainty-based decoding techniques.

1. Introduction

When speech recognition is to be used in arbitrary, noisy environments, interfering speech poses significant problems due to the ovelapping spectra and nonstationarity. If automatic speech recognition (ASR) is nonetheless required, for example for robust voice control in public spaces or for meeting transcription, the use of independent component analysis (ICA) can be important to segregate all involved speech sources for subsequent recognition. In order to attain the best results, it is often helpful to apply an additional nonlinear gain function to the ICA output to suppress residual speech and noise. After a short introduction to ICA in Section 2, this paper shows in Section 3 how such nonlinear gain functions can be attained based on three different principal approaches.

However, while source separation itself is greatly improved by nonlinear postprocessing, speech recognition results often suffer from artefacts and loss in information due to such masks. In order to compensate for these losses and to obtain results exceeding those of ICA alone, we suggest the use of uncertainty-of-observation techniques for the subsequent speech recognition. This allows for the utilization of a feature uncertainty estimate, which can be derived considering both artefacts and incorrectly suppressed components of target speech, and will be described in more detail in Section 4. From such an uncertain description of the speech signal in the spectrum domain, uncertainties need to be made available also in the feature domain, in order to be used for recognition. This can be achieved by the so-called "uncertainty propagation," which converts an uncertain description of speech from the spectrum domain, where ICA takes place, to the feature domain of speech recognition. After this uncertainty propagation, detailed in Section 5, recognition can take place under observation uncertainty, as shown in Section 6.

The entire process is vitally dependent on the appropriate estimation of uncertainties. Results given in Section 8 show that when the exact uncertainty in the spectrum domain is known, recognition results with the suggested approach are far in excess of those achievable by ICA alone. Also, a realistically computable uncertainty estimate is introduced, and experiments and results given in Sections 7 and 8 show that with this practical uncertainty measure, significant improvements of recognition performance can be attained for noisy, reverberant room recordings.

The presented method is closely related to other works that consider observation vectors as uncertain for decoding purposes, most often for noisy speech recognition [1–4], but in some cases also for speech recognition in multitalker conditions, as, for example, [5, 6], or [7] in conjunction with speech segregation via binary masking (see, e.g. [8, 9]).

The main novelty in comparison with the above techniques is the use of independent component analysis in conjunction with uncertainty estimation and with a piecewise approach of transforming uncertainties to the feature domain of interest. This allows for the suggested approach to utilize the combined strengths of independent component analysis and soft time-frequency masking, and to still be used with a wide range of feature parameterizations, often without the need for recomputing the uncertainty mapping function to the desired ASR-domain. Corresponding results are shown here for both MFCC and RASTA-PLP coefficients, but the discussed uncertainty transformation approach also generalizes well to the ETSI advanced front end, as shown in [10].

2. Independent Component Analysis for Reverberant Speech

Independent component analysis has been successfully employed for the separation of speech mixtures in both clean and noisy environments [11, 12]. Alternative methods include adaptive beamforming, which is closely related to independent component analysis when information-theoretic cost functions are applied [13], sparsity-based methods that utilize amplitude-delay histograms [6, 8, 14], or grouping cues typical of human stream segregation [15]. Here, independent component analysis has been chosen due to its inherent robustness to noise and its ability to handle strong reverberation by frequency-by-frequency optimization of the cost function.

In order to separate a number of simultaneously active speech signals from recordings, with , the reverberant, noisy mixing process is modelled as

(1)

where the room impulse response from source to sensor is considered time-invariant.

Since convolutions are easily separable in the frequency domain, this expression is transformed by a short-time Fourier transform (STFT). Then, (1) becomes

(2)

where is composed of the room transfer functions from all sources to the sensors , and is the sensor noise. Here, and denote the integer-valued frequency bin index and frame index, respectively.

In order to extract the original sources from the mixtures, ICA finds an unmixing matrix

(3)

for each frequency bin , which by principle can only be known up to an arbitrary scaling and permutation described by the diagonal scaling matrix and the permutation matrix . The unmixing matrix is found by maximizing the statistical independence of the unmixed signals . Finally, unmixing is carried out separately in each frequency bin according to

(4)

To learn the matrix , the adaptive algorithm described in [16, Table , Equation ] is used. In this algorithm, the demixing matrix is calculated using a gradient descent. The update rule for the matrix in the th iteration consists of two steps. At first, the current estimate of the source signals is computed by (4), using the result of the previous iteration for unmixing. Then, the update of the unmixing matrix takes place according to

(5)

where is a diagonal matrix with

(6)

Here, denotes the mean value and

(7)

Ideally, this optimization will result in independent output signals in each frequency bin. To obtain a complete spectrum of unmixed sources, it is additionally necessary to correctly sort the outputs, since their ordering after ICA is arbitrary and may vary from frequency bin to frequency bin. This so-called permutation problem can be solved in a number of ways; see, for example, [17, 18]. In all following work, permutations have been corrected by sorting outputs in accordance with the distance criterion

(8)

described in [19]. Here, is defined by

(9)

is the frequency bin at which the permutation problem has to be solved and denotes the frequency bin to be used as reference, and is a constant. For this strategy, ordering permutations first at higher frequencies and proceeding downward has proven beneficial; therefore, the ordering at the maximum frequency bin was chosen as reference, and sorting according to (8) took place binwise in descending order.

3. Time-Frequency Masking for ICA

However, some residual noise and interference are to be expected even after applying source separation, especially in reverberant environments. For removing these, post-masking of ICA-results has often been employed [14, 17, 20, 21] using

(10)

This is motivated by the potential gains in Signal-to-Interference Ratio (SIR), which can already be attained by simple binary masking with an ideal mask. With such an, albeit practically elusive, mask, which is given by evaluating the true knowledge about the signal spectra via

(11)

it is possible to obtain more than 40 dB SIR improvement on two-speaker mixtures even without ICA, while remaining above 20 dB of Signal-to-Distortion Ratio (SDR) [22]. The results of one such exemplary experiment are shown in Figure 1. For this figure, an additional masking threshold was introduced, and the mask in (11) was only set to 1, if the source of interest was greater than all other sources by at least  dB, that is, if

(12)
Figure 1
figure 1

Performance of an ideal binary mask, tested on 12 pairs of same-and mixed-gender speakers. Performance is shown for frame lengths (NFFT) of 256, 512, 1024, and 2048 samples in terms of SDR and SIR-improvement. When the SNR-threshold is increased, the red SDR-curves are decreasing monotonically, while a more pronounced monotonic increase can be observed for the SIR-improvement, shown in green color.

However, an ideal mask is impossible to obtain realistically; thus, approximations to it are required. For obtaining such an approximation, mask estimation based on ICA results has been proposed and shown to be successful, both for binary and soft masks, see, for example, [17, 18, 20]. The motivation for this procedure lies both in the noise-robustness of ICA, which can therefore unmix signals even when large interferences make the estimation of a time-frequency mask extremely difficult, and also in the fact that ICA will unmix signals even in those time-frequency regions, where two or more of them are simultanously active to a significant extent.

The architecture of such systems is shown in Figure 2 for the exemplary case of two sources and microphones.

Figure 2
figure 2

Structure of ICA with Postmasking. Here, stands for those ICA-based features that may be needed for mask estimation. All double arrows show the data flow of signal spectrograms, while single arrows indicate auxilliary information flow. The figure corresponds to the special case of two microphones and two estimated signals.

In the following, four types of masks are considered:

  1. (i)

    amplitude-based masks,

  2. (ii)

    phase-based masks,

  3. (iii)

    two types of interference-based masks,

which will be described in the subsequent sections.

3.1. Amplitude-Based Masks

One of the simplest post-masks suitable for postprocessing of ICA results is based on comparing the magnitude of all ICA outputs [20]. Due to the sparsity of sources in an appropriate spectral representation [8], only one should be dominant; therefore, all others are discarded.

In order for the strategy to be independent of the source signal energies, all ICA output signals need to be normalized to equal variance via

(13)

before the mask is computed.

Then, a hard amplitude mask can be obtained by comparing a local dominance ratio to an acceptance threshold via

(14)

with defined by

(15)

This is a rather simple approach, which has been enhanced in the following by applying a sigmoid nonlinearity to reduce artefacts. This can be easily achieved by redefining to

(16)

where is the mask gain controlling its steepness.

3.2. Phase-Based Masks

The source separation performance of ICA can also be seen from a beamforming perspective. When the unmixing filters learned by ICA are viewed as frequency-variant beamformers, it can be shown that successful ICA effectively places zeros in the directions of all interfering sources [23]. Therefore, the zero directions of the unmixing filters should be indicative of all source directions. Thus, when the local direction of arrival (DOA) is estimated from the phase of any one given time-frequency bin, this should give an indication of the dominant source in this bin. This is the principle underlying phase-based time-frequency masking strategies.

Phase-based postmasking of ICA outputs was introduced in [17]. In this method, the angle between the 'th target basis vector of the unmixing matrix and the microphone signal vector is used in order to determine whether and to what degree a given channel should be masked.

According to (2), when noise is not considered, the mixing system can be modeled by

(17)

Here, denotes the 'th column of the mixing matrix, and is the value of source in frequency at frame .

ICA results in an unmixing matrix , which is used to obtain estimated source signals according to (4). This corresponds to

(18)

where the estimated mixing matrix is given in terms of its constituent column vectors, . When comparing (18) and (2), and considering (3), it can be seen that the columns of correspond to the columns of , the matrix containing the values of the room transfer function for each frequency, up to an arbitrary scaling of column vectors and a reordering of sources, which is constant over frequencies after the permutation correction. Thus, in those time-frequency bins, where source is dominant, the associated basis vector should correspond to the column of the mixing matrix associated with source . In general, the index may be different from the index , due to possible permutations. However, as this change of indices will be consistent over frequency, it is disregarded in the following.

Thus, after appropriate normalization, in frames with dominant source , the associated basis vector would also be equal to of the current frame. If an anechoic model is appropriate for the mixing process at hand, the basis vectors should form clusters, one for each of the sources. For this purpose, the basis vectors need to be normalized regarding both their phases and amplitudes as detailed in [17]. For phasenormalization, they are first normalized with respect to a reference sensor and secondly frequency-normalized, which gives

(19)

as a normalized vector. Here, stands for the center frequency in Hz of frequency bin ; is the velocity of sound and stands for the distance between the reference sensor and the farthest of all other microphones . For this vector, the phase varies only between

(20)

which is important for computing a distance measure between vectors. Finally, amplitude-normalization is carried out by

(21)

After the normalized basis vectors are thus available, masking is carried out based on the angle between the observed vector and the basis vector . This angle is computed in a whitened space, where and are premultiplied by the whitening matrix , which is the inverse square root of the sensor autocorrelation matrix, .

The mask is a soft mask, which is determined from by the logistic function

(22)

The parameter describes the steepness of the mask and is the transition point, where the mask takes on the value . More details on the mask computation can be found in [17].

3.3. Interference-Based Masks

As an alternative criterion for masking, residual interference in the signal may be estimated and the mask may be computed as an MMSE estimator of the clean signal. This can be achieved with a number of approaches, two of which will be presented here in more detail.

3.3.1. Ephraim-Malah Filter-Based Post-Filtering

The remaining noise components in the separated signals can be minimized based on the Ephraim-Malah filter technique. For this purpose, the following signal model is assumed

(23)

where the clean signal is corrupted by a noise component , the remaining sum of the interfering signals and the background noise. The estimated clean signals are obtained by

(24)

where is the amplitude estimator gain. For the calculation of the gain , different speech enhancement algorithms can be used. In the following, we are using the log spectral amplitude estimator (LSA) as proposed by Ephraim and Malah [24].

For the algorithm, the a posteriori and a priori SNR are defined by

(25)

Here, is a smoothing parameter, is the th ICA-output, and is the noise power

(26)

with the noise estimate given by

(27)

With these parameters, the log spectral amplitude estimator is given by

(28)

with denoting the local a priori SNR and

(29)

3.3.2. Inclusion of Speech Presence Probabilities

According to [25], the previous approach can be expanded using additional information for calculation of speech presence probabilities. The gain function of the Ephraim-Malah filter becomes

(30)

where is a spectral attenuation floor, the gain of the speech enhancement method, and the speech presence probability [26, 27]. The infomation needed for speech presence probability calculation is gained from a bin-wise noise dominance estimate, which can be computed in the spectrum domain by [18]

(31)

A similar measure of speech dominance is needed in addition

(32)

Both measures utilize the difference between the estimated target spectrogram and the sum of estimated nontarget signals . The Euclidean norm operator is applied to two-dimensional windowed spectrograms here by taking the sum over their squared entries, and

(33)

uses a two-dimensional window function of size , usually a two-dimensional Hanning window. The speech presence probability is then approximated by a soft mask via

(34)

Here, and are parameters specifying the two threshold points and the mask gain, respectively.

4. Estimation of Uncertainties

Due to the use of time-frequency masking, part of the information of the original signal might be eliminated along with the interfering sources. To compensate for this lack of information, each masked estimated source is considered as uncertain and described in the form of a posterior distribution of each Fourier coefficient of the clean signal given the available information.

Estimating the uncertainty in the spectrum domain has clear advantages, when contrasted with uncertainty estimation in the domain of speech recognition, since much intermediate information about the signal and noise process as well as the mask is known in this phase of signal processing, but is generally not available in the further steps of feature extraction. This has motivated a number of studies on spectrum domain uncertainty estimation, most recently for example [7, 10]. In contrast to other methods, the suggested strategy possesses two advantages: it does not need a detailed spectrum domain speech prior, which may require a large number of components or may incur the need for adaptation to the speaker and environment; and it gives a computationally very inexpensive approximation that is applicable for both binary and soft masks.

The model used here for this purpose is the complex Gaussian uncertainty model [28]

(35)

where the mean is set equal to the Fourier coefficient obtained from post-masking and the variance represents the lack of information, or uncertainty. In order to determine , two alternative procedures were used.

4.1. Ideal Uncertainties

Ideal Uncertainties describe the squared difference between the true and the estimated signal magnitude. They are computed by

(36)

where is the reference signal. However, these ideal uncertainties are available only in experiments where a reference signal has been recorded. Thus, the ideal results may only serve as a perspective of what the suggested method would be capable of if a very high quality error estimate were already available.

4.2. Masking Error Estimate

In practice, it is necessary to approximate the ideal uncertainty estimate using values that are actually available. Since much of the estimation error is due to the time-frequency mask, in further experiments such a masking error was used as the single basis of the uncertainty measure.

This uncertainty due to masking can be computed by

(37)

If , this error estimate would assume that the time-frequency mask leads to missing signal information with 100 certainty. The value should be lower to reflect the fact that some of the masked time-frequency bins contain no target speech information at all. To obtain the most suitable value for , the following expression was minimized

(38)

In order to avoid adapting parameters to each of the test signals and masks, this minimization was carried out only once and only for a mixture not used in testing. After averaging over all mask types, the same value of was used in all experiments and for all datasets. This optimal value was .

5. Propagation of Uncertainties

When uncertain features are available in the STFT domain, they could in principle be used for spectrum domain speech recognition. However, as shown in [29], due to the less robust spectrum domain models, this does not provide for optimum results. Instead, a more successful approach is to transform the uncertain description of speech from the spectrum domain to the domain of speech recognition. This can in principle be achieved by two approaches, data-driven as in [7] or model-driven as in [5]. In the following, we only consider the model-driven approach, which can achieve very low propagation errors with small memory requirements and without the need for a training phase [10]. However, a detailed comparison of both principal methods remains an interesting target for future work.

In order to carry out the propagation through the feature extraction process, the uncertain spectrum domain description is considered as specifying speech as a random variable according to (35). If such an uncertain description of the STFT is used, the corresponding posterior distribution has to be propagated into the feature domain. For this purpose, the effect of all transformations in the feature extraction process on this probability distribution needs to be considered, which will result in an estimated feature domain random variable, describing both the mean of the speech features as well as the associated degree of uncertainty. Since this computation takes place for each feature and in each bin, subsequent recognition will have a maximally precise description of all uncertainties, allowing the algorithm to focus most on those features that are most reliable, and, if desired, to replace the uncertain ones by better estimates under simultaneous consideration of the recognizer speech model.

In conventional automatic speech recognition, only the STFT of each estimated source must be transformed into the feature domain of automatic speech recognition. Feature extractions involve multiple transformations, some of them nonlinear, which are performed jointly on multiple features of the same frame or by combining features from different time frames. Propagating an uncertain description of the STFT of each estimated source is therefore a complicated task that can be simplified by propagating only first- and second-order information. This section shows how this propagation can be attained by a piecewise approach in which the feature extraction is divided into different steps and the optimal method is chosen to perform uncertainty propagation in each step. Uncertainty propagation is used with two of the more robust speech recognition features, namely the Mel-cepstrum coefficients (MFCCs) [30] and the cepstral coefficients obtained from the RelAtive SpecTrAl Perceptual Linear Prediction (RASTA-PLP) feature extraction [31], here denoted as RASTA-LPCCs.

5.1. Mel-Cepstral Feature Extraction

The conventional Mel-cepstral feature extraction consists of the following steps.

  1. (1)

    Extract the short-time spectral amplitude (STSA) from the STFT.

  2. (2)

    Compute each filter output of a Mel-filterbank as a weighted sum of the STSA features of each frame.

  3. (3)

    Apply the logarithm to each filter output.

  4. (4)

    Compute the discrete cosine transform (DCT) from each frame of log-filterbank features.

In order to propagate random variables rather than deterministic signals, these steps were modified as follows.

Step can be solved if we take into account that if a Fourier coefficient is complex Gaussian distributed as given by (35), its amplitude is Rice distributed. From the first raw moment of the Rice distribution, it is possible to compute the mean of the uncertain STSA features as [28]

(39)

where and correspond to the modified Bessel functions of order zero and one, respectively. The variance of the uncertain STSA features can be computed from the first and second raw moments as

(40)

Step in the Mel-cepstral feature extraction corresponds to the Mel-filterbank, which is a linear transformation and bears no additional difficulty for the propagation of mean and covariance. In general, given a random vector variable and a linear transformation defined by the matrix , the transformed mean and covariance correspond to

(41)

Step corresponds to the computation of the logarithm. Since the distribution of the Mel-STSA uncertain features has a relatively low skewness and the dimensionality of the features has been reduced by approximately one order of magnitude through the application of the Mel-filterbank, the use of the pseudo-Montecarlo method termed unscented transform [32] provides an acceptable trade-off between accuracy and computational cost. Details regarding the use of the unscented transform for uncertainty propagation can be found in [28].

Step , the DCT transform, completes the computation of the MFCC coefficients. Since this is a linear transformation like the Mel-filterbank, it can be computed according to (41).

5.2. Relative Spectral Perceptual Linear Prediction Feature Extraction

The obtention of the RASTA-Linear Prediction Cepstral Coefficients (RASTA-LPCCs) corresponds to the following steps.

  1. (1)

    Extract the power spectral density (PSD) from the STFT.

  2. (2)

    Compute each filter output of a Bark-filterbank as a weighted sum of the PSD features of each frame.

  3. (3)

    Apply the logarithm to each filter output.

  4. (4)

    Filter the resulting frames with the RASTA IIR filter.

  5. (5)

    Add the equal loudness curve and multiply by 0.33 to simulate the power law of hearing.

  6. (6)

    Apply the exponential to invert the effect of the logarithm.

  7. (7)

    Compute an all-pole model of each frame to obtain the linear prediction coefficients (LPCs).

  8. (8)

    Compute cepstral coefficients from each LPC frame.

This feature extraction also requires a set of modifications and approximations in order to be applicable for uncertain features. An overview of these is shown in Figure 3 and the necessary computational steps are given in detail below.

Figure 3
figure 3

Block diagram of the RASTA-LPCC feature extraction extended to encompass uncertainty propagation. Arrows indicate the propagation of mean and covariance at each step. The figure shows the propagation of the posterior corresponding to the estimated source.

Step can be solved similarly to the case of the STSA. The propagated mean and covariance can be computed from the second and fourth raw moments of the Rice distribution as [33]

(42)

Step which corresponds to the Bark-filterbank, can be resolved identically to the case of the Mel-filterbank of the MFCCs by using (41).

Step of the RASTA-PLP transformation consists of the computation of the logarithm as in the case of the Mel-cepstral feature extraction. However, the distribution of the Bark-PSD uncertain features presents a much higher skewness compared to the case of the Mel-STSA features. Consequently, the propagation through this step is more accurately computed using the assumption of log-normality of the Bark-PSD features, also used in other propagation approaches like [3, 5, 34]. The covariance under this assumption can be approximated by [34, equation 5.47], yielding

(43)

where , are the filterbank indices and and correspond to the mean and covariance after the Bark-filterbank transformation. The mean can be approximated by [34, equation 5.46]

(44)

Step corresponds to the RASTA filter. The RASTA filter is an IIR filter that imitates the preference of humans for sounds with a certain rate of change. It realizes the transfer function

(45)

This can also be expressed by the following difference equation

(46)

where is a column vector containing the th frame of RASTA-filtered features, and and correspond to previous logarithm domain input and RASTA domain output frames, respectively. The scalars and are the normalized feedforward and feedback coefficients. Computing the propagation of the mean through this transformation is identical to the case of the Mel or Bark filterbanks. The computation of the covariance is, however, more complex due to the created time correlation between inputs and outputs. The correlation matrix for the th filter output can be computed from (46) as

(47)

where the last summand accounts for the input output correlation. The corresponding covariance of the RASTA features can be obtained as

(48)

Steps and correspond to conventional linear transformations in the logarithm domain, and therefore the propagation through them can be solved by applying (41) to obtain the means and covariances . Furthermore, since the assumption of log-normality in the Bark-PSD domain implies that the log-domain features are normally distributed, RASTA, preemphasis, and power-law transformations do not alter this condition.

Step corresponds to the transformation through the exponential. Since this transformation is the inverse of the logarithm, the corresponding features are log-normally distributed with mean and covariance computable from [34, equations 5.44, 5.45]

(49)

The final steps of the RASTA-LPCC feature extraction, Steps and correspond to the computation of the all-pole model to obtain the LPC coefficients, described in the conventional PLP technique [35], and the computation of the cepstral coefficients from the LPCs using [30, equation 3] Due to the complex nature of these transformations and the low skewness of the uncertain features after the exponential transformation, the propagation is computed using the unscented transform, similarly to the case of the logarithm transformation for the Mel-cepstral features.

6. Recognition of Uncertain Features

When features for speech recognition are given not as point estimates, but rather in the form of a posterior distribution with estimated mean and covariance the speech decoder must be modified in order to take this additional information into account. A number of approaches exist, both for binary and for continuous-valued uncertainties, for example, [2, 36, 37].

Here, two missing feature approaches were applied, which are capable of considering real-valued uncertainties. These methods, modified imputation [5] and HMM variance compensation [2], have been implemented for the Hidden Markov Model Toolkit (HTK) [38] and were used in the tests.

Both methods are appropriate for HMM-based systems, where recognition takes place by finding the optimum HMM state sequence , which gives the best match to the feature vector sequence when each HMM state has an associated output probability distribution .

6.1. HMM Variance Compensation

In HMM variance compensation, the computation of state output probabilities is modified to incorporate frame-by-frame and feature-by-feature uncertainties [2]. This is formulated as an averaging of the output probability distribution over all possible unseen cepstra defined by the posterior

(50)

which leads to

(51)

Here, denotes the HMM state, with mean and covariance . For Gaussian mixture models, the same procedure can be applied to each mixture component. This yields

(52)

for an -component mixture model with weights .

6.2. Modified Imputation

In modified imputation, the idea is to replace the imputation equation, originally proposed for completely missing features in [36], with an alternative formulation, which also allows for real-valued degrees of uncertainty. Thus, whereas missing parts of feature vectors are replaced by the corresponding components of the HMM model mean in classical imputation, modified imputation finds the maximum a posteriori estimate

(53)

Assuming a flat prior for , as shown in [5], (53) leads to

(54)

Finally, the modified imputation estimate of the feature vector in state

(55)

can be obtained. This estimate is used to evaluate the pdf of the HMM state at time , as in conventional recognition or classical imputation.

For mixture-of-Gaussian (MOG) models, (55) is evaluated separately for each mixture component to obtain separate estimates , and all mixture component probabilities are finally added to obtain the feature likelihood for state via

(56)

where stands for the mixture weight of component . This, again, is analogous to the process in conventional recognition or classical imputation.

7. Experiments

7.1. Room Recordings

For the evaluation of the proposed approaches, recordings were made in a noisy lab room with a reverberation time of  160 ms. In these recordings, audio files from the TIDigits database [39] were used and mixtures with two and three speakers were recorded at 11 kHz. The distance between the loudspeakers and the center of the microphone array was varied between 0.9 and 3 m. The experimental setup is shown schematically in Figure 4. The distance between two sensors was 3 cm and a linear array of four microphones was used in all experiments. The recording conditions for all mixtures are summarized in Tables 1 and 2.

Table 1 Mixture description.
Table 2 Mixture description.
Figure 4
figure 4

Experimental Setup.

7.2. Model Training

The HMM speech recognizer was trained with the HTK toolkit [38]. The trained HMMs comprised phoneme-level models with 6-component MOG emitting probabilities and a conventional left-right structure. The training data was mixed and it comprised the 114 speakers of the TI-DIGITS clean speech database along with the room recordings for speakers sa and rk used for adaptation. Speakers used for adaptation were removed from the test set. The feature extractions presented in Section 5 were also complemented with cepstral mean subtraction (CMS) for further reduction of convolutive effects. Since CMS is a linear operation, it poses no additional difficulty for uncertainty propagation.

7.3. Parameter Settings of Time-Frequency Masks

Parameters of all masks were set manually for good performance on all datasets, and were kept consistent throughout all experiments.

7.3.1. Amplitude-Based Masking

For amplitude-based masking, a soft mask according to (14) and (16) was used. Thus, there are two parameters, the mask threshold and the gain , which were set to and , respectively.

7.3.2. Phase-Based Masking

In phase-based masking according to (22), there are two free parameters as well, again a mask gain and also a mask threshold, the angle threshold . However, optimum performance was reached for different parameter values depending on the recognizer parameterization. For optimal performance on MFCC features, they were set to and , which will be refered to as Phase1 in the results. In contrast, for RASTA-PLP-based recognition, better results were generally achieved with and (Phase2), that is, the same threshold but less steep of a mask gain.

7.3.3. Interference-Based Masking

For the first interference-based mask, defined in Section 3.3.1, the two smoothing parameters defining the algorithm are set to and . This algorithm will be denoted by IB in the following.

The second interference-based algorithm additionally includes the speech probability estimate defined in Section 3.3.2. Thus, in addition to the parameters and , there are additional parameters in the weighting function (34). These are and , parameters specifying the two threshold points and the mask gain. They are defined to correspond to the mean absolute value of the estimated signal Fourier coefficients , the mean absolute value of the noise estimate Fourier coefficient ; and the mask gain is set to . For windowing in (33), a Hanning window of size is used. For this algorithm, the abbreviation IBPE will be used.

8. Results

8.1. Recognition Performance Measurement

To evaluate recognition performance, the number of reference labels (), of substitutions (), insertions () and deletions () are counted. From these values, the recognition accuracy PA is defined as

(57)

The value of , output by the HTK scoring tool, corresponds with , where is the word error rate that is also commonly used in the evaluation of speech recognition performance.

8.2. Multispeaker Recognition Results

At first, results are given for the estimated uncertainty values and RASTA-PLP features in Table 3 and for MFCC features in Table 4. Especially for RASTA-PLP features, results are improved notably by masking and missing feature recognition by modified imputation, averaging an absolute improvement of more than 10% over all tested masks and experiments. For MFCCs, significant improvements can also be achieved by the suggested strategy. This is true especially for the two strategies of phase masking and interference-based filtering with speech probability estimation. In both cases, an absolute improvement of about 6% can be achieved. It is also clearly visible that here, uncertainty decoding performs better on average.

Table 3 Word accuracy (WA) of ASR tests for RASTA-PLP features, estimated uncertainties. Here, the algorithms Phase1 and Phase2 utilize the parameters defined in Section 7.3.2, the entries with the heading Amplitude correspond with the mask given in Section 7.3.1, and the two interference-based strategies IB and IBPE are specified in Section 7.3.3. The two robust recognition strategies are abbreviated by MI for modified imputation and UD for uncertainty decoding.
Table 4 Word accuracy (WA) of ASR tests for MFCC features, estimated uncertainties.

When true rather than estimated uncertainties are used, results are again improved greatly, both for RASTA-PLP and for MFCC features, as shown in Tables 5 and 6. Compared to the use of ICA alone, a relative error rate reduction of 59% for uncertainty decoding and of 69% for modified imputation is achieved in the case of RASTA features.

Table 5 Word accuracy (WA) of ASR tests for RASTA-PLP features, true uncertainties.
Table 6 Word accuracy (WA) of ASR tests for MFCC features, true uncertainties.

Similar performance gains can be observed in the case of MFCC features, where word error rates can be reduced by 64% and 62% for uncertainty decoding and modified imputation, respectively. Comparing the uncertain recognition strategies, again, modified imputation is on average the better performer for RASTA-PLPs, whereas uncertainty decoding leads to better performance gains for MFCCs. Concerning the masking strategies, it is clear that the IB-mask, which has fairly aggressive parameter settings and an extremely low recognition rate without missing feature approaches, is the best for this case of ideal uncertainties.

9. Conclusion

An overview of the use of independent component analysis for speech recognition under multitalker conditions has been given. As shown by the presented results, the conventional strategy of purely linear source separation can be improved by post-masking in the time-frequency domain, if this is accompanied by missing-feature speech recognition. Especially for three-speaker scenarios, this improves the recognition rate notably. Interestingly, the optimal decoding strategy is apparently dependent on the features that are used for recognition. Whereas modified imputation was clearly superior for RASTA features, better results for MFCC features have almost consistently been achieved by uncertainty decoding, even though uncertainties were estimated in the spectrum domain for both features and propagated to the recognition domain of interest. Further work will be necessary to determine how these results correspond to the degree of model mismatch in both domains, with the aim of determining an optimal decoding strategy depending on specific application scenarios.

A vital aspect of missing feature recognition is still the estimation of the feature uncertainty. Here, an ideal uncertainty estimate will result in superior recognition performance for all considered test cases and all applied post masks. Since such an ideal uncertainty is not available in practice, the value needs to be estimated from available data. In the presented cases, this measure has been derived from the ICA output signal and the applied nonlinear gain function. The resulting uncertainty estimate has a correlation coefficient of 0.45 with the true uncertainties, leading to superior and consistent performance among all tested uncertainty estimates.

However, uncertainty estimation for the ICA output signals should be improved further, in order to approximate more closely the ideally achievable performance of this strategy. For this purpose, it will be interesting to compare the proposed uncertainty estimation to other approaches. Specifically, the uncertainty estimation described in [7] is of interest for use with any type of recognition feature and preprocessing method, but it requires learning of a regression tree for the given specific feature set and environment. In contrast, feature-specific methods described for example in [2, 3] are applicable only to the feature domain they have been derived for, but can be used without the need for additional training stages.

Since none of the above methods is designed specifically for use with ICA, another direction of research is a better use of the statistical information gathered during source separation. Further research can thus focus on an optimal use of this intermediate data, and on its combination with more detailed prior models in the spectrum domain, as those in [29], for arriving at more accurate uncertainty estimates which utilize all avaliable data from multiple microphones.

References

  1. Kristjansson TT, Frey BJ: Accounting for uncertainty in observations: a new paradigm for robust automatic speech recognition. Proceedings of the IEEE International Conference on Acustics, Speech, and Signal Processing (ICASSP '02), 2002

    Google Scholar 

  2. Deng L, Droppo J, Acero A: Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion. IEEE Transactions on Speech and Audio Processing 2005, 13(3):412-421.

    Article  Google Scholar 

  3. Stouten V, Van Hamme H, Wambacq P: Application of minimum statistics and minima controlled recursive averaging methods to estimate a cepstral noise model for robust ASR. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), May 2006 1:

    Google Scholar 

  4. Van Segbroeck M, Van Hamme H: Robust speech recognition using missing data techniques in the prospect domain and fuzzy masks. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), 2008 4393-4396.

    Google Scholar 

  5. Kolossa D, Klimas A, Orglmeister R: Separation and robust recognition of noisy, convolutive speech mixtures using time-frequency masking and missing data techniques. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA '05), 2005 82-85.

    Google Scholar 

  6. Kühne M, Togneri R, Nordholm S: Time-frequency masking: linking blind source separation and robust speech recognition. In Speech Recognition: Technologies and Applications. IN-TECH, Vienna, Austria; 2008:61-80.

    Google Scholar 

  7. Srinivasan S, Wang D: Transforming binary uncertainties for robust speech recognition. IEEE Transactions on Audio, Speech and Language Processing 2007, 15(7):2130-2140.

    Article  Google Scholar 

  8. Yilmaz Ö, Rickard S: Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing 2004, 52(7):1830-1847. 10.1109/TSP.2004.828896

    Article  MathSciNet  Google Scholar 

  9. Brown G, Wang D: Separation of speech by computational auditory scene analysis. In Speech Enhancement. Edited by: Benesty J, Makino S, Chen J. Springer, New York, NY, USA; 2005:371-402.

    Chapter  Google Scholar 

  10. Astudillo RF, Kolossa D, Mandelartz P, Orglmeister R: An uncertainty propagation approach to robust ASR using the ETSI advanced front-end. IEEE Journal of Selected Topics in Signal Processing. In press

  11. Buchner H, Aichner R, Kellermann W: TRINICON: a versatile framework for multichannel blind signal processing. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '04), 2004 3: 889-892.

    Google Scholar 

  12. Makino S, Lee T-W, Sawada H (Eds): Blind Speech Separation. Springer, New York, NY, USA; 2007.

    Google Scholar 

  13. Kumatani K, McDonough J, Klakow D, Garner PN, Li W: Adaptive beamforming with a maximum negentropy criterion. Proceedings of the Hands-Free Speech Communication and Microphone Arrays (HSCMA '08), January 2008, Trento, Italy 180-183.

    Chapter  Google Scholar 

  14. Roman N, Wang D, Brown GJ: Speech segregation based on sound localization. Journal of the Acoustical Society of America 2003, 114(4 I):2236-2252.

    Article  Google Scholar 

  15. Brown GJ, Cooke M: Computational auditory scene analysis. Computer Speech and Language 1994, 8(4):297-336. 10.1006/csla.1994.1016

    Article  Google Scholar 

  16. Cichocki A, Amari S: Adaptive Blind Signal and Image Processing. John Wiley & Sons, New York, NY, USA; 2002.

    Book  Google Scholar 

  17. Sawada H, Araki S, Mukai R, Makino S: Blind extraction of dominant target sources using ICA and time-frequency masking. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(6):2165-2173.

    Article  Google Scholar 

  18. Hoffmann E, Kolossa D, Orglmeister R: A batch algorithm for blind source separation of acoustic signals using ICA and time-frequency masking. Proceedings of the 7th International Conference on Independent Component Analysis and Signal Separation (ICA '07), 2007, London, UK 480-487.

    Chapter  Google Scholar 

  19. Kamata K, Hu X, Kobatake H: A new approach to the permutation problem in frequency domain blind source separation. Proceedings of the 5th International Conference on Independent Component Analysis and Blind Signal Separation (ICA '04), September 2004, Granada, Spain 849-856.

    Chapter  Google Scholar 

  20. Kolossa D, Orglmeister R: Nonlinear postprocessing for blind speech separation. Proceedings of the 5th International Conference on Independent Component Analysis and Signal Separation (ICA '04), 2004, Granada, Spain 832-839.

    Chapter  Google Scholar 

  21. Pedersen MS, Wang D, Larsen J, Kjems U: Overcomplete blind source separation by combining ICA and binary time-frequency masking. Proceedings of the IEEE Workshop on Machine Learning for Signal Processing, September 2005 15-20.

    Google Scholar 

  22. Kolossa D: Independent component analysis for environmentally robust speech recognition, Ph.D. dissertation. TU Berlin, Berlin, Germany; 2007.

    Google Scholar 

  23. Araki S, Makino S, Hinamoto Y, Mukai R, Nishikawa T, Saruwatari H: Equivalence between frequency-domain blind source separation and frequency-domain adaptive beamforming for convolutive mixtures. EURASIP Journal on Applied Signal Processing 2003, 2003(11):1157-1166. 10.1155/S1110865703305074

    Article  MATH  Google Scholar 

  24. Ephraim Y, Malah D: Speech enhancement using a minimum mean-square error-log-spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing 1985, 33(2):443-445. 10.1109/TASSP.1985.1164550

    Article  Google Scholar 

  25. Cohen I: Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator. IEEE Signal Processing Letters 2002, 9(4):113-116. 10.1109/97.1001645

    Article  Google Scholar 

  26. Cohen I: On speech enhancement under signal presence uncertainty. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP '01), May 2001, Salt Lake City, Utah, USA 167-170.

    Google Scholar 

  27. Ephraim Y, Cohen I: Recent advancements in speech enhancement. In The Electrical Engineering Handbook. CRC Press, Boca Raton, Fla, USA; 2006.

    Google Scholar 

  28. Astudillo RF, Kolossa D, Orglmeister R: Propagation of statistical information through non-linear feature extractions for robust speech recognition. Proceedings of the 27th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt '07), November 2007 954: 245-252.

    Article  Google Scholar 

  29. Raj B, Seltzer ML, Stern RM: Reconstruction of missing features for robust speech recognition. Speech Communication 2004, 43(4):275-296. 10.1016/j.specom.2004.03.007

    Article  Google Scholar 

  30. Davis SB, Mermelstein P: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 1980, 28(4):357-366. 10.1109/TASSP.1980.1163420

    Article  Google Scholar 

  31. Hermansky H, Morgan N: RASTA processing of speech. IEEE Transactions on Speech and Audio Processing 1994, 2(4):578-589. 10.1109/89.326616

    Article  Google Scholar 

  32. Julier S, Uhlmann J: A general method for approximating nonlinear transformations of probability distributions. University of Oxford, Oxford, UK; 1996.

    Google Scholar 

  33. Astudillo RF, Kolossa D, Orglmeister R: Uncertainty propagation for speech recognition using RASTA features in highly nonstationary noisy environments. Proceedings of the Workshop for Speech Communication (ITG '08), 2008

    Google Scholar 

  34. Gales M: Model-based techniques for noise robust speech recognition, Ph.D. thesis. Cambridge University, Cambridge, UK; 1996.

    Google Scholar 

  35. Hermansky H, Hanson BA, Wakita H: Perceptually based linear predictive analysis of speech. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '85), 1985 509-512.

    Chapter  Google Scholar 

  36. Barker J, Green P, Cooke M: Linking auditory scene analysis and robust ASR by missing data techniques. Proceedings of the Workshop on Innovation in Speech Processing (WISP '01), 2001

    Google Scholar 

  37. Arrowood J, Clements M: Using observation uncertainty in HMM decoding. Proceedings of the International Conference on Spoken Language Processing (ICSLP '02), 2002

    Google Scholar 

  38. Young S: The HTK Book (for HTK Version 3.4). Cambridge University, Engineering Department

  39. Leonard RG: Database for speaker-independent digit recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '84), 1984 3:

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dorothea Kolossa.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Kolossa, D., Fernandez Astudillo, R., Hoffmann, E. et al. Independent Component Analysis and Time-Frequency Masking for Speech Recognition in Multitalker Conditions. J AUDIO SPEECH MUSIC PROC. 2010, 651420 (2010). https://doi.org/10.1155/2010/651420

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1155/2010/651420

Keywords