 Research
 Open access
 Published:
Noiserobust speech feature processing with empirical mode decomposition
EURASIP Journal on Audio, Speech, and Music Processing volume 2011, Article number: 9 (2011)
Abstract
In this article, a novel technique based on the empirical mode decomposition methodology for processing speech features is proposed and investigated. The empirical mode decomposition generalizes the Fourier analysis. It decomposes a signal as the sum of intrinsic mode functions. In this study, we implement an iterative algorithm to find the intrinsic mode functions for any given signal. We design a novel speech feature postprocessing method based on the extracted intrinsic mode functions to achieve noiserobustness for automatic speech recognition. Evaluation results on the noisydigit Aurora 2.0 database show that our method leads to significant performance improvement. The relative improvement over the baseline features increases from 24.0 to 41.1% when the proposed postprocessing method is applied on meanvariance normalized speech features. The proposed method also improves over the performance achieved by a very noiserobust frontend when the test speech data are highly mismatched.
1 Introduction
Stateoftheart automatic speech recognition (ASR) systems can achieve satisfactory performance under wellmatched conditions. However, when there is a mismatch between the train data and the test data, the performance often degrades quite severely. The versatility of everyday environments requires ASR systems to function well in a wide range of unseen noisy conditions. Therefore, noiserobust speech processing technology for recognition is an important research topic.
Many techniques for noiserobustness have been proposed and put to tests. Speech enhancement methods, such as the wellknown spectral subtraction [1] and Wiener filters [2], introduce preprocessing steps to remove the noise part or estimate the clean part given the noisy speech signal. Auditory frontend approaches incorporate knowledge of human auditory systems acquired from psychoacoustic experiments, such as critical bands and spectral/temporal masking effects [3, 4], in the process of speech feature extraction. Noiserobust feature postprocessing techniques, such as cepstral mean subtraction (CMS) [5], cepstral variance normalization (CVN) [6], and histogram equalization (HEQ) [7], aim to convert raw speech features to a form that is less vulnerable to the corruption of adverse environments.
In this article, we study a feature postprocessing technique for noiserobust ASR based on the empirical mode decomposition (EMD) [8]. Through EMD, a feature sequence (as a function of time) is decomposed into intrinsic mode functions (IMFs). The basic idea behind our proposed method is that the loworder IMFs contain highfrequency components and they are removed based on a threshold estimated from training data. Alternatively, the recombination weights can be decided using other algorithms [9]. Since EMD is a temporaldomain technique, a comparison of EMD with common temporal processing techniques is in order. In the RASTA processing of speech [10], a filter combining temporal difference and integration effects is designed. It results in a bandpass filter, which discriminates speech and noise by their difference in temporal properties. The RASTA processing technique is generally considered very effective for both additive and convolution noises. However, a basic assumption underlying any filtering technique is that the signals being processed are approximately stationary, which may not be the case for speech or nonstationary noises. Furthermore, using linear filters implies a decomposition of signal into sinusoidal functions. In contrast, IMFs used in EMD are data driven, so they are theoretically more general than sinusoidal functions, and may lead to better signalnoise decomposition. A comparison between the results of using EMD and RASTA is given in Section 5. In modulation spectrogram approach [11], modulation patterns of temporal envelope signals of the criticalband channels are represented by the amplitudes at 4Hz (via FFT) dynamically. This representation proves to be robust for syllable recognition under noise corruption. For a different application, critical parameters of central frequency may have to be tuned. In temporal modulation processing of speech signals [12], the DC part of the signal is denoised for better speech detection in noisy conditions, and to provide an SNR estimator via crosscorrelation with low modulationfrequency (1.5Hz) part of the signal. In contrast to the above reviewed methods of temporal processing, we note that the proposed EMD does not assume stationarity of signal, and there are no taskdependent parameters to be tuned when we extract IMFs.
The rest of article is organized as follows. In Section 2, we introduce the formulation of EMD and show that it is a generalization of the Fourier analysis. In Section 3, we design an iterative algorithm to extract IMFs for EMD. In Section 4, we describe the proposed EMDbased feature postprocessing method and give a few illustrative examples. Experimental results are presented in Section 5. Finally, concluding remarks are summarized in Section 6.
2 Empirical mode decomposition
The EMD generalizes the Fourier series. Sinusoidal basis functions used in the Fourier analysis are generalized to datadependent IMFs. Compared to a sinusoidal function, an IMF satisfies the generalized alternating property and the generalized zeromean property, and relaxes the amplitude and frequency from being constant to being generally timevarying.
2.1 The Fourier series
A signal x(t) of finite duration, say T, can be represented by a Fourier series, which is a weighted sum of complex exponential functions with frequencies ω_{ k }= 2πk/T. That is, we can write
Defining
we can rewrite (1) as
If x(t) is real, p_{ k }, q_{ k }in (2) are real. Equation 3 can be seen as a decomposition of x(t) in the vector space spanned by the following basis set:
The following properties of about basis functions of the Fourier series are quite critical in the generalization to EMD.

(alternating property) A basis function has alternating stationary points and zeros. That is, there is exactly one zero between two stationary points, and exactly one stationary point between two zeros.

(zeromean property) The maxima and minima of the basis functions are opposite in sign, and the average of the maxima and the minima is 0.
2.2 Empirical mode decomposition
In EMD, a realvalued signal x(t) is decomposed as
Those c_{ k }(t)s in (5) are called IMFs. As generalization of sinusoidal function, an IMF is required to satisfy the following generalized properties.

(generalized alternating property) The difference between the number of extrema (maxima and minima) and the number of zeros is either 0 or 1.

(generalized zeromean property) The average of the upper envelope (a smooth curve through the maxima) and the lower envelope (a smooth curve through the minima) is zero.
The amplitude and frequency of an IMF are defined as follows. Given a realvalued function c_{ k }(t), let d_{ k }(t) be the Hilbert transform of c_{ k }(t). A complex function f_{ k }(t) is formed by
In (6), we identify α_{ k }(t) and ν_{ k }(t) as the timedependent amplitude and the timedependent frequency of f_{ k }(t). Note that the Fourier analysis is a special case of (6), since sinω_{ k }t is the Hilbert transform of cosω_{ k }t. While sinusoidal functions have constant amplitudes and frequencies, IMFs have timevarying amplitudes and frequencies.
3 Intrinsic mode functions
The core problem for EMD is to find IMFs given a signal. In the following subsections, we state the algorithm that we design for EMD and highlight properties of IMFs with an illuminating instance.
3.1 Extraction algorithm
The pseudocode of the extraction of IMFs is stated as follows.
Require: input signal x(t), maximum number of IMFs K;
remainder function r(t);
extracted IMF c_{ k }(t);
upper envelope function u(t);
lower envelope function l(t);
hypothetical function h(t);
k := 1;
r(t) := x(t);
while k ≤ K and r(t) is not monotonic do
h(t) = 0;
while h(t) is not an IMF do
u(t) ← the upper envelope of r(t);
l(t) ← the lower envelope of r(t);
h\left(t\right)\leftarrow r\left(t\right)\frac{1}{2}\left(u\left(t\right)+l\left(t\right)\right);
if (h(t) is an IMF or a stopping criterion is met) then
c_{ k }(t) ← h(t);
r\left(t\right)\leftarrow x\left(t\right){\sum}_{i=1}^{k}{c}_{i}\left(t\right);
k ← k + 1;
else
r(t) ← h(t);
end if
end while
end while
return the IMF c_{ k }(t)'s;
In this algorithm, there is an outer loop to control the number of IMFs and there is an inner loop to find the next IMF given the current remainder function. The spline interpolation is used to find envelopes (cf. Section 4.2). To guard against slow convergence, we enforce a criterion to terminate the iteration if the difference between the old and new candidates of h(t) is below a threshold.
3.2 An important property
In the extraction of IMFs, the remainder function r(t) is recursively replaced by the hypothetical function h(t),
The envelopes u(t) and l(t) are smoother than r(t) as each envelope is the spline interpolation of a proper subset of points of r(t). Being the remainder after the subtraction of the envelope mean, h(t) approximates the timevarying local highfrequency part of r(t). Whenever h(t) is a valid IMF, it is set to c_{ k }(t) and subtracted, so the remaining part of signal is smoother. Thus, we expect c_{ k }(t) to be progressively smooth as k increases.
For an illustrative example, IMFs extracted from the logenergy sequence of an utterance in the Aurora 2.0 database with a signaltonoise ratio (SNR) of 0 dB are shown in Figure 1. One can see clearly that the degree of oscillation decreases as k increases, as is predicted by our analysis.
4 EMDbased feature postprocessing
The goal of speech feature postprocessing is to reduce the mismatch between clean speech and noisy speech. In order to achieve this goal, we first look at the patterns introduced by the presence of noises of varying levels, then we propose a method to counter such patterns.
The patterns created by noises of several SNRs can be observed on the logenergy sequences of an underlying clean utterance in the Aurora 2.0 database, as shown at the top of Figure 2. We can see that the degree of oscillation of the speech feature sequence increases with the noise level. That is, the spurious spikes in the sequence basically stems from the noise signal, rather than from the speech signal.
4.1 Basic idea
Since the spikes introduced by the noise are manifest in the loworder IMFs, we propose to subtract these IMFs to alleviate mismatch. That is, for a noisy speech signal x(t) with EMD
we simply subtract a small number, say N, of IMFs from x(t), i.e.,
At the bottom of Figure 2, EMD postprocessed sequences of the same instances are shown. Comparing them to the original sequences at the top, we can see that the mismatch between clean and noisy speech is significantly reduced.
4.2 Implementation details
The spline interpolation is employed to find upper and lower envelopes during the process of IMF extraction. For upper envelopes (and similarly for lower envelopes), we use the local maximum points and the end points as the interpolation points. These interpolation points divide the entire time span into segments, and each segment, say segment i, is interpolated by a cubic function,
where the parameters α_{ i }, β_{ i }, γ_{ i }, δ_{ i }can be decided by requiring the overall interpolation function to be continuous up to the secondorder derivatives [13].
In the extraction algorithm, we also guard against perpetual changes in the extraction process of IMF via a threshold on the standard deviation (SD), which is defined as follows:
where T is the total number of points in the sequence, h_{o}(t) and h_{ n }(t) are the old and new candidates for IMF. In our experiments, we set σ = 0.25 [8]. The number of iterations needed to find the first IMF varies with the input signal. The histogram (statistics) of this iteration scheme applied on a data set is given in Figure 3.
5 Experiments
The proposed EMDbased approach to noiserobustness is evaluated on the Aurora 2.0 database [14]. After the baseline results are reproduced, we first apply the commonly used perutterance meanvariance normalized (MVN) on the speech features to boost the performance, then we apply the proposed EMDbased postprocessing to achieve further improvement. After seeing significant performance gain over the baseline, we apply the proposed method to ETSI advanced frontend (AFE) speech features [15] to see if further improvement can be achieved on speech features that are already very noiserobust to begin with. We also compare EMD with the RASTA processing method.
5.1 Aurora database
The Aurora 2.0 noisy digit database is widely used for the evaluation of noiserobust frontends [14]. Eight types of additive noises are artificially added to clean speech data with SNR levels ranging from 20 to 5 dB. The data may be further convolved with two types of convolution noises. The multitrain recognizer is trained by a data set (called the multitrain set) consisting of clean and multicondition noisy speech samples. The cleantrain recognizer is trained by a data set (called the cleantrain set) consisting of clean speech samples only. Test data in Set A are matched to the multicondition train data, test data in Set B are not matched to the multicondition train data, and test data in Set C are further mismatched due to convolution. Note that the proportion of the data amounts of Set A, Set B, and Set C is 2 : 2 : 1.
5.2 Frontend and backend
The baseline speech feature vector consists of the static features of 12 melfrequency cepstral coefficients (MFCC) C_{1}, . . ., C_{12} and the log energy. Dynamic features of velocity (delta) and acceleration (deltadelta) are also derived, resulting in a 39dimension vector per frame.
The standard backend recognizer of Aurora evaluation [14] is adopted. That is, we use 16state wholeword models for digits, a 3state silence model, and a 1state shortpause model. The state of the shortpause model is tied to the middle state of the silence model. The stateemitting probability density is a 3component Gaussian mixture for a word state, and a 6component Gaussian mixture for a silence/shortpause state.
5.3 Results
Three sets of experiments have been carried out in this research. In the first set of experiments, noisy feature sequences are replaced by the corresponding clean feature sequences. This is possible in Aurora 2.0 because clean and noisy speech data are "parallel", i.e. each noisy utterance has a corresponding clean utterance. The results are compared to case where each sequence is postprocessed by EMD. In the second set of experiments, various aspects of EMD are investigated. In the final set of experiments, the proposed EMD method is compared to the wellknown temporal filtering method of RASTA.
5.3.1 Feature replacement experiments
The first set of experiments is designed to determine which speech feature sequence is to be applied the EMDbased postprocessing. For each of the 13 static features, we replace noisy feature sequences with clean feature sequences (RwC: replace with clean). Based on the results summarized in Table 1, it is clear that replacing noisy logenergy sequence leads to the most significant improvement. The performance level decreases as we move down the table from C_{1} to C_{12}. Thus, unless otherwise stated, in the remaining investigation, we focus on using logenergy sequences as the targets to be processed by the proposed EMD.
In addition, we apply the proposed EMD to noisy feature sequences and the results are also shown in Table 1. It is interesting to see that EMD even leads to better performance than clean feature replacement in the cases from C_{2} to C_{12}. Furthermore, applying EMD to all features does not yield better performance than EMD on log energy alone, although the performance levels are quite close. Higherorder cepstral features provide information for the more delicate structures in the speech signal. It is more difficult to recover such information lost in the presence of noise through EMD. In contrast, the loss of information conveyed by log energy due to noise is relatively easy to recover.
5.3.2 Effectiveness of EMD
The recognition accuracy rates of cleantrain tasks averaged over 020 dB noisy test data with different degrees of feature postprocessing are listed in Table 2. The row of "baseline" shows the results of using the raw speech features extracted by the ETSI standard frontend. The row of "MVN" shows the results after the application of the meanvariance normalization (MVN). MVN achieves 24.0% relative improvement.
The proposed EMDbased method is applied to the logenergy feature sequences, by subtracting the first IMF for each utterance. Applying EMD on the MVN feature sequence, the relative improvement improves from 24.0 to 41.1%. The results show that the EMDbased postprocessing of subtracting IMFs from the speech feature sequences significantly reduce the mismatch between clean and noisy feature sequences.
It is very encouraging to see that the case of most significant improvement by EMD is with Set C (66.475.3%). We note that Set C contains arguably the most mismatched data because that convolution noises are applied to the utterances in addition to additive noises. With only MVN, the accuracy level on Set C is significantly below Set A or Set B. After EMD, the accuracy levels of the three sets become very close. Thus, EMD does increase the noiserobustness of the ASR system.
Detailed comparison between the word accuracy rates of the MVN method and the proposed EMDbased postprocessing method are broken down in Table 3. In addition, we present a scatter plot of the word accuracy rates in Figure 4. It can be clearly seen that the recognition accuracy is improved by EMD.
In addition to ETSI basic frontend feature sequence, we also apply the proposed EMDbased method on ETSI AFE feature sequence. It is important for us to point out that AFE is a strongly noiserobust frontend, which combines modules for voice activity detection (VAD), Wienerfilter noise reduction, and blind equalization. From Table 3, we can see that while AFE already achieves a relative improvement of 67.1% over the baseline, the application of EMD further improves the performance, achieving further improvements in Sets A and C. The improvement on the most mismatched test data set (Set C) is the most significant (from 85.6 to 86.1%).
We also compare subtracting different numbers of IMFs. Essentially, the more IMFs are subtracted, the smoother the resultant sequence becomes. Recognition accuracies when subtracting 1 IMF (MVN+EMD1) and 2 IMFs (MVN+EMD2) are listed in Table 4. From the results, we can see that for the noisier 0 and 5 dB data, MVN+EMD2 yields better accuracy. The results confirm that we should subtract fewer IMFs in higher SNRs, because the interference of noise is not as severe as in lowerSNR cases.
Based on the arguments given in Section 4, it is clear that the noise level and the number of IMFs to be subtracted from the signal to reduce mismatch are closely related. Therefore, we use a scheme that allows the number of IMFs to be subtracted from speech feature sequences to vary from utterance to utterance. We calculate the average oscillation frequency of the logenergy feature sequences from the cleantrain data and use it as a threshold. If the oscillation frequency of the remainder is lower than the threshold, we stop finding and subtracting the next IMF. The results of recognition experiments are listed in Table 5. We can see that this scheme, denoted by MVN+EMDd, does outperform the schemes of subtracting a fixed number (1 or 2) of IMFs. We also inspect the number of IMFs, N in (9), subtracted in the dynamic scheme of EMD. Figure 5 shows the average of N on the test set as a function of SNR, for the MVN feature and the AFE feature. As expected, it increases as SNR decreases, i.e., as the noise level increases.
5.3.3 EMD and RASTA
Since EMD is essentially a technique that alters feature sequences in the temporal domain, it is of interest to compare its effectiveness with common temporaldomain techniques. The proposed EMD method is compared to the RASTA processing since both are temporal processing techniques. The results are summarized in Table 6, and it is clearly seen that EMD outperforms the RASTA in this evaluation. The results support our analysis in Section 1 from the theoretical perspective that EMD is potentially more effective on nonstationary signals than conventional techniques based on temporal filtering. Decomposition with IMFs is more general than decomposition with sinusoidal functions, in allowing timevarying amplitudes and frequencies for input signals.
It is important to point out that EMD processing is an utterancelevel method, so the latency is generally longer than using framelevel methods such as the RASTA filter or the advanced ETSI frontend. There is a tradeoff between complexity, latency, and accuracy. In certain scenarios where low latency is critical, fast online/sequential methods without significant sacrifice in performance may be preferred to batch techniques.
6 Conclusion
In this article, we propose a feature postprocessing scheme for noiserobust speech recognition frontend based on EMD. We introduce EMD as generalization of the Fourier analysis. Our motivation is that speech signals are nonstationary and nonlinear, so EMD is theoretically superior to Fourier analysis for signal decomposition. We implement an algorithm to find IMFs. Based on properties of the extracted IMFs, we propose to subtract loworder IMFs to reduce the mismatch between clean and noisy data. Evaluation results on the Aurora 2.0 database show that the proposed method can effectively improve recognition accuracy. Furthermore, with the ETSI AFE speech features, which are very noiserobust by design, the application of EMD method further improves recognition accuracy, which is very remarkable.
References
Boll S: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process 1979,27(2):113120. 10.1109/TASSP.1979.1163209
Berstein A, Shallom I: A hypothesized Wiener filtering approach to noisy speech recognition, in. ICASSP 1991, 913916.
Zhu W, O'Shaughnessy D: Incorporating frequency masking filtering in a standard MFCC feature extraction algorithm, in. Proceedings of the IEEE International Conference on Signal Processing 2004, 617620.
Strope B, Alwan A: A model of dynamic auditory perception and its application to robust word recognition. IEEE Trans Speech Audio Process 1997,5(5):451464. 10.1109/89.622569
Furui S: Cepstral analysis technique for automatic speaker verification. IEEE Trans Acoust Speech Signal Process 1981,29(2):254272. 10.1109/TASSP.1981.1163530
Viikki O, Bye D, Laurila K: A recursive feature vector normalization approach for robust speech recognition in noise, in. Proceedings of the ICASSP 1998, 733736.
de La Torre A, Peinado A, Segura J, PerezCordoba J, Benitez M, Rubio A: Histogram equalization of speech representation for robust speech recognition. IEEE Trans Speech Audio Process 2005,13(3):355366.
Huang N, Shen Z, Long S, Wu M, Shih H, Zheng Q, Yen N, Tung C, Liu H: The empirical mode decomposition and the Hilbert spectrum for nonlinear and nonstationary time series analysis. Proc R Soc London Ser A Math Phys Eng Sci 1998, 454: 903995. 10.1098/rspa.1998.0193
Li XY: The FPGA implementation of robust speech recognition system by combining genetic algorithm and empirical mode decomposition. Master's thesis, National Kaohsiung University; 2009.
Hermansky H, Morgan N: RASTA processing of speech. IEEE Trans Speech Audio Process 1994,2(4):578589. 10.1109/89.326616
Greenberg S, Kingsbury BED: The modulation spectrogram: in pursuit of an invariant representation of speech, in. Proceedings of the ICASSP 1997, 16471650.
You H, Alwan A: Temporal modulation processing of speech signals for noise robust ASR, in. Proceedings of the INTERSPEECH 2009, 3639.
Knoty GD: Interpolating Cubic Splines. Birkhäuser, Boston; 1999.
Pearce D, Hirsch H: The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions, in. ICSA ITRW ASR2000 2000.
ETSI Standard ETSI ES 202 050: Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced frontend feature extraction algorithm; compression algorithms 2007.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Wu, KH., Chen, CP. & Yeh, BF. Noiserobust speech feature processing with empirical mode decomposition. J AUDIO SPEECH MUSIC PROC. 2011, 9 (2011). https://doi.org/10.1186/1687472220119
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1687472220119