 Research
 Open Access
 Published:
Intraframe cepstral subband weighting and histogram equalization for noiserobust speech recognition
EURASIP Journal on Audio, Speech, and Music Processing volume 2013, Article number: 29 (2013)
Abstract
In this paper, we propose a novel noiserobustness method known as weighted subband histogram equalization (WSHEQ) to improve speech recognition accuracy in noisecorrupted environments. Considering the observations that high and lowpass portions of the intraframe cepstral features possess unequal importance for noisecorrupted speech recognition, WSHEQ is intended to reduce the highpass components of the cepstral features. Furthermore, we provide four types of WSHEQ, which partially refers to the structure of spatial histogram equalization (SHEQ). In the experiments conducted on the Aurora2 noisydigit database, the presented WSHEQ yields significant recognition improvements relative to the Melscaled filterbank cepstral coefficient (MFCC) baseline and to cepstral histogram normalization (CHN) in various noisecorrupted situations and exhibits a behavior superior to that of SHEQ in most cases.
1 Introduction
The performance of speech recognition systems is often degraded due to noise in application environments. A significant number of noiserobustness techniques have been proposed to address the noise problem, and one prevailing subset of these techniques is focused on reducing the statistical mismatch of speech features in the training and testing conditions of the recognizer. Typical examples are perceptual masking [1], empirical mode decomposition [2], optimally modified logspectral amplitude estimation [3], wavelet packet decomposition with AR modeling [4], cepstral mean and variance normalization (MVN) [5], cepstral histogram normalization (CHN) [6, 7], MVN with ARMA filtering (MVA) [8], higher order cepstral moment normalization (HOCMN) [9], and temporal structure normalization (TSN) [10]. In some of these methods, the compensation is performed on each individual cepstral channel sequence of an utterance by assuming that these channels are mostly uncorrelated [7].
Recently, certain studies have investigated the use of cepstral framebased processing to compensate for the noise effect to achieve better recognition accuracy. For example, the work in [11] revealed that in the CHN method, even though each cepstral channel is processed by histogram equalization (HEQ), a significant histogram mismatch still exists among the training and testing cepstral features for the lowpass filtered (LPF) and highpass filtered (HPF) portions of the intraframe cepstra. Thus, the method of spatial HEQ in [11] further performs HEQ on the LPF and HPF portions to eliminate the aforementioned mismatch for the CHNpreprocessed cepstra. Compared with conventional CHN that processes each individual cepstral channel, spatial HEQ (SHEQ) additionally takes the neighboring cepstral channels into consideration collectively and produces superior noise robustness. Furthermore, for a frame signal, the LPF and HPF portions of the cepstral vector just correspond to the logarithmic filterbank (LFB) components at lower and higher frequencies, respectively. However, compensation performed directly on LPF and HPF is more helpful than that applied to the LFB components, most likely because the LFB components are significantly correlated [11].
Partly inspired by SHEQ, here we develop a novel scheme known as the weighted SHEQ (WSHEQ) to improve the recognition performance and operation efficiency of SHEQ in three directions. First, because the LPF and HPF portions of the original or CHNpreprocessed cepstra possess different characteristics in noisy environments and provide unequal contributions to the recognition accuracy, we tune the portion of HPF produced in the original SHEQ and show that this adjustment can outperform SHEQ in recognition accuracy. Second, we change the order of the procedures in SHEQ by first splitting the original intraframe cepstra (not the CHNpreprocessed cepstra) into LPF and HPF, subsequently compensating LPF and HPF individually, and finally, normalizing the fullband cepstra. This new structure can reduce the effect of noise on the LPF and HPF portions in the plain cepstra more directly in comparison with SHEQ. Finally, because SHEQ requires three HEQ operations, we use the simpler process of MVN to replace any of the three HEQ processes in SHEQ to improve the computational efficiency. The experimental results show that some variants of WSHEQ, which require fewer HEQ operations, provide a similar or even better recognition accuracy relative to SHEQ.
The remainder of this paper is organized as follows. Section 2 reviews SHEQ, and the basic concept and detailed procedures of the proposed WSHEQ are presented in Section 3. Section 4 describes the experimental setup, and Sections 5 and 6 contain a series of recognition experiments for WSHEQ together with their corresponding discussions. Finally, the concluding remarks are summarized in Section 7.
2 Brief review of SHEQ
If we consider using the Melscaled filterbank cepstral coefficients (MFCC) as the baseline features for speech recognition, then the cepstral feature vector stream associated with an arbitrary utterance is represented by a matrix C:
where m is the cepstral channel index within a frame and n is the frame index, and M and N are the total number of channels and frames within the utterance, respectively. In the temporal processing methods as MVN and CHN, the compensation is often directly performed on the individual channel stream (i.e., the sequence $\left\{c\right(\stackrel{~}{m},n);0\le n\le N1\}$ with respect to the $\stackrel{~}{m}$th channel), and therefore, all of the channel streams of the features are treated independently. According to the general concept that the cepstral coefficients within a frame are mostly uncorrelated [7], such a process is quite reasonable.
Recently, a novel method known as the spatial HEQ (SHEQ) was suggested to decompose each frame of a CHNpreprocessed cepstral vector into two parts, a highpass filtered and lowpass filtered portion (denoted hereafter as HPF and LPF), such that the temporal sequences of HPF and LPF can be processed separately and then the updated HPF and LPF can be combined to form the new feature vector stream. The work in [11] shows that SHEQ outperforms the conventional CHN by providing better recognition accuracy. The overall procedure of SHEQ is depicted in Figure 1.
3 Proposed approach: WSHEQ
SHEQ [11] offers additional insight into the possible distortions left unprocessed by CHN and a method for achieving even better noise robustness for speech features. In this section, we further examine SHEQ to assess whether it can be further improved. The following two observations can be made about SHEQ:

1.
SHEQ divides each CHNpreprocessed cepstral vector into HPF and LPF and subsequently treats the temporal stream of these two parts in the same manner (i.e., with HEQ processing). Therefore, SHEQ does not consider the characteristic differences between HPF and LPF. According to [11], the plain HPF (from the original cepstra, not the CHNpreprocessed cepstra) is often more vulnerable to noise and displays more mismatch than the plain LPF, whereas SHEQ compensates for the CHNpreprocessed HPF and LPF directly. Additionally, HPF and LPF possess unequal importance in speech recognition, which will be shown later.

2.
In SHEQ, the HEQ operation is repeated up to three times: one for the original feature stream set and the other two for the HPF and LPF stream sets. Thus, SHEQ requires twice more computational effort than the conventional CHN method, which only processes the original stream set once via HEQ.
In this work, we design a simple experiment to evaluate the relative importance of different subbands of the cepstral features in speech recognition. With the Aurora2 database [12], we select 8,440 clean utterances for the cleancondition training task as the data used to train the acoustic models and 8,440 noisy utterances (corrupted by any of four types of noise at five signaltonoise ratios) originally for the multicondition training task as the testing data. Each utterance in the training and testing sets is first converted into a sequence of 13dimensional cepstral vectors (c 0, c 1 to c 12). The obtained cepstra are either kept unchanged or processed by CHN. Next, for each original/CHNprocessed cepstral vector, we obtain its ‘subband’ version with the following two steps:
Step 1. Find the spectrum of the cepstral vector via discrete Fourier transform (DFT):
Let c=[ c_{0}c_{1}c_{2} … c_{12}]^{T} denote an arbitrary cepstral vector, and its spectrum is obtained by
Due to the conjugate symmetry of {C[ k]}, we only need to retain the first seven points, which correspond to $\{k\frac{2\pi}{13};0\le k\le 6\}$ in normalized frequency.
Step 2. Retain a contiguous portion of the spectral points and transform them (together with their conjugate symmetric parts) into a new cepstral vector via inverse DFT. For example, if we retain the first to fifth spectral points unchanged and set the zeroth and the sixth spectral points to zero, then the resulting new cepstral vector is a subband version of the original cepstral vector and corresponds approximately to the band range of $[\phantom{\rule{0.3em}{0ex}}\frac{2\pi}{13},\frac{10\pi}{13}].$
The recognition accuracy rates for different cepstral features obtained from the above subband processing are shown in Figures 2a and 3a, the former being for the original cepstra and the later being for the CHNprocessed cepstra (Please note that the testing data undergo the same process as the training data in the recognition experiment. Therefore, the original testing cepstra are recognized by the acoustic models trained from the original training cepstra, and the CHNprocessed testing cepstra are recognized by the acoustic models trained from the CHNprocessed training cepstra). The vertical axis in Figures 2a and 3a denotes the word accuracy rate, and the other two axes indicate the initial and final spectral points, k_{ L } and k_{ H }, of the assigned subband, respectively. Obviously, the CHNprocessed cepstra outperform the original cepstra in recognition results. Besides, for both types of cepstra the fullband features are always able to achieve the highest accuracy, and decreasing the bandwidth of the subband worsens the accuracy. However, we can further evaluate the relative importance of different spectral points in the subband from the two figures using the following equation:
where r_{ m } denotes the averaged contribution of the m th spectral points, R_{m,k} is the recognition rate using the cepstra within the subband including the m th to k th spectral points, and N_{ r } is the total number of items in the summation of Equation 3. (The term ‘relative importance’ and its definition shown in Equation 3 are borrowed from [13], in which a series of bandpass filters are used to evaluate the various modulation spectral components in their contribution to the recognition accuracy.) The obtained results from the original and the CHNprocessed cepstra are shown in Figures 2b and 3b, respectively. Note that in Equation 3, the number of spectral points in the assigned subband range is always greater than or equal to 2 because the cepstra associated with a single spectral point quite often result in a rather poor (even negative) recognition accuracy.
From Figures 2b and 3b, the seven spectral points possess unequal importance in noisy speech recognition. The middle and lower frequency points (except for the DC point) seem to contribute more to the recognition accuracy than the upper points. These results suggest that alleviating the higher frequency components in the cepstra more likely results in better recognition performance in a noisy environment. Besides, comparing Figure 3b with Figure 2b, we find that the CHN process helps the higher frequency points to reinforce their importance in speech recognition, especially for the point at frequency $\frac{10\pi}{13}$.
The spectrum of the cepstra in the aforementioned evaluation experiment is created via the DFT, with the main reason that lowpass and highpass filters are to be applied to the cepstra in later discussions, and we often evaluate the effect of a filter on the processed signal in the Fourierbased frequency domain. Also, in most cases, the characteristics of a filter are investigated by its frequency response; the Fourier transform, of its impulse response. However, since each framewise cepstral vector is the truncated version of the inverse discrete cosine transform (IDCT) of the logarithmic spectrum of the corresponding frame, here we reconduct the preceding evaluation experiment based on the ‘DCTbased’ spectrum of the original/CHNprocessed cepstra. That is, in step 1 of the experiment, we obtain the 13point spectrum of any arbitrary cepstral vector c= [ c_{0}c_{1}c_{2} … c_{12}]^{T} via DCT:
and then in step 2, a contiguous portion of the DCTbased spectral points is retained and transformed into a new cepstral vector via IDCT.
Some differences between the DCTbased spectrum $\left\{\stackrel{~}{C}\right[\phantom{\rule{0.3em}{0ex}}{k}^{\prime}\left]\right\}$ in Equation 4 and DFTbased spectrum {C[ k]} in Equation 2 are as follows:

1.
Unlike the DFTbased spectrum {C[ k]} which is complexvalued and conjugate symmetric, in general, the realvalued DCTbased spectrum $\left\{\stackrel{~}{C}\right[\phantom{\rule{0.3em}{0ex}}{k}^{\prime}\left]\right\}$ is not symmetric in any sense. Thus, we cannot discard the second half points of $\left\{\stackrel{~}{C}\right[\phantom{\rule{0.3em}{0ex}}{k}^{\prime}\left]\right\}$ as we do on {C[ k]}.

2.
$\left\{\stackrel{~}{C}\right[\phantom{\rule{0.3em}{0ex}}{k}^{\prime}\left]\right\}$ possesses a higher frequency resolution than {C[ k]}. Comparing Equation 4 with Equation 2, the frequency difference between any two adjacent bins of $\left\{\stackrel{~}{C}\right[\phantom{\rule{0.3em}{0ex}}{k}^{\prime}\left]\right\}$ is $\frac{\pi}{13}$, while it is $\frac{2\pi}{13}$ for {C[ k]}.

3.
Referring to [14], the Npoint DCT, $\left\{\stackrel{~}{C}\right[\phantom{\rule{0.3em}{0ex}}{k}^{\prime}\left]\right\}$, of a lengthN sequence {c[ n],0≤n≤N−1} (here N=13), can be computed via a 2N DFT of another length 2N sequence $\left\{\stackrel{~}{c}\right[\phantom{\rule{0.3em}{0ex}}n],0\le n\le 2N1\}$, denoted by {D[ k ^{′}]}, in which $\stackrel{~}{c}\left[\phantom{\rule{0.3em}{0ex}}n\right]$ is the even extension of c[ n] satisfying $\stackrel{~}{c}\left[\phantom{\rule{0.3em}{0ex}}n\right]=c\left[\phantom{\rule{0.3em}{0ex}}n\right]$ for 0≤n≤N−1 and $\stackrel{~}{c}\left[\phantom{\rule{0.3em}{0ex}}n\right]=x[\phantom{\rule{0.3em}{0ex}}2N1n]$ for N≤n≤2N−1. $\left\{\stackrel{~}{C}\right[\phantom{\rule{0.3em}{0ex}}{k}^{\prime}\left]\right\}$ and {D[ k ^{′}]} are related by:
$$\stackrel{~}{C}\left[\phantom{\rule{0.3em}{0ex}}{k}^{\prime}\right]=0.5{e}^{j\frac{\mathrm{\pi k}}{2N}}D\left[\phantom{\rule{0.3em}{0ex}}{k}^{\prime}\right]\text{for}\phantom{\rule{0.3em}{0ex}}0\le k\le N1.$$(5)Generally speaking, the DCTbased spectrum $\left\{\stackrel{~}{C}\right[\phantom{\rule{0.3em}{0ex}}{k}^{\prime}\left]\right\}$ is more concentrated at low frequencies than the DFTbased spectrum {C[ k]}, which is well known as the ‘energy compaction property’ of DCT. An underlying reason for this phenomenon is that DFT implicitly assumes the periodic extension of the processed signal and often causes the artificial discontinuities at the signal boundary, which adds high frequency contents in the DFTbased spectrum. To show this, a lengthN sequence {x[ n],0≤n≤N−1} is treated by Npoint DFT as an Nperiodic signal, denoted by x_{ e }[ n], in which x_{ e }[ n]=x[ n] for 0≤n≤N−1 and x_{ e }[ n+N]=x_{ e }[ n]. Thus, x_{ e }[ n] is generally discontinuous at the (original) boundary positions:
$$\begin{array}{l}{x}_{e}\left[\phantom{\rule{0.3em}{0ex}}0\right]=x\left[\phantom{\rule{0.3em}{0ex}}0\right]\ne x[\phantom{\rule{0.3em}{0ex}}N1]={x}_{e}[\phantom{\rule{0.3em}{0ex}}1],\phantom{\rule{2em}{0ex}}\end{array}$$(6)$$\begin{array}{l}{x}_{e}[\phantom{\rule{0.3em}{0ex}}N1]=x[\phantom{\rule{0.3em}{0ex}}N1]\ne x\left[\phantom{\rule{0.3em}{0ex}}0\right]={x}_{e}\left[\phantom{\rule{0.3em}{0ex}}N\right].\phantom{\rule{2em}{0ex}}\end{array}$$(7)However, as mentioned earlier, the Npoint DCT of a lengthN sequence {x[ n]} (starting at n=0) can be obtained from the 2Npoint DFT of the even extension of {x[ n]}, and the corresponding 2Nperiodic signal, denoted by ${\stackrel{~}{x}}_{e}\left[\phantom{\rule{0.3em}{0ex}}n\right]$, remains continuous at the boundary positions:
$$\begin{array}{l}{\stackrel{~}{x}}_{e}\left[\phantom{\rule{0.3em}{0ex}}0\right]=\stackrel{~}{x}\left[\phantom{\rule{0.3em}{0ex}}0\right]=\stackrel{~}{x}[\phantom{\rule{0.3em}{0ex}}2N1]={\stackrel{~}{x}}_{e}[\phantom{\rule{0.3em}{0ex}}2N1]={\stackrel{~}{x}}_{e}[\phantom{\rule{0.3em}{0ex}}1],\phantom{\rule{2em}{0ex}}\end{array}$$(8)$$\begin{array}{l}{\stackrel{~}{x}}_{e}[\phantom{\rule{0.3em}{0ex}}2N1]=\stackrel{~}{x}[\phantom{\rule{0.3em}{0ex}}2N1]=\stackrel{~}{x}\left[\phantom{\rule{0.3em}{0ex}}0\right]={\stackrel{~}{x}}_{e}\left[\phantom{\rule{0.3em}{0ex}}0\right]={\stackrel{~}{x}}_{e}\left[\phantom{\rule{0.3em}{0ex}}2N\right].\phantom{\rule{2em}{0ex}}\end{array}$$(9)As a result, the (Npoint) DCTbased spectrum does not contain the high frequency artifacts as the (Npoint) DFTbased spectrum, and it appears more compact at low frequencies.
With the cepstra from the IDCT of subband DCTbased spectra, the corresponding evaluation experiment is performed to obtain the recognition accuracy rates, which are shown in Figures 4a and 5a, and the relative importance of different spectral points are shown in Figures 4b and 5b. Figure 4a,b is for the original cepstra and Figure 5a,b is for the CHNprocessed cepstra. These two figures roughly reveal that the lower and middle DCTbased spectral points contribute to the recognition more than the upper ones in recognition, which somewhat coincides our observations from Figures 2a,b and 3a,b associated with the DFTbased spectra. In addition, comparing Figure 4b with Figure 2b and Figure 5b with Figure 3b, we find that the higher DCTbased spectral points reveal more importance than the higher DFTbased spectral points. which partially agrees with our previous statement that the DFTbased spectrum contains some artificial high frequency contents, which distort the higher spectral points and reduce the corresponding contribution.
In light of the aforementioned discussions, we developed a novel method known as the WSHEQ to enhance the speech features in noise robustness. The initial concept of WSHEQ is to apply a weighting factor to the HPF portion in SHEQ (as shown in Figure 1) to reduce the intraframe higher frequency components, and we further provide several variations on the presented WSHEQ. First, according to the order of the HEQ processing for the fullband cepstra and subband cepstra, we describe two structures:
Structure I. HEQ first operates on the plain (intraframe) fullband cepstra and subsequently on the subband cepstra.
Structure II. HEQ first operates on the plain (intraframe) subband cepstra and subsequently on the fullband cepstra.
Please note that in the above two structures, the two subband cepstral portions, LPF and HPF, are obtained with simple twopoint FIR filters operating on the fullband cepstra [11]:
where c_{ lp }(m,n) and c_{ hp }(m,n) denote the lowpass and highpass filtered parts of the n th cepstral frame.
Next, according to different treatments (i.e., the compensation methods, HEQ, and MVN) of LPF and HPF in Equations 10 and 11, each structure of WSHEQ has the following four types of variations:
where HEQ [ ·] and MVN [ ·] denote the operators of the HEQ and MVN processes, respectively; ${\stackrel{~}{c}}_{\mathit{\text{lp}}}$ and ${\stackrel{~}{c}}_{\mathit{\text{hp}}}$ are the updated LPF and HPF, respectively (we omit the indices (m,n) for simplicity); and the parameter α with a range of [0, 1] is the scaling factor selected specifically for the HPF component. The flowcharts of the various structures and types of WSHEQ are depicted in Figure 6a,b.
For clarity, in the following discussions, the term ‘WSHEQ’ is written with an additional subscript of ‘I’ or ‘II’, and a superscript of ‘(1)’, ‘(2)’, ‘(3)’, or ‘(4)’ to identify different structures and different processing schemes for LPF and HPF in the presented WSHEQ method. For example, ${\text{WSHEQ}}_{\text{II}}^{\left(3\right)}$ indicates that the WSHEQ method applying the second structure shown in Figure 6b and uses HEQ and MVN for the LPF and HPF portions, respectively. Additional discussions on the various forms of WSHEQ are given:

1.
Because the HEQ operation is nonlinear, ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ with α=1.0 (no attenuation for HPF) as shown in Equation 12 in which HEQ is first performed on the subband cepstra and subsequently on the fullband cepstra, is different from SHEQ (equivalent to ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ with α=1.0) shown in Figure 1, in which the fullband cepstra are HEQprocessed in advance.

2.
In the first type of WSHEQ _{II}, (viz. ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$), both LPF and HPF (of the original MFCC) are processed by HEQ. The resulting new HPF is attenuated by a factor of α and then combined with the new LPF to form the fullband cepstra, which are further processed by HEQ in the final stage. Therefore, ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ requires three HEQ operations, the same as SHEQ, demonstrating that SHEQ and ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ are similar in computational complexity.

3.
The other three types of WSHEQ as shown in Equations 13 to 15 differ from the first type in that they compensate either or both of the LPF and HPF portions via MVN instead of HEQ. MVN can be implemented more efficiently than HEQ because MVN involves only the operations of addition and multiplication, whereas a sorting algorithm is required in HEQ. We expect that the cost savings of HEQ on HPF/LPF will not affect the prospective recognition accuracy.
4 Experimental setup
The performance of our proposed WSHEQ scheme is examined in two databases. One is the Aurora2 database [12] corresponding to a connected Englishdigit recognition task, and the other is a subset of the TCC300 database [15] for the recognition of 408 Chinese syllables. Briefly speaking, we conduct more comprehensive experiments with the Aurora2 database for analysis and comparison upon the various forms of the presented WSHEQ together with some other robustness algorithms, and a smaller number of experiments conducted on the subset of the TCC database are simply to examine if the presented WSHEQ can be extended to work well in a mediansize vocabulary recognition task which is more complicated than Aurora2. Furthermore, in order to avoid the ambiguity and confusion in discussion, the remainder of this section and Section 5 are specially for the Aurora2 evaluation task, while the detailed discussions about the TCC300 subset task will given in Section 6.
As for the Aurora2 database, the test data consist of 4,004 utterances, and three different subsets are defined for the recognition experiments: test sets A and B are both affected by four types of noise, and test set C is affected by two types. Each noise instance is artificially added to the clean speech signal at seven SNR levels (ranging from 20 to −5 dB). The signals in test sets A and B are filtered with a G.712 filter, and those in Set C are filtered with an MIRS filter. In the ‘cleancondition training, multicondition testing’ evaluation task defined in [12], the training data consist of 8,440 noisefree clean utterances filtered with a G.712 filter. Thus, compared with the training data, test sets A and B are distorted by additive noise, and test set C is affected by additive noise and a channel mismatch.
In the experiments, each utterance in the clean training set and the three testing sets is first converted to a 13dimensional MFCC (c 0, c 1 to c 12) sequence. Next, the MFCC features are processed by either SHEQ [11] or the various forms of WSHEQ noted in Section 3. In addition, the selected target distribution of the HEQ operation applied to any of the fullband, LPF, and HPF cepstra is the standard normal (Gaussian), with a zero mean and unity variance. (Please note that, given the fullband cepstral sequences being standard normal and approximately mutually uncorrelated, the corresponding LPF and HPF via the operations in Equations 10 and 11 are also standard normal. Similarly, if the HPF and LPF are both standard normal and approximately mutually uncorrelated, then the corresponding fullband cepstra are normally distributed with a zero mean and a variance of less than 1 since we scale down the HPF portion.)
The resulting 13 new features, in addition to their first and secondorder derivatives, are the components of the final 39dimensional feature vector. With the new feature vectors in the clean training set, the hidden Markov models (HMMs) for each digit and for silence are trained with the scripts provided by the Aurora2 CD set [16]. Each HMM digit contains 16 states, with three Gaussian mixtures per state.
In particular, the 8,440 noisy utterances (corrupted by four types of noise at five signaltonoise ratios) originally for the multicondition training task [12], which has been mentioned earlier in Section 3, are served as the development set here in order to obtain an appropriate selection of the scaling factor α for the HPF portion in Equations 12 to 15. The value of α is varied from 0.0 to 1.0 with an interval of 0.1 in each form of WSHEQ, and then the one that achieves the optimal recognition accuracy for the development set is chosen for the corresponding WSHEQ in practice. The selected values of α for different forms of WSHEQ are listed in Table 1.
5 Experimental results and discussions for the Aurora2 task
5.1 Recognition accuracy
The presented WSHEQ is evaluated in terms of recognition accuracy. Tables 2 and 3 show the individual set recognition accuracy rates averaged over five SNR conditions (0 to 20 dB, with a 5dB interval) for the MFCC baseline, CHN, SHEQ (equivalent to ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ with α=1.0), and various forms of the presented WSHEQ, while Table 4 further lists the recognition accuracy rates for each individual SNR situations but averaged over ten noise situations. In addition, Figure 7 depicts the overall averaged word error rates achieved by several methods, including MVA, HOCMN, TSN, CHN, SHEQ, ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}(\alpha =0.6)$, and ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}(\alpha =0.6)$. From Tables 2,3,4 and Figure 7, we have the following findings:

1.
Compared with the MFCC baseline, all of the HEQrelated methods provide very similar accuracy rates for the clean situation, and they are able to provide significant improvement in recognition accuracy for various noisecorrupted situations, showing that HEQ is quite helpful for speech features in terms of noise robustness.

2.
SHEQ (${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ with α=1.0) outperforms CHN by around 2.3% in the averaged accuracy, and thus, further manipulation of the mismatch in LPF and HPF with two extra HEQ operations can benefit the recognition performance.

3.
${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ with α=1.0 produces results similar to those of SHEQ, and thus, the proposed structure II (shown in Figure 6b) performs quite well. Additionally, provided that no attenuation exists for HPF by setting α=1.0, using structure II in the other three types of WSHEQ, i.e., ${\text{WSHEQ}}_{\text{II}}^{\left(2\right)}$, ${\text{WSHEQ}}_{\text{II}}^{\left(3\right)}$, and ${\text{WSHEQ}}_{\text{II}}^{\left(4\right)}$ as shown in Equations 13 to 15, outperforms the respective methods under structure I. In particular, ${\text{WSHEQ}}_{\text{II}}^{\left(2\right)}$ behaves better than ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$, whereas ${\text{WSHEQ}}_{\text{I}}^{\left(2\right)}$ behaves worse than ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$, revealing that applying structure II can make WSHEQ less costly in computation and can obtain improved recognition results simultaneously.

4.
Reducing the HPF component by setting the factor α as less than 1.0 as in Table 1 significantly improves the recognition accuracy, regardless of the different structures and types of WSHEQ. ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ gives an averaged accuracy of 84.99%, which is optimal among all of the methods and corresponds to error reduction rates of 62.71%, 23.73%, and 13.83% relative to the MFCC baseline, CHN, and SHEQ, respectively. These results support the aforementioned observations that HPF is more extensively contaminated by noise and that lo wering HPF is beneficial.

5.
Among the four types of WSHEQ listed in Equations 12 to 15, by assigning α as less than 1.0, WSHEQ ^{(1)}, which requires three HEQ operations, displays the best behavior, regardless of the selected structure. However, the two types that require only two HEQ operations (i.e., WSHEQ ^{(2)} and WSHEQ ^{(3)}) perform quite similarly to WSHEQ ^{(1)} when structure II is used. Finally, WSHEQ ^{(4)} performs worse than the other three types, possibly because it applies only one HEQ operation. Even so, ${\text{WSHEQ}}_{\text{I}}^{\left(4\right)}$ and ${\text{WSHEQ}}_{\text{II}}^{\left(4\right)}$ with α=0.6 can behave very close to SHEQ (${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ with α=1.0).

6.
The presented ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ with α=0.6 behaves better in the overall averaged word error rate when compared with several wellknown noiserobustness methods: TSN, HOCMN, MVA, CHN, and SHEQ. The absolute error rate reduction of ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ with α=0.6 relative to the MFCC baseline is as high as 25.24%.
Taking a step further, among the methods used for comparison, MVA and TSN explicitly applies a temporal filter, and in most cases, the used filter is low pass so as to perform a ‘temporal’ smoothing on the cepstral time series. In contrast, the presented WSHEQ lowers HPF (the highpass filtered portion) of each cepstral vector and is analogous to a ‘spatial’ smoothing operation. Such an observation leads to the idea of combining either MVA or TSN with WSHEQ in order to achieve a twodimensional smoothing. To realize this idea, the cepstra are first processed with any of the eight forms of WSHEQ and then further compensated by MVA or TSN. The obtained recognition results are shown in Tables 5 and 6, in which the applied WSHEQ uses the scaling factor α listed in Table 1. As we look into the results shown in Tables 5 and 6, it can be found that the pairing of WSHEQ and MVA/TSN consistently achieves better performance than the individual component method, regardless of the various forms of WSHEQ. For example, the method ‘${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$+TSN’ obtains the averaged accuracy of 86.25%, better than TSN (81.39%) and ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ (84.99%). These results indicate that the joint spatialtemporal smoothing can provide the cepstral features with better noise robustness in comparison with either spatial smoothing or temporal smoothing in isolation. In particular, different forms of WSHEQ behave very similar and can give around 85% in averaged accuracy when TSN/MVA is integrated, implying that when employing TSN/MVA as a postprocessing technique, simpler versions of WSHEQ, such as ${\text{WSHEQ}}_{\text{I}}^{\left(4\right)}$ and ${\text{WSHEQ}}_{\text{II}}^{\left(4\right)}$, are relatively more appropriate in practical applications due to their high recognition performance and relatively low computation complexity in comparison with SHEQ, ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$, and ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$.
5.2 The influence of the parameter α in WSHEQ
As stated previously, the parameter α in WSHEQ determines the degree of attenuation for the HPF portion of the processed cepstra. Here, we would like to investigate how the value of α in WSHEQ influences the recognition accuracy of the test sets. For simplicity, we vary the parameter α from 0.0 to 1.0 in two types of WSHEQ: ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ and ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$, and the corresponding recognition accuracy rates averaged over all noise types and levels in three test sets are shown in Figure 8a,b. These two figures reveal that

1.
Lowering the HPF part by tuning α from 1.0 to 0.4 in both ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ and ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ achieves better results consistently relative to these two WSHEQ methods using α=1.0. However, further reducing the HPF part can ruin the recognition accuracy, which implies that the HPF part also contains information helpful for recognition.

2.
The optimal accuracy for ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ and ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ occurs when α is assigned to 0.5 and 0.6, respectively, while the results from the development set suggest the parameter α to be 0.6 for these two WSHEQ methods (as shown in Table 1). However, ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ with α=0.6 gives the recognition rate of 84.27%, very close to the optimal one (84.32%). Therefore, it assures us that the development set can help to determine the nearly optimal parameter in the test sets.

3.
The performance of ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ and ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ is not very sensitive to the parameter α, which is based on the observation that the accuracy difference is below 1.0% provided the value of α is within the range [ 0.4,0.7].
Next, we explore the best possible recognition results for each testing situation achieved by WSHEQ with various assignments of the scaling parameter α. Please note that, in the preceding experiments the scaling parameter α in WSHEQ is determined by the development set and then uniformly applied to the every test set. Here, we would like to investigate whether the optimal choice of α (which gives rise to the highest recognition accuracy) depends on the noise type and level (viz. the SNR) of the testing utterances. To do this, we vary the value of α from 0.0 to 1.0 with an interval of 0.1 in each form of WSHEQ to process the features in the training and testing sets and then perform the experiment. The optimal recognition accuracy rate and the associated α with respect to each noise type and level in the testing set achieved by ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ and ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ are respectively shown in Tables 7 and 8. Some contents of the tables together with the data obtained from the other six forms of WSHEQ (which are not listed here due to their huge amount) are further summarized in Tables 9 and 10, which also contain a portion of the data in Tables 2 and 3 for the purpose of comparison. Observing these tables, we find that the value of the factor α that achieves the optimal recognition accuracy indeed depends on the noise type and level of the utterances. However, there seems no general rule for selecting a better α with respect to any specific noise situation. Furthermore, as seen in Table 9, in most cases, the accuracy rates obtained with the optimal α associated with the individual noise situation are very close to the accuracy rates using a fixed α which gets the optimal results for the development set. The maximum difference between the above two types of accuracy rates is 0.75%, which occurs at the method of ${\text{WSHEQ}}_{\text{I}}^{\left(2\right)}$. As a result, we can roughly conclude that using the α recommended by the development set suffices to provide WSHEQ with nearly optimal performance.
5.3 The feature distortion reduced by WSHEQ
Apart from the recognition performance, in this subsection, we evaluate WSHEQ in the capacity of reducing the feature distortion caused by noise. The incoherent feature distortion [7] defined by
is measured for the feature streams processed by the noiserobustness method, where $\stackrel{~}{X}\left[\phantom{\rule{0.3em}{0ex}}k\right]$ and X[ k] denote the DFT of the noisefree clean feature stream and its noisecorrupted counterpart, respectively. Figure 9 depicts the feature distortion associated with any cepstral channel at the SNR of 10 dB, averaged over the 1,001 utterances in test set A of the Aurora2 database, with respect to the feature streams processed by any of SHEQ, ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ with α=0.6, and ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ with α=0.6. From Figure 9, two observations are made: first, ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ with α=0.6 results in smaller distortions than SHEQ irrespective of the cepstral channel, implying that to lower the HPF portion of the cepstra can further reduce the effect of noise; second, by setting the parameter α to 0.6, the distortions provided by ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ are slightly smaller than those by ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ for most of the cepstral channels, which agrees with the finding that ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ slightly outperforms ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ in recognition accuracy.
5.4 The effect of lowering HPF in different schemes
In order to further examine the effect of attenuating HPF in recognition accuracy, here we additionally design three schemes to process the cepstra in training and testing sets:
Scheme 1. The original (fullband) cepstra is split into LPF and HPF, and then the HPF portion is scaled by a factor α. Finally, the original LPF and the attenuated HPF are combined to constitute the new cepstra. This scheme is to remove all three HEQ processes in ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ or ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ shown in Figure 6a,b, and its flowchart is depicted in Figure 10a.
Scheme 2. The original (fullband) cepstra is preprocessed by HEQ, and then the HEQpreprocessed cepstra is split into LPF and HPF. We scale LPF by a factor α and finally combine LPF and attenuated HPF to obtain the new cepstra. This scheme is to remove the two HEQ processes for LPF and HPF in ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ shown in Figure 6a, and its flowchart is depicted in Figure 10b.
Scheme 3. The original cepstra is split into LPF and HPF, and then the HPF portion is tuned with the scaling factor α. Next, we combine LPF and attenuated HPF to obtain the fullband cepstra, which are further postprocessed by HEQ to obtain the new cepstra. This scheme is to remove the two HEQ processes for LPF and HPF in ${\text{WSHEQ}}_{\text{II}}^{\left(1\right)}$ shown in Figure 6b, and its flowchart is depicted in Figure 10c.
The scaling factor α in the above three schemes is varied from 0.6 to 1.4 with an interval of 0.2. Please note that the case with α>1 corresponds to amplifying HPF and thus reducing the proportion of LPF in the overall cepstra. The recognition results for the three schemes are shown in Table 11. From this table, we have the following observations:

1.
At the clean noisefree case in all three schemes, the recognition accuracy remains as high as around 99% nearly irrespective of the varied scaling factor α, which implies that neither lowering nor raising the HPF portion of the cepstra can significantly influence the recognition performance. The possible explanation for this result is that the backend acoustic modeling with HMMs compensates well for the variation of the frontend speech features.

2.
From the results for scheme 1, reducing HPF (using α<1) without pre or postprocessing with HEQ produces degraded performance under noisecorrupted situations compared with the case using α=1, which disagrees with the results for various forms of WSHEQ as shown in the preceding subsections. Under the same situations, setting α>1 to amplify HPF (and thus to reduce the proportion of LPF) cannot improve the accuracy, either. Therefore, the relative importance of LPF and HPF in noisecorrupted cepstra discussed in Section 3 cannot be reflected in recognition accuracy when there is no noiserobust processing such as HEQ. In other words, merely emphasizing LPF or HPF fails to result in more noiserobust cepstra and produces worse recognition accuracy.

3.
Different from the results for scheme 1, the results associated with schemes 2 and 3 show that when the cepstra are pre or postprocessed by HEQ, reducing the HPF part by setting α<1 can promote the recognition accuracy under noisecorrupted situations (except for the case of −5dB SNR). On the other hand, the cases corresponding to α>1 in which HPF is raised produce worse results. The underlying reason is probably that the noise effect of HPF is relatively difficult to alleviate, and simply lowering HPF can benefit HEQ to give better performance. Similar situations can be also found in Tables 2 and 3 by comparing the results of ${\text{WSHEQ}}_{\text{I}}^{\left(2\right)},{\text{WSHEQ}}_{\text{I}}^{\left(3\right)},{\text{WSHEQ}}_{\text{II}}^{\left(2\right)}$ and ${\text{WSHEQ}}_{\text{II}}^{\left(3\right)}$ with α=1. ${\text{WSHEQ}}_{\text{I}}^{\left(2\right)}$ and ${\text{WSHEQ}}_{\text{II}}^{\left(2\right)}$ outperform ${\text{WSHEQ}}_{\text{I}}^{\left(3\right)}$ and ${\text{WSHEQ}}_{\text{II}}^{\left(3\right)}$, respectively, indicating a stronger normalization strategy like HEQ is required to compensate the distortion in HPF, while a relatively simple MVN process suffices to improve LPF well. Furthermore, comparing Table 11 with Tables 2 and 3 we find that the effect of lowering HPF in recognition accuracy appears a lot more significant when we further compensate the subband cepstra (viz. HPF and LPF) by HEQ/MVN, again in agreement with the statements about SHEQ [11] that additionally normalizing HPF and LPF can reduce the environmental mismatch caused by noise.
6 The experiment on the TCC300 Mandarin dataset
Besides the evaluation on the Aurora2 dataset as described in the previous two sections, here the recognition experiments with the presented WSHEQ are further carried out in another dataset, the eleventh group of the TCC300 microphone speech database from the Association for Computational Linguistics and Chinese Language Processing in Taiwan [15]. This dataset includes 7,009 Mandarin character strings uttered by 50 male and 50 female adult speakers. The corresponding read speakingstyle speech signals were recorded with a microphone at the sampling rate of 16 kHz. The Mandarin characters included in the utterances of this dataset correspond to 408 different Mandarin syllables. In the experiment, the syllable recognition is performed on this dataset without any language model or grammar constraint at the back end so that the recognition performance can be more related to the used frontend acoustic features. As a result, in comparison with the 11digit recognition on the Aurora2 telephoneband dataset in the previous sections, here we conduct a more complicated task of mediumvocabulary recognition (408 syllables) on the broadband speech data. Among the 7,009 Mandarin utterances in the TCC300 subset, 6,509 strings are selected in acoustic model training, while the other 500 are in testing. The utterances in the training set are kept noisefree, while the utterances in the testing set are artificially added with noise at four SNR levels (20, 15, 5, and 0 dB) to produce noisecorrupted speech data. The noise types include white (broadband) and pink (narrowband), both taken from the NOISEX 92 database [17]. These utterances for training and testing are first converted into 13dimensional MFCCs (c 0, c 1−c 12), and then processed by various kinds of noiserobustness algorithms. Similar to the feature parameter settings for the Aurora2 database, the resulting 13 new features plus their first and secondorder derivatives constitute the finally used 39dimensional feature vector.
As for the acoustic modeling, we train the HMMs of INITIAL and FINAL units, which corresponds to the semisyllables in Mandarin Chinese. In most cases, a Mandarin Chinese syllable can be split into INITIAL/FINAL parts analogous to the consonant/vowel pair in English. There are totally 112 rightcontextdependent INITIAL HMMs and 38 contextindependent FINAL HMMs to be trained. Each INITIAL HMM consists of five states and eight Gaussian mixtures per state, while each FINAL HMM contains ten states and eight Gaussian mixtures per state. The HMM for each of the 408 Mandarin syllables is then constructed by concatenating the associated INITIAL and FINAL HMMs.
Tables 12 and 13 list the syllable recognition accuracy rates of the MFCC baseline and the various robustness methods including CHN, SHEQ (equivalent to ${\text{WSHEQ}}_{\text{I}}^{\left(1\right)}$ with α=1.0), and seven forms of the presented WSHEQ for the white and pink noise environments, respectively. The scaling parameter α in WSHEQ is set to 0.6, which is not optimized but just to clarify whether lowering HPF can give rise to performance improvement. From these two tables, we have the following findings:

1.
Due to the simple freesyllable decoding framework in the recognition procedure, the recognition accuracy of MFCC baseline features at the clean noisefree condition is just around 75%. Besides, the noise robustness methods used here result in similar or even better performance compared with the MFCC baseline when the testing utterances contain no noise.

2.
Both types of noise degrade the performance of MFCC seriously as the SNR gets worse, while CHN and all of the other HEQrelated algorithms benefit the recognition accuracy significantly. In particular, the various forms of WSHEQ with α=1 outperforms CHN, indicating that additionally processing LPF and HPF with HEQ or MVN can further enhance CHN to produce better results.

3.
Reducing the scaling factor α from 1.0 to 0.6 in the eight forms of WSHEQ consistently brings about better results by significant margins in all noisecorrupted situations. This result reconfirms the capability of the presented HPF lowering operation in boosting noise robustness of CHNprocessed features. Furthermore, when α is set to 0.6, the performance difference among various forms of WSHEQ becomes relatively small in comparison with that under the condition of α=1.0.
7 Conclusions
In this paper, we explored the relative importance of different frequency components of the intraframe speech features and subsequently presented a novel algorithm, WSHEQ, to improve noisy speech recognition. WSHEQ mainly reduces the intraframe highpass filtered component of the speech features, which appears more vulnerable to noise. Compared with the wellknown SHEQ method, WSHEQ can achieve superior recognition accuracy, higher computational efficiency, or both. In future work, we will pursue new filter structures for obtaining the LPF and HPF components for WSHEQ to achieve better results. Additionally, we will investigate how to tune the intraframe speech features more flexibly in the corresponding DFT or DCT domains for further noise reduction.
References
 1.
Maganti HK, Matassoni M: A perceptual masking approach for noise robust speech recognition. EURASIP J. Audio Speech Music Process 2012., 2012(29):
 2.
Wu K, Chen C, Yeh B: Noiserobust speech feature processing with empirical mode decomposition. EURASIP J. Audio Speech Music Process 2011., 2011(9):
 3.
Cohen I, Berdugo B: Speech enhancement for nonstationary noise environments. Signal Process 2001, 81(11):24032418. 10.1016/S01651684(01)001281
 4.
Kotnik B, Kačič Z: A noise robust feature extraction algorithm using joint wavelet packet subband decomposition and AR modeling of speech signals. Signal Process 2007, 87(6):12021223. 10.1016/j.sigpro.2006.10.009
 5.
Tibrewala S, Hermansky H: Multiband and adaptation approaches to robust speech recognition. In 5th Eurospeech Conference on Speech Communications and Technology. Rhodes: Eurospeech; 22–25 Sept 1997.
 6.
Hilger F, Ney H: Quantile based histogram equalization for noise robust large vocabulary speech recognition. IEEE Trans. Audio Lang. Process 2006, 14: 845854.
 7.
Benesty J, Sondhi MM, Huang Y (Eds): In Springer Handbook of Speech Processing. 2008.
 8.
Chen C, Bilmes J: MVA processing of speech features. IEEE Trans. Audio Speech Lang. Process 2007, 15: 257270.
 9.
Hsu CW, Lee LS: Higher order cepstral moment normalization for improved robust speech recognition. IEEE Trans. Audio Speech Lang. Process 2009, 17: 205220.
 10.
Xiao X, Chng ES, Li H: Normalization of the speech modulation spectra for robust speech recognition. IEEE Trans. Audio Speech Lang. Process 2008, 16: 16621674.
 11.
Joshi V, Bilgi R, Umesh S, García L, Benítez MC: Subband level histogram equalization for robust speech recognition. In 12th International Conference on Spoken Language Processing. Florence: Interspeech; 27–31 Sept 2011.
 12.
Hirsch HG, Pearce D: The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions. In Proceedings of the 2000 Automatic Speech Recognition: Challenges for the new Millenium. Paris: ISCA ITRW ASR; 18–20 Sept 2000.
 13.
Kanedera N, Arai T, Hermansky H, Pavel M: On the importance of various modulation frequencies for speech recognition. In 5th European Conference on Speech Communication and Technology. Rhodes: Eurospeech; 22–25 Sept 1997.
 14.
Huang X, Acero A, Hon HW: Spoken Language Processing: A Guide to Theory, Algorithm and System Development. New Jersey: Prentice Hall; 2001.
 15.
ACLCLP 1990.http://www.aclclp.org.tw/corp.php, Accessed 10 Aug 2013
 16.
ELDA 1995.http://www.elda.org/article52.html, Accessed 8 Aug 2013
 17.
Varga AP, Steeneken HJM, Tomlinson M, Jones D: The NOISEX92 study on the effect of additive noise on automatic speech recognition. Technical report, DRA Speech Research Unit (1992)
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Hung, Jw., Fan, Ht. Intraframe cepstral subband weighting and histogram equalization for noiserobust speech recognition. J AUDIO SPEECH MUSIC PROC. 2013, 29 (2013). https://doi.org/10.1186/16874722201329
Received:
Accepted:
Published:
Keywords
 Subband division
 Speech recognition
 Robust speech features
 Histogram equalization