Skip to main content

Intra-frame cepstral sub-band weighting and histogram equalization for noise-robust speech recognition

Abstract

In this paper, we propose a novel noise-robustness method known as weighted sub-band histogram equalization (WS-HEQ) to improve speech recognition accuracy in noise-corrupted environments. Considering the observations that high- and low-pass portions of the intra-frame cepstral features possess unequal importance for noise-corrupted speech recognition, WS-HEQ is intended to reduce the high-pass components of the cepstral features. Furthermore, we provide four types of WS-HEQ, which partially refers to the structure of spatial histogram equalization (S-HEQ). In the experiments conducted on the Aurora-2 noisy-digit database, the presented WS-HEQ yields significant recognition improvements relative to the Mel-scaled filter-bank cepstral coefficient (MFCC) baseline and to cepstral histogram normalization (CHN) in various noise-corrupted situations and exhibits a behavior superior to that of S-HEQ in most cases.

1 Introduction

The performance of speech recognition systems is often degraded due to noise in application environments. A significant number of noise-robustness techniques have been proposed to address the noise problem, and one prevailing subset of these techniques is focused on reducing the statistical mismatch of speech features in the training and testing conditions of the recognizer. Typical examples are perceptual masking [1], empirical mode decomposition [2], optimally modified log-spectral amplitude estimation [3], wavelet packet decomposition with AR modeling [4], cepstral mean and variance normalization (MVN) [5], cepstral histogram normalization (CHN) [6, 7], MVN with ARMA filtering (MVA) [8], higher order cepstral moment normalization (HOCMN) [9], and temporal structure normalization (TSN) [10]. In some of these methods, the compensation is performed on each individual cepstral channel sequence of an utterance by assuming that these channels are mostly uncorrelated [7].

Recently, certain studies have investigated the use of cepstral frame-based processing to compensate for the noise effect to achieve better recognition accuracy. For example, the work in [11] revealed that in the CHN method, even though each cepstral channel is processed by histogram equalization (HEQ), a significant histogram mismatch still exists among the training and testing cepstral features for the low-pass filtered (LPF) and high-pass filtered (HPF) portions of the intra-frame cepstra. Thus, the method of spatial HEQ in [11] further performs HEQ on the LPF and HPF portions to eliminate the aforementioned mismatch for the CHN-preprocessed cepstra. Compared with conventional CHN that processes each individual cepstral channel, spatial HEQ (S-HEQ) additionally takes the neighboring cepstral channels into consideration collectively and produces superior noise robustness. Furthermore, for a frame signal, the LPF and HPF portions of the cepstral vector just correspond to the logarithmic filter-bank (LFB) components at lower and higher frequencies, respectively. However, compensation performed directly on LPF and HPF is more helpful than that applied to the LFB components, most likely because the LFB components are significantly correlated [11].

Partly inspired by S-HEQ, here we develop a novel scheme known as the weighted S-HEQ (WS-HEQ) to improve the recognition performance and operation efficiency of S-HEQ in three directions. First, because the LPF and HPF portions of the original or CHN-preprocessed cepstra possess different characteristics in noisy environments and provide unequal contributions to the recognition accuracy, we tune the portion of HPF produced in the original S-HEQ and show that this adjustment can outperform S-HEQ in recognition accuracy. Second, we change the order of the procedures in S-HEQ by first splitting the original intra-frame cepstra (not the CHN-preprocessed cepstra) into LPF and HPF, subsequently compensating LPF and HPF individually, and finally, normalizing the full-band cepstra. This new structure can reduce the effect of noise on the LPF and HPF portions in the plain cepstra more directly in comparison with S-HEQ. Finally, because S-HEQ requires three HEQ operations, we use the simpler process of MVN to replace any of the three HEQ processes in S-HEQ to improve the computational efficiency. The experimental results show that some variants of WS-HEQ, which require fewer HEQ operations, provide a similar or even better recognition accuracy relative to S-HEQ.

The remainder of this paper is organized as follows. Section 2 reviews S-HEQ, and the basic concept and detailed procedures of the proposed WS-HEQ are presented in Section 3. Section 4 describes the experimental setup, and Sections 5 and 6 contain a series of recognition experiments for WS-HEQ together with their corresponding discussions. Finally, the concluding remarks are summarized in Section 7.

2 Brief review of S-HEQ

If we consider using the Mel-scaled filter-bank cepstral coefficients (MFCC) as the baseline features for speech recognition, then the cepstral feature vector stream associated with an arbitrary utterance is represented by a matrix C:

C={c(m,n);0mM1,0nN1},
(1)

where m is the cepstral channel index within a frame and n is the frame index, and M and N are the total number of channels and frames within the utterance, respectively. In the temporal processing methods as MVN and CHN, the compensation is often directly performed on the individual channel stream (i.e., the sequence {c( m ~ ,n);0nN1} with respect to the m ~ th channel), and therefore, all of the channel streams of the features are treated independently. According to the general concept that the cepstral coefficients within a frame are mostly uncorrelated [7], such a process is quite reasonable.

Recently, a novel method known as the spatial HEQ (S-HEQ) was suggested to decompose each frame of a CHN-preprocessed cepstral vector into two parts, a high-pass filtered and low-pass filtered portion (denoted hereafter as HPF and LPF), such that the temporal sequences of HPF and LPF can be processed separately and then the updated HPF and LPF can be combined to form the new feature vector stream. The work in [11] shows that S-HEQ outperforms the conventional CHN by providing better recognition accuracy. The overall procedure of S-HEQ is depicted in Figure 1.

Figure 1
figure 1

The structure of the S-HEQ algorithm.

3 Proposed approach: WS-HEQ

S-HEQ [11] offers additional insight into the possible distortions left unprocessed by CHN and a method for achieving even better noise robustness for speech features. In this section, we further examine S-HEQ to assess whether it can be further improved. The following two observations can be made about S-HEQ:

  1. 1.

    S-HEQ divides each CHN-preprocessed cepstral vector into HPF and LPF and subsequently treats the temporal stream of these two parts in the same manner (i.e., with HEQ processing). Therefore, S-HEQ does not consider the characteristic differences between HPF and LPF. According to [11], the plain HPF (from the original cepstra, not the CHN-preprocessed cepstra) is often more vulnerable to noise and displays more mismatch than the plain LPF, whereas S-HEQ compensates for the CHN-preprocessed HPF and LPF directly. Additionally, HPF and LPF possess unequal importance in speech recognition, which will be shown later.

  2. 2.

    In S-HEQ, the HEQ operation is repeated up to three times: one for the original feature stream set and the other two for the HPF and LPF stream sets. Thus, S-HEQ requires twice more computational effort than the conventional CHN method, which only processes the original stream set once via HEQ.

In this work, we design a simple experiment to evaluate the relative importance of different sub-bands of the cepstral features in speech recognition. With the Aurora-2 database [12], we select 8,440 clean utterances for the clean-condition training task as the data used to train the acoustic models and 8,440 noisy utterances (corrupted by any of four types of noise at five signal-to-noise ratios) originally for the multi-condition training task as the testing data. Each utterance in the training and testing sets is first converted into a sequence of 13-dimensional cepstral vectors (c 0, c 1 to c 12). The obtained cepstra are either kept unchanged or processed by CHN. Next, for each original/CHN-processed cepstral vector, we obtain its ‘sub-band’ version with the following two steps:

Step 1. Find the spectrum of the cepstral vector via discrete Fourier transform (DFT):

Let c=[ c0c1c2c12]T denote an arbitrary cepstral vector, and its spectrum is obtained by

C[k]= m = 0 12 c m e j 2 πmk 13 ,0k12.
(2)

Due to the conjugate symmetry of {C[ k]}, we only need to retain the first seven points, which correspond to {k 2 π 13 ;0k6} in normalized frequency.

Step 2. Retain a contiguous portion of the spectral points and transform them (together with their conjugate symmetric parts) into a new cepstral vector via inverse DFT. For example, if we retain the first to fifth spectral points unchanged and set the zeroth and the sixth spectral points to zero, then the resulting new cepstral vector is a sub-band version of the original cepstral vector and corresponds approximately to the band range of [ 2 π 13 , 10 π 13 ].

The recognition accuracy rates for different cepstral features obtained from the above sub-band processing are shown in Figures 2a and 3a, the former being for the original cepstra and the later being for the CHN-processed cepstra (Please note that the testing data undergo the same process as the training data in the recognition experiment. Therefore, the original testing cepstra are recognized by the acoustic models trained from the original training cepstra, and the CHN-processed testing cepstra are recognized by the acoustic models trained from the CHN-processed training cepstra). The vertical axis in Figures 2a and 3a denotes the word accuracy rate, and the other two axes indicate the initial and final spectral points, k L and k H , of the assigned sub-band, respectively. Obviously, the CHN-processed cepstra outperform the original cepstra in recognition results. Besides, for both types of cepstra the full-band features are always able to achieve the highest accuracy, and decreasing the bandwidth of the sub-band worsens the accuracy. However, we can further evaluate the relative importance of different spectral points in the sub-band from the two figures using the following equation:

r m = 1 N r k > m + 1 ( R m , k R m + 1 , k ) + k < m 1 ( R k , m R k , m 1 ) ,
(3)
Figure 2
figure 2

Some information about the DFT-based spectrum of cepstra without CHN processing. (a) Recognition rates for the band-pass filtered cepstra. (b) The contribution of each individual spectral point.

Figure 3
figure 3

Some information about the DFT-based spectrum of CHN-processed cepstra. (a) Recognition rates for the band-pass filtered cepstra. (b) The contribution of each individual spectral point.

where r m denotes the averaged contribution of the m th spectral points, Rm,k is the recognition rate using the cepstra within the sub-band including the m th to k th spectral points, and N r is the total number of items in the summation of Equation 3. (The term ‘relative importance’ and its definition shown in Equation 3 are borrowed from [13], in which a series of band-pass filters are used to evaluate the various modulation spectral components in their contribution to the recognition accuracy.) The obtained results from the original and the CHN-processed cepstra are shown in Figures 2b and 3b, respectively. Note that in Equation 3, the number of spectral points in the assigned sub-band range is always greater than or equal to 2 because the cepstra associated with a single spectral point quite often result in a rather poor (even negative) recognition accuracy.

From Figures 2b and 3b, the seven spectral points possess unequal importance in noisy speech recognition. The middle and lower frequency points (except for the DC point) seem to contribute more to the recognition accuracy than the upper points. These results suggest that alleviating the higher frequency components in the cepstra more likely results in better recognition performance in a noisy environment. Besides, comparing Figure 3b with Figure 2b, we find that the CHN process helps the higher frequency points to reinforce their importance in speech recognition, especially for the point at frequency 10 π 13 .

The spectrum of the cepstra in the aforementioned evaluation experiment is created via the DFT, with the main reason that low-pass and high-pass filters are to be applied to the cepstra in later discussions, and we often evaluate the effect of a filter on the processed signal in the Fourier-based frequency domain. Also, in most cases, the characteristics of a filter are investigated by its frequency response; the Fourier transform, of its impulse response. However, since each frame-wise cepstral vector is the truncated version of the inverse discrete cosine transform (IDCT) of the logarithmic spectrum of the corresponding frame, here we reconduct the preceding evaluation experiment based on the ‘DCT-based’ spectrum of the original/CHN-processed cepstra. That is, in step 1 of the experiment, we obtain the 13-point spectrum of any arbitrary cepstral vector c= [ c0c1c2c12]T via DCT:

C ~ [ k ]= m = 0 12 c m cos π ( m + 1 2 ) k 13 ,0 k 12,
(4)

and then in step 2, a contiguous portion of the DCT-based spectral points is retained and transformed into a new cepstral vector via IDCT.

Some differences between the DCT-based spectrum { C ~ [ k ]} in Equation 4 and DFT-based spectrum {C[ k]} in Equation 2 are as follows:

  1. 1.

    Unlike the DFT-based spectrum {C[ k]} which is complex-valued and conjugate symmetric, in general, the real-valued DCT-based spectrum { C ~ [ k ]} is not symmetric in any sense. Thus, we cannot discard the second half points of { C ~ [ k ]} as we do on {C[ k]}.

  2. 2.

    { C ~ [ k ]} possesses a higher frequency resolution than {C[ k]}. Comparing Equation 4 with Equation 2, the frequency difference between any two adjacent bins of { C ~ [ k ]} is π 13 , while it is 2 π 13 for {C[ k]}.

  3. 3.

    Referring to [14], the N-point DCT, { C ~ [ k ]}, of a length-N sequence {c[ n],0≤nN−1} (here N=13), can be computed via a 2N DFT of another length- 2N sequence { c ~ [n],0n2N1}, denoted by {D[ k ]}, in which c ~ [n] is the even extension of c[ n] satisfying c ~ [n]=c[n] for 0≤nN−1 and c ~ [n]=x[2N1n] for Nn≤2N−1. { C ~ [ k ]} and {D[ k ]} are related by:

    C ~ [ k ]=0.5 e j πk 2 N D[ k ]for0kN1.
    (5)

    Generally speaking, the DCT-based spectrum { C ~ [ k ]} is more concentrated at low frequencies than the DFT-based spectrum {C[ k]}, which is well known as the ‘energy compaction property’ of DCT. An underlying reason for this phenomenon is that DFT implicitly assumes the periodic extension of the processed signal and often causes the artificial discontinuities at the signal boundary, which adds high frequency contents in the DFT-based spectrum. To show this, a length-N sequence {x[ n],0≤nN−1} is treated by N-point DFT as an N-periodic signal, denoted by x e [ n], in which x e [ n]=x[ n] for 0≤nN−1 and x e [ n+N]=x e [ n]. Thus, x e [ n] is generally discontinuous at the (original) boundary positions:

    x e [ 0 ] = x [ 0 ] x [ N 1 ] = x e [ 1 ] ,
    (6)
    x e [ N 1 ] = x [ N 1 ] x [ 0 ] = x e [ N ] .
    (7)

    However, as mentioned earlier, the N-point DCT of a length-N sequence {x[ n]} (starting at n=0) can be obtained from the 2N-point DFT of the even extension of {x[ n]}, and the corresponding 2N-periodic signal, denoted by x ~ e [n], remains continuous at the boundary positions:

    x ~ e [ 0 ] = x ~ [ 0 ] = x ~ [ 2 N 1 ] = x ~ e [ 2 N 1 ] = x ~ e [ 1 ] ,
    (8)
    x ~ e [ 2 N 1 ] = x ~ [ 2 N 1 ] = x ~ [ 0 ] = x ~ e [ 0 ] = x ~ e [ 2 N ] .
    (9)

    As a result, the (N-point) DCT-based spectrum does not contain the high frequency artifacts as the (N-point) DFT-based spectrum, and it appears more compact at low frequencies.

With the cepstra from the IDCT of sub-band DCT-based spectra, the corresponding evaluation experiment is performed to obtain the recognition accuracy rates, which are shown in Figures 4a and 5a, and the relative importance of different spectral points are shown in Figures 4b and 5b. Figure 4a,b is for the original cepstra and Figure 5a,b is for the CHN-processed cepstra. These two figures roughly reveal that the lower and middle DCT-based spectral points contribute to the recognition more than the upper ones in recognition, which somewhat coincides our observations from Figures 2a,b and 3a,b associated with the DFT-based spectra. In addition, comparing Figure 4b with Figure 2b and Figure 5b with Figure 3b, we find that the higher DCT-based spectral points reveal more importance than the higher DFT-based spectral points. which partially agrees with our previous statement that the DFT-based spectrum contains some artificial high frequency contents, which distort the higher spectral points and reduce the corresponding contribution.

Figure 4
figure 4

Some information about the DCT-based spectrum of cepstra without CHN processing. (a) Recognition rates for the band-pass filtered cepstra. (b) The contribution of each individual spectral point.

Figure 5
figure 5

Some information about the DCT-based spectrum of CHN-processed cepstra. (a) Recognition rates for the band-pass filtered cepstra. (b) The contribution of each individual spectral point.

In light of the aforementioned discussions, we developed a novel method known as the WS-HEQ to enhance the speech features in noise robustness. The initial concept of WS-HEQ is to apply a weighting factor to the HPF portion in S-HEQ (as shown in Figure 1) to reduce the intra-frame higher frequency components, and we further provide several variations on the presented WS-HEQ. First, according to the order of the HEQ processing for the full-band cepstra and sub-band cepstra, we describe two structures:

Structure I. HEQ first operates on the plain (intra-frame) full-band cepstra and subsequently on the sub-band cepstra.

Structure II. HEQ first operates on the plain (intra-frame) sub-band cepstra and subsequently on the full-band cepstra.

Please note that in the above two structures, the two sub-band cepstral portions, LPF and HPF, are obtained with simple two-point FIR filters operating on the full-band cepstra [11]:

LPF: c lp ( m , n ) = c ( m , n ) + c ( m 1 , n ) 2 ,
(10)
HPF: c hp ( m , n ) = c ( m , n ) c ( m 1 , n ) 2 ,
(11)

where c lp (m,n) and c hp (m,n) denote the low-pass and high-pass filtered parts of the n th cepstral frame.

Next, according to different treatments (i.e., the compensation methods, HEQ, and MVN) of LPF and HPF in Equations 10 and 11, each structure of WS-HEQ has the following four types of variations:

Type 1: c ~ lp = HEQ [ c lp ] , c ~ hp = α HEQ [ c hp ] ,
(12)
Type 2: c ~ lp = MVN [ c lp ] , c ~ hp = α HEQ [ c hp ] ,
(13)
Type 3: c ~ lp = HEQ [ c lp ] , c ~ hp = α MVN [ c hp ] ,
(14)
Type 4: c ~ lp = MVN [ c lp ] , c ~ hp = α MVN [ c hp ] ,
(15)

where HEQ [ ·] and MVN [ ·] denote the operators of the HEQ and MVN processes, respectively; c ~ lp and c ~ hp are the updated LPF and HPF, respectively (we omit the indices (m,n) for simplicity); and the parameter α with a range of [0, 1] is the scaling factor selected specifically for the HPF component. The flowcharts of the various structures and types of WS-HEQ are depicted in Figure 6a,b.

Figure 6
figure 6

The flowcharts of two structures of WS-HEQ. (a) Structure I and (b) structure II.

For clarity, in the following discussions, the term ‘WS-HEQ’ is written with an additional subscript of ‘I’ or ‘II’, and a superscript of ‘(1)’, ‘(2)’, ‘(3)’, or ‘(4)’ to identify different structures and different processing schemes for LPF and HPF in the presented WS-HEQ method. For example, WS-HEQ II ( 3 ) indicates that the WS-HEQ method applying the second structure shown in Figure 6b and uses HEQ and MVN for the LPF and HPF portions, respectively. Additional discussions on the various forms of WS-HEQ are given:

  1. 1.

    Because the HEQ operation is nonlinear, WS-HEQ II ( 1 ) with α=1.0 (no attenuation for HPF) as shown in Equation 12 in which HEQ is first performed on the sub-band cepstra and subsequently on the full-band cepstra, is different from S-HEQ (equivalent to WS-HEQ I ( 1 ) with α=1.0) shown in Figure 1, in which the full-band cepstra are HEQ-processed in advance.

  2. 2.

    In the first type of WS-HEQ II, (viz. WS-HEQ II ( 1 ) ), both LPF and HPF (of the original MFCC) are processed by HEQ. The resulting new HPF is attenuated by a factor of α and then combined with the new LPF to form the full-band cepstra, which are further processed by HEQ in the final stage. Therefore, WS-HEQ II ( 1 ) requires three HEQ operations, the same as S-HEQ, demonstrating that S-HEQ and WS-HEQ II ( 1 ) are similar in computational complexity.

  3. 3.

    The other three types of WS-HEQ as shown in Equations 13 to 15 differ from the first type in that they compensate either or both of the LPF and HPF portions via MVN instead of HEQ. MVN can be implemented more efficiently than HEQ because MVN involves only the operations of addition and multiplication, whereas a sorting algorithm is required in HEQ. We expect that the cost savings of HEQ on HPF/LPF will not affect the prospective recognition accuracy.

4 Experimental setup

The performance of our proposed WS-HEQ scheme is examined in two databases. One is the Aurora-2 database [12] corresponding to a connected English-digit recognition task, and the other is a subset of the TCC-300 database [15] for the recognition of 408 Chinese syllables. Briefly speaking, we conduct more comprehensive experiments with the Aurora-2 database for analysis and comparison upon the various forms of the presented WS-HEQ together with some other robustness algorithms, and a smaller number of experiments conducted on the subset of the TCC database are simply to examine if the presented WS-HEQ can be extended to work well in a median-size vocabulary recognition task which is more complicated than Aurora-2. Furthermore, in order to avoid the ambiguity and confusion in discussion, the remainder of this section and Section 5 are specially for the Aurora-2 evaluation task, while the detailed discussions about the TCC-300 subset task will given in Section 6.

As for the Aurora-2 database, the test data consist of 4,004 utterances, and three different subsets are defined for the recognition experiments: test sets A and B are both affected by four types of noise, and test set C is affected by two types. Each noise instance is artificially added to the clean speech signal at seven SNR levels (ranging from 20 to −5 dB). The signals in test sets A and B are filtered with a G.712 filter, and those in Set C are filtered with an MIRS filter. In the ‘clean-condition training, multi-condition testing’ evaluation task defined in [12], the training data consist of 8,440 noise-free clean utterances filtered with a G.712 filter. Thus, compared with the training data, test sets A and B are distorted by additive noise, and test set C is affected by additive noise and a channel mismatch.

In the experiments, each utterance in the clean training set and the three testing sets is first converted to a 13-dimensional MFCC (c 0, c 1 to c 12) sequence. Next, the MFCC features are processed by either S-HEQ [11] or the various forms of WS-HEQ noted in Section 3. In addition, the selected target distribution of the HEQ operation applied to any of the full-band, LPF, and HPF cepstra is the standard normal (Gaussian), with a zero mean and unity variance. (Please note that, given the full-band cepstral sequences being standard normal and approximately mutually uncorrelated, the corresponding LPF and HPF via the operations in Equations 10 and 11 are also standard normal. Similarly, if the HPF and LPF are both standard normal and approximately mutually uncorrelated, then the corresponding full-band cepstra are normally distributed with a zero mean and a variance of less than 1 since we scale down the HPF portion.)

The resulting 13 new features, in addition to their first- and second-order derivatives, are the components of the final 39-dimensional feature vector. With the new feature vectors in the clean training set, the hidden Markov models (HMMs) for each digit and for silence are trained with the scripts provided by the Aurora-2 CD set [16]. Each HMM digit contains 16 states, with three Gaussian mixtures per state.

In particular, the 8,440 noisy utterances (corrupted by four types of noise at five signal-to-noise ratios) originally for the multi-condition training task [12], which has been mentioned earlier in Section 3, are served as the development set here in order to obtain an appropriate selection of the scaling factor α for the HPF portion in Equations 12 to 15. The value of α is varied from 0.0 to 1.0 with an interval of 0.1 in each form of WS-HEQ, and then the one that achieves the optimal recognition accuracy for the development set is chosen for the corresponding WS-HEQ in practice. The selected values of α for different forms of WS-HEQ are listed in Table 1.

Table 1 Scaling factor α for each type of WS-HEQ

5 Experimental results and discussions for the Aurora-2 task

5.1 Recognition accuracy

The presented WS-HEQ is evaluated in terms of recognition accuracy. Tables 2 and 3 show the individual set recognition accuracy rates averaged over five SNR conditions (0 to 20 dB, with a 5-dB interval) for the MFCC baseline, CHN, S-HEQ (equivalent to WS-HEQ I ( 1 ) with α=1.0), and various forms of the presented WS-HEQ, while Table 4 further lists the recognition accuracy rates for each individual SNR situations but averaged over ten noise situations. In addition, Figure 7 depicts the overall averaged word error rates achieved by several methods, including MVA, HOCMN, TSN, CHN, S-HEQ, WS-HEQ I ( 1 ) (α=0.6), and WS-HEQ II ( 1 ) (α=0.6). From Tables 2,3,4 and Figure 7, we have the following findings:

Table 2 The recognition accuracy results (%) of the MFCC baseline, CHN, S-HEQ, and WS-HEQ with structure I
Table 3 The recognition accuracy results (%) of the MFCC baseline, CHN, and WS-HEQ with structure II
Table 4 The recognition accuracy results (%) of the MFCC baseline, CHN, and eight forms of WS-HEQ
Figure 7
figure 7

Overall word error rate (%) averaged over all noise types and levels achieved by different noise-robustness methods.

  1. 1.

    Compared with the MFCC baseline, all of the HEQ-related methods provide very similar accuracy rates for the clean situation, and they are able to provide significant improvement in recognition accuracy for various noise-corrupted situations, showing that HEQ is quite helpful for speech features in terms of noise robustness.

  2. 2.

    S-HEQ ( WS-HEQ I ( 1 ) with α=1.0) outperforms CHN by around 2.3% in the averaged accuracy, and thus, further manipulation of the mismatch in LPF and HPF with two extra HEQ operations can benefit the recognition performance.

  3. 3.

    WS-HEQ II ( 1 ) with α=1.0 produces results similar to those of S-HEQ, and thus, the proposed structure II (shown in Figure 6b) performs quite well. Additionally, provided that no attenuation exists for HPF by setting α=1.0, using structure II in the other three types of WS-HEQ, i.e., WS-HEQ II ( 2 ) , WS-HEQ II ( 3 ) , and WS-HEQ II ( 4 ) as shown in Equations 13 to 15, outperforms the respective methods under structure I. In particular, WS-HEQ II ( 2 ) behaves better than WS-HEQ II ( 1 ) , whereas WS-HEQ I ( 2 ) behaves worse than WS-HEQ I ( 1 ) , revealing that applying structure II can make WS-HEQ less costly in computation and can obtain improved recognition results simultaneously.

  4. 4.

    Reducing the HPF component by setting the factor α as less than 1.0 as in Table 1 significantly improves the recognition accuracy, regardless of the different structures and types of WS-HEQ. WS-HEQ II ( 1 ) gives an averaged accuracy of 84.99%, which is optimal among all of the methods and corresponds to error reduction rates of 62.71%, 23.73%, and 13.83% relative to the MFCC baseline, CHN, and S-HEQ, respectively. These results support the aforementioned observations that HPF is more extensively contaminated by noise and that lo wering HPF is beneficial.

  5. 5.

    Among the four types of WS-HEQ listed in Equations 12 to 15, by assigning α as less than 1.0, WS-HEQ (1), which requires three HEQ operations, displays the best behavior, regardless of the selected structure. However, the two types that require only two HEQ operations (i.e., WS-HEQ (2) and WS-HEQ (3)) perform quite similarly to WS-HEQ (1) when structure II is used. Finally, WS-HEQ (4) performs worse than the other three types, possibly because it applies only one HEQ operation. Even so, WS-HEQ I ( 4 ) and WS-HEQ II ( 4 ) with α=0.6 can behave very close to S-HEQ ( WS-HEQ I ( 1 ) with α=1.0).

  6. 6.

    The presented WS-HEQ II ( 1 ) with α=0.6 behaves better in the overall averaged word error rate when compared with several well-known noise-robustness methods: TSN, HOCMN, MVA, CHN, and S-HEQ. The absolute error rate reduction of WS-HEQ II ( 1 ) with α=0.6 relative to the MFCC baseline is as high as 25.24%.

Taking a step further, among the methods used for comparison, MVA and TSN explicitly applies a temporal filter, and in most cases, the used filter is low pass so as to perform a ‘temporal’ smoothing on the cepstral time series. In contrast, the presented WS-HEQ lowers HPF (the high-pass filtered portion) of each cepstral vector and is analogous to a ‘spatial’ smoothing operation. Such an observation leads to the idea of combining either MVA or TSN with WS-HEQ in order to achieve a two-dimensional smoothing. To realize this idea, the cepstra are first processed with any of the eight forms of WS-HEQ and then further compensated by MVA or TSN. The obtained recognition results are shown in Tables 5 and 6, in which the applied WS-HEQ uses the scaling factor α listed in Table 1. As we look into the results shown in Tables 5 and 6, it can be found that the pairing of WS-HEQ and MVA/TSN consistently achieves better performance than the individual component method, regardless of the various forms of WS-HEQ. For example, the method ‘ WS-HEQ II ( 1 ) +TSN’ obtains the averaged accuracy of 86.25%, better than TSN (81.39%) and WS-HEQ II ( 1 ) (84.99%). These results indicate that the joint spatial-temporal smoothing can provide the cepstral features with better noise robustness in comparison with either spatial smoothing or temporal smoothing in isolation. In particular, different forms of WS-HEQ behave very similar and can give around 85% in averaged accuracy when TSN/MVA is integrated, implying that when employing TSN/MVA as a post-processing technique, simpler versions of WS-HEQ, such as WS-HEQ I ( 4 ) and WS-HEQ II ( 4 ) , are relatively more appropriate in practical applications due to their high recognition performance and relatively low computation complexity in comparison with S-HEQ, WS-HEQ I ( 1 ) , and WS-HEQ II ( 1 ) .

Table 5 The recognition accuracy results (%) achieved by the combination of MVA and WS-HEQ
Table 6 The recognition accuracy results (%) achieved by the combination of TSN and WS-HEQ

5.2 The influence of the parameter α in WS-HEQ

As stated previously, the parameter α in WS-HEQ determines the degree of attenuation for the HPF portion of the processed cepstra. Here, we would like to investigate how the value of α in WS-HEQ influences the recognition accuracy of the test sets. For simplicity, we vary the parameter α from 0.0 to 1.0 in two types of WS-HEQ: WS-HEQ I ( 1 ) and WS-HEQ II ( 1 ) , and the corresponding recognition accuracy rates averaged over all noise types and levels in three test sets are shown in Figure 8a,b. These two figures reveal that

  1. 1.

    Lowering the HPF part by tuning α from 1.0 to 0.4 in both WS-HEQ I ( 1 ) and WS-HEQ II ( 1 ) achieves better results consistently relative to these two WS-HEQ methods using α=1.0. However, further reducing the HPF part can ruin the recognition accuracy, which implies that the HPF part also contains information helpful for recognition.

  2. 2.

    The optimal accuracy for WS-HEQ I ( 1 ) and WS-HEQ II ( 1 ) occurs when α is assigned to 0.5 and 0.6, respectively, while the results from the development set suggest the parameter α to be 0.6 for these two WS-HEQ methods (as shown in Table 1). However, WS-HEQ I ( 1 ) with α=0.6 gives the recognition rate of 84.27%, very close to the optimal one (84.32%). Therefore, it assures us that the development set can help to determine the nearly optimal parameter in the test sets.

  3. 3.

    The performance of WS-HEQ I ( 1 ) and WS-HEQ II ( 1 ) is not very sensitive to the parameter α, which is based on the observation that the accuracy difference is below 1.0% provided the value of α is within the range [ 0.4,0.7].

Figure 8
figure 8

The overall recognition accuracy achieved by two WS-HEQ methods with different α . (a) WS-HEQ I ( 1 ) . (b) WS-HEQ II ( 1 ) .

Next, we explore the best possible recognition results for each testing situation achieved by WS-HEQ with various assignments of the scaling parameter α. Please note that, in the preceding experiments the scaling parameter α in WS-HEQ is determined by the development set and then uniformly applied to the every test set. Here, we would like to investigate whether the optimal choice of α (which gives rise to the highest recognition accuracy) depends on the noise type and level (viz. the SNR) of the testing utterances. To do this, we vary the value of α from 0.0 to 1.0 with an interval of 0.1 in each form of WS-HEQ to process the features in the training and testing sets and then perform the experiment. The optimal recognition accuracy rate and the associated α with respect to each noise type and level in the testing set achieved by WS-HEQ I ( 1 ) and WS-HEQ II ( 1 ) are respectively shown in Tables 7 and 8. Some contents of the tables together with the data obtained from the other six forms of WS-HEQ (which are not listed here due to their huge amount) are further summarized in Tables 9 and 10, which also contain a portion of the data in Tables 2 and 3 for the purpose of comparison. Observing these tables, we find that the value of the factor α that achieves the optimal recognition accuracy indeed depends on the noise type and level of the utterances. However, there seems no general rule for selecting a better α with respect to any specific noise situation. Furthermore, as seen in Table 9, in most cases, the accuracy rates obtained with the optimal α associated with the individual noise situation are very close to the accuracy rates using a fixed α which gets the optimal results for the development set. The maximum difference between the above two types of accuracy rates is 0.75%, which occurs at the method of WS-HEQ I ( 2 ) . As a result, we can roughly conclude that using the α recommended by the development set suffices to provide WS-HEQ with nearly optimal performance.

Table 7 Recognition accuracy results (%) of WS-HEQ I ( 1 ) using the optimal scaling factor α (in parentheses)
Table 8 The recognition accuracy results (%) of WS-HEQ II ( 1 ) using the optimal scaling factor α (in parentheses)
Table 9 The recognition accuracy results (%) of various forms of WS-HEQ for different test sets
Table 10 The recognition accuracy results (%) of various forms of WS-HEQ at different SNRs

5.3 The feature distortion reduced by WS-HEQ

Apart from the recognition performance, in this subsection, we evaluate WS-HEQ in the capacity of reducing the feature distortion caused by noise. The incoherent feature distortion [7] defined by

φ= Σ k ( | X [ k ] | | X ~ [ k ] | ) 2 Σ k | X ~ [ k ] | 2
(16)

is measured for the feature streams processed by the noise-robustness method, where X ~ [k] and X[ k] denote the DFT of the noise-free clean feature stream and its noise-corrupted counterpart, respectively. Figure 9 depicts the feature distortion associated with any cepstral channel at the SNR of 10 dB, averaged over the 1,001 utterances in test set A of the Aurora-2 database, with respect to the feature streams processed by any of S-HEQ, WS-HEQ I ( 1 ) with α=0.6, and WS-HEQ II ( 1 ) with α=0.6. From Figure 9, two observations are made: first, WS-HEQ I ( 1 ) with α=0.6 results in smaller distortions than S-HEQ irrespective of the cepstral channel, implying that to lower the HPF portion of the cepstra can further reduce the effect of noise; second, by setting the parameter α to 0.6, the distortions provided by WS-HEQ II ( 1 ) are slightly smaller than those by WS-HEQ I ( 1 ) for most of the cepstral channels, which agrees with the finding that WS-HEQ II ( 1 ) slightly outperforms WS-HEQ I ( 1 ) in recognition accuracy.

Figure 9
figure 9

Feature distortion averaged over the 1,001 utterances of test set A. It is achieved by S-HEQ, WS-HEQ I(1)(α=0.6), and WS-HEQ II ( 1 ) (α=0.6). The DFT size used in Equation 16 is set to 512.

5.4 The effect of lowering HPF in different schemes

In order to further examine the effect of attenuating HPF in recognition accuracy, here we additionally design three schemes to process the cepstra in training and testing sets:

Scheme 1. The original (full-band) cepstra is split into LPF and HPF, and then the HPF portion is scaled by a factor α. Finally, the original LPF and the attenuated HPF are combined to constitute the new cepstra. This scheme is to remove all three HEQ processes in WS-HEQ I ( 1 ) or WS-HEQ II ( 1 ) shown in Figure 6a,b, and its flowchart is depicted in Figure 10a.

Scheme 2. The original (full-band) cepstra is preprocessed by HEQ, and then the HEQ-preprocessed cepstra is split into LPF and HPF. We scale LPF by a factor α and finally combine LPF and attenuated HPF to obtain the new cepstra. This scheme is to remove the two HEQ processes for LPF and HPF in WS-HEQ I ( 1 ) shown in Figure 6a, and its flowchart is depicted in Figure 10b.

Scheme 3. The original cepstra is split into LPF and HPF, and then the HPF portion is tuned with the scaling factor α. Next, we combine LPF and attenuated HPF to obtain the full-band cepstra, which are further post-processed by HEQ to obtain the new cepstra. This scheme is to remove the two HEQ processes for LPF and HPF in WS-HEQ II ( 1 ) shown in Figure 6b, and its flowchart is depicted in Figure 10c.

Figure 10
figure 10

Flowcharts of three schemes defined in defined in Section 5.4. (a) Scheme 1. (b) Scheme 2. (c) Scheme 3.

The scaling factor α in the above three schemes is varied from 0.6 to 1.4 with an interval of 0.2. Please note that the case with α>1 corresponds to amplifying HPF and thus reducing the proportion of LPF in the overall cepstra. The recognition results for the three schemes are shown in Table 11. From this table, we have the following observations:

  1. 1.

    At the clean noise-free case in all three schemes, the recognition accuracy remains as high as around 99% nearly irrespective of the varied scaling factor α, which implies that neither lowering nor raising the HPF portion of the cepstra can significantly influence the recognition performance. The possible explanation for this result is that the back-end acoustic modeling with HMMs compensates well for the variation of the front-end speech features.

  2. 2.

    From the results for scheme 1, reducing HPF (using α<1) without pre- or post-processing with HEQ produces degraded performance under noise-corrupted situations compared with the case using α=1, which disagrees with the results for various forms of WS-HEQ as shown in the preceding sub-sections. Under the same situations, setting α>1 to amplify HPF (and thus to reduce the proportion of LPF) cannot improve the accuracy, either. Therefore, the relative importance of LPF and HPF in noise-corrupted cepstra discussed in Section 3 cannot be reflected in recognition accuracy when there is no noise-robust processing such as HEQ. In other words, merely emphasizing LPF or HPF fails to result in more noise-robust cepstra and produces worse recognition accuracy.

  3. 3.

    Different from the results for scheme 1, the results associated with schemes 2 and 3 show that when the cepstra are pre- or post-processed by HEQ, reducing the HPF part by setting α<1 can promote the recognition accuracy under noise-corrupted situations (except for the case of −5-dB SNR). On the other hand, the cases corresponding to α>1 in which HPF is raised produce worse results. The underlying reason is probably that the noise effect of HPF is relatively difficult to alleviate, and simply lowering HPF can benefit HEQ to give better performance. Similar situations can be also found in Tables 2 and 3 by comparing the results of WS-HEQ I ( 2 ) , WS-HEQ I ( 3 ) , WS-HEQ II ( 2 ) and WS-HEQ II ( 3 ) with α=1. WS-HEQ I ( 2 ) and WS-HEQ II ( 2 ) outperform WS-HEQ I ( 3 ) and WS-HEQ II ( 3 ) , respectively, indicating a stronger normalization strategy like HEQ is required to compensate the distortion in HPF, while a relatively simple MVN process suffices to improve LPF well. Furthermore, comparing Table 11 with Tables 2 and 3 we find that the effect of lowering HPF in recognition accuracy appears a lot more significant when we further compensate the sub-band cepstra (viz. HPF and LPF) by HEQ/MVN, again in agreement with the statements about S-HEQ [11] that additionally normalizing HPF and LPF can reduce the environmental mismatch caused by noise.

Table 11 The recognition accuracy results (%) of the three schemes defined in Section 5.4

6 The experiment on the TCC-300 Mandarin dataset

Besides the evaluation on the Aurora-2 dataset as described in the previous two sections, here the recognition experiments with the presented WS-HEQ are further carried out in another dataset, the eleventh group of the TCC-300 microphone speech database from the Association for Computational Linguistics and Chinese Language Processing in Taiwan [15]. This dataset includes 7,009 Mandarin character strings uttered by 50 male and 50 female adult speakers. The corresponding read speaking-style speech signals were recorded with a microphone at the sampling rate of 16 kHz. The Mandarin characters included in the utterances of this dataset correspond to 408 different Mandarin syllables. In the experiment, the syllable recognition is performed on this dataset without any language model or grammar constraint at the back end so that the recognition performance can be more related to the used front-end acoustic features. As a result, in comparison with the 11-digit recognition on the Aurora-2 telephone-band dataset in the previous sections, here we conduct a more complicated task of medium-vocabulary recognition (408 syllables) on the broad-band speech data. Among the 7,009 Mandarin utterances in the TCC-300 subset, 6,509 strings are selected in acoustic model training, while the other 500 are in testing. The utterances in the training set are kept noise-free, while the utterances in the testing set are artificially added with noise at four SNR levels (20, 15, 5, and 0 dB) to produce noise-corrupted speech data. The noise types include white (broad-band) and pink (narrow-band), both taken from the NOISEX 92 database [17]. These utterances for training and testing are first converted into 13-dimensional MFCCs (c 0, c 1−c 12), and then processed by various kinds of noise-robustness algorithms. Similar to the feature parameter settings for the Aurora-2 database, the resulting 13 new features plus their first- and second-order derivatives constitute the finally used 39-dimensional feature vector.

As for the acoustic modeling, we train the HMMs of INITIAL and FINAL units, which corresponds to the semi-syllables in Mandarin Chinese. In most cases, a Mandarin Chinese syllable can be split into INITIAL/FINAL parts analogous to the consonant/vowel pair in English. There are totally 112 right-context-dependent INITIAL HMMs and 38 context-independent FINAL HMMs to be trained. Each INITIAL HMM consists of five states and eight Gaussian mixtures per state, while each FINAL HMM contains ten states and eight Gaussian mixtures per state. The HMM for each of the 408 Mandarin syllables is then constructed by concatenating the associated INITIAL and FINAL HMMs.

Tables 12 and 13 list the syllable recognition accuracy rates of the MFCC baseline and the various robustness methods including CHN, S-HEQ (equivalent to WS-HEQ I ( 1 ) with α=1.0), and seven forms of the presented WS-HEQ for the white and pink noise environments, respectively. The scaling parameter α in WS-HEQ is set to 0.6, which is not optimized but just to clarify whether lowering HPF can give rise to performance improvement. From these two tables, we have the following findings:

  1. 1.

    Due to the simple free-syllable decoding framework in the recognition procedure, the recognition accuracy of MFCC baseline features at the clean noise-free condition is just around 75%. Besides, the noise robustness methods used here result in similar or even better performance compared with the MFCC baseline when the testing utterances contain no noise.

  2. 2.

    Both types of noise degrade the performance of MFCC seriously as the SNR gets worse, while CHN and all of the other HEQ-related algorithms benefit the recognition accuracy significantly. In particular, the various forms of WS-HEQ with α=1 outperforms CHN, indicating that additionally processing LPF and HPF with HEQ or MVN can further enhance CHN to produce better results.

  3. 3.

    Reducing the scaling factor α from 1.0 to 0.6 in the eight forms of WS-HEQ consistently brings about better results by significant margins in all noise-corrupted situations. This result reconfirms the capability of the presented HPF lowering operation in boosting noise robustness of CHN-processed features. Furthermore, when α is set to 0.6, the performance difference among various forms of WS-HEQ becomes relatively small in comparison with that under the condition of α=1.0.

Table 12 Recognition accuracy results (%) of WS-HEQ for different SNR conditions at white noise environment
Table 13 Recognition accuracy results (%) of WS-HEQ for different SNR condition at the pink noise environment

7 Conclusions

In this paper, we explored the relative importance of different frequency components of the intra-frame speech features and subsequently presented a novel algorithm, WS-HEQ, to improve noisy speech recognition. WS-HEQ mainly reduces the intra-frame high-pass filtered component of the speech features, which appears more vulnerable to noise. Compared with the well-known S-HEQ method, WS-HEQ can achieve superior recognition accuracy, higher computational efficiency, or both. In future work, we will pursue new filter structures for obtaining the LPF and HPF components for WS-HEQ to achieve better results. Additionally, we will investigate how to tune the intra-frame speech features more flexibly in the corresponding DFT or DCT domains for further noise reduction.

References

  1. Maganti HK, Matassoni M: A perceptual masking approach for noise robust speech recognition. EURASIP J. Audio Speech Music Process 2012., 2012(29):

    Google Scholar 

  2. Wu K, Chen C, Yeh B: Noise-robust speech feature processing with empirical mode decomposition. EURASIP J. Audio Speech Music Process 2011., 2011(9):

    Google Scholar 

  3. Cohen I, Berdugo B: Speech enhancement for non-stationary noise environments. Signal Process 2001, 81(11):2403-2418. 10.1016/S0165-1684(01)00128-1

    Article  Google Scholar 

  4. Kotnik B, Kačič Z: A noise robust feature extraction algorithm using joint wavelet packet subband decomposition and AR modeling of speech signals. Signal Process 2007, 87(6):1202-1223. 10.1016/j.sigpro.2006.10.009

    Article  Google Scholar 

  5. Tibrewala S, Hermansky H: Multi-band and adaptation approaches to robust speech recognition. In 5th Eurospeech Conference on Speech Communications and Technology. Rhodes: Eurospeech; 22–25 Sept 1997.

    Google Scholar 

  6. Hilger F, Ney H: Quantile based histogram equalization for noise robust large vocabulary speech recognition. IEEE Trans. Audio Lang. Process 2006, 14: 845-854.

    Article  Google Scholar 

  7. Benesty J, Sondhi MM, Huang Y (Eds): In Springer Handbook of Speech Processing. 2008.

    Google Scholar 

  8. Chen C, Bilmes J: MVA processing of speech features. IEEE Trans. Audio Speech Lang. Process 2007, 15: 257-270.

    Article  Google Scholar 

  9. Hsu C-W, Lee L-S: Higher order cepstral moment normalization for improved robust speech recognition. IEEE Trans. Audio Speech Lang. Process 2009, 17: 205-220.

    Article  Google Scholar 

  10. Xiao X, Chng ES, Li H: Normalization of the speech modulation spectra for robust speech recognition. IEEE Trans. Audio Speech Lang. Process 2008, 16: 1662-1674.

    Article  Google Scholar 

  11. Joshi V, Bilgi R, Umesh S, García L, Benítez MC: Sub-band level histogram equalization for robust speech recognition. In 12th International Conference on Spoken Language Processing. Florence: Interspeech; 27–31 Sept 2011.

    Google Scholar 

  12. Hirsch HG, Pearce D: The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions. In Proceedings of the 2000 Automatic Speech Recognition: Challenges for the new Millenium. Paris: ISCA ITRW ASR; 18–20 Sept 2000.

    Google Scholar 

  13. Kanedera N, Arai T, Hermansky H, Pavel M: On the importance of various modulation frequencies for speech recognition. In 5th European Conference on Speech Communication and Technology. Rhodes: Eurospeech; 22–25 Sept 1997.

    Google Scholar 

  14. Huang X, Acero A, Hon H-W: Spoken Language Processing: A Guide to Theory, Algorithm and System Development. New Jersey: Prentice Hall; 2001.

    Google Scholar 

  15. ACLCLP 1990.http://www.aclclp.org.tw/corp.php, Accessed 10 Aug 2013

  16. ELDA 1995.http://www.elda.org/article52.html, Accessed 8 Aug 2013

  17. Varga AP, Steeneken HJM, Tomlinson M, Jones D: The NOISEX-92 study on the effect of additive noise on automatic speech recognition. Technical report, DRA Speech Research Unit (1992)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jeih-weih Hung.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Hung, Jw., Fan, Ht. Intra-frame cepstral sub-band weighting and histogram equalization for noise-robust speech recognition. J AUDIO SPEECH MUSIC PROC. 2013, 29 (2013). https://doi.org/10.1186/1687-4722-2013-29

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1687-4722-2013-29

Keywords