 Research Article
 Open access
 Published:
Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Builtin Speaker Normalization
EURASIP Journal on Audio, Speech, and Music Processing volumeÂ 2008, ArticleÂ number:Â 148967 (2008)
Abstract
A proven method for achieving effective automatic speech recognition (ASR) due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for realtime performance. The most popular speaker normalization technique is vocaltract length normalization (VTLN), despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled builtin speaker normalization (BISN), where normalization is performed onthefly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for realtime ASR systems. Evaluations are performed for (i) an incar extended digit recognition task, where an onthefly BISN implementation reduces the relative word error rate (WER) by 24%, and (ii) for a diverse noisy speech task (SPINE 2), where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.
1. Introduction
Current speakerindependent automatic speech recognition (ASR) systems perform well in most of the realworld applications but the performance gap between speakerdependent and speakerindependent settings is still significant. Although a reasonable amount of progress have occurred in recent years in the general ASR technology by exploiting more complex algorithms with the help of faster computing [1], little progress has been reported in the development of core speech processing algorithms. Many speech researchers would agree that there is still a significant potential in formulating an acoustic representation of the speech signal that will successfully maintain information needed for efficient speech recognition, especially in noise, while eliminating irrelevant speakerdependent information [1]. The perceptual MVDR (PMVDR) coefficients have proven to be more effective than the MFCC front end on a number of tasks, especially in noisy environments [2, 3]. This paper introduces a new and computationally efficient speaker normalization algorithm within the PMVDR [2, 3] framework which we call builtin speaker normalization (BISN). BISN is computationally efficient and can be completely integrated into the frontend.
There are different ways to address speaker variability for automatic speech recognition. One approach is to normalize speaker variabilities in the feature space prior to employing an HMM acoustic recognizer framework. A number of effective algorithms have been developed to compensate for such variabilities due to speaker stress and emotion (see [4] for an overview). Probably, the most successful approach is the adaptive cepstral compensation (ACC) [5] which was shown to significantly reduce the impact of speaker variability for ASR. This approach uses a lowlevel voiced/transitional/unvoiced segmentation scheme followed by a source generator framework to compensate the MFCC cepstral feature sequence prior to ASR. More recent approaches have focused on reducing the impact of vocaltract length differences in the spectral domain [6, 7].
Basic likelihoodbased warp estimation was first introduced by Andreou et al. [8]. However, it was computationally cumbersome and required a substantial amount of speech from each speaker in order to estimate the best warp factor. Their basic motivation was to extract acoustic features that have reduced speaker dependency. In order to achieve this, they linearly warped the frequency axis. The degree of this linear warping is in fact a speakerdependent factor and must be estimated for each speaker. For the estimation of the warp factor, they proposed a set of maximum likelihoodbased procedures. Unfortunately, these procedures were computationally very expensive.
Lee and Rose [6, 7] proposed a set of speaker normalization procedures using maximum likelihood estimates of the best warp for each speaker. There was no attempt to recover the underlying vocaltract shape. Instead, their motivation was to use an optimization criterion directly related to the one used in the recognizer. They revised the set of maximum likelihood estimation procedures proposed by Andreou [8] to estimate the warp factors for each speaker. These procedures are now widely known as vocaltract length normalization (VTLN). The most popular way of estimating VTLN warps is to use likelihoodbased estimation techniques [6, 7] in which a set of HMM models trained on a large population of speakers by placing 1 Gaussian per state is scored against warped features. Afterwards, incoming features are extracted using different VTLN warps, and the warp producing the maximum likelihood (given the HMMs and transcription) is used as the best VTLN warp for that speaker. VTLN is shown to be effective for a number of tasks but the computational load of determining the best warp for each speaker, especially at the time of recognition, is not tractable. They also proposed computationally more efficient variants of the VTLN based on the GMM modeling of each VTLN warp [6, 7]. However, these variants are less accurate due to the loss of temporal information (this stems from the use of GMMs in the modeling) buried in the speech signal. As a result, although a good method for offline simulations, classical VTLN is rarely used in practical systems where computational efficiency is of primary concern. Therefore, there is a need for achieving onthefly speaker normalization by introducing computationally more efficient algorithms.
Eide and Gish [9] proposed a waveformbased algorithm, in which they estimate the warping factors by using the average position of the third formant. Their idea is that the third formant is not affected as much as the first and second formants from the context and therefore more closely related to the speaker's vocaltract length. By using the ratio of the average thirdformant location for a particular speaker to the average thirdformant location for a large population of speakers, they were able to determine reasonable normalization factors, which helped reduce interspeaker variations. Although this approach has the advantage of estimating the speakernormalization warps directly from the speech signal, the difficulty of estimating the third formant reliably even for clean speech is apparent, as some speakers may not even have clear thirdformant locations.
Acero [10] proposed a speakerdependent bilinear transform (BLT) to account for interspeaker variations. In that study, an LPCbased front end is used with the FFT spectrum warped before the computation of the cepstral coefficients. A vector quantization distortion measure is computed to estimate the best BLT warp for each speaker. Substantial performance improvements were obtained with the LPCbased cepstral coefficients (LPCCs). The proposed BISN algorithm has some similarities with Acero's approach [10]. In both methods, a firstorder allpass system (or a BLT) is used to incorporate the perceptual scale into the feature extraction process. A fixed BLT warp factor, is used to approximate Mel and Bark scales as needed. In order to reduce the speaker differences, a best BLT warp factor, is specifically estimated for each speaker, which in some sense, integrates perceptual BLT warp and speaker normalization BLT warp into a single speakerdependent BLT warp factor. The procedure employed to estimate the best BLT warp factor for each speaker, on the other hand, has substantial differences. As mentioned above, Acero used a vector quantization distortion measure in order to estimate the best BLT warp factor for each speaker. Our approach in BISN is fundamentally different in the sense that each best BLT warp factor is estimated within the VTLN framework proposed by Lee and Rose [6, 7]. Moreover, several other algorithms are also integrated within the search process in order to reduce the computational load down to manageable levels for realtime implementations.
The feasibility of bilinear and allpass transforms (BLT, APT) has also been extensively studied by McDonough [11, 12]. In that study, the BLT is implemented in the cepstral domain. The best BLT parameters were estimated by a Gaussian mixture model (GMM) as the one maximizing the likelihood of the incoming data [11, 12]. The BISN approach is somehow related to this method, however relation is merely in the use of a BLT for speaker normalization. McDonough did not make any attempt to integrate perceptual warp and speaker normalization BLT warp into a single warp (which BISN does). Rather, he used cepstrum transformation matrices (which are derived from the BLT) on the final MFCC vectors to achieve the speaker normalization. This means that still the perceptual and speaker normalization warps are performed in two separate steps, perceptual warp is achieved through use of a nonlinearly distributed Melfilterbank whereas speaker normalization is achieved through the use of an appropriate matrix transformation after the Mel cepstra have been computed.
In this paper, we integrate BLTbased speaker normalization within the perceptual MVDR (PMVDR) coefficients framework [2, 3]. First, we demonstrate that the perceptual warp is actually meant to remove some of the existing speaker differences. By estimating a specific perceptual warp factor for each speaker, it is possible to further remove these speakerdependent differences. Then, the warp estimation process is computationally improved by integrating a binary tree search (BTS) [13] approach which reduces the computation 67% with respect to the classical VTLN. Next, performing the best warp search in the model space rather than in the feature space [14] further reduces the necessary computational resources for realtime applicability and performance. Finally, a configuration for onthefly implementation of this builtin speaker normalization (BISN) algorithm is proposed for an incar speech recognition task which reduces the word error rate (WER) 24% relative to the baseline PMVDRbased system.
In Section 2, we summarize the theoretical background for the PMVDR front end which is the basis for the BISN algorithm. In Section 3, we consider the underlying meaning of socalled perceptual warping. We show via a modified LDAbased analysis [15, 16] that perceptual warping successfully removes a substantial amount of interspeaker variability. This observation leads to the idea of using a specific selfnormalization warp factor for each speaker. The offline approach for the vocaltract length normalization (VTLN) is summarized in Section 4 with its disadvantages in terms of computational efficiency. Section 5 formulates the builtin speaker normalization (BISN) algorithm in detail. Improvements to the search are introduced in Sections 5.1 and 5.2. We summarize our evaluation results in Section 6 for two different tasks, CUMove extended digit recognition task and the speech in noisy environment (SPINE2) task. Section 7 explains how one can easily integrate the BISN algorithm within the PMVDR framework for a realworld application. After summarizing computational considerations for the different algorithms proposed in this paper in Section 8, we make concluding remarks in Section 9.
2. The PMVDR Acoustic Front End
PMVDR is a new acoustic front end which does not use a nonlinearly spaced filterbank to incorporate perceptual considerations. Instead of using a filterbank, the FFT spectrum is directly warped before the envelope extraction stage [2, 3]. The envelope is extracted via a loworder allpole MVDR spectrum which is shown to be superior to the linear prediction (LP) based envelopes [17]. Utilizing direct warping on the FFT power spectrum by removing filterbank processing avoids the smoothing effect of a filterbank and leads to preservation of almost all information that exits in the shortterm speech spectrum. Also, using the MVDR method to extract the envelope contributes greatly to superior performance in noisy conditions [2, 3]. We now shortly summarize the MVDR spectrum estimation to extract the spectral envelope and the warping via interpolation algorithm to directly warp the FFT spectrum. For the details of the PMVDR computation we refer readers to [2, 3].
2.1. Minimum Variance Distortionless Response (MVDR) Spectrum Estimation
Allpole modeling is commonly used in speech spectrum analysis for speech processing applications. MVDR can be seen as an alternative allpole modeling technique to the popular linear prediction (LP) [17]. The MVDR spectrum for all frequencies can be expressed in a parametric form. Let the thorder MVDR spectrum be written as
The parameters, , hence the MVDR spectrum, can be easily obtained by a modest noniterative computation proposed by Musicus [18]. The parameters, , are computed from the LP coefficients and the prediction error variance as
Therefore, the coefficients, , are sufficient to completely specify the MVDR spectrum .
2.2. Direct Warping of FFT Spectrum
It has been shown that implementing the perceptual scales through the use of a firstorder allpass system is feasible [19, 20]. In fact, both Mel and Bark scales are determined by changing the single parameter, , of the system [20]. The transfer function, , and the phase response, , of the system are given as
where represents the linear frequency, while represents the warped frequency. Here, the value of controls the degree of warping. We are more interested in the nonlinear phase response through which we implement the perceptual warping. For 16 kHz sampled signals, we set and to approximate the Mel and Bark scales, respectively. For 8 kHz, these values are adjusted to and [20]. Bark scale performs more warping in the lower frequencies when compared to the Mel scale.
2.3. Implementation of Direct Warping
Warping via interpolation is a simple and fast method to implement direct warping. We would like to obtain the value of the power spectrum in the warped frequency space by using its corresponding value in the linearfrequency space, . The inverse relation that takes us from the warped to linear frequency space can be easily obtained from (4) by replacingwith:
A stepbystep algorithm that describes how warping can be efficiently implemented via interpolation can be given as follows.

(1)
Take the FFT of the input speech frame of length to obtain the FFT power spectrum. should be selected as the nearest possible powerof2, thus providing spectral points (i.e., , ) in linear power spectrum space.

(2)
Calculate linearly spaced spectral points over the warped frequency space by dividing the entire warped frequency range into equispaced points:
(6)

(3)
Compute the linear frequencies and FFT indexes that correspond to these warped frequencies using
(7)

(4)
For the final step, perform an interpolation of the nearest linear spectral values to obtain the warped spectral value:
(8)
where is the lower nearest linear FFT bin, is the nearest upper linear FFT bin, and is the value of the warped power spectrum that corresponds to FFT bin . Thus, the spectral value , at the warped frequency index , is computed as the linear interpolation of nearest upper, , and lower, , spectral values in the linear frequency space.
2.4. Implementation of PMVDR
In utilizing a filterbank for incorporating perceptual scales, the filterbank has two tasks, (i) warping the spectrum nonlinearly and (ii) smoothing out excitation details. In using direct warping, on the other hand, no averaging of the FFT power spectrum is used to achieve smoothing, only warping of the spectrum is performed. The smoothing is achieved through a loworder MVDR analysis that follows the warping step. Therefore, in the direct warping of the spectrum, little information is lost.
The remainder of the PMVDR algorithm can be summarized in the following steps.

(1)
Obtain the perceptually warped FFT power spectrum via interpolation.

(2)
Compute the "perceptual autocorrelation lags" by taking the IFFT of the "perceptually warped" power spectrum.

(3)
Perform an thorder LP analysis via LevinsonDurbin recursion using the perceptual autocorrelation lags [21, 22].

(4)
Calculate the thorder MVDR spectrum using (2) from the LP coefficients [17].

(5)
Obtain the final cepstrum coefficients using the straightforward FFTbased approach [23]. In this implementation, after obtaining the MVDR coefficients from the perceptually warped spectrum, we take the FFT of the parametrically expressible MVDR spectrum. After applying the log operation, we apply IFFT to return back to the cepstral domain.

(6)
Take the first , generally 12 excluding the thorder cepstrum, cepstral coefficients as the output of the PMVDR front end. This is the cepstral truncation step.
A flow diagram for the PMVDR algorithm is given in Figure 1 [3]. For further details on the PMVDR front end and its evaluation on different databases, we refer reader to [2, 3, 24].
3. The "Meaning" of Perceptual Warping
Virtually all acoustic front ends proposed for ASR use some form of nonlinear warping of the spectrum at some level. The MFCC front end, for example, uses a Melscaled filterbank in order to incorporate perceptual considerations. The argument for applying a nonlinear warping, or socalled perceptual warping, to the speech spectrum is strongly tied to the fact that the human auditory system performs similar processing. This is generally justified because experimental results have shown that lower frequencies of the speech spectrum carry more crucial information for ASR than higher frequencies; therefore, these frequencies are generally emphasized by a nonlinear warping function. In this section, we consider the real "meaning" of the perceptual warping from the standpoint of the interspeaker variability analysis as proposed in [15]. In all of our experiments, when a perceptual warp is introduced, it always yields better recognition accuracy (on the order of 20%, relative). We believe that there is another important "task" of the perceptual warping other than emphasizing lower frequencies. In fact, the perceptual warp was actually meant to remove some of the existing interspeaker variability in the feature set. To justify this claim, we conducted an analysis within the framework explained in [2, 15, 25]. We extracted the PMVDR features for the CUMove invehicle speech [26] training set (see Section 6) (1) with no perceptual warping, (2) using the Bark scale (), and (3) using the BISN warp factors (see Section 5). Afterwards, we computed the variation of the trace measure (TM). The larger the TM is, the more effectively the speaker variability is removed [2, 15, 25]. Figure 2 shows the variation of the trace measure (with respect to the minimum of number speech classes and feature dimension [15]) for the three cases. The figure verifies that using the perceptual warp indeed leads to the removal of a significant amount of interspeaker variability. However, using the BISN warps specifically estimated for each speaker further removes the interspeaker variability signifying the applicability of the BISN in the context of speaker normalization.
4. Offline VTLN
The most popular method for speaker normalization is vocaltract length normalization (VTLN) in which the speech spectrum is linearly warped with an optimal warp factor () [6, 7, 27]. The warping can also be performed by rearranging the position of the Mel filters [6, 7]. However, in the PMVDR front end, we no longer use a filterbank structure, and therefore warping is directly performed on the FFT power spectrum. In the offline VTLN application, a twostep warp needs to be performed. The first warp is called perceptual warp and applied during the extraction of acoustic features. VTLN warp also needs to be performed in cascade to the perceptual warp within the acoustic front end. The speakerdependent parameter, , is generally determined by conducting likelihood computations for different values within the range [0.84Ã¢â‚¬â€œ1.16] (for our purpose we extend the range slightly to facilitate the binary search algorithm described in Section 5.1). Generally a singleGaussian HMM set which is trained on all available training data is used to estimate the warp factor.
4.1. Warping Factor Estimation
Assume that we have utterances from speaker and would like to estimate the warp factor for this speaker. Here, we define the following terms as in [7]:
(i) denotes the set of feature vectors for all of the available utterances from speaker , warped by warp factor ,
(ii) denotes the set of transcriptions of all utterances,
(iii) denotes the best warp factor for speaker ,
(iv) denotes a given HMM trained from a large population of speakers.
The best warp factor for speaker is estimated by maximizing the likelihood of the warped features with respect to the HMM model and transcriptions :
Obtaining a closedform solution for is difficult since the frequency warping corresponds to a highly nonlinear transformation of the speech features. Therefore, the best warp is estimated by searching over a grid of 33 points spaced evenly in the range of [0.84Ã¢â‚¬â€œ1.16]. The goal of training is to obtain a canonical (normalized) set of HMMs, , in the sense that each speaker's utterance is warped with an appropriate warping factor and the resulting HMM is defined over a frequencynormalized feature set. Initially, the HMM set is trained from unwarped utterances, and this model is used to estimate the best warp factor for each speaker. Afterwards, every speaker's utterances are parameterized with the estimated best warp factor and then the HMM model set is reestimated from this warped feature set. In theory, this new canonical model can be used to reestimate the optimal warp factors, and another HMM can be trained and the procedure iterated several times. However, during our experimentation with offline VTLN, we observed that further iterating did not yield significant improvements over the first iteration, therefore we only estimate the optimal warps once and train the canonical HMMs from the feature set parameterized with these optimal warps.
During recognition, our goal is to warp the frequency scale of each test utterance to best match the canonical HMMs, . Unlike training, in the test phase, only one utterance is used to estimate and the transcription is not available. A general approach is to use a twopass strategy. At first, the th unwarped utterance of the th speaker, and the normalized model is used to obtain a preliminary transcription of the utterance, . Afterwards, the optimal warp factor, , is estimated via the general search procedure:
Finally, we warp the utterance with the estimated warp factor, and redecode using the normalized HMM model, . The output of the recognizer is our final recognition result. For offline VTLN experiments reported in this paper, however, we used all the available data from each test speaker to estimate the best warps in an offline setting (i.e., warp factors are not estimated for every single utterance).
Typically, we parameterize speech within the range of [0.84Ã¢â‚¬â€œ1.16] and with a step size of 0.01 yielding a 33point search space. Using the monotonic property, we compare the likelihoods at the current warp and at the previous warp. When the difference is negative, the best warp is found. On the average, the estimation of the best VTLN warp for a speaker requires 18 times the computational resources for one feature extraction and one likelihood computation. During the test, we must perform recognition twice in order to obtain an initial transcription to estimate the optimal warp.
5. BuiltIn Speaker Normalization (BISN)
Our earlier interspeaker variability analysis yielded the fact that socalled perceptual warping is in fact a speakernormalization warping too. Motivated by this outcome, we can adjust the perceptual warp parameter specifically for each speaker and call this new warp the selfnormalization warp. This should, in turn, normalize the vocaltract length differences. Since this procedure does not require 2 applications of warping to the spectrum (one for the perceptual warp and one for the VTLN warp), as in offline VTLN, it is more efficient. Moreover, the normalization is achieved by only adjusting an internal parameter of the acoustic front end (i.e., the perceptual warp factor ), making it a builtin procedure, hence the name builtin speaker normalization (BISN). The selfnormalization warp () in the BISN context refers to a nonlinear mapping (as defined by (3) and (4)) whereas in the VTLN context the speaker normalization warp () refers to a linear mapping of the frequency axis.
The estimation of the selfnormalization warp, , for speaker , is done in a manner similar to offline VTLN. Here, is estimated as the one which maximizes the total likelihood of the data given a singleGaussian HMM set. Another advantage of BISN is the reduced search space. While in classical VTLN, the search space is generally a 33point grid, for the BISN case, using a 17point search space yields sufficient accuracy. (In our implementation, the search was over this range, but one may reduce the dimension of the search space at the expense of performance.) In a typical setting with a perceptual warp factor of (Bark scale at 16 kHz), the search space for the selfnormalization warps can be chosen as [0.49, 0.65] reducing the search space by half versus that for VTLN. The search for the selfnormalization warp within the BISN framework requires 10 times the computational resources for one feature extraction and one likelihood computation, which is still computationally expensive. The search is a computationally intensive procedure. This disadvantage has been noticed by other researchers [13]. Taking advantage of the monotonic property of the likelihood function, one can use a binary tree search [13] rather than linear search which reduces the computational load substantially with no performance loss (i.e., by producing exactly the same warp factors).
5.1. Binary Tree Search (BTS) Approach
The likelihood of the data from a specific speaker is monotonically increasing (with the changing warp factor) up to a maximum, that is, until reaching the best warp, and then becomes monotonically decreasing. We present two sample likelihood variations in Figure 3 for a male and female speaker from the WSJ database [28]. For illustration purposes, the singleGaussian HMM models for optimal warp search were trained with and the search space was chosen to be and with a step size resulting in a 33point search space. In general, a step size of provides sufficient resolution for optimal performance.
Using this monotonic property of the likelihood function, it is possible to devise a much more efficient search algorithm than the linear search approach [13]. In [13], a Brent search was used in order to efficiently obtain the best warp factor. Without loss of generality, we will call the efficient search process as binary tree search (BTS) in this paper.
Let the singleGaussian HMM set be trained with (e.g., ) and let the search space be chosen as [, ] (e.g., [, ]) with a step size (e.g., ) resulting in a point (e.g., ) onedimensional search space, where
We can summarize the steps of the binary tree search (BTS) algorithm as follows.

(1)
Compute the likelihood, , for , where we refer to this warp as the middle warp since it is the center of our search space.

(2)
Compute the lower warp as the mean of lower limit and middle warp and similarly higher warp as the mean of upper limit and middle warp as follows:
(12)
These two steps divide the warp space in half, lower region and upper region, whose middle warps are and , respectively.

(3)
Compute for , if , then disregard the upper region, and consider the lower region as the new search space whose middle warp is and return to Step (2). If , then compute , for . If then disregard the lower region, and consider the upper region as the new search space whose middle warp is and return to Step (2). For the last case where , take the new search space to be ], whose middle warp is and return to Step (2). In all the cases, the search space is reduced by half.
By recursively repeating Steps (2) and (3), we compute the optimal warp for a speaker with an average of 6 times the computational resources for one feature extraction and one likelihood computation (with the example settings above). Thus, the BTS algorithm summarized above reduces the number of likelihood computations from 10 to 6 for the BISN algorithm, exactly producing the same selfnormalization warps. For BTS approach integrated within the BISN algorithm (considering a 17point search space), the number of feature extraction and likelihood computations is 6, hence when compared with classical VTLN, it estimates the selfnormalization warps with a 67% relative reduction in the computational load.
5.2. Model Versus Feature Space Search
In the current implementation, the search is conducted in the feature space. This means that the singleGaussian HMM set is trained on unwarped features and tested on warped features for different warps throughout the search space. However, there are two motivating reasons to use the model space as the search space [14]. The first is the unaccounted Jacobian. The warped features are generated by transforming the frequency axis by a suitable warping function (speakerdependent BLT in our case), the models, on the other hand, are trained on unwarped features. The likelihood computation, therefore, needs to be corrected using the Jacobian of the frequency transformation [14, 29]. Assume that we warp the spectra of the th speaker by different warping factors (i.e., ) and compute the warped features over time as . Let denote the transcription of the utterance from speaker . If denotes a set of singleGaussian HMM models trained from a large population of speakers, then the optimal warping factor for the th speaker, is obtained by maximizing the likelihood of the warped utterances with respect to the model and the transcription [14]:
If and are the original and transformed feature vectors, respectively, for speaker , then the loglikelihood of is given by
where is the Jacobian of the transformation taking to [14]. In conventional speaker normalization, the contribution of the Jacobian is not taken into account since this may cause some systematic errors in optimal warp factor estimation. When the search is conducted in the model space, the need to compensate for the Jacobian of the transformation is eliminated [14].
Second motivating reason is the computational gain implied by the modelbased search. In the modelbased search, we train a singleGaussian HMM set for each warp in the search space offline. We then extract the features for the no warp case only once and then compute the probability for different warped models. This will reduce the heavy computational load for extracting the features over and over for each warp in the search space. Since this is integrated within the BTS approach, the modelbased search only requires 1 feature extraction and 6 likelihood computations. We call this the model spacebinary tree search approach (MSBTS) which can be summarized as follows.

(1)
Train singleGaussian HMM models for each warping factor in the search space. An example search space would be in the range of [0.49Ã¢â‚¬â€œ0.65] with a step size of .

(2)
For the estimation of the optimal warp, extract the features with selfnormalization warp, (this generally can be chosen as , which is the center of our search space) and then select the model (trained with ) yielding the maximum likelihood given the warped features. The search is again performed with the BTS approach to quickly find the warped model giving the largest likelihood, .

(3)
The optimal selfnormalization warp is the inverse of with respect to and can easily be calculated using
(15)

(4)
When the input features are extracted using the center of our search space (i.e., ), the above equation becomes
(16)
After determining the self normalization warps by using the model space search approach summarized above, the rest of the normalization is similar to the offline VTLN. The canonical HMMs are trained from warped features which are extracted using appropriate selfnormalization warps. During the test, same modelbased approach is used to determine the selfnormalization warp factors, and a twopass recognition is performed.
Changing the search space from the feature to model space helps reducing the computational load further for estimating the optimal selfnormalization warps. Now for the MSBTSbased BISN, we need to extract the features only once and then perform 6 likelihood computations on the average to obtain the optimal selfnormalization warp.
6. Experimental Framework
In order to test the effectiveness of the proposed BISN algorithm, recognition experiments were performed on two different databases that address different adverse conditions. We believe that it is important to test the speaker normalization algorithms for actual adverse environments, in order to determine if they have practical value. The databases used in the simulations are (a) CUMove databaseextended digits Portion [30], for real noisy incar environments, (b) speech in noisy environments (SPINEs) [31], for simulated noisy military task conditions. These databases reflect good examples of environments where reliable and efficient speaker normalization is needed.
6.1. General System Description
For all experiments, we used SONIC [32, 33], the University of Colorado's HMMbased large vocabulary speech recognition system. We used a window length of 25 milliseconds and a skip rate of 10 milliseconds by Hamming windowing the frame data before further processing. The 39dimensional feature set contains 12 statics, deltas and deltadeltas along with normalizedlog energy, delta and deltadelta energy. Cepstral mean normalization (CMN) was utilized on the final feature vectors.
For both VTLN and BISN experiments, a single best warp is estimated for each speaker offline using all available data. We reextracted PMVDR features using these best warps and retrained the HMM model set in order to obtain canonical models. During the test, a twopass strategy was used. First, all utterances from a single speaker are recognized with noncanonical HMM set, and best warp factors are estimated using the result of this recognition. In the second step, the utterances for that speaker are extracted incorporating the best warps obtained in the first step, and a second recognition is performed with the canonical models to get the final hypothesis.
6.2. Experiments for CUMove Extended Digits Task
For noisy speech experiments, we use the CUMove extended digits corpus [30] which was collected in real car environments. The database and noise conditions are analyzed in [34, 35] in detail.
A total of 60 speakers balanced across gender and age (18Ã¢â‚¬â€œ70 years old) were used in the training set. (Note that [34] summarizes recommended training development and test sets for the CUMove corpus.) The test set contained another 50 speakers, again gender and age balanced. The HMMs were trained using SONIC's decisiontree HMM trainer [32, 33] resulting in a model set with approximately 10 K total Gaussians. The 40word vocabulary is very convenient for telephone dialing applications since it contains many necessary words like "dash", "pound", "sign" in addition to numbers. We used the optimized settings ( and ) for PMVDR on the CUMove task [3].
The recognition performance for different normalization approaches is given in Table 1. As we can see, the relative improvement of PMVDR integrated with BISN is close to 50% WER reduction with respect to the MFCC baseline.
Although there is no substantial improvement in the WER performance of the BISNbased techniques with respect to VTLN baseline, there is a computational gain and the convenience of performing the recognition within the acoustic front end merely changing an internal parameter. BISNbased normalization can be easily integrated into embedded systems, such as incar speechbased navigation systems, without increasing the computational cost significantly.
6.3. Experiments for the SPINE Task
The SPINE task uses the ARCON communicability exercise (ACE) that was originally developed to test communication systems. The training data for the SPINE2 task consists of 4 parts, (1) SPINE1 training data (8.7 hours), (2) SPINE1 evaluation data (7.3 hours), (3) SPINE2 training data (3.4 hours), and (4) SPINE2 development data (1.1 hours) totaling up to 20.5 hours of training data. The evaluation data consists of 64 talkerpair conversations which is 3.5 hours of total stereo data (2.8 hours of talktime total). On the average, each of the 128 conversations contains 1.3 minutes of speech activity. For the SPINE2 evaluation, a class Ngram language model is trained from the training data text. For further details about the task, we refer readers to [33]. The test data contains large segments of silence and a voice activity detector (VAD) is used to estimate speech segments. For the speaker normalization experiments, however, we preferred to use reference handcuts provided by NRL in order to objectively evaluate the performance of speaker normalization algorithms. We again trained genderindependent HMMs using the Sonic's decisiontree HMM trainer. The models had about 2500 clusters and around 50 K Gaussians. We used (Mel scale at 16 kHz) and as the settings for the PMVDR front end. The recognition performance for different speaker normalization approaches is given in Table 2. The relative improvement of PMVDR w/BISN is about 8.5% WER reduction with respect to the MFCC baseline. This moderate improvement can be attributed to the high WER of the task. Since the recognition results (hence the alignments) are not sufficiently accurate, this yields poor warp estimates. Again the WER performance is comparable with VTLN. We observe a better improvement for females versus males from the MFCC baseline.
7. Application of BISN in a RealTime Scenario
We now would like to elaborate on the application of BISN w/MSBTS within a real world scenario. In real time, we have all the training data in advance and can determine the selfnormalization warps offline using all the available data from each speaker. However, during the test we do not have access to all speech from a specific speaker to determine the selfnormalization warp for that speaker. Moreover, we do not have the information as to when speaker changes occur. So the algorithm should in fact be able to adapt the selfnormalization warps to changing speakers. It should also be flexible (i.e., slowly changing) even for the same speaker to account for the slight variations in the vocaltract characteristics. By making effective use of all the algorithms described so far, it is possible to establish a cooperation between the acoustic front end and the recognizer which will enable the front end to normalize itself automatically without the need to perform recognition twice. We give the blockdiagram for the application of this selfnormalization front end (BISN w/MSBTS) in Figure 4.
Assume that we have the canonical models, , trained on speakernormalized training data and would like to perform online VTLN during the test. Also assume that recognition is performed for small sections of speech (i.e., utterances). We can summarize the operation of the selfnormalizing front end as follows.

(i)
Parameterize first the th input utterance with the perceptual warp .

(ii)
Recognize the utterance and pass the transcription (with alignment) information to the MSBTS block.

(iii)
Determine the best selfnormalization warp (i.e., the instantaneous warp for the current utterance ).

(iv)
Pass through a recursive averaging block with a forgetting factor() to obtain an averaged version (i.e., ). Here, the forgetting factor was set to 0.6, an optimization experiment is presented in this chapter later on.

(v)
Supply to the PMVDR front end, which is an estimate of the selfnormalization warp for the th incoming utterance.
In summary, the front end estimates the selfnormalization warp for the incoming utterance by using the selfnormalization warp estimated from the earlier utterances via a recursive averaging with a forgetting factor. After performing recognition with the estimated selfnormalization warp, the recognizer feeds back the alignment information so that the selfnormalization warp for the next utterance can be estimated (and updated).
In this way, we never have to perform the recognition twice and sequentially we refine the warp estimate to accommodate the slight variations for the vocaltract even for the same speaker. Moreover, the recursive averaging ensures quick adaptation of selfnormalization warp to changing speakers over time. If we call the instantaneous warp estimated for the current utterance , then the selfnormalization warp estimate for the incoming utterance can be computed as follows:
where is the averaged warp used in the parameterization of th utterance, is the instantaneous warp estimated for the th utterance given the features from the front end and alignment from the recognizer , and is the estimated warp factor to be used in the parameterization of th utterance. As an initial condition for the first utterance, we can choose to use the center warp of our search space (i.e., ). Finally, is the total number of utterances in the test set. provides a means for smoothing the selfnormalization warp estimate and helps accounting for the changes in vocaltract characteristics. Since the instantaneous selfnormalization warp is estimated from a short segment of data (as short as one spoken digit), it fluctuates considerably. We give the variation of instantaneous selfnormalization warp () and recursively averaged selfnormalization warp () for a comparison in Figure 5. The fixed selfnormalization warps obtained from the offline BISN w/MSBTS algorithm are also superimposed on the averaged selfnormalization warp graph. The averaged selfnormalization warp tracks the fixed selfnormalization warp, permitting slow variations within the same speaker. Allowing some flexibility for the warp factor even within the same speaker compensates for variations which may stem from Lombard effect, stress, or a number of other physiological factors [36]. It is also shown that the averaged selfnormalization warp successfully and quickly adapts to new speakers with no need to detect speaker turns.
As observed from Figure 5, the fluctuation in instantaneous selfnormalization warp is mostly smoothed out by the recursive averaging. To determine a good value for the forgetting factor , we conducted an experiment for a changing forgetting factor versus WER, the results are presented in Figure 6. As observed, the particular value of is not that crucial as long as it is within the range of [0.4Ã¢â‚¬â€œ0.8]. We infer that, for the CUMove task, a good value of the forgetting factor () is .
In Table 3, we summarize the recognition results for the CUMove task in which each test speaker had an average of approximately 60 utterances. The results, which are slightly better than the offline experimentation, confirm the applicability of the proposed selfnormalizing front end (BISN w/MSBTS). This can be attributed to the more accurate alignments obtained during the onthefly normalization. In the offline case, all speech for a specific speaker is recognized first and then a warp factor is determined, since unwarped models and features are used in the first round of recognition, the recognition results (hence alignments) are moderately accurate. In the onthefly experimentation, however, the warp is adjusted as more and more data becomes available from the same speaker, and normalized models and features are used to update the selfnormalization warp, hence the alignments supplied by the recognizer are more accurate, yielding better estimates for the selfnormalization warp. We also note that for Table 3, it is not possible to directly compare BISN w/MSBTS with VTLN, since VTLN can only be applied offline.
8. Computational Considerations
This final section aims to evaluate all algorithms in terms of their computational efficiency. We consider the number of warpings performed on the FFT spectrum (NW), the number of feature extractions (NFEs) required for the whole system (both for search and recognition), the number of likelihood computations (NLCs), and lastly the number of recognition passes (NRPs). Table 4 clearly illustrates the computational gain obtained by moving from classical VTLN to the onthefly version of BISN w/MSBTS. Moving from classical VTLN to BISN eliminates the need to perform warping on the FFT spectrum twice. The perceptual and speaker normalization warps are integrated into a single speakerdependent warp. Integration of the MSBTS algorithm within the BISN framework for an onthefly application eliminates even the need to extract the features twice. Extracted features for recognition are also passed to the MSBTS block for the selfnormalization warp estimation for the incoming utterance. Since the estimation is sequential, the need to perform recognition twice is also eliminated. The selfnormalization warp for the incoming utterance is recursively estimated from earlier utterances. The computational load is now reduced to realistic levels even for embedded systems. The only drawback is that we need to store all singleGaussian models trained at each point of the search space (here we have 17 singleGaussian models in the BISN case) in memory all the time. However, since these are only singleGaussian models, they do not require a large amount of memory.
9. Conclusions
In this paper, we have proposed a new and efficient algorithm for performing online and efficient VTLN which can easily be implemented within the PMVDR front end. In VTLN, we need to perform warping on the spectrum twice, to accommodate perceptual considerations and to normalize for speaker differences. The proposed BISN algorithm, on the other hand, estimates a selfnormalization warp for each speaker which performs both the perceptual warp and speaker normalization in a single warp. The use of a single warp to achieve both perceptual warp and VTLN warp unifies these two concepts. The model spacebinary tree search (MSBTS) algorithm was integrated to reduce the computational load in the search stage for the estimation of selfnormalization warps. Moving the search base from the feature space to the model space [13] reduced the need to extract the features for each point in the search space, which in turn eliminated the need for high computational resources. A sequential onthefly implementation of the BISN w/MSBTS algorithm also eliminated the need to perform multipass recognition which makes it possible to integrate this scheme with lowresource speech recognition systems.
We have shown that the BISN approach is effective for two different databases, the CUMove invehicle dialog (extended digits portion) database and the SPINE military noisy speech database. The onthefly implementation of the BISN w/MSBTS algorithm was also shown to be slightly more accurate than the offline version with a considerable savings in computational resources. Integrated with the BISN approach, the PMVDR front end can now be considered an intelligent front end which cooperates with the recognizer in order to automatically normalize itself with respect to the incoming speaker/speech. Since it can quickly adapt to the changing vocaltract characteristics, it does not require any detection of speaker changes whatsoever. We believe that the PMVDR front end integrated with the strong BISN algorithm is an ideal front end for use in every system requiring noise robustness and a measurable level of speaker normalization (especially for embedded systems). It can perform acoustic feature extraction with moderate computational requirements and achieve selfnormalization with respect to changing speakers very efficiently, yielding a sound acoustic front end that can be used in today's demanding speech recognition applications.
References
Hunt MJ: Spectral signal processing for ASR. Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '99), December 1999, Keystone, Colo, USA 1: 1726.
Yapanel UH, Hansen JHL: A new perspective on feature extraction for robust invehicle speech recognition. Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH '03), September 2003, Geneva, Switzerland 12811284.
Yapanel UH, Hansen JHL: A new perceptually motivated MVDRbased acoustic frontend (PMVDR) for robust automatic speech recognition. Speech Communication 2008,50(2):142152. 10.1016/j.specom.2007.07.006
Hansen JHL: Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition. Speech Communication 1996,20(12):151173. 10.1016/S01676393(96)000507
Hansen JHL: Morphological constrained feature enhancement with adaptive cepstral compensation (MCEACC) for speech recognition in noise and Lombard effect. IEEE Transactions on Speech and Audio Processing 1994,2(4):598614. 10.1109/89.326618
Lee L, Rose RC: Speaker normalization using efficient frequency warping procedures. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96), May 1996, Atlanta, Ga, USA 1: 353356.
Lee L, Rose RC: A frequency warping approach to speaker normalization. IEEE Transactions on Speech and Audio Processing 1998,6(1):4960. 10.1109/89.650310
Andreou A, Kamm T, Cohen J: Experiments in vocal tract normalization. Proceedings of the CAIP Workshop: Frontiers in Speech Recognition II, JulyAugust 1994, Piscataway, NJ, USA
Eide E, Gish H: A parametric approach to vocal tract length normalization. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96), May 1996, Atlanta, Ga, USA 1: 346348.
Acero A: Acoustical and environmental robustness in automatic speech recognition, Ph.D. thesis. Carnegie Mellon University, Pittsburgh, Pa, USA; 1990.
McDonough J: Speaker compensation with allpass transforms, Ph.D. thesis. The John Hopkins University, Baltimore, Md, USA; 2000.
McDonough J, Byrne W, Luo X: Speaker adaptation with allpass transforms. Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP '98), NovemberDecember 1998, Sydney, Australia 6: 23072310.
Hain T, Woodland PC, Niesler TR, Whittaker EWD: The 1998 HTK system for transcription of conversational telephone speech. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '99), March 1999, Phoenix, Ariz, USA 5760.
Sinha R, Umesh S: A method for compensation of Jacobian in speaker normalization. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 560563.
HaebUmbach R: Investigations on interspeaker variability in the feature space. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '99), March 1999, Phoenix, Ariz, USA 1: 397400.
Kim Y: Signal modeling for robust speech recognition with frequency warping and convex optimization, Ph.D. thesis. Department of Electrical Engineering, Stanford Univerity, Palo Alto, Calif, USA; 2000.
Murthi MN, Rao BD: Allpole modeling of speech based on the minimum variance distortionless response spectrum. IEEE Transactions on Speech and Audio Processing 2000,8(3):221239. 10.1109/89.841206
Musicus BR: Fast MLM power spectrum estimation from uniformly spaced correlations. IEEE Transactions on Acoustics, Speech, and Signal Processing 1985,33(5):13331335. 10.1109/TASSP.1985.1164696
Smith JO III, Abel JS: Bark and ERB bilinear transforms. IEEE Transactions on Speech and Audio Processing 1999,7(6):697708. 10.1109/89.799695
Tokuda K, Masuko T, Kobayashi T, Imai S: Melgeneralized cepstral analysisa unified approach to speech spectral estimation. Proceedings of the International Conference on Spoken Language Processing (ICSLP '94), September 1994, Yokohama, Japan 10431046.
Haykin S: Adaptive Filter Theory. PrenticeHall, Englewood Cliffs, NJ, USA; 1991.
Makhoul J: Linear prediction: a tutorial review. Proceedings of the IEEE 1975,63(4):561580.
Oppenheim AV, Schafer RW: DiscreteTime Signal Processing. PrenticeHall, Englewood Cliffs, NJ, USA; 1989.
Yapanel UH: Acoustic modeling and speaker normalization strategies with application to robust invehicle speech recognition and dialect classification, Ph.D. thesis. Center for Spoken Language Research, Department of Electrical and Computer Engineering, Univerity of Colorado at Boulder, Boulder, Colo, USA; 2005.
Yapanel UH, Dharanipragada S, Hansen JHL: Perceptual MVDRbased cepstral coefficients (PMCCs) for high accuracy speech recognition. Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH '03), September 2003, Geneva, Switzerland 18291832.
Hansen JHL, Zhang X, Akbacak M, et al.: CUMOVE: advanced invehicle speech systems for route navigation. In DSP for InVehicle and Mobile Systems. Kluwer Academic Publishers, Dordrecht, The Netherlands; 2004:1945.
Zhan P, Westphal M: Speaker normalization based on frequency warping. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Atlanta, Ga, USA 2: 10391042.
LDC, http://wave.ldc.upenn.edu/catalog/catalogentry.jsp?catalogid=ldc93s6a
Pitz M, Molau S, Schluter R, Ney H: Vocal tract normalization equals linear transformation in cepstral space. Proceedings of the 7th European Conference on Speech Communication and Technology (EUROSPEECH '01), September 2001, Aalborg, Denmark
Hansen JHL, Sarikaya R, Yapanel UH, Pellom B: Robust speech recognition in noise: an evaluation using the SPINE corpus. Proceedings of the 7th European Conference on Speech Communication and Technology (EUROSPEECH '01), September 2001, Aalborg, Denmark 2: 905908.
Pellom B: SONIC: the university of colorado continuous speech recognizer. Center for Spoken Language Research, University of Colorado at Boulder, Boulder, Colo, USA; 2001.
Pellom B, Hacioglu K: Recent improvements in the CU Sonic ASR system for noisy speech: the SPINE task. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 47.
Hansen JHL: Getting started with the cumove database. The Center For Spoken Language Research, Boulder, Colo, USA; 2002.http://www.cumove.colorado.edu.com
Hansen JHL, Angkititrakul P, Plucienkowski J, et al.: CUmove: analysis & corpus development for interactive invehicle speech systems. Proceedings of the 7th European Conference on Speech Communication and Technology (EUROSPEECH '01), September 2001, Aalborg, Denmark 209212.
BouGhazale SE, Hansen JHL: A comparative study of traditional and newly proposed features for recognition of speech under stress. IEEE Transactions on Speech and Audio Processing 2000,8(4):429442. 10.1109/89.848224
Acknowledgment
This work was supported by US Air Force Research Laboratory, Rome NY, under Contract no.FA87500410058.
Author information
Authors and Affiliations
Corresponding author
Summary of Abbreviations and Acronyms
Summary of Abbreviations and Acronyms
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Yapanel, U.H., Hansen, J.H. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Builtin Speaker Normalization. J AUDIO SPEECH MUSIC PROC. 2008, 148967 (2008). https://doi.org/10.1155/2008/148967
Received:
Accepted:
Published:
DOI: https://doi.org/10.1155/2008/148967