DWT and LPC based feature extraction methods for isolated word recognition
© Nehe and Holambe; licensee Springer. 2012
Received: 21 January 2011
Accepted: 30 January 2012
Published: 30 January 2012
In this article, new feature extraction methods, which utilize wavelet decomposition and reduced order linear predictive coding (LPC) coefficients, have been proposed for speech recognition. The coefficients have been derived from the speech frames decomposed using discrete wavelet transform. LPC coefficients derived from subband decomposition (abbreviated as WLPC) of speech frame provide better representation than modeling the frame directly. The WLPC coefficients have been further normalized in cepstrum domain to get new set of features denoted as wavelet subband cepstral mean normalized features. The proposed approaches provide effective (better recognition rate), efficient (reduced feature vector dimension), and noise robust features. The performance of these techniques have been evaluated on the TI-46 isolated word database and own created Marathi digits database in a white noise environment using the continuous density hidden Markov model. The experimental results also show the superiority of the proposed techniques over the conventional methods like linear predictive cepstral coefficients, Mel-frequency cepstral coefficients, spectral subtraction, and cepstral mean normalization in presence of additive white Gaussian noise.
A speech recognition system has two major components, namely, feature extraction and classification. Feature extraction method plays a vital role in speech recognition task. There are two dominant approaches of acoustic measurement. First is a temporal domain or parametric approach such as linear prediction , which is developed to closely match the resonant structure of human vocal tract that produces the corresponding sound. Linear prediction coefficients (LPC) technique is not suitable for representing speech because it assumes signal stationary within a given frame and hence not analyze the localized events accurately. Also it is not able to capture the unvoiced and nasalized sounds properly . Second approach is nonparametric frequency domain approach based on human auditory perception system and known as Mel-frequency cepstral coefficients (MFCC) . The widespread use of the MFCCs is due to its low computational complexity and better performance for ASR under clean matched conditions. Performance of MFCC degrades rapidly in presence of noise and degradation is directly proportional to signal-to-noise ratio (SNR). Poor performance of LPC and its different forms like reflection coefficients, linear prediction cepstral coefficients (LPCC) as well as MFCC and its various forms  in noisy conditions has led many researchers to investigate alternative robust feature extraction algorithms.
In the literature, various techniques have been proposed to improve the performance of ASR systems in the presence of noise. Speech enhancement techniques such as spectral subtraction (SS)  or cepstrums from difference of power spectrum  reduce the effect of noise either using statistical information of noise or filtering the noise from noisy speech before feature extraction. Techniques like perceptual linear prediction  and relative spectra  incorporate some of the features of the human auditory mechanism and give noise robust ASR. Feature enhancement techniques like cepstral mean subtraction  and parallel model combination  improve ASR performance by compensating for mismatch effects in cepstral domain features.
In another approach [11–16] wavelet transform and wavelet packet tree have been used for speech feature extraction in which the energies of wavelet decomposed subbands have been used in place of Mel filtered subband energies. Because of its better energy compaction property , wavelet transform-based features give better recognition accuracy than LPC and MFCC. Mel filter-like admissible wavelet packet structure  performs better than MFCC in unvoiced phoneme recognition. Wavelet subband features proposed in  used normalized subband energies as features which show good performance in presence of additive white noise. However, in these wavelet-based approaches, the time information is lost due to use of wavelet subband energies. We used the actual wavelet coefficients proposed in , which preserve the time information, and also these features performed better than LPCC and MFCC due to the combined advantages of LPC and WT. LPC can better distinguish words having distinct vowel sounds  and WT can model the details of unvoiced sound portions of speech signal. However, the performance of these features is not well for the noisy speech recognition.
We propose the modification in the features proposed in  to derive effective, efficient, and noise robust features from the frequency subbands of the frame. Each frame of speech signal is decomposed (uniformly/dyadic) into different frequency subbands using discrete wavelet transform (DWT) and each subband is further modeled using linear predictive coding (LPC). The WT has a better capability to model the details of unvoiced sound portions. Hence, the subband decomposition has been performed by means of DWT. DWT is more popular in the field of digital signal processing due to its multiresolution capability and it has the property of constant Q, which is one of the demands of many signal processing applications, especially in the processing of the speech signals (as human's hearing system is constant Q perceptional) . Wavelet decomposition results in a logarithmic set of bandwidths, which is very similar to the response of human ear to frequencies (logarithmic fashion). The LPC coefficients derived from the speech subbands obtained after DWT decomposition provide WLPC features . Further these features were normalized in cepstrum domain using well-known cepstrum mean normalization (CMN) technique to get the noise robust features. These new features are denoted as wavelet subband-based cepstral mean normalized features (WSCMN) which perform better in additive white noise environment. The performance of the proposed features is tested on TI-46 and Marathi digits database using continuous density hidden Markov model (CDHMM) as a classifier.
The rest of the article is organized as follows. In Section 2, we describe a brief theory about DWT. The proposed WLPC feature extraction and its normalization are described in Section 3. The various experiments and recognition results are given in Section 4. Section 5 gives the concluding remarks based on the experimentation.
2. Discrete wavelet transform
The speech is a nonstationary signal. The Fourier transform (FT) is not suitable for the analysis of such nonstationary signal because it provides only the frequency information of signal but does not provide the information about at what time which frequency is present. The windowed short-time FT (STFT) provides the temporal information about the frequency content of signal. A drawback of the STFT is its fixed time resolution due to fixed window length. The WT, with its flexible time-frequency window, is an appropriate tool for the analysis of nonstationary signals like speech which have both short high frequency bursts and long quasi-stationary components also.
where function ϕ(t) is called scaling function, h[n] is an impulse response of a low-pass filter, and g[n] is an impulse response of a high-pass filter. The scaling and wavelet functions can be implemented effectively using a pair of filters, i.e., h[n] and g[n]. These filters are called a quadrature mirror filters that satisfy the property g[n] = (-1)1-nh[1-n] . The input signal is low-pass filtered to give the approximate components and high-pass filtered to give the detail components of the input speech signal. The approximate signal at each stage is further decomposed using same low-pass and high-pass filters to get the approximate and detail components for the next stage. This type of decomposition is called dyadic decomposition, whereas decomposition of detail signal along with the approximate signal at each stage is called uniform decomposition. Dyadic decomposition divides the input signal bandwidth into the logarithmic set of bandwidths, whereas the uniform decomposition divides it into the uniform set of bandwidths.
In speech signal, high frequencies are present very briefly at the onset of a sound while lower frequencies are present latter for long period . DWT resolves all these frequencies well. The DWT parameters contain the information of different frequency scales. This helps in getting the speech information of corresponding frequency band. In order to parameterize the speech signal, the signal is decomposed into four frequency bands uniformly or in dyadic fashion.
3. Proposed WLPC feature extraction
where i = 1,2,...,p. The obtained LPC and LPCC features cannot capture the high frequency peaks present in the speech signal and also cannot analyze the localized events accurately which wavelet transform can analyze. However, LPC can better distinguish between the words that have distinct vowel sounds than those share common vowel sounds . WT is able to model the details of unvoiced sound portion of speech than LPC . Also subband signals (wavelet coefficients) obtained from the wavelet decomposition can preserve the time information  and LPC can be estimated from such time domain signals easily. So, we can apply LPC technique on each subband signal after the wavelet decomposition which gives the combined benefits of LPC and WT. Hence, the combination of LPC with WT has been proposed in this article.
where, is a row vector formed using prediction coefficients obtained from the approximate components A3 at third level and is row vector formed using prediction coefficients obtained from the detail components D j (j = 1,2,3) at j th level. T indicates a vector transpose.
Figure 1b shows the schematic of uniform wavelet decomposed LPC (UWLPC) feature extraction from subbands of uniform bandwidth. The subbands are obtained by two-level wavelet packet decomposition . Then, the UWLPC feature vector is formed similar to DWLPC by concatenation of LPC coefficients estimated from the uniformly decomposed subband signals.
3.1. WSCMN features
After normalization, the mean of the cepstral sequence is zero, and it has a variance of one. This normalization is also called as cepstral mean and variance normalization. The CMN makes the features robust to some linear filtering of the acoustic signal, which might be caused by microphones with different transfer functions, varying distance from user to microphone, the room acoustics, or transmission channels .
4. Experimental results
This section evaluates the performance of the proposed techniques on isolated words in presence of stationary white noise using TI-46 and own created Marathi databases.
The speech recognition experiments were conducted under clean and noisy conditions using the TI-46 and own created Marathi digit database. The TI-46 Speaker Dependent Isolated Word Corpus  has two datasets, namely, TI-20 and TI-ALPHA. The TI-20 vocabulary consists of ten English digits "zero" through "nine" and ten control words "yes, no, erase, rubout, repeat, go, enter, help, stop, and start". The TI-ALPHA subset consists of "a" through "z" English alphabets. In both the subsets, data are collected from eight male and eight female speakers. There are 26 utterances of each word from each speaker out of which 10 were used as training tokens and remaining 16 were used as testing tokens. So, TI-20 subset has total 3200 training samples and 5120 test samples, whereas TI-ALPHA has 4160 training samples and 6656 test samples. All the data samples were digitized with sampling frequency 12.5 kHz.
English and equivalent Marathi digit pronunciation
4.2 Experimental setup
The input speech samples are pre-emphasized by a first-order filter with transfer function H(z) = 1-0.97z-1. The pre-emphasized speech data are divided into blocks of 25.6 ms duration with 50% overlap between every adjacent frame. The smooth frequency transitions are ensured using a Hamming window to each frame.
Noisy test samples of each dataset (TI-20, TI-ALPHA, and Marathi Digits) were obtained by artificially adding stationary white Gaussian noise under a wide range of SNRs (0, 5, 10, 15, 20, and 30 dB) into the test samples of each dataset. Tests were carried out on clean as well as noisy test samples. For training and testing, diagonal covariance left-right CDHMM  with 4-mixtures and 5-states (as this combination yields best performance) was used as a classifier.
4.3 Baseline experiment
The baseline experiments were performed using LPCC and MFCC features on each database. First in the LPCC feature extraction, the prediction coefficients were extracted from each speech frame using 13th order LPC. From the obtained prediction coefficients, cepstral coefficients and its temporal derivatives (first and second derivatives) were extracted and concatenated to form a final LPCC feature vector (this gives feature dimension 39).
Percentage recognition rate of LPCC and MFCC features on various datasets
% Recognition rate
We tested the performance of LPCC and MFCC features for different LPC orders and different number of Mel-filters in the triangular filter bank, respectively. It was observed that 13th-order LPC (p = 13), 20 Mel-filters in filter bank, and feature vector of length 39 (13 LPC/MFCC coefficients and their first and second derivatives) yield best performance on the databases. Hence, the results were obtained for these values of parameters.
4.4 WLPC features
Percentage recognition rates of different features on TI-20 database.
Feature vector length
% Recognition rate
Performance of WLPC features on TI-Alpha database
Feature vector length
% Recognition rate
In this article, DWT and LPC-based techniques (UWLPC and DWLPC) for isolated word recognition have been presented. Experimental results show that the proposed WLPC (UWLPC and DWLPC) features are effective and efficient as compared to LPCC and MFCC because it takes the combined advantages of LPC and DWT while estimating the features. Feature vector dimension for WLPC is almost half of the LPCC and MFCC. This reduces the memory requirement and the computational time. It is also observed that the performance of DWLPC is better than UWLPC. This is because the dyadic (logarithmic) frequency decomposition mimics the human auditory perception system better than uniform frequency decomposition.
WSCMN features are noise robust features because of normalization in cepstrum domain. It is observed that the proposed WSCMN features yield better performance as compared to the popular existing methods in presence of white noise because this technique is able to capture the difference between the phonemes (especially in Marathi database) more clearly than the MFCC and CMN. It has also been proved experimentally that the proposed approaches provide effective (better recognition rate), efficient (reduced feature vector dimension), and robust features.
- Itakura F: Minimum prediction residual principle applied to speech recognition. IEEE Trans Acoust Speech Signal Proces 1975, ASSP-23: 67-72.View ArticleGoogle Scholar
- Rabiner L, Juang BH: Fundamentals of Speech Recognition. Prentice-Hall Inc., Englewood Cliffs, NJ; 1993.Google Scholar
- Davis SB, Mermelstein P: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process 1980, ASSP-28(4):357-366.View ArticleGoogle Scholar
- Wang K, Lee CH, Juang BH: Selective feature extraction via signal decomposition. IEEE Signal Process Lett 1997, 4: 8-11. 10.1109/97.551687View ArticleGoogle Scholar
- Boll SF: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process 1979, 27: 113-120. 10.1109/TASSP.1979.1163209View ArticleGoogle Scholar
- Xu J, Wei G: Noise-robust speech recognition based on difference of power spectrum. Electron Lett 2000, 36(14):1247-1248. 10.1049/el:20000848View ArticleGoogle Scholar
- Hermansky H: Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am 1990, 87(4):1738-1752. 10.1121/1.399423View ArticleGoogle Scholar
- Hermansky H, Morgan N: RASTA processing of speech. IEEE Trans Speech Audio Process 1994, 2: 578-589. 10.1109/89.326616View ArticleGoogle Scholar
- Rosenberg AE, Lee CH, Soong FK: Cepstral channel normalization techniques for hmm-based speaker verification. In Proc ICSLP. Yokohama, Japan; 1994:1835-1838.Google Scholar
- Gales MJF, Young SJ: Robust speech recognition using parallel model combination. IEEE Trans Speech Audio Process 1996, 4: 352-359. 10.1109/89.536929View ArticleGoogle Scholar
- Tufekci Z, Gowdy JN: Feature extraction using discrete wavelet transform for speech recognition. In IEEE International Conference Southeastcon 2000. Nashville, TN, USA; 2000:116-123.Google Scholar
- Gupta M, Gilbert A: Robust speech recognition using wavelet coefficient features. In Proc IEEE workshop on Automatic Speech Recognition and Understanding (ASRU'01). Madonna di Campiglio, Trento, Italy; 2001:445-448.Google Scholar
- Gowdy JN, Tufekci Z: Mel-scaled discrete wavelet coefficients for speech recognition. In Proc IEEE Inter Conf Acoustics, speech, and Signal Processing (ICASSP'00). Volume 3. Istanbul, Turkey; 2000:1351-1354.Google Scholar
- Farooq O, Datta S: Mel filter-like admissible wavelet packet structure for speech recognition. IEEE Signal Process Lett 2001, 8(7):196-198. 10.1109/97.928676View ArticleGoogle Scholar
- Farooq O, Datta S: Wavelet based robust sub-band features for phoneme recognition. IEE Vis Image Signal Process 2004, 151(4):187-193.View ArticleGoogle Scholar
- Kotnik B, Kačič Z: A comprehensive noise robust speech parameterization algorithm using wavelet packet decomposition-based denoising and speech feature representation techniques. EURASIP J Adv Signal Process 2007, 1: 1-20.Google Scholar
- Mallat S: A Wavelet Tour of Signal Processing. Academic, New York; 1998.Google Scholar
- Nehe NS, Holambe RS: New feature extraction methods using DWT and LPC for isolated word recognition. In Proc of IEEE TENCON 2008. Hyderabad, India; 2008:1-6.Google Scholar
- Krishnan M, Neophytou CP, Prescott G: Wavelet transform speech recognition using vector quantization, dynamic time warping and artificial neural networks. In International Conference On Spoken Language Processing. Yokohama, Japan; 1994:1191-1193.Google Scholar
- Hao Y, Zhu X: A new feature in speech recognition based on wavelet transform. In Proc IEEE 5th Inter Conf on Signal Processing (WCCC-ICSP 2000). Volume 3. Beijing, China; 2000:1526-1529.Google Scholar
- Soman KP, Ramchandran KI: Insight into Wavelets from Theory to Practice. 2nd edition. Prentice-Hall of India, New Delhi; 2005.Google Scholar
- 1991.Google Scholar
- Pallett DS: A benchmark for speaker-dependent recognition using the Texas Instruments 20 Word and Alpha-set speech database. In Proc of Speech Recognition Workshop. Bristol, UK; 1986:67-72.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.