Significance of Joint Features Derived from the Modified Group Delay Function in Speech Processing

Hegde, Rajesh M.; Murthy, Hema A.; Gadde, V. R. R.

doi:10.1155/2007/79032

Research Article
Open access
Published: 20 December 2006

Significance of Joint Features Derived from the Modified Group Delay Function in Speech Processing

Rajesh M. Hegde¹,
Hema A. Murthy² &
V. R. R. Gadde³

EURASIP Journal on Audio, Speech, and Music Processing volume 2007, Article number: 079032 (2006) Cite this article

1482 Accesses
13 Citations
Metrics details

Abstract

This paper investigates the significance of combining cepstral features derived from the modified group delay function and from the short-time spectral magnitude like the MFCC. The conventional group delay function fails to capture the resonant structure and the dynamic range of the speech spectrum primarily due to pitch periodicity effects. The group delay function is modified to suppress these spikes and to restore the dynamic range of the speech spectrum. Cepstral features are derived from the modified group delay function, which are called the modified group delay feature (MODGDF). The complementarity and robustness of the MODGDF when compared to the MFCC are also analyzed using spectral reconstruction techniques. Combination of several spectral magnitude-based features and the MODGDF using feature fusion and likelihood combination is described. These features are then used for three speech processing tasks, namely, syllable, speaker, and language recognition. Results indicate that combining MODGDF with MFCC at the feature level gives significant improvements for speech recognition tasks in noise. Combining the MODGDF and the spectral magnitude-based features gives a significant increase in recognition performance of 11% at best, while combining any two features derived from the spectral magnitude does not give any significant improvement.

[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45]

References

Rabiner LR, Juang BH: Fundamentals of Speech Recognition. Prentice Hall, Englewood Cliffs, NJ, USA; 1993.
MATH Google Scholar
Aikawa K, Singer H, Kawahara H, Tohkura Y: A dynamic cepstrum incorporating time-frequency masking and its application to continuous speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '93), April 1993, Minneapolis, Minn, USA 2: 668-671.
Google Scholar
Bacchiani M, Aikawa K: Optimization of time-frequency masking filters using the minimum classification error criterion. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '94), April 1994, Adelaide, SA, Australia 2: 197-200.
Google Scholar
Hermansky H: Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America 1990,87(4):1738-1752. 10.1121/1.399423
Article Google Scholar
Ghitza O: Auditory models and human performance in tasks related to speech coding and speech recognition. IEEE Transactions on Speech and Audio Processing 1994,2(1, part 2):115-132. 10.1109/89.260357
Article Google Scholar
Payton KL: Vowel processing by a model of the auditory periphery: a comparison to eighth-nerve responses. The Journal of the Acoustical Society of America 1988,83(1):145-162. 10.1121/1.396441
Article MathSciNet Google Scholar
Lyon R: A computational model of filtering, detection, and compression in the cochlea. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '82), May 1982, Paris, France 7: 1282-1285.
Article Google Scholar
Seneff S: A joint synchrony/mean-rate model of auditory speech processing. Journal of Phonetics 1988,16(1):55-76.
Google Scholar
Cohen JR: Application of an auditory model to speech recognition. The Journal of the Acoustical Society of America 1989,85(6):2623-2629. 10.1121/1.397756
Article Google Scholar
Hunt MJ, Richardson SM, Bateman DC, Piau A: An investigation of PLP and IMELDA acoustic representations and of their potential for combination. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '91), May 1991, Toronto, Ont, Canada 2: 881-884.
Google Scholar
Davis SB, Mermelstein P: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 1980,28(4):357-366. 10.1109/TASSP.1980.1163420
Article Google Scholar
Paliwal KK, Alsteris LD: On the usefulness of STFT phase spectrum in human listening tests. Speech Communication 2005,45(2):153-170. 10.1016/j.specom.2004.08.001
Article Google Scholar
Alsteris LD, Paliwal KK: Some experiments on iterative reconstruction of speech from STFT phase and magnitude spectra. Proceedings of 9th European Conference on Speech Communication and Technology (EUROSPEECH '05), September 2005, Lisbon, Portugal 337-340.
Google Scholar
Murthy HA, Gadde VRR: The modified group delay function and its application to phoneme recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 68-71.
Google Scholar
Hegde RM, Murthy HA, Gadde VRR: Application of the modified group delay function to speaker identification and discrimination. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '04), 2004, Montreal, Quebec, Canada 1: 517-520.
Hegde RM, Murthy HA, Gadde VRR: Continuous speech recognition using joint features derived from the modified group delay function and MFCC. Proceedings of 8th International Conference on Spoken Language Processing (INTERSPEECH '04), October 2004, Jeju Island, Korea 2: 905-908.
Google Scholar
Hegde RM, Murthy HA, Gadde VRR: The modified group delay feature: a new spectral representation of speech. Proceedings of 8th International Conference on Spoken Language Processing (INTERSPEECH '04), October 2004, Jeju Island, Korea 2: 913-916.
Google Scholar
Hegde RM, Murthy HA, Gadde VRR: Significance of the modified group delay feature in speech recognition. to appear in IEEE Transactions on Speech and Audio Processing
Hegde RM, Murthy HA, Gadde VRR: Speech processing using joint features derived from the modified group delay function. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 1: 541-544.
Google Scholar
Okawa S, Bocchieri E, Potamianos A: Multi-band speech recognition in noisy environments. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '98), May 1998, Seattle, Wash, USA 2: 641-644.
Google Scholar
Ellis D: Feature stream combination before and/or after the acoustic model. In Tech. Rep. TR-00-007. International Computer Science Institute, Berkeley, Calif, USA; 2000.
Google Scholar
Christensen H: Speech recognition using heterogenous information extraction in multi-stream based systems, Ph.D. dissertation.
Kingsbury BED, Morgan N: Recognizing reverberant speech with RASTA-PLP. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 2: 1259-1262.
Google Scholar
Wu S-L, Kingsbury BED, Morgan N, Greenberg S: Incorporating information from syllable-length time scales intoautomatic speech recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '98), May 1998, Seattle, Wash, USA 2: 721-724.
Google Scholar
Janin A, Ellis D, Morgan N: Multi-stream speech recognition: ready for prime time? Proceedings of 6th European Conference on Speech Communication and Technology (EUROSPEECH '99), September 1999, Budapest, Hungary 591-594.
Google Scholar
Kirchhoff K, Bilmes JA: Dynamic classifier combination in hybrid speech recognition systems using utterance-level confidence values. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '99), March 1999, Phoenix, Ariz, USA 2: 693-696.
Article Google Scholar
Database for Indian Languages. Speech and Vision Lab, IIT Madras, Chennai, India; 2001.
NTIS : The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. 1993.
Google Scholar
Jankowski C, Kalyanswamy A, Basson S, Spitz J: NTIMIT : a phonetically balanced, continuous speech, telephone bandwidth speech database. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '90), April 1990, Albuquerque, NM, USA 1: 109-112.
Google Scholar
Besacier L, Bonastre JF: Time and frequency pruning for speaker identification. Proceedings of the 14th International Conference on Pattern Recognition (ICPR '98), August 1998, Brisbane, Qld., Australia 2: 1619-1621.
Google Scholar
Brown KL, George EB: CTIMIT: a speech corpus for the cellular environment with applications to automatic speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '95), May 1995, Detroit, Mich, USA 1: 105-108.
Google Scholar
Muthusamy YK, Cole RA, Oshika BT: The OGI multi-language telephone speech corpus. Proceedings of the 2nd International Conference on Spoken Language Processing (ICSLP '92), October 1992, Banff, Alberta, Canada 895-898.
Google Scholar
Turner K: Linear and order statistics combiners for reliable pattern classification, Ph.D. dissertation.
Perrone MP, Cooper LN: When networks disagree: ensemble methods for hybrid neural networks. In Neural Networks for Speech and Image Processing. Chapman-Hall, London, UK; 1993:126-142.
Google Scholar
Sarikaya R, Hansen JHL: Analysis of the root-cepstrum for acoustic modeling and fast decoding in speech recognition. Proceedings of the 7th European Conference on Speech Communication and Technology (EUROSPEECH '01), September 2001, Aalborg, Denmark 687-690.
Google Scholar
Krogh A, Vedelsby J: Neural network ensembles, cross validation, and active learning. In Advances in Neural Information Processing Systems. Volume 7. MIT Press, Cambridge, Mass, USA; 1995:231-238.
Google Scholar
Murthy HA, Yegnanarayana B: Formant extraction from group delay function. Speech Communication 1991,10(3):209-221. 10.1016/0167-6393(91)90011-H
Article Google Scholar
Yegnanarayana B, Saikia DK, Krishnan TR: Significance of group delay functions in signal reconstruction from spectral magnitude or phase. IEEE Transactions on Acoustics, Speech, and Signal Processing 1984,32(3):610-623. 10.1109/TASSP.1984.1164365
Article Google Scholar
Prasad VK, Nagarajan T, Murthy HA: Automatic segmentation of continuous speech using minimum phase group delay functions. Speech Communication 2004,42(3-4):429-446. 10.1016/j.specom.2003.12.002
Article Google Scholar
Yegnanarayana B, Murthy HA: Significance of group delay functions in spectrum estimation. IEEE Transactions on Signal Processing 1992,40(9):2281-2289. 10.1109/78.157227
Article MATH Google Scholar
Yip P, Rao KR: Discrete Cosine Transform: Algorithms, Advantages, and Applications. Academic Press, San Diego, Calif, USA; 1997.
MATH Google Scholar
Acero A: Acoustical and environmental robustness in automatic speech recognition, Ph.D. dissertation.
Murthy HA, Beaufays F, Heck LP, Weintraub M: Robust text-independent speaker identification over telephone channels. IEEE Transactions on Speech and Audio Processing 1999,7(5):554-568. 10.1109/89.784108
Article Google Scholar
Alexandre P, Lockwood P: Root cepstral analysis: a unified view. Application to speech processing in car noise environments. Speech Communication 1993,12(3):277-288. 10.1016/0167-6393(93)90099-7
Article Google Scholar
Gadde VRR, Stolcke A, Vergyri JZD, Sonmez K, Venkatraman A: The SRI SPINE 2001 Evaluation System. SRI: Menlo Park, Calif, USA, 2001
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of California San Diego, San Diego, CA, 92122, USA
Rajesh M. Hegde
Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, 600 036, India
Hema A. Murthy
STAR Lab, SRI International, 333 Ravenswood Avenue, Menlo Park, CA, 94025, USA
V. R. R. Gadde

Authors

Rajesh M. Hegde
View author publications
You can also search for this author in PubMed Google Scholar
Hema A. Murthy
View author publications
You can also search for this author in PubMed Google Scholar
V. R. R. Gadde
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajesh M. Hegde.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Hegde, R.M., Murthy, H.A. & Gadde, V.R.R. Significance of Joint Features Derived from the Modified Group Delay Function in Speech Processing. J AUDIO SPEECH MUSIC PROC. 2007, 079032 (2006). https://doi.org/10.1155/2007/79032

Download citation

Received: 01 April 2006
Revised: 20 September 2006
Accepted: 10 October 2006
Published: 20 December 2006
DOI: https://doi.org/10.1155/2007/79032

Significance of Joint Features Derived from the Modified Group Delay Function in Speech Processing

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords