Skip to main content

On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling

Abstract

Recent research on the TIMIT corpus suggests that longer-length acoustic models are more appropriate for pronunciation variation modelling than the context-dependent phones that conventional automatic speech recognisers use. However, the impressive speech recognition results obtained with longer-length models on TIMIT remain to be reproduced on other corpora. To understand the conditions in which longer-length acoustic models result in considerable improvements in recognition performance, we carry out recognition experiments on both TIMIT and the Spoken Dutch Corpus and analyse the differences between the two sets of results. We establish that the details of the procedure used for initialising the longer-length models have a substantial effect on the speech recognition results. When initialised appropriately, longer-length acoustic models that borrow their topology from a sequence of triphones cannot capture the pronunciation variation phenomena that hinder recognition performance the most.

[123456789101112131415161718192021222324252627]

References

  1. 1.

    Ostendorf M: Moving beyond the 'beads-on-a-string' model of speech. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '99), December 1999, Keystone, Colo, USA 79-84.

    Google Scholar 

  2. 2.

    Kessens JM, Cucchiarini C, Strik H: A data-driven method for modeling pronunciation variation. Speech Communication 2003,40(4):517-534. 10.1016/S0167-6393(02)00150-4

    Article  Google Scholar 

  3. 3.

    Jurafsky D, Ward W, Banping Z, Herold K, Xiuyang Y, Sen Z: What kind of pronunciation variation is hard for triphones to model? Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '01), May 2001, Salt Lake, Utah, USA 1: 577-580.

    Google Scholar 

  4. 4.

    Ganapathiraju A, Hamaker J, Picone J, Ordowski M, Doddington GR: Syllable-based large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing 2001,9(4):358-366. 10.1109/89.917681

    Article  Google Scholar 

  5. 5.

    McAllaster D, Gillick L: Studies in acoustic training and language modeling using simulated speech data. Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH '99), September 1999, Budapest, Hungary 1787-1790.

    Google Scholar 

  6. 6.

    Plannerer B, Ruske G: Recognition of demisyllable based units using semicontinuous hidden Markov models Plannerer. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '92), March 1992, San Francisco, Calif, USA 1: 581-584.

    Google Scholar 

  7. 7.

    Jones RJ, Downey S, Mason JS: Continuous speech recognition using syllables. Proceedings of the 5th European Conference on Speech Communication and Technology (EUROSPEECH '97), September 1997, Rhodes, Greece 3: 1171-1174.

    Google Scholar 

  8. 8.

    Sethy A, Narayanan S: Split-lexicon based hierarchical recognition of speech using syllable and word level acoustic units. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 772-775.

    Article  Google Scholar 

  9. 9.

    Sethy A, Ramabhadran , Narayanan S: Improvements in English ASR for the MALACH project using syllable-centric models. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '03), November-December 2003, St. Thomas, Virgin Islands, USA 129-134.

    Google Scholar 

  10. 10.

    Jouvet D, Messina R: Context dependent "long units" for speech recognition. Proceedings of the 8th International Conference on Spoken Language Processing (ICSLP '04), October 2004, Jeju Island, Korea 645-648.

    Google Scholar 

  11. 11.

    Hämäläinen A, de Veth J, Boves L: Longer-length acoustic units for continuous speech recognition. Proceedings of European Signal Processing Conference (EUSIPCO '05), September 2005, Antalya, Turkey

    Google Scholar 

  12. 12.

    Hämäläinen A, Boves L, de Veth J: Syllable-length acoustic units in large-vocabulary continuous speech recognition. Proceedings of the 10th International Conference on Speech and Computer (SPECOM '05), October 2005, Patras, Greece 499-502.

    Google Scholar 

  13. 13.

    Schiller NO, Meyer AS, Levelt WJM: The syllabic structure of spoken words: evidence from the syllabification of intervocalic consonants. Language and Speech 1997,40(2):103-140.

    Google Scholar 

  14. 14.

    Pallier C: Phonemes and syllables in speech perception: size of attentional focus in French. Proceedings of the 5th European Conference on Speech Communication and Technology (EUROSPEECH '97), September 1997, Rhodes, Greece 2159-2162.

    Google Scholar 

  15. 15.

    Greenberg S: Speaking in shorthand—a syllable-centric perspective for understanding pronunciation variation. Speech Communication 1999,29(2):159-176. 10.1016/S0167-6393(99)00050-3

    Article  Google Scholar 

  16. 16.

    TIMIT acoustic-phonetic continuous speech corpus In NTIS Order PB91-505065. National Institute of Standards and Technology, Gaithersburg, Md, USA; 1990. Speech Disc 1-1.1

  17. 17.

    Oostdijk N, Goedertier W, Van Eynde F, et al.: Experiences from the spoken Dutch corpus project. Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC '02), May 2002, Las Palmas, Canary Islands, Spain 1: 340-347.

    Google Scholar 

  18. 18.

    Kullback S, Leibler R: On information and sufficiency. Annals of Mathematical Statistics 1951,22(1):79-86. 10.1214/aoms/1177729694

    MathSciNet  Article  MATH  Google Scholar 

  19. 19.

    Young S, Evermann G, Hain T, et al.: The HTK Book (for HTK Version 3.2.1). Cambridge University, Cambridge, UK; 2002.

    Google Scholar 

  20. 20.

    Fisher WM: tsylb2-1.1 syllabification software. 1996.http://www.nist.gov/speech/tools/index.htm

    Google Scholar 

  21. 21.

    Kahn D: Syllable-based generalisations in English phonology, Ph.D. thesis. Indiana University Linguistics Club, Bloomington, Ind, USA; 1976.

    Google Scholar 

  22. 22.

    Baayen RH, Piepenbrock R, Gulikers L: The CELEX Lexical Database (Release 2). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pa, USA; 1995.

    Google Scholar 

  23. 23.

    Printz H, Olsen P: Theory and practice of acoustic confusability. Proceedings of Automatic Speech Recognition: Challenges for the New Millenium (ISCA ITRW ASR '00), September 2000, Paris, France 77-84.

    Google Scholar 

  24. 24.

    Wester M: Pronunciation variation modeling for Dutch automatic speech recognition, Ph.D. thesis. University of Nijmegen, Nijmegen, The Netherlands; 2002.

    Google Scholar 

  25. 25.

    Hain T: Implicit modelling of pronunciation variation in automatic speech recognition. Speech Communication 2005,46(2):171-188. 10.1016/j.specom.2005.03.008

    MathSciNet  Article  Google Scholar 

  26. 26.

    Greenberg S, Chang S: Linguistic dissection of switchboard-corpus automatic speech recognition systems. Proceedings of Automatic Speech Recognition: Challenges for the new Millenium (ISCA ITRW ASR '00), September 2000, Paris, France 195-202.

    Google Scholar 

  27. 27.

    Sun J, Deng L: An overlapping-feature-based phonological model incorporating linguistic constraints: applications to speech recognition. Journal of the Acoustical Society of America 2002,111(2):1086-1101. 10.1121/1.1420380

    MathSciNet  Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Annika Hämäläinen.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Hämäläinen, A., Boves, L., de Veth, J. et al. On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling. J AUDIO SPEECH MUSIC PROC. 2007, 046460 (2007). https://doi.org/10.1155/2007/46460

Download citation

Keywords

  • Acoustics
  • Speech Recognition
  • Substantial Effect
  • Recognition Performance
  • Considerable Improvement