Skip to main content


We're creating a new version of this page. See preview

  • Research Article
  • Open Access

On the Utility of Syllable-Based Acoustic Models for Pronunciation Variation Modelling

  • 1Email author,
  • 1,
  • 1 and
  • 1
EURASIP Journal on Audio, Speech, and Music Processing20072007:046460

  • Received: 6 December 2006
  • Accepted: 18 May 2007
  • Published:


Recent research on the TIMIT corpus suggests that longer-length acoustic models are more appropriate for pronunciation variation modelling than the context-dependent phones that conventional automatic speech recognisers use. However, the impressive speech recognition results obtained with longer-length models on TIMIT remain to be reproduced on other corpora. To understand the conditions in which longer-length acoustic models result in considerable improvements in recognition performance, we carry out recognition experiments on both TIMIT and the Spoken Dutch Corpus and analyse the differences between the two sets of results. We establish that the details of the procedure used for initialising the longer-length models have a substantial effect on the speech recognition results. When initialised appropriately, longer-length acoustic models that borrow their topology from a sequence of triphones cannot capture the pronunciation variation phenomena that hinder recognition performance the most.


  • Acoustics
  • Speech Recognition
  • Substantial Effect
  • Recognition Performance
  • Considerable Improvement


Authors’ Affiliations

Centre for Language and Speech Technology (CLST), Faculty of Arts, Radboud University Nijmegen, P.O. Box 9103, Nijmegen, 6500 HD, The Netherlands


  1. Ostendorf M: Moving beyond the 'beads-on-a-string' model of speech. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '99), December 1999, Keystone, Colo, USA 79-84.Google Scholar
  2. Kessens JM, Cucchiarini C, Strik H: A data-driven method for modeling pronunciation variation. Speech Communication 2003,40(4):517-534. 10.1016/S0167-6393(02)00150-4View ArticleGoogle Scholar
  3. Jurafsky D, Ward W, Banping Z, Herold K, Xiuyang Y, Sen Z: What kind of pronunciation variation is hard for triphones to model? Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '01), May 2001, Salt Lake, Utah, USA 1: 577-580.Google Scholar
  4. Ganapathiraju A, Hamaker J, Picone J, Ordowski M, Doddington GR: Syllable-based large vocabulary continuous speech recognition. IEEE Transactions on Speech and Audio Processing 2001,9(4):358-366. 10.1109/89.917681View ArticleGoogle Scholar
  5. McAllaster D, Gillick L: Studies in acoustic training and language modeling using simulated speech data. Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH '99), September 1999, Budapest, Hungary 1787-1790.Google Scholar
  6. Plannerer B, Ruske G: Recognition of demisyllable based units using semicontinuous hidden Markov models Plannerer. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '92), March 1992, San Francisco, Calif, USA 1: 581-584.Google Scholar
  7. Jones RJ, Downey S, Mason JS: Continuous speech recognition using syllables. Proceedings of the 5th European Conference on Speech Communication and Technology (EUROSPEECH '97), September 1997, Rhodes, Greece 3: 1171-1174.Google Scholar
  8. Sethy A, Narayanan S: Split-lexicon based hierarchical recognition of speech using syllable and word level acoustic units. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 772-775.View ArticleGoogle Scholar
  9. Sethy A, Ramabhadran , Narayanan S: Improvements in English ASR for the MALACH project using syllable-centric models. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '03), November-December 2003, St. Thomas, Virgin Islands, USA 129-134.Google Scholar
  10. Jouvet D, Messina R: Context dependent "long units" for speech recognition. Proceedings of the 8th International Conference on Spoken Language Processing (ICSLP '04), October 2004, Jeju Island, Korea 645-648.Google Scholar
  11. Hämäläinen A, de Veth J, Boves L: Longer-length acoustic units for continuous speech recognition. Proceedings of European Signal Processing Conference (EUSIPCO '05), September 2005, Antalya, TurkeyGoogle Scholar
  12. Hämäläinen A, Boves L, de Veth J: Syllable-length acoustic units in large-vocabulary continuous speech recognition. Proceedings of the 10th International Conference on Speech and Computer (SPECOM '05), October 2005, Patras, Greece 499-502.Google Scholar
  13. Schiller NO, Meyer AS, Levelt WJM: The syllabic structure of spoken words: evidence from the syllabification of intervocalic consonants. Language and Speech 1997,40(2):103-140.Google Scholar
  14. Pallier C: Phonemes and syllables in speech perception: size of attentional focus in French. Proceedings of the 5th European Conference on Speech Communication and Technology (EUROSPEECH '97), September 1997, Rhodes, Greece 2159-2162.Google Scholar
  15. Greenberg S: Speaking in shorthand—a syllable-centric perspective for understanding pronunciation variation. Speech Communication 1999,29(2):159-176. 10.1016/S0167-6393(99)00050-3View ArticleGoogle Scholar
  16. TIMIT acoustic-phonetic continuous speech corpus In NTIS Order PB91-505065. National Institute of Standards and Technology, Gaithersburg, Md, USA; 1990. Speech Disc 1-1.1Google Scholar
  17. Oostdijk N, Goedertier W, Van Eynde F, et al.: Experiences from the spoken Dutch corpus project. Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC '02), May 2002, Las Palmas, Canary Islands, Spain 1: 340-347.Google Scholar
  18. Kullback S, Leibler R: On information and sufficiency. Annals of Mathematical Statistics 1951,22(1):79-86. 10.1214/aoms/1177729694MathSciNetView ArticleMATHGoogle Scholar
  19. Young S, Evermann G, Hain T, et al.: The HTK Book (for HTK Version 3.2.1). Cambridge University, Cambridge, UK; 2002.Google Scholar
  20. Fisher WM: tsylb2-1.1 syllabification software. 1996. Scholar
  21. Kahn D: Syllable-based generalisations in English phonology, Ph.D. thesis. Indiana University Linguistics Club, Bloomington, Ind, USA; 1976.Google Scholar
  22. Baayen RH, Piepenbrock R, Gulikers L: The CELEX Lexical Database (Release 2). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, Pa, USA; 1995.Google Scholar
  23. Printz H, Olsen P: Theory and practice of acoustic confusability. Proceedings of Automatic Speech Recognition: Challenges for the New Millenium (ISCA ITRW ASR '00), September 2000, Paris, France 77-84.Google Scholar
  24. Wester M: Pronunciation variation modeling for Dutch automatic speech recognition, Ph.D. thesis. University of Nijmegen, Nijmegen, The Netherlands; 2002.Google Scholar
  25. Hain T: Implicit modelling of pronunciation variation in automatic speech recognition. Speech Communication 2005,46(2):171-188. 10.1016/j.specom.2005.03.008MathSciNetView ArticleGoogle Scholar
  26. Greenberg S, Chang S: Linguistic dissection of switchboard-corpus automatic speech recognition systems. Proceedings of Automatic Speech Recognition: Challenges for the new Millenium (ISCA ITRW ASR '00), September 2000, Paris, France 195-202.Google Scholar
  27. Sun J, Deng L: An overlapping-feature-based phonological model incorporating linguistic constraints: applications to speech recognition. Journal of the Acoustical Society of America 2002,111(2):1086-1101. 10.1121/1.1420380MathSciNetView ArticleGoogle Scholar


© Annika Hämäläinen et al. 2007

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.