Skip to main content


  • Research Article
  • Open Access

Visual Contribution to Speech Perception: Measuring the Intelligibility of Animated Talking Heads

  • 1Email author,
  • 2,
  • 2 and
  • 2
EURASIP Journal on Audio, Speech, and Music Processing20062007:047891

  • Received: 7 January 2006
  • Accepted: 21 July 2006
  • Published:


Animated agents are becoming increasingly frequent in research and applications in speech science. An important challenge is to evaluate the effectiveness of the agent in terms of the intelligibility of its visible speech. In three experiments, we extend and test the Sumby and Pollack (1954) metric to allow the comparison of an agent relative to a standard or reference, and also propose a new metric based on the fuzzy logical model of perception (FLMP) to describe the benefit provided by a synthetic animated face relative to the benefit provided by a natural face. A valid metric would allow direct comparisons accross different experiments and would give measures of the benfit of a synthetic animated face relative to a natural face (or indeed any two conditions) and how this benefit varies as a function of the type of synthetic face, the test items (e.g., syllables versus sentences), different individuals, and applications.


  • Logical Model
  • Acoustics
  • Test Item
  • Speech Perception
  • Important Challenge


Authors’ Affiliations

LORIA, Campus Scientifique, BP 239, Vandoeure lès Nancy Cedex, 54506, France
Perceptual Science Laboratory, University of California, Santa Cruz, CA 95064, USA


  1. Sumby WH, Pollack I: Visual contribution to speech intelligibility in noise. Journal of Acoustical Society of America 1954,26(2):212-215. 10.1121/1.1907309View ArticleGoogle Scholar
  2. Benoît C, Mohamadi T, Kandel S: Effects of phonetic context on audio-visual intelligibility of French. Journal of Speech and Hearing Research 1994,37(5):1195-1203.View ArticleGoogle Scholar
  3. Jesse A, Vrignaud N, Cohen MM, Massaro DW: The processing of information from multiple sources in simultaneous interpreting. Interpreting 2000,5(2):95-115. 10.1075/intp.5.2.04jesView ArticleGoogle Scholar
  4. Summerfield AQ: Use of visual information for phonetic perception. Phonetica 1979,36(4-5):314-331. 10.1159/000259969View ArticleGoogle Scholar
  5. Bailly G, Bérar M, Elisei F, Odisio M: Audiovisual speech synthesis. International Journal of Speech Technology 2003,6(4):331-346. 10.1023/A:1025700715107View ArticleGoogle Scholar
  6. Beskow J: Talking heads - models and applications for multimodal speech synthesis, Ph.D. thesis. Department of Speech, Music and Hearing, KTH, Stockholm, Sweden; 2003.Google Scholar
  7. Massaro DW: Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. MIT Press, Cambridge, Mass, USA; 1998.Google Scholar
  8. Odisio M, Bailly G, Elisei F: Tracking talking faces with shape and appearance models. Speech Communication 2004,44(1–4):63-82.View ArticleGoogle Scholar
  9. Pelachaud C, Badler NI, Steedman M: Generating facial expressions for speech. Cognitive Science 1996,20(1):1-46. 10.1207/s15516709cog2001_1View ArticleGoogle Scholar
  10. Massaro DW, Beskow J, Cohen MM, Fry CL, Rodriguez T: Picture my voice: audio to visual speech synthesis using artificial neural networks. In Proceedings of Auditory-Visual Speech Processing (AVSP '99), August 1999, Santa Cruz, Calif, USA Edited by: Massaro DW. 133-138.Google Scholar
  11. Beskow J, Karlsson I, Kewley J, Salvi G: SYNFACE-a talking head telephone for the hearing-impaired. In Proceedings of 9th International Conference on Computers Helping People with Special Needs (ICCHP '04), July 2004, Paris, France Edited by: Miesenberger K, Klaus J, Zagler W, Burger D. 1178-1186.View ArticleGoogle Scholar
  12. Bosseler A, Massaro DW: Development and evaluation of a computer-animated tutor for vocabulary and language learning in children with autism. Journal of Autism and Developmental Disorders 2003,33(6):653-672.View ArticleGoogle Scholar
  13. Massaro DW, Light J: Improving the vocabulary of children with hearing loss. Volta Review 2004,104(3):141-174.Google Scholar
  14. Massaro DW, Light J: Read my tongue movements: bimodal learning to perceive and produce non-native speech /r/and /l/. Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH '03), September 2003, Geneva, Switzerland 2249-2252.Google Scholar
  15. Massaro DW, Light J: Using visible speech for training perception and production of speech for hard of hearing individuals. Journal of Speech, Language, and Hearing Research 2004,47(2):304-320. 10.1044/1092-4388(2004/025)View ArticleGoogle Scholar
  16. Nass C: Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship. MIT Press, Cambridge, Mass, USA; 2005.Google Scholar
  17. Cohen MM, Walker RL, Massaro DW: Perception of synthetic visual speech. In Speechreading by Humans and Machines: Models, Systems, and Applications. Edited by: Stork DG, Hennecke ME. Springer, Berlin, Germany; 1996:153-168.View ArticleGoogle Scholar
  18. Siciliano C, Williams G, Beskow J, Faulkner A: Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired. Proceedings of the 15th International Congress of Phonetic Science (ICPhS '03), August 2003, Barcelona, Spain 131-134.Google Scholar
  19. LeGoff B, Guiard-Marigny T, Cohen MM, Benoît C: Real-time analysis-synthesis and intelligibility of talking faces. Proceedings of the 2nd International Conference on Speech Synthesis, September 1994, Newark, NY, USAGoogle Scholar
  20. Ouni S, Cohen MM, Massaro DW: Training Baldi to be multilingual: a case study for an Arabic Badr. Speech Communication 2005,45(2):115-137. 10.1016/j.specom.2004.11.008View ArticleGoogle Scholar
  21. Grant KW, Walden BE: Evaluating the articulation index for auditory-visual consonant recognition. Journal of the Acoustical Society of America 1996,100(4):2415-2424. 10.1121/1.417950View ArticleGoogle Scholar
  22. Bernstein LE, Eberhardt SP: Johns Hopkins Lipreading Corpus Videodisk Set. The Johns Hopkins University, Baltimore, Md, USA; 1986.Google Scholar
  23. Grant KW, Seitz PF: Measures of auditory-visual integration in nonsense syllables and sentences. Journal of the Acoustical Society of America 1998,104(4):2438-2450. 10.1121/1.423751View ArticleGoogle Scholar
  24. Grant KW, Walden BE, Seitz PF: Auditory-visual speech recognition by hearing-impaired subjects: consonant recognition, sentence recognition, and auditory-visual integration. Journal of the Acoustical Society of America 1998,103(5):2677-2690. 10.1121/1.422788View ArticleGoogle Scholar
  25. Grant KW, Walden BE: Predicting auditory-visual speech recognition in hearing-impaired listeners. Proceedings of the 13th International Congress of Phonetic Sciences, August 1995, Stockholm, Sweden 3: 122-129.Google Scholar
  26. Massaro DW, Cohen MM: Tests of auditory-visual integration efficiency within the framework of the fuzzy logical model of perception. Journal of the Acoustical Society of America 2000,108(2):784-789. 10.1121/1.429611View ArticleGoogle Scholar
  27. Massaro DW, Cohen MM, Campbell CS, Rodriguez T: Bayes factor of model selection validates FLMP. Psychonomic Bulletin and Review 2001,8(1):1-17. 10.3758/BF03196136View ArticleGoogle Scholar
  28. Chen TH, Massaro DW: Mandarin speech perception by ear and eye follows a universal principle. Perception and Psychophysics 2004,66(5):820-836. 10.3758/BF03194976View ArticleGoogle Scholar
  29. Massaro DW: From multisensory integration to talking heads and language learning. In Handbook of Multisensory Processes. Edited by: Calvert G, Spence C, Stein BE. MIT Press, Cambridge, Mass, USA; 2004:153-176.Google Scholar
  30. Lesner SA: The talker. Volta Review 1988,90(5):89-98.Google Scholar
  31. Johnson K, Ladefoged P, Lindau M: Individual differences in vowel production. Journal of the Acoustical Society of America 1993,94(2):701-714. 10.1121/1.406887View ArticleGoogle Scholar
  32. Kricos PB, Lesner SA: Differences in visual intelligibility across talkers. Volta Review 1982, 84: 219-225.Google Scholar
  33. Gesi AT, Massaro DW, Cohen MM: Discovery and expository methods in teaching visual consonant and word identification. Journal of Speech and Hearing Research 1992,35(5):1180-1188.View ArticleGoogle Scholar
  34. Montgomery AA, Jackson PL: Physical characteristics of the lips underlying vowel lipreading performance. Journal of the Acoustical Society of America 1983,73(6):2134-2144. 10.1121/1.389537View ArticleGoogle Scholar
  35. Preminger JE, Lin H-B, Payen M, Levitt H: Selective visual masking in speechreading. Journal of Speech, Language, and Hearing Research 1998,41(3):564-575.View ArticleGoogle Scholar


© Slim Ouni et al. 2007

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.