Skip to main content

Visual Contribution to Speech Perception: Measuring the Intelligibility of Animated Talking Heads

Abstract

Animated agents are becoming increasingly frequent in research and applications in speech science. An important challenge is to evaluate the effectiveness of the agent in terms of the intelligibility of its visible speech. In three experiments, we extend and test the Sumby and Pollack (1954) metric to allow the comparison of an agent relative to a standard or reference, and also propose a new metric based on the fuzzy logical model of perception (FLMP) to describe the benefit provided by a synthetic animated face relative to the benefit provided by a natural face. A valid metric would allow direct comparisons accross different experiments and would give measures of the benfit of a synthetic animated face relative to a natural face (or indeed any two conditions) and how this benefit varies as a function of the type of synthetic face, the test items (e.g., syllables versus sentences), different individuals, and applications.

[1234567891011121314151617181920212223242526272829303132333435]

References

  1. 1.

    Sumby WH, Pollack I: Visual contribution to speech intelligibility in noise. Journal of Acoustical Society of America 1954,26(2):212-215. 10.1121/1.1907309

    Article  Google Scholar 

  2. 2.

    Benoît C, Mohamadi T, Kandel S: Effects of phonetic context on audio-visual intelligibility of French. Journal of Speech and Hearing Research 1994,37(5):1195-1203.

    Article  Google Scholar 

  3. 3.

    Jesse A, Vrignaud N, Cohen MM, Massaro DW: The processing of information from multiple sources in simultaneous interpreting. Interpreting 2000,5(2):95-115. 10.1075/intp.5.2.04jes

    Article  Google Scholar 

  4. 4.

    Summerfield AQ: Use of visual information for phonetic perception. Phonetica 1979,36(4-5):314-331. 10.1159/000259969

    Article  Google Scholar 

  5. 5.

    Bailly G, Bérar M, Elisei F, Odisio M: Audiovisual speech synthesis. International Journal of Speech Technology 2003,6(4):331-346. 10.1023/A:1025700715107

    Article  Google Scholar 

  6. 6.

    Beskow J: Talking heads - models and applications for multimodal speech synthesis, Ph.D. thesis. Department of Speech, Music and Hearing, KTH, Stockholm, Sweden; 2003.

    Google Scholar 

  7. 7.

    Massaro DW: Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. MIT Press, Cambridge, Mass, USA; 1998.

    Google Scholar 

  8. 8.

    Odisio M, Bailly G, Elisei F: Tracking talking faces with shape and appearance models. Speech Communication 2004,44(1–4):63-82.

    Article  Google Scholar 

  9. 9.

    Pelachaud C, Badler NI, Steedman M: Generating facial expressions for speech. Cognitive Science 1996,20(1):1-46. 10.1207/s15516709cog2001_1

    Article  Google Scholar 

  10. 10.

    Massaro DW, Beskow J, Cohen MM, Fry CL, Rodriguez T: Picture my voice: audio to visual speech synthesis using artificial neural networks. In Proceedings of Auditory-Visual Speech Processing (AVSP '99), August 1999, Santa Cruz, Calif, USA Edited by: Massaro DW. 133-138.

    Google Scholar 

  11. 11.

    Beskow J, Karlsson I, Kewley J, Salvi G: SYNFACE-a talking head telephone for the hearing-impaired. In Proceedings of 9th International Conference on Computers Helping People with Special Needs (ICCHP '04), July 2004, Paris, France Edited by: Miesenberger K, Klaus J, Zagler W, Burger D. 1178-1186.

    Google Scholar 

  12. 12.

    Bosseler A, Massaro DW: Development and evaluation of a computer-animated tutor for vocabulary and language learning in children with autism. Journal of Autism and Developmental Disorders 2003,33(6):653-672.

    Article  Google Scholar 

  13. 13.

    Massaro DW, Light J: Improving the vocabulary of children with hearing loss. Volta Review 2004,104(3):141-174.

    Google Scholar 

  14. 14.

    Massaro DW, Light J: Read my tongue movements: bimodal learning to perceive and produce non-native speech /r/and /l/. Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH '03), September 2003, Geneva, Switzerland 2249-2252.

    Google Scholar 

  15. 15.

    Massaro DW, Light J: Using visible speech for training perception and production of speech for hard of hearing individuals. Journal of Speech, Language, and Hearing Research 2004,47(2):304-320. 10.1044/1092-4388(2004/025)

    Article  Google Scholar 

  16. 16.

    Nass C: Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship. MIT Press, Cambridge, Mass, USA; 2005.

    Google Scholar 

  17. 17.

    Cohen MM, Walker RL, Massaro DW: Perception of synthetic visual speech. In Speechreading by Humans and Machines: Models, Systems, and Applications. Edited by: Stork DG, Hennecke ME. Springer, Berlin, Germany; 1996:153-168.

    Google Scholar 

  18. 18.

    Siciliano C, Williams G, Beskow J, Faulkner A: Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired. Proceedings of the 15th International Congress of Phonetic Science (ICPhS '03), August 2003, Barcelona, Spain 131-134.

    Google Scholar 

  19. 19.

    LeGoff B, Guiard-Marigny T, Cohen MM, Benoît C: Real-time analysis-synthesis and intelligibility of talking faces. Proceedings of the 2nd International Conference on Speech Synthesis, September 1994, Newark, NY, USA

    Google Scholar 

  20. 20.

    Ouni S, Cohen MM, Massaro DW: Training Baldi to be multilingual: a case study for an Arabic Badr. Speech Communication 2005,45(2):115-137. 10.1016/j.specom.2004.11.008

    Article  Google Scholar 

  21. 21.

    Grant KW, Walden BE: Evaluating the articulation index for auditory-visual consonant recognition. Journal of the Acoustical Society of America 1996,100(4):2415-2424. 10.1121/1.417950

    Article  Google Scholar 

  22. 22.

    Bernstein LE, Eberhardt SP: Johns Hopkins Lipreading Corpus Videodisk Set. The Johns Hopkins University, Baltimore, Md, USA; 1986.

    Google Scholar 

  23. 23.

    Grant KW, Seitz PF: Measures of auditory-visual integration in nonsense syllables and sentences. Journal of the Acoustical Society of America 1998,104(4):2438-2450. 10.1121/1.423751

    Article  Google Scholar 

  24. 24.

    Grant KW, Walden BE, Seitz PF: Auditory-visual speech recognition by hearing-impaired subjects: consonant recognition, sentence recognition, and auditory-visual integration. Journal of the Acoustical Society of America 1998,103(5):2677-2690. 10.1121/1.422788

    Article  Google Scholar 

  25. 25.

    Grant KW, Walden BE: Predicting auditory-visual speech recognition in hearing-impaired listeners. Proceedings of the 13th International Congress of Phonetic Sciences, August 1995, Stockholm, Sweden 3: 122-129.

    Google Scholar 

  26. 26.

    Massaro DW, Cohen MM: Tests of auditory-visual integration efficiency within the framework of the fuzzy logical model of perception. Journal of the Acoustical Society of America 2000,108(2):784-789. 10.1121/1.429611

    Article  Google Scholar 

  27. 27.

    Massaro DW, Cohen MM, Campbell CS, Rodriguez T: Bayes factor of model selection validates FLMP. Psychonomic Bulletin and Review 2001,8(1):1-17. 10.3758/BF03196136

    Article  Google Scholar 

  28. 28.

    Chen TH, Massaro DW: Mandarin speech perception by ear and eye follows a universal principle. Perception and Psychophysics 2004,66(5):820-836. 10.3758/BF03194976

    Article  Google Scholar 

  29. 29.

    Massaro DW: From multisensory integration to talking heads and language learning. In Handbook of Multisensory Processes. Edited by: Calvert G, Spence C, Stein BE. MIT Press, Cambridge, Mass, USA; 2004:153-176.

    Google Scholar 

  30. 30.

    Lesner SA: The talker. Volta Review 1988,90(5):89-98.

    Google Scholar 

  31. 31.

    Johnson K, Ladefoged P, Lindau M: Individual differences in vowel production. Journal of the Acoustical Society of America 1993,94(2):701-714. 10.1121/1.406887

    Article  Google Scholar 

  32. 32.

    Kricos PB, Lesner SA: Differences in visual intelligibility across talkers. Volta Review 1982, 84: 219-225.

    Google Scholar 

  33. 33.

    Gesi AT, Massaro DW, Cohen MM: Discovery and expository methods in teaching visual consonant and word identification. Journal of Speech and Hearing Research 1992,35(5):1180-1188.

    Article  Google Scholar 

  34. 34.

    Montgomery AA, Jackson PL: Physical characteristics of the lips underlying vowel lipreading performance. Journal of the Acoustical Society of America 1983,73(6):2134-2144. 10.1121/1.389537

    Article  Google Scholar 

  35. 35.

    Preminger JE, Lin H-B, Payen M, Levitt H: Selective visual masking in speechreading. Journal of Speech, Language, and Hearing Research 1998,41(3):564-575.

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Slim Ouni.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Ouni, S., Cohen, M.M., Ishak, H. et al. Visual Contribution to Speech Perception: Measuring the Intelligibility of Animated Talking Heads. J AUDIO SPEECH MUSIC PROC. 2007, 047891 (2006). https://doi.org/10.1155/2007/47891

Download citation

Keywords

  • Logical Model
  • Acoustics
  • Test Item
  • Speech Perception
  • Important Challenge