- Research Article
- Open access
- Published:
Visual Contribution to Speech Perception: Measuring the Intelligibility of Animated Talking Heads
EURASIP Journal on Audio, Speech, and Music Processing volume 2007, Article number: 047891 (2006)
Abstract
Animated agents are becoming increasingly frequent in research and applications in speech science. An important challenge is to evaluate the effectiveness of the agent in terms of the intelligibility of its visible speech. In three experiments, we extend and test the Sumby and Pollack (1954) metric to allow the comparison of an agent relative to a standard or reference, and also propose a new metric based on the fuzzy logical model of perception (FLMP) to describe the benefit provided by a synthetic animated face relative to the benefit provided by a natural face. A valid metric would allow direct comparisons accross different experiments and would give measures of the benfit of a synthetic animated face relative to a natural face (or indeed any two conditions) and how this benefit varies as a function of the type of synthetic face, the test items (e.g., syllables versus sentences), different individuals, and applications.
References
Sumby WH, Pollack I: Visual contribution to speech intelligibility in noise. Journal of Acoustical Society of America 1954,26(2):212-215. 10.1121/1.1907309
Benoît C, Mohamadi T, Kandel S: Effects of phonetic context on audio-visual intelligibility of French. Journal of Speech and Hearing Research 1994,37(5):1195-1203.
Jesse A, Vrignaud N, Cohen MM, Massaro DW: The processing of information from multiple sources in simultaneous interpreting. Interpreting 2000,5(2):95-115. 10.1075/intp.5.2.04jes
Summerfield AQ: Use of visual information for phonetic perception. Phonetica 1979,36(4-5):314-331. 10.1159/000259969
Bailly G, Bérar M, Elisei F, Odisio M: Audiovisual speech synthesis. International Journal of Speech Technology 2003,6(4):331-346. 10.1023/A:1025700715107
Beskow J: Talking heads - models and applications for multimodal speech synthesis, Ph.D. thesis. Department of Speech, Music and Hearing, KTH, Stockholm, Sweden; 2003.
Massaro DW: Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. MIT Press, Cambridge, Mass, USA; 1998.
Odisio M, Bailly G, Elisei F: Tracking talking faces with shape and appearance models. Speech Communication 2004,44(1–4):63-82.
Pelachaud C, Badler NI, Steedman M: Generating facial expressions for speech. Cognitive Science 1996,20(1):1-46. 10.1207/s15516709cog2001_1
Massaro DW, Beskow J, Cohen MM, Fry CL, Rodriguez T: Picture my voice: audio to visual speech synthesis using artificial neural networks. In Proceedings of Auditory-Visual Speech Processing (AVSP '99), August 1999, Santa Cruz, Calif, USA Edited by: Massaro DW. 133-138.
Beskow J, Karlsson I, Kewley J, Salvi G: SYNFACE-a talking head telephone for the hearing-impaired. In Proceedings of 9th International Conference on Computers Helping People with Special Needs (ICCHP '04), July 2004, Paris, France Edited by: Miesenberger K, Klaus J, Zagler W, Burger D. 1178-1186.
Bosseler A, Massaro DW: Development and evaluation of a computer-animated tutor for vocabulary and language learning in children with autism. Journal of Autism and Developmental Disorders 2003,33(6):653-672.
Massaro DW, Light J: Improving the vocabulary of children with hearing loss. Volta Review 2004,104(3):141-174.
Massaro DW, Light J: Read my tongue movements: bimodal learning to perceive and produce non-native speech /r/and /l/. Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH '03), September 2003, Geneva, Switzerland 2249-2252.
Massaro DW, Light J: Using visible speech for training perception and production of speech for hard of hearing individuals. Journal of Speech, Language, and Hearing Research 2004,47(2):304-320. 10.1044/1092-4388(2004/025)
Nass C: Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship. MIT Press, Cambridge, Mass, USA; 2005.
Cohen MM, Walker RL, Massaro DW: Perception of synthetic visual speech. In Speechreading by Humans and Machines: Models, Systems, and Applications. Edited by: Stork DG, Hennecke ME. Springer, Berlin, Germany; 1996:153-168.
Siciliano C, Williams G, Beskow J, Faulkner A: Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired. Proceedings of the 15th International Congress of Phonetic Science (ICPhS '03), August 2003, Barcelona, Spain 131-134.
LeGoff B, Guiard-Marigny T, Cohen MM, Benoît C: Real-time analysis-synthesis and intelligibility of talking faces. Proceedings of the 2nd International Conference on Speech Synthesis, September 1994, Newark, NY, USA
Ouni S, Cohen MM, Massaro DW: Training Baldi to be multilingual: a case study for an Arabic Badr. Speech Communication 2005,45(2):115-137. 10.1016/j.specom.2004.11.008
Grant KW, Walden BE: Evaluating the articulation index for auditory-visual consonant recognition. Journal of the Acoustical Society of America 1996,100(4):2415-2424. 10.1121/1.417950
Bernstein LE, Eberhardt SP: Johns Hopkins Lipreading Corpus Videodisk Set. The Johns Hopkins University, Baltimore, Md, USA; 1986.
Grant KW, Seitz PF: Measures of auditory-visual integration in nonsense syllables and sentences. Journal of the Acoustical Society of America 1998,104(4):2438-2450. 10.1121/1.423751
Grant KW, Walden BE, Seitz PF: Auditory-visual speech recognition by hearing-impaired subjects: consonant recognition, sentence recognition, and auditory-visual integration. Journal of the Acoustical Society of America 1998,103(5):2677-2690. 10.1121/1.422788
Grant KW, Walden BE: Predicting auditory-visual speech recognition in hearing-impaired listeners. Proceedings of the 13th International Congress of Phonetic Sciences, August 1995, Stockholm, Sweden 3: 122-129.
Massaro DW, Cohen MM: Tests of auditory-visual integration efficiency within the framework of the fuzzy logical model of perception. Journal of the Acoustical Society of America 2000,108(2):784-789. 10.1121/1.429611
Massaro DW, Cohen MM, Campbell CS, Rodriguez T: Bayes factor of model selection validates FLMP. Psychonomic Bulletin and Review 2001,8(1):1-17. 10.3758/BF03196136
Chen TH, Massaro DW: Mandarin speech perception by ear and eye follows a universal principle. Perception and Psychophysics 2004,66(5):820-836. 10.3758/BF03194976
Massaro DW: From multisensory integration to talking heads and language learning. In Handbook of Multisensory Processes. Edited by: Calvert G, Spence C, Stein BE. MIT Press, Cambridge, Mass, USA; 2004:153-176.
Lesner SA: The talker. Volta Review 1988,90(5):89-98.
Johnson K, Ladefoged P, Lindau M: Individual differences in vowel production. Journal of the Acoustical Society of America 1993,94(2):701-714. 10.1121/1.406887
Kricos PB, Lesner SA: Differences in visual intelligibility across talkers. Volta Review 1982, 84: 219-225.
Gesi AT, Massaro DW, Cohen MM: Discovery and expository methods in teaching visual consonant and word identification. Journal of Speech and Hearing Research 1992,35(5):1180-1188.
Montgomery AA, Jackson PL: Physical characteristics of the lips underlying vowel lipreading performance. Journal of the Acoustical Society of America 1983,73(6):2134-2144. 10.1121/1.389537
Preminger JE, Lin H-B, Payen M, Levitt H: Selective visual masking in speechreading. Journal of Speech, Language, and Hearing Research 1998,41(3):564-575.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Ouni, S., Cohen, M.M., Ishak, H. et al. Visual Contribution to Speech Perception: Measuring the Intelligibility of Animated Talking Heads. J AUDIO SPEECH MUSIC PROC. 2007, 047891 (2006). https://doi.org/10.1155/2007/47891
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1155/2007/47891