Visual Contribution to Speech Perception: Measuring the Intelligibility of Animated Talking Heads

Ouni, Slim; Cohen, Michael M; Ishak, Hope; Massaro, Dominic W

doi:10.1155/2007/47891

Research Article
Open access
Published: 23 October 2006

Visual Contribution to Speech Perception: Measuring the Intelligibility of Animated Talking Heads

Slim Ouni¹,
Michael M Cohen²,
Hope Ishak² &
…
Dominic W Massaro²

EURASIP Journal on Audio, Speech, and Music Processing volume 2007, Article number: 047891 (2006) Cite this article

2331 Accesses
23 Citations
3 Altmetric
Metrics details

Abstract

Animated agents are becoming increasingly frequent in research and applications in speech science. An important challenge is to evaluate the effectiveness of the agent in terms of the intelligibility of its visible speech. In three experiments, we extend and test the Sumby and Pollack (1954) metric to allow the comparison of an agent relative to a standard or reference, and also propose a new metric based on the fuzzy logical model of perception (FLMP) to describe the benefit provided by a synthetic animated face relative to the benefit provided by a natural face. A valid metric would allow direct comparisons accross different experiments and would give measures of the benfit of a synthetic animated face relative to a natural face (or indeed any two conditions) and how this benefit varies as a function of the type of synthetic face, the test items (e.g., syllables versus sentences), different individuals, and applications.

[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35]

References

Sumby WH, Pollack I: Visual contribution to speech intelligibility in noise. Journal of Acoustical Society of America 1954,26(2):212-215. 10.1121/1.1907309
Article Google Scholar
Benoît C, Mohamadi T, Kandel S: Effects of phonetic context on audio-visual intelligibility of French. Journal of Speech and Hearing Research 1994,37(5):1195-1203.
Article Google Scholar
Jesse A, Vrignaud N, Cohen MM, Massaro DW: The processing of information from multiple sources in simultaneous interpreting. Interpreting 2000,5(2):95-115. 10.1075/intp.5.2.04jes
Article Google Scholar
Summerfield AQ: Use of visual information for phonetic perception. Phonetica 1979,36(4-5):314-331. 10.1159/000259969
Article Google Scholar
Bailly G, Bérar M, Elisei F, Odisio M: Audiovisual speech synthesis. International Journal of Speech Technology 2003,6(4):331-346. 10.1023/A:1025700715107
Article Google Scholar
Beskow J: Talking heads - models and applications for multimodal speech synthesis, Ph.D. thesis. Department of Speech, Music and Hearing, KTH, Stockholm, Sweden; 2003.
Google Scholar
Massaro DW: Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. MIT Press, Cambridge, Mass, USA; 1998.
Google Scholar
Odisio M, Bailly G, Elisei F: Tracking talking faces with shape and appearance models. Speech Communication 2004,44(1–4):63-82.
Article Google Scholar
Pelachaud C, Badler NI, Steedman M: Generating facial expressions for speech. Cognitive Science 1996,20(1):1-46. 10.1207/s15516709cog2001_1
Article Google Scholar
Massaro DW, Beskow J, Cohen MM, Fry CL, Rodriguez T: Picture my voice: audio to visual speech synthesis using artificial neural networks. In Proceedings of Auditory-Visual Speech Processing (AVSP '99), August 1999, Santa Cruz, Calif, USA Edited by: Massaro DW. 133-138.
Google Scholar
Beskow J, Karlsson I, Kewley J, Salvi G: SYNFACE-a talking head telephone for the hearing-impaired. In Proceedings of 9th International Conference on Computers Helping People with Special Needs (ICCHP '04), July 2004, Paris, France Edited by: Miesenberger K, Klaus J, Zagler W, Burger D. 1178-1186.
Chapter Google Scholar
Bosseler A, Massaro DW: Development and evaluation of a computer-animated tutor for vocabulary and language learning in children with autism. Journal of Autism and Developmental Disorders 2003,33(6):653-672.
Article Google Scholar
Massaro DW, Light J: Improving the vocabulary of children with hearing loss. Volta Review 2004,104(3):141-174.
Google Scholar
Massaro DW, Light J: Read my tongue movements: bimodal learning to perceive and produce non-native speech /r/and /l/. Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH '03), September 2003, Geneva, Switzerland 2249-2252.
Google Scholar
Massaro DW, Light J: Using visible speech for training perception and production of speech for hard of hearing individuals. Journal of Speech, Language, and Hearing Research 2004,47(2):304-320. 10.1044/1092-4388(2004/025)
Article Google Scholar
Nass C: Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship. MIT Press, Cambridge, Mass, USA; 2005.
Google Scholar
Cohen MM, Walker RL, Massaro DW: Perception of synthetic visual speech. In Speechreading by Humans and Machines: Models, Systems, and Applications. Edited by: Stork DG, Hennecke ME. Springer, Berlin, Germany; 1996:153-168.
Chapter Google Scholar
Siciliano C, Williams G, Beskow J, Faulkner A: Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired. Proceedings of the 15th International Congress of Phonetic Science (ICPhS '03), August 2003, Barcelona, Spain 131-134.
Google Scholar
LeGoff B, Guiard-Marigny T, Cohen MM, Benoît C: Real-time analysis-synthesis and intelligibility of talking faces. Proceedings of the 2nd International Conference on Speech Synthesis, September 1994, Newark, NY, USA
Google Scholar
Ouni S, Cohen MM, Massaro DW: Training Baldi to be multilingual: a case study for an Arabic Badr. Speech Communication 2005,45(2):115-137. 10.1016/j.specom.2004.11.008
Article Google Scholar
Grant KW, Walden BE: Evaluating the articulation index for auditory-visual consonant recognition. Journal of the Acoustical Society of America 1996,100(4):2415-2424. 10.1121/1.417950
Article Google Scholar
Bernstein LE, Eberhardt SP: Johns Hopkins Lipreading Corpus Videodisk Set. The Johns Hopkins University, Baltimore, Md, USA; 1986.
Google Scholar
Grant KW, Seitz PF: Measures of auditory-visual integration in nonsense syllables and sentences. Journal of the Acoustical Society of America 1998,104(4):2438-2450. 10.1121/1.423751
Article Google Scholar
Grant KW, Walden BE, Seitz PF: Auditory-visual speech recognition by hearing-impaired subjects: consonant recognition, sentence recognition, and auditory-visual integration. Journal of the Acoustical Society of America 1998,103(5):2677-2690. 10.1121/1.422788
Article Google Scholar
Grant KW, Walden BE: Predicting auditory-visual speech recognition in hearing-impaired listeners. Proceedings of the 13th International Congress of Phonetic Sciences, August 1995, Stockholm, Sweden 3: 122-129.
Google Scholar
Massaro DW, Cohen MM: Tests of auditory-visual integration efficiency within the framework of the fuzzy logical model of perception. Journal of the Acoustical Society of America 2000,108(2):784-789. 10.1121/1.429611
Article Google Scholar
Massaro DW, Cohen MM, Campbell CS, Rodriguez T: Bayes factor of model selection validates FLMP. Psychonomic Bulletin and Review 2001,8(1):1-17. 10.3758/BF03196136
Article Google Scholar
Chen TH, Massaro DW: Mandarin speech perception by ear and eye follows a universal principle. Perception and Psychophysics 2004,66(5):820-836. 10.3758/BF03194976
Article Google Scholar
Massaro DW: From multisensory integration to talking heads and language learning. In Handbook of Multisensory Processes. Edited by: Calvert G, Spence C, Stein BE. MIT Press, Cambridge, Mass, USA; 2004:153-176.
Google Scholar
Lesner SA: The talker. Volta Review 1988,90(5):89-98.
Google Scholar
Johnson K, Ladefoged P, Lindau M: Individual differences in vowel production. Journal of the Acoustical Society of America 1993,94(2):701-714. 10.1121/1.406887
Article Google Scholar
Kricos PB, Lesner SA: Differences in visual intelligibility across talkers. Volta Review 1982, 84: 219-225.
Google Scholar
Gesi AT, Massaro DW, Cohen MM: Discovery and expository methods in teaching visual consonant and word identification. Journal of Speech and Hearing Research 1992,35(5):1180-1188.
Article Google Scholar
Montgomery AA, Jackson PL: Physical characteristics of the lips underlying vowel lipreading performance. Journal of the Acoustical Society of America 1983,73(6):2134-2144. 10.1121/1.389537
Article Google Scholar
Preminger JE, Lin H-B, Payen M, Levitt H: Selective visual masking in speechreading. Journal of Speech, Language, and Hearing Research 1998,41(3):564-575.
Article Google Scholar

Download references

Author information

Authors and Affiliations

LORIA, Campus Scientifique, BP 239, Vandoeure lès Nancy Cedex, 54506, France
Slim Ouni
Perceptual Science Laboratory, University of California, Santa Cruz, CA, 95064, USA
Michael M Cohen, Hope Ishak & Dominic W Massaro

Authors

Slim Ouni
View author publications
You can also search for this author in PubMed Google Scholar
Michael M Cohen
View author publications
You can also search for this author in PubMed Google Scholar
Hope Ishak
View author publications
You can also search for this author in PubMed Google Scholar
Dominic W Massaro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Slim Ouni.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Ouni, S., Cohen, M.M., Ishak, H. et al. Visual Contribution to Speech Perception: Measuring the Intelligibility of Animated Talking Heads. J AUDIO SPEECH MUSIC PROC. 2007, 047891 (2006). https://doi.org/10.1155/2007/47891

Download citation

Received: 07 January 2006
Revised: 21 July 2006
Accepted: 21 July 2006
Published: 23 October 2006
DOI: https://doi.org/10.1155/2007/47891

Visual Contribution to Speech Perception: Measuring the Intelligibility of Animated Talking Heads

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords