Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

Iwano, Koji; Yoshinaga, Tomoaki; Tamura, Satoshi; Furui, Sadaoki

doi:10.1155/2007/64506

Research Article
Open access
Published: 15 March 2007

Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

Koji Iwano¹,
Tomoaki Yoshinaga¹,
Satoshi Tamura¹ &
…
Sadaoki Furui¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2007, Article number: 064506 (2007) Cite this article

1966 Accesses
20 Citations
3 Altmetric
Metrics details

Abstract

This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.

[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21]

References

Bregler C, Konig Y: "Eigenlips" for robust speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '94), April 1994, Adelaide, SA, Australia 2: 669-672.
Google Scholar
Tomlinson MJ, Russell MJ, Brooke NM: Integrating audio and visual information to provide highly robust speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96), May 1996, Atlanta, Ga, USA 2: 821-824.
Google Scholar
Potamianos G, Cosatto E, Graf HP, Roe DB: Speaker independent audio-visual database for bimodal ASR. Proceedings of ESCA Workshop on Audio-Visual Speech Processing (AVSP '97), September 1997, Rhodes, Greece 65-68.
Google Scholar
Neti C, Potamianos G, Luettin J, et al.: Audio-visual speech recognition. In Final Workshop 2000 Report. Center for Language and Speech Processing, The Johns Hopkins University, Baltimore, Md, USA; 2000.
Google Scholar
Dupont S, Luettin J: Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia 2000,2(3):141-151. 10.1109/6046.865479
Article Google Scholar
Zhang Y, Levinson S, Huang TS: Speaker independent audio-visual speech recognition. Proceedings of IEEE International Conference on Multi-Media and Expo (ICME '00), July-August 2000, New York, NY, USA 1073-1076.
Google Scholar
Chu SM, Huang TS: Bimodal speech recognition using coupled hidden Markov models. Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP '00), October 2000, Beijing, China 2: 747-750.
Google Scholar
Miyajima C, Tokuda K, Kitamura T: Audio-visual speech recognition using MCE-based HMMs and model-dependent stream weights. Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP '00), October 2000, Beijing, China 2: 1023-1026.
Google Scholar
Iwano K, Tamura S, Furui S: Bimodal speech recognition using lip movement measured by optical-flow analysis. Proceedings of International Workshop on Hands-Free Speech Communication (HSC '01), April 2001, Kyoto, Japan 187-190.
Google Scholar
Tamura S, Iwano K, Furui S: Multi-modal speech recognition using optical-flow analysis for lip images. Journal of VLSI Signal Processing—Systems for Signal, Image, and Video Technology 2004,36(2-3):117-124.
Article Google Scholar
Tamura S, Iwano K, Furui S: Improvement of audio-visual speech recognition in cars. Proceedings of the 18th International Congress on Acoustics (ICA '04), April 2004, Kyoto, Japan 4: 2595-2598.
Google Scholar
Yoshinaga T, Tamura S, Iwano K, Furui S: Audio-visual speech recognition using new lip features extracted from side-face images. Proceedings of COST278 and ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction (Robust '04), August 2004, Norwich, UK
Google Scholar
Yoshinaga T, Tamura S, Iwano K, Furui S: Audio-visual speech recognition using lip movement extracted from side-face images. Proceedings of International Conference on Audio-Visual Speech Processing, ISCA Tutorial and Research Workshop (AVSP '03), September 2003, St. Jorioz, France 117-120.
Google Scholar
Bovik AC, Desai MD: Basic binary image processing. In Handbook of Image and Video Processing. Edited by: Bovik AC. Academic Press, San Diego, Calif, USA; 2000:37-52.
Google Scholar
Horn BKP, Schunck BG: Determining optical flow. Artificial Intelligence 1981,17(1–3):185-203.
Article Google Scholar
Leggetter CJ, Woodland PC: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language 1995,9(2):171-185. 10.1006/csla.1995.0010
Article Google Scholar
Lucey P, Potamianos G: Lipreading using profile versus frontal views. Proceedings of the 8th IEEE Workshop on Multimedia Signal Processing (MMSP '06), October 2006, Victoria, BC, Canada 24-28.
Google Scholar
Potamianos G, Graf HP: Discriminative training of HMM stream exponents for audio-visual speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '98), May 1998, Seattle, Wash, USA 6: 3733-3736.
Google Scholar
Nakamura S, Ito H, Shikano K: Stream weight optimization of speech and lip image sequence for audio-visual speech recognition. Proceedings of 6th International Conference on Spoken Language Processing (ICSLP '00), October 2000, Beijing, China 3: 20-24.
Google Scholar
Tamura S, Iwano K, Furui S: A stream-weight optimization method for audio-visual speech recognition using multi-stream HMMs. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '04), May 2004, Montreal, Quebec, Canada 1: 857-860.
Google Scholar
Tamura S, Iwano K, Furui S: A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 1: 469-472.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Tokyo Institute of Technology, 2-12-1-W8-77 Ookayama, Meguro-ku, Tokyo, 152-8552, Japan
Koji Iwano, Tomoaki Yoshinaga, Satoshi Tamura & Sadaoki Furui

Authors

Koji Iwano
View author publications
You can also search for this author in PubMed Google Scholar
Tomoaki Yoshinaga
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Tamura
View author publications
You can also search for this author in PubMed Google Scholar
Sadaoki Furui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Koji Iwano.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Iwano, K., Yoshinaga, T., Tamura, S. et al. Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images. J AUDIO SPEECH MUSIC PROC. 2007, 064506 (2007). https://doi.org/10.1155/2007/64506

Download citation

Received: 12 July 2006
Revised: 24 January 2007
Accepted: 25 January 2007
Published: 15 March 2007
DOI: https://doi.org/10.1155/2007/64506

Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords