Optimization of an Image-Based Talking Head System
© K. Liu and J. Ostermann. 2009
Received: 25 February 2009
Accepted: 3 July 2009
Published: 30 September 2009
This paper presents an image-based talking head system, which includes two parts: analysis and synthesis. The audiovisual analysis part creates a face model of a recorded human subject, which is composed of a personalized 3D mask as well as a large database of mouth images and their related information. The synthesis part generates natural looking facial animations from phonetic transcripts of text. A critical issue of the synthesis is the unit selection which selects and concatenates these appropriate mouth images from the database such that they match the spoken words of the talking head. Selection is based on lip synchronization and the similarity of consecutive images. The unit selection is refined in this paper, and Pareto optimization is used to train the unit selection. Experimental results of subjective tests show that most people cannot distinguish our facial animations from real videos.
Generally, the image-based talking head system  includes two parts. One is the offline analysis, the other is the online synthesis. The analysis provides a large database of mouth images and their related information for the synthesis. The quality of synthesized animations depends mainly on the database and the unit selection.
The database contains tens of thousands of mouth images and their associated parameters, such as feature points of mouth images and the motion parameters. If these parameters are not analyzed precisely, the animations look jerky. Instead of template matching-based feature detection in , we use Active Appearance Models- (AAM-) based feature point detection [6–8] to locate the facial feature points, which is robust to the illumination change on the face resulted from head and mouth motions. Another contribution of our work in the analysis is to estimate the head motion using gradient-based approach  rather than feature point-based approach . Since feature-based motion estimation  is very sensitive to the detected feature points, the approach is not stable for the whole sequence.
The training of image-based facial animation system is time consuming and can only find one of the possible optimal parameters [1, 11], such that the facial animation system can only achieve good quality for a limited set of sentences. To better train the facial animation system, an evolutionary algorithm (Pareto optimization) [12, 13] is chosen. Pareto optimization is used to solve a multiobjective problem, which is to search the optimal parameter sets in the parameter space efficiently and to track many optimized targets according to defined objective criteria. In this paper, objective criteria are proposed to train the facial animation system using Pareto optimization approach.
In the remainder of this paper, we compare our approach to other talking head systems in Section 2. Section 3 introduces the overview of the talking head system. Section 4 presents the process of database building. Section 5 refines the unit selection synthesis. The unit selection will be optimized by Pareto optimization approach in Section 6. Experimental results and subjective evaluation are shown in Section 7. Conclusions are given in Section 8.
2. Previous Work
According to the underlying face model, talking heads can be categorized into 3D model-based animation and image-based rendering of models . Image-based facial animation can achieve more realistic animations, while 3D-based approaches are more flexible to render the talking head in any view and under any lighting conditions.
The 3D model-based approach  usually requires a mesh of 3D polygons that define the head shape, which can be deformed parametrically to perform facial actions. A texture is mapped over the mesh to render facial parts. Such a facial animation has become a standard defined in ISO/IEC MPEG-4 . A typical shortcoming is that the texture is changed during the animation. Pighin et al.  present another 3D model-based facial animation system, which can synthesize facial expressions by morphing static 3D models with textures. A more flexible approach is to model the face by 3D morphable models [17, 18]. Hair is not included in the 3D model and the model building is time consuming. Morphing static facial expressions look surprisingly realistic nowadays, whereas a realistic talking head (animation with synchronized audio) is not possible yet. The physics-based animation [19, 20] has an underlying anatomical structure such that the model allows a deformation of the head in anthropometrically meaningful ways . These techniques allow the creation of subjectively pleasing animations. Due to the complexity of real surfaces, texture, and motion, talking faces are immediately identified as synthetic.
The image-based approaches analyze the recorded image sequences, and animations are synthesized by combining different facial parts. A 3D model is not necessary for animations. Bregler et al.  proposed a prototype called video rewrite which used triphones as the element of the database. A new video is synthesized by selecting the most appropriate triphone videos. Ezzat et al.  developed a multidimensional morphable model (MMM), which is capable of morphing between various basic mouth shapes. Cosatto et al.  described another image-based approach with higher realism and flexibility. A large database is built including all facial parts. A new sequence is rendered by stitching facial part images to the correct position in a previously recorded background sequence. Due to the use of a large number of recorded natural images, this technique has the potential of creating realistic animations. For short sentences, animations without expressions can be indistinguishable from real videos .
A talking head can be driven by text or speech. The text-driven talking head consists of TTS and talking head. The TTS synthesizes the audio with phoneme information from the input text. Then the phoneme information drives the talking head. The speech-driven talking head uses phoneme information from original sounds. Text-driven talking head is flexible and can be used in many applications, but the quality of speech is not so good as that of a speech-driven talking head.
The text-driven or speech-driven talking head has an essential problem, lip synchronization. The mouth movement of the talking head has to match the corresponding audio utterance. Lip synchronization is rather complicated due to the coarticulation phenomena  which indicate that a particular mouth shape depends not only on its own phoneme but also on its preceding and succeeding phonemes. Generally, the 3D model-based approaches use a coarticulation model with an articulation mapping between a phoneme and the model's action parameters. Image-based approaches implicitly make use of the coarticulation of the recorded speaker when selecting an appropriate sequence of mouth images. Comparing to 3D model-based animations, each frame in the image-based animations looks realistic. However, selecting mouth images, which provides a smooth movement, remains a challenge.
The mouth movement can be derived from the coarticulation property of the vocal tracts. Key-frame-based rendering interpolates the frames between key frames. For example,  defined the basic visemes as the key frames and the transition in the animation is based on morphing visemes. A viseme is the basic mouth image corresponding to the speech unit "phoneme", for example, the phonemes "m'', "b'', "p'' correspond to the closure viseme. However, this approach does not take into account the coarticulation models [24, 26]. As preceding and succeeding visemes affect the vocal tracts, the transition between two visemes also gets affected by other neighbor visemes.
Recently, HMMs are used for lip synchronization. Rao et al.  presented a Gaussian mixture-based HMM for converting speech features to facial features. The problem is changed to estimate the missing facial feature vectors based on trained HMMs and given audio feature vectors. Based on the joint speech and facial probability distribution, conditional expectation values of facial features are calculated as the optimal estimates for given speech data. Only the speech features at a given instant in time are considered to estimate the corresponding facial features. Therefore, this model is sensitive to noise in the input speech. Furthermore, coarticulation is disregarded in the approach. Hence, abrupt changes in the estimated facial features occur and the mouth movement appears jerky.
Based on , Choi et al.  proposed a Baum-Welch HMM Inversion to estimate facial features from speech. The speech-facial HMMs are trained using joint audiovisual observations; optimal facial features are generated directly by Baum-Welch iterations in the Maximum Likelihood (ML) sense. The estimated facial features are used for driving the mouth movement of a 3D face model. In the above two approaches, the facial features are simply parameterized by the mouth width and height. Both lack an explicit and concise articulatory model that simulates the speech production process, resulting in sometimes wrong mouth movements.
In contrast to the above models, Xie and Liu  developed a Dynamic Bayesian Network- (DBN)- structured articulatory model, which takes the articulator variables into account which produce the speech. The articulator variables (with discrete values) are defined as voicing (on, off), velum (open, closed), lip rounding (rounded, slightly rounded, mid, wide), tongue show (touching top teeth, near alveolar ridge, touching alveolar, others), and teeth show (on, off). After training the articulatory model parameters, an EM-based conversion algorithm converts audio to facial features in a maximum likelihood sense. The facial features are parameterized by PCA (Principal Component Analysis) . The mouth images are interpolated in PCA space to generate animations. One problem of this approach is that it needs a lot of manual work to determine the value of the articulator variables from the training video clips. Due to the interpolation in PCA space, unnatural images with teeth shining through lips may be generated.
The image-based facial animation system proposed in  uses shape and appearance models to create realistic talking head. Each recorded video is mapped to a trajectory in the model space. In the synthesis, synthesis units are the segments extracted from the trajectories. These units are selected and concatenated by matching the phoneme similarity. A sequence of appearance images and 2D feature points are the synthesized trajectory in the model space. The final animations are created by warping the appearance model to the corresponding feature points. But the linear texture modes using PCA are unable to model nonlinear variations of the mouth part. Therefore, the talking head has a rendering problem with mouth blurring, which results in unrealistic animations.
Thus, there exists a significant need to improve coarticulatory model for lip synchronization. The image-based approach selects appropriate mouth images matching the desired values from a large database, in order to maintain the mechanism of mouth movement during speaking. Similar to the unit selection synthesis in the text-to-speech synthesizer, the resulted talking heads could achieve the most naturalness.
3. System Overview of Image-Based Talking Head
4.1. Audio-Visual Analysis
The audio-visual analysis of recorded human subjects results in a database of mouth images and their relevant features suitable for synthesis. The audio and video of a human subject reading texts of a predefined corpus are recorded. As shown in Figure 4(a), the recorded audio and video data are analyzed by motion estimation and aligner.
Phoneme-viseme mapping of SAPI American English Phoneme Representation. There are 43 phonemes and 22 visemes.
Viseme type no.
Viseme type no.
ae, ax, ah
ey, eh, uh
sh, ch, jh, zh
iy, y, ih, ix
d, t, n
k, g, ng
p, b, m
The head motion of the recorded videos is estimated and the mouth images are normalized. A 3D face mask is adapted to the first frame of the video using the calibrated camera parameters and 6 facial feature points (4 eye corners and 2 nostrils). Gradient-based motion estimation approach  is carried out to compute the rotation and translation parameters of the head movement in the later frames. These motion parameters are used to compensate head motion such that normalized mouth images can be parameterized by PCA correctly.
4.2. Parameterization of Normalized Mouth Images
Figure 4(b) shows the parameterization of mouth images. As PCA transforms the mouth image data into principal component space, reflecting the original data structure, we use PCA parameters to measure the distance of the mouth images in the objective criteria for system training. In order to maintain the system consistency, PCA is also used to parameterize the mouth images to describe the texture information.
The geometric parameters, such as mouth corner points and lip position, are obtained by template matching-based approach in the reference system . This method is very sensitive to the illumination change resulted from mouth movement and head motion during speaking, even though the environment lighting is consistent in the studio. Furthermore, the detection error of the mouth corners may be less accurate when the mouth is very wide open. The same problem exists also in the detection of eye corners, which will result in an incorrect motion estimation and normalization.
In order to detect stable and precise feature points, AAM-based feature point detection is proposed in . AAM-based feature detection uses not only the texture but also the shape of the face. AAM models are built from a training set including different appearances. The shape is manually marked. Because the AAM is built in a PCA space, if there are enough training data that can construct the PCA space, the AAM is not sensitive to the illumination change on the face. Typically the training data set consists about 20 mouth images.
5.1. Unit Selection
First, a search graph is built. Each frame is populated with a list of candidate mouth images that belong to the viseme corresponding to the phoneme of the frame. Using a viseme instead of a phoneme increases the number of valid candidates for a given target, given the relatively small database. Each candidate is fully connected to the candidates of the next frame. The connectivity of the candidates builds a search graph as depicted in Figure 6. Target costs are assigned to each candidate and concatenation costs are assigned to each connection. A Viterbi search through the graph finds the optimal path with minimal total cost. Given in Figure 6, the selected sequence is composed of several segments. The segments are extracted from the recorded sequence. Lip synchronization is achieved by defining target costs that are small for images recorded with the same phonetic context as the current image to be synthesized.
where and are the average PCA weights of phoneme and respectively. is the reduced dimension of the PCA space of mouth images. is the weight of the th PCA component, which describes the discrimination of the components, we use exponential factor , , with and .
with the weights wccf and wccg. Candidates, (from frame ) and (from frame ), have a feature vector and of the mouth image considering the articulator features including teeth, tongue, lips, appearance, and geometric features.
where measures the Euclidean distance in the articulator feature space with dimension. Each feature is given a weight which is proportional to its discrimination. For example, the weight for each component of PCA parameters is proportional to its corresponding eigenvalue of PCA analysis.
These weights should be trained. In  two approaches are proposed to train the weights of the unit selection for a speech synthesizer. In the first approach, weight space search is to search a range of weight sets in the weight space and find the best weight set which minimize the difference between the natural waveform and the synthesized waveform. In the second approach, regression training is used to determine the weights for the target cost and the weights for the concatenation cost separately. Exhaustive comparison of the units in the database and multiple linear regression are involved. Both methods are time consuming and the weights are not globally optimal. An approach similar to weight space search is presented in , which uses only one objective measurement to train the weights of the unit selection. However, other objective measurements are not optimized. Therefore, these approaches are only sub-optimal for training the unit selection, which has to create a compromise between partially opposing objective quality measures. Considering multiobjective measurements, a novel training method for optimizing the unit selection is presented in the next section.
5.2. Rendering Performance
The performance of visual speech synthesis depends mainly on the TTS synthesizer, the unit selection, and the OpenGL rendering of the animations. We have measured that the TTS synthesizer has about 10 ms latency in a WLAN network. The unit selection is running as a thread, which only delay the program at the first sentence. The unit selection for the second sentence is run when the first sentence is rendered. Therefore, the unit selection is done in real time. The OpenGL rendering takes the main time of the animations, which relies on the graphics card. For our system (CPU: AMD Athlon XP 1.1 GHz, Graphics card: NVIDIA Geforce FX 5200), the rendering needs only 25 ms for each frame of a sequence with CIF format at 25 fps.
6. Unit Selection Training by Pareto Optimization
As discussed in Section 5.1, several weights, influencing TC, CC, and PC, should be trained. Generally, the training set includes several original recorded sentences (as ground truth) which are not included in the database. Using the database, an animation will be generated using the given weights for unit selection. We use objective evaluator functions as Face Image Distance Measure (FIDM). The evaluator functions are average target cost, average segment length, average visual difference between segments. The average target cost indicates the lip synchronization, the average segment length and average visual difference indicate the smoothness.
6.1. Multiobjective Measurements
A mouth sequence with minimal path cost is found by the Viterbi search in the unit selection. Each mouth has a target cost ( ) and a concatenation cost including a visual cost and a skip cost in the selected sequence.
where is the number of segments in the final animation. For example, the average segment length of the animation in Figure 6 is calculated as .
6.2. Pareto Optimization of Unit Selection
Inspired in natural evolution ideas, Pareto optimization evolves a population of candidate solutions (i.e., weights), adapting them to multiobjective evaluator functions (i.e., FIDM). This process takes advantage of evolution mechanisms such as the survival of the fit test and genetic material recombination. The fit test is an evaluation process, which finds the weights that maximize the multiobjective evaluator functions. The Pareto algorithm starts with an initial population. Each individual is a weight vector containing weights to be adjusted. Then, the population is evaluated by the multiobjective evaluator functions (i.e., FIDM). A number of best weight sets are selected to build a new population with the same size as the previous one. The individuals of the new population are recombined in two steps, that is, crossover and mutation. The first step recombines the weight values of two individuals to produce two new children. The children replace their parent in the population. The second step introduces random perturbations to the weights with a given probability. Finally, a new population is obtained to replace the original one, starting the evolutionary cycle again. This process stops when a certain finalization criteria is satisfied.
Once the Pareto-front is obtained, the best weights combination is located on the Pareto-front. The subjective test is the ultimate way to find the best weights combination, but there are many weight combinations performing similar results that subjects cannot distinguish. Therefore, it is necessary to define objective measurements to find the best weight combination automatically and objectively.
The measurable criteria consider the subjective impression of quality. We have performed the following objective evaluations. The similarity of the real sequence and the animated sequence is described by directly comparing the visual parameters of the animated sequence with the real parameters extracted from the original video. We use the cross-correlation of the two visual parameters as the measure of similarity. The visual parameters are the size of open mouth and the texture parameter.
where and are the first principal component coefficient of PCA parameter or the mouth height of the mouth image at frame in the real and animated sequence, respectively. and are the means of the corresponding series, and . is the total number of frames of the sequence.
7. Experimental Results
7.1. Data Collection
In order to test our talking head system, two data sets are used, comprising the data from our Institute (TNT) and the data from LIPS2008 .
In our studio a subject is recorded while reading a corpus including about 300 sentences. A lighting system is designed and developed for an audio-visual recording with high image quality , which minimizes the shadow on the face of an subject and reduces the change of illumination in the recorded sequences. The capturing is done using an HD camera (Thomson LDK 5490). The video format is originally at 50 fps, which is cropped to pixels at 50 fps. The audio signal is sampled at 48 kHz. 148 utterances are selected to build a database to synthesize animations. The database contains 22 762 normalized mouth images with a resolution of .
The database from LIPS2008 consists of 279 sentences, supporting the phoneme transcription of the texts. The video format is at 50 fps. 180 sentences are selected to build a database for visual speech synthesis. The database contains 36 358 normalized mouth images with a resolution of .
7.2. Unit Selection Optimization
The weight set corresponding to the point on the Pareto-front with maximal similarity are used in the unit selection. Animations generated by the optimal facial animation system are used for the following formal subjective tests.
7.3. Subjective Tests
A subjective test is defined and carried out to evaluate the facial animation system. The goal of the subjective test is to assess the naturalness of animations whether they can be distinguished from real videos.
Assessing the quality of a talking head system becomes even more urgent as the animations become more lifelike, since improvements may be more subtle and subjective. A subjective test where observers give feedback is the ultimate measure of quality, although objective measurements used by the Pareto optimization can greatly accelerate the development and also increase the efficiency of subjective tests by focusing them on the important issues. Since a large number of observers is required, preferably from different demographic groups, we designed a Website for subjective tests.
In order to get a fair subjective evaluation, let the viewers focus on the lips and separate the different factors, such as head motions and expressions, influencing the speech perception, we selected a short recorded video with neutral expressions and tiny head movements as the background sequence. The mouth images, which are cropped from a recorded video, are overlaid to the background sequence in a correct position and orientation to generate a new video, named original video. The corresponding real audio is used to generate a synthesized video by the optimized unit selection. Thus a pair of videos, uttering the same sentence, are ready for subjective tests. Overall 5 pairs of original and synthesized videos are collected to build a video database available for subjective tests on our Website. The real videos corresponding to the real audios are not part of the database.
A Turing test was performed to evaluate our talking head system. 30 students and employees of Leibniz University of Hanover were invited to take part in the formal subjective tests. All video pairs from the video database were randomly selected and the video pair was itself presented to the participant randomly only once. The participant should decide whether it is an original or a synthesized video immediately after the video pair was displayed.
Results of the subjective tests for talking heads by using TNT database. 5 video pairs were shown to 30 viewers. The number of the viewers, which identified the real and synthesized video correctly (NCI), was counted. The correct identifying rate (CIR) for each video pair was calculated.
Table 2 shows the results of subjective tests. CIR 50% is expected, which means that the animations are as realistic as the real one. From the results of the subjective tests, we can find that the original videos of video pairs 1 and 5 are correctly recognized by 70% of the viewers. The video pairs 2 and 3 are almost indistinguishable to the viewers, where the CIR is approaching 50%. The synthesized video of video pair 4 is decided by most viewers as original video.
The generated talking heads using LIPS 2008 database were evaluated on the conference of Interspeech 2008. In comparison to other attended systems , our proposed talking head system achieved the most audio-visual consistency in terms of naturalness. The Mean Opinion Score (MOS) of our system was about 3.7 in the subjective test evaluated by a 5-point grading scale (5: Excellent, 4: Good, 3: Fair, 2: Poor, 1: Bad). The original videos were scored with about 4.7.
The subjective tests carried out in our institute show that the talking head generated by using the database of TNT performs better than the talking head generated by using the database of LIPS2008. A reason for the better animation results is the designed light settings resulting in a high quality recording. All viewers think the videos from TNT look better, since the lighting contrast of the image gives a big impact on the perception of overall quality of talking heads in the subjective tests. Furthermore, the shadow and the illumination changes on the face cause problems in motion estimation, which makes the final animations jerky and blinking. Therefore, talking heads generated by using the database of LIPS2008 do not look as realistic as those heads by using the database of TNT.
Based on the facial animation system, Web-based interactive services such as E-shop and Newsreader were developed. The demos and related Website are available at http://www.tnt.uni-hannover.de/project/facialanimation/demo/. In addition, the video pairs used for the subjective tests can be downloaded from http://www.tnt.uni-hannover.de/project/facialanimation/demo/subtest/.
We have presented the optimization of an image-based talking head system. The image-based talking head system consists of an offline audio-visual analysis and an online unit selection synthesis. In the analysis part, Active Appearance Models (AAMs) based facial feature detection is used to find geometric parameters of mouth images instead of color template-based approach that is a reference method. By doing so, the accuracy of facial features is improved to subpixel. In the synthesis part, we have refine the unit selection algorithm. Furthermore, optimization of the unit selection synthesis is a difficult problem because the unit selection is a nonlinear system. Pareto optimization algorithm is chosen to train the unit selection so that the visual speech synthesis is stable for arbitrary input texts. The optimization criteria include lip synchronization, visual smoothness, and others. Formal subjective tests show that synthesized animations generated by the optimized talking head system match the corresponding audio naturally. More encouraging, 3 out of 5 synthesized animations are so realistic that the viewers cannot distinguish them from original videos.
In the future work, we are planning to record additional videos in which the subject is smiling while speaking. We hope to generate expressive talking heads by switching between the smile and the neutral mouth images.
This research work was funded by EC within FP6 under Grant 511568 with the acronym 3DTV. The authors acknowledge Holger Blume for his support with the Pareto optimization software. The authors would like to thank Tobias Elbrandt for his helpful comments and suggestions in the evaluation of the subjective tests. The authors also wish to thank all the people involved in the subjective tests.
- Cosatto E, Ostermann J, Graf HP, Schroeter J: Lifelike talking faces for interactive services. in Proceedings of the IEEE 2003,91(9):1406-1429. 10.1109/JPROC.2003.817141View ArticleGoogle Scholar
- Liu K, Ostermann J: Realistic talking head for human-car-entertainment services. Proceedings of the Informationssysteme fuer Mobile Anwendungen (IMA '08), September 2008, Braunschweig, Germany 108-118.Google Scholar
- Beskow J: Talking Heads—Models and Applications for Multimodal Speech Synthesis, Doctoral thesis. Department of Speech, Music and Hearing, KTH, Stockholm, Sweden; 2003.Google Scholar
- Pandzic IS, Ostermann J, Millen DR: User evaluation: synthetic talking faces for interactive services. The Visual Computer 1999,15(7-8):330-340. 10.1007/s003710050182View ArticleGoogle Scholar
- Ostermann J, Weissenfeld A: Talking faces—technologies and applications. Proceedings of the International Conference on Pattern Recognition (ICPR '04), August 2004 3: 826-833.Google Scholar
- Cootes TF, Edwards GJ, Taylor CJ: Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence 2001,23(6):681-685. 10.1109/34.927467View ArticleGoogle Scholar
- Stegmann MB, Ersbøll BK, Larsen R: FAME—a flexible appearance modeling environment. IEEE Transactions on Medical Imaging 2003,22(10):1319-1331. 10.1109/TMI.2003.817780View ArticleGoogle Scholar
- Liu K, Weissenfeld A, Ostermann J, Luo X: Robust AAM building for morphing in an image-based facial animation system. Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '08), June 2008, Hanover, Germany 933-936.Google Scholar
- Weissenfeld A, Urfalioglu O, Liu K, Ostermann J: Robust rigid head motion estimation based on differential evolution. Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '06), July 2006, Toronto, Canada 225-228.Google Scholar
- Cosatto E, Graf HP: Photo-realistic talking-heads from image samples. IEEE Transactions on Multimedia 2000,2(3):152-163. 10.1109/6046.865480View ArticleGoogle Scholar
- Weissenfeld A, Liu K, Klomp S, Ostermann J: Personalized unit selection for an image-based facial animation system. Proceedings of the IEEE 7th Workshop on Multimedia Signal Processing (MMSP '05), October 2005, Shanghai, ChinaGoogle Scholar
- Zitzler E, Laumanns M, Bleuler S: A tutorial on evolutionary multiobjective optimization. In Proceedings of the Multiple Objective Metaheuristics (MOMH '03), 2003, Berlin, Germany. Springer;Google Scholar
- Von Livonius J, Blume H, Noll TG: Flexible Umgebung zur Pareto-Optimierung von Algorithmen—Anwendungen in der Videosignalverarbeitung. ITG 2007Google Scholar
- Deng Z, Neumann U: Data-Driven 3D Facial Animation. Springer; 2008.Google Scholar
- Ostermann J: Animation of synthetic faces in MPEG-4. Proceedings of the Computer Animation, June 1998, Philadelphia, Pa, USA 98: 49-55.View ArticleGoogle Scholar
- Pighin F, Hecker J, Lischinski D, Szeliski R, Salesin DH: Synthesizing realistic facial expressions from photographs. Proceedings of the 29th ACM annual conference on Computer Graphics (SIGGRAPH'98), July 1998, Orlando, Fla, USA 3: 75-84.View ArticleGoogle Scholar
- Blanz V, Vetter T: A morphable model for the synthesis of 3D faces. Proceedings of the 26th ACM Annual Conference on Computer Graphics (SIGGRAPH '99), August 1999, Los Angeles, Calif, USA 187-194.Google Scholar
- Blanz V, Basso C, Poggio T, Vetter T: Reanimating faces in images and video. Proceedings of the Computer Graphics Forum (Eurographics '03), November 2003, Basel, Switzerland 22: 641-650.View ArticleGoogle Scholar
- Terzopoulos D, Waters K: Physically-based facial modeling analysis and animation. Journal of Visualization and Computer Animation 1990,1(4):73-80.View ArticleGoogle Scholar
- Waters K, Frisbie J: Coordinated muscle model for speech animation. Proceedings of the Graphics Interface Conference, May 1995 163-170.Google Scholar
- Kaehler K, Haber J, Yamauchi H, Seidel H-P: Head shop: generating animated head models with anatomical structure. Proceedings of the ACM Computer Animation Conference (SIGGRAPH '02), 2002 55-63.
- Bregler C, Covell M, Slaney M: Video rewrite: driving visual speech with audio. Proceedings of the ACM Conference on Computer Graphics (SIGGRAPH '97), August 1997, Los Angeles, Calif, USA 353-360.Google Scholar
- Ezzat T, Geiger G, Poggio T: Trainable videorealistic speech animation. Proceedings of the ACM Transactions on Graphics (SIGGRAPH '02), July 2002 21(3):388-397.Google Scholar
- Cohen MM, Massaro DW: Modeling coarticulation in synthetic visual speech. In Models and Techniques in Computer Animation. Edited by: Magnenat-Thalmann M, Thalmann D. Springer, Tokyo, Japan; 1993:139-156.View ArticleGoogle Scholar
- Ezzat T, Poggio T: MikeTalk: a talking facial display based on morphing visemes. Proceedings of the 7th IEEE Eurographics Workshop on Computer Animation, 1998 96-102.Google Scholar
- Hewlett N, Hardcastle WJ: Coarticulation: Theory, Data and Techniques. Cambridge University Press, Cambridge, UK; 2000.Google Scholar
- Rao RR, Chen T, Merserau RM: Audio-to-visual conversion for multimedia communication. IEEE Transaction On Industrial Electronics 1998,45(1):12-22.View ArticleGoogle Scholar
- Choi K, Luo Y, Hwang J-N: Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology 2001,29(1-2):51-61.View ArticleMATHGoogle Scholar
- Xie L, Liu Z-Q: Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Transactions on Multimedia 2007,9(3):500-510.View ArticleGoogle Scholar
- Jolliffe I: Principal Component Analysis. Springer, New York, NY, USA; 1989.MATHGoogle Scholar
- Theobald BJ, Bangham JA, Matthews IA, Cawley GC: Near-videorealistic synthetic talking faces: Implementation and evaluation. Speech Communication 2004,44(1–4):127-140.View ArticleGoogle Scholar
- Theobald B, Fagel S, Bailly G, Elsei F: LIPS2008: visual speech synthesis challenge. Proceedings of the Interspeech, 2008 2310-2313.Google Scholar
- Hunt AJ, Black AW: Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '96), 1996 1: 373-376.Google Scholar
- Guenther R: Aufbau eines Mehrkamerastudios fuer audio-visuelle Aufnahmen, Diplomarbeit. Leibniz University of Hannover, Hannover, Germany; 2009.Google Scholar
- LIPS2008: Visual Speech Synthesis Challenge http://www.lips2008.org/
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.