- Open Access
Identifying underlying articulatory targets of Thai vowels from acoustic data based on an analysis-by-synthesis approach
© Prom-on et al.; licensee Springer. 2014
- Received: 26 December 2013
- Accepted: 17 April 2014
- Published: 8 May 2014
This paper investigates the estimation of underlying articulatory targets of Thai vowels as invariant representation of vocal tract shapes by means of analysis-by-synthesis based on acoustic data. The basic idea is to simulate the process of learning speech production as a distal learning task, with acoustic signals of natural utterances in the form of Mel-frequency cepstral coefficients (MFCCs) as input, VocalTractLab - a 3D articulatory synthesizer controlled by target approximation models as the learner, and stochastic gradient descent as the target training method. To test the effectiveness of this approach, a speech corpus was designed to contain contextual variations of Thai vowels by juxtaposing nine Thai long vowels in two-syllable sequences. A speech corpus consisting of 81 disyllabic utterances was recorded from a native Thai speaker. Nine vocal tract shapes, each corresponding to a vowel, were estimated by optimizing the vocal tract shape parameters of each vowel to minimize the sum of square error of MFCCs between original and synthesized speech. The stochastic gradient descent algorithm was used to iteratively optimize the shape parameters. The optimized vocal tract shapes were then used to synthesize Thai vowels both in monosyllables and in disyllabic sequences. The results, both numerically and perceptually, indicate that this model-based analysis strategy allows us to effectively and economically estimate the vocal tract shapes to synthesize accurate Thai vowels as well as smooth formant transitions between adjacent vowels.
- Articulatory target
- Articulatory synthesis
- Target approximation
- Acoustic-to-articulatory inversion
- Thai vowels
Speaking requires an accurate control of highly variable successive articulatory movements, each involving simultaneous actions of multiple articulators [1, 2], and all of them coordinated in such a way that many layers of meanings are simultaneously encoded . Even more intriguingly, a human child seems to be able to acquire such highly intricate motor skills without specific articulatory instructions and without direct observation of the articulators of the mature speakers other than the visible ones such as the lips. The only definitive input the child receives that is articulatorily (as opposed to meaning-wise) informative is the acoustics of the speech utterances. Understanding how proper articulatory skills can be learned from acoustic data, a task known as acoustic-to-articulatory inversion, is therefore the key to our understanding of the nature of human speech acquisition and production. Such knowledge is also beneficial to both speech recognition  and speech synthesis .
Different approaches have been proposed to achieve acoustic-to-articulatory inversion [6–16]. These methods rely on either explicit mapping between acoustic and articulatory data [6–14] or optimization of articulatory synthesis model parameters [15, 16]. Different methods of mapping between articulatory and acoustic data have been tested using probabilistic models such as hidden Markov models (HMM) [6, 7], neural networks [8, 9], codebooks [10–13], or filters . Except the methods that are based on the task dynamic (TD) model, most of them, however, share the common drawback of the mapping paradigm, i.e., the lack of the inclusion of speech production mechanism in the modeling process, in particular, the dynamic movement of speech gestures [1, 2] that results in smooth spectral transitions observed in the natural acoustic data. An alternative approach is to use an analysis-by-synthesis strategy [15, 16] in which parameters of a synthesis model are iteratively adjusted to minimize a cost function. The cost function can be the error from acoustic comparison between the original speech and speech synthesized with optimized parameters. This strategy, when implemented with an articulatory synthesizer with sufficient capacity to generate acoustic data from model parameters, may have the potential to achieve the closest simulation of speech learning behavior.
This paper reports the results of a study based on this alternative approach. The study attempts to identify underlying articulatory targets of Thai vowels by means of model-based optimization. Using the analysis-by-synthesis strategy, the underlying articulatory targets representing the vocal tract shapes are estimated from coarticulated disyllabic vowels. This modeling process iteratively adjusts the articulatory parameters to the optimal condition by minimizing the acoustical error between original and synthesized sounds generated with the tentative vocal tract parameters. The accuracy of these estimated vocal tract shapes are then evaluated by comparing the formants of the synthetic vowels to the formant trajectories of the natural utterances and to those of previous studies and by a listening experiment that compared the perceptual accuracy and naturalness of synthetic to those of natural speech.
The corpus was designed to have full contextual vowel variations in Thai to facilitate the modeling process. Thai has nine vowels, in short and long minimal pairs, which are evenly spread across the vowel space . To estimate the articulatory targets of all the Thai vowels, each utterance was designed to have two syllables consisting of only vowels, in the form of /V1 V2/, where both V1 and V2 are one of the nine long vowels (/a:/, /i:/, /u:/, /e:/, /ϵ:/, /ɯ:/, /ɤ:/, /o:/, /ɔ:/). Thus, there are 81 combinations in total. These disyllabic vowel sequences do not have any meanings. This design allows us to fully study the spectral changes resulting from transitions between vowels and to simulate their dynamics through computational modeling.
Speech data were recorded by a native Thai male speaker who had been living in the Greater Bangkok region in the past 20 years and had no self-reported speech or hearing disabilities. Recordings were done in a sound-treated room at the King Mongkut's University of Technology Thonburi, Bangkok, Thailand. The speaker was instructed to produce the disyllabic vowel sequences in a continuous manner with the mid tones for both syllables and without pauses between vowels as if they were in a simple noun-verb sentence. This makes the stress placed on the second syllable according to the general rule of Thai pronunciation. No particular normalization was done to remove the effect of stress. Nevertheless, the full factorial design of the corpus balances the occurrence of tones in both positions. The utterances were recorded at a sampling rate of 22.05 kHz and 16-b resolution.
The corpus was annotated using Praat . Syllable boundaries were manually marked according to the concept of target approximation to be detailed later in Section 2.4. Briefly, the articulation of a segment is defined as a unidirectional movement toward its underlying target . As a result, the moment a movement starts to turn away from the segmental target is viewed as the offset of one segment and onset of the next. Therefore, the boundary between two syllables was marked at the point where the spectrogram starts to change. This strategy, which was also used in our previous studies [20–23], differs from the conventional marking of syllable boundaries (cf for evidence from production). Since the syllables contained only vowels, no consonantal boundaries were marked.
2.2 Overview of analysis-by-synthesis strategy for target estimation
2.3 VocalTractLab, the articulatory synthesizer
The core of the analysis-by-synthesis strategy in this paper is VocalTractLab  - an articulatory synthesizer that can generate acoustic output based on articulatory parameters. VocalTractLab is capable of generating a full range of speech sounds by controlling vocal tract shapes, aerodynamics, and voice quality [27–29]. It consists of a detailed 3D model of the vocal tract that can be configured to fit the anatomy of any specific speaker, an advanced self-oscillating model of the vocal folds, and an efficient method for aeroacoustic simulation of the speech signal.
The acoustic simulation of VocalTractLab approximates the trachea, the glottis, and the vocal tract as a series of cylindrical tube sections with variable lengths as shown in Figure 1. The aeroacoustic simulation is based on a transmission-line circuit representation of the tube system (see  for a detailed derivation). The simulation considers fluid dynamic losses at constrictions, as well as losses due to radiation, soft walls, and viscous friction.
2.4 Target approximation model
where N is the order of the system and s is the complex frequency. The higher N, the more bell-shaped is the velocity profile associated with a step response of the system, a shape which is typically found in human target-directed movements . However, with increasing order, the delay between input (target) and output (action) also increases. In VocalTractLab, sixth-order systems are used as a compromise.
Also, as a result of this purely sequential target approximation, no gestural overlap is assumed as far as any particular articulator is concerned. A target approximation movement does not start until the previous one is over. Any seeming overlap between adjacent movements is simulated by the combined effect of cross-boundary state transfer and the articulation strength of the later movement that determines how quickly it can eliminate the carryover effect exerted by the transferred state.
A key advantage of TA is that it allows the mapping of variant surface trajectories due to phonetic context, stress, speech rate, etc. to a single invariant target as demonstrated in our previous studies in prosody modeling [3, 20–23, 31]. TA simplifies the problem of inverse mapping from acoustics to underlying articulatory targets. This can be achieved because of the clear separation of the transient and the steady-state responses in the TA representation of the articulatory movements. The estimation of articulatory targets underlying the movements due to these factors allows us to capture the trend of the variability, which can be then summarized into a single contextually invariant target by the analysis of parameter distribution [21, 22]. This approach of using an invariant target in inverse mapping is different from the Directions Into Velocities of Articulator (DIVA) framework that defines articulatory targets as regions [32, 33]. The DIVA framework relies on a neural network to map the associations between acoustic and articulatory data. In this sense, DIVA is largely a mapping method. In the TA framework, there is only one invariant articulatory target corresponding to a specific functional condition. The contextual variability due to the transition from one target to another is modeled as a transient response which is a by-product of the transition. This means that for each phonetic unit a single target or a single compound target can be learned from its many context-sensitive realizations. The feasibility of this approach has been seen in our recent work on F0 modeling [21–23].
List of articulatory target parameters of VocalTractLab
Horizontal jaw position
Tongue tip positions
Tongue blade positions
Tongue body positions
Tongue side elevation 1-4
2.5 Optimization via analysis-by-synthesis
One of the critical issues in modeling the learning process is to determine the level of representation of the observable data. In this paper, we explicitly assume that the calculation of the comparison is done only at the acoustic level. This may not entirely cover the range of inputs that a child receives in the actual learning process, which may also involve orofacial features [34–36]. But it allows us to systematically and separately test the effectiveness of information that may be present in acoustic data independent of the visual features. Therefore, parameters of the visible articulators such as the lips and the jaw are acoustically optimized. Note that this strategy does not prevent future studies from including visual information as additional training input.
The representation of the acoustic kinematics should be sufficiently detailed to allow accurate analysis-by-synthesis but not so detailed as to make the computation infeasible. For segmental learning, we need to identify a spectral representation that best captures the articulatory changes and reflects human speech perception. A good candidate is Mel-frequency cepstral coefficients (MFCCs), which have been successfully used in speech recognition and HMM-based synthesis . In this paper, MFCCs of the surface acoustics of both natural and synthetic speech are calculated using a standard setting in Praat , and the difference between the two are used as errors in the optimization of the articulatory targets.
where n is the number of acoustic feature timeframes, m is the number of MFCC coefficients, c ij is the j th cepstral coefficient of the i th frame in the natural utterance, and is the j th cepstral coefficient of the i th frame in the synthetic utterance.
Some articulatory parameters may not be entirely independent of others, however. For example, the tongue parameters have been found to be positively correlated with each other in articulatory movements for certain places of articulation, such as alveolar, palatal, and velar . This correlation suggests that there is a constraint weakly tying these parameters together so that the changes in one parameter also affect other parameters, depending on the physiological locations. Such an embodiment relationship can be used to help the optimization process so that the parameter adjustment is more realistic. In this paper, we modeled this embodiment constraint by co-adjusting nearby articulators. For example, whenever the tongue blade parameters (TBX/TBY) were adjusted, those of tongue tip and tongue body (TTX/TTY, TCX/TCY) were also modified by a small amount (20%) in the direction of the main adjustment.
2.6 Numerical assessment and perceptual evaluation
After obtaining the optimal target values, the accuracy of the estimated articulatory targets was assessed by comparing the formant tracks of the synthesized utterances with the original formant tracks. Time-normalized formant tracks (F1-F3) of both synthesized and original utterances were extracted using FormantPro , a Praat script for large-scale systematic analysis of continuous formant movements. The comparison was done by measuring for each syllable the root mean square error (RMSE) between the synthesized and original utterances.
To assess the synthesis quality, a listening experiment was conducted with native Thai participants to identify the synthetic vowels and evaluate their naturalness. Target parameters of the same vowel optimized through the analysis-by-synthesis strategy are averaged together across multiple contexts as the underlying representation of that vowel. A monosyllabic word of each individual vowel was predictively (since no monosyllables were used in the training) synthesized using the estimated articulatory parameters. The standard vocal tract configuration of a male German speaker provided by VocalTractLab version 2.1  was used to generate the stimuli. It should be noted that while the original vocal tract configuration is derived from a German speaker, the articulatory parameters and the synthesis process are language dependent. As controls, the natural stimuli of the same words were recorded by the same speaker used in the training corpus at a sampling rate of 22.05 kHz and 16-b resolution. Recording was done in a sound-treated room at the King Mongkut's University of Technology Thonburi. Both natural and synthetic stimuli had their intensities normalized to 70 dB using Praat. In total, there were 18 stimuli.
Twenty native Thai listeners participated in the experiment, which was conducted with the ExperimentMFC function of Praat. The stimulus words were randomly presented to the listeners, who were asked to first identify the Thai word they heard and then select a naturalness score on a five-level scale from terrible (1) to excellent (5). Listeners were allowed to listen to the stimuli as many times as they preferred.
3.1 Synthesis accuracy of estimated vowel targets
Mean formant RMSEs in percentage of each vowel in each syllable
3.1 Vocal tract shapes of Thai vowels
3.3 Synthesis quality
The results of the present study have shown that it is possible to estimate the underlying articulatory targets of vowels from surface acoustics of continuous speech and predefined annotated segmental boundaries as the input, with an articulatory synthesizer controlled by target approximation models as the learner, and analysis-by-synthesis optimization as the training regimen. The numerical assessment as shown in Table 2 and Figure 4 indicates that the learned targets can be used to consistently synthesize the acoustic data that closely approximate the natural utterances. The visual impression of the synthesis examples as shown in Figures 5 and 6 suggests a good match in the dynamics of the acoustic data. The perceptual evaluation shows that the underlying articulatory targets learned this way can be used to generate isolated vowels that approximate to the original both perceptually as shown in Figure 8 and in terms of articulation as shown in Figure 7. All these results indicate that the estimated articulatory parameters closely represent the underlying targets of the Thai vowels.
One advantage of the present approach over the previous attempts is the decoupling of the observed data and the speech production mechanism. Compared to the mapping approaches [6–14], the present study does not use any actual articulatory data in the optimization process but utilizes the knowledge of speech production mechanisms incorporated in the articulatory synthesis. This also provides an option that further studies may integrate the visible articulator data into the scheme to reduce the total degree of freedom. Another advantage of the present approach over the mapping approach is on the emergence of new data that are unknown to the system. The mapping approach would have difficulties in immediately applying the learned model to the new data since it has to be trained on a specific speaker. In contrast, the present approach has shown that even with the German vocal tract configuration, the system can learn Thai vowels that are numerically and perceptually accurate. Such decoupling of the articulatory mechanisms from the linguistic information enables us to apply this approach to different languages.
Another main feature of the present approach is the use of the TA model. With the TA model, the optimization only needs to estimate the targets and let the transient responses of the articulatory trajectories be calculated by the TA model. This significantly reduces the degrees of freedom of the estimation to only a set of articulatory parameters, along with a time constant, which need to be optimized, instead of having to map from every frame of acoustic data to the articulatory estimates. The usage of the TA model also allows us to simulate smooth transitions in the acoustic data as observed in natural utterances. While this feature may be present in different forms in the previous attempts such as generalized smoothness constraint , state transition in HMM [6, 7], or smoothing algorithm [11–14], the direct incorporation of TA model into the articulatory synthesizer provides a simple yet effective strategy in the simulation of articulatory dynamics.
Some mapping studies that are based on the TD model [9, 10] do take dynamic gestural control into consideration. The TD model provides a mechanism for generating movements of tract variables. It uses a critically damped second-order system to describe the movement. In TD, gestural movements are assumed to be always completed and adjacent gesture movements are assumed to be overlapped. This is in contrast to TA, which assumes that targets are not always reached, and allows remaining momentum at the end of a target approximation movement to be transferred to the next interval as its initial conditions.
Further development of the framework for organizing trained targets is still needed. First, no consonants have been simulated, so the effect of gestural overlap between consonant and vowel has not yet been modeled. Strategies have yet to be developed to simulate the learning of the fully overlapped CV gestures . Second, the incorporation of a timing model in the articulatory synthesis is also needed, as timing specifications of the segments are required prior to the generation process. Third, the visible articulatory data can be directly incorporated into the learning strategy. This will further reduce the degree of freedom of the optimization process and may further improve the effectiveness of the system. Finally, the acoustic-to-articulatory inversion in the current study is not fully complete, as the segmentation of continuous utterances into discrete unidirectional movements is done manually. The underlying assumption is that the learning of perceptual segmentation is achieved prior to the learning of the articulatory targets. But the validity of this assumption is not fully established and has to be addressed in future studies.
In this study, we explored the estimation of articulatory targets of Thai vowels as a distal learning task using a model-based analysis-by-synthesis strategy. Articulatory targets as vocal tract parameters of each vowel were iteratively optimized by minimizing the acoustic error between original and synthetic utterances. The estimated vocal tract shape targets were used to synthesize the acoustical vowels, and the perceptual evaluation confirmed the synthesis quality. These results demonstrate that distal learning with an articulatory synthesizer that incorporates knowledge of speech production mechanisms is an effective strategy for the simulation of speech production acquisition.
We would like to thank the Royal Academy of Engineering (UK) for financial support through the Newton International Fellowship Alumni follow-on funding and the Thai Research Fund (Thailand) through the Research Grant for New Researcher (TRG5680096 to SP).
- Mermelstein P: Articulatory model for the study of speech production. J. Acoust. Soc. Am. 1973, 53(4):1070-1082. doi:10.1121/1.1913427 10.1121/1.1913427View ArticleGoogle Scholar
- Saltzman EL, Munhall KG: A dynamical approach to gestural patterning in speech production. Ecol. Psychol. 1989, 1: 333-382. doi:10.1207/s15326969eco0104_2 10.1207/s15326969eco0104_2View ArticleGoogle Scholar
- Xu Y: Speech melody as articulatorily implemented communicative functions. Speech Commun. 2005, 46: 220-251. doi:10.1016/j.specom.2005.02.014 10.1016/j.specom.2005.02.014View ArticleGoogle Scholar
- Sun J, Deng L: An overlapping-feature-based phonological model incorporating linguistic constraints: applications to speech recognition. J. Acoust. Soc. Am. 2002, 111: 1086. doi:10.1121/1.1420380 10.1121/1.1420380View ArticleGoogle Scholar
- Ling Z, Richmond K, Yamagishi J, Wang R: Integrating articulatory features into HMM-based parametric speech synthesis. IEEE Audio Speech Lang. Process. 2009, 17(6):1171-1185. doi:10.1109/TASL.2009.2014796View ArticleGoogle Scholar
- Hofer G, Yamagishi J, Shimodaira H: Speech-driven lip motion generation with a trajectory HMM. In Proceedings of the 9th Annual Conference of the International Speech Communication Association. Brisbane: Interspeech; 2008. 22–26 September 2008, pp. 2314–2317Google Scholar
- Tamura M, Kondo S, Masuko T, Kobayashi T: Text-to-visual speech synthesis based on parameter generation from HMM. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing. Seattle, WA: ICASSP 98; 12–15 May 1998, pp. 3745–3748Google Scholar
- Uria B, Renal S, Richmond K: A deep neural network for acoustic-articulatory speech inversion. In Proceedings of the NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning. Sierra Nevada, Spain; 16 December 2011. http://www.cstr.ed.ac.uk/downloads/publications/2011/articulatory_inversion.pdfGoogle Scholar
- Mitra V, Nam H, Espy-Wilson CY, Saltzman E, Goldstein L: Retrieve tract variables from acoustics: a comparison of different machine learning strategies. IEEE J. Sel. Topics Signal Process. 2010, 4(6):1027-1045. doi:10.1109/JSTSP.2010.2076013View ArticleGoogle Scholar
- Nam H, Mitra V, Tiede M, Hasegawa-Johnson M, Espy-Wilson C, Saltzman E, Goldstein L: A procedure for estimating gestural scores from speech acoustics. J. Acoust. Soc. Am. 2012, 132(6):3980-3989. doi:10.1121/1.4763545 10.1121/1.4763545View ArticleGoogle Scholar
- Schroeter J, Sondhi MM: Dynamic programming search of articulatory codebooks. In Proceedings of the 1989 IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP; 1989. vol. 1, Glasgow, UK, 23–26 May 1989, pp. 588–591Google Scholar
- Ouni S, Laprie Y: Modeling the articulatory space using a hypercube codebook for acoustic-to-articulatory inversion. J. Acoust. Soc. Am. 2005, 118(1):444-460. doi:10.1121/1.1921448 10.1121/1.1921448View ArticleGoogle Scholar
- Potard B, Laprie Y, Ouni S: Incorporation of phonetic constraints in acoustic-to-articulatory inversion. J. Acoust. Soc. Am. 2008, 123(4):2310-2323. doi:10.1121/1.2885747 10.1121/1.2885747View ArticleGoogle Scholar
- Ghosh PK, Narayanan S: A generalized smoothness criterion for acoustic-to-articulatory inversion. J. Acoust. Soc. Am. 2010, 128(4):2162-2172. doi:10.1121/1.3455847 10.1121/1.3455847View ArticleGoogle Scholar
- Panchapagesan S, Alwan A: A study of acoustic-to-articulatory inversion of speech by analysis-by-synthesis using chain matrices and the Maeda articulatory model. J. Acoust. Soc. Am. 2011, 129(4):2144-2162. doi:10.1121/1.3514544 10.1121/1.3514544View ArticleGoogle Scholar
- McGowan R: Recovering articulatory movement from formant frequency trajectories using task dynamics and a genetic algorithm: preliminary model test. Speech. Commun. 1994, 14: 19-48. doi:10.1016/0167-6393(94)90055-8 10.1016/0167-6393(94)90055-8View ArticleGoogle Scholar
- Tingsabadh K, Abhramson AS: Thai. J Int. Phon. Assoc.. 1993, 22(1):24-48. doi:10.1017/S0025100300004746View ArticleGoogle Scholar
- Boersma P: Praat, a system for doing phonetics by computer. Glot. Int. 2001, 5(9/10):314-345.Google Scholar
- Xu Y, Liu F: Tonal alignment, syllable structure and coarticulation: toward an integrated model. Italian. J. Linguist. 2006, 18: 125-159.Google Scholar
- Prom-on S, Birkholz P, Xu Y: Training an articulatory synthesizer with continuous acoustic data. In Proceedings of the 14th Annual Conference of the International Speech Communication Association. Lyon: Interspeech; 2013. 25–29 August 2013, pp. 349–353Google Scholar
- Prom-on S, Thipakorn B, Xu Y: Modeling tone and intonation in Mandarin and English as a process of target approximation. J. Acoust. Soc. Am. 2009, 125: 405-424. doi:10.1121/1.3037222 10.1121/1.3037222View ArticleGoogle Scholar
- Xu Y, Prom-on S: Toward invariant functional representations of variable surface fundamental frequency contours: synthesizing speech melody via model-based stochastic learning. Speech Commun. 2014, 57: 181-208. doi:10.1016/j.specom.2013.09.013View ArticleGoogle Scholar
- Prom-on S, Liu F, Xu Y: Post-low bouncing in Mandarin Chinese: acoustic analysis and computational modeling. J. Acoust. Soc. Am. 2012, 132: 421-432. doi:10.1121/1.4725762 10.1121/1.4725762View ArticleGoogle Scholar
- Xu Y, Liu F: Determining the temporal interval of segments with the help of F0 contours. J. Phon. 2007, 35: 398-420. doi:10.1016/j.wocn.2006.06.002 10.1016/j.wocn.2006.06.002View ArticleGoogle Scholar
- Jordan MI, Rumelhart DE: Forward models: supervised learning with a distal teacher. Cogn. Sci. 1992, 16: 307-354. doi:10.1207/s15516709cog1603_1 10.1207/s15516709cog1603_1View ArticleGoogle Scholar
- Birkholz P: VocalTractLab 2.1 for Windows. 2013.http://www.vocaltractlab.de . Accessed 17 December 2013Google Scholar
- Birkholz P: Modeling consonant-vowel coarticulation for articulatory speech synthesis. PLoS One 2013, 8(4):e60603. doi:10.1371/journal.pone.0060603 10.1371/journal.pone.0060603View ArticleGoogle Scholar
- Birkholz P, Kröger BJ, Neuschaefer-Rube C: Model-based reproduction of articulatory trajectories for consonantal-vowel sequences. IEEE Audio, Speech and Lang. Process 2011, 19(5):1422-1433. doi:10.1109/TASL.2010.2091632View ArticleGoogle Scholar
- Birkholz P, Kröger BJ, Neuschaefer-Rube C: Proceedings of the 12th Annual Conference of the International Speech Communication Association (Interspeech 2011). Florence August 2011, 28–31: 2681-2684.Google Scholar
- Birkholz P, Jackèl D, Kröger BJ: Simulation of losses due to turbulence in the time-varying vocal system. IEEE Audio, Speech and Lang. Process 2007, 15(4):1218-1226. doi:10.1109/TASL.2006.889731View ArticleGoogle Scholar
- Xu Y, Wang QE: Pitch targets and their realization: evidence from Mandarin Chinese. Speech. Commun. 2001, 33: 319-337. doi:10.1016/S0167-6393(00)00063-7 10.1016/S0167-6393(00)00063-7View ArticleGoogle Scholar
- Guenther FH, Vladusich T: A neural theory of speech acquisition and production. J. Neurolinguist 2012, 25: 402-422. doi:10.1016/j.jneuroling.2009.08.006View ArticleGoogle Scholar
- Guenther FH: Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production. Psychol. Rev. 1995, 102: 594-621. doi:10.1016/j.jneuroling.2009.08.006View ArticleGoogle Scholar
- Green JR, Moore CA, Higashikawa M, Steeve RW: The physiologic development of speech motor control: lip and jaw coordination. J. Speech Lang. Hear. Res. 2000, 43: 239-255. PMCID: PMC2890218View ArticleGoogle Scholar
- Green JR, Moore CA, Reilly KJ: The sequential development of jaw and lip control for speech. J. Speech Lang. Res 2002, 45: 66-79. PMCID: PMC2890215 10.1044/1092-4388(2002/005)View ArticleGoogle Scholar
- Harold MP, Barlow SM: Effects of environmental stimulation on infant vocalizations and orofacial dynamics at the onset of canonical babbling. Infant Behav. Dev. 2013, 36: 84-93. doi:10.1016/j.infbeh.2012.10.001 10.1016/j.infbeh.2012.10.001View ArticleGoogle Scholar
- Taylor P: Text-to-Speech Synthesis. Cambridge: Cambridge University Press; 2009.View ArticleGoogle Scholar
- Green JR, Wang YT: Tongue-surface movement patterns during speech and swallowing. J. Acoust. Soc. Am. 2009, 113(5):2820-2833. doi:10.1121/1.1562646View ArticleGoogle Scholar
- Xu Y: FormantPro Version 1.1. . Accessed 24 December 2013 http://www.phon.ucl.ac.uk/home/yi/FormantPro
- McGowan RS, Berger MA: Acoustic-articulatory mapping in vowels by locally weighted regression. J. Acoust. Soc. Am. 2009, 126(4):2011-2032. doi:10.1121/1.3184581 10.1121/1.3184581View ArticleGoogle Scholar
- Abramson AS: The vowels and tones of standard Thai: acoustical measurements and experiments. Bloomington: Indiana University Research Center in Anthropology, Folklore, and Linguistics, Pub. 20; 1962. . Accessed 26 February 2014 http://www.haskins.yale.edu/Reprints/HL0035.pdfGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.