Exploring the Effect of Differences in the Acoustic Correlates of Adults' and Children's Speech in the Context of Automatic Speech Recognition
© S. Ghai and R. Sinha. 2010
Received: 1 June 2009
Accepted: 25 January 2010
Published: 21 March 2010
This work explores the effect of mismatches between adults' and children's speech due to differences in various acoustic correlates on the automatic speech recognition performance under mismatched conditions. The different correlates studied in this work include the pitch, the speaking rate, the glottal parameters (open quotient, return quotient, and speech quotient), and the formant frequencies. An effort is made to quantify the effect of these correlates by explicitly normalizing each of them using the already existing techniques available in literature. Our initial study done on a connected digit recognition task shows that among these parameters only the formant frequencies, the pitch, and the speaking rate affect the automatic speech recognition performance. Significant improvements are obtained in the performance with normalization of these three parameters. With combined normalization of the pitch, the speaking rate, and the formant frequencies, 80% and 70% relative improvements are obtained over the baseline for children's speech and adults' speech recognition under mismatched conditions.
In recent years, development of speech recognition systems has enhanced the use of machines and other interactive multimedia systems in diverse areas . Nowadays children have also become the potential users of these systems and, therefore, there is need for children's speech recognition. This will make their interaction with machines possible for various tasks like reading tutors, language learning by children, information retrieval, and entertainment applications [2–5]. Most speech recognition systems perform reasonably well for adult users but exhibit severe degradation in case of children users [6, 7]. Children's speech differs considerably from adults' speech in many important aspects and characteristics. The various acoustic and linguistic differences include differences in the pitch, the formant frequencies, the average phone duration, the speaking rate, the glottal parameters, pronunciation, and grammar. Children have a greater range of values with different means and variances for these parameters than adults due to anatomical and physiological changes occurring during a child's growth , thus resulting in a high inter- and intraspeaker acoustic variability. These differences together cause the deterioration in the recognition performance of children's speech on adults' speech trained models and vice versa [9, 10].
Children have nonlinearly increasing formants located at high values [11–13]. Also, they have high pitch frequency values causing large spacing between the harmonics [11–13]. These high frequency values of formants and pitch are attributed to their inherently short vocal tract and vocal fold lengths, respectively. For instance, five-year-old children have been reported to have 50% higher value of formant frequencies than adult males . The higher formants of children fall outside the transmission bandwidth of telephone channel resulting in the loss of the spectral information in case of narrowband speech recognition. In comparison to the presence of 3-4 formants of an adult in 0.3–3.2 kHz bandwidth range, children have only 2-3 formants present . The phoneme durations and the average sentence durations have also been observed to be nearly 10% longer than those of adults [8, 11, 13, 15], which in turn reduce their speaking rate [7, 8]. The physiological differences among the speakers cause differences in the glottal parameters and thus the source spectrum . For instance, the open quotient (OQ) mainly affects the levels of the lower part of the source spectrum so that a large OQ typically means a higher level of the lowest few harmonics. The return quotient (RQ) affects the steepness of the source spectrum, a large RQ corresponds to greater attenuation of the higher frequencies. These glottal parameters like the open quotient (OQ), the return quotient (RQ), and the speed quotient (SQ) have also been observed to be different for speech corresponding to children and adult speakers [17, 18]. Children exhibit less precision in the control of their articulators especially at the age of 5-6 years rendering to various pronunciation problems like disfluencies, false-starts, and extraneous speech . Their vocabulary is smaller than that of adults and sometimes also contains some spurious words which are not found in the case of adults. It has been reported that children of the age of 5 years have about 60% of vowel classification accuracy against that of 90% of the adults .
Various methodologies and research issues have been investigated for improving children's speech recognition performance on adults' speech trained models. The foremost includes the vocal tract length normalization (VTLN) [7, 13]. It diminishes the effect of varying vocal tract length among different speakers by warping the frequency axis of the speech power spectrum during signal analysis . Various forms of speaker adaptation techniques like maximum a posteriori (MAP) and maximum likelihood linear regression (MLLR) , speaker adaptive training (SAT) [20, 21], constrained MLLR speaker normalization (CMLSN) , and their combinations  have also been tried so as to reduce the mismatch of children's speech with adults' speech trained models. SAT performs speaker-specific transformations to compensate for the interspeaker acoustic variations in the training set . It involves MLLR adaptation of the means of output distributions of continuous density hidden Markov models (HMMs). CMLSN method transforms the acoustic observation vectors by means of speaker-specific affine transformations obtained through constrained MLLR . In order to cope with the age-dependent variability, age-specific modeling of recognizers has also been tried [3, 9, 23]. However, training age-specific speech models requires huge amount of data from the target age speakers, thus making the method costlier. To incorporate the linguistic mismatches between children's and adults' speech, language modeling [24, 25] and pronunciation modeling  have also been explored.
In contrast to various feature and model domain techniques, recent few studies have reported explicit normalization of various differences in the signal domain. A voice transformation technique which normalizes the children's speech signal before being fed to the adults' speech trained recognizer has been explored in . It modifies the speech signal by transforming its pitch using the time-domain pitch-synchronous overlap-add (TD-PSOLA) method and obtaining VTLN by linear compression of the spectral envelope of each window. The use of the phase vocoder algorithm has also been demonstrated for achieving the same transformation. In , the speaking rate normalization in combination with VTLN has also been explored to achieve a better performance for children's speech on adults' speech trained recognizer.
Motivated by the studies done in [27, 28], this paper explores the independent effect of all of the acoustic sources of mismatch between adults' and children's speech reported in literature, that is, the pitch, the speaking rate, the formant frequencies, and the glottal parameters: OQ, RQ, and SQ on the recognition performance on a linguistically neutral task. Among these different acoustic sources of mismatch, the independent effects of the pitch and the glottal parameters on ASR have not been reported so far. The study is done on a limited vocabulary task (i.e., digit recognition) where the linguistic differences would be minimal.
The rest of the paper is organized as follows. Section 2 describes the technique used for transformation of different acoustic parameters of speech signals. Section 3 presents the details about the speech corpus and the experimental setup. Section 4 studies the degree of variation in various acoustic correlates between the adults' and the children's speech data used in this work. Section 5 discusses the results of the recognition experiments and the paper concludes in Section 6.
2. Transformation Procedures
In this work, the pitch, the signal duration (for modifying the speaking rate), and the glottal parameters, namely, the OQ, the RQ, and the SQ of the speech signals, are modified using a recently proposed pitch-synchronous time-scaling (PSTS) method . The PSTS method is reported to provide faithful transformations over a wide range of transformation factors for the abovesaid parameters.
For addressing the mismatch in the formant frequencies between adults' and children's speech, the commonly used frequency warping is employed. For warping the frequency axis of the utterances during computation of the mel frequency cepstral coefficients (MFCCs) feature, the piece-wise linear frequency warping of filterbank, as supported in the hidden Markov toolkit (HTK) , has been used. In the following subsections, we describe the use of PSTS method for transforming the average pitch, the signal duration, and the glottal parameters (OQ, RQ, and SQ) of the speech signals.
2.1. PSTS Method
The PSTS method involves pitch-synchronous-time scaling of the linear prediction (LP) residual waveform of the speech signal. By time-scaling the short-time signals, the overlapping interval can be changed maintaining the energy balance of the modified signal. Since the LP residual signal approximates the derivative of the excitation signal, the time scaling operation also helps in preserving various important parameters of the glottal waveform like the OQ, the RQ, and the SQ. Additionally, it also overcomes the problem of energy fluctuations at large pitch modification factors which have been observed in case of pitch transformation using the pitch-synchronous overlap-add-based approaches .
2.1.1. Pitch and Signal Duration Transformation
The pitch marks and the LP residual signal are computed as described in Section 2.1. The modified pitch mark locations are then computed in accordance to the desired pitch and signal duration (for speaking rate) modification. The shift between successive synthesis pitch marks is equal to the desired pitch period . The short-time signals are computed by mapping the synthesis pitch marks on the estimated analysis pitch marks . Each short-time signal is time scaled by a factor resulting in the modified short-time signal where is the desired synthesis pitch period.
For pitch transformation factors , the spectra of the speech signal get compressed, giving rise to an "energy hole" at the higher frequencies. Since in our experiments children's speech is transformed to adults' speech, this problem gets even more enhanced. To overcome this problem, a high-frequency regeneration method based on time-scaling the open phase of the glottal source waveform has been used so as to reduce the OQ which in turn boosts the energy at the high-frequency region of the source spectrum to fill the energy hole. For details of the high-frequency regeneration method refer to .
2.1.2. Glottal Parameter Transformation
The pitch marks and the LP residual signal are computed as described in Section 2.1. Corresponding to each pitch cycle a short-time analysis frame is determined using (1). The following time instants are then estimated for each of the voiced short-time analysis frames: the glottal closure instant , the glottal opening instant , and the instant of maximum of the glottal flow .
In order to transform the glottal flow parameters, time scale transformations are done over the segments corresponding to the glottal flow phases in each of the short-time analysis frames. The segments corresponding to each of the glottal cycle phases are computed from the extracted time instants using following relations:
To increase OQ, both the return phase duration and the peak flow duration must be increased. To decrease OQ, both of the durations must be shortened. Thus, the time scale factor is equal to the required modification factor for OQ. Due to time scale transformation it is necessary to adjust the duration of the closed phase to preserve the pitch period of the glottal waveform as described in .
The return quotient can be increased or decreased by a time scale expansion or compression of the return phase. To maintain the pitch period and the open quotient, the peak flow duration is also time scaled by an adequate factor.
The speed quotient can be increased with a time scale expansion of the opening phase and a time scale compression of the closing phase so that the peak flow duration remains constant. SQ can be decreased by the opposite transformation.
Finally, the complete synthesis LP residual signal and the modified synthesis speech signal are computed as described in Section 2.1.1. The sample speech files with the average pitch, the average utterance duration, and the average values of the glottal parameters (OQ, RQ, and SQ) modified by different factors are available at "http://www.iitg.ac.in/ece/emstlab/psts.htm" for assessing the quality of the various transformations.
3. Speech Corpus and Experimental Setup
Age groupwise breakup of the children's speech data.
Age group (Yrs.)
No. of Speakers
No. of Utterances
For experiments done on adults' speech trained recognizer, the adults' speech training set referred to as "TR1", the adults' speech test set referred to as "AD", and the children's speech test set referred to as "CH1" have been derived from the TIDIGITS corpus. "TR1" comprises of the adults' speech data containing a total of 11,016 utterances, or 35,566 digits, from 90 male and 107 female speakers. "AD" comprises of the adults' speech data containing a total of 3,303 utterances, or 10,813 digits, from 29 male and 52 female speakers. "CH1" comprises of whole children's speech data containing a total of 7,772 utterances, or 25,525 digits, available from 50 boys and 51 girls.
Details of all of the training and test speech sets.
No. of Utterances
No. of Speakers
where "Sub" is the number of substitutions, "Del" is the number of deletions, and "Ins" is the number of insertions.
The connected digit recognizer used in this work has been developed using the HTK toolkit . The 11 digits (0–9 and "OH") are modeled as whole word left-to-right hidden Markov model (HMM). Each word model has 16 states with simple left to right paths and no skip paths over the states. The observation densities are mixtures of five multivariate Gaussian distributions with diagonal covariance matrices. The silence is explicitly modeled using three-state HMM model having six Gaussian mixtures per state. A single-state short-pause model tied to the middle state of the silence model is also used. A 21-channel filterbank is used for 13-dimensional MFCC ( to ) feature computation. In addition to the base features, their first- and second-order derivatives are also appended making the final feature dimension as 39. Cepstral mean subtraction is also applied to all features. The speech is preemphasized using a factor of 0.97 and for analysis a Hamming window of length of 25 ms and the frame rate of 100 Hz is used.
4. Acoustic Analysis of the Speech Database
In this section, we quantify the degree of mismatch in various acoustic correlates of the adults' and the children's speech data used for the recognition experiments in this work. This is done in order to hypothesize the relative effect of normalization of each of these acoustic correlates on the ASR performance under mismatched conditions. The various acoustic correlates that have been analyzed include the pitch, the speaking rate, and the glottal parameters (OQ, RQ, and SQ).
Therefore, on noting the high degree of variation in the average pitch values of the children's and the adults' speech data and the effect of high pitch value on the smooth spectrum corresponding to MFCC feature, it is hypothesized that pitch-normalization would significantly improve the ASR performance for children's speech due to reduction in the pitch-dependent distortions observed in the spectral envelope.
4.2. Speaking Rate
Thus, the state transition probabilities of models trained on speech data with fast speaking rate (adults' speech) would adversely affect the ASR performance for speech data with slow speaking rate (children's speech) [36–38]. So, in order to recognize a particular acoustic property as the intended phonetic segment, it is required to normalize the speaking rate differences.
4.3. Glottal Parameters
5. Experimental Results and Discussion
This section describes our experiments to study the effect of various acoustic correlates in addressing the mismatch between adults' and children's speech on their recognition performance under mismatched conditions. This study is first explored in detail for addressing the recognition of children's speech on adults' speech trained models. Following it we have also shown the results of a similar study on vice-versa condition.
5.1. Children's Speech Recognition on Adults' Speech Trained Models
The adults' speech trained models used in this study have been developed using the adults training set "TR1" derived from the TIDIGITS corpus. The recognition performances for the adult test set "AD" and the children test set "CH1" are 0.43% and 11.37%, respectively.
where is the feature corresponding to a particular value of an acoustic correlate for the i th utterance, is the speech recognition model, and is the transcription of the i th utterance. The is determined by doing an initial recognition pass using original feature (i.e., with no transformation).
The appropriate values for transformation of the average pitch frequency, the signal duration (and thus the speaking rate), the glottal source parameters (OQ, RQ, SQ), and the formant frequencies are obtained using the above procedure. In this work, the average pitch frequency, the signal duration, and the glottal source parameters are transformed explicitly in the signal domain prior to feature computation whereas the formant frequencies are modified in the feature domain. The average pitch of a speech signal is estimated using the ESPS tool available in the Wavesurfer software package . The average speaking rate of the signals is measured as the number of syllables per second computed as the ratio of the number of syllables in an utterance to the total length of the utterance. Each of the 11 digits constituting the training and test set utterances used in this work comprises of only 1 syllable except for the digits "zero" and "seven" which contain 2 syllables. In the following, we describe in detail the experimental conditions and the results obtained by the normalization of each of these acoustic correlates independently as well as in combinations.
For pitch-normalization of the children test set "CH1", the signals are transformed to seven different pitch values ranging from 70 Hz to 250 Hz with a step size of 30 Hz. Such pitch range has been chosen based on the pitch distribution of the training data as shown in Figure 2.
Performances of the children test set "CH1" (with breakup for different pitch groups based on original average pitch values) with and without pitch-normalization. The quantity in bracket shows the number of utterances in that group. The 95% confidence interval for the performance is 0.39 (for the 250 Hz, 250–300 Hz, and 300 Hz pitch groups the confidence interval turns out to be 0.39, 0.79, and 3.37, resp.).
5.1.2. Speaking Rate
For normalization of the speaking rate of the children test set "CH1" according to that of the adults' speech trained models, the duration of the signals is reduced by factors ranging from 0.6 to 1 with a step size of 0.05, thereby increasing the speaking rate of the signals by factors ranging from 1 to 1.65. The choice of such duration transformation factors is based on the distribution of the speaking rate of the signals belonging to the adults training set as shown in Figure 5.
Performances of the children test set "CH1" with and without normalization of different acoustic correlates of speech. The 95% confidence interval for the performances is 0.39.
Norm. (Speaking Rate)
Norm. (Open Quotient)
Norm. (Return Quotient)
Norm. (Speed Quotient)
Norm. (Formant Frequencies)
5.1.3. Glottal Parameters
For normalizing the variations in the glottal parameters of the children test set "CH1" with respect to those of the adults training set "TR1" used for training the acoustic models using the ML-based approach, the OQ, RQ, and SQ of the signals are modified by factors ranging from 0.55 to 1, 0.35 to 1, and 0.45 to 1, each with a step size of 0.05, respectively. The choice of such transformation factors for OQ, RQ, and SQ modification is supported by the studies done in literature which report that children's speech has higher OQ, RQ, and SQ values than those of the adults' speech [17, 18, 40]. The recognition performances of the children test set "CH1" with and without normalization of the three glottal parameters are given in Table 4. As hypothesized earlier, none of the glottal parameters give any significant improvement over the baseline after normalization. Although the glottal parameters have been found to be of significance in case of one-to-one voice transformation in case of ASR, where the acoustic model is trained using data from a large number of speakers, there is enough variation in the glottal parameters within the training set itself leaving a very little mismatch due to differences in the glottal parameters between the training and the test data.
5.1.4. Formant Frequencies
5.1.5. Combined Normalization of Acoustic Correlates
From the study done in the previous subsections analyzing the independent effect of each of the acoustic correlates of speech like the pitch, the speaking rate, the glottal parameters, and the formant frequencies on children's speech recognition performance, it is noted that the significant improvements in the recognition performance are obtained with the normalization of the pitch, the speaking rate, and the formant frequencies only. In this subsection, we study the effect of combined normalization of only these three acoustic correlates of speech on children's speech recognition performance. The combined normalization of the acoustic correlates of speech has been done in sequential manner, that is, for obtaining both the speaking rate and the pitch-normalized speech signals; first the speaking rate of the speech signal is normalized followed by its pitch-normalization. As mentioned earlier, VTLN is performed in the feature domain whereas the speaking rate and the pitch-normalization are done in the signal domain. Thus, to incorporate VTLN in combination with the speaking rate and the pitch-normalization, the signal is first speaking rate and/or pitch-normalized followed by VTLN. Between the speaking rate and the pitch-normalization, we have first normalized the speaking rate since the speaking rate transformation may result in slight modification of the pitch of the signals.
Performances of the children test set "CH1" with and without normalization of different acoustic correlates of speech in various combinations. The 95% confidence interval for the performances is 0.39.
Norm. (Speaking Rate + Pitch)
Norm. (Speaking Rate + Formant Freq.)
Norm. (Pitch + Formant Freq.)
Norm. (Speaking Rate + Pitch + Formant Freq.)
Norm. (Rate + Formant Freq.)
(Using "Back Off" Procedure)
Norm. (Pitch + Formant Freq.)
(Using "Back Off" Procedure)
Norm. (Speaking Rate + Pitch + Formant Freq.)
(Using "Back Off" Procedure)
Figure 19 shows the distribution of the warp factors estimated for VTLN of the speaking rate normalized, the pitch-normalized and the combined speaking rate and pitch-normalized signals of the children test set "CH1". It is noted that after speaking rate normalization a larger number of signals have chosen optimal warp factors of 0.88 as compared to in the case of original speech signals, which is consistent with the fact that children's speech spectra need greater compression to align with that of adults' speech. However, after pitch-normalization, though a larger number of signals have chosen optimal warp factors of 0.88 as compared to the original speech case, unlike in case of the speaking rate normalization, for all warp factors greater than 0.92 the number of normalized signals which have chosen those values have considerably increased compared to those in the original speech case. Further, we found few signals to choose warp factors 1 after pitch-normalization as against choosing values close to 0.88 in original speech case. Such estimation of warp factors after pitch-normalization is inappropriate as children's speech spectrum needs compression rather than expansion with respect to that of the adults' speech. This behavior is attributed to the distortions introduced in the spectra of few speech signals during explicit pitch-normalization as it involves decimation and/or interpolation for time-scaling operations. To overcome these distortions we have followed a "Back Off" procedure in which all those of warp factors estimated on the speaking rate and/or the pitch-normalized speech which have value greater than the ones obtained prior to that normalization are replaced by the values obtained on the signals prior to normalization. This allows us to reduce the errors introduced in the warp factor estimation particularly after pitch-normalization. The recognition performances on doing VTLN of the speaking rate and/or the pitch-normalized signals of the children test set "CH1" using the "Back Off" procedure are also given in the last three rows of Table 5 for the ease of comparison. It is noted that on using the warp factors estimated using the "Back Off" procedure for VTLN of the speaking rate and/or the pitch-normalized signals improves their recognition performances with significant improvements for signals involving the pitch-normalization. The combined normalization of the pitch, the speaking rate, and the formant frequencies of the signals results in a relative improvement of 80% over the baseline. This shows that the improvements obtained with the pitch and the speaking rate normalization are significant and additive to that obtained with VTLN.
5.2. Adults' Speech Recognition on Children's Speech Trained Models
Following the observations made in Section 5.1, it would be interesting to study the behavior of the normalization of the three identified significant acoustic correlates in context of adults' speech recognition on children's speech trained models. For this purpose, a new recognizer has been developed using the children training set "TR2" derived from the TIDIGITS corpus. The recognition performance for the children test set "CH2" and the adults test set "AD" on this new recognizer is 1.01% and 13.28%, respectively. It is to note here that the children's ASR performance is more than twice that of the adults' under matched condition. This is consistent with the known fact that children's speech has higher intraspeaker variability than adults' speech leading to larger variance of the acoustic models .
Performances of the adult test set "AD" with and without normalization of different acoustic correlates. The 95% confidence interval for the performances is 0.64.
Norm. (Formant Frequencies)
Norm. (Speaking Rate + Pitch + Formant Frequencies)
(Using "Back Off" Procedure)
In this work, the effect of differences in various acoustic correlates of speech like the pitch, the speaking rate, the glottal parameters (OQ, RQ, SQ), and the formant frequencies for children's and adults' speech has been explored in the context of ASR under mismatched conditions. Our study done on a connected digit recognition task indicates that the differences in the pitch, the speaking rate, and the formant frequencies significantly affect the ASR performance and thus lead to significant improvement after normalization. On the other hand, the glottal parameters (OQ, RQ, SQ) have not been found to have any significant impact on the ASR performance. The normalization of the three significant acoustic correlates (the pitch, the speaking rate, the formant frequencies) in various combinations has also been studied. The experimental results show that we can successfully combine the improvements due to normalization of the above three acoustic correlates resulting in an overall relative improvement of 80% and 70% over the baseline for children's speech recognition and adults' speech recognition under mismatched conditions. Our future work aims at studying the effect of explicit normalization of the pitch and the speaking rate on the cepstral features like some studies are already relating the frequency warping for VTLN to the linear transformation of the cepstra. These cepstral domain transformations would not only ease the computational complexity of the normalization process but also would allow us to include the Jacobian factor of transformation in the estimation of the normalization factors.
Part of this work has been supported by the ongoing project No. SR/S3/EECE/39/2009 sponsored by the Department of Science and Technology, Government of India.
- Yildiz P: The multimedia interactive theatre by virtual means regarding computational intelligence in space design as HCI and samples from Turkey. International Journal of Humanities and Social Sciences 2008., 2(1):Google Scholar
- Giuliani D, Mich O, Nardon M: A study on the use of a voice interactive system for teaching English to Italian children. Proceedings of the 3rd IEEE International Conference on Advanced Learning Technologies (ICALT '03), July 2003 376-377.View ArticleGoogle Scholar
- Hagen A, Pellom B, Cole R: Children's speech recognition with application to interactive books and tutors. Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU '03), December 2003 186-191.Google Scholar
- Russell M, Brown C, Skilling A, et al.: Applications of automatic speech recognition to speech and language development in young children. Proceedings of the International Conference on Spoken Language Processing (ICSLP '96), October 1996 1: 176-179.View ArticleGoogle Scholar
- Russell M, Series RW, Wallace JL, Brown C, Skilling A: The STAR system: an interactive pronunciation tutor for young children. Computer Speech and Language 2000, 14(2):161-175. 10.1006/csla.2000.0139View ArticleGoogle Scholar
- Burnett D, Fanty M: Rapid unsupervised adaptation to children's speech on a connected-digit task. Proceedings of the International Conference on Spoken Language Processing (ICSLP '96), 1996 2: 1145-1148.View ArticleGoogle Scholar
- Narayanan S, Potamianos A: Creating conversational interfaces for children. IEEE Transactions on Speech and Audio Processing 2002, 10(2):65-78. 10.1109/89.985544View ArticleGoogle Scholar
- Lee S, Potamianos A, Narayanan S: Acoustics of children's speech: developmental changes of temporal and spectral parameters. Journal of the Acoustical Society of America 1999, 105(3):1455-1468. 10.1121/1.426686View ArticleGoogle Scholar
- Wilpon JG, Jacobsen CN: A study of speech recognition for children and the elderly. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96), 1996 349-352.Google Scholar
- Giuliani D, Gerosa M: Investigating recognition of children's speech. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003 2: 137-140.Google Scholar
- Potamianos G, Narayanan S, Lee S: Analysis of children speech: duration, pitch and formants. Proceedings of the European Conference on Speech Communication and Technology (Eurospeech '97), 1997 473-476.Google Scholar
- Benzeguiba M, Mori RD, Deroo O, et al.: Automatic speech recognition and intrinsic speech variation. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), May 2006 5:Google Scholar
- Gerosa M, Giuliani D, Brugnara F: Acoustic variability and automatic recognition of children's speech. Speech Communication 2007, 49(10-11):847-860. 10.1016/j.specom.2007.01.002View ArticleGoogle Scholar
- Potamianos A, Narayanan S: Robust recognition of children's speech. IEEE Transactions on Speech and Audio Processing 2003, 11(6):603-616. 10.1109/TSA.2003.818026View ArticleGoogle Scholar
- Gerosa M, Giuliani D, Brugnara F: Speaker adaptive acoustic modeling with mixture of adult and children's speech. Proceedings of the 9th European Conference on Speech Communication and Technology (Interspeech '05), 2005 2193-2196.Google Scholar
- Klatt DH, Klatt LC: Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of the Acoustical Society of America 1990, 87(2):820-856. 10.1121/1.398894View ArticleGoogle Scholar
- Iseli M, Shue Y-L, Alwan A: Age- and gender-dependent analysis of voice source characteristics. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), 2006 I389-I392.Google Scholar
- Weinrich B, Salz B, Hughes M: Aerodynamic measurements: normative data for children ages 6:0 to 10:11 Years. Journal of Voice 2005, 19(3):326-339. 10.1016/j.jvoice.2004.07.009View ArticleGoogle Scholar
- Lee L, Rose RC: Speaker normalization using efficient frequency warping procedures. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96), May 1996 1: 353-356.Google Scholar
- Anastasakos T, McDonough J, Schwartz R, Makhoul J: A compact model for speaker-adaptive training. Proceedings of the International Conference on Spoken Language Processing (ICSLP '96), 1996 1137-1140.View ArticleGoogle Scholar
- Gales MJF: Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language 1998, 12(2):75-98. 10.1006/csla.1998.0043View ArticleGoogle Scholar
- Giuliani D, Gerosa M, Brugnara F: Improved automatic speech recognition through speaker normalization. Computer Speech and Language 2006, 20(1):107-123. 10.1016/j.csl.2005.05.002View ArticleGoogle Scholar
- Potamianos A, Narayanan S, Lee S: Automatic speech recognition for children. Proceedings of the European Conference on Speech Communication and Technology (Eurospeech '97), September 1997, Rhodes, Greece 2371-2374.Google Scholar
- Das S, Nix D, Picheny M: Improvements in children's speech recognition performance. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '98), 1998 433-436.Google Scholar
- Hagen A, Pellom B, Cole R: Highly accurate children's speech recognition for interactive reading tutors using subword units. Speech Communication 2007, 49(12):861-873. 10.1016/j.specom.2007.05.004View ArticleGoogle Scholar
- Benzeghiba M, Mori RD, Deroo O, et al.: Automatic speech recognition and speech variability: a review. Speech Communication 2007, 49(10-11):763-786. 10.1016/j.specom.2007.02.006View ArticleGoogle Scholar
- Gustafson J, Sjölander K: Voice transformations for improving children's speech recognition in a publicly available dialogue system. Proceedings of the International Conference on Spoken Language Processing (ICSLP '02), September 2002 297-300.Google Scholar
- Stemmer G, Hacker C, Steidl S, Noth E: Acoustic normalization of children's speech. Proceedings of the European Conference on Speech Communication and Technology (Eurospeech '03), 2003 1313-1316.Google Scholar
- Cabral JP, Oliveira LC: Pitch-synchronous time-scaling for prosodic and voice quality transformations. Proceedings of the 9th European Conference on Speech Communication and Technology (Interspeech '05), 2005 1137-1140.Google Scholar
- Young S, Evermann G, Gales M, et al.: The HTK Book Version 3.4. Cambridge University Engineering Department, Cambridge, UK; 2006.Google Scholar
- Cabral JP, Oliveira LC: Pitch-synchronous time-scaling for high-frequency excitation regeneration. Proceedings of the 9th European Conference on Speech Communication and Technology (Interspeech '05), 2005 1513-1516.Google Scholar
- Leonard R: A database for speaker-independent digit recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '84), 1984 42.11.1-42.11.4.Google Scholar
- Sinha R, Ghai S: On the use of pitch normalization for improving children's speech recognition. Proceedings of the International Speech Communication Association (Interspeech '09), September 2009 568-571.Google Scholar
- Miller J: Effects of speaking rate on segmental distinctions. Perspectives on the Study of Speech 1981, 39-74.Google Scholar
- Peterson G, Lehiste I: Duration of syllable nuclei in english. Journal of the Acoustical Society of America 1960, 32: 693-703. 10.1121/1.1908183View ArticleGoogle Scholar
- Pfau T, Faltlhauser R, Ruske G: A combination of speaker normalization and speech rate normalization for automatic speech recognition. Proceedings of the International Conference on Spoken Language Processing (ICSLP '00), October 2000 4: 362-365.Google Scholar
- Mirghafori N, Fosler E, Morgan N: Towards robustness to fast speech in ASR. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96), 1996 1: 335-338.Google Scholar
- Siegler M, Stern R: On the effects of speech rate in large vocabulary speech recognition systems. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '95), May 1995 1: 612-615.Google Scholar
- Open source software from the speech group Wavesurfer version 1.8.5, January 2008, http://www.speech.kth.se/software
- Sulter AM, Wit HP: Glottal volume velocity waveform characteristics in subjects with and without vocal training, related to gender, sound intensity, fundamental frequency, and age. Journal of the Acoustical Society of America 1996, 100(5):3360-3373. 10.1121/1.416977View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.