- Open Access
The influence of speech rate on Fujisaki model parameters
© Mixdorff et al. 2014
Received: 16 January 2014
Accepted: 23 July 2014
Published: 13 August 2014
The current paper examines influences of speech rate on Fujisaki model parameters based on read speech from the BonnTempo-Corpus containing productions by 12 native speakers of German at five different intended tempo levels (very slow, slow, normal, fast, fastest possible). The normal condition was produced at an average rate of 6.34 syllables/s or 100%, the very slow version at 67%, and the fastest version at 161% of the normal rate. We extracted F0 contours and subjected them to decomposition using the Fujisaki model. We ordered all the data with respect to their actual speech rates. First, we assessed how prosodic realizations vary with speech rate and examined phrase command magnitudes, the number of phrase commands as well as the base frequency, accent command amplitudes, and the timing of accent command with respects to the underlying syllables and their nuclear vowels. Second, we analyzed between-sentence variability within and between speakers and investigated whether and how the prosodic structure is preserved at different speech rates. For very slow speech, we found for some of the speakers that the original phrase structure had disintegrated into something like a list of isolated words separated by pauses. Very fast speech became chains of uniform syllables at very high pitch and with almost flat intonation. With respect to the F0 range reflected by the amplitude of accent commands, we found strong interspeaker differences. While four of the subjects exhibited a significant reduction at higher speech rates, the others did not. As speed increases, it appears that F0 gestures commence earlier in the syllable, that is, the onset time of accent commands is located closer to the syllable/vowel onset than at lower speed.
To date, there are only relatively few accounts of the effects of speech rate on fundamental frequency F0. It is well established, for example, that an increase in speaking rate correlates with a decrease in pauses and a decrease in prosodic boundary marking (cf. ,). Caspers and van Heuven  reported that rises in Dutch are steeper at fast articulation rates. Ladd et al.  showed that in accentual F0 rises, rise time becomes shorter the faster the articulation rate.
In the current paper, we will employ the well-known Fujisaki model  to examine the dependency of F0 contour on the syllable rate. This model reproduces a given F0 contour by superimposing three components: a speaker-individual base frequency Fb, a phrase component, and an accent component. The phrase component results from impulse responses to impulse-wise phrase commands associated with prosodic breaks. Phrase commands are described by their onset time T0, magnitude Ap, and time constant alpha. The accent component results from step-wise accent commands associated with accented syllables. Accent commands are described by on- and offset times T1 and T2, amplitude Aa, and time constant beta.
Within the framework of the Fujisaki model, Fujisaki and Hirose  found that phrase command magnitude Ap is lower for speakers with fast articulation rates than for speakers with a normal or slow rate. Fujisaki and Hirose  reported that accent command amplitudes Aa, too, decrease in faster articulation rate conditions. Moreover, they showed that faster speaking rate leads to the merging of accent commands. In his D.Eng. thesis , the first author already addressed the influence of the speech rate on the intonational features of German, however, on a very small data set. A single trained speaker was asked to read a short text at comfortable (henceforth ‘medium’), slow, and fast speeds. Analysis showed that the fast version was produced at a speed 28% higher and the slow version at a 15% lower speed than the medium one. Ap and Aa for the N↑ and I↓ intoneme, that is, accents with rising or falling F0, respectively (see Section 2), become smaller when speed increases. This means that the F0 range is reduced. However, for these parameters, the mean difference between fast and slow versions only amounts to 17% for an overall speed difference of about 50%. Interestingly, the change between slow and medium versions was greater than between medium and fast versions, though the difference in speed was not. As explained before, increased speech rate reduces the number of prosodic phrases and also reduces the duration of pauses. It also leads to the merging of some accent commands that are separate at lower speed. In more recent work on Swiss German, Leemann  showed that higher articulation rates can lead to a reduction of phrase boundaries, which has an effect on the other intonation phrases, making them overall longer in duration.
Although these results seem to indicate an inherent coupling between speech rate and the F0 contour, one also has to take into account that speakers employ individual strategies when producing speech at different velocities. In their well-known cineradiographic study, Kuehn and Moll  performed measurements of the velocity and displacement of the tongue during speech production and found considerable intra-speaker variation of these two parameters. It has also been shown that in fast speech, segment shortening tends to cause phonetic target undershoot or spatial reduction of articulatory targets , which leads to reduced articulatory displacement toward the phonetic goal and slower peak velocity of participating articulators . More generally speaking, speaker-specific articulatory strategies are an important factor in explaining the articulatory variations . Hence, it can also be expected that the impact of speech rate on F0 gestures will to certain extent be speaker-specific.
The remainder of this paper is structured as follows: Section 2 introduces the methodology for modeling German intonation adopted in this work. Section 3 discusses the corpus employed in this study and the prosodic features we extracted from it. Section 4 then presents results of individual samples as well as statistical analyses of the entire corpus. Section 5 concludes this paper offering a discussion of the findings and conclusions.
2 The concept of intonemes and their quantitative analysis
We will discuss some of the basics of the framework adopted in this study. In the works of Isačenko and Schädlich  and Stock and Zacharias , a given F0 contour is mainly described as a sequence of communicatively motivated tone switches, major transitions of the F0 contour aligned with accented syllables. Tone switches can be thought of as the phonetic realization of phonologically distinct intonational elements, the so-called intonemes. In the original formulation by Stock, depending on their communicative function, three classes of intonemes are distinguished, namely the N↑ intoneme (‘non-terminal intoneme,’ signaling incompleteness and continuation, rising tone switch), I↓ intoneme (‘information intoneme’ at declarative-final accents, falling tone switch, conveying information), and the C↑ intoneme (‘contact intoneme’ associated, for instance, with question-final accents, rising tone switch, establishing contact). Hence, intonemes in the original sense mainly distinguish sentence modality, although there exists a variant of the I↓ intoneme, I(E)↓ which denotes emphatic accentuation and occurs in contrastive, narrowly focused environments. Intonemes for reading style speech are predictable by applying a set of phonological rules to a string of text as to word accentability and accent group formation. Other F0 transitions - termed ‘pitch interrupters’ by Isačenko - will occur at phrase boundaries or in unstressed syllables where they do not have the same prominence-lending effect as tone switches (see ).
Based on this concept, Mixdorff and Jokisch  developed a model of German prosody anchoring prosodic features such as F0, duration, and intensity to the syllable as a basic unit of speech rhythm. In order to quantify the interval and timing of the tone switches with respect to the syllabic grid, the framework adopts the Fujisaki model for parameterizing F0 contours . In a perception study  employing synthetic stimuli of identical wording but varying F0 contours created with the Fujisaki model, it was shown that information intonemes are characterized by an accent command ending before or early in the accented syllable, creating a falling contour. N↑ intonemes were connected with rising tone switches to the mid-range of the subject connected with an accent command beginning early in the accented syllable and plateau-like continuation up to the phrase boundary, whereas C↑ intonemes required F0 transitions to span a total interval of more than 10 semitones and generally starting later in the accented syllable, although the F0 interval was a more important factor than the precise alignment.
On the utterance level: The base frequency Fb marks the floor of the F0 pattern. So far, we regard Fb as a speaker-individual constant varying only slightly. However, we have observed that Fb can also vary depending on the emotional content of an utterance, for instance .
On the phrase level: The phrase magnitude Ap reflects the degree of declination line reset at phrase boundaries. Earlier work suggested that Ap decreases as the tempo rises. We also expect to find fewer phrase commands at higher tempos as prosodic phrases will merge.
On the syllable level: The accent command amplitude Aa corresponds to the interval of local F0 excursions associated with accented syllables and boundary tones. So far, we assume that Aa decreases with increasing tempo, and due to accent command merging or suppression of secondary accents, there will be fewer accent commands. The accent command onset times T1 and accent command offset times T2 with respect to the underlying syllable or nuclear vowel onset or offset times reflect the precise alignment of F0 excursions with the segmental tier. We hypothesize that increased speech rate also requires the F0 gesture to occur earlier in the syllable. We also wish to examine whether accent commands and hence the F0 gestures are more strongly anchored to the nuclear vowel onset than to the onset of the syllable.
3 Speech material and method of analysis
The speech material used in the current study are the recordings of the German L1 speech from the BonnTempo-Corpus ,. It contains data from four male and eight female native speakers of standard German. The corpus is based on readings of a text from a novel by Schlink  of 76 syllables in three sentences (four main and three subordinate clauses). Versions at different tempos were elicited as follows: Subjects were provided the text and asked to familiarize with it by reading it aloud several times. Subsequently, they were recorded performing the task to read the text in a way they considered ‘normal reading’. After that, subjects were recorded twice, the first time being instructed to read the text ‘slowly’ and the second time to read the text ‘even slower’. In a third step, subjects were recorded under the instruction to read the text ‘fast’ and were then encouraged to read the text ‘faster’ until they considered themselves having reached a maximum reading speed or until their performance seriously deteriorated. From the resulting materials, five versions are examined in the current study: normal (no), slow (s1), even slower (s2), fast (f1), and fastest (f2). These were labeled on the syllabic level by the third author and his colleagues. In addition, they labeled the nuclear vowels. We are aware that the syllabic rate as a correlate of speech rate is inferior to the perceptual local speech rate (PLSR) proposed by Pfitzinger , as the local syllable rate and the local phone rate are not well-correlated, since they represent different perceptual aspects of speech rate. Perception experiments with short stretches of speech being judged on a rate scale revealed that neither syllable rate nor phone rate is sufficient to predict the perception results. Subsequently, it was shown that a linear combination of the two measures yielded a correlation of r = 0.91 and a mean deviation of 10% which is accurate enough to successfully extract PLSR from large spoken language corpora. However, the BonnTempo-Corpus does not contain manually corrected phone labels. Since the Fujisaki model commands are anchored to the syllabic layer (see Section 2) and we did not require an exact local estimate of speech rate, but a broad classification of speech rate on the utterance level, the following investigation is performed with respect to the syllabic rate. Based on the underlying text of the utterances, we marked all lexically stressed syllables of content words.
F0 values were extracted at a step of 10 ms with F0 floors and ceilings for male (50 to 300 Hz) and female participants (120 to 400 Hz) using the PRAAT default method . All F0 contours were then subjected to Fujisaki model parameter extraction , with an alpha of 2.0/s, beta of 20.0/s and variable Fb. Results were checked and if necessary corrected in the FujiParaEditor. Evaluating the alignment between phrase and accent commands with respect to the underlying syllables, while taking into account the status of these syllables as either being lexically stressed or not, we associated each accent command with the closest syllable. As explained in Section 1, a rising tone switch is invariably connected with the onset of an accent command and a falling tone switch with an offset of an accent command. All other accent command onsets or offsets are related to pitch interrupters at unaccented syllables.
4 Results of analysis
4.1 General observations
Overview of speaker-specific means ( M ) and standard deviations (SD) of syllable rate for the five different intended tempos
1 Very slow
5 Very fast
Figure 2 shows results of analysis for all tempos produced by female speaker 1 uttering the sentence 1 Am nächsten Tag fuhr ich nach Husum - ‘On the next day I went to Husum’. The figure displays the following, from the top to the bottom: the speech waveform, the F0 contour (+signs, extracted; solid line, model-based), the German SAMPA phone segmentation, the underlying phrase, and accent commands. The boundaries of underlying syllables are indicated by vertical dotted lines. Vowels carrying lexical stress are marked by bracketing, for instance (E:). Pauses are marked by underscores ‘_’.
As can be seen, the prosodic structure as reflected by the configuration of underlying accent commands aligned with accented syllables remains intact throughout all conditions, although the amplitudes Aa, durations, and alignments of commands vary. There is a tendency for Aa as well as for the durations of accent commands to become smaller as speed increases. The pause after ‘Tag’ only disappears in the fastest condition. The declination lines in all utterances can be modeled by a single-phrase command preceding each utterance about 475 ms before the segmental onset. This is slightly smaller than the ideal value of 500 ms. It has to be taken into account that the automatic extraction method aims at pertaining a global optimum of fit for the entire phrase component. This may lead to the phrase command occurring closer to the segmental onset of the phrase.
All data were analyzed using R  and the R packages lme4, languageR,, and JMP . If not indicated otherwise, data were analyzed using linear mixed effect models. Normality was checked by visual inspection of quantile plots. Speaker and sentence were treated as random effects, and intended tempo as fixed effect. Effects were tested by model comparison between a full model in which the factor in question is entered as either a fixed or a random effect (R code example: lmer(dependent_variable ~ fixed_factor + (1|random_factor1) + (1|random_factor2), data = data)) and a reduced model in which the factor in question is excluded (R code example: lmer(dependent_variable ~ 1 + (1|random_factor1) + (1|random_factor2), data = data)). P values were retrieved by comparing the results from the two models using ANOVAs (R code: anova(model_full, model_reduced). To assess the relative goodness of fit we indicate Akaike information criterion (AIC) values, which decrease with goodness of fit . P values that are considered significant at the α = 0.05 level are reported.
4.2 Phrase command parameters
Summary of the statistics for the phrase-level Fujisaki model parameters
Fujisaki model parameter
Phrase command magnitude Ap
P < 0.0001, AIC = −872
P = 0.0009, AIC = −872
P = 0.033, AIC = −860
P = 0.023, AIC = 1,274
P = 0.009, AIC = 1,291
Figure 5 reveals that normal speech shows the highest Ap (M = 0.266, SD = 0.13), followed by very slow speech (M = 0.264, SD = 0.15), and fast intended tempo (M = 0.263, SD = 0.12). Very fast speech shows the lowest Ap (M = 0.17, SD = 0.12). Visually, if the boxes' notches do not overlap, this can be taken as strong evidence that their medians (solid black lines) differ. Results further revealed a main effect of speaker for phrase duration.
4.3 Accent command parameters
Summary of the statistics for the accent-level Fujisaki model parameters
Fujisaki model parameter
Accent command amplitude Aa
P < 0.0001, AIC = −838
P < 0.0001, AIC = −838
P < 0.0001, AIC = −894
t1relon (distance between T1 and syllable onset)
P < 0.0001, AIC = −1,282
P < 0.0001, AIC = −1,282
P < 0.0001, AIC = −1,295
t1relvon (distance between T1 and nuclear vowel onset)
P < 0.0001, AIC = −1,427
P < 0.0001, AIC = −1,427
The interaction obtained for intended tempo*speaker becomes evident in Figure 6: whereas speaker 4, for example, exhibited a trend of increasing amplitudes the faster he speaks (except for the very fast condition), speaker 11 performed conversely: the faster his speech, the lower the accent command amplitudes. Given the interaction of rate*speaker, the main effects are no longer readily interpretable. To test for the simple effect of intended tempo, we processed 12 ANOVAs, one for each speaker. Only 3 of the 12 ANOVAs showed significant effects of intended tempo (Bonferroni adjusted for speaker, α = 0.0042; speakers 2, 4, and 11). Correlation analysis between Aa and speech rate in syllables/s only yielded significant dependencies for four of the speakers with a weak Pearson's r < −0.3 indicating a compressed F0 range at higher rates. The syllabic distance between accent commands increases with speed: At the normal rate, subjects produce on average one accent command every 3.1 syllables, at very slow tempo every 2.7 syllables, and at very fast speed every 4.6 syllables.
As for the temporal distance between accent command onset and the onset of a stressed syllable (t1relon), the model with speaking rate as a fixed effect provided an improved goodness of, which means that between-rate variation was again significant. Descriptive statistics showed that the faster the speech, the smaller the distance between accent command onset and syllable onset (very slow M = 0.14, SD = 0.09; slow M = 0.12, SD = 0.08; normal M = 0.09, SD = 0.05; fast M = 0.08, SD = 0.05; very fast M = 0.05, SD = 0.04). In other words, local rises occur earlier in the syllable the faster a person speaks. Results indicated that there was a significant effect of speaker and interaction of intended tempo*speaker. To test for the simple effect of intended tempo, we again processed 12 ANOVAs, one for each speaker. Again, only 3 of the 12 ANOVAs showed significant effects of intended tempo (Bonferroni adjusted for speaker, α = 0.0042; speakers 4, 5, and 11).
Figure 7 reveals that most speakers exhibited the trend mentioned above: the faster the speech, the earlier the rise relative to the vowel onset. This is particularly evident for speakers 4, 9, and 11.
The difference between accent command onset time T1 and (1) the syllable onset time (t1relon), (2) the syllable offset time (t1reloff), (3) the vowel onset time (t1relvon), and (4) the vowel offset time (t1relvoff)
The difference between accent command onset time T2 and (1) the syllable onset time (t2relon), (2) the syllable offset time (t2reloff), (3) the vowel onset time (t2relvon), and (4) the vowel offset time (t2relvoff).
Means ( M ) and standard deviations (SD) in milliseconds for distance measures between accent commands and syllables or nuclear vowels, respectively
Rising F0 (N↑ intonemes)
Falling F0 (I↓ intonemes)
The current paper examined the relationship between the F0 contour and speech rate. We employed the Fujisaki model for decomposing F0 contours into utterance level, phrase level, and syllable level components, that is, the base frequency Fb, phrase commands, and accent commands, respectively. We found that only at the extreme ends of the tempo range the prosodic structure disintegrates. Otherwise, the configuration in terms of phrase and accent command numbers and positions remains relatively unchanged, and speech rate has mostly an influence on the amplitudes and exact timings of commands.
In general, we found the following trends: The base frequency Fb increases with speech rate. The phrase duration measured as the distance between consecutive phrase commands decreased with speech rate. Although we expect phrases to contain more syllables at higher speech rate, these syllables are of shorter duration. Unless phrases are merged, phrase duration will therefore also decrease. As the examples in Figure 2 indicated, we do not necessarily see an increase in the numbers of phrase commands at low speed, unless the utterance is broken into very small chunks like in Figure 3, top. This tendency might well be due to the character of the underlying reading material which contains mostly short phrases. The deep boundaries are produced similarly at all tempos, and low speed does not give rise to new phrase commands at shallow boundaries which are rather marked by short pauses.
With respect to accent command amplitude Aa, our outcome suggests that despite a slight trend for Aa to decrease at higher tempos, the influence of the speaker on this parameter by far outweighs that of speech rate. This means that speakers have idiosyncratic ways of manipulating F0 parameters as a function of speech rate. For some speakers, for example, accent command amplitude increases the faster they speak (speaker 4, Figure 6), whereas for others, Aa decreases (speaker 11, Figure 6). We also found that speakers tend to produce fewer accent commands at higher speech rates. This indicates that a high articulatory rate imposes limitations on the frequency of F0 gestures.
As regards the alignment of F0 gestures with the syllable, our results suggest that increased tempo also leads to an early execution of F0 gestures. These observations are similar regardless of whether we anchor the accent command to the syllable onset or the nuclear vowel onset.
It is possible that the reading task underlying our data might have affected the outcome of our analyses, as people repeated the same sentences over and over again. Future work will examine other speech materials produced with more natural tempo variations.
- Goldman-Eilser F: Psycholinguistics: Experiments in Spontaneous Speech. Academic, New York; 1968.Google Scholar
- Rietveld ACM, Gussenhoven CC: Perceived speech rate and intonation. J. Phonetics 1987, 15: 273-285.Google Scholar
- Caspers J, van Heuven VJ: Effects of time pressure on the phonetic realization of the Dutch accent-lending pitch rise and fall. Phonetica 1993, 50: 161-171. 10.1159/000261936View ArticleGoogle Scholar
- Ladd DR, Faulkner D, Faulkner H, Schepman A: Constant segmental anchoring of F0 movements under changes in speech rate. J. Acoust. Soc. Am. 1999, 106: 1543-1554. 10.1121/1.427151View ArticleGoogle Scholar
- Fujisaki H, Hirose K: Analysis of voice fundamental frequency contours for declarative sentences of Japanese. J. Acoust. Soc. Japan (E) 1984, 5(4):233-241. 10.1250/ast.5.233View ArticleGoogle Scholar
- Fujisaki H, Hirose K: Modeling the dynamic characteristics of voice fundamental frequency with applications to analysis and synthesis of intonation: preprints of the working group on intonation. 13th International Congress of Linguistics, Tokyo; 1982.Google Scholar
- Mixdorff H: Intonation patterns of German—model-based quantitative analysis and synthesis of F0-contours. D.Eng, Thesis, TU Dresden; 1998.Google Scholar
- Leemann A: Swiss German Intonation Patterns. Benjamins, Amsterdam/New York; 2012.View ArticleGoogle Scholar
- Kuehn DP, Moll KL: A cineradiographic study of VC and CV articulatory velocities. J. Phonetics 1976, 23(4):303-320.Google Scholar
- B Lindblom, Explaining phonetic variation: a sketch of the H&H theory, in Speech production and speech modelling, ed. by WJ Hardcastle, A Marchal (Kluwer 1990), pp. 403–439Google Scholar
- Flege JE: Effects of speaking rate on tongue position and velocity of movement in vowel production. JASA 1988, 84: 901-916. 10.1121/1.396659View ArticleGoogle Scholar
- Isačenko AV, Schädlich HJ: Untersuchungen über die deutsche Satzintonation. Akademie-Verlag, Berlin; 1964.Google Scholar
- Stock E, Zacharias C: Deutsche Satzintonation. VEB Verlag Enzyklopädie, Leipzig; 1982.Google Scholar
- Mixdorff H, Widera C: Perceived prominence in terms of a linguistically motivated quantitative intonation model. Proc. Eurospeech 2001, Aalborg, Denmark; 2001.Google Scholar
- Mixdorff H, Jokisch O: Building an integrated prosodic model of German, vol. 2. Proceedings of Eurospeech 2001, Aalborg, Denmark; 2001.Google Scholar
- Mixdorff H, Fujisaki H: Production and perception of statement, question and non-terminal intonation in German. Proc. ICPhS, Stockholm 1995, 2: 410-413.Google Scholar
- Amir N, Mixdorff H, Amir O, Rochman D, Diamond GM, Isserles T, Abramson S, Pfitzinger HR: Unresolved anger: prosodic analysis and classification of speech from a therapeutical setting. Proceedings of Speech Prosody 2010, Chicago, USA; 2010.Google Scholar
- V Dellwo, P Wagner, Relationships between speech rate and rhythm, in Proceedings of the ICPhS 2003 (Barcelona, 2003)Google Scholar
- Dellwo V, Steiner I, Aschenberner B, Dankovicova J, Wagner P: The BonnTempo-corpus & BonnTempo-tools: a database for the study of speech rhythm and rate. Proceedings of ICSLP 2005 2005.Google Scholar
- Schlink B: Selbs Betrug. Diogenes Verlag, Zurich; 1994.Google Scholar
- Pfitzinger HR: Local speech rate perception in German speech. Proc. ICPhS 1999, 1999: 893-896.Google Scholar
- Boersma P: Praat, a system for doing phonetics by computer. Glot Int. 2001, 5: 341-345.Google Scholar
- Mixdorff H: A novel approach to the fully automatic extraction of Fujisaki model parameters, vol 3. Proceedings of ICASSP 2000, Istanbul Turkey; 2000.View ArticleGoogle Scholar
- Mixdorff H: FujiParaEditor. 2009.Google Scholar
- R, A language and environment for statistical computing, R foundation for statistical computing. 2013.Google Scholar
- Bates DM, Maechler M: lme4: linear mixed-effects models using S4 classes. 2009.Google Scholar
- Baayen RH: Analyzing linguistic data: a practical introduction to statistics using R. CUP, Cambridge; 2008.View ArticleGoogle Scholar
- Baayen RH: LanguageR: data sets and functions with analyzing linguistic data: a practical introduction to statistics using R. 2009.Google Scholar
- Version 9.0. SAS Institute Inc, Cary NY; 1989–2007.Google Scholar
- Kliegl R, Wei P, Dambacher M, Yan M, Zhou X: Experimental effects and individual differences in linear mixed models: estimating the relationship between spatial, object, and attraction effects in visual attention. Front. Psychol. 2011, 1(238):1-12.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.