Classification of speech under stress based on modeling of the vocal folds and vocal tract
© Yao et al.; licensee Springer. 2013
Received: 30 October 2012
Accepted: 21 June 2013
Published: 5 July 2013
In this study, we focus on the classification of neutral and stressed speech based on a physical model. In order to represent the characteristics of the vocal folds and vocal tract during the process of speech production and to explore the physical parameters involved, we propose a method using the two-mass model. As feature parameters, we focus on stiffness parameters of the vocal folds, vocal tract length, and cross-sectional areas of the vocal tract. The stiffness parameters and the area of the entrance to the vocal tract are extracted from the two-mass model after we fit the model to real data using our proposed algorithm. These parameters are related to the velocity of glottal airflow and acoustic interaction between the vocal folds and the vocal tract and can precisely represent features of speech under stress because they are affected by the speaker’s psychological state during speech production. In our experiments, the physical features generated using the proposed approach are compared with traditionally used features, and the results demonstrate a clear improvement of up to 10% to 15% in average stress classification performance, which shows that our proposed method is more effective than conventional methods.
KeywordsSpeech under stress Stress classification Physical parameters Two-mass model Vocal folds Vocal tract
Stress is a psycho-physiological state characterized by subjective strain, increased physiological activity, and deterioration of performance . Factors inducing stress on speakers include workload, background noise, emotions, physical environmental factors (e.g., G-force), and fatigue. These factors are believed to affect voice quality and are detrimental to the performance of communication equipment, especially automated systems with speech interfaces. Therefore, it has become increasingly important to study speech under stress in order to improve the performance of speech recognition systems, to recognize when people are in a stressed state and to understand contexts in which speakers are communicating.
Researchers have attempted to probe reliable indicators of stress by analyzing acoustic variables. Some external factors (workload, background noise, etc.) and internal factors (emotional state, fatigue, etc.) may induce stress . The first investigations of emotional speech were conducted in the mid-1980s, using the statistical properties of acoustic features in order to detect emotions from speech [3, 4]. It has been found that fundamental frequency (F0) has different characteristics for each emotion  and that respiration patterns and muscle tension also change . The influence of the Lombard effect on speech recognition has also been examined [7, 8]. Selected acoustic features have been analyzed, such as amplitude and distribution of spectral energy, and it was found that spectral energy shifted to higher frequencies for consonants in the presence of loud background noise. High workload stress has been proven to have a significant impact on the performance of speech recognition systems, with speech under workload sounding faster, softer, or louder than neutral speech [9, 10]. Matsuo et al. examined the frequency domain and showed how differences in the spectrum of the high frequency band under stressful workload conditions could be used to catch people committing remittance fraud, and their proposed measure achieved better classification performance . Furthermore, the Teager energy operator (TEO)  was proposed to explore variations in the energy of airflow characteristics within the glottis for the purpose of stress classification . However, the features examined in these previous studies lack a physical basis, and the methods do not consider the whole process of speech production, which is believed to be essential for effective classification of speech under stress.
We propose a stressed speech classification method based on a physical model characterizing the vocal folds (VF) and the vocal tract (VT). This method can represent the process of speech production and model airflow patterns in the vocal folds and the vocal tract, which are essential for stress classification. In this physical model, changes in the physical characteristics of the vocal folds, such as muscle tension, have a modulating effect on the formants, while the shape of the vocal tract can also influence the glottal source because of the interaction between the vocal folds and the vocal tract. It is believed that the presence of stress can result in variations in the physical characteristics of physiological systems and influence the acoustic interaction between the vocal folds and the vocal tract . The parameters of the physical model also help represent the influence of speaking style more directly and clearly. Therefore, a physical model is helpful to estimate the parameters of the physiological system.
An early but still prominent physical model is the source-filter model , which models speech as the combination of a glottal source (such as the vocal folds), and a linear acoustic filter representing the vocal tract and its radiation characteristic. An important assumption that is often made in the use of the source-filter model is independence of the source and filter. In such cases, the model should more accurately be referred to as the ‘independent source-filter model’. In 1961, Wong proposed a linear model of speech production using a lossless tube model of the vocal tract . In 1979, a linear source tract model was proposed to model the glottal source, the vocal tract, and radiation impedance as linear filters, using covariance analysis . However, the vocal tract and vocal folds do not function independently of each other instead there is some form of interaction between them , which results in significant changes in fundamental frequency and formant characteristics.
The two-mass model is a physical model, which attempts to simulate the physical process of vocal fold vibration, characterizing the vocal folds and the vocal tract, and to also model the effect of glottis-vocal tract interaction . Parameters affected under stressed conditions are extracted from the physical model and are used as features to identify speech under stress more precisely. We use the two-mass model as a physical model, and our proposed method estimates the values of parameters included in the model from input speech. To identify speech under stress, we evaluate parameters affected by stress.
In this paper, we propose a method for fitting a physical model to real speech in order to estimate the physical parameters which characterize the vocal folds and the vocal tract. For the physical model, a two-mass model connected to a four-tube model is used to simulate the process of speech production. The physical parameters (stiffness, vocal tract length, and cross-sectional areas of the vocal tract) are estimated by fitting the model to real speech. The estimated parameters can be further analyzed and proposed as features for the classification of neutral and stressed speech. Furthermore, different cost functions are proposed to compare classification performance. As a result, stiffness of the vocal folds and cross-sectional areas of the vocal tract are selected as features for the classification of neutral and stressed speech.
The paper is organized as follows: In Overview, an overview of our method is presented. Physical parameters, related to the vocal folds and the vocal tract, based on the two-mass model are described as features for classification in Physical parameters. This is followed by the presentation of a fitting algorithm for real speech data in Estimation method to help estimate the physical parameters. Classification describes the classification method used for evaluation. In Evaluation, experiments are performed to evaluate the obtained parameters and show their corresponding classification performances when separating neutral and stressed speech. Finally, we draw our conclusions in Conclusion.
Initially, we propose physical parameters considered likely to be useful, which include stiffness parameters of the vocal folds, vocal tract length, and cross-sectional areas of the vocal tract. These parameters characterize the behavior of vocal folds and the shape of the vocal tract. Furthermore, the relationship between the selected physical parameters and acoustic parameters has been shown to represent characteristics of the interaction between the vocal folds and the vocal tract.
The proposed physical parameters are then estimated by fitting the two-mass model to real speech. An algorithm based on the analysis-by-synthesis method is proposed for fitting the model to real speech. The Nelder-Mead simplex method  is used as a search strategy in order to find the optimal physical parameters. An iteration method is performed for vocal fold fitting and vocal tract fitting to estimate parameters, because there is interaction between the VF and VT.
For classification, a linear classifier is trained using utterances from each speaker. Currently, a simple linear classifier based on Euclidean distance is used for classification. Also, since we only have speech data for a small number of speakers, we evaluate our proposed method as a speaker-dependent system.
3. Physical parameters
A method which fits the two-mass model to real speech is proposed for classifying speech under stress. Some of the physical parameters characterizing the vocal folds and the vocal tract are estimated. The two-mass vocal fold model was originally proposed by Ishizaka and Flanagan to simulate the process of speech production . We propose three types of feature parameters extracted from the two-mass model: stiffness, vocal tract length, and cross-sectional area of the entrance of the vocal tract. In the following sections, we will define these parameters and describe their characteristics.
The stiffness parameters are related to muscle tension in the vocal folds. Generally, the stiffness of the vocal folds is considered to depend mainly on two muscles: the cricothyroid muscle (CT) and thyroarytenoid muscle (TA) . In the two-mass model, coupling stiffness kc is relative to the tension in the TA muscle, so a high k1 value and a low value for kc represent the contraction of the CT muscle and relaxation of the TA muscle.
Notations and variables in the two-mass model for the vocal folds
The horizontal displacements measured from the rest (neutral) position x0
The equivalent viscous resistances
The force related to tissue elasticity
The force of airflow, which is determined by subglottal pressure
The stiffness coefficients
The coupling stiffness
A coefficient of the nonlinear relations
Stiffness parameters are the main factors relating to fundamental frequency, and they can also determine the amplitude of the glottal area and glottal volume velocity , so source excitation is significantly influenced by the degree of stiffness. During the production of speech, the natural frequency of the vocal folds is determined by both their mass and stiffness. However, in order to simplify the estimation algorithm, only the stiffness parameters are estimated, with mass fixed as a constant.
3.2 Vocal tract length and cross-sectional area
The supraglottic area includes the structures that lie above the true vocal folds and below the base of the tongue. The anatomical structures present in this area that are important to speech production lie posterior to the epiglottis. They include the ventricle, false vocal folds, epiglottis, arytenoids, laryngeal aspects of the aryepiglottic folds, and vestibule .
Notations and variables in the two-mass model for the vocal tract
The cross-sectional areas in the tube model
The cylinder lengths in the tube model
The thickness of m1 and m2
The cross-sectional areas of the glottis
The average volume velocity across the glottal area
The velocity of sound
The air density
The radian frequency
where , , , , , and .
The length of the vocal tract and its cross-sectional areas are the main parameters which determine the shape of the vocal tract and have a significant impact on the distribution of formants. Vocal tract length and cross-sectional areas of the tube model are computed from real speech.
3.3 Relationship between physical parameters and acoustic parameters
In this section, we describe experiments which were performed to represent the presence of acoustic interaction and show the relationship between physical and acoustic parameters. Aerodynamics in the glottis is modeled using the two-mass model. In order to clarify the relationship between physical and acoustic parameters, we will first briefly describe the main equations representing the aerodynamics of speech production.
where ρ is the air density, Ug is the volume velocity of glottal airflow, and Ag1 is the cross-sectional lower glottal area, which is represented by Ag1 = 2lg(x0 + x1), where lg is the length of the vocal fold and x0 is the displacement when the vocal folds are in the rest position.
where μ is the air viscosity coefficient and d1 is the width of m1.
where P21 is the air pressure at the lower edge of m2 and Ag2 is the cross-sectional lower glottal area.
where P1 is the pressure at the inlet of the vocal tract. Here, the parameter N is defined as N = Ag2 / A1, where A1 is the area of the entrance to the vocal tract. N denotes the difference in area between the outlet of the vocal folds and the inlet of the vocal tract, which is significant to the acoustic interaction between the vocal folds and the vocal tract . Since glottal area Ag2 does not change significantly during the oscillation of the vocal folds, A1 is the parameter relating to the acoustic interaction.
Physical and acoustic parameters
k 1 , kc, A1, A2, A3, LVT
F0, F1, F2, F3
We first examine how stiffness parameters impact the distribution of formants. First, we fixed the shape of the vocal tract and examined how variation in the stiffness parameters of the vocal folds affects the shift of formants. The vocal tract model was represented by a standard tube configuration for the vowels /a/ and /e/ . In order to reduce the number of parameters to be estimated and simplify the proposed method, typical values were adopted for the configuration of the tube model. Therefore, as typical values, the length chosen for the vocal tract was LVT = 16 cm, with each element l i = 4 cm, and the cross-sectional area was fixed at A1 = 0.8 cm2, A2 = 0.4 cm2, A3 = 3 cm2, and A4 = 8 cm2 for /a/ and A1 = 1 cm2, A2 = 8 cm2, A3 = 8 cm2, and A4 = 8 cm2 for /e/. When a specific stiffness is checked, the other stiffness parameters are fixed at typical values. We changed stiffness parameters k1 (20 to 240 kdyn/cm), k2 (2 to 40 kdyn/cm), and kc (2.5 to 70 kdyn/cm) to examine variation in formants. Formant estimation is based on modeling vocal tract frequency response using linear predictive coding (LPC) techniques. It estimates formant frequencies from the all-pole model of the vocal tract transfer function.
Next, we fixed the configuration of the vocal folds and examined how variation of the cross-sectional area of the vocal tract impacts the fundamental frequency (F0) of speech. Stiffness was fixed at typical values k1 = 80,000 dyn/cm, k2 = 8,000 dyn/cm, and kc = 25,000 dyn/cm to check how the fundamental frequency changes with the area function. When checking the impact of a specific area, other areas and vocal tract length (VTL) were fixed at typical values for /a/ or /e/. When considering VTL, all the cross-sectional areas were fixed at typical values. We then change the cross-sectional area or VTL to examine their impact on F0. The variation range for VTL was 13 to 19 cm, and for cross-sectional area of VT, the range was 0.1 to 20 cm. The algorithm for estimation of the fundamental frequency of speech is YIN . It is based on the well-known autocorrelation method, with a number of modifications that combine to prevent error.
Therefore, it is our conclusion that stiffness of the vocal folds and cross-sectional area A1 affect both the fundamental frequency and formants and, further, the interaction between the vocal folds and the vocal tract.
3.4 Parameters representing stress
In Relationship between physical parameters and acoustic parameters, the experimental results show that stiffness of the vocal folds and cross-sectional area A1 have an impact on the interaction between the vocal folds and the vocal tract. It is believed that the variations in acoustic interaction differ markedly between neutral and stressed speech , so stiffness and A1 should be selected as parameters for representing stress.
In theory, Equation 8 shows that both the velocity of glottal airflow and the difference between the area of the outlet of the vocal folds and the inlet of the vocal tract have an impact on the pressure difference inside and outside of the glottis. Thus, the two factors can cause variations in the airflow patterns in the glottis and thus are likely to be effective to represent the presence of stress.
Variation in the stiffness of the vocal folds influences the time span of glottal opening and closing phases and causes glottal airflow to accelerate in the glottis, thus impacting the velocity of glottal airflow. Therefore, we can also assume that stiffness parameters can be potential parameters for stress detection.
A1 in the four-tube model is the area of the entrance to the vocal tract in the supraglottis. Narrowing A1 facilitates phonation by decreasing the oscillation threshold pressure of the vocal folds . Since glottal area Ag2 does not change significantly during the oscillation of the vocal folds, A1 is the main factor determining the pressure difference between the inside and outside of the glottis and has an impact on the acoustic interaction between VF and VT. Based on these considerations, we also make the assumption that A1 is an effective parameter for stress classification.
4. Estimation method
4.1 Algorithm for fitting
For vocal tract fitting, stiffness parameters are fixed at typical values and are taken as an input to vocal tract fitting. The parameters for the cross-sectional areas are then estimated. Next, the obtained areas are used as an input for vocal fold fitting, and the two-mass model is fit to estimate the new stiffness parameters. When current stiffness differs significantly from the typical value, the corresponding formants are also affected, and some variations can occur. In such cases, vocal tract fitting needs to be performed again. We take iterations for the two parts until the results reach convergence.
The Nelder-Mead algorithm is a simplex method for finding the minimum of a function involving several variables. It is a direct search method and does not require the calculation of a derivative. We use the Nelder-Mead method based on the comparison of the values of the cost function at the n + 1 vertices for n-dimensional decision variables to solve our optimization problem. Here, we select A1, A2, A3, and A4 as variables in vocal tract fitting. Each calculation will generate a new vertex for the simplex. If this new point is better than at least one of the existing vertices, it replaces the worst vertex. The simplex vertices are changed through reflection, expansion, shrinkage, and contraction operations in order to find an improved solution to estimate the parameters. Optimal values of the physical parameters are estimated using the Nelder-Mead simplex method, which is implemented to search for the optimal physical parameters to minimize the cost function.
where S(ω) and S*(ω) are the power spectrums of the residual signal for simulated and real speech, respectively. Here, we select the stiffness parameters k1, k2, and kc as variables for vocal tract fitting.
Here, we use the residual signal from LPC analysis to estimate the parameters of the vocal folds. The LPC model is based on a mathematical approximation of the vocal tract. We use it to remove the effect of the vocal tract and obtain the residual signal to estimate the stiffness parameters with generated cost functions. In order to make a comparison with the spectrum of the residual signal from real speech, an LPC inverse filter is used for the simulated speech to obtain the residual signal. Our target here is to evaluate the similarity of the spectrums of residual signals both from real and simulated speech instead of representing the source wave. The aim of this paper is to classify speech under stress. It is believed that the main differences between neutral and stressed speech are focused on the harmonic structure of the spectrum of residual signal . Thus, in this study, obtaining the residual signal using LPC can work well for showing the harmonic structure of the spectrum.
4.2 Cost functions for vocal tract fitting
As for the definition of cost function 2, we utilized four different cost functions in order to compare their classification performance.
4.2.1 Formant ()
where the asterisk denotes the target value for real speech. The weights are given the values α1 and α2 to normalize the different target parameters to the same range, and the overbar denotes mean values over the target region.
4.2.2 RMS distance of spectral envelope (Crms)
4.2.3 Itakura-Saito distance of spectral envelope (CI-S)
4.2.4 Envelope and formant (CE-F)
where F1, F2, H1, and H2 refer to the frequency and amplitude of the first and second formants and n is the iteration number.
It would be helpful to evaluate the accuracy of the fitting method to show that the proposed method works well. However, it is difficult to compare the simulated values with the actual values because sensors are not available to measure the actual values for human beings. In this paper, we calculate the error in acoustic features between real and simulated speech to describe the accuracy of the fitting method.
where the asterisk denotes the target value for real speech.
During the training process, all of the speech samples from a specific speaker are labeled as neutral or stressed speech. The labeled speech is segmented into fixed frames, and all of the frames are fit using the two-mass model to estimate the proposed parameters. A linear classifier based on minimum Euclidean distance is trained for the classification, using the estimated physical parameters from all of the frames.
A K-fold cross-validation method was used in the training and testing process, and K was set to 4. Using this method, the data set was divided into four subsets, and for each classification, one of the subsets was used as a test set and the other three subsets were combined to form a training set. The final result was obtained by calculating the average classification rate across four trials.
6.1 Database and experimental setup
In the experiments, we used a database collected by the Fujitsu Corporation containing speech samples from eleven subjects (four males and seven females) . To simulate mental pressure resulting in psychological stress, the speakers performed three different tasks while having telephone conversations with an operator, in order to simulate a situation involving pressure during a telephone call.
The three tasks involved (a) concentration, (b) time pressure, and (c) risk taking. For each speaker, there are four dialogues with different tasks. In two dialogues, the speaker was asked to finish the tasks within a limited amount of time, and in the other dialogues, there is relaxed chat without any task.
All of the data comes from telephone calls, so the sampling frequency was 8 kHz. Segments with the vowels /a/ and /e/ were cut from the speech and selected as samples. The experiments were conducted for each speaker, and all of the results were speaker dependent. The number of samples was different for each speaker. The range of the total number of samples is from 100 to 250 for each vowel from each person. We randomly chose six speakers (three males and three females) from eleven subjects to test classification performance. A K-fold cross-validation method was used in the classification experiments, in which K was set to 4. Using this method, the data set was divided evenly into four subsets, and for each classification, one of the subsets was used as a test set and the other three subsets were combined to form a training set. The final result was obtained by calculating the average classification rate across four trials. The samples were analyzed with 12-order LPC, and the frame size chosen to perform the experiment was 64 ms, with 16 ms for frame shift.
For configuration of the two-mass model, the following values were adopted, using typical values for males: m1M = 1.25 × 10−4 kg, m2M = 2.5 × 10−5 kg, lgM = 0.014 m, d1M = 0.0025 m, d2M = 5 × 10−4 m, ζ1M = 0.1, ζ2M = 0.6, x0 = 2 × 10−4 m, and Ps = 500 Pa. The vocal tract model was represented by a tube model, and the number of elements was limited to four cylindrical sections of equal length. Typical values used for configuration for females were as follows: m1F = 4.56 × 10−5 kg, m2F = 9.1 × 10−6 kg, lgF = 0.01 m, d1F = 1.79 × 10−3 m, d2F = 3.6 × 10−4 m, ζ1F = 0.1, ζ2F = 0.6, x0 = 2 × 10−4 m, and Ps = 500 Pa. Furthermore, the ranges for the control parameters were k1 = 10 to 140 kdyn/cm, k2 = 2 to 14 kdyn/cm, kc = 4 to 45 kdyn/cm, VTL = 13 to 19 cm, and A1, A2, A3, A4 = 0.2 to 20 cm.
6.2 Results for cost functions
The results illustrate that classification performance is improved when vocal tract values are variable. In this case, the cost functions for the vocal tract are used, and formants are also considered, which results in more information about the frequency domain of the speech being available, making the estimated results more reliable. Furthermore, we compared the performance of different cost functions. Our results show that the stress classification rate for CE-F is higher than for the other cost functions. Since CE-F can match the rough shape of the spectral envelope and also effectively catch the characteristics of F1 and F2, which have been proven to be sensitive to the interaction between the VF and VT, the classification of stressed speech is improved.
6.3 Results for physical parameters
In the second evaluation, VTL was first estimated for each speaker, and further evaluations were based on the obtained vocal tract length. Here, we selected cost function CE-F, which achieved the best performance in classification during the first evaluation. The purpose of this evaluation was to verify which parameters in the stiffness and area functions are related to stress and then check the classification performance of these parameters in comparison to traditionally used features.
6.3.1 Evaluation of vocal tract length estimation
6.3.2 Evaluation of stiffness parameters of the vocal folds
6.3.3 Evaluation of parameters of the cross-sectional areas of the vocal tract
A2 and A3 do affect F0 to some extent, which was illustrated in Figure 5, so they have some influence on acoustic interaction and, further, on stress classification; however, we believe their influence is insignificant. The characteristics of the vocal tract also affect stress classification to some extent. Since A2 and A3 represent the shape of the vocal tract, [k1, kc, A1, A2, A3] can achieve some improvement in the recognition rate, but the increase is very small, which suggests that A2 and A3 are less important for stress classification than A1.
6.3.4 Evaluation for proposed physical parameters
6.4 Results of Gaussian mixture modeling
In this section, we modeled the features using Gaussian mixture model (GMM), which are widely used statistical classifier. Two GMM models were trained, one for neutral speech the other for stressed speech.
The data set for each speaker was divided evenly into four subsets, and for each classification, one of the subsets was used as a test set and the other three subsets were combined to form a training set. The final result was obtained by calculating the average classification rate across four trials by a K-fold cross-validation method. In order to increase the amount of training data, the GMMs were trained using training set from three male speakers. The testing set of three male speakers and all of the data from female speakers were combined to generate the testing data used in this experiment.
Classification rates with different numbers of mixtures
Number of mixtures
Classification rate (%)
In this paper, we explored more effective features for the classification of neutral and stressed speech based on a physical model. To achieve this target, a two-mass model characterizing the properties of the vocal folds and the vocal tract was used to simulate speech production. Physical parameters including stiffness of the vocal folds, vocal tract length, and cross-sectional area of the vocal tract were investigated and estimated using a method that fits the two-mass model to real data. Cost functions were used as targets to reach more reliable results. The obtained parameters were used as physical features to classify stressed speech. We concluded that the two parameters: (1) stiffness of the vocal folds and (2) the area at the entrance to the vocal tract in the supraglottis, which is related to the velocity of glottal airflow and acoustic interaction between the vocal folds and the vocal tract, are key indicators of stress during phonation. The average performance in the classification of speech under stress was improved by 10% to 15% using the proposed features, compared to traditional methods of stressed speech classification. In the future, our work should be focused on the exploration of parameters for a speaker-independent stressed speech classification system.
This work has been partially supported by the ‘Core Research for Evolutional Science and Technology’ (CREST) project of the Japan Science and Technology Agency (JST). We are very grateful to Mr. Matsuo of the Fujitsu Corporation for allowing us to use their database and for his valuable suggestions.
- Steeneken HJM, Hansen JHL: Speech under stress conditions: overview of the effect on speech production and on system performance. In Proc. ICASSP. Atlanta, Georgia; 1996.Google Scholar
- Cairns D, Hansen JHL: Nonlinear analysis and detection of speech under stressed conditions. J. Acoust. Soc. Am. 1994, 96(6):3392-3400. 10.1121/1.410601View ArticleGoogle Scholar
- Bezooijen RV: The characteristics and recognizability of vocal expression of emotions. Foris, Drodrecht; 1984.View ArticleGoogle Scholar
- Tolkmitt FJ, Scherer KR: Effect of experimentally induced stress on vocal parameters. J. Exp. Psychol. 1986, 12(3):302-313.Google Scholar
- Williams CE, Stevens KN: Emotions and speech: some acoustical correlates. J. Acoust. Soc. Am. 1972, 52(4):1238-1250.View ArticleGoogle Scholar
- Bou-Ghazale SE, Hansen JHL: Generating stressed speech from neutral speech using a modified CELP vocoder. Speech Commun. 1996, 20: 93-110. 10.1016/S0167-6393(96)00047-7View ArticleGoogle Scholar
- Bond ZS, Moore TJ International Conference on Spoken Language Processing. In A note on loud and Lombard speech. Kobe; 1990:969-972.Google Scholar
- Hansen JHL Ph.D. dissertation. In Analysis and compensation of stressed and noisy speech with application to robust automatic recognition. Georgia Institute of Technology, Atlanta; 1988.Google Scholar
- Murray IR, Baber C: A South, Toward a definition and working model of stress and its effects on speech. Speech Commun. 1996, 20: 3-12. 10.1016/S0167-6393(96)00040-4View ArticleGoogle Scholar
- Whitmore J, Fisher S: Speech during sustained operations. Speech Commun. 1996, 20: 55-70. 10.1016/S0167-6393(96)00044-1View ArticleGoogle Scholar
- Kamano A, Washio N, Harada S, Matsuo N IEICE Technical Report IEICE-SP2010-64. In A study of psychological suppression detection based on non-verbal information. IEICE, Tokyo; 2010:107-110. in JapaneseGoogle Scholar
- Kaiser JF: On Teager’s energy algorithm and its generalization to continuous signals. In Proceedings of the 4th IEEE Digital Signal Processing Workshop. New Paltz; 1990.Google Scholar
- Zhou G, Hansen JHL, Kaiser JF: Nonlinear feature based classification of speech under stress. IEEE Trans. Speech Audio Process. 2001, 3: 201-206.View ArticleGoogle Scholar
- Fant G: Acoustic Theory of Speech Production. Mouton, The Hague; 1960.Google Scholar
- Dunn HK: Methods of measuring vowel formant bandwidths. J. Acoust. Soc. Am. 1961, 33(12):1737-1746. 10.1121/1.1908558View ArticleGoogle Scholar
- Wong DY, Markel JD, Gray AH: Glottal inverse filtering from the acoustic speech waveform. IEEE Trans. Acoust. Speech Signal Process 1979, 27(4):350-355. 10.1109/TASSP.1979.1163260View ArticleGoogle Scholar
- Kaiser JF: Some observations on vocal tract operation from a fluid flow point of view. In Vocal Fold Physiology: Biomechanics, Acoustics, and Phonatory Control. Edited by: Titze IR, Scherer RC. Denver Center for the Performing Arts, Denver; 1983:358-386.Google Scholar
- Ishizaka K, Flanagan JL: Synthesis of voiced sounds from a two-mass model of the vocal cords. Bell. Syst. Tech. J. 1972, 51: 1233-1268.View ArticleGoogle Scholar
- Kincaid D, Cheney W: Numerical Analysis: Mathematics of Scientific Computing. 3rd edition. Brook/Cole, Pacific Grove; 2002:722-723.Google Scholar
- Lucero C: Chest- and falsetto-like oscillations in a two-mass model of vocal folds. J. Acoust. Soc. Am. 1996, 100: 3355-3399. 10.1121/1.416976View ArticleGoogle Scholar
- Titze IR: Acoustic interpretation of resonant voice. J. Voice 2001, 15: 519-528. 10.1016/S0892-1997(01)00052-2View ArticleGoogle Scholar
- Flanagan JL: Speech Analysis, Synthesis, and Perception. Springer-Verlag, New York; 1972.View ArticleGoogle Scholar
- de Cheveigne A, Kawahara H: YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 2002, 111(4):1917-1930. 10.1121/1.1458024View ArticleGoogle Scholar
- Titze IR, Story BH: Acoustic interactions of the voice source with the lower vocal tract. J. Acoust. Soc. Am. 1997, 101: 2234-2243. 10.1121/1.418246View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.