Classification of speech under stress based on modeling of the vocal folds and vocal tract

Yao, Xiao; Jitsuhiro, Takatoshi; Miyajima, Chiyomi; Kitaoka, Norihide; Takeda, Kazuya

doi:10.1186/1687-4722-2013-17

Research
Open access
Published: 05 July 2013

Classification of speech under stress based on modeling of the vocal folds and vocal tract

Xiao Yao¹,
Takatoshi Jitsuhiro^1,2,
Chiyomi Miyajima¹,
Norihide Kitaoka¹ &
…
Kazuya Takeda¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2013, Article number: 17 (2013) Cite this article

3700 Accesses
8 Citations
Metrics details

Abstract

In this study, we focus on the classification of neutral and stressed speech based on a physical model. In order to represent the characteristics of the vocal folds and vocal tract during the process of speech production and to explore the physical parameters involved, we propose a method using the two-mass model. As feature parameters, we focus on stiffness parameters of the vocal folds, vocal tract length, and cross-sectional areas of the vocal tract. The stiffness parameters and the area of the entrance to the vocal tract are extracted from the two-mass model after we fit the model to real data using our proposed algorithm. These parameters are related to the velocity of glottal airflow and acoustic interaction between the vocal folds and the vocal tract and can precisely represent features of speech under stress because they are affected by the speaker’s psychological state during speech production. In our experiments, the physical features generated using the proposed approach are compared with traditionally used features, and the results demonstrate a clear improvement of up to 10% to 15% in average stress classification performance, which shows that our proposed method is more effective than conventional methods.

1. Introduction

Stress is a psycho-physiological state characterized by subjective strain, increased physiological activity, and deterioration of performance [1]. Factors inducing stress on speakers include workload, background noise, emotions, physical environmental factors (e.g., G-force), and fatigue. These factors are believed to affect voice quality and are detrimental to the performance of communication equipment, especially automated systems with speech interfaces. Therefore, it has become increasingly important to study speech under stress in order to improve the performance of speech recognition systems, to recognize when people are in a stressed state and to understand contexts in which speakers are communicating.

Researchers have attempted to probe reliable indicators of stress by analyzing acoustic variables. Some external factors (workload, background noise, etc.) and internal factors (emotional state, fatigue, etc.) may induce stress [2]. The first investigations of emotional speech were conducted in the mid-1980s, using the statistical properties of acoustic features in order to detect emotions from speech [3, 4]. It has been found that fundamental frequency (F₀) has different characteristics for each emotion [5] and that respiration patterns and muscle tension also change [6]. The influence of the Lombard effect on speech recognition has also been examined [7, 8]. Selected acoustic features have been analyzed, such as amplitude and distribution of spectral energy, and it was found that spectral energy shifted to higher frequencies for consonants in the presence of loud background noise. High workload stress has been proven to have a significant impact on the performance of speech recognition systems, with speech under workload sounding faster, softer, or louder than neutral speech [9, 10]. Matsuo et al. examined the frequency domain and showed how differences in the spectrum of the high frequency band under stressful workload conditions could be used to catch people committing remittance fraud, and their proposed measure achieved better classification performance [11]. Furthermore, the Teager energy operator (TEO) [12] was proposed to explore variations in the energy of airflow characteristics within the glottis for the purpose of stress classification [13]. However, the features examined in these previous studies lack a physical basis, and the methods do not consider the whole process of speech production, which is believed to be essential for effective classification of speech under stress.

We propose a stressed speech classification method based on a physical model characterizing the vocal folds (VF) and the vocal tract (VT). This method can represent the process of speech production and model airflow patterns in the vocal folds and the vocal tract, which are essential for stress classification. In this physical model, changes in the physical characteristics of the vocal folds, such as muscle tension, have a modulating effect on the formants, while the shape of the vocal tract can also influence the glottal source because of the interaction between the vocal folds and the vocal tract. It is believed that the presence of stress can result in variations in the physical characteristics of physiological systems and influence the acoustic interaction between the vocal folds and the vocal tract [2]. The parameters of the physical model also help represent the influence of speaking style more directly and clearly. Therefore, a physical model is helpful to estimate the parameters of the physiological system.

An early but still prominent physical model is the source-filter model [14], which models speech as the combination of a glottal source (such as the vocal folds), and a linear acoustic filter representing the vocal tract and its radiation characteristic. An important assumption that is often made in the use of the source-filter model is independence of the source and filter. In such cases, the model should more accurately be referred to as the ‘independent source-filter model’. In 1961, Wong proposed a linear model of speech production using a lossless tube model of the vocal tract [15]. In 1979, a linear source tract model was proposed to model the glottal source, the vocal tract, and radiation impedance as linear filters, using covariance analysis [16]. However, the vocal tract and vocal folds do not function independently of each other instead there is some form of interaction between them [17], which results in significant changes in fundamental frequency and formant characteristics.

The two-mass model is a physical model, which attempts to simulate the physical process of vocal fold vibration, characterizing the vocal folds and the vocal tract, and to also model the effect of glottis-vocal tract interaction [18]. Parameters affected under stressed conditions are extracted from the physical model and are used as features to identify speech under stress more precisely. We use the two-mass model as a physical model, and our proposed method estimates the values of parameters included in the model from input speech. To identify speech under stress, we evaluate parameters affected by stress.

In this paper, we propose a method for fitting a physical model to real speech in order to estimate the physical parameters which characterize the vocal folds and the vocal tract. For the physical model, a two-mass model connected to a four-tube model is used to simulate the process of speech production. The physical parameters (stiffness, vocal tract length, and cross-sectional areas of the vocal tract) are estimated by fitting the model to real speech. The estimated parameters can be further analyzed and proposed as features for the classification of neutral and stressed speech. Furthermore, different cost functions are proposed to compare classification performance. As a result, stiffness of the vocal folds and cross-sectional areas of the vocal tract are selected as features for the classification of neutral and stressed speech.

The paper is organized as follows: In Overview, an overview of our method is presented. Physical parameters, related to the vocal folds and the vocal tract, based on the two-mass model are described as features for classification in Physical parameters. This is followed by the presentation of a fitting algorithm for real speech data in Estimation method to help estimate the physical parameters. Classification describes the classification method used for evaluation. In Evaluation, experiments are performed to evaluate the obtained parameters and show their corresponding classification performances when separating neutral and stressed speech. Finally, we draw our conclusions in Conclusion.

2. Overview

An overview of our work is shown in Figure 1. It includes the three steps needed to perform stressed speech classification: proposal of physical parameters, parameter estimation by fitting them to the two-mass model, and the classification of neutral and stressed speech.

Initially, we propose physical parameters considered likely to be useful, which include stiffness parameters of the vocal folds, vocal tract length, and cross-sectional areas of the vocal tract. These parameters characterize the behavior of vocal folds and the shape of the vocal tract. Furthermore, the relationship between the selected physical parameters and acoustic parameters has been shown to represent characteristics of the interaction between the vocal folds and the vocal tract.

The proposed physical parameters are then estimated by fitting the two-mass model to real speech. An algorithm based on the analysis-by-synthesis method is proposed for fitting the model to real speech. The Nelder-Mead simplex method [19] is used as a search strategy in order to find the optimal physical parameters. An iteration method is performed for vocal fold fitting and vocal tract fitting to estimate parameters, because there is interaction between the VF and VT.

For classification, a linear classifier is trained using utterances from each speaker. Currently, a simple linear classifier based on Euclidean distance is used for classification. Also, since we only have speech data for a small number of speakers, we evaluate our proposed method as a speaker-dependent system.

3. Physical parameters

A method which fits the two-mass model to real speech is proposed for classifying speech under stress. Some of the physical parameters characterizing the vocal folds and the vocal tract are estimated. The two-mass vocal fold model was originally proposed by Ishizaka and Flanagan to simulate the process of speech production [18]. We propose three types of feature parameters extracted from the two-mass model: stiffness, vocal tract length, and cross-sectional area of the entrance of the vocal tract. In the following sections, we will define these parameters and describe their characteristics.

3.1 Stiffness

The stiffness parameters are related to muscle tension in the vocal folds. Generally, the stiffness of the vocal folds is considered to depend mainly on two muscles: the cricothyroid muscle (CT) and thyroarytenoid muscle (TA) [16]. In the two-mass model, coupling stiffness k_c is relative to the tension in the TA muscle, so a high k₁ value and a low value for k_c represent the contraction of the CT muscle and relaxation of the TA muscle.

Figure 2 shows a sketch of the model. Each vocal fold is represented by a mass-spring-damper system, joined with a coupling stiffness [18]. It is represented as:

m_{i} \frac{d^{2} x_{1}}{d t^{2}} + r_{1} \frac{d x_{1}}{dt} + s_{1} (x_{1}) + k_{c} (x_{1} - x_{2}) = F_{1},

(1)

m_{2} \frac{d^{2} x_{2}}{d t^{2}} + r_{2} \frac{d x_{2}}{dt} + s_{2} (x_{2}) + k_{c} (x_{2} - x_{1}) = F_{2} .

(2)

Tissue elasticity (or ‘spring’) s_i represents the tension of the vocal folds, which depends on the contraction of different muscles. The equivalent tensions are given by:

s_{i} (x_{i}) = k_{i} (x_{i} + η x_{i}^{3}), i = 1, 2,

(3)

whose notations and variables are documented in Table 1.

Table 1 Notations and variables in the two-mass model for the vocal folds

Full size table

Stiffness parameters are the main factors relating to fundamental frequency, and they can also determine the amplitude of the glottal area and glottal volume velocity [20], so source excitation is significantly influenced by the degree of stiffness. During the production of speech, the natural frequency of the vocal folds is determined by both their mass and stiffness. However, in order to simplify the estimation algorithm, only the stiffness parameters are estimated, with mass fixed as a constant.

3.2 Vocal tract length and cross-sectional area

The supraglottic area includes the structures that lie above the true vocal folds and below the base of the tongue. The anatomical structures present in this area that are important to speech production lie posterior to the epiglottis. They include the ventricle, false vocal folds, epiglottis, arytenoids, laryngeal aspects of the aryepiglottic folds, and vestibule [21].

The two-mass model is connected to a four-tube model representing the vocal tract [18]. The tube model is constructed using a transmission line analogy involving n cylindrical, hard-walled sections. The elemental values of the model are determined by cross-sectional areas A₁ ⋯ A_n and cylinder lengths l₁ ⋯ l_n. The total length of the vocal tract is defined as L_VT. The tube model can be represented by an equivalent circuit, as shown in Figure 3. The inductances L_n = ρl_n / 2A_n, the capacitances C_n = l_n ⋅ A_n / ρc², and the resistances $R_{n} = (S_{n} / A_{n}^{2}) \sqrt{ρμω / 2}$ , where c is the velocity of sound. Here, the tube model has been limited to four cylindrical sections of equal length, n = 4. In this study, the model is limited to only vowel articulation (as vowels were the subject of the experiments) and modal voice production. These assumptions greatly simplify the modeling of the vocal tract and the glottal source. In this paper, we use a four-tube model to simulate the vocal tract, which followed the original paper [18]. Furthermore, in the following analysis, we propose A₁ as one of our feature parameters because the other areas, A₂, A₃, and A₄ are less effective on classification than A₁. Thus, we currently consider the four-tube model to be sufficient.

The model is terminated in a radiation load equal to that of a circular piston in an infinite baffle. $L_{n} = (8 ρ / 3 π) \sqrt{π A_{n}}$ , R_R = 128ρc/9π²A_n, where A_n is the area of the mouth. The notations and variables are documented in Table 2.

Table 2 Notations and variables in the two-mass model for the vocal tract

Full size table

Therefore, the differential equations related to the volume velocities of the system are:

\begin{array}{l} (R_{k 1} + R_{k 2}) |U_{g}| U_{g} + (R_{v 1} + R_{v 2}) U_{g} + (L_{g 1} + L_{g 2}) \frac{d U_{g}}{dt} + \\ L_{1} \frac{d U_{g}}{dt} + R_{1} U_{g} + \frac{1}{C_{1}} \int_{0}^{t} (U_{g} - U_{1}) dt - P_{s} = 0 \end{array}

\begin{array}{l} (L_{1} + L_{2}) \frac{d U_{1}}{dt} + (R_{1} + R_{2}) U_{1} + \\ \frac{1}{C_{2}} \int_{0}^{t} (U_{1} - U_{2}) dt + \frac{1}{C_{1}} \int_{0}^{t} (U_{1} - U_{g}) dt = 0 \\ \begin{array}{c} ⋮ \end{array} \end{array}

(4)

L_{R} \frac{d}{dt} (U_{R} + U_{L}) + R_{R} \cdot U_{R} = 0,

where $R_{v 1} = 12 \frac{μ l_{g}^{2} d_{1}}{A_{g 1}^{3}}$ , $R_{v 2} = 12 \frac{μ l_{g}^{2} d_{2}}{A_{g 2}^{3}}$ , $L_{g 1} = \frac{ρ d_{1}}{A_{g 1}}$ , $L_{g 2} = \frac{ρ d_{2}}{A_{g 2}}$ , $R_{k 1} = \frac{0.19 ρ}{A_{g 1}^{2}}$ , and $R_{k 2} = \frac{ρ [0.5 - \frac{A_{g 2}}{A_{1}} (1 - \frac{A_{g 2}}{A_{1}})]}{A_{g 2}^{2}}$ .

The length of the vocal tract and its cross-sectional areas are the main parameters which determine the shape of the vocal tract and have a significant impact on the distribution of formants. Vocal tract length and cross-sectional areas of the tube model are computed from real speech.

3.3 Relationship between physical parameters and acoustic parameters

In this section, we describe experiments which were performed to represent the presence of acoustic interaction and show the relationship between physical and acoustic parameters. Aerodynamics in the glottis is modeled using the two-mass model. In order to clarify the relationship between physical and acoustic parameters, we will first briefly describe the main equations representing the aerodynamics of speech production.

If subglottal pressure is represented as P_s, then air pressure drops to P₁₁ when air enters the glottis (at the edge of m₁) according to Bernoulli’s equation. The abrupt contraction in the cross-sectional area at the inlet to the glottis causes a phenomenon called vena contracta, which causes the air pressure to undergo an even greater drop. The drop is determined by the flow measurements of van den Berg:

P_{s} - P_{11} = (1.00 + 0.37) \frac{ρ U_{g}^{2}}{2 A_{g 1}^{2}},

(5)

where ρ is the air density, U_g is the volume velocity of glottal airflow, and A_g1 is the cross-sectional lower glottal area, which is represented by A_g1 = 2l_g(x₀ + x₁), where l_g is the length of the vocal fold and x₀ is the displacement when the vocal folds are in the rest position.

Along masses m₁ and m₂, pressure drops as a result of air viscosity:

P_{i 1} - P_{i 2} = \frac{12 μ d_{i} l_{g}^{2} U_{g}}{A_{g i}^{3}}, i = 1, 2,

(6)

where μ is the air viscosity coefficient and d₁ is the width of m₁.

At the boundary between the two masses, the pressure drop can be calculated by:

P_{21} - P_{12} = \frac{ρ U_{g}^{2}}{2} (\frac{1}{A_{g 1}^{2}} - \frac{1}{A_{g 2}^{2}}),

(7)

where P₂₁ is the air pressure at the lower edge of m₂ and A_g2 is the cross-sectional lower glottal area.

At the glottal outlet, abrupt expansion causes the pressure to recover because of the relatively large area of the vocal tract. This pressure is given by:

P_{1} - P_{22} = \frac{1}{2} ρ \frac{U_{g}^{2}}{A_{g 2}^{2}} [2 N (1 - N)],

(8)

where P₁ is the pressure at the inlet of the vocal tract. Here, the parameter N is defined as N = A_g2 / A₁, where A₁ is the area of the entrance to the vocal tract. N denotes the difference in area between the outlet of the vocal folds and the inlet of the vocal tract, which is significant to the acoustic interaction between the vocal folds and the vocal tract [18]. Since glottal area A_g2 does not change significantly during the oscillation of the vocal folds, A₁ is the parameter relating to the acoustic interaction.

In Equation 4, it is shown that airflow velocity U_g depends on both the stiffness of the vocal folds and area of the entrance to the vocal tract A₁. Therefore, it is our assumption that parameters k₁, k₂, k_c, and A₁ related to velocity have an impact on acoustic interaction. In this paper, experiments are performed to represent the presence of this interaction by showing the relationship between physical and acoustic parameters. Due to the presence of these interactions, changes in the oscillation of the vocal folds affect the distribution of formants, and different shapes of the vocal tract (length and area) also influence the glottal source. Table 3 lists the physical and acoustic parameters.

Table 3 Physical and acoustic parameters

Full size table

We first examine how stiffness parameters impact the distribution of formants. First, we fixed the shape of the vocal tract and examined how variation in the stiffness parameters of the vocal folds affects the shift of formants. The vocal tract model was represented by a standard tube configuration for the vowels /a/ and /e/ [22]. In order to reduce the number of parameters to be estimated and simplify the proposed method, typical values were adopted for the configuration of the tube model. Therefore, as typical values, the length chosen for the vocal tract was L_VT = 16 cm, with each element l_i = 4 cm, and the cross-sectional area was fixed at A₁ = 0.8 cm², A₂ = 0.4 cm², A₃ = 3 cm², and A₄ = 8 cm² for /a/ and A₁ = 1 cm², A₂ = 8 cm², A₃ = 8 cm², and A₄ = 8 cm² for /e/. When a specific stiffness is checked, the other stiffness parameters are fixed at typical values. We changed stiffness parameters k₁ (20 to 240 kdyn/cm), k₂ (2 to 40 kdyn/cm), and k_c (2.5 to 70 kdyn/cm) to examine variation in formants. Formant estimation is based on modeling vocal tract frequency response using linear predictive coding (LPC) techniques. It estimates formant frequencies from the all-pole model of the vocal tract transfer function.

Figure 4 shows the relationship between the stiffness parameters and different formants. It shows that k₂ does not significantly influence formants, but that first and second formants will shift their location to a lower frequency with the increase of k₁, although there is no significant change in the third formant (F₃). A similar phenomenon occurs for k_c. When k_c decreases, F₁ also has a tendency to shift to a lower frequency, while F₂ and F₃ are less influenced by the variation of k_c. Therefore, it is shown that stiffness parameters k₁ and k_c can affect the distribution of formants and that the first and second formants are easily affected by acoustic interaction.

Next, we fixed the configuration of the vocal folds and examined how variation of the cross-sectional area of the vocal tract impacts the fundamental frequency (F₀) of speech. Stiffness was fixed at typical values k₁ = 80,000 dyn/cm, k₂ = 8,000 dyn/cm, and k_c = 25,000 dyn/cm to check how the fundamental frequency changes with the area function. When checking the impact of a specific area, other areas and vocal tract length (VTL) were fixed at typical values for /a/ or /e/. When considering VTL, all the cross-sectional areas were fixed at typical values. We then change the cross-sectional area or VTL to examine their impact on F₀. The variation range for VTL was 13 to 19 cm, and for cross-sectional area of VT, the range was 0.1 to 20 cm. The algorithm for estimation of the fundamental frequency of speech is YIN [23]. It is based on the well-known autocorrelation method, with a number of modifications that combine to prevent error.

Figure 5 shows the relationship between the vocal tract parameters (vocal tract length and cross-sectional area) and fundamental frequency. It shows that VTL has less impact on F₀ and only determines the distribution of formants. However, an increase in cross-sectional area A₁ can cause F₀ to change significantly. While cross-sectional areas A₂ and A₃ also have an impact on F₀ to some extent, but their influence is insignificant compared to A₁.

Therefore, it is our conclusion that stiffness of the vocal folds and cross-sectional area A₁ affect both the fundamental frequency and formants and, further, the interaction between the vocal folds and the vocal tract.

3.4 Parameters representing stress

In Relationship between physical parameters and acoustic parameters, the experimental results show that stiffness of the vocal folds and cross-sectional area A₁ have an impact on the interaction between the vocal folds and the vocal tract. It is believed that the variations in acoustic interaction differ markedly between neutral and stressed speech [2], so stiffness and A₁ should be selected as parameters for representing stress.

In theory, Equation 8 shows that both the velocity of glottal airflow and the difference between the area of the outlet of the vocal folds and the inlet of the vocal tract have an impact on the pressure difference inside and outside of the glottis. Thus, the two factors can cause variations in the airflow patterns in the glottis and thus are likely to be effective to represent the presence of stress.

Variation in the stiffness of the vocal folds influences the time span of glottal opening and closing phases and causes glottal airflow to accelerate in the glottis, thus impacting the velocity of glottal airflow. Therefore, we can also assume that stiffness parameters can be potential parameters for stress detection.

A₁ in the four-tube model is the area of the entrance to the vocal tract in the supraglottis. Narrowing A₁ facilitates phonation by decreasing the oscillation threshold pressure of the vocal folds [24]. Since glottal area A_g2 does not change significantly during the oscillation of the vocal folds, A₁ is the main factor determining the pressure difference between the inside and outside of the glottis and has an impact on the acoustic interaction between VF and VT. Based on these considerations, we also make the assumption that A₁ is an effective parameter for stress classification.

4. Estimation method

4.1 Algorithm for fitting

The goal of stress classification is to determine from speech data if a specific person is under stress when he or she is speaking. When speech is input to the system, it is split into several frames, and further estimation of the physical parameters is performed for each frame. VTL for each speaker is first calculated; then, the obtained VTL is input as a known parameter. Then, the two-mass model is fit to each speech sample to simulate the vocal folds and the vocal tract. An outline of our method is shown in Figure 6.

In the first step, estimation of VTL is performed. Since VTL has no impact on the glottal source, it can be estimated separately. Because VTL varies with each speaker, all of the neutral speech data for vowel /a/ from each speaker is used to estimate the vocal tract length of that speaker. Here, we mainly consider the neutral speech for each speaker in the database. During VTL estimation, real speech from a database is analyzed using LPC to obtain the spectral envelope. The stiffness parameters are fixed at typical values and are taken as an input. The two-mass model is then fit to the neutral speech of each speaker to estimate the parameters of vocal tract length and cross-sectional area. Nelder-Mead simplex method [19] is used to search for the optimal values for fitting. For each speaker i, the probability distribution P_i(L_VT(i, k)) of VTL L_VT(i, k) for all neutral speech is calculated, and we choose the one with the highest probability as the estimated vocal tract length.

L_{VT} {(i)}^{*} = \arg \max_{L_{VT}_{(i, k)}} P_{i} (L_{VT} (i, k)) .

(9)

The detailed fitting procedure is the same as that used for vocal tract fitting described below, which is shown in Figure 7. Equation 12 is used as the cost function.

In the next step, the estimated VTL of this speaker, which was obtained during the first step, is used, and the two-mass model is fit to the real speech to estimate the other physical parameters. Fitting the model to real speech poses a difficulty: estimation of too many parameters may make the fitting method unstable. The solution to this problem is to split the process into two main parts so that the VF and VT are fit with two different cost functions. However, the existence of interaction between VF and VT makes it impossible to fit VF and VT separately, and changes in the stiffness parameters and in A₁ in the tube model can influence both formants and the glottal source. An alternative is to perform iteration when fitting the vocal folds and the vocal tract. Thus, an iteration method is used for vocal fold and vocal tract fitting, which are accomplished as follows. Figure 8 shows the structure of the fitting algorithm.

For vocal tract fitting, stiffness parameters are fixed at typical values and are taken as an input to vocal tract fitting. The parameters for the cross-sectional areas are then estimated. Next, the obtained areas are used as an input for vocal fold fitting, and the two-mass model is fit to estimate the new stiffness parameters. When current stiffness differs significantly from the typical value, the corresponding formants are also affected, and some variations can occur. In such cases, vocal tract fitting needs to be performed again. We take iterations for the two parts until the results reach convergence.

The detailed structure of vocal tract fitting and vocal fold fitting is shown in Figures 9 and 10. Vocal tract fitting includes two steps. First, real speech from a database is analyzed using LPC to calculate the spectral envelope. In the second step, a simulation is performed using the two-mass model to produce speech using an initial area function. The same spectral envelope is calculated from the simulated speech and is compared with the one obtained in the first step to find the difference between them. The difference between the simulated spectrum and the target spectrum is represented by a cost function. The area function is then varied, and glottal flow is simulated until the cost function reaches a minimum. Optimal values of the physical parameters are then estimated using the Nelder-Mead simplex method [19]. Cost function 2 is used in vocal tract fitting. In this paper, we utilize four cost functions in order to compare classification performance, which are described in Cost functions for vocal tract fitting.

The Nelder-Mead algorithm is a simplex method for finding the minimum of a function involving several variables. It is a direct search method and does not require the calculation of a derivative. We use the Nelder-Mead method based on the comparison of the values of the cost function at the n + 1 vertices for n-dimensional decision variables to solve our optimization problem. Here, we select A₁, A₂, A₃, and A₄ as variables in vocal tract fitting. Each calculation will generate a new vertex for the simplex. If this new point is better than at least one of the existing vertices, it replaces the worst vertex. The simplex vertices are changed through reflection, expansion, shrinkage, and contraction operations in order to find an improved solution to estimate the parameters. Optimal values of the physical parameters are estimated using the Nelder-Mead simplex method, which is implemented to search for the optimal physical parameters to minimize the cost function.

Vocal fold fitting uses the same process as vocal tract fitting, with the difference that the residual signal is obtained using LPC analysis, and the spectrum of the residual signal is available to construct the cost function 1 in Figure 10 for vocal fold fitting. It is used to evaluate the difference in the spectrum of the residual signal in vocal fold fitting, which is described as:

C = \frac{\sum_{i = 1}^{fs / 2} {|S^{*} (ω_{i}) - S (ω_{i})|}^{2}}{\sum_{i = 1}^{fs / 2} {|S (ω_{i})|}^{2}},

(10)

where S(ω) and S*(ω) are the power spectrums of the residual signal for simulated and real speech, respectively. Here, we select the stiffness parameters k₁, k₂, and k_c as variables for vocal tract fitting.

Here, we use the residual signal from LPC analysis to estimate the parameters of the vocal folds. The LPC model is based on a mathematical approximation of the vocal tract. We use it to remove the effect of the vocal tract and obtain the residual signal to estimate the stiffness parameters with generated cost functions. In order to make a comparison with the spectrum of the residual signal from real speech, an LPC inverse filter is used for the simulated speech to obtain the residual signal. Our target here is to evaluate the similarity of the spectrums of residual signals both from real and simulated speech instead of representing the source wave. The aim of this paper is to classify speech under stress. It is believed that the main differences between neutral and stressed speech are focused on the harmonic structure of the spectrum of residual signal [11]. Thus, in this study, obtaining the residual signal using LPC can work well for showing the harmonic structure of the spectrum.

4.2 Cost functions for vocal tract fitting

As for the definition of cost function 2, we utilized four different cost functions in order to compare their classification performance.

4.2.1 Formant ( $C^{F_{1} - F_{2}}$ )

The presence of stress causes an increase in the variability of airflow characteristics due to differences in the muscle tension of the vocal folds. This should cause changes in acoustic interaction around the false vocal folds, thus having an impact on the first and second formants (F₁ and F₂). Thus, F₁ and F₂ are calculated from the spectral envelope to define a cost function:

\begin{array}{l} C^{F_{1} - F_{2}} = α_{1} {(F_{1}^{*} - F_{1})}^{2} + α_{2} {(F_{2}^{*} - F_{2})}^{2}, \\ α_{1} = \frac{1}{\bar{F_{1}}}, α_{2} = \frac{1}{\bar{F_{2}}}, \end{array}

(11)

where the asterisk denotes the target value for real speech. The weights are given the values α₁ and α₂ to normalize the different target parameters to the same range, and the overbar denotes mean values over the target region.

4.2.2 RMS distance of spectral envelope (C_rms)

C_{F 1-F 2} only focuses on the frequency of the first two formants, which is not accurate enough to describe the spectrum. Thus, we find a set of all-pole model coefficients, the cost function of which can be defined as the root mean square (RMS) distance between the spectral envelope of simulated speech and the original speech:

\begin{array}{l} C_{rms} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {|log P (ω_{i}) - log P^{*} (ω_{i})|}^{2}} \\ P (ω) = \frac{1}{{|A (ω)|}^{2}} = \frac{1}{{|\sum_{k = 0}^{P} a_{k} e^{- jωk}|}^{2}} . \end{array}

(12)

4.2.3 Itakura-Saito distance of spectral envelope (C_I-S)

The Itakura-Saito distance is a measure of the perceptual difference between an original spectrum and an approximation of that spectrum. It was proposed by Fumitada Itakura and Shuzo Saito in the 1970s and can be described as:

C_{I ‒ S} = \frac{1}{N} \sum_{i = 1}^{N} \frac{P (ω_{i})}{P^{*} (ω_{i})} - \log \frac{P (ω_{i})}{P^{*} (ω_{i})} - 1 .

(13)

4.2.4 Envelope and formant (C_E-F)

The cost functions C_rms and C_I-S catch the difference between the rough shapes of the spectral envelopes, but they neglect local information when locating the formant. Since only the first two formants are affected by the oscillation of the vocal folds, the characteristics of F₁ and F₂ should be the chief focus. We propose matching the spectral envelope initially in the first iteration, and then, in the next iteration, the characteristics of the formant are fully considered:

\begin{array}{l} C_{E ‒ F}^{(1)} = \frac{1}{N} \sum_{i = 1}^{N} {|log P (ω_{i}) - log P^{*} (ω_{i})|}^{2} n = 1, \\ C_{E ‒ F}^{(n)} = α_{1} {(F_{1}^{*} - F_{1})}^{2} + α_{2} {(F_{2}^{*} - F_{2})}^{2} \\ + w_{1} {(H_{1}^{*} - H_{1})}^{2} + w_{2} {(H_{2}^{*} - H_{2})}^{2} n \geq 2, \end{array}

(14)

where F₁, F₂, H₁, and H₂ refer to the frequency and amplitude of the first and second formants and n is the iteration number.

It would be helpful to evaluate the accuracy of the fitting method to show that the proposed method works well. However, it is difficult to compare the simulated values with the actual values because sensors are not available to measure the actual values for human beings. In this paper, we calculate the error in acoustic features between real and simulated speech to describe the accuracy of the fitting method.

Using the fitting method described above, the optimal simulated speech corresponding to the inputted real speech can be obtained. Some acoustic features like F₀, F₁, F₂, F₃, and F₄ can also be estimated from the simulated speech. In order to describe the accuracy of the fitting method, we calculate the error in F₀, F₁, F₂, F₃, and F₄ between real and simulated speech. Here, cost function C_E-F is used.

\begin{array}{c} {Err}_{F_{0}} = (F_{0} - F_{0}^{*}) \\ {Err}_{F_{1}} = (F_{1} - F_{1}^{*}) \\ {Err}_{F_{2}} = (F_{2} - F_{2}^{*}) \\ {Err}_{F_{3}} = (F_{3} - F_{3}^{*}) \\ {Err}_{F_{4}} = (F_{4} - F_{4}^{*}), \end{array}

(15)

where the asterisk denotes the target value for real speech.

We calculate the errors from simulated and real speech for all the samples for vowels /a/ and /e/ and show the distributions of the errors as shown in Figure 11. Simulated results using these four cost functions are shown in Figure 12. The errors, as shown in Figure 11, are smaller in F₀, F₁, and F₂ (±3 Hz) to obtain higher accuracy. However, the errors in F₃ and F₄ may be increasing, because the cost function chosen places more emphasis on the first and second formants, which are believed to be more important for stress classification. Thus, based on the distributions of errors, it is shown that the proposed method provides reliable accuracy for the fitting to real speech.

5. Classification

Evaluation of the physical parameters is speaker dependent. The structure of the classification method is shown in Figure 13.

During the training process, all of the speech samples from a specific speaker are labeled as neutral or stressed speech. The labeled speech is segmented into fixed frames, and all of the frames are fit using the two-mass model to estimate the proposed parameters. A linear classifier based on minimum Euclidean distance is trained for the classification, using the estimated physical parameters from all of the frames.

During testing, test speech is input into the system and split into frames, and the trained linear classifier then separates them into neutral or stressed speech. We use Euclidean distance to make a final decision for speech data with several frames. For a test sample with K frames, the feature vector of the i th frame is V_i. We calculate its Euclidean distance d_i(V_i, a_N) d_i(V_i, a_S) to the neutral and stressed classes, respectively, where a_N and a_S are the average vectors of classes for neutral and stressed speech. The final decision is made for the test sample using the following equation:

j = \arg \max (\sum_{i = 1}^{K} d_{1} (V_{i}, a_{N}), \sum_{i = 1}^{K} d_{i} (V_{i}, a_{S})) j = N or S .

(16)

A K-fold cross-validation method was used in the training and testing process, and K was set to 4. Using this method, the data set was divided into four subsets, and for each classification, one of the subsets was used as a test set and the other three subsets were combined to form a training set. The final result was obtained by calculating the average classification rate across four trials.

6. Evaluation

6.1 Database and experimental setup

In the experiments, we used a database collected by the Fujitsu Corporation containing speech samples from eleven subjects (four males and seven females) [24]. To simulate mental pressure resulting in psychological stress, the speakers performed three different tasks while having telephone conversations with an operator, in order to simulate a situation involving pressure during a telephone call.

The three tasks involved (a) concentration, (b) time pressure, and (c) risk taking. For each speaker, there are four dialogues with different tasks. In two dialogues, the speaker was asked to finish the tasks within a limited amount of time, and in the other dialogues, there is relaxed chat without any task.

All of the data comes from telephone calls, so the sampling frequency was 8 kHz. Segments with the vowels /a/ and /e/ were cut from the speech and selected as samples. The experiments were conducted for each speaker, and all of the results were speaker dependent. The number of samples was different for each speaker. The range of the total number of samples is from 100 to 250 for each vowel from each person. We randomly chose six speakers (three males and three females) from eleven subjects to test classification performance. A K-fold cross-validation method was used in the classification experiments, in which K was set to 4. Using this method, the data set was divided evenly into four subsets, and for each classification, one of the subsets was used as a test set and the other three subsets were combined to form a training set. The final result was obtained by calculating the average classification rate across four trials. The samples were analyzed with 12-order LPC, and the frame size chosen to perform the experiment was 64 ms, with 16 ms for frame shift.

For configuration of the two-mass model, the following values were adopted, using typical values for males: m_1M = 1.25 × 10⁻⁴ kg, m_2M = 2.5 × 10⁻⁵ kg, l_gM = 0.014 m, d_1M = 0.0025 m, d_2M = 5 × 10⁻⁴ m, ζ_1M = 0.1, ζ_2M = 0.6, x₀ = 2 × 10⁻⁴ m, and P_s = 500 Pa. The vocal tract model was represented by a tube model, and the number of elements was limited to four cylindrical sections of equal length. Typical values used for configuration for females were as follows: m_1F = 4.56 × 10⁻⁵ kg, m_2F = 9.1 × 10⁻⁶ kg, l_gF = 0.01 m, d_1F = 1.79 × 10⁻³ m, d_2F = 3.6 × 10⁻⁴ m, ζ_1F = 0.1, ζ_2F = 0.6, x₀ = 2 × 10⁻⁴ m, and P_s = 500 Pa. Furthermore, the ranges for the control parameters were k₁ = 10 to 140 kdyn/cm, k₂ = 2 to 14 kdyn/cm, k_c = 4 to 45 kdyn/cm, VTL = 13 to 19 cm, and A₁, A₂, A₃, A₄ = 0.2 to 20 cm.

6.2 Results for cost functions

In the first evaluation, we estimated the vocal tract length of all of the speakers, and two comparisons were made. First, we estimated the cross-sectional area function using the vocal tract fitting method with the four proposed cost functions and then the shape of the vocal tract was fixed at the obtained values (length and area). We used [k₁, k_c] to check classification performance for neutral and stressed speech using only the cost function for the vocal folds in Equation 10. In the second comparison, we estimated stiffness parameters [k₁, k_c] with varied vocal tract, so cost functions both for VF and VT were used to perform the fitting, and iteration was performed. Here, varied VT denotes that the parameters for cross-sectional area are also estimated by fitting the two-mass model instead of being fixed as constants. Finally, the performance of cost functions $C^{F_{1} ‒ F_{2}}$ , C_rms, C_I-S, and C_E-F was evaluated using the classification rate of [k₁, k_c]. We used a linear classifier for classification, and the average classification rate for all of the speakers was calculated. The results are shown in Figure 14.

The results illustrate that classification performance is improved when vocal tract values are variable. In this case, the cost functions for the vocal tract are used, and formants are also considered, which results in more information about the frequency domain of the speech being available, making the estimated results more reliable. Furthermore, we compared the performance of different cost functions. Our results show that the stress classification rate for C_E-F is higher than for the other cost functions. Since C_E-F can match the rough shape of the spectral envelope and also effectively catch the characteristics of F₁ and F₂, which have been proven to be sensitive to the interaction between the VF and VT, the classification of stressed speech is improved.

6.3 Results for physical parameters

In the second evaluation, VTL was first estimated for each speaker, and further evaluations were based on the obtained vocal tract length. Here, we selected cost function C_E-F, which achieved the best performance in classification during the first evaluation. The purpose of this evaluation was to verify which parameters in the stiffness and area functions are related to stress and then check the classification performance of these parameters in comparison to traditionally used features.

6.3.1 Evaluation of vocal tract length estimation

A comparison was first made to evaluate the vocal tract length estimation for each speaker. In this experiment, segments with the vowels /a/ and /e/ were selected as samples. However, the samples for /a/ and /e/ were not mixed together. The two vowels were first used for evaluation separatelyand then the average recognition rate for the two vowels was calculated to show the experimental results. The physical parameters were estimated using the proposed fitting method, and the estimated parameters were used as features to perform the stress classification. The evaluation results for VTL estimation are shown in Figure 15. Features of physical parameters [k₁, k_c] were compared for their classification performance before and after VTL estimation. Our results show that the performance of [k₁, k_c] is improved by the estimation of VTL. Since a speaker’s vocal tract length is calculated from the neutral speech of that specific speaker and used as a known value for the estimation of other physical parameters, improvement in classification can be achieved by improving the accuracy of VTL estimation.

6.3.2 Evaluation of stiffness parameters of the vocal folds

In this evaluation, we focused on the stiffness parameters of the vocal folds, and the effect of each stiffness parameter on stress recognition was then examined. The physical parameters k₁, k₂, k_c, A₁, A₂, A₃, and A₄ were estimated from varied VF and varied VT values with estimated VTL, and other physical parameters were fixed at the typical values described in Database and experimental setup. We focused on the evaluation of k₁, k₂, and k_c. The classification performances of {[k₁]}, {[k₁, k_c]}, and {[k₁, k₂, k_c]} for different speakers are shown in Figure 16. These results that stress classification performance is improved when k_c is considered. k₁ and k_c, therefore, are the parameters which are effective in stress classification. However, average classification accuracy decreases when taking k₂ into account. It suggests that k₂ is not effective in the classification of neutral and stressed speech; therefore, it is sufficient to select k₁ and k_c as feature parameters in further evaluations.

6.3.3 Evaluation of parameters of the cross-sectional areas of the vocal tract

We focused on each parameter of the cross-sectional area individually, and each area’s impact on stress recognition was then examined separately. The parameters k₁, k₂, k_c, A₁, A₂, A₃, and A₄ were estimated with varied VF and varied VT values. The parameter sets {[k₁, k_c]}, {[k₁, k_c, A₁]}, {[k₁, k_c, A₁, A₂]}, and {[k₁, k_c, A₁, A₂, A₃]} were also evaluated. Their performance is shown in Figure 17. Among the results, we first consider sets {[k₁, k_c]} and {[k₁, k_c, A₁]}. The results show that stiffness [k₁, k_c] is a better parameter for classifying stressed speech. When A₁ is taken into account, classification performance is further improved. This suggests that A₁ is an important parameter strongly related to stress. When A₁ is increasing, it indicates that the area in the supraglottis is broadening. This results in a decrease in the pressure difference inside and outside of the glottis, causing variation in the airflow pattern and further changes in the interaction around the false vocal folds. Considering the performance of sets {[k₁, k_c, A₁]}, {[k₁, k_c, A₁, A₂]}, and {[k₁, k_c, A₁, A₂, A₃]}, we found that they have roughly the same classification accuracy. This illustrates that performance cannot be greatly improved by taking A₂ and A₃ into account and that A₂ and A₃ probably have only a small effect on acoustic interaction. It appears that A₁ is sufficient to classify stressed speech from neutral speech, which agrees with the conclusion of our first evaluation.

A₂ and A₃ do affect F₀ to some extent, which was illustrated in Figure 5, so they have some influence on acoustic interaction and, further, on stress classification; however, we believe their influence is insignificant. The characteristics of the vocal tract also affect stress classification to some extent. Since A₂ and A₃ represent the shape of the vocal tract, [k₁, k_c, A₁, A₂, A₃] can achieve some improvement in the recognition rate, but the increase is very small, which suggests that A₂ and A₃ are less important for stress classification than A₁.

6.3.4 Evaluation for proposed physical parameters

As a result of our evaluation process, parameter set [k₁, k_c, A₁] was proposed. Figure 18 shows the distribution results for k₁, k_c, and A₁ with an estimated VTL. These results show that the proposed parameters are effective for stress classification. The estimated values of the parameters are limited in range, and these ranges correspond to the actual range of human beings. As this distribution shows, stiffness and area of the entrance to the vocal tract are good indicators of stressed speech. Under stressed conditions, the value of k₁ becomes relatively large, k_c smaller, and A₁ increases compared with the same parameters under neutral conditions. This indicates that stress causes variation in the muscle tension of the vocal folds and that the area at the entrance to the vocal tract in the supraglottis becomes wider when the speaker is under stress.

We then compared the performance of proposed parameters [k₁, k_c, A₁] with traditionally proposed features, namely [SFM, F₀], [TEO], and [MFCC]. The results are shown in Figure 19. As our experimental results show, [SFM, F₀], which characterizes the vocal folds, works well in classifying stressed speech. This shows that the characteristics of the vocal folds play a very important role in stress classification. MFCC, which represents vocal tract information, is also effective for stress classification, illustrating that the characteristics of the vocal tract also affect stress classification to some extent, which agrees with our previous results in Figure 17. The results shown in Figure 19 demonstrate that our proposed physical parameters outperform the features traditionally used for stress detection, which suggests that parameters estimated from a physical model are more effective at representing stress during phonation than traditional methods. Results show that [k₁, k_c, A₁] has the best stress recognition performance of the physical parameter sets. This illustrates that stiffness of the vocal folds and the cross-sectional area at the entrance to the vocal tract in the supraglottis are the factors which are most impacted when a speaker is under stress.

6.4 Results of Gaussian mixture modeling

In this section, we modeled the features using Gaussian mixture model (GMM), which are widely used statistical classifier. Two GMM models were trained, one for neutral speech the other for stressed speech.

The data set for each speaker was divided evenly into four subsets, and for each classification, one of the subsets was used as a test set and the other three subsets were combined to form a training set. The final result was obtained by calculating the average classification rate across four trials by a K-fold cross-validation method. In order to increase the amount of training data, the GMMs were trained using training set from three male speakers. The testing set of three male speakers and all of the data from female speakers were combined to generate the testing data used in this experiment.

We performed an experiment to find the best number of mixtures which corresponds to the best performance for proposed features [k₁, k_c, A₁]. Table 4 shows that the best performance is obtained when the number of mixtures equals to four. When we increased the number of mixtures, the classification rate decreased, and it also makes the GMM more complicated. Therefore, the number of mixture components of the GMM was set to four, which obtained the best performance. The features for [SFM, F₀] [TEO-FM-VAR], [MFCC], [k₁, k_c], and [k₁, k_c, A₁] were modeled using GMMs with four mixture components. Classification performance is shown in Figure 20, which shows that improvement is achieved for each feature. However, the increase in classification rates is small because of the lack of training data. If we increase the size of training data significantly, major gains in classification rate should be achieved. Here, it is recommended that a GMM with four mixture components is acceptable for improving stress classification.

Table 4 Classification rates with different numbers of mixtures

Full size table

7. Conclusion

In this paper, we explored more effective features for the classification of neutral and stressed speech based on a physical model. To achieve this target, a two-mass model characterizing the properties of the vocal folds and the vocal tract was used to simulate speech production. Physical parameters including stiffness of the vocal folds, vocal tract length, and cross-sectional area of the vocal tract were investigated and estimated using a method that fits the two-mass model to real data. Cost functions were used as targets to reach more reliable results. The obtained parameters were used as physical features to classify stressed speech. We concluded that the two parameters: (1) stiffness of the vocal folds and (2) the area at the entrance to the vocal tract in the supraglottis, which is related to the velocity of glottal airflow and acoustic interaction between the vocal folds and the vocal tract, are key indicators of stress during phonation. The average performance in the classification of speech under stress was improved by 10% to 15% using the proposed features, compared to traditional methods of stressed speech classification. In the future, our work should be focused on the exploration of parameters for a speaker-independent stressed speech classification system.

References

Steeneken HJM, Hansen JHL: Speech under stress conditions: overview of the effect on speech production and on system performance. In Proc. ICASSP. Atlanta, Georgia; 1996.
Google Scholar
Cairns D, Hansen JHL: Nonlinear analysis and detection of speech under stressed conditions. J. Acoust. Soc. Am. 1994, 96(6):3392-3400. 10.1121/1.410601
Article Google Scholar
Bezooijen RV: The characteristics and recognizability of vocal expression of emotions. Foris, Drodrecht; 1984.
Book Google Scholar
Tolkmitt FJ, Scherer KR: Effect of experimentally induced stress on vocal parameters. J. Exp. Psychol. 1986, 12(3):302-313.
Google Scholar
Williams CE, Stevens KN: Emotions and speech: some acoustical correlates. J. Acoust. Soc. Am. 1972, 52(4):1238-1250.
Article Google Scholar
Bou-Ghazale SE, Hansen JHL: Generating stressed speech from neutral speech using a modified CELP vocoder. Speech Commun. 1996, 20: 93-110. 10.1016/S0167-6393(96)00047-7
Article Google Scholar
Bond ZS, Moore TJ International Conference on Spoken Language Processing. In A note on loud and Lombard speech. Kobe; 1990:969-972.
Google Scholar
Hansen JHL Ph.D. dissertation. In Analysis and compensation of stressed and noisy speech with application to robust automatic recognition. Georgia Institute of Technology, Atlanta; 1988.
Google Scholar
Murray IR, Baber C: A South, Toward a definition and working model of stress and its effects on speech. Speech Commun. 1996, 20: 3-12. 10.1016/S0167-6393(96)00040-4
Article Google Scholar
Whitmore J, Fisher S: Speech during sustained operations. Speech Commun. 1996, 20: 55-70. 10.1016/S0167-6393(96)00044-1
Article Google Scholar
Kamano A, Washio N, Harada S, Matsuo N IEICE Technical Report IEICE-SP2010-64. In A study of psychological suppression detection based on non-verbal information. IEICE, Tokyo; 2010:107-110. in Japanese
Google Scholar
Kaiser JF: On Teager’s energy algorithm and its generalization to continuous signals. In Proceedings of the 4th IEEE Digital Signal Processing Workshop. New Paltz; 1990.
Google Scholar
Zhou G, Hansen JHL, Kaiser JF: Nonlinear feature based classification of speech under stress. IEEE Trans. Speech Audio Process. 2001, 3: 201-206.
Article Google Scholar
Fant G: Acoustic Theory of Speech Production. Mouton, The Hague; 1960.
Google Scholar
Dunn HK: Methods of measuring vowel formant bandwidths. J. Acoust. Soc. Am. 1961, 33(12):1737-1746. 10.1121/1.1908558
Article Google Scholar
Wong DY, Markel JD, Gray AH: Glottal inverse filtering from the acoustic speech waveform. IEEE Trans. Acoust. Speech Signal Process 1979, 27(4):350-355. 10.1109/TASSP.1979.1163260
Article Google Scholar
Kaiser JF: Some observations on vocal tract operation from a fluid flow point of view. In Vocal Fold Physiology: Biomechanics, Acoustics, and Phonatory Control. Edited by: Titze IR, Scherer RC. Denver Center for the Performing Arts, Denver; 1983:358-386.
Google Scholar
Ishizaka K, Flanagan JL: Synthesis of voiced sounds from a two-mass model of the vocal cords. Bell. Syst. Tech. J. 1972, 51: 1233-1268.
Article Google Scholar
Kincaid D, Cheney W: Numerical Analysis: Mathematics of Scientific Computing. 3rd edition. Brook/Cole, Pacific Grove; 2002:722-723.
Google Scholar
Lucero C: Chest- and falsetto-like oscillations in a two-mass model of vocal folds. J. Acoust. Soc. Am. 1996, 100: 3355-3399. 10.1121/1.416976
Article Google Scholar
Titze IR: Acoustic interpretation of resonant voice. J. Voice 2001, 15: 519-528. 10.1016/S0892-1997(01)00052-2
Article Google Scholar
Flanagan JL: Speech Analysis, Synthesis, and Perception. Springer-Verlag, New York; 1972.
Book Google Scholar
de Cheveigne A, Kawahara H: YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 2002, 111(4):1917-1930. 10.1121/1.1458024
Article Google Scholar
Titze IR, Story BH: Acoustic interactions of the voice source with the lower vocal tract. J. Acoust. Soc. Am. 1997, 101: 2234-2243. 10.1121/1.418246
Article Google Scholar

Download references

Acknowledgments

This work has been partially supported by the ‘Core Research for Evolutional Science and Technology’ (CREST) project of the Japan Science and Technology Agency (JST). We are very grateful to Mr. Matsuo of the Fujitsu Corporation for allowing us to use their database and for his valuable suggestions.

Author information

Authors and Affiliations

Graduate School of Information Science, Nagoya University, Nagoya, Aichi, Japan
Xiao Yao, Takatoshi Jitsuhiro, Chiyomi Miyajima, Norihide Kitaoka & Kazuya Takeda
Department of Media Informatics, Aichi University of Technology, Gamagori, Aichi, Japan
Takatoshi Jitsuhiro

Authors

Xiao Yao
View author publications
You can also search for this author in PubMed Google Scholar
Takatoshi Jitsuhiro
View author publications
You can also search for this author in PubMed Google Scholar
Chiyomi Miyajima
View author publications
You can also search for this author in PubMed Google Scholar
Norihide Kitaoka
View author publications
You can also search for this author in PubMed Google Scholar
Kazuya Takeda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao Yao.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Authors’ original file for figure 17

Authors’ original file for figure 18

Authors’ original file for figure 19

Authors’ original file for figure 20

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Yao, X., Jitsuhiro, T., Miyajima, C. et al. Classification of speech under stress based on modeling of the vocal folds and vocal tract. J AUDIO SPEECH MUSIC PROC. 2013, 17 (2013). https://doi.org/10.1186/1687-4722-2013-17

Download citation

Received: 30 October 2012
Accepted: 21 June 2013
Published: 05 July 2013
DOI: https://doi.org/10.1186/1687-4722-2013-17

Classification of speech under stress based on modeling of the vocal folds and vocal tract

Abstract

1. Introduction

2. Overview

3. Physical parameters

3.1 Stiffness

3.2 Vocal tract length and cross-sectional area

3.3 Relationship between physical parameters and acoustic parameters

3.4 Parameters representing stress

4. Estimation method

4.1 Algorithm for fitting

4.2 Cost functions for vocal tract fitting

4.2.1 Formant ( C F 1 − F 2 )

4.2.2 RMS distance of spectral envelope (Crms)

4.2.3 Itakura-Saito distance of spectral envelope (CI-S)

4.2.4 Envelope and formant (CE-F)

5. Classification

6. Evaluation

6.1 Database and experimental setup

6.2 Results for cost functions

6.3 Results for physical parameters

6.3.1 Evaluation of vocal tract length estimation

6.3.2 Evaluation of stiffness parameters of the vocal folds

6.3.3 Evaluation of parameters of the cross-sectional areas of the vocal tract

6.3.4 Evaluation for proposed physical parameters

6.4 Results of Gaussian mixture modeling

7. Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

4.2.1 Formant ( $C^{F_{1} - F_{2}}$ )

4.2.2 RMS distance of spectral envelope (C_rms)

4.2.3 Itakura-Saito distance of spectral envelope (C_I-S)

4.2.4 Envelope and formant (C_E-F)