- Research
- Open access
- Published:
Classification of speech under stress based on modeling of the vocal folds and vocal tract
EURASIP Journal on Audio, Speech, and Music Processing volume 2013, Article number: 17 (2013)
Abstract
In this study, we focus on the classification of neutral and stressed speech based on a physical model. In order to represent the characteristics of the vocal folds and vocal tract during the process of speech production and to explore the physical parameters involved, we propose a method using the two-mass model. As feature parameters, we focus on stiffness parameters of the vocal folds, vocal tract length, and cross-sectional areas of the vocal tract. The stiffness parameters and the area of the entrance to the vocal tract are extracted from the two-mass model after we fit the model to real data using our proposed algorithm. These parameters are related to the velocity of glottal airflow and acoustic interaction between the vocal folds and the vocal tract and can precisely represent features of speech under stress because they are affected by the speaker’s psychological state during speech production. In our experiments, the physical features generated using the proposed approach are compared with traditionally used features, and the results demonstrate a clear improvement of up to 10% to 15% in average stress classification performance, which shows that our proposed method is more effective than conventional methods.
1. Introduction
Stress is a psycho-physiological state characterized by subjective strain, increased physiological activity, and deterioration of performance [1]. Factors inducing stress on speakers include workload, background noise, emotions, physical environmental factors (e.g., G-force), and fatigue. These factors are believed to affect voice quality and are detrimental to the performance of communication equipment, especially automated systems with speech interfaces. Therefore, it has become increasingly important to study speech under stress in order to improve the performance of speech recognition systems, to recognize when people are in a stressed state and to understand contexts in which speakers are communicating.
Researchers have attempted to probe reliable indicators of stress by analyzing acoustic variables. Some external factors (workload, background noise, etc.) and internal factors (emotional state, fatigue, etc.) may induce stress [2]. The first investigations of emotional speech were conducted in the mid-1980s, using the statistical properties of acoustic features in order to detect emotions from speech [3, 4]. It has been found that fundamental frequency (F0) has different characteristics for each emotion [5] and that respiration patterns and muscle tension also change [6]. The influence of the Lombard effect on speech recognition has also been examined [7, 8]. Selected acoustic features have been analyzed, such as amplitude and distribution of spectral energy, and it was found that spectral energy shifted to higher frequencies for consonants in the presence of loud background noise. High workload stress has been proven to have a significant impact on the performance of speech recognition systems, with speech under workload sounding faster, softer, or louder than neutral speech [9, 10]. Matsuo et al. examined the frequency domain and showed how differences in the spectrum of the high frequency band under stressful workload conditions could be used to catch people committing remittance fraud, and their proposed measure achieved better classification performance [11]. Furthermore, the Teager energy operator (TEO) [12] was proposed to explore variations in the energy of airflow characteristics within the glottis for the purpose of stress classification [13]. However, the features examined in these previous studies lack a physical basis, and the methods do not consider the whole process of speech production, which is believed to be essential for effective classification of speech under stress.
We propose a stressed speech classification method based on a physical model characterizing the vocal folds (VF) and the vocal tract (VT). This method can represent the process of speech production and model airflow patterns in the vocal folds and the vocal tract, which are essential for stress classification. In this physical model, changes in the physical characteristics of the vocal folds, such as muscle tension, have a modulating effect on the formants, while the shape of the vocal tract can also influence the glottal source because of the interaction between the vocal folds and the vocal tract. It is believed that the presence of stress can result in variations in the physical characteristics of physiological systems and influence the acoustic interaction between the vocal folds and the vocal tract [2]. The parameters of the physical model also help represent the influence of speaking style more directly and clearly. Therefore, a physical model is helpful to estimate the parameters of the physiological system.
An early but still prominent physical model is the source-filter model [14], which models speech as the combination of a glottal source (such as the vocal folds), and a linear acoustic filter representing the vocal tract and its radiation characteristic. An important assumption that is often made in the use of the source-filter model is independence of the source and filter. In such cases, the model should more accurately be referred to as the ‘independent source-filter model’. In 1961, Wong proposed a linear model of speech production using a lossless tube model of the vocal tract [15]. In 1979, a linear source tract model was proposed to model the glottal source, the vocal tract, and radiation impedance as linear filters, using covariance analysis [16]. However, the vocal tract and vocal folds do not function independently of each other instead there is some form of interaction between them [17], which results in significant changes in fundamental frequency and formant characteristics.
The two-mass model is a physical model, which attempts to simulate the physical process of vocal fold vibration, characterizing the vocal folds and the vocal tract, and to also model the effect of glottis-vocal tract interaction [18]. Parameters affected under stressed conditions are extracted from the physical model and are used as features to identify speech under stress more precisely. We use the two-mass model as a physical model, and our proposed method estimates the values of parameters included in the model from input speech. To identify speech under stress, we evaluate parameters affected by stress.
In this paper, we propose a method for fitting a physical model to real speech in order to estimate the physical parameters which characterize the vocal folds and the vocal tract. For the physical model, a two-mass model connected to a four-tube model is used to simulate the process of speech production. The physical parameters (stiffness, vocal tract length, and cross-sectional areas of the vocal tract) are estimated by fitting the model to real speech. The estimated parameters can be further analyzed and proposed as features for the classification of neutral and stressed speech. Furthermore, different cost functions are proposed to compare classification performance. As a result, stiffness of the vocal folds and cross-sectional areas of the vocal tract are selected as features for the classification of neutral and stressed speech.
The paper is organized as follows: In Overview, an overview of our method is presented. Physical parameters, related to the vocal folds and the vocal tract, based on the two-mass model are described as features for classification in Physical parameters. This is followed by the presentation of a fitting algorithm for real speech data in Estimation method to help estimate the physical parameters. Classification describes the classification method used for evaluation. In Evaluation, experiments are performed to evaluate the obtained parameters and show their corresponding classification performances when separating neutral and stressed speech. Finally, we draw our conclusions in Conclusion.
2. Overview
An overview of our work is shown in Figure 1. It includes the three steps needed to perform stressed speech classification: proposal of physical parameters, parameter estimation by fitting them to the two-mass model, and the classification of neutral and stressed speech.
Initially, we propose physical parameters considered likely to be useful, which include stiffness parameters of the vocal folds, vocal tract length, and cross-sectional areas of the vocal tract. These parameters characterize the behavior of vocal folds and the shape of the vocal tract. Furthermore, the relationship between the selected physical parameters and acoustic parameters has been shown to represent characteristics of the interaction between the vocal folds and the vocal tract.
The proposed physical parameters are then estimated by fitting the two-mass model to real speech. An algorithm based on the analysis-by-synthesis method is proposed for fitting the model to real speech. The Nelder-Mead simplex method [19] is used as a search strategy in order to find the optimal physical parameters. An iteration method is performed for vocal fold fitting and vocal tract fitting to estimate parameters, because there is interaction between the VF and VT.
For classification, a linear classifier is trained using utterances from each speaker. Currently, a simple linear classifier based on Euclidean distance is used for classification. Also, since we only have speech data for a small number of speakers, we evaluate our proposed method as a speaker-dependent system.
3. Physical parameters
A method which fits the two-mass model to real speech is proposed for classifying speech under stress. Some of the physical parameters characterizing the vocal folds and the vocal tract are estimated. The two-mass vocal fold model was originally proposed by Ishizaka and Flanagan to simulate the process of speech production [18]. We propose three types of feature parameters extracted from the two-mass model: stiffness, vocal tract length, and cross-sectional area of the entrance of the vocal tract. In the following sections, we will define these parameters and describe their characteristics.
3.1 Stiffness
The stiffness parameters are related to muscle tension in the vocal folds. Generally, the stiffness of the vocal folds is considered to depend mainly on two muscles: the cricothyroid muscle (CT) and thyroarytenoid muscle (TA) [16]. In the two-mass model, coupling stiffness kc is relative to the tension in the TA muscle, so a high k1 value and a low value for kc represent the contraction of the CT muscle and relaxation of the TA muscle.
Figure 2 shows a sketch of the model. Each vocal fold is represented by a mass-spring-damper system, joined with a coupling stiffness [18]. It is represented as:
Tissue elasticity (or ‘spring’) s i represents the tension of the vocal folds, which depends on the contraction of different muscles. The equivalent tensions are given by:
whose notations and variables are documented in Table 1.
Stiffness parameters are the main factors relating to fundamental frequency, and they can also determine the amplitude of the glottal area and glottal volume velocity [20], so source excitation is significantly influenced by the degree of stiffness. During the production of speech, the natural frequency of the vocal folds is determined by both their mass and stiffness. However, in order to simplify the estimation algorithm, only the stiffness parameters are estimated, with mass fixed as a constant.
3.2 Vocal tract length and cross-sectional area
The supraglottic area includes the structures that lie above the true vocal folds and below the base of the tongue. The anatomical structures present in this area that are important to speech production lie posterior to the epiglottis. They include the ventricle, false vocal folds, epiglottis, arytenoids, laryngeal aspects of the aryepiglottic folds, and vestibule [21].
The two-mass model is connected to a four-tube model representing the vocal tract [18]. The tube model is constructed using a transmission line analogy involving n cylindrical, hard-walled sections. The elemental values of the model are determined by cross-sectional areas A1 ⋯ A n and cylinder lengths l1 ⋯ l n . The total length of the vocal tract is defined as LVT. The tube model can be represented by an equivalent circuit, as shown in Figure 3. The inductances L n = ρl n / 2A n , the capacitances C n = l n ⋅ A n / ρc2, and the resistances , where c is the velocity of sound. Here, the tube model has been limited to four cylindrical sections of equal length, n = 4. In this study, the model is limited to only vowel articulation (as vowels were the subject of the experiments) and modal voice production. These assumptions greatly simplify the modeling of the vocal tract and the glottal source. In this paper, we use a four-tube model to simulate the vocal tract, which followed the original paper [18]. Furthermore, in the following analysis, we propose A1 as one of our feature parameters because the other areas, A2, A3, and A4 are less effective on classification than A1. Thus, we currently consider the four-tube model to be sufficient.
The model is terminated in a radiation load equal to that of a circular piston in an infinite baffle. , R R = 128ρc/9π2A n , where A n is the area of the mouth. The notations and variables are documented in Table 2.
Therefore, the differential equations related to the volume velocities of the system are:
where , , , , , and .
The length of the vocal tract and its cross-sectional areas are the main parameters which determine the shape of the vocal tract and have a significant impact on the distribution of formants. Vocal tract length and cross-sectional areas of the tube model are computed from real speech.
3.3 Relationship between physical parameters and acoustic parameters
In this section, we describe experiments which were performed to represent the presence of acoustic interaction and show the relationship between physical and acoustic parameters. Aerodynamics in the glottis is modeled using the two-mass model. In order to clarify the relationship between physical and acoustic parameters, we will first briefly describe the main equations representing the aerodynamics of speech production.
If subglottal pressure is represented as Ps, then air pressure drops to P11 when air enters the glottis (at the edge of m1) according to Bernoulli’s equation. The abrupt contraction in the cross-sectional area at the inlet to the glottis causes a phenomenon called vena contracta, which causes the air pressure to undergo an even greater drop. The drop is determined by the flow measurements of van den Berg:
where ρ is the air density, Ug is the volume velocity of glottal airflow, and Ag1 is the cross-sectional lower glottal area, which is represented by Ag1 = 2lg(x0 + x1), where lg is the length of the vocal fold and x0 is the displacement when the vocal folds are in the rest position.
Along masses m1 and m2, pressure drops as a result of air viscosity:
where μ is the air viscosity coefficient and d1 is the width of m1.
At the boundary between the two masses, the pressure drop can be calculated by:
where P21 is the air pressure at the lower edge of m2 and Ag2 is the cross-sectional lower glottal area.
At the glottal outlet, abrupt expansion causes the pressure to recover because of the relatively large area of the vocal tract. This pressure is given by:
where P1 is the pressure at the inlet of the vocal tract. Here, the parameter N is defined as N = Ag2 / A1, where A1 is the area of the entrance to the vocal tract. N denotes the difference in area between the outlet of the vocal folds and the inlet of the vocal tract, which is significant to the acoustic interaction between the vocal folds and the vocal tract [18]. Since glottal area Ag2 does not change significantly during the oscillation of the vocal folds, A1 is the parameter relating to the acoustic interaction.
In Equation 4, it is shown that airflow velocity Ug depends on both the stiffness of the vocal folds and area of the entrance to the vocal tract A1. Therefore, it is our assumption that parameters k1, k2, kc, and A1 related to velocity have an impact on acoustic interaction. In this paper, experiments are performed to represent the presence of this interaction by showing the relationship between physical and acoustic parameters. Due to the presence of these interactions, changes in the oscillation of the vocal folds affect the distribution of formants, and different shapes of the vocal tract (length and area) also influence the glottal source. Table 3 lists the physical and acoustic parameters.
We first examine how stiffness parameters impact the distribution of formants. First, we fixed the shape of the vocal tract and examined how variation in the stiffness parameters of the vocal folds affects the shift of formants. The vocal tract model was represented by a standard tube configuration for the vowels /a/ and /e/ [22]. In order to reduce the number of parameters to be estimated and simplify the proposed method, typical values were adopted for the configuration of the tube model. Therefore, as typical values, the length chosen for the vocal tract was LVT = 16 cm, with each element l i = 4 cm, and the cross-sectional area was fixed at A1 = 0.8 cm2, A2 = 0.4 cm2, A3 = 3 cm2, and A4 = 8 cm2 for /a/ and A1 = 1 cm2, A2 = 8 cm2, A3 = 8 cm2, and A4 = 8 cm2 for /e/. When a specific stiffness is checked, the other stiffness parameters are fixed at typical values. We changed stiffness parameters k1 (20 to 240 kdyn/cm), k2 (2 to 40 kdyn/cm), and kc (2.5 to 70 kdyn/cm) to examine variation in formants. Formant estimation is based on modeling vocal tract frequency response using linear predictive coding (LPC) techniques. It estimates formant frequencies from the all-pole model of the vocal tract transfer function.
Figure 4 shows the relationship between the stiffness parameters and different formants. It shows that k2 does not significantly influence formants, but that first and second formants will shift their location to a lower frequency with the increase of k1, although there is no significant change in the third formant (F3). A similar phenomenon occurs for kc. When kc decreases, F1 also has a tendency to shift to a lower frequency, while F2 and F3 are less influenced by the variation of kc. Therefore, it is shown that stiffness parameters k1 and kc can affect the distribution of formants and that the first and second formants are easily affected by acoustic interaction.
Next, we fixed the configuration of the vocal folds and examined how variation of the cross-sectional area of the vocal tract impacts the fundamental frequency (F0) of speech. Stiffness was fixed at typical values k1 = 80,000 dyn/cm, k2 = 8,000 dyn/cm, and kc = 25,000 dyn/cm to check how the fundamental frequency changes with the area function. When checking the impact of a specific area, other areas and vocal tract length (VTL) were fixed at typical values for /a/ or /e/. When considering VTL, all the cross-sectional areas were fixed at typical values. We then change the cross-sectional area or VTL to examine their impact on F0. The variation range for VTL was 13 to 19 cm, and for cross-sectional area of VT, the range was 0.1 to 20 cm. The algorithm for estimation of the fundamental frequency of speech is YIN [23]. It is based on the well-known autocorrelation method, with a number of modifications that combine to prevent error.
Figure 5 shows the relationship between the vocal tract parameters (vocal tract length and cross-sectional area) and fundamental frequency. It shows that VTL has less impact on F0 and only determines the distribution of formants. However, an increase in cross-sectional area A1 can cause F0 to change significantly. While cross-sectional areas A2 and A3 also have an impact on F0 to some extent, but their influence is insignificant compared to A1.
Therefore, it is our conclusion that stiffness of the vocal folds and cross-sectional area A1 affect both the fundamental frequency and formants and, further, the interaction between the vocal folds and the vocal tract.
3.4 Parameters representing stress
In Relationship between physical parameters and acoustic parameters, the experimental results show that stiffness of the vocal folds and cross-sectional area A1 have an impact on the interaction between the vocal folds and the vocal tract. It is believed that the variations in acoustic interaction differ markedly between neutral and stressed speech [2], so stiffness and A1 should be selected as parameters for representing stress.
In theory, Equation 8 shows that both the velocity of glottal airflow and the difference between the area of the outlet of the vocal folds and the inlet of the vocal tract have an impact on the pressure difference inside and outside of the glottis. Thus, the two factors can cause variations in the airflow patterns in the glottis and thus are likely to be effective to represent the presence of stress.
Variation in the stiffness of the vocal folds influences the time span of glottal opening and closing phases and causes glottal airflow to accelerate in the glottis, thus impacting the velocity of glottal airflow. Therefore, we can also assume that stiffness parameters can be potential parameters for stress detection.
A1 in the four-tube model is the area of the entrance to the vocal tract in the supraglottis. Narrowing A1 facilitates phonation by decreasing the oscillation threshold pressure of the vocal folds [24]. Since glottal area Ag2 does not change significantly during the oscillation of the vocal folds, A1 is the main factor determining the pressure difference between the inside and outside of the glottis and has an impact on the acoustic interaction between VF and VT. Based on these considerations, we also make the assumption that A1 is an effective parameter for stress classification.
4. Estimation method
4.1 Algorithm for fitting
The goal of stress classification is to determine from speech data if a specific person is under stress when he or she is speaking. When speech is input to the system, it is split into several frames, and further estimation of the physical parameters is performed for each frame. VTL for each speaker is first calculated; then, the obtained VTL is input as a known parameter. Then, the two-mass model is fit to each speech sample to simulate the vocal folds and the vocal tract. An outline of our method is shown in Figure 6.
In the first step, estimation of VTL is performed. Since VTL has no impact on the glottal source, it can be estimated separately. Because VTL varies with each speaker, all of the neutral speech data for vowel /a/ from each speaker is used to estimate the vocal tract length of that speaker. Here, we mainly consider the neutral speech for each speaker in the database. During VTL estimation, real speech from a database is analyzed using LPC to obtain the spectral envelope. The stiffness parameters are fixed at typical values and are taken as an input. The two-mass model is then fit to the neutral speech of each speaker to estimate the parameters of vocal tract length and cross-sectional area. Nelder-Mead simplex method [19] is used to search for the optimal values for fitting. For each speaker i, the probability distribution P i (LVT(i, k)) of VTL LVT(i, k) for all neutral speech is calculated, and we choose the one with the highest probability as the estimated vocal tract length.
The detailed fitting procedure is the same as that used for vocal tract fitting described below, which is shown in Figure 7. Equation 12 is used as the cost function.
In the next step, the estimated VTL of this speaker, which was obtained during the first step, is used, and the two-mass model is fit to the real speech to estimate the other physical parameters. Fitting the model to real speech poses a difficulty: estimation of too many parameters may make the fitting method unstable. The solution to this problem is to split the process into two main parts so that the VF and VT are fit with two different cost functions. However, the existence of interaction between VF and VT makes it impossible to fit VF and VT separately, and changes in the stiffness parameters and in A1 in the tube model can influence both formants and the glottal source. An alternative is to perform iteration when fitting the vocal folds and the vocal tract. Thus, an iteration method is used for vocal fold and vocal tract fitting, which are accomplished as follows. Figure 8 shows the structure of the fitting algorithm.
For vocal tract fitting, stiffness parameters are fixed at typical values and are taken as an input to vocal tract fitting. The parameters for the cross-sectional areas are then estimated. Next, the obtained areas are used as an input for vocal fold fitting, and the two-mass model is fit to estimate the new stiffness parameters. When current stiffness differs significantly from the typical value, the corresponding formants are also affected, and some variations can occur. In such cases, vocal tract fitting needs to be performed again. We take iterations for the two parts until the results reach convergence.
The detailed structure of vocal tract fitting and vocal fold fitting is shown in Figures 9 and 10. Vocal tract fitting includes two steps. First, real speech from a database is analyzed using LPC to calculate the spectral envelope. In the second step, a simulation is performed using the two-mass model to produce speech using an initial area function. The same spectral envelope is calculated from the simulated speech and is compared with the one obtained in the first step to find the difference between them. The difference between the simulated spectrum and the target spectrum is represented by a cost function. The area function is then varied, and glottal flow is simulated until the cost function reaches a minimum. Optimal values of the physical parameters are then estimated using the Nelder-Mead simplex method [19]. Cost function 2 is used in vocal tract fitting. In this paper, we utilize four cost functions in order to compare classification performance, which are described in Cost functions for vocal tract fitting.
The Nelder-Mead algorithm is a simplex method for finding the minimum of a function involving several variables. It is a direct search method and does not require the calculation of a derivative. We use the Nelder-Mead method based on the comparison of the values of the cost function at the n + 1 vertices for n-dimensional decision variables to solve our optimization problem. Here, we select A1, A2, A3, and A4 as variables in vocal tract fitting. Each calculation will generate a new vertex for the simplex. If this new point is better than at least one of the existing vertices, it replaces the worst vertex. The simplex vertices are changed through reflection, expansion, shrinkage, and contraction operations in order to find an improved solution to estimate the parameters. Optimal values of the physical parameters are estimated using the Nelder-Mead simplex method, which is implemented to search for the optimal physical parameters to minimize the cost function.
Vocal fold fitting uses the same process as vocal tract fitting, with the difference that the residual signal is obtained using LPC analysis, and the spectrum of the residual signal is available to construct the cost function 1 in Figure 10 for vocal fold fitting. It is used to evaluate the difference in the spectrum of the residual signal in vocal fold fitting, which is described as:
where S(ω) and S*(ω) are the power spectrums of the residual signal for simulated and real speech, respectively. Here, we select the stiffness parameters k1, k2, and kc as variables for vocal tract fitting.
Here, we use the residual signal from LPC analysis to estimate the parameters of the vocal folds. The LPC model is based on a mathematical approximation of the vocal tract. We use it to remove the effect of the vocal tract and obtain the residual signal to estimate the stiffness parameters with generated cost functions. In order to make a comparison with the spectrum of the residual signal from real speech, an LPC inverse filter is used for the simulated speech to obtain the residual signal. Our target here is to evaluate the similarity of the spectrums of residual signals both from real and simulated speech instead of representing the source wave. The aim of this paper is to classify speech under stress. It is believed that the main differences between neutral and stressed speech are focused on the harmonic structure of the spectrum of residual signal [11]. Thus, in this study, obtaining the residual signal using LPC can work well for showing the harmonic structure of the spectrum.
4.2 Cost functions for vocal tract fitting
As for the definition of cost function 2, we utilized four different cost functions in order to compare their classification performance.
4.2.1 Formant ()
The presence of stress causes an increase in the variability of airflow characteristics due to differences in the muscle tension of the vocal folds. This should cause changes in acoustic interaction around the false vocal folds, thus having an impact on the first and second formants (F1 and F2). Thus, F1 and F2 are calculated from the spectral envelope to define a cost function:
where the asterisk denotes the target value for real speech. The weights are given the values α1 and α2 to normalize the different target parameters to the same range, and the overbar denotes mean values over the target region.
4.2.2 RMS distance of spectral envelope (Crms)
CF 1-F 2 only focuses on the frequency of the first two formants, which is not accurate enough to describe the spectrum. Thus, we find a set of all-pole model coefficients, the cost function of which can be defined as the root mean square (RMS) distance between the spectral envelope of simulated speech and the original speech:
4.2.3 Itakura-Saito distance of spectral envelope (CI-S)
The Itakura-Saito distance is a measure of the perceptual difference between an original spectrum and an approximation of that spectrum. It was proposed by Fumitada Itakura and Shuzo Saito in the 1970s and can be described as:
4.2.4 Envelope and formant (CE-F)
The cost functions Crms and CI-S catch the difference between the rough shapes of the spectral envelopes, but they neglect local information when locating the formant. Since only the first two formants are affected by the oscillation of the vocal folds, the characteristics of F1 and F2 should be the chief focus. We propose matching the spectral envelope initially in the first iteration, and then, in the next iteration, the characteristics of the formant are fully considered:
where F1, F2, H1, and H2 refer to the frequency and amplitude of the first and second formants and n is the iteration number.
It would be helpful to evaluate the accuracy of the fitting method to show that the proposed method works well. However, it is difficult to compare the simulated values with the actual values because sensors are not available to measure the actual values for human beings. In this paper, we calculate the error in acoustic features between real and simulated speech to describe the accuracy of the fitting method.
Using the fitting method described above, the optimal simulated speech corresponding to the inputted real speech can be obtained. Some acoustic features like F0, F1, F2, F3, and F4 can also be estimated from the simulated speech. In order to describe the accuracy of the fitting method, we calculate the error in F0, F1, F2, F3, and F4 between real and simulated speech. Here, cost function CE-F is used.
where the asterisk denotes the target value for real speech.
We calculate the errors from simulated and real speech for all the samples for vowels /a/ and /e/ and show the distributions of the errors as shown in Figure 11. Simulated results using these four cost functions are shown in Figure 12. The errors, as shown in Figure 11, are smaller in F0, F1, and F2 (±3 Hz) to obtain higher accuracy. However, the errors in F3 and F4 may be increasing, because the cost function chosen places more emphasis on the first and second formants, which are believed to be more important for stress classification. Thus, based on the distributions of errors, it is shown that the proposed method provides reliable accuracy for the fitting to real speech.
5. Classification
Evaluation of the physical parameters is speaker dependent. The structure of the classification method is shown in Figure 13.
During the training process, all of the speech samples from a specific speaker are labeled as neutral or stressed speech. The labeled speech is segmented into fixed frames, and all of the frames are fit using the two-mass model to estimate the proposed parameters. A linear classifier based on minimum Euclidean distance is trained for the classification, using the estimated physical parameters from all of the frames.
During testing, test speech is input into the system and split into frames, and the trained linear classifier then separates them into neutral or stressed speech. We use Euclidean distance to make a final decision for speech data with several frames. For a test sample with K frames, the feature vector of the i th frame is V i . We calculate its Euclidean distance d i (V i , aN) d i (V i , aS) to the neutral and stressed classes, respectively, where aN and aS are the average vectors of classes for neutral and stressed speech. The final decision is made for the test sample using the following equation:
A K-fold cross-validation method was used in the training and testing process, and K was set to 4. Using this method, the data set was divided into four subsets, and for each classification, one of the subsets was used as a test set and the other three subsets were combined to form a training set. The final result was obtained by calculating the average classification rate across four trials.
6. Evaluation
6.1 Database and experimental setup
In the experiments, we used a database collected by the Fujitsu Corporation containing speech samples from eleven subjects (four males and seven females) [24]. To simulate mental pressure resulting in psychological stress, the speakers performed three different tasks while having telephone conversations with an operator, in order to simulate a situation involving pressure during a telephone call.
The three tasks involved (a) concentration, (b) time pressure, and (c) risk taking. For each speaker, there are four dialogues with different tasks. In two dialogues, the speaker was asked to finish the tasks within a limited amount of time, and in the other dialogues, there is relaxed chat without any task.
All of the data comes from telephone calls, so the sampling frequency was 8 kHz. Segments with the vowels /a/ and /e/ were cut from the speech and selected as samples. The experiments were conducted for each speaker, and all of the results were speaker dependent. The number of samples was different for each speaker. The range of the total number of samples is from 100 to 250 for each vowel from each person. We randomly chose six speakers (three males and three females) from eleven subjects to test classification performance. A K-fold cross-validation method was used in the classification experiments, in which K was set to 4. Using this method, the data set was divided evenly into four subsets, and for each classification, one of the subsets was used as a test set and the other three subsets were combined to form a training set. The final result was obtained by calculating the average classification rate across four trials. The samples were analyzed with 12-order LPC, and the frame size chosen to perform the experiment was 64 ms, with 16 ms for frame shift.
For configuration of the two-mass model, the following values were adopted, using typical values for males: m1M = 1.25 × 10−4 kg, m2M = 2.5 × 10−5 kg, lgM = 0.014 m, d1M = 0.0025 m, d2M = 5 × 10−4 m, ζ1M = 0.1, ζ2M = 0.6, x0 = 2 × 10−4 m, and Ps = 500 Pa. The vocal tract model was represented by a tube model, and the number of elements was limited to four cylindrical sections of equal length. Typical values used for configuration for females were as follows: m1F = 4.56 × 10−5 kg, m2F = 9.1 × 10−6 kg, lgF = 0.01 m, d1F = 1.79 × 10−3 m, d2F = 3.6 × 10−4 m, ζ1F = 0.1, ζ2F = 0.6, x0 = 2 × 10−4 m, and Ps = 500 Pa. Furthermore, the ranges for the control parameters were k1 = 10 to 140 kdyn/cm, k2 = 2 to 14 kdyn/cm, kc = 4 to 45 kdyn/cm, VTL = 13 to 19 cm, and A1, A2, A3, A4 = 0.2 to 20 cm.
6.2 Results for cost functions
In the first evaluation, we estimated the vocal tract length of all of the speakers, and two comparisons were made. First, we estimated the cross-sectional area function using the vocal tract fitting method with the four proposed cost functions and then the shape of the vocal tract was fixed at the obtained values (length and area). We used [k1, kc] to check classification performance for neutral and stressed speech using only the cost function for the vocal folds in Equation 10. In the second comparison, we estimated stiffness parameters [k1, kc] with varied vocal tract, so cost functions both for VF and VT were used to perform the fitting, and iteration was performed. Here, varied VT denotes that the parameters for cross-sectional area are also estimated by fitting the two-mass model instead of being fixed as constants. Finally, the performance of cost functions , Crms, CI-S, and CE-F was evaluated using the classification rate of [k1, kc]. We used a linear classifier for classification, and the average classification rate for all of the speakers was calculated. The results are shown in Figure 14.
The results illustrate that classification performance is improved when vocal tract values are variable. In this case, the cost functions for the vocal tract are used, and formants are also considered, which results in more information about the frequency domain of the speech being available, making the estimated results more reliable. Furthermore, we compared the performance of different cost functions. Our results show that the stress classification rate for CE-F is higher than for the other cost functions. Since CE-F can match the rough shape of the spectral envelope and also effectively catch the characteristics of F1 and F2, which have been proven to be sensitive to the interaction between the VF and VT, the classification of stressed speech is improved.
6.3 Results for physical parameters
In the second evaluation, VTL was first estimated for each speaker, and further evaluations were based on the obtained vocal tract length. Here, we selected cost function CE-F, which achieved the best performance in classification during the first evaluation. The purpose of this evaluation was to verify which parameters in the stiffness and area functions are related to stress and then check the classification performance of these parameters in comparison to traditionally used features.
6.3.1 Evaluation of vocal tract length estimation
A comparison was first made to evaluate the vocal tract length estimation for each speaker. In this experiment, segments with the vowels /a/ and /e/ were selected as samples. However, the samples for /a/ and /e/ were not mixed together. The two vowels were first used for evaluation separatelyand then the average recognition rate for the two vowels was calculated to show the experimental results. The physical parameters were estimated using the proposed fitting method, and the estimated parameters were used as features to perform the stress classification. The evaluation results for VTL estimation are shown in Figure 15. Features of physical parameters [k1, kc] were compared for their classification performance before and after VTL estimation. Our results show that the performance of [k1, kc] is improved by the estimation of VTL. Since a speaker’s vocal tract length is calculated from the neutral speech of that specific speaker and used as a known value for the estimation of other physical parameters, improvement in classification can be achieved by improving the accuracy of VTL estimation.
6.3.2 Evaluation of stiffness parameters of the vocal folds
In this evaluation, we focused on the stiffness parameters of the vocal folds, and the effect of each stiffness parameter on stress recognition was then examined. The physical parameters k1, k2, kc, A1, A2, A3, and A4 were estimated from varied VF and varied VT values with estimated VTL, and other physical parameters were fixed at the typical values described in Database and experimental setup. We focused on the evaluation of k1, k2, and kc. The classification performances of {[k1]}, {[k1, kc]}, and {[k1, k2, kc]} for different speakers are shown in Figure 16. These results that stress classification performance is improved when kc is considered. k1 and kc, therefore, are the parameters which are effective in stress classification. However, average classification accuracy decreases when taking k2 into account. It suggests that k2 is not effective in the classification of neutral and stressed speech; therefore, it is sufficient to select k1 and kc as feature parameters in further evaluations.
6.3.3 Evaluation of parameters of the cross-sectional areas of the vocal tract
We focused on each parameter of the cross-sectional area individually, and each area’s impact on stress recognition was then examined separately. The parameters k1, k2, kc, A1, A2, A3, and A4 were estimated with varied VF and varied VT values. The parameter sets {[k1, kc]}, {[k1, kc, A1]}, {[k1, kc, A1, A2]}, and {[k1, kc, A1, A2, A3]} were also evaluated. Their performance is shown in Figure 17. Among the results, we first consider sets {[k1, kc]} and {[k1, kc, A1]}. The results show that stiffness [k1, kc] is a better parameter for classifying stressed speech. When A1 is taken into account, classification performance is further improved. This suggests that A1 is an important parameter strongly related to stress. When A1 is increasing, it indicates that the area in the supraglottis is broadening. This results in a decrease in the pressure difference inside and outside of the glottis, causing variation in the airflow pattern and further changes in the interaction around the false vocal folds. Considering the performance of sets {[k1, kc, A1]}, {[k1, kc, A1, A2]}, and {[k1, kc, A1, A2, A3]}, we found that they have roughly the same classification accuracy. This illustrates that performance cannot be greatly improved by taking A2 and A3 into account and that A2 and A3 probably have only a small effect on acoustic interaction. It appears that A1 is sufficient to classify stressed speech from neutral speech, which agrees with the conclusion of our first evaluation.
A2 and A3 do affect F0 to some extent, which was illustrated in Figure 5, so they have some influence on acoustic interaction and, further, on stress classification; however, we believe their influence is insignificant. The characteristics of the vocal tract also affect stress classification to some extent. Since A2 and A3 represent the shape of the vocal tract, [k1, kc, A1, A2, A3] can achieve some improvement in the recognition rate, but the increase is very small, which suggests that A2 and A3 are less important for stress classification than A1.
6.3.4 Evaluation for proposed physical parameters
As a result of our evaluation process, parameter set [k1, kc, A1] was proposed. Figure 18 shows the distribution results for k1, kc, and A1 with an estimated VTL. These results show that the proposed parameters are effective for stress classification. The estimated values of the parameters are limited in range, and these ranges correspond to the actual range of human beings. As this distribution shows, stiffness and area of the entrance to the vocal tract are good indicators of stressed speech. Under stressed conditions, the value of k1 becomes relatively large, kc smaller, and A1 increases compared with the same parameters under neutral conditions. This indicates that stress causes variation in the muscle tension of the vocal folds and that the area at the entrance to the vocal tract in the supraglottis becomes wider when the speaker is under stress.
We then compared the performance of proposed parameters [k1, kc, A1] with traditionally proposed features, namely [SFM, F0], [TEO], and [MFCC]. The results are shown in Figure 19. As our experimental results show, [SFM, F0], which characterizes the vocal folds, works well in classifying stressed speech. This shows that the characteristics of the vocal folds play a very important role in stress classification. MFCC, which represents vocal tract information, is also effective for stress classification, illustrating that the characteristics of the vocal tract also affect stress classification to some extent, which agrees with our previous results in Figure 17. The results shown in Figure 19 demonstrate that our proposed physical parameters outperform the features traditionally used for stress detection, which suggests that parameters estimated from a physical model are more effective at representing stress during phonation than traditional methods. Results show that [k1, kc, A1] has the best stress recognition performance of the physical parameter sets. This illustrates that stiffness of the vocal folds and the cross-sectional area at the entrance to the vocal tract in the supraglottis are the factors which are most impacted when a speaker is under stress.
6.4 Results of Gaussian mixture modeling
In this section, we modeled the features using Gaussian mixture model (GMM), which are widely used statistical classifier. Two GMM models were trained, one for neutral speech the other for stressed speech.
The data set for each speaker was divided evenly into four subsets, and for each classification, one of the subsets was used as a test set and the other three subsets were combined to form a training set. The final result was obtained by calculating the average classification rate across four trials by a K-fold cross-validation method. In order to increase the amount of training data, the GMMs were trained using training set from three male speakers. The testing set of three male speakers and all of the data from female speakers were combined to generate the testing data used in this experiment.
We performed an experiment to find the best number of mixtures which corresponds to the best performance for proposed features [k1, kc, A1]. Table 4 shows that the best performance is obtained when the number of mixtures equals to four. When we increased the number of mixtures, the classification rate decreased, and it also makes the GMM more complicated. Therefore, the number of mixture components of the GMM was set to four, which obtained the best performance. The features for [SFM, F0] [TEO-FM-VAR], [MFCC], [k1, kc], and [k1, kc, A1] were modeled using GMMs with four mixture components. Classification performance is shown in Figure 20, which shows that improvement is achieved for each feature. However, the increase in classification rates is small because of the lack of training data. If we increase the size of training data significantly, major gains in classification rate should be achieved. Here, it is recommended that a GMM with four mixture components is acceptable for improving stress classification.
7. Conclusion
In this paper, we explored more effective features for the classification of neutral and stressed speech based on a physical model. To achieve this target, a two-mass model characterizing the properties of the vocal folds and the vocal tract was used to simulate speech production. Physical parameters including stiffness of the vocal folds, vocal tract length, and cross-sectional area of the vocal tract were investigated and estimated using a method that fits the two-mass model to real data. Cost functions were used as targets to reach more reliable results. The obtained parameters were used as physical features to classify stressed speech. We concluded that the two parameters: (1) stiffness of the vocal folds and (2) the area at the entrance to the vocal tract in the supraglottis, which is related to the velocity of glottal airflow and acoustic interaction between the vocal folds and the vocal tract, are key indicators of stress during phonation. The average performance in the classification of speech under stress was improved by 10% to 15% using the proposed features, compared to traditional methods of stressed speech classification. In the future, our work should be focused on the exploration of parameters for a speaker-independent stressed speech classification system.
References
Steeneken HJM, Hansen JHL: Speech under stress conditions: overview of the effect on speech production and on system performance. In Proc. ICASSP. Atlanta, Georgia; 1996.
Cairns D, Hansen JHL: Nonlinear analysis and detection of speech under stressed conditions. J. Acoust. Soc. Am. 1994, 96(6):3392-3400. 10.1121/1.410601
Bezooijen RV: The characteristics and recognizability of vocal expression of emotions. Foris, Drodrecht; 1984.
Tolkmitt FJ, Scherer KR: Effect of experimentally induced stress on vocal parameters. J. Exp. Psychol. 1986, 12(3):302-313.
Williams CE, Stevens KN: Emotions and speech: some acoustical correlates. J. Acoust. Soc. Am. 1972, 52(4):1238-1250.
Bou-Ghazale SE, Hansen JHL: Generating stressed speech from neutral speech using a modified CELP vocoder. Speech Commun. 1996, 20: 93-110. 10.1016/S0167-6393(96)00047-7
Bond ZS, Moore TJ International Conference on Spoken Language Processing. In A note on loud and Lombard speech. Kobe; 1990:969-972.
Hansen JHL Ph.D. dissertation. In Analysis and compensation of stressed and noisy speech with application to robust automatic recognition. Georgia Institute of Technology, Atlanta; 1988.
Murray IR, Baber C: A South, Toward a definition and working model of stress and its effects on speech. Speech Commun. 1996, 20: 3-12. 10.1016/S0167-6393(96)00040-4
Whitmore J, Fisher S: Speech during sustained operations. Speech Commun. 1996, 20: 55-70. 10.1016/S0167-6393(96)00044-1
Kamano A, Washio N, Harada S, Matsuo N IEICE Technical Report IEICE-SP2010-64. In A study of psychological suppression detection based on non-verbal information. IEICE, Tokyo; 2010:107-110. in Japanese
Kaiser JF: On Teager’s energy algorithm and its generalization to continuous signals. In Proceedings of the 4th IEEE Digital Signal Processing Workshop. New Paltz; 1990.
Zhou G, Hansen JHL, Kaiser JF: Nonlinear feature based classification of speech under stress. IEEE Trans. Speech Audio Process. 2001, 3: 201-206.
Fant G: Acoustic Theory of Speech Production. Mouton, The Hague; 1960.
Dunn HK: Methods of measuring vowel formant bandwidths. J. Acoust. Soc. Am. 1961, 33(12):1737-1746. 10.1121/1.1908558
Wong DY, Markel JD, Gray AH: Glottal inverse filtering from the acoustic speech waveform. IEEE Trans. Acoust. Speech Signal Process 1979, 27(4):350-355. 10.1109/TASSP.1979.1163260
Kaiser JF: Some observations on vocal tract operation from a fluid flow point of view. In Vocal Fold Physiology: Biomechanics, Acoustics, and Phonatory Control. Edited by: Titze IR, Scherer RC. Denver Center for the Performing Arts, Denver; 1983:358-386.
Ishizaka K, Flanagan JL: Synthesis of voiced sounds from a two-mass model of the vocal cords. Bell. Syst. Tech. J. 1972, 51: 1233-1268.
Kincaid D, Cheney W: Numerical Analysis: Mathematics of Scientific Computing. 3rd edition. Brook/Cole, Pacific Grove; 2002:722-723.
Lucero C: Chest- and falsetto-like oscillations in a two-mass model of vocal folds. J. Acoust. Soc. Am. 1996, 100: 3355-3399. 10.1121/1.416976
Titze IR: Acoustic interpretation of resonant voice. J. Voice 2001, 15: 519-528. 10.1016/S0892-1997(01)00052-2
Flanagan JL: Speech Analysis, Synthesis, and Perception. Springer-Verlag, New York; 1972.
de Cheveigne A, Kawahara H: YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 2002, 111(4):1917-1930. 10.1121/1.1458024
Titze IR, Story BH: Acoustic interactions of the voice source with the lower vocal tract. J. Acoust. Soc. Am. 1997, 101: 2234-2243. 10.1121/1.418246
Acknowledgments
This work has been partially supported by the ‘Core Research for Evolutional Science and Technology’ (CREST) project of the Japan Science and Technology Agency (JST). We are very grateful to Mr. Matsuo of the Fujitsu Corporation for allowing us to use their database and for his valuable suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Yao, X., Jitsuhiro, T., Miyajima, C. et al. Classification of speech under stress based on modeling of the vocal folds and vocal tract. J AUDIO SPEECH MUSIC PROC. 2013, 17 (2013). https://doi.org/10.1186/1687-4722-2013-17
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1687-4722-2013-17