Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech

Přibil, Jiří; Přibilová, Anna

doi:10.1186/1687-4722-2013-8

Research
Open access
Published: 24 April 2013

Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech

Jiří Přibil¹ &
Anna Přibilová²

EURASIP Journal on Audio, Speech, and Music Processing volume 2013, Article number: 8 (2013) Cite this article

4563 Accesses
23 Citations
1 Altmetric
Metrics details

Abstract

This article analyzes and compares influence of different types of spectral and prosodic features for Czech and Slovak emotional speech classification based on Gaussian mixture models (GMM). Influence of initial setting of parameters (number of mixture components and used number of iterations) for GMM training process was analyzed, too. Subsequently, analysis was performed to find how correctness of emotion classification depends on the number and the order of the parameters in the input feature vector and on the computation complexity. Another test was carried out to verify the functionality of the proposed two-level architecture comprising the gender recognizer and of the emotional speech classifier. Next tests were realized to find dependence of some negative aspect (processing of the input speech signal with too short time duration, the gender of a speaker incorrectly determined, etc.) on the stability of the results generated during the GMM classification process. Evaluations and tests were realized with the speech material in the form of sentences of male and female speakers expressing four emotional states (joy, sadness, anger, and a neutral state) in Czech and Slovak languages. In addition, a comparative experiment using the speech data corpus in other language (German) was performed. The mean classification error rate of the whole classifier structure achieves about 21% for all four emotions and both genders, and the best obtained error rate was 3.5% for the sadness style of the female gender. These values are acceptable in this first stage of development of the GMM classifier. On the other hand, the test showed the principal importance of correct classification of the speaker gender in the first level, which has heavy influence on the resulting recognition score of the emotion classification. This GMM classifier should be used for evaluation of the synthetic speech quality after applied voice conversion and emotional speech style transformation.

1. Introduction

Speaker identification and emotional speech recognition systems, as well as speech recognition systems, use different types of speech features which can systematically be divided into segmental and supra-segmental ones[1]. These include traditional features such as linear predictive coefficients, linear prediction cepstral coefficients, mel-frequency cepstral coefficients (MFCC)[2], or unconventional ones like perceptual linear predictive coefficients, log frequency power coefficients[3], gammatone frequency cepstral coefficients[4], or compact multiclass support vector machines[5]. Several spectral features [spectral centroid (SC), spectral flatness measure (SFM)[6, 7], spectral entropy (SE)[8, 9], etc.] are used to complement the mentioned basic segmental features for speaker recognition[10]. Supra-segmental features comprise statistical values of parameters describing prosody by duration, fundamental frequency, and energy. Included in this category is also a separate group of features constituting the voice quality parameters: jitter, shimmer[11], Hammarberg index[12], Liljencrants-Fant features[13], and spectral tilt[14]. All mentioned speech identification systems and classifiers are usually based on statistical approach, using the discriminative or artificial neural networks[15, 16], hidden Markov models (HMM)[17], or Gaussian mixture models (GMM)[18, 19]. Spectral features like MFCC together with energy and prosodic parameters are most commonly used in GMM emotional speech classification[20]. On the other hand, in automatic speech recognition systems based on HMM approach, the acoustic vector comprises such components as the formant central frequencies and bandwidths. Relative position of formants and formant trajectories can be used as the main indicator for speech classification in the voiced parts[21].

We are mainly focused on voice conversion and emotional speech style transformation in the text-to-speech systems speaking in Czech and Slovak[22] for the voice communication systems with the human–machine (computer) interface[23], or in the communication aids for handicapped people[24, 25]. These two languages (belonging to the Slavonic languages) are similar but different, therefore we can use a common speech corpus to obtain spectral parameters, but on the phonetic and prosody level the synthetic speech must be processed separately. In our previous work, we performed statistical analysis and comparison of emotional speech properties for the Czech and Slovak languages using basic spectral features consisting of the first three formant positions together with their bandwidths and formant tilts, complementary spectral features (CSF) (SC, SFM, and SE), and prosodic parameters—fundamental frequency (F0), microintonation, jitter, shimmer[26].

The aim of this study is to develop a simple emotional speech style classifier based on GMM approach usable for objective evaluation of the finally produced synthetic speech quality as an option to manually performed listening tests. This statistical evaluation approach can be combined with the classical one in the form of listening tests or it can replace them. The main advantage of this system is that it works automatically without human interaction which is a great problem in collective realization of listening tests (more people together—for keeping the same test conditions), and the obtained results can numerically be matched—as the objective comparison criterion. The article describes performed experiments and comparison of GMM classification of male and female acted speech in four emotional states (joy, sadness, anger, and a neutral state) spoken in Czech and Slovak. This speech corpus was primarily used for determination of spectral and prosodic parameters for emotional speech conversion[26]. This article is also aimed to verify a functionality of the proposed GMM emotional speech classifier structure including the stability of the obtained results, to perform an analysis of influence of setting of parameters for GMM training process (number of used mixture components and used number of iterations), and above all, to investigate the influence of different types of used speech features (spectral and/or supra-segmental). In addition, we try to confirm our working hypothesis that speech data corpora in the other languages (primarily intended for emotional speech recognition) can successfully be used for basic testing of the designed GMM emotional speech classifier. On the other hand, the order of parameters in the input feature vector has minimal influence on the classification error rate of the whole emotional speech classifier.

2. Subject and method

2.1. Short description of the developed emotional speech classifier and its expected properties

The basic draft functional structure of our currently developed GMM emotional speech classifier consists of the two-level architecture as it can be seen in Figure1. In the first step, the gender type (male/female) is recognized, and consequently the emotional speech style is identified for each of two gender classes. In both levels of the identification process, different types of the feature vectors together with the trained GMM models (with different number of used mixtures) are used due to different requirements and different statistical properties necessary for gender type classification and emotional style recognition. Because we would like to recognize four emotional speech styles and two basic types of gender, we need to obtain four trained emotion models for classification of speech pronounced by male speakers and four models for classification of sentences spoken by female speakers, and two summary models for gender recognition (trained on the data of sentences pronounced in all classified emotional styles). By the reason of not knowing exactly how and which speech parameters characterize several emotions of speech in Czech and Slovak, we formulate six basic sets of speech parameters for the GMM classifier. Another issue is to find the optimum number of parameters in the feature vector for robust GMM classification of emotions. As the first trial, the length of the input feature vector was experimentally set to 16, as a result of compromise between lower limit of functionality and computational complexity requirements.

The two-level classifier is based on the statistical approach—therefore outputs from the gender recognition or emotion classification block are the probability values subsequently evaluated in the block called the score discriminator (see Figure1). Consequently, different values of the score can be obtained when the same sentence is processed. These different score values can bring about an error in evaluation of the gender type or the emotion class. This situation can arise from several various reasons including

processing of the input speech signal with a short time duration, from which only a small number of feature vectors is obtained during the analysis,
classification using too short input feature vector (small number of parameters in the vector),
application of an incorrect type of a gender model for determination of an emotional class (e.g., using the male model for classification of emotion sentences uttered by a female speaker).

Hence, the stability tests to verifying the proper function parts of the recognizer as well as the whole classifier are necessary to be performed. These tests are also important for mapping of the mentioned negative reasons of the resulting system error. In addition, we assume that the choice of feature types (spectral properties and prosodic parameters) and the method of their determination from the input speech signal would significantly determine the proper function of the GMM classifier. The correctness and quality of obtained results also depends on the correctness and accuracy of the initialization and training phase during the creation of a given GMM model. Above all, it means the properly determined number of used mixtures and the number of passed iterations. It means that it is also necessary to judge influence of these parameters on gender recognition and emotion classification error rate. Before the first practical use of the whole classifier, individual function blocks as well as their cascade connection must be tested. Subsequently, suitability of the whole classifier for our purpose—objective tool for evaluation of the synthetic speech quality after applied emotional style conversion in Czech and Slovak—will be determined.

2.2. Basic principles of applied classification method

The GMM can be defined as a linear combination of multiple Gaussian probability density functions (GPDF) of the input data vector x

\begin{array}{l} f (x) = \sum_{k = 1}^{K} α_{k} P_{k} (x), P (x) \\ = \frac{1}{\sqrt{{(2 π)}^{d} det (\sum)}} exp (- \frac{{(x - μ)}^{T} \sum^{- 1} (x - μ)}{2}), \end{array}

(1)

where P_k(x) is the GPDF (expressed with the help of d as the dimension of the GPDF, Σ is the covariance matrix, and μ is the vector of mean values), K is the number of these distribution functions, and α_k are the weighting parameters. For GMM creation it is necessary to determine the covariance matrix Σ, the vector of mean values μ, and the weighting parameters α_k from the input training data. Using the expectation-maximization (EM) iteration algorithm the maximum likelihood function of GMM is defined as follows:

log L (Θ| x) = log \prod_{m = 1}^{M} \sum_{k = 1}^{K} α_{k} P_{k} (x_{m}| Θ_{k}),

(2)

where P_k( ) are the GPDFs, K is the number of these functions in a mixture, M is the number of trained vectors, α_k are the weighting parameters, and the term Θ = (μ, Σ) represents parameters of the Gaussian probability distribution. For control of the EM algorithm, the N_iter parameter corresponding to the number of iteration steps is used, and the N_gmix represents the used number of mixtures in each of the GMM models. The iteration stops when the difference between the previous and the current probabilities fulfills the internal condition or the predetermined maximum number of iterations is reached. To initialize the GMM model parameters, the K-means algorithm is usually used—this procedure is repeated several times until the minimum deviation of the input data sorted in N clusters S = {S₁,S₂,…,S_N} is found.

The GMM classifier returns probabilities (the so-called scores) that the tested utterance belongs to the GMM model while the identification of emotion (or gender) i^* is given by the maximum overall probability for the given emotion (gender)

i^{*} = arg max_{1 \leq i \leq N} score (T, i),

(3)

where the emotion/gender score(T, i) is the returned probability value of the GMM classifier for the models trained for each emotion/gender category and the tested sentence T (an input vector of features obtained from this sentence).

2.3. Determination of basic and complementary spectral properties of emotional speech

The basic speech spectral properties consist of the formant positions F₁, F₂, F₃, and their bandwidths as well as the auxiliary parameters (the formant tilts) that can be calculated by several techniques. We apply the approach combining two basic methods for formant position determination (see Figure2).

1.
Indirect—formant positions are determined as the first three local maxima of the smoothed spectral envelope where its gradient changes from positive to negative. Corresponding bandwidths are obtained as frequency intervals between the points of 3 dB decrease of the magnitude spectrum relative to the formant amplitudes. The smooth spectral envelope of the speech signal can be determined during cepstral analysis [27]. Cepstral analysis of the speech signal is performed in the following way: first, the complex spectrum using fast Fourier transform (FFT) algorithm is calculated from the input samples (after segmentation and weighting by a Hamming window). In the next step, the power spectrum is computed and the natural logarithm is applied. Application of the inverse FFT algorithm gives the symmetric real cepstrum. Limitation to the first N ₀ + 1 cepstral coefficients represents an approximation of the log spectrum envelope
$S (e^{jω}) = c_{0} + 2 \sum_{n = 1}^{N_{0}} c_{n} cos (n \cdot ω),$
(4)

where the first cepstral coefficient c₀ corresponds to the signal energy.
2.
Immediate—estimation of the formant frequencies and their bandwidths directly from the complex roots of the linear predictive coding (LPC) polynomial A(z)—poles of the LPC transfer function. The formant frequency F _k and the 3 dB bandwidth B _k in (Hz) can be determined as follows:
$F_{k} = \frac{f_{s}}{2 π} θ_{k} = \frac{arg (z_{k})}{2 π} f_{s}, B_{k} = - \frac{f_{s}}{π} ln |z_{k}|,$
(5)

where f_s is the sampling frequency and θ_k is the angle in (rad) of the complex root.
Figure 2
Block diagram of determination of the basic and auxiliary formant features from the spectral envelope.
Full size image

Resulting values obtained with the help of the direct method are corrected by the results of indirect determination of the spectral envelope (smoothed by cepstral limitation) according to the following two criteria:

the values of 3-dB bandwidths must be less than 500 Hz[28],
the found values of the first three formant positions must fall within the corresponding frequency interval depending on the gender type (male/female)[29].

The auxiliary spectral parameters like the formant tilts are defined as directions and angles between the first three spectral maxima of a smoothed envelope. The general bisector formula in the parametric form can be used for calculation

y - y_{1} = k (x - x_{1}), k = \frac{y_{2} - y_{1}}{x_{2} - x_{1}}, k = tg (ϕ),

(6)

where k is a bisector direction, y_1,2 represent values of power spectral density (PSD) in (dB) of determined formants, and x_1,2 are positions of the formants on the frequency axis in (Hz). For k < 0, the formants have declining trend, for k > 0 the formants have ascending trend. The resulting angle φ in degrees is defined as φ = (Arctg (k)/π)⋅180.

The cepstral coefficients {c_n} obtained during the cepstral analysis process bring information about spectral properties of the human vocal tract[27]. As the shape of the vocal tract depends also on the emotional state of the speaker, these coefficients can be used in the feature vector for GMM emotional classification. The mentioned cepstral analysis (see Figure3) can also be used for determination of additional speech parameters—the CSF including

1.
The SC defined as a center of gravity of the power spectrum [10] which can be calculated using the absolute value of the FFT |S(k)| of the speech signal x(n). The SC values in (Hz) are determined as
$SC = \frac{\sum_{k = 1}^{N_{FFT} / 2} k {|S (k)|}^{2}}{\sum_{k = 1}^{N_{FFT} / 2} {|S (k)|}^{2}} \cdot \frac{f_{s}}{N_{FFT}},$
(7)

where f_s is the sampling frequency, and N_FFT represents the number of the processed points for FFT calculation.
2.
The SFM can be used to determine the degree of periodicity in the signal [6, 7]. This spectral feature is calculated as a ratio of the geometric and the arithmetic mean values of the power spectrum by the following formula
$SFM = \frac{{[\prod_{k = 1}^{N_{FFT} / 2} {|S (k)|}^{2}]}^{\frac{2}{N_{FFT}}}}{\frac{2}{N_{FFT}} \sum_{k = 1}^{N_{FFT} / 2} {|S (k)|}^{2}} .$
(8)
3.
The SE is a measure of spectral distribution [10]. It quantifies a degree of randomness of spectral probability density represented by normalized frequency components of the spectrum. SE will be low for spectra having clear formants whereas for unvoiced sounds it will be higher. Shannon SE is defined as follows:
$SE = - \sum_{k = 1}^{N_{FFT} / 2} P (k) {log}_{2} P (k),$
(9)

where P(k) represents the PSD values.
4.
The harmonics-to-noise ratio (HNR) provides an indication of the overall periodicity of the speech signal. Specifically, it quantifies the ratio between the periodic and aperiodic components in the signal [30]. The HNR is a function of glottal noise and other factors such as jitter and shimmer which are responsible for the aperiodic component in the voice. Noise at harmonic locations is typically estimated as the average of the noise estimates at either side of the harmonic locations. The spectral-based HNR expressed in (dB) is computed as follows:
$HNR = 10 {log}_{10} (\frac{\sum_{k = N_{FB} LO}^{N_{FFT} / 2} {|S (k)|}^{2}}{\sum_{k = N_{FB} HI}^{N_{FFT} / 2} {|N (k)|}^{2}}), N_{FB} = \frac{f_{max FB} N_{FFT}}{f_{s}},$
(10)

where |S(k)| represents harmonic amplitudes, |N(k)| is the noise estimate, and N_FFT is the number of points up to the sampling frequency. The summation index N_FB depends on the chosen frequency band, where f_s is the sampling frequency and f_maxFB is the maximum frequency of the band (N_FB equals N_FFT/2 for the whole band up to f_s/2). The spectrum portion of harmonic amplitudes is summed from low frequencies corresponding to the index N_FBLO (approx. 50–70 Hz), the noise portion is calculated from high frequencies corresponding to the index N_FBHI (approx. 1500–2000 Hz—depending on the gender type).
Figure 3
Block diagram of calculation of the basic and CSF of emotional speech.
Full size image

In our algorithm, the values of the HNR, SC and SFM are obtained only from the voiced speech frames. In the case of the SE parameter, the values are determined from the voiced as well as unvoiced frames with the signal energy higher than the threshold (calculated as e^{c 0} using the first cepstral coefficient) for elimination of speech pauses between words within the sentence and beginning and ending parts of the sentence[26].

2.3. Estimation of supra-segmental features of emotional speech

Microintonation, together with sentence melody and word melody, represents melody of speech given by F0 contour. Microintonation component of speech melody can be supposed to be a random, band-pass signal described by its spectrum and statistical parameters. The voice quality parameter “jitter” describes pitch perturbations in the context of vocal expression. Our approach to microintonation estimation is somewhat similar to that of[31] where a jitter related to microvariations of a pitch curve is computed as a relative number of zero crossings of a derivative pitch curve normalized by utterance duration. Speech frames classified as voiced are analyzed separately depending on the emotional state and the gender type. The whole supra-segmental feature analysis process is divided into seven phases corresponding to the block diagram in Figure4:

1.
Determination of F0 values, definition of the voiced and unvoiced parts of the processed speech signal.
2.
Determination of F0_Mean values and calculation of the linear trend (LT) by the least mean square method.
3.
Calculation of differential microintonation signal F0_DIFF by subtraction of these values from the corresponding F0 contours (F0_Mean and LT removal)
$F 0_{DIFF} (n) = (F 0 (n) - F 0_{Mean}) - LT (n) .$
(11)
4.
Detection of zero crossings, calculation of zero crossing periods L _Z, and relative values defined as L _{Z rel} = N _Z/N _V, where N _Z is the total number of zero crossings in each of the four emotions, and N _V is the total number of voiced frames.
5.
Calculation of the frequency parameters from the zero crossing periods
$F 0_{ZCR} = f_{F} / (2 \cdot L_{Z rel}),$
(12)

where f_F is the frame frequency.
6.
Calculation of the absolute jitter J _Abs values as the average absolute difference between consecutive pitch periods L measured in samples [30]
$J_{Abs} = \frac{1}{f_{s} (N_{L} - 1)} \sum_{n = 1}^{N_{L} - 1} |L_{n} - L_{n + 1}|,$
(13)

where f_s is the sampling frequency and N_L is the number of extracted pitch periods.
7.
Calculation of the shimmer measure as a period-to-period variability of amplitudes of a speech signal [30]
$shimmer = \frac{|A_{n} - A_{n + 1}|}{\frac{1}{N_{V}} \sum_{n = 1}^{N_{V} - 1} A_{n}},$
(14)

where A_n is the peak amplitude value of the n th frame of the input speech signal, and N_V is the number of voiced frames.
Figure 4
Block diagram of estimation of the supra-segmental features.
Full size image

3. Description of performed analysis and comparison experiments

Our experiments were aimed at comparison and analysis of

1.
influence of the used number of mixtures and the used number of training iterations on GMM emotion classification;
2.
influence of the used number of mixtures and the used number of training iterations on the GMM gender recognition error rate;
3.
influence of different length of the feature vector on GMM emotion classification error rate;
4.
influence of different length of the feature vector on the computational time (complexity) of the phases: GMM creation, training, and classification (recognition);
5.
influence of the type of the features in the feature vector on GMM emotion classification and gender recognition error rate;
6.
test of the complete GMM emotion classifier with the best training parameters (N _iter and N _gmix) and the feature set with the best score (minimum mean error rate).

To find the optimum number of mixtures for GMM classification and the optimum number of training iterations the influence of using one to eight mixtures was investigated for classification of four emotional speech styles and the influence of one to four mixtures was tested for recognition between male and female genders. The influence of the used number of iterations on the GMM classification/recognition error rate was analyzed in eight cases with the values in the range of <100–1500>. For the analysis of different number of values in the feature vector (see points 3 and 4), three types of vectors were used with different lengths of N_FEAT = 8, 16, and 32 values. In the case of the shortest one with the length of 8 we used parameters {1, 5, 6, 8, 10, 12, 13, 16} of the original feature vector with the length N_FEAT = 16.

In addition, we perform a set of tests of stability consisting of

1.
stability of the GMM emotion classification process when the time duration of the input processed sentence shortens;
2.
stability of the GMM emotion classification process with the limited length of the feature vector;
3.
stability of the emotion classification when the gender type of the GMM model is chosen incorrectly;
4.
stability test of the obtained GMM scores and finally determined emotional class for correctly set male or female genders.

The same testing sentence was processed to compare recognition scores of the GMM classifiers. This test passed for 500 times using the same set of the trained models. The sentence “Vlak už nejede” (No more train leaves today) was used for testing. It was expressed by two male and two female speakers in neutral and emotional speaking styles with mean duration of 1.5 s (which corresponds approx. to 125 frames for analysis). The length of the original feature vector was N_FEAT = 16. For limited length of N_FEAT = 12, the zero values were used at the positions 7, 9, 11, and 15 of the original feature vector. For the length the N_FEAT = 8, the zero values were used at the positions 2, 3, 4, 7, 9, 11, 14, and 15.

Finally, we realized two experiments for verifying of our working hypothesis about:

1.
usability of the speech database in other language using the German database as a data source for GMM emotion training and testing (recognition);
2.
minimal influence of the order of parameters in the input feature vector on the GMM emotion classification score.

Verification of the second working hypothesis was realized within the framework of analysis of influence of the type of the feature vector and the order of features in the feature vector on the recognition error rate and the stability of the classifier.

3.1. Used types of features in the input vectors of the GMM classifier

As it was mentioned in Section 1, our research is focused mainly on analysis and comparison of basic and complementary spectral properties of the emotional speech including the prosodic—supra-segmental parameters.

For that reason, also in this experiment, these types of speech parameters were used as the input features for the emotion classification based on the GMM approach.

In the case of the spectral features, the basic statistical parameters—mean value, and standard deviation (std)—were used as the representative values in the feature vectors for GMM emotion and gender recognition. The special category of the spectral features is represented by coefficients of the real cepstrum[27]. The calculated histograms of distribution were used to determine the extended statistical parameters—skewness and kurtosis that were used in the feature vectors. For implementation of the supra-segmental parameters of emotional speech, the statistical types of median values, range of values, std, and/or relative maximum and minimum we used in the feature vectors.

For our experiments, we set up six basic feature sets and a special one as the input data vectors for GMM training and classification—see detailed description of their structure in Tables 1,2,3,4,5,6 and7:

1.
feature set containing only statistical values of supra-segmental parameters (P1);
2.
feature set consisting of extended statistical values of spectral parameters together with extended statistical values of supra-segmental parameters (P2);
3.
feature set including complete values of CSF and extended statistical values of supra-segmental parameters (P3);
4.
feature set containing a ratio of formant frequencies F ₁, F ₂, a formant tilt, values for all types of CSF, and extended values of supra-segmental parameters (P4);
5.
feature set including extended statistical parameters of the first three cepstral coefficients (c ₁–c ₃) together with basic values of CSF (excluding the HNR), and basic supra-segmental parameters (P5);
6.
feature set containing a mix of basic spectral parameters (skewness of the first four cepstral coefficients, a formant ratio, and a tilt), complete values of CSF, and basic supra-segmental parameters (P6);
7.
special feature set consisting of 32 values including extended mix of basic spectral parameters (a skewness and a kurtosis of the first four cepstral coefficients, formant ratios of the first three formant frequencies F ₁, F ₂, F ₃, and formant tilts computed also from the first three formants), values for all types of CSF, and extended statistical values of supra-segmental parameters (P8).

Table 1 Structure of the feature set P1

Evaluation of influence of spectral and prosodic features on GMM classification of Czech and Slovak emotional speech

Abstract

1. Introduction

2. Subject and method

2.1. Short description of the developed emotional speech classifier and its expected properties

2.2. Basic principles of applied classification method

2.3. Determination of basic and complementary spectral properties of emotional speech

2.3. Estimation of supra-segmental features of emotional speech

3. Description of performed analysis and comparison experiments

3.1. Used types of features in the input vectors of the GMM classifier

3.2. Description of the used speech corpora and methods of processing of sentences

3.3. Obtained results of performed experiments

4. Discussion of results

5. Conclusion

Abbreviations

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords