Decision tree SVM model with Fisher feature selection for speech emotion recognition

Sun, Linhui; Fu, Sheng; Wang, Fu

doi:10.1186/s13636-018-0145-5

Research
Open access
Published: 07 January 2019

Decision tree SVM model with Fisher feature selection for speech emotion recognition

Linhui Sun¹,
Sheng Fu¹ &
Fu Wang¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2019, Article number: 2 (2019) Cite this article

7881 Accesses
74 Citations
Metrics details

Abstract

The overall recognition rate will reduce due to the increase of emotional confusion in multiple speech emotion recognition. To solve the problem, we propose a speech emotion recognition method based on the decision tree support vector machine (SVM) model with Fisher feature selection. At the stage of feature selection, Fisher criterion is used to filter out the feature parameters of higher distinguish ability. At the emotion classification stage, an algorithm is proposed to determine the structure of decision tree. The decision tree SVM can realize the two-step classification of the first rough classification and the fine classification. Thus the redundant parameters are eliminated and the performance of emotion recognition is improved. In this method, the decision tree SVM framework is firstly established by calculating the confusion degree of emotion, and then the features with higher distinguish ability are selected for each SVM of the decision tree according to Fisher criterion. Finally, speech emotion recognition is realized based on this model. The decision tree SVM with Fisher feature selection on CASIA Chinese emotion speech corpus and Berlin speech corpus are constructed to validate the effectiveness of our framework. The experimental results show that the average emotion recognition rate based on the proposed method is 9% higher than traditional SVM classification method on CASIA, and 8.26% higher on Berlin speech corpus. It is verified that the proposed method can effectively reduce the emotional confusion and improve the emotion recognition rate.

1 Introduction

In recent years, speech emotion recognition has been widely applied in the field of human-computer interaction [1,2,3]. Emotion recognition helps machine understand and learn human emotions. However, the performance of the emotion recognition is still far from the expectation of researchers. In speech emotion recognition, there are mainly two difficulties [4] that are how to find effective speech emotion features, and how to construct a suitable speech emotion recognition model. In previous studies, some effective feature parameters were extracted for emotional recognition tasks. Zhao et al. adopted the pitch frequency, short-term energy, formant frequency, and chaotic characteristics to construct 144 dimensional emotion feature vector for recognition [5]. Cao et al. combined the feature parameters such as energy, zero crossing rate, and first-order derivative for speech emotion recognition, and encouraging results were obtained in comparison with other methods [6]. In [7], the first 120 Fourier coefficients of the speech signal were extracted, and the recognition rate of 79.51% was obtained using Germany Berlin speech emotion database with 6 emotions. In [8], some new harmonic and Zipf-based features for better speech emotion characterization in the valence dimension were proposed for better emotional class discrimination. In [9], Prosody features and voice quality information were combined in emotion recognition. The methods mentioned above improve the performance of emotion recognition by feature fusion. However, feature fusion may lead to high dimension and redundancy of features, so it is vital to filter out the characteristic parameters of higher distinguish ability. Fisher criterion is a classical linear decision method, which can achieve satisfying results in selecting features. Huang et al. used the Fisher discriminant coefficient to screen out 10 dimensional features from 84 dimensional features for the identification of 5 emotions [10], which increased the emotion recognition by 8%.

To further improve the performance of speech emotion recognition, an effective emotion recognition model needs to be constructed. Currently, some classifiers are extensively used in speech emotion recognition, including Gaussian mixture model (GMM) [11], artificial neural network (ANN) [12], support vector machine (SVM), etc. Among them, the SVM has a unique advantage in solving nonlinear, small sample, and high dimensional pattern recognition problems, so it is widely used in speech emotion recognition [13, 14]. In [15], Zhang et al. proposed an improved leaping algorithm to optimize the SVM classifier, and this algorithm was applied to speech emotion recognition. In [16], an integrated system of hidden Markov model (HMM) and SVM, combining advantages on capability of dynamic time warping of HMM and pattern recognition of SVM, had been proposed to implement emotion classification, which achieved an 18.3% improvement compared to the method using HMM in the experiment of speaker independent emotion classification. Work in [17] applied the GMM-MAP/SVM generative models and discriminative models to speech emotion recognition, which increased the average emotion recognition by 6.1% compared to the method using SVM. In addition, the binary decision tree SVM recognition model had also been applied to multiple emotion recognition, which obtained good performance [18, 19].

In our study, we found that the statistical variables of multi frames emotional features were better than emotional features extracted by frame in emotion recognition. Due to the diversity of information represented by different features, combining various features can achieve better performance, which has become the current research hotspot [20,21,22]. Besides, in multiple emotion recognition, based on the ability of various features to discriminate emotional types, we can filter out an optimal feature set to eliminate the redundant parameters.

The overall recognition rate will reduce due to the increase of emotional confusion in multiple speech emotion recognition. Inspired by the above methods, for multiple speech emotion recognition, we proposed a speech emotion recognition method based on the decision tree SVM model with Fisher feature selection. In this method, a high-performance decision tree SVM classifier is established by calculating the degree of emotional confusion, to realize the two step classification of the first rough classification and the fine classification. For each of decision tree SVM, we filter out the feature parameters of higher distinguish ability by Fisher criterion to gain an optimal feature set. Finally, this model is used for speech emotion recognition. Thus, a better emotional classification performance can be obtained.

The contributions made in this paper include (1) adopt Fisher criterion to remove the redundant features to improve emotion recognition performance; (2) propose an algorithm to determine the structure of decision tree dynamically, and construct the system frameworks on the CASIA Chinese speech emotion corpus and the EMO-DB Berlin speech corpus; and (3) combine Fisher criterion with decision tree SVM, and adopt genetic algorithm to optimize the parameters of SVM to further improve the emotion recognition rate.

The rest of this paper is organized as follows. In Section 2, we present the idea of decision tree SVM model with Fisher feature selection. Section 3 introduces the experiment and the analysis of results. In Section 4, we summarize our paper and discuss the future work for speech emotion recognition.

2 Decision tree SVM model with Fisher feature selection

2.1 Emotion features

To effectively recognize emotions from speech signals, we need to extract some feature parameters that can reflect the emotional information in the speech signal, and then use these parameters to train the model which is used for the emotion recognition. The quality of selected feature parameters affects the recognition rate of the system directly. Traditional speech features used for emotion recognition tasks are mainly divided into prosodic features, spectrum-based features, and voice quality features [23]. The prosodic features include speech rate, pitch period, amplitude energy, etc. The spectrum-based features include linear predictor coefficient (LPC), Mel-frequency cepstral coefficient (MFCC), etc. The voice quality features include formant frequency and glottis parameters. In addition, some basic parameters, such as Fourier coefficients [24, 25], are often used in speech emotion recognition.

Feature parameters are usually extracted by frame. Since a single frame contains less information, most researchers use feature parameters to calculate statistical variables in multiple frames for emotion recognition tasks. In this paper, five kinds of features are adopted, including MFCC, energy, Fourier coefficients, pitch frequency, and zero-crossing rate, and five statistical variables (i.e., maximum, minimum, mean, standard deviation, and median) of multi-frame features are calculated and applied to recognition tasks.

2.2 Support vector machine model

Support vector machine (SVM) proposed in the 1990s is a kind of machine learning method which is applied in many areas. For the nonlinear separable problem, its basic idea is that the input space is mapped into a high dimensional feature space by nonlinear transformation, and the optimal hyperplane is found in the new space. The optimal hyperplane not only needs to ensure that different categories can be discriminated correctly, but also the maximum categorization interval between them should be promised. Thus, the generalization capability of the support vector machine is stronger. In another word, looking for a hyperplane with a maximum interval is the goal of training SVM.

The target function corresponding to the nonlinear separable support vector machine is given by:

$$ \min \left(\frac{1}{2}{\boldsymbol{\omega}}^T\boldsymbol{\omega} +C\sum \limits_{i=1}^N{\xi}_i\right) $$

(1)

$$ \mathrm{s}.\mathrm{t}.\kern1em {y}_i\left({\boldsymbol{\omega}}^T{\boldsymbol{x}}_i+b\right)\ge 1-{\xi}_i,{\xi}_i\ge 0,i=1,2,\dots, N $$

where ω represents the weight coefficient vector, and b is a constant. C denotes the penalty coefficient to control the penalty degree for misclassified samples and balance the complexity of the model and loss error. ξ_i represents the relaxation factor to adjust the number of misclassified samples that allowed exit in the process of classification.

When the SVM is used to solve the classification problems, two strategies can be adopted. One is ONE-TO-ALL, and another is ONE-TO-ONE. According to previous studies, ONE-TO-ONE classification strategies have an advantage in speed [26]. Therefore, the paper uses the ONE-TO-ONE strategies. Kernel functions are the key for SVM. The kernel functions commonly used include linear kernel function, polynomial kernel function, radial basis function (RBF), and multilayer perceptron kernel function. Based on the previous experiments, the best RBF kernel function is used in this paper.

2.3 Construction strategy of decision tree SVM

In multi-classified emotion recognition, the overall recognition rate is reduced due to the increase of confusion between emotions. To solve this problem, this paper establishes a decision tree SVM by calculating the degree of emotional confusion, and uses the decision tree SVM as a classifier for speech emotion recognition.

At first, we defined E = {e₁, e₂, e₃, …, e_n} as the emotional state set, where the number of the state is n. Defined the confusion degree between the ith emotion e_i and the jth emotion e_j is I_{i, j}, which represents the average of the probability that the ith emotion is misjudged as the jth emotion and the jth emotion is misjudged as the ith emotion [27]. The formula is given by:

$$ {I}_{i,j}=\frac{P\left(r=j|\boldsymbol{x}\in {e}_i\right)+P\left(r=i|\boldsymbol{x}\in {e}_j\right)}{2} $$

(2)

where x represents the test sample, and r represents the result of classification corresponding to x.

The proposed decision tree SVM algorithm is as follows:

(a)
Calculate the emotional confusion matrix using traditional SVM method, only the MFCC parameters are used to train SVM.
(b)
Set an appropriate initial threshold P at primary classification. The emotions in which the confusion degree exceeds the threshold P are classified into a same group. If I_{a, b} > P, a and b will be divided into one group. If I_{a, b} > P, I_{b, c} > P, a, b, and c will be divided into one group. If the confusion degree between a certain emotion and other emotions is less than the threshold, this certain emotion will be grouped separately.
(c)
Calculate the confusion degree between ungrouped emotions and other emotions according to Eq. (2), and then move to step (b) to divide the ungrouped emotional states into existing groups or a new group.
(d)
Calculate the number of emotional states in each group. If the number is greater than 2, the threshold needs to be increased by P, and move to step (a); otherwise, move to step (e).
(e)
All emotions are categorized and ended.

In this section, we introduce an algorithm to obtain the decision tree SVM model. This algorithm can determine the depth of decision tree dynamically, which means the decision tree structure of various corpus is different. At first, we need to decide the initial threshold P based on a large number of comparison experiments. After determining the initial threshold, the threshold of the remaining layer is also determined. For the second layer, the threshold is 2P. For the third layer of decision tree, the threshold is 3P, and so on. When the threshold of each layer is determined, the emotional states can be classified. When the confusion degree between emotions is greater than the threshold in this layer, these emotions are divided into a same group. If their confusion degree is below the threshold, these emotions need not be grouped, but are directly classified by the SVM in this layer. In this way, various corpus can obtain the optimal structure of decision tree model.

2.4 Feature selection strategy for decision tree SVM

In order to improve the recognition rate of multiple classification speech emotion recognition, we propose a speech emotion recognition method based on the decision tree SVM model and Fisher feature selection. In this method, the speech signal is preprocessed by pre-emphasize and framing, and the MFCC coefficients of the speech signal are extracted. Then, the confusion matrix between emotions is obtained by using MFCC coefficient and traditional SVM, and the confusion degree between emotions is calculated based on the confusion matrix. Finally, the decision tree SVM is constructed based on confusion degree and the strategy of decision tree SVM. When the SVM decision tree is constructed, Fisher discriminant coefficient can be obtained by calculating the mean and variance of each dimension feature parameters. Feature parameters of higher distinguish ability are selected for each SVM in the decision tree according to the Fisher discriminant coefficient, which is used for training. The specific flow chart is shown in Fig. 1.

In speech emotion recognition, due to the difference in the ability to discriminate emotional states for various features, it is vital to select appropriate features to discriminate different sets of emotions. In an ideal feature space, the distance between different categories should be as large as possible, and the distance between the same categories should be as small as possible, so that we can classify effectively. For the feature selection of the extracted features, the mean and variance of the feature are used as a criterion to measure the characteristics of the feature. Assume that the feature matrix F_P of Pth emotion is given by:

$$ {F}_P=\left[\begin{array}{cccccccc}{X}_{11}^P& {X}_{12}^P& {X}_{13}^P& {X}_{14}^P& {X}_{15}^P& {X}_{16}^P& \dots & {X}_{1N}^P\\ {}{X}_{21}^P& {X}_{22}^P& {X}_{23}^P& {X}_{24}^P& {X}_{25}^P& {X}_{26}^P& \dots & {X}_{2N}^P\\ {}{X}_{31}^P& {X}_{32}^P& {X}_{33}^P& {X}_{34}^P& {X}_{35}^P& {X}_{36}^P& \dots & {X}_{3N}^P\\ {}\vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ {}{X}_{M1}^P& {X}_{M2}^P& {X}_{M3}^P& {X}_{M4}^P& {X}_{M5}^P& {X}_{M6}^P& \dots & {X}_{MN}^P\end{array}\right] $$

(3)

where M and N are the number and the dimension of the feature parameters respectively. $ {\boldsymbol{T}}_{id}=\left\{{X}_{1d}^i,{X}_{2d}^i,{X}_{3d}^i,\dots, {X}_{Md}^i\right\} $ represents the set of the d-dim feature of the ith type emotion. For the d-dim feature, the Fisher discriminant coefficient is defined as:

$$ f(d)=\frac{{\left({\mu}_{1d}-{\mu}_{2d}\right)}^2}{\sigma_{1d}^2+{\sigma}_{2d}^2} $$

(4)

where μ_idand$ {\sigma}_{id}^2 $ denote the mean and variance of the vector T_id respectively. The size of Fisher discriminant coefficient can reflect the degree of dissimilarity between different classes and the similarity between the same classes. For distinguishing emotional states, the larger the Fisher discriminant coefficient of the feature is, the greater the emotion contribution the feature makes. For multiple classifications, Fisher’s discriminant coefficient is calculated as:

$$ f(d)=\frac{1}{\complement_Q^2}\sum \limits_{0<i<j\le Q}\frac{{\left({\mu}_{id}-{\mu}_{jd}\right)}^2}{\sigma_{id}^2+{\sigma}_{jd}^2} $$

(5)

where Q is the total number of emotional states.

3 Experiments

In this paper, to evaluate the effectiveness of proposed method, two different corpus are employed: the CASIA Chinese speech emotion corpus and the EMO-DB Berlin speech corpus. The CASIA Chinese speech emotion corpus is recorded and provided by the institution of Automation, Chinese Academy of Sciences. The CASIA corpus contains six kinds of basic emotions: Angry, Happy, Fear, Neutral, Surprise, and Sad. The EMD-DB consists of seven basic emotions: Angry, Happy, Fear, Neutral, Boring, Disgust, and Sad. The EMO-DB corpus consists of 535 speech utterances, and all of these utterances are used in the experiments. The experiments carried out in this paper are all based on a tenfold cross-validation method. In other words, the samples are randomly divided into ten parts equally, among which 9/10 of samples are used for training and 1/10 of samples are used for testing. The experiment is repeated ten times, and the final recognition result is the average of these results. This paper selects SVM as an emotion recognition model and uses the LIBSVM toolbox developed by Professor Lin Zhiren of Taiwan University to realize the training and testing of SVM. The development tool of Matlab2013a is adopted to extract emotion features, and LIBSVM is installed in the environment of Visual Studio 2010.

Before extracting the parameters of the speech signal, this paper first performs endpoint detection on the speech signal, and the speech signal is framed with the frame length of 256 points and the frame shift of 128 points. The feature parameters of the experiment include the first 160 Fourier coefficients, the amplitude energy, the pitch frequency, the zero crossing rate, the 24 order MFCC, and the first-order difference. The statistical variables include the maximum, the minimum, the mean, the median, and the variance. In addition, all 1055 dimensional feature parameters are normalized.

3.1 CASIA Chinese speech emotion corpus

3.1.1 Decision tree SVM model for CASIA

According to the proposed decision tree SVM algorithm, the emotional confusions among emotions need be calculated, and the confusion matrix of six emotions is shown in Table 1, using MFCC parameters and traditional SVM. According to Eq. (2), the confusion degree among emotions is calculated, which is shown in Table 2. The initial threshold P is set to 7% by a large number of experiments using CASIA corpus (the solution of optimal initial threshold will be introduced in Section 3.3). From Table 2, we can find that the confusion degree between Angry and Surprise is 11.25%, while that of Happy and Surprise is 11.25%. Both of them are more than 7% of the initial classification threshold. According to the decision tree SVM construction algorithm, Angry, Happy, and Surprise are divided into the first group. The confusion degree between Fear and Sad is 31.75%, so Fear and Sad are divided into the second group. Since the confusion degree between Neutral and other emotions is less than 7%, Neutral is classified into the third group. At this point, the SVM, which implement the three major classifications, is recorded as SVM1.

Table 1 Recognition confusion matrix of six emotions (%)

Decision tree SVM model with Fisher feature selection for speech emotion recognition

Abstract

1 Introduction

2 Decision tree SVM model with Fisher feature selection

2.1 Emotion features

2.2 Support vector machine model

2.3 Construction strategy of decision tree SVM

2.4 Feature selection strategy for decision tree SVM

3 Experiments

3.1 CASIA Chinese speech emotion corpus

3.1.1 Decision tree SVM model for CASIA

3.1.2 Feature selection by Fisher criterion

3.1.3 Comparison results on CASIA

3.2 EMO-DB emotional corpus

3.2.1 System framework for EMO-DB

3.2.2 Feature selection by Fisher criterion

3.2.3 Comparison results on EMO-DB

3.3 The optimal initial threshold

3.4 SVM parameters optimization by genetic algorithm

4 Conclusion

References

Acknowledgements

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords