Speech emotion recognition based on Graph-LSTM neural network

Currently, Graph Neural Networks have been extended to the field of speech signal processing. It is the more compact and flexible way to represent speech sequences by graphs. However, the structures of the relationships in recent studies are tend to be relatively uncomplicated. Moreover, the graph convolution module exhibits limitations that impede its adaptability to intricate application scenarios. In this study, we establish the speech-graph using feature similarity and introduce a novel architecture for graph neural network that leverages an LSTM aggregator and weighted pooling. The unweighted accuracy of 65.39% and the weighted accuracy of 71.83% are obtained on the IEMOCAP dataset, achieving the performance comparable to or better than existing graph baselines. This method can improve the interpretability of the model to some extent, and identify speech emotion features effectively.


Introduction
Speech emotion recognition (SER) is a branch of automatic emotion recognition and automatic speech recognition [1].It recognizes the emotional state of speech by analyzing the acoustic features and linguistic content of the speech.It can currently be applied to multimodality generation tasks [2], assisted psychotherapy [3], video games [4] and telephone services [5].The speech emotion recognition task is divided into two main phases: feature extraction and emotion classification.The speech signal is first processed based on time-domain and frequencydomain characteristics to quantize the raw speech.Subsequently, the processed data is fed into deep learning models for the purpose of emotion classification.The most popular models are convolutional neural network (CNN) [6], recurrent neural network (RNN) [7], long short-term memory network (LSTM) [8], as well as largescale speech recognition models [9].However, the voice state and emotional expression are variable at any time.It is still a great challenge to accurately identify the emotional state in short time.
Graph neural network (GNN) is an extension of convolutional networks on non-Euclidean data space, with the core idea being to construct good feature interpretability based on data association [10].It has been successfully applied to computer vision and natural language processing tasks.Because speech is the combination of linear sequences, it is difficult to be converted into irregular non-Euclidean data.Therefore, the application of graph neural networks in speech signal processing is limited.In recent years, researchers have considered linear sequences as a special case of graph and applied graph convolution as encoder by transformations like line graphs, cycle graphs [11,12], and complete graphs [13], building lightweight architectures with excellent performance.However, the relational structures of these compositions are single.The graph convolution is limited by the graph topology, which is not flexible, resulting in poor generalization ability in complex scenes.
This paper focuses on the task of the sentence-level speech emotion classification.To facilitate this task, individual frames are considered as nodes within the framework.The backbone of the model is constructed using a cycle graph, while the feature similarity between speech frames is computed to determine the connections between nodes.Specifically, the K edges with the highest weights are selected to establish these connections.For complex topological graphs, we choose the Message Passing Neural Network (MPNN) based on spatial-domain convolution to design a more flexible classification model.
Our contributions are as follows.
1) The development of a more adaptable directed graph of speech by leveraging feature similarity allows for greater flexibility in representing speech.
2) The introduction of a graph neural network architecture based on an LSTM Aggregator employs a message passing mechanism to capture input dependencies and facilitates accurate recognition of speech emotions, particularly in graphs with higher complexity.
3) The proposal of a weighted graph pooling operation for graph-level classification tasks enables the extraction of global features.The experimental results show that the weighted pooling can effectively remove redundant information and lead to a more stable convergence trend.

SER based on deep learning
Currently, classifiers of SER can be categorized into two types, traditional classifiers and deep learning classifiers.
Traditional classifiers include Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs) and Support Vector Machines (SVMs), etc. [14], which rely on a lot of preprocessing and precision feature engineering [15].With the development of deep learning technology, the performance of SER has gained significant improvement.Some studies have combined Deep Neural Networks (DNNs) and traditional classifiers, e.g., [16] proposes a DNN-decision tree SVM model based on DNN, which can capture more distinctive emotion features compared with traditional SVM and DNN-SVM.Most of recognition frameworks based on neural networks utilize CNNs, LSTMs, and combinitions [17,18].For example, [19] modifies the initial model with an incremental approach, and inputs multiple acoustic features to a 1D CNN, which improves the accuracy.[20] constructs a robust and effective recognition model based on key sequence fragments, combining CNN and BiLSTM.Attention mechanism is another key for recognizer based on deep learning to deal with hidden information.Attention-based DNN can mine unevenly distributed features in speech and emphasize saliently emotional information, which better adapts to changes in speech emotion [21].By directing self-attention to deal with missing and hidden information, the more robust structure [22] obtains the satisfactory performance.Furthermore, the challenge of building SER systems based on neural networks lies in the poor generalization due to data mismatch.To address this problem, [23,24] make significant progress on generalization by sharing feature representations among auxiliary tasks through multi-task learning.However, the traditional recognition system based on deep learning has the complex structure and weak interpretability of speech features.The graph has been introduced into speech tasks as a compact and efficient representation.And the superiority of GNNs in graph processing has received widespread attention.

SER based on GNNs
At present, the application of graph neural networks in the field of speech technology still has some limitations [25], but some scholars have verified the advantages of graph convolution in the field of speech technology and the possibility of being widely used through research, such as conversational speech recognition [26], sentence-level [27] / conversation-level speech emotion recognition [28], speech enhancement [29], and Q &A rewriting [30].The methods of graph construction can be divided into sample point-based, frame-based, speech channel-based, and historical dialogue-based approaches, as shown in Fig. 1.In addition, graph neural network has good performance in low-resource speech emotion recognition, such as [31] using transduction integrated learning algorithms based on graph neural networks to accomplish the challenge of Portuguese speech emotion classification.
In current studies, researchers mostly use frame-based composition.Each frame is considered as one node.Additionally down-sampling is used to reduce the number of frames and simplify the structure.For example, the study [11] modeled the speech signal as a framebased recurrent graph and constructed a lightweight and precise graph convolution architecture, achieving comparable performance with existing techniques.The studies [10,12,25] extend the context acceptance range by constructing neighbors within the specific times on the deep frame-level features obtained by recurrent neural networks.Similarly, the study [32] extends to dialogue speech emotion recognition by introducing CNN-BiL-STM to extract conversation features and constructing edges through a fixed past context window.These studies have a high dependence on the feature processing capability of sequence models, and the connections are relatively fixed.The study [33] proposes an ideal graph structure based on cosine similarity and constructs a graph convolutional network with better robustness.However, in practical applications, speech sequences are prone to problems of high feature similarity and feature instability.The threshold approach is not applicable to realistic scenarios.
To address the problems of inflexible graph structure and poor generalization ability in the above studies, this paper proposes a graph neural network based on LSTM aggregator and weighted pooling to transform the speech emotion recognition task into the graph classification task.

Proposed approach
In this part, we will discuss each component of Graph-LSTM neural network (GLNN) in detail.

Graph construction
Inspired by studies [11,12], the speech signal is processed into frames, and each frame is considered as a node.To preserve feature integrity and build the scalable graph, the processing of downsampling and fixed-length cut is discarded.The speech with variable number of frames is transformed into the graph based on the temporal relationship and feature similarity.Thus, the speech graph is heterogeneous.
The graph dataset is represented as G = (V , E) .V is the set of nodes, and E is the set of edges.The feature matrix of nodes in the figure is represented as X, X ∈ R n×D , where n represents the number of nodes, and D represents the feature dimension, and x i represents the feature vector of the i-th node.x, the feature vector of the node, is composed of a set of low-level descriptors extracted by openSMILE 3.0.The edges are constructed in two categories, one is the directed edges constructed by the temporal relationship.The one-way edges v i → v i+1 } n−1 i=1 are constructed only depending on the time, and finally the loop is established by v n → v 1 .The directed cycle graph is used as backbone to improve the stability of the graph structure.The other category is the directed edges obtained from the feature similarity calculation.In order to reduce the computational complexity, the dot product similarity operation is used as follows: where X ∈ R n×D is the feature matrix of nodes on the graph, and n represents the number of nodes, and D represents the feature dimension; X is standardized, and the dot product similarity between nodes is calculated to obtain the similarity weights.edges represent the set of constructed edges.j represents the index of the adjacent node of the i-th node selected by the TopK function.e ji means the edge built between the i-th and the j-th nodes, pointing from the j-th node to the i-th node.
The heat map of the weights is shown in Fig. 2. According to the heat map, it is observed that the feature similarity between nodes is greatest in the region centered on the diagonal.And the feature similarity is higher in a small range of neighborhoods, which is consistent with the characteristics of speech temporal changes.In order to screen out redundant information and select the edges with the highest correlation, the TopK algorithm [34] is used to select the k nodes with the highest similarity to the target node v i .By conducting experimental verifica- tion, the value of k is set to 10, resulting in improved stability of the model's convergence.

Graph-LSTM neural network
The structure of Graph-LSTM neural network (GLNN) is shown in Fig. 3.The architecture based on speech-graph consists of three graph convolution layers, a pooling layer and a classifier.In Fig. 3, A is the overall structure of GLNN; B is the structure of graph convolution, (1) The model construction is based on the message passing network with two phases of forward passing, message aggregation and readout operation [35].The convolution layers of Graph-LSTM model consist of aggregator and updater. (3) where ϕ α and ϕ β represent the linear transformation; N(i) represents the neighborhood of the target node; x aggr rep- resents the neighborhood features obtained by aggregation, and x i represents the feature vector of the i-th node, and x j represents the features vector of adjacent points.
Based on the graph structure of 3.1, considering the continuity and complexity of speech features, the simple aggregation operation [36] is no longer applicable to this application scenario.As a result, the LSTM aggregation operator [37] is chosen to accomplish inductive (5) x up = ϕ β x ′ i ⊕ x aggr + γ Fig. 2 Similarity weighting heat map Fig. 3 The structure of GLNN.A is the overall structure of GLNN; B is the structure of graph convolution, consisting of LSTM aggregator and linear updater; C is the structure of weighted pooling layer.In addition, the solid line represents the backbone, and the dashed line represents the possible edges constructed by similarity in Graph of A representation learning of adjacent features.The flow is from the source nodes to the target nodes.The neighbors are combined into time series and fed into LSTM for inference, obtaining deep aggregated features.
The Graph-LSTM neural network consists of the 3-layers graph convolution module, which realizes the feature aggregation and update.Then it performs the read-out operation through the pooling layer to obtain graph-level features, which are input to the classifier.

Weighted pooling
The construction of the speech graph establishes connection relationships based on time sequence and similarity.There are a large number of overlapping regions and redundant features between neighbors.Conventional pooling operation is difficult to filter out representative features from dense connections, while the time sequence of speech needs to preserve the integrity of node features.Therefore, weighted pooling is constructed based on global pooling operations of sum, max and mean, which is calculated as shown in Eq. 6.
where max, mean and sum represent the three types of global pooling operations; x i represents the feature vec- tor of the i-th node; x pooling represents the global feature vector; α , β and represent the weights of the three pool- ing operations respectively, which are set to {0.3, 0.3, 0.3} in the experiment.Through the weighted pooling operations, the feature integrity is retained while removing redundant information.

Dataset and features
The dataset used for the study is the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database [38], containing 12 hours of audiovisual data.The data were collected from two-person situational dialogues which the actors performed in a scripted or improvised manner.The actors' facial expressions and hand movements were recorded simultaneously during the communication.The speech emotion recognition task in this study uses only speech data, for five binary dialogues divided into multiple sentences.IEMOCAP uses the multi-annotator to annotate these data with 11 emotion labels.For objective experimental analysis and performance comparison, we used four classes of data in our experiments, namely angry, happy, sad, and neutral, totaling 4490 utterances.
The extraction of audio features is done with the open source tool called openSMILE 3.0 [39].openSMILE is a large-space audio feature interpreter that is widely used (6) for sentiment computing tasks.Audio feature extraction can be achieved through command line and configuration files.The experiment uses the INTERSPEECH 2010 Paralinguistic Challenge feature set to extract a set of low-level descriptors (LLDs) consisting of mfcc, maxPos, amean, skewness, and smoothing using the corresponding first-order delta coefficients.The speech is framed by a fixed-size sliding window, with the frame length set to 25 ms and the shift set to 10 ms.In addition, spontaneous binary features are added to each frame, inspired by spontaneous learning [40].As a result, 77-dimensional features are generated for speech frames.

Experimental setup
The dataset is divided into training set and test set by stratify to equalize the data categories, and the division ratio is 8:2.Training is performed using Adam optimizer.The learning rate is set to 1e-5.The decay weight is set to 1e-4, and the batch size is set to 8. All experiments are performed on NVIDIA Tesla V100 GPU.The model performance is evaluated using weighted accuracy (WA) and unweighted accuracy (UA) metrics.

Comparison model
We compare the proposed method with the sequencebased SER model and the graph-based SER model respectively.

SER models
We selected three SER models as baselines.DCNN [41]: a 1-D convolutional neural network uses hybrid features as input and modifies the initial model using incremental methods to improve the classification accuracy.The model has good generalization.
ResNet34 [42]: a transfer learning method combines with acoustic spectrogram enhancement that can efficiently handle variable-length inputs using a pre-trained residual network.The method alleviates the over-fitting problem and improves the generalization ability of the model.ADAN + SVM [43]: an adversarial data augmentation network generates augmented data and makes SVM classifiers outperform RNN classifiers in terms of local attention.

GNN baselines
Compact SER [11]: a lightweight graph convolutional network based on recurrent or linear graphs maintains comparable performance to existing techniques under reduced learning parameters.
PATCHY-SAN [44]: a general framework for extracting locally connected regions is based on convolutional networks to learn the arbitrary graph, which is computationally efficient.
PATCHY-Diff [45]: a microscopic graph pooling module generates hierarchical representations to be combined with multiple graph neural networks in an end-to-end mode for graph classification tasks.
For the above methods, the four classes of data totaling 4490 utterances, angry, happy, sad and neutral, were used for analysis.
GA-GRU [25]: a speech emotion recognition framework applies graph attention mechanism to gating units, combining long time sequences and graph data to enhance feature saliency.
CoGCN [33]: a graph convolutional network is based on cosine similarity with good noise immunity.
LSTM-GIN [46]: a speech emotion recognition network based on LSTM and GIN applies Graph Isomorphism Network to extract global feature representations.
The above approaches merge the happy and excited categories when validating the model performance, and extend the happy category data to 1636 utterances, for a total of 5531 utterances.

Performance comparison
Table 1 shows the results of GLNN compared with baselines.First, the basic architecture of GLNN, using global average pooling, obtains the WA of 68.15% on IEMOCAP, which exceeds the baseline methods.However, we found that the UA was only 59.16%, which was lower than ResNet34 [42] and compact SER [11].The possible reason is category imbalance.The happy class with only 595 utterances is much lower than others, which might lead to lower UA values.For validation, the happy category is combined with the excited category, and the results are compared with the methods [25,33,46].
With more balanced categories, the difference between GLNN on WA and UA metrics reduces.Especially the UA has a significant improvement of 9.49%.It indicates that the number of training data in each categories has a large impact on the model performance, and GLNN is not accurate enough when training with the small and unbalanced dataset.Because the graph structure and graph convolution used by GLNN may lead to the problem of feature redundancy and unstable feature extraction for small training samples.To solve this problem, the weighted pooling layer is constructed.
After adopting the weighted pooling method, GLNN exhibits notable enhancement, achieving WA of 71.83% and UA of 65.39%.These results surpass the performance of the baseline models, indicating superior effectiveness.Furthermore, a reduction in the disparity between the two metrics is observed.In practical application, the category equalization problem is a common data problem.It is difficult to equalize the data.Therefore, it is more feasible to use weighted pooling to optimize the model performance and mitigate the oversmoothing problem.

Ablation
We set up three groups of ablation experiments to verify the rationality of the proposed method.Table 2 analyzes the effect of the number of layers of graph convolution and calculates the corresponding parameters.From the experimental results, it is clear that the best performance is obtained by the 3-layer convolution module, which has the large improvement compared with the 2-layer convolution.However, the model performance decreases by continuing to add the graph convolution layers.

Table 1 Comparison between SER baselines and proposed model
The Bold represents the best results.'-' means that the result is not recorded in the report

UA (%) WA (%) Condition
DCNN 2020 [   Meanwhile, the complexity of the graph convolution is analyzed.The space complexity determines the number of parameters, and Table 2 records   Fig. 4 The convergence curves of five pooling methods.The blue, orange, green, red and purple curves represent max-pooling, mean-pooling, sum-pooling, topk-pooling and weighted-pooling repectively.Two types of curves, WA curve and UA curve, are drawn separately Table 3 analyzes the effect of k values when constructing edges by the TopK algorithm, i.e., the effect of the number of edges.The results in Table 3 show that the k value of 10 obtains the large improvement compared with the value of 5, with an improvement of 9.6% on WA.However, the gain of model performance is very small when the k value is taken as 15, indicating that the information obtained by adjacent nodes is saturated.Increasing the number of edges cannot bring extra information gain.
Table 4 compares the effects of different pooling methods on the accuracy.It should be noted that, in addition to three simple read-out operations of maximum, mean and summation, we also try to use topk pooling to filter out 50% of the nodes before performing mean pooling.From Table 4, the mean-pooling performs better than max-pooling and sum-pooling, but worse than topkpooling.It indicates that filtering the nodes to remove redundant features helps to improve the performance.And weighted pooling maximally preserves the integrity of node features and effectively filters out representative features.Compared with other pooling methods, it has the better performance.Figure 4 shows the test curves of different pooling methods.As shown in Fig. 4, the weighted pooling can effectively mitigate oversmoothing and converge more stably.

Conclusion
In this paper, we explore a graph neural network based on LSTM aggregator and weighted pooling applied to speech emotion recognition task.The specific process is as follows.First, speech features are extracted by the openSMILE.Then, the connection relationship is selected for speech graph construction based on the feature similarity and TopK algorithm.Finally, a classification model is designed based on the message passing architecture to convert speech classification into a graph classification task.Our evaluation on the IEMOCAP dataset demonstrates superior performance compared to the baseline models.However, there are some shortcomings in the current stage, including 1) complex connections and a large number of redundant features in graph; 2) unstable processing and analysis of small datasets; 3) neglecting the speaker's information.The research focuses on adult speech, which is a lack of exploration of children's speech emotion recognition [47,48].
In order to address the aforementioned challenges, we will adopt the following stategies in the next stage.1) To address the issue of redundant features, we will consider more versatile approaches for graph construction to further reduce the requirement for data size and optimize the model framework.2) Faced with the problem of data scarcity, the Transfer Learning strategy [49] is adopted to design a multi-task framework for speech recognition and emotion recognition, which improves the adaptability to small sample data through feature sharing.3) To address the issue of differences in acoustic and linguistic features of speakers, a speaker converter is introduced to learn adaptive transformation, which enables the model to eliminate feature differences.

Fig. 1
Fig. 1 It provides four examples of graph construction used in the above studies.The nodes of these graphs are frames, sample points, speech channels and dialogues the parameter of the graph convolution.The parameter of the three-layer convolution is 409K with moderate training, which can fit well.And the time complexity is calculated from three links: feature mapping, feature aggregation and feature updating.n denotes the number of nodes.D denotes the original dimension of inputs, and D ′ 1 denotes the mapping dimension, and D ′ 2 denotes the feature dimension of outputs.Firstly, the features of all nodes are mapped with the time complexity of O(n * D * D ′ 1 ) .Then the feature aggregation is performed by the LSTM aggregator with the complexity approximately equal to O(n * D ′2 1 ) .Finally, the feature updating is completed by the linear layer with the time complexity of O(n * D ′ 1 * D ′ 2 ) .In summary, the time complexity is O(n(D ′ 1 * D ′ 2 + D ′2 1 )).

Table 2
Comparison between different layers

Table 3
Comparison of number of K

Table 4
Comparison between different pooling methods