Black-box adversarial attacks through speech distortion for speech emotion recognition

Speech emotion recognition is a key branch of affective computing. Nowadays, it is common to detect emotional diseases through speech emotion recognition. Various detection methods of emotion recognition, such as LTSM, GCN, and CNN, show excellent performance. However, due to the robustness of the model, the recognition results of the above models will have a large deviation. So in this article, we use black boxes to combat sample attacks to explore the robustness of the model. After using three different black-box attacks, the accuracy of the CNN-MAA model decreased by 69.38% at the best attack scenario, while the word error rate (WER) of voice decreased by only 6.24%, indicating that the robustness of the model does not perform well under our black-box attack method. After adversarial training, the model accuracy only decreased by 13.48%, which shows the effectiveness of adversarial training against sample attacks. Our code is available in Github.


Introduction
Machine recognition of emotional content in speech is crucial in many human-centric systems, such as behavioral health monitoring and empathetic conversational systems. Speech emotion recognition [1] is the simulation of human emotion perception and understanding process by computer. Its task is to extract the acoustic features expressing emotion from the collected speech signals, and find the mapping relationship between these acoustic features and human emotion. Therefore, Speech Emotion Recognition (SER) in general is a challenging task due to the huge variability in emotion expression and perception across speakers, languages and culture.
Many SER approaches follow a two-stage framework, In this framework, a set of Low-Level Descriptors (LLDs) are first extracted from raw speech. Then the LLDs are *Correspondence: yandiqun@nbu.edu.cn College of Information Science and Engineering, Ningbo University, Zhejiang, China fed to a deep learning model to generate discrete (or continuous) emotion labels [2][3][4][5]. While the use of handcrafted acoustic features is still common in SER, lexical features [6,7] and log Mel spectrgrams are also used as input [8]. Spectrograms are often used with Convolutional Neural Networks (CNNs) that does not explicitly model the speech dynamics. Explicit modeling of the temporal dynamics is important in SER as it reflects the changes in emotion dynamics [9]. The deep learning model of time series shows excellent performance in this regard, such as Long-Short Term Memory networks (CNN-LSTM), Graph Convolution Network (GCN), Convolutional Neural Networks with Multiscale Area Attention (CNN-MAA) [10][11][12] and various deep learning techniques, etc. [13][14][15][16]. The above models are very outstanding in capturing the temporal dynamics of emotion, and their performance effect in SER is the best so far.
Despite their outstanding performance accuracies in SER, recent research [17][18][19] has shown that neural net-(2022) 2022: 20 Page 2 of 10 works are easily fooled by malicious attackers who can force the model to produce wrong result or to even generate a targeted output value. And the robustness of SER models against intentional attacks has been largely neglected. However, understanding the robustness against intentional attacks is important for the following reasons: (i) The speech privacy protection method was migrated to the field of speech emotion recognition for black-box adversarial attack. (ii) If the speaker itself has emotional problems and does not want his own voice to be analyzed and used to explore privacy, such an operation can protect his own privacy, while the interference to the original signal content is minimal and imperceptible. The former is regarded as a defense against speech emotion recognition attacks, while the latter is regarded as a protection of speaker emotional privacy to prevent privacy leakage.
To solve these problems, there are gradient-based adversarial attack methods to enhance model robustness, such as Fast Gradient Sign Method (FGSM) [20] and Project Gradient Descent (PGD) [21], but such methods require the attacker to understand the structure and parameters of the original recognition model. We also need to train alternative models. However, we found that the method of spectral envelope distortion, which is common in the field of speech privacy protection, can play a good adversarial attack effect in speech emotion recognition system. These methods do not need to understand the original recognition model and additional corpus, and can be used for black box attack without training substitutive models. In this paper, we use McAdams transformation, Vocal Trace Length Normalization (VTLN) and Modulation Spectrum Smoothing (MSS) [22] to explore the impact on the current advanced SER system. To our knowledge, this is the first work that investigates adversarial examples for the field of speech emotional recognition.
The contributions of this work are summarized as follows: (i) We have migrated voice privacy protection methods for use in the field of voice emotional recognition to black-box adversarial attack. (ii) We use the above methods to adversarial attacks against SER and summarize the results to get the best performing hyperparameters α. (iii) We are the first to propose black-box adversarial attack methods to analyze the robustness of the SER models.
Firstly, Section 2 introduces three different SER models (CNN-LSTM, GCN, CNN-MAA) studied in this paper. Then Section 3 will show three speech transformations (McAdams, VTLN, MSS). Section 4 will present the experimental setup and results, Finally, Section 5 concludes the article.

Related work
This section reviews three advanced speech emotion recognition models, which are used as test models in the subsequent parts.

SER based on CNN-LSTM
Speech emotion recognition is a challenging task. The recognition accuracy largely depends on the acoustic features of the input and the network conditions used. Acoustic features mainly rely on contextual information in the input speech for computation. The combination of Convolution Neural Networks (CNNs) and Long-Short Term Memory (LSTM) has gained a huge advantage in learning contextual information that is crucial for emotion recognition. CNN can overcome the scalability problem of traditional neural networks, while LSTM has longterm memory and solves the problems of vanishing and exploding gradients during training of long sequences.
In [10], Siddique Latif et al. proposed the use of parallel convolutional layers to harness multiple temporal resolutions in the feature extraction block, which is jointly trained with an LSTM-based classification network for emotion recognition tasks and achieved better performance results.

SER based on GCN
In 2021, by Amir Shirian et al., a light-weight depth map method is proposed to solve the task of speech emotion recognition [11]. Following the theory of graph signal processing, modeling speech signals as cyclic graphs or line graphs is a more compact, efficient and scalable form compared to traditional CNN networks. At the same time, compared with the traditional graph structure, the author greatly simplifies the convolution operation on the graph by reducing the operation of weighted edges on the traditional graph, so the parameters that can be learned in the SER task are significantly reduced, and its performance is better than that of LSTM, standard GCN, and other state-of-the-art graph models in SER.

SER based on CNN-MAA
In SER, emotional features are often represented by multiple energy patterns in the spectrogram. Conventional attention neural network classifiers for SER are usually optimized at a fixed attention granularity. While Xu Mingke et al. [12] applied multiscale area attention in deep convolutional neural networks to focus on emotional features with different granularities, so the classifier could benefit from attention sets with different scales. Meanwhile, channel length perturbation is used for data augmentation to improve the generalization ability of the classifier. Compared with other emotion recognition models, more advanced recognition results are obtained.

Adversarial attacks based on speech envelope distorition
In speech privacy protection, there have been many methods of spectral envelope distortion to protect the personal information contained in speech. Our research found that the speech after spectral envelope distortion has a very excellent adversarial attack effect on the speech emotion recognition system trained by the original speech. Here, we describe signal processing-based methods that are a part of our voice modification module, each method has an individual scalar hyper parameter α * . By adjusting the α * , we explore the adversarial attacks against the SER model and the robustness of the model. Figure 1 shows the flow of black-box attack on SER model. After the attack, in order to test the robustness of the model, we use the adversarial training method to add the samples gener-ated by the attack to the training samples, use the original label as the correct label for training, and use the trained model to identify the normal samples and countermeasure samples.

Vocal tract length normalization
Vocal Tract Length Normalization (VTLN) [23] was originally used for speech-to-text recognition tasks to remove distortions caused by differences in channel lengths by modifying the magnitude spectrum of the original speech through a warping function. Let ω 0 ∈[ 0, 1] and ω 1 ∈[ 0, 1] where α vtln ∈[ −1, 1] is a hyperparameter of the warping function and also represents the degree of frequency warping. Figure 2 shows the warping results for different hyperparameter choices. When α vtln < 0 and α vtln > 0, the distorted spectral curves become convex and concave, respectively, which represent the contraction and expansion amplitudes, respectively. When α vtln = 0, ω 0 = ω 1 , which means no warping. In this paper, we first obtain the log amplitude spectrum from the original speech using the short-time Fourier transform, and then perform frequency warping with VTLN to obtain the warped logamplitude spectrum. And finally, the transformed speech is obtained by inverse STFT of the modified amplitude spectrogram and original phase spectrogram.

McAdams transformation
McAdams transformation [24] achieves the result of speech transformation by modifying the formant frequency of speech. By performing Linear Predictive Coding (LPC) [25] on the original speech, we can get the N poles. Pole p n ∈ c is written as A n exp(θ n ) in the polar coordinate, where A n ∈[ 0, 1] is calculated from the LPC, which is less than 1, and θ n ∈[ 0, π] is the offset phase. The pipeline of the McAdams transformation approach is shown in Fig. 3. The transformed frequency θ 1 n ∈[ 0, π] is obtained by performing θ 1 n = θ α mas n on the original frequency θ n ∈ [ 0, π], where α mas ∈ R + is the McAdams coefficient. We obtain the corrected pole p 1 n by combining the original formant intensity with the transformed formant frequency, i.e., p 1 n = A n exp(jθ 1 n ). The general speech transformation is to generate the transformed waveform by adding multiple cosine oscillations to the original oscillation wave: where K is the harmonic index, r k (t) is amplitude, ϕ k is the phase, t is time. Equation 2 represents the synthesis of periodic signals that combine harmonic cosine oscillations, each with a certain amplitude and phase offset. The purpose of the McAdams coefficient is to adjust the frequency of each harmonic, namely θ i n , to produce transformed speech by modifying the harmonics in the original speech. Figure 4 shows an example. The picture on the left is the case where the pole position is transformed by McAdams transformation, and the picture on the right is the influence on the spectral envelope.

Modulation spectrum smoothing
Modulation spectrum smoothing achieves the purpose of modifying speech by removing the temporal fluctuation of speech features [26]. The original speech is obtained by short-time Fourier transform to obtain the complex spectrogram X ∈ C (FT) where F and T are the numbers  The cutoff frequency range of the low-pass filter we used is α ms ∈[ 0, 1], after filtering, the inverse short-time Fourier transform is used to combine the smoothed amplitude with the original phase spectrogram to generate the transformed speech. In Fig. 5, the left side shows the smoothing effect of a certain frequency in the spectrum envelope, and the right side shows the complete smoothing effect.

Dataset
The most widely used data set in the above SER model is Interactive Emotional Dyadic Motion Capture (IEMO-CAP)et al. [27]. Therefore, in this paper, in order to explore the effect of black-box attack on the above model, we also use the data in this data set for research. It contains 12 h of emotional speech performed by 10 actors from the Drama Department of University of Southern California. The performance is divided into two parts, improvised and scripted, according to whether the actors perform according to a fixed script. The utterances are labeled with 9 types of emotion-anger, happiness, excitement, sadness, frustration, fear, surprise, other and neutral state. For the databases, a single utterance may have multiple labels owing to different annotators. We consider only the label that has majority agreement. For the labeled data in the database, we only consider the case of many labels due to the difference of vision aids. In previous studies [28][29][30], due to the imbalanced data in the dataset (fewer happy data), researchers usually choose more common emotions such as neutral state, sadness, anger, and because of excitement and happiness there is a certain similarity, so the excitement will be replaced by happiness, or the excitement and happiness will be combined to increase the amount of data. In this paper, we also use the four emotions of neutral, excitement, sadness and anger from the IEMOCAP dataset.

Evaluation metrics
Evaluating the recognition performance in the above SER model uses weighted accuracy (WA) and unweighted accuracy (UA), where WA weighs each class according to the number of samples in that class and UA calculates accuracy in terms of the total correct predictions divided by total samples, which gives equal weight to each class: where P is the number of correct positive instances, N is the number of all negative samples, and True Positive(TP) and True Negative(TN) are the number of positive and negative samples predicted correctly, respectively. And in [31], considering that WA and UA may not reach the maximum value in the same model, their average ACC is used as the final evaluation standard (the smaller the ACC, the better the attack effect on the model is). At the same time, in order to show the actual auditory effect of the transformed speech, we use automatic speech recognition (ASR) as the change standard before and after speech processing. And will calculate the word error rate: where N sub , N del , and N ins are the number of substitution, deletion, and insertion errors, respectively, and N ref the number of words in the reference [22]. We will calculate WER on the voice before and after the attack as the standard to judge the voice quality.

Evaluation setup
In the experiments, we randomly split the dataset into training set (80%) and test set (20%) for cross-validation. First of all, after the above three SER models are trained on the training set, they are tested with the test set, and then the test set is processed with three different black-box attack methods, and then the attack effect is identified and explored. Finally, adversarial training is added to explore the robustness of the model.   Table 1 shows the results of three different online emotion recognition for the dataset (IEMOCAP). Firstly, VTLN is used to attack three different models. Table 2 describes the performance of the three models under adversarial attack. With the continuous adjustment of super parameters α vtln , the success rate of attack is also increasing. However, due to excessive and obvious transformation, the original voice content will change too much, which is a loss for the value of speech, so the loss of speech quality also needs to be taken into account as a consideration when considering the best attack case i.e. the growth of WER. Therefore, in our experiment, we know that it has the best performance when the hyperparameter α vtln = 0.15, and the WER increases from 11.23 to 21.40% as shown in Table 5, which means that the speech quality effect decreases by 10.17%. The recognition accuracy of the three models is reduced to about 10% in Fig. 6, indicating that they have good resistance to the emotion recognition system. Table 3 shows the recognition results of the three models under McAdams transformation attack. Due to the particularity of McAdams coefficient, there are two relatively symmetric transformation modes in forward and reverse, so the recognition results in the table also show a symmetry. According to the experimental results in the Fig. 7, the best attack performance will be obtained when the α mas = 1.20 (reverse is 0.80), reducing the recognition accuracy of the three models to 8-10%. Meanwhile, WER increased by only 6.24% in the Table 5. Table 4 shows the results of the three models on the Modulation Spectrum Smoothing attack method. According to the analysis of the experimental results, as shown in Fig. 8, when the α ms = 0.25, the best attack effect can be obtained, and the accuracy of emotion recognition can be reduced to 12-14%, and WER increased by 8.83% in the Table 5. After the three attack methods, the recognition accuracy of the model dropped significantly. At the initial hyperparameter α * (0.05, 0.95, 0.05, respectively), the model accuracy dropped to 20-25%, indicating that the three black-box confrontation attacks effectiveness, the robustness of the model is not excellent.

Evaluation results
After we add three kinds of adversarial samples into the training, as shown in Table 6, three different adversarial samples are added. As shown in Fig. 9, VTLN train, Mas train and MSS train respectively add one adversarial sample to the training with the correct label, and then test the accuracy of the model. The best performance is the   Fig. 9), the best performance model is CNN-MAA, and the recognition accuracy is 64.60%. According to our analysis, the above two models still have strong robustness after adversarial training because they have better learning effect on sample dispersion by incorporating graph structure and area attention mechanism.

Conclusion
By transferring the method of voice privacy protection to the field of SER, a black-box attack is carried out under the condition of an unknown emotional recognition system, and it is found that warp transformation processing has a strong resistance to emotional recognition. After simple warp transformation, the voice is well protected in the trained SER and the usability of voice content is guaranteed. In different speech transformation processing, the final attack effect is not the same. Experiments show that, among which the McAdams attack method has the best attack effect WA = 8.32%. Different emotional recognition models have high mobility and low time cost. This kind of black-box attack is a kind of no-target attack. There is no actual direction and prediction for the result of the attack. Meanwhile, after the adversarial samples are added to the training, although the accuracy of the model decreases to a certain extent, the recognition results still have a certain accuracy, and the model has a certain robustness to such adversarial samples.
Our work in the future should be to study how to make a clear and targeted attack through the voice warp transformation.