Cross-corpus speech emotion recognition using subspace learning and domain adaption

Speech emotion recognition (SER) is a hot topic in speech signal processing. When the training data and the test data come from different corpus, their feature distributions are different, which leads to the degradation of the recognition performance. Therefore, in order to solve this problem, a cross-corpus speech emotion recognition method is proposed based on subspace learning and domain adaptation in this paper. Specifically, training set data and the test set data are used to form the source domain and target domain, respectively. Then, the Hessian matrix is introduced to obtain the subspace for the extracted features in both source and target domains. In addition, an information entropy-based domain adaption method is introduced to construct the common space. In the common space, the difference between the feature distributions in the source domain and target domain is reduced as much as possible. To evaluate the performance of the proposed method, extensive experiments are conducted on cross-corpus speech emotion recognition. Experimental results show that the proposed method achieves better performance compared with some existing subspace learning and domain adaptation methods.


Introduction
There are many ways for people to express emotions, such as through speech, actions, and facial expressions. Speech is an important way to express emotions among these ways, because it contains riches emotions, such as happy, angry, and sad. Speakers can deliver their intentions through different tones, volumes, or content. How to judge a speaker's emotion through speech becomes crucial. Therefore, speech emotion recognition (SER) is an important branch of many modal affective computing, and it is also an important part of speech recognition. With the development of SER, it has been applied in the fields of psychotherapy, human-computer interaction, etc. According to the results of SER, the machine can generate appropriate responses for the user in an interactive environment. Therefore, SER is one of the most important technologies for human-computer interaction [1][2][3][4].
The semantic-based methods are an important class of SER methods, because emotions can be expressed effectively by semantics. If the speakers use emotive words to communicate with others, then we can directly judge the emotion from the semantics of the words. Therefore, semantic-based research gradually began to develop. A multi-classifier emotion recognition model based on prosodic information and semantic labels is introduced in [5]. Similarly, the semantic labels and the non-verbal audio in speech, such as onomatopoeia such as crying, laughter, or sighing, are used in SER [6]. Subsequently, temporal and semantic coherence is introduced for SER [7]. In addition, the model of bimodal SER from acoustic and linguistic information fusion is proposed [8].
Although semantics understanding is simply for humans, it is a complex process for machines. Therefore, more research is currently aimed at speech features that are easily understood by machines, which is also important for SER. Compared with semantic information, speech features are more abstract. But they are very important for expressing the speaker's emotions. The main features used in SER are divided into acoustic features and spectral features. The acoustic features include intensity, pitch, and timbre. Features like energy, Mel-frequency cepstral coefficients (MFCC), linear prediction coefficients (LPC), and fundamental frequency are called spectral features. The features such as pitch, MFCC, formant, intensity, and chroma are adopted for SER [9,10]. Also, the pitch, spectrum, and formant are combined with semantic information for recognizing emotions in [5]. To improve the robustness of SER, some methods have been used to process the features. Specifically, PCA is adopted to reduce the dimensionality of the features [11], and a statistical method is utilized to find robust spectral features [12].
In practical scenarios, the speaker's emotion is very complex. The speaker may have multiple emotions at the same time, rather than a single emotion, or the emotion expressed by the speaker is inconsistent with the actual emotion. It makes SER difficult. There is also research proposed for complex emotions. A circular continuous dimensional model to describe an emotion, called valence-arousal model (VA) was proposed in [13,14]. The model no longer regards emotions as discrete but uses two-dimensional coordinates to describe the continuous distribution of emotions. The PAD emotional model was shown in [15,16], which has P (pleasure), A (arousal), and D (dominance) values to represent all emotional states. In addition, based on the emotional probability distribution, an ambiguous label is proposed to solve the inconsistency problem in ambiguous emotional cognition [17].
Another problem in SER is how to recognize emotions. To this end, some machine learning methods were adopted to recognize emotions, such as support vector machine (SVM) [18], hidden Markov model (HMM) [19], and Gaussian mixed model (GMM) [20]. In recent years, with the rapid development of deep learning, various neural network structures have been introduced in SER. From convolutional neural networks (CNN) [21], recurrent neural networks (RNN) [22], back propagation neural network (BPNN) [23], and deep neural network (DNN) [24] to sequential capsule networks [25] and adversarial data augmentation network [26], they are both used for SER. A segment-based iterative self-learning enhanced speech emotion recognition model is proposed in [27]. The above algorithms perform well in traditional SER, and the recognition accuracy of some algorithms can even reach more than 80% in some corpora settings. In the actual scene, the speech signals do not belong to a specific corpus, which are recorded in different scenes. The speech data is also affected by language, gender, speaking styles, and other factors. So, when the training set and the test set came from different corpus, the training and testing data often follow different feature distributions. The recognition performance will be reduced at this time.
Therefore, transfer learning is adopted to solve the problem of data cross-corpus [28]. The known corpus data is considered as the source domain, and the unknown data to be learned constitutes the target domain. Transfer learning is to transfer the knowledge of the source domain to the target domain to reduce the data distribution difference between the two domains, and in SER, the features of the source and target domains are distributed in different spaces. So, the transfer from the source domain to the target domain is a feature-based transfer, that is, a mapping relationship between two domains is established to reduce the differences in feature distributions. With the development of transfer learning, more transfer learning algorithms are applied to SER. Among them, in order to solve the cross-corpus SER problem, many researches focus on transfer subspace learning and domain adaptation, such as unsupervised transfer subspace learning [28], transfer subspace learning based on feature selection [29], transfer subspace learning based on non-negative matrix factorization [30], transfer linear subspace learning [31], and Universum autoencoder-based domain adaptation [32]. In addition, a cross-corpus speech emotion recognition based on domain adaptive least squares regression is proposed in [33], and in [34,35], ADDoG-based and DANN-based methods are proposed according to the idea of domain adversarial. Most of the above methods involve transfer subspace learning and domain adaptation, which are important issues in transfer learning and the focus of this paper. The two parts are considered jointly in this paper. Therefore, inspired by the frame in [36], a cross-corpus speech emotion recognition method is proposed.
The contributions of the proposed method are summarized as follows: • The proposed method combines subspace learning and mapping to realize speech emotion recognition across the corpus. The feasibility of the proposed method is proved by experimental results. • In this paper, a subspace learning model is constructed based on the Hessian matrix, so that the extracted features both in the source domain and the target domain have good robustness in their independent subspace, which can be adopted to improve the subsequent cross-corpus transfer ability. • Information entropy is used to establish a domain adaption model in the proposed method. The numerical descent is used to minimize information entropy, so that a common space of source and target domains is learned, thereby the difference in features distribution between the two domains is reduced.
The rest of the paper is organized as follows. In Section 2, the specific process of the proposed method is introduced, along with some optimizations. In Section 3, the emotion recognition performance of the proposed method is analyzed on three public datasets, and the effects of different parameters on the performance are analyzed through experiments. Finally, the conclusion is drawn in Section 4.

The proposed method
A cross-corpus speech emotion recognition method is proposed by combing subspace learning and domain adaption. The block diagram of the proposed method is shown in Fig. 1.
Firstly, features of speech in the source corpus and target corpus are extracted to form the source domain and the target domain. Then, the Hessian-based subspace learning is performed on the feature in the source domain and the target domain to obtain lowdimensional features for forming their own independent subspace. The flowchart of the Hessian-based subspace learning part is shown in Fig. 2. Furthermore,  the mapping relationship between the source domain subspace and the target domain subspace is established by using information entropy, which is used for reducing the difference of feature distribution between different domains. This mapping relationship is revealed by the common space. Therefore, it is important to find the common space corresponding to the two domains in this method. The flowchart of the domain adaption part is shown in Fig. 3. Finally, emotions are predicted.
In the part of Hessian-based subspace learning, the neighboring frames of the current frame are found based on neighborhood calculation. Then, the Hessian matrix [37] is constructed for low-dimensional embedding to obtain the subspace of the source and target domain, respectively.
After obtaining the subspace of the source and target domain, the transformation matrix is obtained through correlation coefficients of the subspace. Then, the distance between the feature data of each frame in the source domain subspace with that of each frame in the target domain subspace is calculated. And the probability that a frame in the subspace of the target domain is neighborhood to each frame in the source domain is obtained according to the distance. In this way, the posterior probability that the features of each frame in the target domain subspace are estimated to be a certain class can be obtained according to the known class labels of the features of each frame in the source domain subspace. Then, the entropy between the target domain features and emotion labels and the entropy between the features and domain labels of the two domains are calculated. Finally, the two information entropies are jointly optimized by numerical descent. The mapping relationship between the source domain subspace and the target domain subspace is acquired, which is described by a common space.
Then, Hessian-based subspace learning [38] and the domain adaption based on information entropy are introduced in detail. Finally, a specific optimization method for finding the common space is given.

Hessian-based subspace learning
An input feature matrix X=(x mn ) M × N is given, which is composed of the features of the speech. m and n are the feature index and the frame index, respectively. M and N are the total number of the feature dimension and the number of frames, respectively. First, the feature energy of each frame is as follows: where x e n represents the feature energy of the nth frame, and x mn represents the feature of the mth dimension in the nth frame.
Thus, an energy matrix can be formed as X e = [ x e 1 , x e 2 , . . . , x e N ]. Then, two new feature energy matrices A and B, which are used for calculating the distance of the feature between different frames, are defined as follows: where a ij = x e j , b ij = x e i , 1 ≤ i, j ≤ N, and i and j represent the index of the row and column, respectively. In order to find the nearest K frames of each frame, the distance D e = (d ij ) N × N of the feature between different frames is calculated as follows: where d ij represents the distance between the feature energy of the ith frame and the jth frame. The smaller the distance d ij is, the closer the feature energies of the ith frame and the jth frame are. In fact, the definition of distance D e is derived from Euclidean distance. A and B are formed by the square of the elements in the input matrix X. According to Eqs. (1), (2), and (3), the distance defined in this paper meets the requirements of non-negativity, directness, and identity. A and B are constructed in a way that also satisfies the symmetry of the distance.
The jth column from the matrix D e (i.e., ) denotes the distance vector of feature energy between the jth frame and each frame. The sorted distance matrix in ascending order is of the frame sorted by the distance from the jth frame, where S j (1) represents the index with the minimum distance in d e ij ; and S j (N) is the index of the maximum distance. It is worth mentioning that for each frame, d e jj is the K denotes the number of the largest neighbor frames. Thereby, the K×N adjacent index matrix I = [i 1 , i 2 , …, i N ] of N frames is obtained. Then, the elements in the input matrix X correspond to the indices in I and are selected to form a neighborhood matrix Z n , which is defined as follows: k, m, and n are the neighbor index, the feature index, and the frame index, respectively. Z n represents the neighborhood matrix corresponding to the nth frame. E n is a centralized matrix of Z n , which is defined as follows: The purpose of the proposed Hessian-based subspace learning is to obtain the local coordinates of the neighborhood, which are transitioned by tangent coordinates. The tangent space consists of tangent coordinates, which is regarded as a subspace of the Euclidean space. A standard orthogonal coordinate system is associated with the inner product inheritance of the Euclidean space, which can be obtained by using singular value decomposition. Therefore, Z n − E n is subjected to singular value decomposition. The standard orthonormal basis V n = v n ij K ×K can be obtained by singular value decomposition as follows: where (·) T denotes transposition. U n is the left singular vector of Z n − E n . Σ n is a diagonal matrix of singular values.
First d columns of V n are extracted to constitute the tangent coordinates Next, an association Hessian matrix Q n is given by using V d n , which is defined as follows: where q n kj = v n kj 1 v n kj 2 , n is the frame index, 1≤ n ≤N. j 1 and j 2 are the dimension indexes. The corresponding relationship among j, j 1 , and j 2 is given as follows: is constructed as follows: can be obtained by Schmitt orthogonalization of estimated matrix L n [39]. The last columns of G n are taken to obtain the matrix . Then, Hessian quadratic matrix H can be constructed by using the matrix G b n , which is formed as follows: and it is defined as follows: , and S n (j) denotes the index of the frame sorted by the distance from the nth frame, 1≤n ≤ N.
Next, the d-dimensional subspace corresponding to the d smallest eigenvalues can be obtained by using H, which is a null space and denotes as U = (u ij ) N × d . If a manifold is locally equidistant to an open subset in Euclidean space, then the mapping function from this manifold to the open subset is a linear function. The quadratic mixed derivative of the linear function is 0, so the local quadratic form formed by the Hessian coefficients is also 0. Hence, the global Hessian matrix has a (d+1)-dimensional null space. The first-dimension subspace of the Hessian matrix is composed of a constant function, and other d-dimensional subspaces form equidistant coordinates. Then, the embedding matrix R = (r ij ) d × d can be calculated as follows: where J represents the set of the index of the neighborhood frames, Finally, the subspace Y is obtained according to the low-dimensional embedding: where μ is a regularization parameter, and (·) T denotes transposition.
There may be a small number of outliers in the subspace Y after the low-dimensional embedding. In order to solve this problem, the outliers in the subspace Y are corrected in this paper. These outliers are characterized by a small number, with values that deviate from the distribution of most data. So, the detection thresholds are set to recognize the outliers. Then, the outliers are replaced with 2Tr(U T EU ) [40], where Tr(·) means the trace of the matrix in parentheses. E = (e ij ) N × N is a diagonal matrix, where e ij is defined as [41]: Following the above steps, the source domain subspace Y s and the target domain subspace Y t can be obtained.

Information entropy-based domain adaption
A domain adaption method was proposed to build the relationship between the source domain subspace (11) and the target domain subspace. In detail, a common space with similar feature distributions in the source and target domains is constructed. Both the information entropy between the data and emotion labels and the entropy between data and domain labels are used to optimize the mapping [42]. Thereby, the difference in feature distribution in different corpora can be reduced. After The neighbor frames are detected according to the distance between the feature of each frame. Therefore, a conditional probability model is defined as follows: where 1≤ i ≤ N, 1≤ j ≤ N, and p ij is the conditional probability density that the jth frame in the target domain is adjacent to the ith frame in the source domain. It can describe the probability of the nearest neighbor between each frame feature in the source domain and the frame feature in the target domain. The emotion label corresponding to the ith frame in the source domain is Label i , Label i ∈Label = {1, 2, ... , L}, i.e., there are a total of L types of emotion. According to formula (16), an emotion label probability estimate p lj of the jth frame in the target domain is given as follows: where 1≤l ≤ L, 1 ≤ j ≤ N, 1 ≤ i ≤ N, and p lj express the probability that the jth frame in the target domain is discriminated as the lth type of emotion when the emotion of the source domain is known.
Since p lj is a preliminary probability estimate of the emotion label of each frame feature in the target domain, the relationship between target domain features and emotion labels cannot be directly revealed by p lj [43][44][45]. Therefore, the entropy I(X t ; Label) between the target domain features and emotion labels is calculated by using p lj in this paper, which is defined as follows: Equation (18) is composed of two parts. In the first part, the entropy of the average probability that the feature of all frames in the target domain belongs to each emotion label is calculated. The average of the entropy of the feature in the target domain belonging to each emotion label is computed in the second part. In order to reduce the influence of incorrect labels on the feature discrimination results of each frame in the target domain, Eq. (18) needs to be optimized later. It should be noted that if only the second part is minimized, a degenerate solution will be obtained. That is, all frames in the target domain may be classified into the same type of emotion. So, the first part in Eq. (18) is necessary.
Then, the entropy I st (X) between the features and domain labels of the two domains are introduced to maximize the similarity between the two domains, which is defined as: To calculate the entropy I st (X), firstly, the distance d ′ ij between the ith frame feature in the source domain and the jth frame feature in the target domains is calculated according to Eq. (3), where X = (x ij ) d × (N + M) denotes the feature for all frames in the source and target (17) N and M denote the number of frames in the source domain and target domain, respectively. In this paper, the number of frames in the source domain is the same as that in the target domains, i.e., N = M. Then, the probability p ′ ij of the ith frame feature and the jth frame being adjacent to each other in the source domain and the target domain is calculated according to Eq. (16) using d ′ ij . Next, the probability p tj that the jth frame in the source domain and the target domain is judged as the target domain or the source domain is calculated according to Eq. (17).

Optimization
In this subsection, an iterative optimization algorithm based on numerical descent [46] is introduced using Eqs. (18) and (19). The objective function is: where λ is the regularization parameter. In the optimization process, the transfer coefficient matrix g is given for numerical descent in this paper, which is defined as follows: where λ is the regularization parameter. The calculation process of g(X t ; Label) is as follows. First, an information matrix I C = i c lj L×N is defined using p lj as: where i c lj represents the difference between the probability that the feature of the jth frame in the target domain belongs to the emotion of the lth category and the average probability that the features of all frames in the target domain belong to the emotion of the category. where o ij = i c lj , Label i = l. g(X t ; Label) is obtained as follows: where Ω is a diagonal matrix, and the main diagonal element is N j=1 γ ij . W is the transfer matrix.
Since the calculation process of g(X t ; Label) and g st (X) is the same, the calculation process of g(X t ; Label) is introduced in detail in this paper. The variables for the calculation process of g st (X) refer to the calculation process of I st (X).
Finally, the common space L is obtained. So, the feature data in the source domain after mapping is F s = Y s T L, and the feature data from the target domain is F t = Y t T L.

Experiments and results analysis
To evaluate the effectiveness of the proposed cross-corpus speech emotion recognition method, a number of experiments are conducted with some baseline methods on three commonly standard datasets, namely Berlin [47], NNIME [48], IEMOCAP [49], MSP-Improv [50], and MSP-PODCAST [51]. The specific statistics of each dataset are shown in Table 1.

Data preparation
Berlin dataset is a German emotional speech corpus recorded by the Technical University of Berlin. In this dataset, ten actors performed 7 emotions, including neutral, angry, fearful, happy, sad, disgusted, and bored. The sampling rate is 16 kHz. The dataset contains 233 male emotional sentences and 302 female emotional sentences saved in WAV format. The NTHU-NTUA Chinese Interactive Multimodal Emotional Corpus (i.e., NNIME) is a multimodal dataset. In this dataset, audio, video, ECG, etc. were recorded for 44 actors during oral interactions. There are 6 emotions including anger, happy, sad, neutral, frustration, and surprise in this dataset. The audio sampling rate is 16 kHz. The dataset also contains annotation results from 49 annotators in different perspectives.
IEMOCAP, known as the Interactive Emotional Binary Motion Capture Database, is recorded by the Speech Analysis and Interpretation Laboratory at the University of Southern California. Ten emotions are shown by recording the expressions, movements, and audio of 10 actors in this dataset. Twelve hours of data are contained in this dataset. The audio sampling rate is 16 kHz. Considering the relevance and ambiguity of different types of emotions, 4 typical emotions (angry, neutral, happy, and sad) audio data were selected from the above three datasets in this paper.
MSP-Improv is an improvised multimodal emotional corpus. There are 6 sessions each session is a dyadic interaction between two speakers. Twenty target sentences are consisted in each session. In this corpus, 12 actors (six male and six female) performed 4 emotions, including neutral, angry, happy, and sad. Two actors improvise these emotionspecific situations, leading them to utter contextualized, non-read renditions of sentences that have fixed lexical content and convey different emotions. The sampling rate is 44.1 kHz. MSP-Improv is more natural than other corpora. Hereinafter referred to as MSP-Improv is MSP.
MSP-PODCAST, a large and natural emotional corpus. It relies on existing spontaneous recordings obtained from audio-sharing websites. The criterion to select the podcasts is to include only episodes that can be shared to the broader community. In this corpus, the types of emotions and themes are diverse, and the audio quality is very good in this corpus, because segments recorded with poor quality are removed. Segments with SNR values less than 20 dB are discarded. Phone-quality speech are also removed. Therefore, this step also removes segments that do not have significant energy above 4 kHz. Podcasts in the corpus contain 9 emotions, including angry, sad, happy, neutral, fear, surprise, disgust, others, and contempt. However, angry, happy, neutral, and sad are selected in this paper. There are also many real-world corpora like LSSED [52], and so on.

Experimental settings
In this experiment, 5 artificial audio features are used, including static MFCC and their first-and second-order dynamic differences, LPC, log amplitude-frequency characteristics, Philips Fingerprints [53], and spectral entropy. The selected audio features are listed in Table 2.
In the following, the amplitude characteristic of the frequency coefficient is described by log amplitude-frequency characteristics (LAFC).
Considering that different features contribute differently to speech emotion recognition, each feature in the source domain and the target domain is weighted before training. The weights are set by the dimensions of the features in this paper. For MFCC, LPC, LAFC, Philips Fingerprint, and Spectral Entropy, the corresponding weights are β 1 , β 2 , β 3 , β 4 , and β 5 , respectively.
After subspace learning and domain adaption, the weighted features in the source domain are trained. That is, the features are used to build a training set. Similarly, the weighted features in the target domain are used to build a test set.
In the training process, a constant recognition accuracy threshold α is set in advance. Next, the test set is divided into two parts of equal amount of data, i.e., test set 1 and  test set 2. Test set 1 is used for assist training, and test set 2 is used to optimize the performance of the proposed method. If the recognition accuracy of a certain type of emotion is less than α in the first training, the features corresponding to the emotion need to be re-trained in the next training. The operations repeated until one of the following conditions is met: (1) the recognition accuracy of all emotions is greater than α, and (2) the number of the emotion with recognition accuracy less than α remains unchanged in the two adjacent training.
To evaluate the performance of the proposed method in the cross-corpus condition, the Berlin, NNIME, and IEMOCAP are combined in pairs in this paper. Then, any two datasets are taken as the source domain and the target domain. Therefore, a total of 6 combination cases are designed as follows: • N-B: NNIME is the source domain dataset, and Berlin is the target domain dataset. • B-N: Berlin is the source domain dataset, and NNIME is the target domain dataset. • N-I: NNIME is the source domain dataset, and IEMOCAP is the target domain dataset. • I-N: IEMOCAP is the source domain dataset, and NNIME is the target domain dataset. • B-I: Berlin is the source domain dataset, and IEMOCAP is the target domain dataset.

• I-B: IEMOCAP is the source domain dataset, and
Berlin is the target domain dataset.

Parameter details
Linear SVM is chosen for training and testing. The grid search method is used to optimize the kernel function coefficients of the SVM and the independent terms of the sum function. There are four hyperparameters and five feature weight coefficients in this experiment. The recognition accuracy threshold α is set to 0.45. It is determined by an informal experiment. According to the dimension of the feature, the weight coefficient β 1 , β 2 , β 3 , β 4 , and β 5 are set as 0.3, 0.3, 0.3, 0.05, and 0.05, respectively. The complexity of the algorithm is affected by K. The larger the value of K is, the higher the algorithm complexity is, and the more features are extracted. So, the range of the neighboring value K is set as [3,9]. For the two regularization parameters μ and λ, the range is set to {− 1/4, − 1/3, − 1/2, 1, 1/2, 1/3, 1/4} and {0.001, 0.01, 0.1, 1, 10, 100, 1000}, respectively. Considering that embedding regularization parameter μ is an exponent, if μ is a positive integer, the value will affect the value of the element in Y. Nevertheless, if μ is a positive or negative fraction, it may affect the value range of the element in R. Hence, both integer and fraction can be chosen for μ. For regularization parameter λ, it affects the importance of both parts of two information entropy. For the proposed method, the dimension of the simplified subspace feature is set to 169.

Traditional linear baseline
In order to evaluate the performance of the proposed method for cross-corpus speech emotion recognition, on the basis of the above 6 sets of experiments, the proposed method is compared with some related most commonly used and advanced transfer learning methods. The following is an introduction to these baseline methods: • Principal components analysis (PCA) [54]: A dimensionality reduction method that maps data into a lowdimensional subspace through linear transformation to prevent information loss as much as possible. • Linear discriminant analysis (LDA) [55]: In this method, the projection direction that maximizes the ratio of the inter-class distance and minimizes the intra-class distance ratio is found. The subsequent classification results are affected while reducing the dimension. • Kernel spectral regression (KSR) [56][57][58]: In reproducing kernel Hilbert spaces (RKHS), the problem of learning embedding functions is transformed by SR into a regression problem. • Geodesic flow kernel (GFK) [59]: The movement of the domain is simulated by integrating an infinite number of subspaces. The changes in geometric and statistical properties from the source domain to the target domain are described by these subspaces. • Subspace alignment (SA) [60]: SA is a transfer learning algorithm for two subspaces by matching the feature. The core of this method is to seek linear transformation to transform and align for different data. • Manifold embedded distribution alignment (MEDA) [61]: Taking into account the importance of both conditional and marginal distributions, a domaininvariant classifier is learned via a Grassmann manifold with structural risk minimization.

Results analysis 3.3.1 Comparison with the traditional linear baseline method
In this section, the recognition accuracy of the proposed method is compared with that of some traditional linear baseline methods. The result is shown in Tables 3 and 4. From Table 3, it is clear that the performance of the proposed method outperforms that of other methods in most cases. Only in the case of I-B, the performance of the proposed method is slightly lower than that of BDA and TJM. For the proposed method, the average recognition accuracy reached 58.20% in the six cases. In the case of I-B, the recognition accuracy is the lowest among the six cases, which is 46.88%. In contrast, in the case of I-N, the recognition accuracy reached 67.75%, which is the highest among the six cases. Compared with TJM which has the highest recognition accuracy among the baseline methods, the average recognition accuracy of the proposed method is significantly improved by 13.3%.
Although weighted accuracy is an important indicator to evaluate the overall classification performance of the model, weighted accuracy is affected by the unbalanced distribution of sample classes. Therefore, unweighted accuracy is very important for evaluating the overall classification performance of the model when the distribution of sample classes is unbalanced. It can be seen from Table 4 that the unweighted accuracy of almost all methods is lower than the weighted accuracy. For the proposed method, unweighted accuracy is 3.27% lower than weighted accuracy. Compared with the baseline method, it still has advantages. Furthermore, we can find that the average recognition accuracy of the proposed method, distribution adaptation method, and feature selection method is higher than that of most subspace learning. The reason is that the distribution of data in different domains is different. Therefore, the recognition performance of traditional subspace learning algorithms is poor in cross-corpus speech emotion recognition. Transfer learning can be used to improve recognition performance.
In addition, the confusion matrix of the proposed method in six cases is shown in Fig. 4. It can be seen that there are two types of emotion with more than 50% recognition accuracy in most cases. In the case of N-B and N-I, the highest recognition accuracy can be achieved for neutral. From Fig. 4b and f, it is clear that the proposed method has a good recognition ability for happy, and the highest recognition accuracy can be achieved for angry in the case of I-N and B-I. Moreover, it can be also found that sad is easier to be recognized than other emotions in most cases.

Ablation experiment
In this section, a set of ablation experiments is established to verify the impact of the two parts of the proposed method on the recognition performance. The specific results are shown in Fig. 5. The specific settings are as follows: • Subspace learning: Only Hessian-based Subspace Learning is performed. • Domain adaption: Only information entropy-based domain adaption is performed. • Subspace learning and domain adaption: Hessianbased subspace learning and domain adaption are combined.
The average recognition accuracy of the ablation experiments is shown in Fig. 5. It can be found that the recognition performance of the combined method (i.e., the proposed method) is better than that of the method only with Hessian-based subspace learning or domain adaption. Through ablation experiments, it is clear that both Hessian-based subspace learning and domain adaption have played a positive role in cross-corpus speech emotion recognition. In the cases of N-B, B-N, and N-I, the recognition accuracy of the domain adaption method is slightly higher than that of the Hessian-based subspace learning method. On the contrary, in the cases of I-N, B-I, and I-B, the recognition accuracy of the Hessian-based subspace learning method is slightly higher than that of the domain adaption method.

Comparison with deep learning-based method
In this section, IEMOCAP and MSP-Improv are used for cross-corpus speech emotion recognition. ADDoGbased method and CNN-based method [34] are chosen as reference methods. The recognition accuracy of the proposed method is compared with these reference methods. The result is shown in Fig. 6: It can be seen from Fig. 6 that when MSP-Improv is the source domain and IEMOCAP is the target domain, the unweight accuracy of the proposed method is better than that of the CNN-based method but slightly lower than that of the ADDoG-based method. However, in the corpus reverse experiment, the unweight accuracy of the proposed method is slightly higher than that of the CNNbased method and ADDoG-based method. It can be clearly seen that the performance of the ADDoG-based method is the most stable among the three methods. In general, the proposed method can achieve well performance compared with traditional linear methods and deep learning methods.

Experiment of real-world corpus
In order to verify that the method proposed in this paper is also effective in the real world, in this section, a realworld corpus MSP-PODCAST and several corpora in controlled experimental environments are used for cross-corpus speech emotion recognition. The experimental setup of this paper is to set MSP-PODCAST as the source corpus and target corpus respectively for experiments with other corpora. The recognition accuracy of the proposed method using MSP-PODCAST as the target corpus is shown in Fig. 7, and Fig. 8 shows the recognition accuracy of the accuracy of MSP-PODCAST as the source corpus: It can be seen from Figs. 7 and 8 that the recognition performance of the proposed method using MSP-PODCAST as the target corpus is better than that using MSP-PODCAST as the source corpus. When MSP-POD-CAST is used as a source corpus, the transferable knowledge is limited due to the influence of complex acoustic conditions. It can be seen that the performance of speech emotion recognition is indeed affected by the corpus environment. In addition, it is clear that the recognition  performance of the proposed method using IEMOCAP and MSP-Improv is better than that of other corpora.

Parameters analysis
The influence of different parameters on the recognition performance of the proposed method is analyzed in this section. The analyzed parameters include the number of the nearest neighbors K, the embedding regularization parameter μ, and the information entropy regularization parameter λ. Different recognition accuracy can be obtained by selecting different values of parameters. First of all, the nearest neighbor number K is analyzed, which is used to identify the number of neighboring frames of the current frame. The complexity of the algorithm is affected by K. The smaller K is, the fewer neighboring frames are identified, and the less feature is provided. While the larger K is, the more neighboring frames are identified, the more feature is provided. However, if K is set large, some frames which are not useful for recognition may be identified as neighboring frames, which may lead to high algorithmic complexity. So, the range of K is set from 3 to 9 in this paper. In different cases, the recognition accuracy of different K is shown in Fig. 9. From  Fig. 9, we can find that the proposed method achieves a good recognition accuracy when K = 6. However, it is not enough to only use the recognition accuracy to measure the recognition performance under different corpus settings. Therefore, variances of recognition accuracy are introduced in parameter analysis to measure the recognition performance under different corpus settings at the same time in this paper. For K, variances under different corpus settings are shown in Fig. 10. It can be seen from Fig. 10 that, although the variances of recognition accuracy achieve the maximum when K = 6, there is a small difference when K takes different values. Therefore, considering the algorithmic complexity and recognition performance, K is selected as 6 in this paper.
Then, the embedding regularization parameter μ is analyzed, which is used to control the value of the embedded coordinates. The range of μ is set as {− 1/2, − 1/3, − 1/4, 1/4, 1/3, 1/2, 1} in this paper. In different cases, the recognition accuracy of the proposed method with different μ is shown in Fig. 11. From Fig. 11, it is clear that the proposed method can achieve a good recognition accuracy when μ = 1/4. The variance of recognition accuracy with different μ under different corpus settings is shown in Fig. 12. Although the variance of recognition accuracy is very small when μ = 1, the recognition accuracy is significantly lower than that under other conditions. Therefore, in consideration of recognition accuracy and variance of recognition accuracy, μ = 1/4 is chosen in this paper.
Finally, the information entropy regularization parameter λ is analyzed, which controls the weight of the information entropy. The range of λ is set as {0.001, 0.01, 0.1, 1, 10, 100, 1000} in this paper. In different cases, the recognition accuracy of the proposed method with different λ is shown in Fig. 13. As shown in Fig. 13, when λ = 100 and λ = 1000, the changes in the recognition accuracy are great. Although when λ = 100, the recognition accuracy in both N-I and B-I cases exceeds 70%. However, it is not stable in these two cases as shown in Fig. 14. Therefore, considering recognition accuracy and variance of recognition accuracy in a compromise, λ = 10 is chosen in this paper.

Complexity analysis
For the performance evaluation of a method, both recognition accuracy and model complexity should be considered. For the deep learning-based method, the complexity of the model is determined by the network structure and the number of parameters. Therefore, some complexity analysis of the proposed method and reference methods are given in this subsection. For the CNN-based method, the feature encoder consists of two convolution layers and a max pooling layer, and the emotion classifier consists of fully connected layers and softmax. On this basis, the ADDoG model adds a critic composed of full connection layers. With the increase of the input MFBs, the calculation amount and trainable parameter amount of each layer will increase more. In addition, during training, when the number of samples in the source domain and target domain increases, the computational complexity of the

Conclusion
In this paper, a cross-corpus speech emotion recognition method is proposed using subspace learning and domain adaptation. In the subspace learning part, the Hessian matrix is introduced to locally embed the features in both source and target domains to form the feature subspace. In the domain adaption part, the mapping relationship is constructed based on information entropy. Then, the common space of both the source and target domains is obtained, which reduces the discrepancy in feature distribution between the source and target domains. Extensive experiments on datasets in three different languages are conducted to verify the performance of the proposed method.