 Methodology
 Open Access
 Published:
Crosscorpus speech emotion recognition using subspace learning and domain adaption
EURASIP Journal on Audio, Speech, and Music Processing volumeÂ 2022, ArticleÂ number:Â 32 (2022)
Abstract
Speech emotion recognition (SER) is a hot topic in speech signal processing. When the training data and the test data come from different corpus, their feature distributions are different, which leads to the degradation of the recognition performance. Therefore, in order to solve this problem, a crosscorpus speech emotion recognition method is proposed based on subspace learning and domain adaptation in this paper. Specifically, training set data and the test set data are used to form the source domain and target domain, respectively. Then, the Hessian matrix is introduced to obtain the subspace for the extracted features in both source and target domains. In addition, an information entropybased domain adaption method is introduced to construct the common space. In the common space, the difference between the feature distributions in the source domain and target domain is reduced as much as possible. To evaluate the performance of the proposed method, extensive experiments are conducted on crosscorpus speech emotion recognition. Experimental results show that the proposed method achieves better performance compared with some existing subspace learning and domain adaptation methods.
1 Introduction
There are many ways for people to express emotions, such as through speech, actions, and facial expressions. Speech is an important way to express emotions among these ways, because it contains riches emotions, such as happy, angry, and sad. Speakers can deliver their intentions through different tones, volumes, or content. How to judge a speakerâ€™s emotion through speech becomes crucial. Therefore, speech emotion recognition (SER) is an important branch of many modal affective computing, and it is also an important part of speech recognition. With the development of SER, it has been applied in the fields of psychotherapy, humancomputer interaction, etc. According to the results of SER, the machine can generate appropriate responses for the user in an interactive environment. Therefore, SER is one of the most important technologies for humancomputer interaction [1,2,3,4].
The semanticbased methods are an important class of SER methods, because emotions can be expressed effectively by semantics. If the speakers use emotive words to communicate with others, then we can directly judge the emotion from the semantics of the words. Therefore, semanticbased research gradually began to develop. A multiclassifier emotion recognition model based on prosodic information and semantic labels is introduced in [5]. Similarly, the semantic labels and the nonverbal audio in speech, such as onomatopoeia such as crying, laughter, or sighing, are used in SER [6]. Subsequently, temporal and semantic coherence is introduced for SER [7]. In addition, the model of bimodal SER from acoustic and linguistic information fusion is proposed [8].
Although semantics understanding is simply for humans, it is a complex process for machines. Therefore, more research is currently aimed at speech features that are easily understood by machines, which is also important for SER. Compared with semantic information, speech features are more abstract. But they are very important for expressing the speakerâ€™s emotions. The main features used in SER are divided into acoustic features and spectral features. The acoustic features include intensity, pitch, and timbre. Features like energy, Melfrequency cepstral coefficients (MFCC), linear prediction coefficients (LPC), and fundamental frequency are called spectral features. The features such as pitch, MFCC, formant, intensity, and chroma are adopted for SER [9, 10]. Also, the pitch, spectrum, and formant are combined with semantic information for recognizing emotions in [5]. To improve the robustness of SER, some methods have been used to process the features. Specifically, PCA is adopted to reduce the dimensionality of the features [11], and a statistical method is utilized to find robust spectral features [12].
In practical scenarios, the speakerâ€™s emotion is very complex. The speaker may have multiple emotions at the same time, rather than a single emotion, or the emotion expressed by the speaker is inconsistent with the actual emotion. It makes SER difficult. There is also research proposed for complex emotions. A circular continuous dimensional model to describe an emotion, called valencearousal model (VA) was proposed in [13, 14]. The model no longer regards emotions as discrete but uses twodimensional coordinates to describe the continuous distribution of emotions. The PAD emotional model was shown in [15, 16], which has P (pleasure), A (arousal), and D (dominance) values to represent all emotional states. In addition, based on the emotional probability distribution, an ambiguous label is proposed to solve the inconsistency problem in ambiguous emotional cognition [17].
Another problem in SER is how to recognize emotions. To this end, some machine learning methods were adopted to recognize emotions, such as support vector machine (SVM) [18], hidden Markov model (HMM) [19], and Gaussian mixed model (GMM) [20]. In recent years, with the rapid development of deep learning, various neural network structures have been introduced in SER. From convolutional neural networks (CNN) [21], recurrent neural networks (RNN) [22], back propagation neural network (BPNN) [23], and deep neural network (DNN) [24] to sequential capsule networks [25] and adversarial data augmentation network [26], they are both used for SER. A segmentbased iterative selflearning enhanced speech emotion recognition model is proposed in [27]. The above algorithms perform well in traditional SER, and the recognition accuracy of some algorithms can even reach more than 80% in some corpora settings. In the actual scene, the speech signals do not belong to a specific corpus, which are recorded in different scenes. The speech data is also affected by language, gender, speaking styles, and other factors. So, when the training set and the test set came from different corpus, the training and testing data often follow different feature distributions. The recognition performance will be reduced at this time.
Therefore, transfer learning is adopted to solve the problem of data crosscorpus [28]. The known corpus data is considered as the source domain, and the unknown data to be learned constitutes the target domain. Transfer learning is to transfer the knowledge of the source domain to the target domain to reduce the data distribution difference between the two domains, and in SER, the features of the source and target domains are distributed in different spaces. So, the transfer from the source domain to the target domain is a featurebased transfer, that is, a mapping relationship between two domains is established to reduce the differences in feature distributions. With the development of transfer learning, more transfer learning algorithms are applied to SER. Among them, in order to solve the crosscorpus SER problem, many researches focus on transfer subspace learning and domain adaptation, such as unsupervised transfer subspace learning [28], transfer subspace learning based on feature selection [29], transfer subspace learning based on nonnegative matrix factorization [30], transfer linear subspace learning [31], and Universum autoencoderbased domain adaptation [32]. In addition, a crosscorpus speech emotion recognition based on domain adaptive least squares regression is proposed in [33], and in [34, 35], ADDoGbased and DANNbased methods are proposed according to the idea of domain adversarial. Most of the above methods involve transfer subspace learning and domain adaptation, which are important issues in transfer learning and the focus of this paper. The two parts are considered jointly in this paper. Therefore, inspired by the frame in [36], a crosscorpus speech emotion recognition method is proposed.
The contributions of the proposed method are summarized as follows:

The proposed method combines subspace learning and mapping to realize speech emotion recognition across the corpus. The feasibility of the proposed method is proved by experimental results.

In this paper, a subspace learning model is constructed based on the Hessian matrix, so that the extracted features both in the source domain and the target domain have good robustness in their independent subspace, which can be adopted to improve the subsequent crosscorpus transfer ability.

Information entropy is used to establish a domain adaption model in the proposed method. The numerical descent is used to minimize information entropy, so that a common space of source and target domains is learned, thereby the difference in features distribution between the two domains is reduced.
The rest of the paper is organized as follows. In Section 2, the specific process of the proposed method is introduced, along with some optimizations. In Section 3, the emotion recognition performance of the proposed method is analyzed on three public datasets, and the effects of different parameters on the performance are analyzed through experiments. Finally, the conclusion is drawn in Section 4.
2 The proposed method
A crosscorpus speech emotion recognition method is proposed by combing subspace learning and domain adaption. The block diagram of the proposed method is shown in Fig. 1.
Firstly, features of speech in the source corpus and target corpus are extracted to form the source domain and the target domain. Then, the Hessianbased subspace learning is performed on the feature in the source domain and the target domain to obtain lowdimensional features for forming their own independent subspace. The flowchart of the Hessianbased subspace learning part is shown in Fig. 2. Furthermore, the mapping relationship between the source domain subspace and the target domain subspace is established by using information entropy, which is used for reducing the difference of feature distribution between different domains. This mapping relationship is revealed by the common space. Therefore, it is important to find the common space corresponding to the two domains in this method. The flowchart of the domain adaption part is shown in Fig. 3. Finally, emotions are predicted.
In the part of Hessianbased subspace learning, the neighboring frames of the current frame are found based on neighborhood calculation. Then, the Hessian matrix [37] is constructed for lowdimensional embedding to obtain the subspace of the source and target domain, respectively.
After obtaining the subspace of the source and target domain, the transformation matrix is obtained through correlation coefficients of the subspace. Then, the distance between the feature data of each frame in the source domain subspace with that of each frame in the target domain subspace is calculated. And the probability that a frame in the subspace of the target domain is neighborhood to each frame in the source domain is obtained according to the distance. In this way, the posterior probability that the features of each frame in the target domain subspace are estimated to be a certain class can be obtained according to the known class labels of the features of each frame in the source domain subspace. Then, the entropy between the target domain features and emotion labels and the entropy between the features and domain labels of the two domains are calculated. Finally, the two information entropies are jointly optimized by numerical descent. The mapping relationship between the source domain subspace and the target domain subspace is acquired, which is described by a common space.
Then, Hessianbased subspace learning [38] and the domain adaption based on information entropy are introduced in detail. Finally, a specific optimization method for finding the common space is given.
2.1 Hessianbased subspace learning
An input feature matrix X=(x_{mn})_{Mâ€‰Ã—â€‰N} is given, which is composed of the features of the speech. m and n are the feature index and the frame index, respectively. M and N are the total number of the feature dimension and the number of frames, respectively. First, the feature energy of each frame is as follows:
where \({x}_n^{\textrm{e}}\) represents the feature energy of the nth frame, and x_{mn} represents the feature of the mth dimension in the nth frame.
Thus, an energy matrix can be formed as X^{e}= [\({x}_1^{\textrm{e}},{x}_2^{\textrm{e}},\dots, {x}_N^{\textrm{e}}\)]. Then, two new feature energy matrices A and B, which are used for calculating the distance of the feature between different frames, are defined as follows:
where \({a}_{ij}={x}_j^{\textrm{e}}\), \({b}_{ij}={x}_i^{\textrm{e}}\), 1 â‰¤ i, j â‰¤ N, and i and j represent the index of the row and column, respectively. In order to find the nearest K frames of each frame, the distance D_{e}â€‰=â€‰(d_{ij})_{Nâ€‰Ã—â€‰N} of the feature between different frames is calculated as follows:
where d_{ij} represents the distance between the feature energy of the ith frame and the jth frame. The smaller the distance d_{ij} is, the closer the feature energies of the ith frame and the jth frame are. In fact, the definition of distance D_{e} is derived from Euclidean distance. A and B are formed by the square of the elements in the input matrix X. According to Eqs. (1), (2), and (3), the distance defined in this paper meets the requirements of nonnegativity, directness, and identity. A and B are constructed in a way that also satisfies the symmetry of the distance.
The jth column from the matrix D_{e} (i.e., \({\boldsymbol{d}}_j^e={\left[{d}_{1j}^e,{d}_{2j}^e,\dots, {d}_{Nj}^e\right]}^T\)) denotes the distance vector of feature energy between the jth frame and each frame. The sorted distance matrix in ascending order is \({\textbf{d}}_j^{eS}={\left[{d}_{S_j(1)j}^e,{d}_{S_j(2)j}^e,\dots, {d}_{S_j(N)j}^e\right]}^T\); S_{j}(i) denotes the index of the frame sorted by the distance from the jth frame, where S_{j}(1) represents the index with the minimum distance in \({d}_{ij}^e\); and S_{j}(N) is the index of the maximum distance. It is worth mentioning that for each frame, \({d}_{jj}^e\) is the minimum element in \({\boldsymbol{d}}_j^e\), i.e., S_{j}(1)â€‰=â€‰j. The 2nd to the (K+1)th minimum distance from \({\textbf{d}}_j^{eS}\) are selected to form the adjacent index matrix i_{j}â€‰=â€‰[S_{j}(2),â€‰S_{j}(3),â€‰â€¦,â€‰S_{j}(Kâ€‰+â€‰1)]^{T}of the jth frame. K denotes the number of the largest neighbor frames. Thereby, the KÃ—N adjacent index matrix Iâ€‰=â€‰[i_{1},â€‰i_{2},â€‰â€¦,â€‰i_{N}] of N frames is obtained. Then, the elements in the input matrix X correspond to the indices in I and are selected to form a neighborhood matrix Z_{n}, which is defined as follows:
where \({z}_{mk}^n={x}_{m{S}_n\left(k+1\right)}\), 1 â‰¤ k â‰¤ K, 1 â‰¤ m â‰¤ M, 1 â‰¤ n â‰¤ N. k, m, and n are the neighbor index, the feature index, and the frame index, respectively. Z_{n} represents the neighborhood matrix corresponding to the nth frame.
E _{n} is a centralized matrix of Z_{n}, which is defined as follows:
where \(e_{mk}^n=\frac1K\sum_{k=1}^Kz_{mk}^n\)
The purpose of the proposed Hessianbased subspace learning is to obtain the local coordinates of the neighborhood, which are transitioned by tangent coordinates. The tangent space consists of tangent coordinates, which is regarded as a subspace of the Euclidean space. A standard orthogonal coordinate system is associated with the inner product inheritance of the Euclidean space, which can be obtained by using singular value decomposition. Therefore, Z_{n}â€‰âˆ’â€‰E_{n} is subjected to singular value decomposition. The standard orthonormal basis \({\textbf{V}}_n={\left({v}_{ij}^n\right)}_{K\times K}\) can be obtained by singular value decomposition as follows:
where (Â·)^{T} denotes transposition. U_{n} is the left singular vector of Z_{n}â€‰âˆ’â€‰E_{n}. Î£_{n} is a diagonal matrix of singular values.
First d columns of V_{n} are extracted to constitute the tangent coordinates \({\textbf{V}}_n^d={\left({v}_{ij}^n\right)}_{K\times d}\) with dimension Kâ€‰Ã—â€‰d.
Next, an association Hessian matrix Q_{n} is given by using \({\textbf{V}}_n^d\), which is defined as follows:
where \({q}_{kj}^n={v}_{k{j}_1}^n{v}_{k{j}_2}^n\), n is the frame index, 1â‰¤ n â‰¤N. j_{1} and j_{2} are the dimension indexes. The corresponding relationship among j, j_{1}, and j_{2} is given as follows:
where 1â€‰â‰¤â€‰j_{1}â€‰â‰¤â€‰d, 1â€‰â‰¤â€‰j_{2}â€‰â‰¤â€‰d, \(j=1,2,\dots, \frac{d\left(d+1\right)}{2}\).
Furthermore, an estimation matrix \({\textbf{L}}_n={\left({l}_{ij}^n\right)}_{K\times \left(1+d+\frac{d\left(d+1\right)}{2}\right)}\) is constructed as follows:
where 1â€‰â‰¤â€‰iâ€‰â‰¤â€‰K, 1â€‰â‰¤â€‰nâ€‰â‰¤â€‰N.
\({\textbf{G}}_n={\left({g}_{ij}^n\right)}_{K\times \left(1+d+\frac{d\left(d+1\right)}{2}\right)}\) can be obtained by Schmitt orthogonalization of estimated matrix L_{n} [39]. The last \(\frac{d\left(d+1\right)}{2}\) columns of G_{n} are taken to obtain the matrix \({\textbf{G}}_n^b={\left({g}_{ij}^{bn}\right)}_{K\times \frac{d\left(d+1\right)}{2}}\). Then, Hessian quadratic matrix H can be constructed by using the matrix \({\textbf{G}}_n^b\), which is formed as follows:
where \({\textbf{C}}_n={\left({\textrm{c}}_{ij}\right)}_{\frac{d\left(d+1\right)}{2}\times N}\) is a matrix composed of \({{\textbf{G}}_n^b}^T\), and it is defined as follows:
where \(1\le i\le \frac{d\left(d+1\right)}{2}\), and S_{n}(j) denotes the index of the frame sorted by the distance from the nth frame, 1â‰¤nâ€‰â‰¤â€‰N.
Next, the ddimensional subspace corresponding to the d smallest eigenvalues can be obtained by using H, which is a null space and denotes as Uâ€‰=â€‰(u_{ij})_{Nâ€‰Ã—â€‰d}. If a manifold is locally equidistant to an open subset in Euclidean space, then the mapping function from this manifold to the open subset is a linear function. The quadratic mixed derivative of the linear function is 0, so the local quadratic form formed by the Hessian coefficients is also 0. Hence, the global Hessian matrix has a (d+1)dimensional null space. The firstdimension subspace of the Hessian matrix is composed of a constant function, and other ddimensional subspaces form equidistant coordinates. Then, the embedding matrix Râ€‰=â€‰(r_{ij})_{dâ€‰Ã—â€‰d} can be calculated as follows:
where J represents the set of the index of the neighborhood frames, 1â€‰â‰¤â€‰iâ€‰â‰¤â€‰d, 1â€‰â‰¤â€‰jâ€‰â‰¤â€‰d.
Finally, the subspace Y is obtained according to the lowdimensional embedding:
where Î¼ is a regularization parameter, and (Â·)^{T} denotes transposition.
There may be a small number of outliers in the subspace Y after the lowdimensional embedding. In order to solve this problem, the outliers in the subspace Y are corrected in this paper. These outliers are characterized by a small number, with values that deviate from the distribution of most data. So, the detection thresholds are set to recognize the outliers. Then, the outliers are replaced with 2Tr(U^{T}EU )[40], where Tr(Â·) means the trace of the matrix in parentheses. Eâ€‰=â€‰(e_{ij})_{Nâ€‰Ã—â€‰N} is a diagonal matrix, where e_{ij} is defined as [41]:
Following the above steps, the source domain subspace Y_{s} and the target domain subspace Y_{t} can be obtained.
2.2 Information entropybased domain adaption
A domain adaption method was proposed to build the relationship between the source domain subspace and the target domain subspace. In detail, a common space with similar feature distributions in the source and target domains is constructed. Both the information entropy between the data and emotion labels and the entropy between data and domain labels are used to optimize the mapping [42]. Thereby, the difference in feature distribution in different corpora can be reduced.
After obtaining the source domain subspace \({\textbf{Y}}_{\textrm{s}}={\left({y}_{ij}^{\textrm{s}}\right)}_{d\times N}\) and target domain subspace \({\textbf{Y}}_{\textrm{t}}={\left({y}_{ij}^{\textrm{t}}\right)}_{d\times N}\), a principal component coefficient of the source domain \({\textbf{W}}_{\textrm{s}}={\left({w}_{ij}^s\right)}_{d\times d}\) and the target domain \({\textbf{W}}_{\textrm{t}}={\left({w}_{ij}^t\right)}_{d\times d}\) is calculated. In some cases, the dimension of the source domain and the target domain is different, which leads to different dimensions of the principal component coefficients. The dimension of the principal component coefficient of the target domain and the source domain with the smallest dimension should be taken as d_{w}. The dimensions of the source domain and the target domain are the same in this paper, so d_{w} is set as d. Since the transfer is carried out from the source domain to the target domain, the target domain is used as the basis for the transformation space. The transformation matrix W for both source domain and target domain is set as Wâ€‰=â€‰W_{t}. Features in the source domain and target domain can be mapped into a common space by W.
First, the distance matrix Dâ€‰=â€‰(d_{ij})_{Nâ€‰Ã—â€‰N} formed by the features between different frames from the source domain subspace and the target domain subspace is given as follows:
where \({\textbf{X}}_{\textrm{s}}={\left({x}_{mn}^{\textrm{s}}\right)}_{d\times N}={\textbf{W}}^T{\textbf{Y}}_{\textrm{s}}\) denotes the source domain subspace features in transform space, \({\textbf{X}}_{\textrm{t}}={\left({x}_{mn}^{\textrm{t}}\right)}_{d\times N}={\textbf{W}}^T{\textbf{Y}}_{\textrm{t}}\) denotes the target domain subspace features in transform space, \(\boldsymbol{A}^{\boldsymbol{'}}={(a_{ij})}_{N\times N},a_{ij}=\sum\nolimits_{m=1}^{d}\left(x_{mj}^{\text{s}}\right)^{2}\), \(\mathbf{B}^{\boldsymbol{'}}={(b_{ij})}_{N\times N},b_{ij}=\sum\nolimits_{m=1}^{d}\left(x_{mi}^{\text{t}}\right)^{2}\).
The neighbor frames are detected according to the distance between the feature of each frame. Therefore, a conditional probability model is defined as follows:
where 1â‰¤ i â‰¤ N, 1â‰¤ j â‰¤ N, and p_{ij} is the conditional probability density that the jth frame in the target domain is adjacent to the ith frame in the source domain. It can describe the probability of the nearest neighbor between each frame feature in the source domain and the frame feature in the target domain.
The emotion label corresponding to the ith frame in the source domain is Label_{i}, Label_{i}âˆˆLabel = {1, 2, ... , L}, i.e., there are a total of L types of emotion. According to formula (16), an emotion label probability estimate \({\hat{p}}_{lj}\) of the jth frame in the target domain is given as follows:
where 1â‰¤lâ€‰â‰¤â€‰L, 1â€‰â‰¤â€‰jâ€‰â‰¤â€‰N, 1â€‰â‰¤â€‰iâ€‰â‰¤â€‰N, and \({\hat{p}}_{lj}\) express the probability that the jth frame in the target domain is discriminated as the lth type of emotion when the emotion of the source domain is known.
Since \({\hat{p}}_{lj}\) is a preliminary probability estimate of the emotion label of each frame feature in the target domain, the relationship between target domain features and emotion labels cannot be directly revealed by \({\hat{p}}_{lj}\) [43,44,45]. Therefore, the entropy I(X_{t};â€‰Label) between the target domain features and emotion labels is calculated by using \({\hat{p}}_{lj}\) in this paper, which is defined as follows:
Equation (18) is composed of two parts. In the first part, the entropy of the average probability that the feature of all frames in the target domain belongs to each emotion label is calculated. The average of the entropy of the feature in the target domain belonging to each emotion label is computed in the second part. In order to reduce the influence of incorrect labels on the feature discrimination results of each frame in the target domain, Eq. (18) needs to be optimized later. It should be noted that if only the second part is minimized, a degenerate solution will be obtained. That is, all frames in the target domain may be classified into the same type of emotion. So, the first part in Eq. (18) is necessary.
Then, the entropy I^{st}(X) between the features and domain labels of the two domains are introduced to maximize the similarity between the two domains, which is defined as:
where 1â€‰â‰¤â€‰jâ€‰â‰¤â€‰Nâ€‰+â€‰M.
To calculate the entropy I^{st}(X), firstly, the distance \({d}_{ij}^{\prime }\) between the ith frame feature in the source domain and the jth frame feature in the target domains is calculated according to Eq. (3), where Xâ€‰=â€‰(x_{ij})_{dâ€‰Ã—â€‰(Nâ€‰+â€‰M)} denotes the feature for all frames in the source and target domains, Aâ€‰=â€‰(a_{ij})_{(Nâ€‰+â€‰M)â€‰Ã—â€‰(Nâ€‰+â€‰M)}, \({a}_{ij}=\sum_{m=1}^d{\left({x}_{mj}\right)}^2\), Bâ€‰=â€‰(b_{ij})_{(Nâ€‰+â€‰M)â€‰Ã—â€‰(Nâ€‰+â€‰M)}, and \({b}_{ij}=\sum_{m=1}^d{\left({x}_{mi}\right)}^2\). N and M denote the number of frames in the source domain and target domain, respectively. In this paper, the number of frames in the source domain is the same as that in the target domains, i.e., N = M. Then, the probability \({p}_{ij}^{\prime }\) of the ith frame feature and the jth frame being adjacent to each other in the source domain and the target domain is calculated according to Eq. (16) using \({d}_{ij}^{\prime }\). Next, the probability p_{tj} that the jth frame in the source domain and the target domain is judged as the target domain or the source domain is calculated according to Eq. (17).
2.3 Optimization
In this subsection, an iterative optimization algorithm based on numerical descent [46] is introduced using Eqs. (18) and (19). The objective function is:
where Î» is the regularization parameter.
In the optimization process, the transfer coefficient matrix g is given for numerical descent in this paper, which is defined as follows:
where Î» is the regularization parameter.
The calculation process of g(X_{t};â€‰Label) is as follows. First, an information matrix \({\textbf{I}}^C={\left({i}_{lj}^c\right)}_{L\times N}\) is defined using \({\hat{p}}_{lj}\) as:
where \({i}_{lj}^c\) represents the difference between the probability that the feature of the jth frame in the target domain belongs to the emotion of the lth category and the average probability that the features of all frames in the target domain belong to the emotion of the category.
Next, a coefficient matrix Î“â€‰=â€‰(Î³_{ij})_{Nâ€‰Ã—â€‰N} is calculated from p_{ij} and \({i}_{lj}^c\) as follow:
where \({o}_{ij}={i}_{lj}^c\), Label_{i}â€‰=â€‰l. g(X_{t};â€‰Label) is obtained as follows:
where Î© is a diagonal matrix, and the main diagonal element is \(\sum_{j=1}^N{\gamma}_{ij}\). W is the transfer matrix.
Since the calculation process of g(X_{t};â€‰Label) and g^{st}(X) is the same, the calculation process of g(X_{t};â€‰Label) is introduced in detail in this paper. The variables for the calculation process of g^{st}(X) refer to the calculation process of I^{st}(X).
Finally, the common space L is obtained. So, the feature data in the source domain after mapping is F_{s}â€‰=â€‰Y_{s}^{T}L, and the feature data from the target domain is F_{t}â€‰=â€‰Y_{t}^{T}L.
3 Experiments and results analysis
To evaluate the effectiveness of the proposed crosscorpus speech emotion recognition method, a number of experiments are conducted with some baseline methods on three commonly standard datasets, namely Berlin [47], NNIME [48], IEMOCAP [49], MSPImprov [50], and MSPPODCAST [51]. The specific statistics of each dataset are shown in Table 1.
3.1 Data preparation
Berlin dataset is a German emotional speech corpus recorded by the Technical University of Berlin. In this dataset, ten actors performed 7 emotions, including neutral, angry, fearful, happy, sad, disgusted, and bored. The sampling rate is 16 kHz. The dataset contains 233 male emotional sentences and 302 female emotional sentences saved in WAV format.
The NTHUNTUA Chinese Interactive Multimodal Emotional Corpus (i.e., NNIME) is a multimodal dataset. In this dataset, audio, video, ECG, etc. were recorded for 44 actors during oral interactions. There are 6 emotions including anger, happy, sad, neutral, frustration, and surprise in this dataset. The audio sampling rate is 16 kHz. The dataset also contains annotation results from 49 annotators in different perspectives.
IEMOCAP, known as the Interactive Emotional Binary Motion Capture Database, is recorded by the Speech Analysis and Interpretation Laboratory at the University of Southern California. Ten emotions are shown by recording the expressions, movements, and audio of 10 actors in this dataset. Twelve hours of data are contained in this dataset. The audio sampling rate is 16 kHz. Considering the relevance and ambiguity of different types of emotions, 4 typical emotions (angry, neutral, happy, and sad) audio data were selected from the above three datasets in this paper.
MSPImprov is an improvised multimodal emotional corpus. There are 6 sessions each session is a dyadic interaction between two speakers. Twenty target sentences are consisted in each session. In this corpus, 12 actors (six male and six female) performed 4 emotions, including neutral, angry, happy, and sad. Two actors improvise these emotionspecific situations, leading them to utter contextualized, nonread renditions of sentences that have fixed lexical content and convey different emotions. The sampling rate is 44.1 kHz. MSPImprov is more natural than other corpora. Hereinafter referred to as MSPImprov is MSP.
MSPPODCAST, a large and natural emotional corpus. It relies on existing spontaneous recordings obtained from audiosharing websites. The criterion to select the podcasts is to include only episodes that can be shared to the broader community. In this corpus, the types of emotions and themes are diverse, and the audio quality is very good in this corpus, because segments recorded with poor quality are removed. Segments with SNR values less than 20 dB are discarded. Phonequality speech are also removed. Therefore, this step also removes segments that do not have significant energy above 4 kHz. Podcasts in the corpus contain 9 emotions, including angry, sad, happy, neutral, fear, surprise, disgust, others, and contempt. However, angry, happy, neutral, and sad are selected in this paper. There are also many realworld corpora like LSSED [52], and so on.
3.2 Experimental settings
In this experiment, 5 artificial audio features are used, including static MFCC and their first and secondorder dynamic differences, LPC, log amplitudefrequency characteristics, Philips Fingerprints [53], and spectral entropy. The selected audio features are listed in Table 2.
In the following, the amplitude characteristic of the frequency coefficient is described by log amplitudefrequency characteristics (LAFC).
Considering that different features contribute differently to speech emotion recognition, each feature in the source domain and the target domain is weighted before training. The weights are set by the dimensions of the features in this paper. For MFCC, LPC, LAFC, Philips Fingerprint, and Spectral Entropy, the corresponding weights are Î²_{1}, Î²_{2}, Î²_{3}, Î²_{4}, and Î²_{5}, respectively.
After subspace learning and domain adaption, the weighted features in the source domain are trained. That is, the features are used to build a training set. Similarly, the weighted features in the target domain are used to build a test set.
In the training process, a constant recognition accuracy threshold Î± is set in advance. Next, the test set is divided into two parts of equal amount of data, i.e., test set 1 and test set 2. Test set 1 is used for assist training, and test set 2 is used to optimize the performance of the proposed method. If the recognition accuracy of a certain type of emotion is less than Î± in the first training, the features corresponding to the emotion need to be retrained in the next training. The operations repeated until one of the following conditions is met: (1) the recognition accuracy of all emotions is greater than Î±, and (2) the number of the emotion with recognition accuracy less than Î± remains unchanged in the two adjacent training.
To evaluate the performance of the proposed method in the crosscorpus condition, the Berlin, NNIME, and IEMOCAP are combined in pairs in this paper. Then, any two datasets are taken as the source domain and the target domain. Therefore, a total of 6 combination cases are designed as follows:

NB: NNIME is the source domain dataset, and Berlin is the target domain dataset.

BN: Berlin is the source domain dataset, and NNIME is the target domain dataset.

NI: NNIME is the source domain dataset, and IEMOCAP is the target domain dataset.

IN: IEMOCAP is the source domain dataset, and NNIME is the target domain dataset.

BI: Berlin is the source domain dataset, and IEMOCAP is the target domain dataset.

IB: IEMOCAP is the source domain dataset, and Berlin is the target domain dataset.
3.2.1 Parameter details
Linear SVM is chosen for training and testing. The grid search method is used to optimize the kernel function coefficients of the SVM and the independent terms of the sum function. There are four hyperparameters and five feature weight coefficients in this experiment. The recognition accuracy threshold Î± is set to 0.45. It is determined by an informal experiment. According to the dimension of the feature, the weight coefficient Î²_{1}, Î²_{2}, Î²_{3}, Î²_{4}, and Î²_{5} are set as 0.3, 0.3, 0.3, 0.05, and 0.05, respectively. The complexity of the algorithm is affected by K. The larger the value of K is, the higher the algorithm complexity is, and the more features are extracted. So, the range of the neighboring value K is set as [3, 9]. For the two regularization parameters Î¼ and Î», the range is set to {âˆ’â€‰1/4, âˆ’â€‰1/3, âˆ’â€‰1/2, 1, 1/2, 1/3, 1/4} and {0.001, 0.01, 0.1, 1, 10, 100, 1000}, respectively. Considering that embedding regularization parameter Î¼ is an exponent, if Î¼ is a positive integer, the value will affect the value of the element in Y. Nevertheless, if Î¼ is a positive or negative fraction, it may affect the value range of the element in R. Hence, both integer and fraction can be chosen for Î¼. For regularization parameter Î», it affects the importance of both parts of two information entropy. For the proposed method, the dimension of the simplified subspace feature is set to 169.
3.2.2 Traditional linear baseline
In order to evaluate the performance of the proposed method for crosscorpus speech emotion recognition, on the basis of the above 6 sets of experiments, the proposed method is compared with some related most commonly used and advanced transfer learning methods. The following is an introduction to these baseline methods:

Principal components analysis (PCA) [54]: A dimensionality reduction method that maps data into a lowdimensional subspace through linear transformation to prevent information loss as much as possible.

Linear discriminant analysis (LDA) [55]: In this method, the projection direction that maximizes the ratio of the interclass distance and minimizes the intraclass distance ratio is found. The subsequent classification results are affected while reducing the dimension.

Kernel spectral regression (KSR) [56,57,58]: In reproducing kernel Hilbert spaces (RKHS), the problem of learning embedding functions is transformed by SR into a regression problem.

Geodesic flow kernel (GFK) [59]: The movement of the domain is simulated by integrating an infinite number of subspaces. The changes in geometric and statistical properties from the source domain to the target domain are described by these subspaces.

Subspace alignment (SA) [60]: SA is a transfer learning algorithm for two subspaces by matching the feature. The core of this method is to seek linear transformation to transform and align for different data.

Manifold embedded distribution alignment (MEDA) [61]: Taking into account the importance of both conditional and marginal distributions, a domaininvariant classifier is learned via a Grassmann manifold with structural risk minimization.

Joint distribution adaptation (JDA) [62]: The marginal probability distribution and conditional probability distribution of the source and target domains are adapted to reduce the distribution difference between different domains.

Transfer component analysis (TCA) [63]: The data in both domains are mapped together into a highdimensional regenerated kernel Hilbert space. In this space, the distance of data in the source domain and target domain is minimized.

Balanced distribution adaptation (BDA) [64]: The weights of marginal and conditional distributions are adaptively utilized on the basis of JDA.

Transfer joint matching (TJM) [65]: The domain variance is reduced by jointly matching features and reweighting instances across domains in a dimensionality reduction process. The new feature representations invariant to both distributional variance and uncorrelated instances are built.
3.3 Results analysis
3.3.1 Comparison with the traditional linear baseline method
In this section, the recognition accuracy of the proposed method is compared with that of some traditional linear baseline methods. The result is shown in Tables 3 and 4.
From Table 3, it is clear that the performance of the proposed method outperforms that of other methods in most cases. Only in the case of IB, the performance of the proposed method is slightly lower than that of BDA and TJM. For the proposed method, the average recognition accuracy reached 58.20% in the six cases. In the case of IB, the recognition accuracy is the lowest among the six cases, which is 46.88%. In contrast, in the case of IN, the recognition accuracy reached 67.75%, which is the highest among the six cases. Compared with TJM which has the highest recognition accuracy among the baseline methods, the average recognition accuracy of the proposed method is significantly improved by 13.3%.
Although weighted accuracy is an important indicator to evaluate the overall classification performance of the model, weighted accuracy is affected by the unbalanced distribution of sample classes. Therefore, unweighted accuracy is very important for evaluating the overall classification performance of the model when the distribution of sample classes is unbalanced. It can be seen from Table 4 that the unweighted accuracy of almost all methods is lower than the weighted accuracy. For the proposed method, unweighted accuracy is 3.27% lower than weighted accuracy. Compared with the baseline method, it still has advantages.
Furthermore, we can find that the average recognition accuracy of the proposed method, distribution adaptation method, and feature selection method is higher than that of most subspace learning. The reason is that the distribution of data in different domains is different. Therefore, the recognition performance of traditional subspace learning algorithms is poor in crosscorpus speech emotion recognition. Transfer learning can be used to improve recognition performance.
In addition, the confusion matrix of the proposed method in six cases is shown in Fig. 4. It can be seen that there are two types of emotion with more than 50% recognition accuracy in most cases. In the case of NB and NI, the highest recognition accuracy can be achieved for neutral. From Fig. 4b and f, it is clear that the proposed method has a good recognition ability for happy, and the highest recognition accuracy can be achieved for angry in the case of IN and BI. Moreover, it can be also found that sad is easier to be recognized than other emotions in most cases.
3.3.2 Ablation experiment
In this section, a set of ablation experiments is established to verify the impact of the two parts of the proposed method on the recognition performance. The specific results are shown in Fig. 5. The specific settings are as follows:

Subspace learning: Only Hessianbased Subspace Learning is performed.

Domain adaption: Only information entropybased domain adaption is performed.

Subspace learning and domain adaption: Hessianbased subspace learning and domain adaption are combined.
The average recognition accuracy of the ablation experiments is shown in Fig. 5. It can be found that the recognition performance of the combined method (i.e., the proposed method) is better than that of the method only with Hessianbased subspace learning or domain adaption. Through ablation experiments, it is clear that both Hessianbased subspace learning and domain adaption have played a positive role in crosscorpus speech emotion recognition. In the cases of NB, BN, and NI, the recognition accuracy of the domain adaption method is slightly higher than that of the Hessianbased subspace learning method. On the contrary, in the cases of IN, BI, and IB, the recognition accuracy of the Hessianbased subspace learning method is slightly higher than that of the domain adaption method.
3.3.3 Comparison with deep learningbased method
In this section, IEMOCAP and MSPImprov are used for crosscorpus speech emotion recognition. ADDoGbased method and CNNbased method [34] are chosen as reference methods. The recognition accuracy of the proposed method is compared with these reference methods. The result is shown in Fig. 6:
It can be seen from Fig. 6 that when MSPImprov is the source domain and IEMOCAP is the target domain, the unweight accuracy of the proposed method is better than that of the CNNbased method but slightly lower than that of the ADDoGbased method. However, in the corpus reverse experiment, the unweight accuracy of the proposed method is slightly higher than that of the CNNbased method and ADDoGbased method. It can be clearly seen that the performance of the ADDoGbased method is the most stable among the three methods. In general, the proposed method can achieve well performance compared with traditional linear methods and deep learning methods.
3.3.4 Experiment of realworld corpus
In order to verify that the method proposed in this paper is also effective in the real world, in this section, a realworld corpus MSPPODCAST and several corpora in controlled experimental environments are used for crosscorpus speech emotion recognition. The experimental setup of this paper is to set MSPPODCAST as the source corpus and target corpus respectively for experiments with other corpora. The recognition accuracy of the proposed method using MSPPODCAST as the target corpus is shown in Fig. 7, and Fig. 8 shows the recognition accuracy of the accuracy of MSPPODCAST as the source corpus:
It can be seen from Figs. 7 and 8 that the recognition performance of the proposed method using MSPPODCAST as the target corpus is better than that using MSPPODCAST as the source corpus. When MSPPODCAST is used as a source corpus, the transferable knowledge is limited due to the influence of complex acoustic conditions. It can be seen that the performance of speech emotion recognition is indeed affected by the corpus environment. In addition, it is clear that the recognition performance of the proposed method using IEMOCAP and MSPImprov is better than that of other corpora.
3.3.5 Parameters analysis
The influence of different parameters on the recognition performance of the proposed method is analyzed in this section. The analyzed parameters include the number of the nearest neighbors K, the embedding regularization parameter Î¼, and the information entropy regularization parameter Î». Different recognition accuracy can be obtained by selecting different values of parameters.
First of all, the nearest neighbor number K is analyzed, which is used to identify the number of neighboring frames of the current frame. The complexity of the algorithm is affected by K. The smaller K is, the fewer neighboring frames are identified, and the less feature is provided. While the larger K is, the more neighboring frames are identified, the more feature is provided. However, if K is set large, some frames which are not useful for recognition may be identified as neighboring frames, which may lead to high algorithmic complexity. So, the range of K is set from 3 to 9 in this paper. In different cases, the recognition accuracy of different K is shown in Fig. 9. From Fig. 9, we can find that the proposed method achieves a good recognition accuracy when K = 6. However, it is not enough to only use the recognition accuracy to measure the recognition performance under different corpus settings. Therefore, variances of recognition accuracy are introduced in parameter analysis to measure the recognition performance under different corpus settings at the same time in this paper. For K, variances under different corpus settings are shown in Fig. 10. It can be seen from Fig. 10 that, although the variances of recognition accuracy achieve the maximum when K = 6, there is a small difference when K takes different values. Therefore, considering the algorithmic complexity and recognition performance, K is selected as 6 in this paper.
Then, the embedding regularization parameter Î¼ is analyzed, which is used to control the value of the embedded coordinates. The range of Î¼ is set as {âˆ’â€‰1/2, âˆ’â€‰1/3, âˆ’â€‰1/4, 1/4, 1/3, 1/2, 1} in this paper. In different cases, the recognition accuracy of the proposed method with different Î¼ is shown in Fig. 11. From Fig. 11, it is clear that the proposed method can achieve a good recognition accuracy when Î¼ = 1/4. The variance of recognition accuracy with different Î¼ under different corpus settings is shown in Fig. 12. Although the variance of recognition accuracy is very small when Î¼ = 1, the recognition accuracy is significantly lower than that under other conditions. Therefore, in consideration of recognition accuracy and variance of recognition accuracy, Î¼ = 1/4 is chosen in this paper.
Finally, the information entropy regularization parameter Î» is analyzed, which controls the weight of the information entropy. The range of Î» is set as {0.001, 0.01, 0.1, 1, 10, 100, 1000} in this paper. In different cases, the recognition accuracy of the proposed method with different Î» is shown in Fig. 13. As shown in Fig. 13, when Î» = 100 and Î» = 1000, the changes in the recognition accuracy are great. Although when Î» = 100, the recognition accuracy in both NI and BI cases exceeds 70%. However, it is not stable in these two cases as shown in Fig. 14. Therefore, considering recognition accuracy and variance of recognition accuracy in a compromise, Î» = 10 is chosen in this paper.
3.4 Complexity analysis
For the performance evaluation of a method, both recognition accuracy and model complexity should be considered. For the deep learningbased method, the complexity of the model is determined by the network structure and the number of parameters. Therefore, some complexity analysis of the proposed method and reference methods are given in this subsection. For the CNNbased method, the feature encoder consists of two convolution layers and a max pooling layer, and the emotion classifier consists of fully connected layers and softmax. On this basis, the ADDoG model adds a critic composed of full connection layers. With the increase of the input MFBs, the calculation amount and trainable parameter amount of each layer will increase more. In addition, during training, when the number of samples in the source domain and target domain increases, the computational complexity of the loss function and iteration times increase. Although there is a userdefined maximum number of iterations for the proposed method, convergence can be achieved by an average of 50 or fewer iterations under each experimental setting. In summary, the proposed method requires relatively few adaptation steps compared to the needing of finetuning whole deep neural network.
4 Conclusion
In this paper, a crosscorpus speech emotion recognition method is proposed using subspace learning and domain adaptation. In the subspace learning part, the Hessian matrix is introduced to locally embed the features in both source and target domains to form the feature subspace. In the domain adaption part, the mapping relationship is constructed based on information entropy. Then, the common space of both the source and target domains is obtained, which reduces the discrepancy in feature distribution between the source and target domains. Extensive experiments on datasets in three different languages are conducted to verify the performance of the proposed method.
Availability of data and materials
Not applicable.
References
S. Zhao, G. Jia, J. Yang, G. Ding, K. Keutzer, Emotion recognition from multiple modalities: fundamentals and methodologies. IEEE Sign. Process. Magazine 38(6), 59â€“73 (2021)
X. Wu, S. Hu, Z. Wu, X. Liu, H. Meng, in 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE ICASSP). Neural architecture search for speech emotion recognition (2022), pp. 1â€“4
C.C. Lee, K. Sridhar, J.L. Li, W.C. Lin, S. BoHao, C. Busso, Deep representation learning for affective speech signal analysis and processing: preventing unwanted signal disparities. IEEE Sign. Process. Magazine 38(6), 22â€“38 (2021)
J.S. GÃ³mezCaÃ±Ã³n, E. Cano, T. Eerola, P. Herrera, H. Xiao, Y.H. Yang, E. GÃ³mez, Music emotion recognition: toward new, robust standards in personalized and contextsensitive applications. IEEE Sign. Process. Magazine 38(6), 106â€“114 (2021)
W. ChungHsien, W.B. Liang, Emotion recognition of affective speech based on multiple classifiers using acousticprosodic information and semantic labels. IEEE Trans. Affect. Comput. 2(1), 10â€“21 (2011)
J.H. Hsu, M.H. Su, C.H. Wu, Y.H. Chen, Speech emotion recognition considering nonverbal vocalization in affective conversations. IEEE/ACM Trans. Audio, Speech Lang. Process. 29, 1675â€“1686 (2021)
B. Chen, Q. Cao, M. Hou, Z. Zhang, G. Lu, D. Zhang, Multimodal emotion recognition with temporal and semantic consistency. IEEE/ACM Trans. Audio, Speech Lang. Process. 29, 3592â€“3603 (2021)
B.T. Atmaja, A. Sasou, M. Akagi, Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion. Speech Commun. 140, 11â€“28 (2022)
Y. Jin, P. Song, W. Zheng, L. Zhao, Novel feature fusion method for speech emotion recognition based on multiple kernel learning. J. South. Univ. (English Edition) 29(2), 129â€“133 (2013)
U. Garg, S. Agarwal, S. Gupta, R. Dutt, D. Singh, in 2020 12th International Conference on Computational Intelligence and Communication Networks (CICN). Prediction of emotions from the audio speech signals using MFCC, MEL and Chroma (2020), pp. 87â€“91
N.P. Jagini, R.R. Rao, in 2017 International Conference on Intelligent Computing and Control Systems (ICICCS). Exploring emotion specific features for emotion recognition system using PCA approach (2017), pp. 58â€“62
S.R. Krishna, R.R. Rao, in 2017 International Conference on Communication and Signal Processing (ICCSP). Exploring robust spectral features for emotion recognition using statistical approaches (2017), pp. 1838â€“1843
J.A. Russell, A circumplex model of affect. J. Pers. Soc. Psychol. 39, 1161â€“1178 (1980)
J. Posner, J.A. Russell, B.S. Peterson, The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology. Dev. Psychopathol. 17(3), 715â€“734 (2005)
A. Mehrabian, Basic dimensions for a general psychological theory (Oelgeschlager, Gunn& Hain, Incorporated, Cambridge, 1980), pp. 39â€“53
R.F. Bales, Social interaction systems: theory and measurement (Transaction Publishers, Piscataway, 2001), pp. 139â€“140
Y. Zhou, X. Liang, Y. Gu, Y. Yin, L. Yao, Multiclassifier interactive learning for ambiguous speech emotion recognition. IEEE/ACM Trans. Audio, Speech Lang. Process. 30, 695â€“705 (2022)
Y. Pan, P. Shen, L. Shen, Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101â€“108 (2012)
S. Mao, D. Tao, G. Zhang, P.C. Ching, T. Lee, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. Revisiting hidden Markov models for speech emotion recognition (2019), pp. 6715â€“6719
H. Hu, M. Xu, W. Wu, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing  ICASSP â€˜07. GMM supervector based SVM with spectral features for speech emotion recognition (2007), pp. IV413â€“IV416
Y.C. Kao, C.T. Li, T.C. Tai, J.C. Wang, in 2021 9th International Conference on Orange Technology (ICOT). Emotional speech analysis based on convolutional neural networks (2021), pp. 1â€“4
C.H. Park, D.W. Lee, K.B. Sim, in 2002 International Conference on Machine Learning and Cybernetics. Emotion recognition of speech based on RNN, vol 4 (2002), pp. 2210â€“2213
S. Wang, X. Ling, F. Zhang, J. Tong, in 2010 International Conference on Measuring Technology and Mechatronics Automation. Speech emotion recognition based on principal component analysis and back propagation neural network (2010), pp. 437â€“440
K.H. Lee, H. Kyun Choi, B.T. Jang, D.H. Kim, in 2019 International Conference on Information and Communication Technology Convergence (ICTC). A study on speech emotion recognition using a deep neural network (2019), pp. 1162â€“1165
X. Wu et al., in IEEE/ACM Transactions on Audio, Speech, and Language Processing. Speech emotion recognition using sequential capsule networks, vol 29 (2021), pp. 3280â€“3291
L. Yi, M.W. Mak, Improving speech emotion recognition with adversarial data augmentation network. IEEE Trans. Neural Netw. Learn. Syst. 33(1), 172â€“184 (2022)
S. Mao, P.C. Ching, T. Lee, Enhancing segmentbased speech emotion recognition by iterative selflearning. IEEE/ACM Trans. Audio, Speech Lang. Process. 30, 123â€“134 (2022)
N. Liu et al., Transfer subspace learning for unsupervised crosscorpus speech emotion recognition. IEEE Access 9, 95925â€“95937 (2021)
P. Song, W. Zheng, Feature selection based transfer subspace learning for speech emotion recognition. IEEE Trans. Affect. Comput. 11(3), 373â€“382 (2020)
H. Luo, J. Han, Nonnegative matrix factorization based transfer subspace learning for crosscorpus speech emotion recognition. IEEE/ACM Trans. Audio, Speech Lang. Process. 28, 2047â€“2060 (2020)
P. Song, Transfer linear subspace learning for crosscorpus speech emotion recognition. IEEE Trans. Affect. Comput. 10(2), 265â€“275 (2019)
J. Deng, X. Xu, Z. Zhang, S. FrÃ¼hholz, B. Schuller, Universum autoencoderbased domain adaptation for speech emotion recognition. IEEE Sign. Process. Lett. 24(4), 500â€“504 (2017)
Y. Zong, W. Zheng, T. Zhang, X. Huang, Crosscorpus speech emotion recognition based on domainadaptive leastsquares regression. IEEE Sign. Process. Lett. 23(5), 585â€“589 (2016)
J. Gideon, M.G. McInnis, E.M. Provost, Improving crosscorpus speech emotion recognition with adversarial discriminative domain generalization (ADDoG). IEEE Trans. Affect. Comput. 12(4), 1055â€“1068 (2021)
M. Abdelwahab, C. Busso, Domain adversarial for acoustic emotion recognition. IEEE/ACM Trans. Audio, Speech Lang. Process. 26(12), 2423â€“2435 (2018)
W. Zhang, P. Song, Transfer sparse discriminant subspace learning for crosscorpus speech emotion recognition. IEEE/ACM Trans. Audio, Speech Lang. Process. 28, 307â€“318 (2020)
D.L. Donoho et al., Hessian eigenmaps: locally linear embedding techniques for highdimensional data. Proc. Natl. Acad. Sci. U. S. A. 100(10), 5591â€“5596 (2003)
Lianbo Zhang, D. Tao and Weifeng Liu, in Proceedings of the 16th International Conference on Communication Technology. Supervised Hessian Eigenmap for dimensionality reduction (IEEE, Hangzhou, China, 2015), pp.903â€“907.
F. Asano, Y. Suzuki, D.C. Swanson, Optimization of control source configuration in active control systems using GramSchmidt orthogonalization. IEEE Trans. Speech Audio Process. 7(2), 213â€“220 (1999)
F. Nie, H. Huang, X. Cai, et al, in Proceedings of the 24th Annual Conference on Neural Information Processing Systems. Efficient and Robust Feature Selection via Joint â„“2, 1Norms Minimization (NIPS, Vancouver, BC, Canada, 2010), pp.1â€“9
R. He, T. Tan, L. Wang, W. Zheng, in 2012 IEEE Conference on Computer Vision and Pattern Recognition. â„“2, 1 regularized correntropy for robust feature selection (2012), pp. 2504â€“2511
Y. Shi, F. Sha, in Proceedings of the 29th International Conference on Machine Learning. InformationTheoretical Learning of Discriminative Clusters for Unsupervised Domain Adaptation (IMLS, Edinburgh, United kingdom, 2012), pp.1079â€“1086
B. Gholami, P. Sahu, O. Rudovic, K. Bousmalis, V. Pavlovic, Unsupervised multitarget domain adaptation: an information theoretic approach. IEEE Trans. Image Process. 29, 3993â€“4002 (2020)
Y. Tu, M. Mak, J. Chien, Variational domain adversarial learning with mutual information maximization for speaker verification. IEEE/ACM Trans. Audio, Speech Lang. Process. 28, 2013â€“2024 (2020)
D. Xin, T. Komatsu, S. Takamichi, H. Saruwatari, in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Disentangled speaker and language representations using mutual information minimization and domain adaptation for crosslingual TTS (2021), pp. 6608â€“6612
X. Wang, L. Yan and Q. Zhang, in Proceedings of the International Conference on Computer Network, Electronic and Automation. Research on the Application of Gradient Descent Algorithm in Machine Learning (IEEE, Xi'an, China, 2021), pp. 11â€“15
F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier and B. Weiss, in Proceedings of the Interspeech. A database of German emotional speech (ISCA, Lisbon, Portugal, 2005), pp. 1517â€“1520
H.C. Chou, W.C. Lin, L.C. Chang, C.C. Li, H.P. Ma, C.C. Lee, in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII). NNIME: the NTHUNTUA Chinese interactive multimodal emotion corpus (2017), pp. 292â€“298
C. Busso, M. Bulut, C.C. Lee, et al., IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resourc. Eval. 42(4), 335â€“359 (2008)
C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, E.M. Provost, MSPIMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8(1), 119â€“130 (2017)
R. Lotfian, C. Busso, Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Trans. Affect. Comput. 10(4), 471â€“483 (2019)
Fan, Weiquan, et al, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing. LSSED: a largescale dataset and benchmark for speech emotion recognition (IEEE, Toronto, Canada,2021), pp. 641â€“645
J. Haitsma, T. Kalker, in Proceedings of the 3rd International Conference on Music Information Retrieval. A highly robust audio fingerprinting system (ISMIR, Paris, France, 2002), pp. 107â€“115
Y.C. Du, W.C. Hu, L.Y. Shyu, in The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. The effect of data reduction by independent component analysis and principal component analysis in hand motion identification (2004), pp. 84â€“86
S. Ji, J. Ye, Generalized linear discriminant analysis: a unified framework and efficient model selection. IEEE Trans. Neural Netw. 19(10), 1768â€“1782 (2008)
D. Cai, Spectral Regression: A Regression Framework for Efficient Regularized Subspace Learning. (Doctoral dissertation, University of Illinois at UrbanaChampaign), 2009
D. Cai, X. He, J. Han, Speed up kernel discriminant analysis. Int. J. Very Large Data Bases 20(1), 187â€“191 (2011)
D. Cai, X. He, J. Han, in Seventh IEEE International Conference on Data Mining (ICDM 2007). Spectral regression: a unified approach for sparse subspace learning (2007), pp. 73â€“82
B. Gong, Y. Shi, F. Sha, K. Grauman, in 2012 IEEE Conference on Computer Vision and Pattern Recognition. Geodesic flow kernel for unsupervised domain adaptation (2012), pp. 2066â€“2073
B. Fernando, A. Habrard, M. Sebban, T. Tuytelaars, in 2013 IEEE International Conference on Computer Vision. Unsupervised visual domain adaptation using subspace alignment (2013), pp. 2960â€“2967
J. Wang, W. Feng, Y. Chen, et al, in Proceedings of the ACM Multimedia Conference. Visual Domain Adaptation with Manifold Embedded Distribution Alignment (ACM, Seoul, Korea, 2018), pp. 402â€“410
M. Long, J. Wang, G. Ding, J. Sun, P.S. Yu, in 2013 IEEE International Conference on Computer Vision. Transfer feature learning with joint distribution adaptation (2013), pp. 2200â€“2207
S.J. Pan, I.W. Tsang, J.T. Kwok, Q. Yang, Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22(2), 199â€“210 (2011)
J. Wang, Y. Chen, S. Hao, W. Feng, Z. Shen, in 2017 IEEE International Conference on Data Mining (ICDM). Balanced distribution adaptation for transfer learning (2017), pp. 1129â€“1134
M. Long, J. Wang, G. Ding, J. Sun, P.S. Yu, in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Transfer joint matching for unsupervised domain adaptation (2014), pp. 1410â€“1417
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grants (61971015), Beijing Natural Science Foundation (No. L223033), and the Cooperative Research Project of BJUTNTUT (No. NTUTBJUT11005).
Funding
This work was supported by the National Natural Science Foundation of China under Grants (61971015) and the Cooperative Research Project of BJUTNTUT (No. NTUTBJUT11005).
Author information
Authors and Affiliations
Contributions
CX performed the whole research and wrote the paper. JM provided support to the writing and experiments. The authors read and approved the final version of the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cao, X., Jia, M., Ru, J. et al. Crosscorpus speech emotion recognition using subspace learning and domain adaption. J AUDIO SPEECH MUSIC PROC. 2022, 32 (2022). https://doi.org/10.1186/s13636022002645
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636022002645
Keywords
 Speech emotion recognition
 Crosscorpus
 Subspace learning
 Domain adaption