Cross-corpus speech emotion recognition using subspace learning and domain adaption

Cao, Xuan; Jia, Maoshen; Ru, Jiawei; Pai, Tun-wen

doi:10.1186/s13636-022-00264-5

Methodology
Open access
Published: 27 December 2022

Cross-corpus speech emotion recognition using subspace learning and domain adaption

Xuan Cao¹,
Maoshen Jia ORCID: orcid.org/0000-0002-3452-3913¹,
Jiawei Ru¹ &
…
Tun-wen Pai²

EURASIP Journal on Audio, Speech, and Music Processing volume 2022, Article number: 32 (2022) Cite this article

2454 Accesses
3 Citations
Metrics details

Abstract

Speech emotion recognition (SER) is a hot topic in speech signal processing. When the training data and the test data come from different corpus, their feature distributions are different, which leads to the degradation of the recognition performance. Therefore, in order to solve this problem, a cross-corpus speech emotion recognition method is proposed based on subspace learning and domain adaptation in this paper. Specifically, training set data and the test set data are used to form the source domain and target domain, respectively. Then, the Hessian matrix is introduced to obtain the subspace for the extracted features in both source and target domains. In addition, an information entropy-based domain adaption method is introduced to construct the common space. In the common space, the difference between the feature distributions in the source domain and target domain is reduced as much as possible. To evaluate the performance of the proposed method, extensive experiments are conducted on cross-corpus speech emotion recognition. Experimental results show that the proposed method achieves better performance compared with some existing subspace learning and domain adaptation methods.

1 Introduction

There are many ways for people to express emotions, such as through speech, actions, and facial expressions. Speech is an important way to express emotions among these ways, because it contains riches emotions, such as happy, angry, and sad. Speakers can deliver their intentions through different tones, volumes, or content. How to judge a speaker’s emotion through speech becomes crucial. Therefore, speech emotion recognition (SER) is an important branch of many modal affective computing, and it is also an important part of speech recognition. With the development of SER, it has been applied in the fields of psychotherapy, human-computer interaction, etc. According to the results of SER, the machine can generate appropriate responses for the user in an interactive environment. Therefore, SER is one of the most important technologies for human-computer interaction [1,2,3,4].

The semantic-based methods are an important class of SER methods, because emotions can be expressed effectively by semantics. If the speakers use emotive words to communicate with others, then we can directly judge the emotion from the semantics of the words. Therefore, semantic-based research gradually began to develop. A multi-classifier emotion recognition model based on prosodic information and semantic labels is introduced in [5]. Similarly, the semantic labels and the non-verbal audio in speech, such as onomatopoeia such as crying, laughter, or sighing, are used in SER [6]. Subsequently, temporal and semantic coherence is introduced for SER [7]. In addition, the model of bimodal SER from acoustic and linguistic information fusion is proposed [8].

Although semantics understanding is simply for humans, it is a complex process for machines. Therefore, more research is currently aimed at speech features that are easily understood by machines, which is also important for SER. Compared with semantic information, speech features are more abstract. But they are very important for expressing the speaker’s emotions. The main features used in SER are divided into acoustic features and spectral features. The acoustic features include intensity, pitch, and timbre. Features like energy, Mel-frequency cepstral coefficients (MFCC), linear prediction coefficients (LPC), and fundamental frequency are called spectral features. The features such as pitch, MFCC, formant, intensity, and chroma are adopted for SER [9, 10]. Also, the pitch, spectrum, and formant are combined with semantic information for recognizing emotions in [5]. To improve the robustness of SER, some methods have been used to process the features. Specifically, PCA is adopted to reduce the dimensionality of the features [11], and a statistical method is utilized to find robust spectral features [12].

In practical scenarios, the speaker’s emotion is very complex. The speaker may have multiple emotions at the same time, rather than a single emotion, or the emotion expressed by the speaker is inconsistent with the actual emotion. It makes SER difficult. There is also research proposed for complex emotions. A circular continuous dimensional model to describe an emotion, called valence-arousal model (VA) was proposed in [13, 14]. The model no longer regards emotions as discrete but uses two-dimensional coordinates to describe the continuous distribution of emotions. The PAD emotional model was shown in [15, 16], which has P (pleasure), A (arousal), and D (dominance) values to represent all emotional states. In addition, based on the emotional probability distribution, an ambiguous label is proposed to solve the inconsistency problem in ambiguous emotional cognition [17].

Another problem in SER is how to recognize emotions. To this end, some machine learning methods were adopted to recognize emotions, such as support vector machine (SVM) [18], hidden Markov model (HMM) [19], and Gaussian mixed model (GMM) [20]. In recent years, with the rapid development of deep learning, various neural network structures have been introduced in SER. From convolutional neural networks (CNN) [21], recurrent neural networks (RNN) [22], back propagation neural network (BPNN) [23], and deep neural network (DNN) [24] to sequential capsule networks [25] and adversarial data augmentation network [26], they are both used for SER. A segment-based iterative self-learning enhanced speech emotion recognition model is proposed in [27]. The above algorithms perform well in traditional SER, and the recognition accuracy of some algorithms can even reach more than 80% in some corpora settings. In the actual scene, the speech signals do not belong to a specific corpus, which are recorded in different scenes. The speech data is also affected by language, gender, speaking styles, and other factors. So, when the training set and the test set came from different corpus, the training and testing data often follow different feature distributions. The recognition performance will be reduced at this time.

Therefore, transfer learning is adopted to solve the problem of data cross-corpus [28]. The known corpus data is considered as the source domain, and the unknown data to be learned constitutes the target domain. Transfer learning is to transfer the knowledge of the source domain to the target domain to reduce the data distribution difference between the two domains, and in SER, the features of the source and target domains are distributed in different spaces. So, the transfer from the source domain to the target domain is a feature-based transfer, that is, a mapping relationship between two domains is established to reduce the differences in feature distributions. With the development of transfer learning, more transfer learning algorithms are applied to SER. Among them, in order to solve the cross-corpus SER problem, many researches focus on transfer subspace learning and domain adaptation, such as unsupervised transfer subspace learning [28], transfer subspace learning based on feature selection [29], transfer subspace learning based on non-negative matrix factorization [30], transfer linear subspace learning [31], and Universum autoencoder-based domain adaptation [32]. In addition, a cross-corpus speech emotion recognition based on domain adaptive least squares regression is proposed in [33], and in [34, 35], ADDoG-based and DANN-based methods are proposed according to the idea of domain adversarial. Most of the above methods involve transfer subspace learning and domain adaptation, which are important issues in transfer learning and the focus of this paper. The two parts are considered jointly in this paper. Therefore, inspired by the frame in [36], a cross-corpus speech emotion recognition method is proposed.

The contributions of the proposed method are summarized as follows:

The proposed method combines subspace learning and mapping to realize speech emotion recognition across the corpus. The feasibility of the proposed method is proved by experimental results.
In this paper, a subspace learning model is constructed based on the Hessian matrix, so that the extracted features both in the source domain and the target domain have good robustness in their independent subspace, which can be adopted to improve the subsequent cross-corpus transfer ability.
Information entropy is used to establish a domain adaption model in the proposed method. The numerical descent is used to minimize information entropy, so that a common space of source and target domains is learned, thereby the difference in features distribution between the two domains is reduced.

The rest of the paper is organized as follows. In Section 2, the specific process of the proposed method is introduced, along with some optimizations. In Section 3, the emotion recognition performance of the proposed method is analyzed on three public datasets, and the effects of different parameters on the performance are analyzed through experiments. Finally, the conclusion is drawn in Section 4.

2 The proposed method

A cross-corpus speech emotion recognition method is proposed by combing subspace learning and domain adaption. The block diagram of the proposed method is shown in Fig. 1.

Firstly, features of speech in the source corpus and target corpus are extracted to form the source domain and the target domain. Then, the Hessian-based subspace learning is performed on the feature in the source domain and the target domain to obtain low-dimensional features for forming their own independent subspace. The flowchart of the Hessian-based subspace learning part is shown in Fig. 2. Furthermore, the mapping relationship between the source domain subspace and the target domain subspace is established by using information entropy, which is used for reducing the difference of feature distribution between different domains. This mapping relationship is revealed by the common space. Therefore, it is important to find the common space corresponding to the two domains in this method. The flowchart of the domain adaption part is shown in Fig. 3. Finally, emotions are predicted.

In the part of Hessian-based subspace learning, the neighboring frames of the current frame are found based on neighborhood calculation. Then, the Hessian matrix [37] is constructed for low-dimensional embedding to obtain the subspace of the source and target domain, respectively.

After obtaining the subspace of the source and target domain, the transformation matrix is obtained through correlation coefficients of the subspace. Then, the distance between the feature data of each frame in the source domain subspace with that of each frame in the target domain subspace is calculated. And the probability that a frame in the subspace of the target domain is neighborhood to each frame in the source domain is obtained according to the distance. In this way, the posterior probability that the features of each frame in the target domain subspace are estimated to be a certain class can be obtained according to the known class labels of the features of each frame in the source domain subspace. Then, the entropy between the target domain features and emotion labels and the entropy between the features and domain labels of the two domains are calculated. Finally, the two information entropies are jointly optimized by numerical descent. The mapping relationship between the source domain subspace and the target domain subspace is acquired, which is described by a common space.

Then, Hessian-based subspace learning [38] and the domain adaption based on information entropy are introduced in detail. Finally, a specific optimization method for finding the common space is given.

2.1 Hessian-based subspace learning

An input feature matrix X=(x_mn)_M × N is given, which is composed of the features of the speech. m and n are the feature index and the frame index, respectively. M and N are the total number of the feature dimension and the number of frames, respectively. First, the feature energy of each frame is as follows:

$$x_n^\text{e}={\textstyle\sum_{m=1}^M}\;x_{mn}^2,$$

(1)

where ${x}_n^{\textrm{e}}$ represents the feature energy of the nth frame, and x_mn represents the feature of the mth dimension in the nth frame.

Thus, an energy matrix can be formed as X^e= [${x}_1^{\textrm{e}},{x}_2^{\textrm{e}},\dots, {x}_N^{\textrm{e}}$]. Then, two new feature energy matrices A and B, which are used for calculating the distance of the feature between different frames, are defined as follows:

$$\left\{\begin{array}{c}\textbf{A}={\left({a}_{ij}\right)}_{N\times N}\\ {}\textbf{B}={\left({b}_{ij}\right)}_{N\times N}\end{array}\right.$$

(2)

where ${a}_{ij}={x}_j^{\textrm{e}}$, ${b}_{ij}={x}_i^{\textrm{e}}$, 1 ≤ i, j ≤ N, and i and j represent the index of the row and column, respectively. In order to find the nearest K frames of each frame, the distance D_e = (d_ij)_N × N of the feature between different frames is calculated as follows:

$${\textbf{D}}_e=\textbf{A}+\textbf{B}-2{\textbf{X}}^T\textbf{X}$$

(3)

where d_ij represents the distance between the feature energy of the ith frame and the jth frame. The smaller the distance d_ij is, the closer the feature energies of the ith frame and the jth frame are. In fact, the definition of distance D_e is derived from Euclidean distance. A and B are formed by the square of the elements in the input matrix X. According to Eqs. (1), (2), and (3), the distance defined in this paper meets the requirements of non-negativity, directness, and identity. A and B are constructed in a way that also satisfies the symmetry of the distance.

The jth column from the matrix D_e (i.e., ${\boldsymbol{d}}_j^e={\left[{d}_{1j}^e,{d}_{2j}^e,\dots, {d}_{Nj}^e\right]}^T$) denotes the distance vector of feature energy between the jth frame and each frame. The sorted distance matrix in ascending order is ${\textbf{d}}_j^{eS}={\left[{d}_{S_j(1)j}^e,{d}_{S_j(2)j}^e,\dots, {d}_{S_j(N)j}^e\right]}^T$; S_j(i) denotes the index of the frame sorted by the distance from the jth frame, where S_j(1) represents the index with the minimum distance in ${d}_{ij}^e$; and S_j(N) is the index of the maximum distance. It is worth mentioning that for each frame, ${d}_{jj}^e$ is the minimum element in ${\boldsymbol{d}}_j^e$, i.e., S_j(1) = j. The 2nd to the (K+1)-th minimum distance from ${\textbf{d}}_j^{eS}$ are selected to form the adjacent index matrix i_j = [S_j(2), S_j(3), …, S_j(K + 1)]^Tof the jth frame. K denotes the number of the largest neighbor frames. Thereby, the K×N adjacent index matrix I = [i₁, i₂, …, i_N] of N frames is obtained. Then, the elements in the input matrix X correspond to the indices in I and are selected to form a neighborhood matrix Z_n, which is defined as follows:

$${\textbf{Z}}_n={\left({z}_{mk}^n\right)}_{M\times K}$$

(4)

where ${z}_{mk}^n={x}_{m{S}_n\left(k+1\right)}$, 1 ≤ k ≤ K, 1 ≤ m ≤ M, 1 ≤ n ≤ N. k, m, and n are the neighbor index, the feature index, and the frame index, respectively. Z_n represents the neighborhood matrix corresponding to the nth frame.

E _n is a centralized matrix of Z_n, which is defined as follows:

$${\textbf{E}}_n={\left({e}_{mk}^n\right)}_{M\times K}$$

(5)

where $e_{mk}^n=\frac1K\sum_{k=1}^Kz_{mk}^n$

The purpose of the proposed Hessian-based subspace learning is to obtain the local coordinates of the neighborhood, which are transitioned by tangent coordinates. The tangent space consists of tangent coordinates, which is regarded as a subspace of the Euclidean space. A standard orthogonal coordinate system is associated with the inner product inheritance of the Euclidean space, which can be obtained by using singular value decomposition. Therefore, Z_n − E_n is subjected to singular value decomposition. The standard orthonormal basis ${\textbf{V}}_n={\left({v}_{ij}^n\right)}_{K\times K}$ can be obtained by singular value decomposition as follows:

$${\textbf{Z}}_n-{\textbf{E}}_n={\textbf{U}}_n{\boldsymbol{\Sigma}}_n{\textbf{V}}_n^T$$

(6)

where (·)^T denotes transposition. U_n is the left singular vector of Z_n − E_n. Σ_n is a diagonal matrix of singular values.

First d columns of V_n are extracted to constitute the tangent coordinates ${\textbf{V}}_n^d={\left({v}_{ij}^n\right)}_{K\times d}$ with dimension K × d.

Next, an association Hessian matrix Q_n is given by using ${\textbf{V}}_n^d$, which is defined as follows:

$${\textbf{Q}}_n={\left({q}_{kj}^n\right)}_{K\times \frac{d\left(d+1\right)}{2}}$$

(7)

where ${q}_{kj}^n={v}_{k{j}_1}^n{v}_{k{j}_2}^n$, n is the frame index, 1≤ n ≤N. j₁ and j₂ are the dimension indexes. The corresponding relationship among j, j₁, and j₂ is given as follows:

$$j=j_2+{\textstyle\sum_{l-1}^{j_i-1}}{\textstyle\sum_{i=j_1}^d1}$$

(8)

where 1 ≤ j₁ ≤ d, 1 ≤ j₂ ≤ d, $j=1,2,\dots, \frac{d\left(d+1\right)}{2}$.

Furthermore, an estimation matrix ${\textbf{L}}_n={\left({l}_{ij}^n\right)}_{K\times \left(1+d+\frac{d\left(d+1\right)}{2}\right)}$ is constructed as follows:

$${l}_{ij}^n=\left\{\begin{array}{c}\overset{1}{v_{ij}^n}\\ {}{q}_{ij}^n\end{array}\kern2.52em \begin{array}{c}j=1\\ {}2\le j\le d\\ {}d+1\le j\le \frac{d\left(\textrm{d}+1\right)}{2}\end{array}\right.$$

(9)

where 1 ≤ i ≤ K, 1 ≤ n ≤ N.

${\textbf{G}}_n={\left({g}_{ij}^n\right)}_{K\times \left(1+d+\frac{d\left(d+1\right)}{2}\right)}$ can be obtained by Schmitt orthogonalization of estimated matrix L_n [39]. The last $\frac{d\left(d+1\right)}{2}$ columns of G_n are taken to obtain the matrix ${\textbf{G}}_n^b={\left({g}_{ij}^{bn}\right)}_{K\times \frac{d\left(d+1\right)}{2}}$. Then, Hessian quadratic matrix H can be constructed by using the matrix ${\textbf{G}}_n^b$, which is formed as follows:

$$\textbf{H}={\textstyle\sum_{n=1}^N}\textbf{C}_n^T{\textbf{C}}_n$$

(10)

where ${\textbf{C}}_n={\left({\textrm{c}}_{ij}\right)}_{\frac{d\left(d+1\right)}{2}\times N}$ is a matrix composed of ${{\textbf{G}}_n^b}^T$, and it is defined as follows:

$${c}_{i{S}_n(j)}=\left\{\begin{array}{c}{g}_{ij}^{bn},\kern0.5em 1\le j\le K\\ {}\begin{array}{cc}0,& K<j\le N\end{array}\end{array}\right.$$

(11)

where $1\le i\le \frac{d\left(d+1\right)}{2}$, and S_n(j) denotes the index of the frame sorted by the distance from the nth frame, 1≤n ≤ N.

Next, the d-dimensional subspace corresponding to the d smallest eigenvalues can be obtained by using H, which is a null space and denotes as U = (u_ij)_N × d. If a manifold is locally equidistant to an open subset in Euclidean space, then the mapping function from this manifold to the open subset is a linear function. The quadratic mixed derivative of the linear function is 0, so the local quadratic form formed by the Hessian coefficients is also 0. Hence, the global Hessian matrix has a (d+1)-dimensional null space. The first-dimension subspace of the Hessian matrix is composed of a constant function, and other d-dimensional subspaces form equidistant coordinates. Then, the embedding matrix R = (r_ij)_d × d can be calculated as follows:

$$r_{ij}=\underset{}{\textstyle\sum_{l\in\boldsymbol J}}u_{li}u_{lj}$$

(12)

where J represents the set of the index of the neighborhood frames, 1 ≤ i ≤ d, 1 ≤ j ≤ d.

Finally, the subspace Y is obtained according to the low-dimensional embedding:

$$\textbf{Y}={\textbf{R}}^{\mu }{\textbf{U}}^T$$

(13)

where μ is a regularization parameter, and (·)^T denotes transposition.

There may be a small number of outliers in the subspace Y after the low-dimensional embedding. In order to solve this problem, the outliers in the subspace Y are corrected in this paper. These outliers are characterized by a small number, with values that deviate from the distribution of most data. So, the detection thresholds are set to recognize the outliers. Then, the outliers are replaced with 2Tr(U^TEU )[40], where Tr(·) means the trace of the matrix in parentheses. E = (e_ij)_N × N is a diagonal matrix, where e_ij is defined as [41]:

$${e}_{ij}=\left\{\begin{array}{c}\begin{array}{cc}\frac{1}{2\parallel {u}_i{\parallel}_2}& i=j\end{array}\\ {}\begin{array}{cc}0& \kern2em i\ne j\end{array}\end{array}\right.$$

(14)

Following the above steps, the source domain subspace Y_s and the target domain subspace Y_t can be obtained.

2.2 Information entropy-based domain adaption

A domain adaption method was proposed to build the relationship between the source domain subspace and the target domain subspace. In detail, a common space with similar feature distributions in the source and target domains is constructed. Both the information entropy between the data and emotion labels and the entropy between data and domain labels are used to optimize the mapping [42]. Thereby, the difference in feature distribution in different corpora can be reduced.

After obtaining the source domain subspace ${\textbf{Y}}_{\textrm{s}}={\left({y}_{ij}^{\textrm{s}}\right)}_{d\times N}$ and target domain subspace ${\textbf{Y}}_{\textrm{t}}={\left({y}_{ij}^{\textrm{t}}\right)}_{d\times N}$, a principal component coefficient of the source domain ${\textbf{W}}_{\textrm{s}}={\left({w}_{ij}^s\right)}_{d\times d}$ and the target domain ${\textbf{W}}_{\textrm{t}}={\left({w}_{ij}^t\right)}_{d\times d}$ is calculated. In some cases, the dimension of the source domain and the target domain is different, which leads to different dimensions of the principal component coefficients. The dimension of the principal component coefficient of the target domain and the source domain with the smallest dimension should be taken as d_w. The dimensions of the source domain and the target domain are the same in this paper, so d_w is set as d. Since the transfer is carried out from the source domain to the target domain, the target domain is used as the basis for the transformation space. The transformation matrix W for both source domain and target domain is set as W = W_t. Features in the source domain and target domain can be mapped into a common space by W.

First, the distance matrix D = (d_ij)_N × N formed by the features between different frames from the source domain subspace and the target domain subspace is given as follows:

$$\textbf{D}={\textbf{A}}^{\prime }+{\textbf{B}}^{\prime }-2{{\textbf{X}}_{\textrm{s}}}^T{\textbf{X}}_{\textrm{t}}$$

(15)

where ${\textbf{X}}_{\textrm{s}}={\left({x}_{mn}^{\textrm{s}}\right)}_{d\times N}={\textbf{W}}^T{\textbf{Y}}_{\textrm{s}}$ denotes the source domain subspace features in transform space, ${\textbf{X}}_{\textrm{t}}={\left({x}_{mn}^{\textrm{t}}\right)}_{d\times N}={\textbf{W}}^T{\textbf{Y}}_{\textrm{t}}$ denotes the target domain subspace features in transform space, $\boldsymbol{A}^{\boldsymbol{'}}={(a_{ij})}_{N\times N},a_{ij}=\sum\nolimits_{m=1}^{d}\left(x_{mj}^{\text{s}}\right)^{2}$, $\mathbf{B}^{\boldsymbol{'}}={(b_{ij})}_{N\times N},b_{ij}=\sum\nolimits_{m=1}^{d}\left(x_{mi}^{\text{t}}\right)^{2}$.

The neighbor frames are detected according to the distance between the feature of each frame. Therefore, a conditional probability model is defined as follows:

$${p}_{ij}=\frac{e^{-{d}_{ij}}}{\sum_{i=1}^N{e}^{-{d}_{ij}}}$$

(16)

where 1≤ i ≤ N, 1≤ j ≤ N, and p_ij is the conditional probability density that the jth frame in the target domain is adjacent to the ith frame in the source domain. It can describe the probability of the nearest neighbor between each frame feature in the source domain and the frame feature in the target domain.

The emotion label corresponding to the ith frame in the source domain is Label_i, Label_i∈Label = {1, 2, ... , L}, i.e., there are a total of L types of emotion. According to formula (16), an emotion label probability estimate ${\hat{p}}_{lj}$ of the jth frame in the target domain is given as follows:

$${\widehat p}_{lj}={\textstyle\sum_{Label_{i=l}}}p_{ij}$$

(17)

where 1≤l ≤ L, 1 ≤ j ≤ N, 1 ≤ i ≤ N, and ${\hat{p}}_{lj}$ express the probability that the jth frame in the target domain is discriminated as the lth type of emotion when the emotion of the source domain is known.

Since ${\hat{p}}_{lj}$ is a preliminary probability estimate of the emotion label of each frame feature in the target domain, the relationship between target domain features and emotion labels cannot be directly revealed by ${\hat{p}}_{lj}$ [43,44,45]. Therefore, the entropy I(X_t; Label) between the target domain features and emotion labels is calculated by using ${\hat{p}}_{lj}$ in this paper, which is defined as follows:

$$I\left({\textbf{X}}_{\textrm{t}};\textbf{Label}\right)=-\sum\nolimits_{l=1}^L\left(\log \left(\sum\nolimits_{j=1}^N\frac{{\hat{p}}_{lj}}{N}\right)\sum\nolimits_{j=1}^N\frac{{\hat{p}}_{lj}}{N}\right)\Big)-\frac{\left(-{\sum}_{j=1}^N{\sum}_{l=1}^L\left({\hat{p}}_{lj}\log \left({\hat{p}}_{lj}\right)\right)\right)}{N}$$

(18)

Equation (18) is composed of two parts. In the first part, the entropy of the average probability that the feature of all frames in the target domain belongs to each emotion label is calculated. The average of the entropy of the feature in the target domain belonging to each emotion label is computed in the second part. In order to reduce the influence of incorrect labels on the feature discrimination results of each frame in the target domain, Eq. (18) needs to be optimized later. It should be noted that if only the second part is minimized, a degenerate solution will be obtained. That is, all frames in the target domain may be classified into the same type of emotion. So, the first part in Eq. (18) is necessary.

Then, the entropy I^st(X) between the features and domain labels of the two domains are introduced to maximize the similarity between the two domains, which is defined as:

$$I^{st}\left(\text{X}\right)=-\sum\nolimits_{t=1}^2\left(\sum\nolimits_{j=1}^{N+M}\frac{p_{tj}}{N+M}\log\left(\sum\nolimits_{j=1}^{N+M}\frac{p_{tj}}{N+M}\right)\right)-\frac{\left(-\sum_{j=1}^{N+M}\sum_{t=1}^2\left(p_{tj}\log\left(p_{tj}\right)\right)\right)}{N+M}$$

(19)

where 1 ≤ j ≤ N + M.

To calculate the entropy I^st(X), firstly, the distance ${d}_{ij}^{\prime }$ between the ith frame feature in the source domain and the jth frame feature in the target domains is calculated according to Eq. (3), where X = (x_ij)_{d × (N + M)} denotes the feature for all frames in the source and target domains, A = (a_ij)_{(N + M) × (N + M)}, ${a}_{ij}=\sum_{m=1}^d{\left({x}_{mj}\right)}^2$, B = (b_ij)_{(N + M) × (N + M)}, and ${b}_{ij}=\sum_{m=1}^d{\left({x}_{mi}\right)}^2$. N and M denote the number of frames in the source domain and target domain, respectively. In this paper, the number of frames in the source domain is the same as that in the target domains, i.e., N = M. Then, the probability ${p}_{ij}^{\prime }$ of the ith frame feature and the jth frame being adjacent to each other in the source domain and the target domain is calculated according to Eq. (16) using ${d}_{ij}^{\prime }$. Next, the probability p_tj that the jth frame in the source domain and the target domain is judged as the target domain or the source domain is calculated according to Eq. (17).

2.3 Optimization

In this subsection, an iterative optimization algorithm based on numerical descent [46] is introduced using Eqs. (18) and (19). The objective function is:

$$f=\min\ \left\{\uplambda {I}^{st}\left(\textbf{X}\right)-I\Big({\textbf{X}}_{\textrm{t}};\textbf{Label}\Big)\right\}$$

(20)

where λ is the regularization parameter.

In the optimization process, the transfer coefficient matrix g is given for numerical descent in this paper, which is defined as follows:

$$\boldsymbol{g}=\uplambda {\boldsymbol{g}}^{st}\left(\textbf{X}\right)-\boldsymbol{g}\left({\textbf{X}}_{\textrm{t}};\textbf{Label}\right)$$

(21)

where λ is the regularization parameter.

The calculation process of g(X_t; Label) is as follows. First, an information matrix ${\textbf{I}}^C={\left({i}_{lj}^c\right)}_{L\times N}$ is defined using ${\hat{p}}_{lj}$ as:

$${i}_{lj}^c=\frac{\log \left({\hat{p}}_{lj}\right)-\log \left(\sum_{j=1}^N\frac{{\hat{p}}_{lj}}{N}\right)}{N}$$

(22)

where ${i}_{lj}^c$ represents the difference between the probability that the feature of the jth frame in the target domain belongs to the emotion of the lth category and the average probability that the features of all frames in the target domain belong to the emotion of the category.

Next, a coefficient matrix Γ = (γ_ij)_N × N is calculated from p_ij and ${i}_{lj}^c$ as follow:

$${\gamma}_{ij}=\left(\sum\nolimits_{i=1}^N{o}_{ij}{p}_{ij}-{o}_{ij}\right){p}_{ij}$$

(23)

where ${o}_{ij}={i}_{lj}^c$, Label_i = l. g(X_t; Label) is obtained as follows:

$$\boldsymbol{g}\left({\textbf{X}}_{\textrm{t}};\textbf{Label}\right)=2\left[{\textbf{Y}}_{\textrm{s}}\boldsymbol{\Omega} {{\textbf{Y}}_{\textrm{s}}}^T+{\textbf{Y}}_{\textrm{t}}\boldsymbol{\Omega} {{\textbf{Y}}_{\textrm{t}}}^T-{\textbf{Y}}_{\textrm{s}}\boldsymbol{\Gamma} {{\textbf{Y}}_{\textrm{t}}}^T-{\textbf{Y}}_{\textrm{t}}\boldsymbol{\Gamma} {{\textbf{Y}}_{\textrm{s}}}^T\right]\textbf{W}$$

(24)

where Ω is a diagonal matrix, and the main diagonal element is $\sum_{j=1}^N{\gamma}_{ij}$. W is the transfer matrix.

Since the calculation process of g(X_t; Label) and g^st(X) is the same, the calculation process of g(X_t; Label) is introduced in detail in this paper. The variables for the calculation process of g^st(X) refer to the calculation process of I^st(X).

Finally, the common space L is obtained. So, the feature data in the source domain after mapping is F_s = Y_s^TL, and the feature data from the target domain is F_t = Y_t^TL.

3 Experiments and results analysis

To evaluate the effectiveness of the proposed cross-corpus speech emotion recognition method, a number of experiments are conducted with some baseline methods on three commonly standard datasets, namely Berlin [47], NNIME [48], IEMOCAP [49], MSP-Improv [50], and MSP-PODCAST [51]. The specific statistics of each dataset are shown in Table 1.

Table 1 Database statistics

Full size table

3.1 Data preparation

Berlin dataset is a German emotional speech corpus recorded by the Technical University of Berlin. In this dataset, ten actors performed 7 emotions, including neutral, angry, fearful, happy, sad, disgusted, and bored. The sampling rate is 16 kHz. The dataset contains 233 male emotional sentences and 302 female emotional sentences saved in WAV format.

The NTHU-NTUA Chinese Interactive Multimodal Emotional Corpus (i.e., NNIME) is a multimodal dataset. In this dataset, audio, video, ECG, etc. were recorded for 44 actors during oral interactions. There are 6 emotions including anger, happy, sad, neutral, frustration, and surprise in this dataset. The audio sampling rate is 16 kHz. The dataset also contains annotation results from 49 annotators in different perspectives.

IEMOCAP, known as the Interactive Emotional Binary Motion Capture Database, is recorded by the Speech Analysis and Interpretation Laboratory at the University of Southern California. Ten emotions are shown by recording the expressions, movements, and audio of 10 actors in this dataset. Twelve hours of data are contained in this dataset. The audio sampling rate is 16 kHz. Considering the relevance and ambiguity of different types of emotions, 4 typical emotions (angry, neutral, happy, and sad) audio data were selected from the above three datasets in this paper.

MSP-Improv is an improvised multimodal emotional corpus. There are 6 sessions each session is a dyadic interaction between two speakers. Twenty target sentences are consisted in each session. In this corpus, 12 actors (six male and six female) performed 4 emotions, including neutral, angry, happy, and sad. Two actors improvise these emotion-specific situations, leading them to utter contextualized, non-read renditions of sentences that have fixed lexical content and convey different emotions. The sampling rate is 44.1 kHz. MSP-Improv is more natural than other corpora. Hereinafter referred to as MSP-Improv is MSP.

MSP-PODCAST, a large and natural emotional corpus. It relies on existing spontaneous recordings obtained from audio-sharing websites. The criterion to select the podcasts is to include only episodes that can be shared to the broader community. In this corpus, the types of emotions and themes are diverse, and the audio quality is very good in this corpus, because segments recorded with poor quality are removed. Segments with SNR values less than 20 dB are discarded. Phone-quality speech are also removed. Therefore, this step also removes segments that do not have significant energy above 4 kHz. Podcasts in the corpus contain 9 emotions, including angry, sad, happy, neutral, fear, surprise, disgust, others, and contempt. However, angry, happy, neutral, and sad are selected in this paper. There are also many real-world corpora like LSSED [52], and so on.

3.2 Experimental settings

In this experiment, 5 artificial audio features are used, including static MFCC and their first- and second-order dynamic differences, LPC, log amplitude-frequency characteristics, Philips Fingerprints [53], and spectral entropy. The selected audio features are listed in Table 2.

Table 2 The features used in this paper

Full size table

In the following, the amplitude characteristic of the frequency coefficient is described by log amplitude-frequency characteristics (LAFC).

Considering that different features contribute differently to speech emotion recognition, each feature in the source domain and the target domain is weighted before training. The weights are set by the dimensions of the features in this paper. For MFCC, LPC, LAFC, Philips Fingerprint, and Spectral Entropy, the corresponding weights are β₁, β₂, β₃, β₄, and β₅, respectively.

After subspace learning and domain adaption, the weighted features in the source domain are trained. That is, the features are used to build a training set. Similarly, the weighted features in the target domain are used to build a test set.

In the training process, a constant recognition accuracy threshold α is set in advance. Next, the test set is divided into two parts of equal amount of data, i.e., test set 1 and test set 2. Test set 1 is used for assist training, and test set 2 is used to optimize the performance of the proposed method. If the recognition accuracy of a certain type of emotion is less than α in the first training, the features corresponding to the emotion need to be re-trained in the next training. The operations repeated until one of the following conditions is met: (1) the recognition accuracy of all emotions is greater than α, and (2) the number of the emotion with recognition accuracy less than α remains unchanged in the two adjacent training.

To evaluate the performance of the proposed method in the cross-corpus condition, the Berlin, NNIME, and IEMOCAP are combined in pairs in this paper. Then, any two datasets are taken as the source domain and the target domain. Therefore, a total of 6 combination cases are designed as follows:

N-B: NNIME is the source domain dataset, and Berlin is the target domain dataset.
B-N: Berlin is the source domain dataset, and NNIME is the target domain dataset.
N-I: NNIME is the source domain dataset, and IEMOCAP is the target domain dataset.
I-N: IEMOCAP is the source domain dataset, and NNIME is the target domain dataset.
B-I: Berlin is the source domain dataset, and IEMOCAP is the target domain dataset.
I-B: IEMOCAP is the source domain dataset, and Berlin is the target domain dataset.

3.2.1 Parameter details

Linear SVM is chosen for training and testing. The grid search method is used to optimize the kernel function coefficients of the SVM and the independent terms of the sum function. There are four hyperparameters and five feature weight coefficients in this experiment. The recognition accuracy threshold α is set to 0.45. It is determined by an informal experiment. According to the dimension of the feature, the weight coefficient β₁, β₂, β₃, β₄, and β₅ are set as 0.3, 0.3, 0.3, 0.05, and 0.05, respectively. The complexity of the algorithm is affected by K. The larger the value of K is, the higher the algorithm complexity is, and the more features are extracted. So, the range of the neighboring value K is set as [3, 9]. For the two regularization parameters μ and λ, the range is set to {− 1/4, − 1/3, − 1/2, 1, 1/2, 1/3, 1/4} and {0.001, 0.01, 0.1, 1, 10, 100, 1000}, respectively. Considering that embedding regularization parameter μ is an exponent, if μ is a positive integer, the value will affect the value of the element in Y. Nevertheless, if μ is a positive or negative fraction, it may affect the value range of the element in R. Hence, both integer and fraction can be chosen for μ. For regularization parameter λ, it affects the importance of both parts of two information entropy. For the proposed method, the dimension of the simplified subspace feature is set to 169.

3.2.2 Traditional linear baseline

In order to evaluate the performance of the proposed method for cross-corpus speech emotion recognition, on the basis of the above 6 sets of experiments, the proposed method is compared with some related most commonly used and advanced transfer learning methods. The following is an introduction to these baseline methods:

Principal components analysis (PCA) [54]: A dimensionality reduction method that maps data into a low-dimensional subspace through linear transformation to prevent information loss as much as possible.
Linear discriminant analysis (LDA) [55]: In this method, the projection direction that maximizes the ratio of the inter-class distance and minimizes the intra-class distance ratio is found. The subsequent classification results are affected while reducing the dimension.
Kernel spectral regression (KSR) [56,57,58]: In reproducing kernel Hilbert spaces (RKHS), the problem of learning embedding functions is transformed by SR into a regression problem.
Geodesic flow kernel (GFK) [59]: The movement of the domain is simulated by integrating an infinite number of subspaces. The changes in geometric and statistical properties from the source domain to the target domain are described by these subspaces.
Subspace alignment (SA) [60]: SA is a transfer learning algorithm for two subspaces by matching the feature. The core of this method is to seek linear transformation to transform and align for different data.
Manifold embedded distribution alignment (MEDA) [61]: Taking into account the importance of both conditional and marginal distributions, a domain-invariant classifier is learned via a Grassmann manifold with structural risk minimization.
Joint distribution adaptation (JDA) [62]: The marginal probability distribution and conditional probability distribution of the source and target domains are adapted to reduce the distribution difference between different domains.
Transfer component analysis (TCA) [63]: The data in both domains are mapped together into a high-dimensional regenerated kernel Hilbert space. In this space, the distance of data in the source domain and target domain is minimized.
Balanced distribution adaptation (BDA) [64]: The weights of marginal and conditional distributions are adaptively utilized on the basis of JDA.
Transfer joint matching (TJM) [65]: The domain variance is reduced by jointly matching features and reweighting instances across domains in a dimensionality reduction process. The new feature representations invariant to both distributional variance and uncorrelated instances are built.

3.3 Results analysis

3.3.1 Comparison with the traditional linear baseline method

In this section, the recognition accuracy of the proposed method is compared with that of some traditional linear baseline methods. The result is shown in Tables 3 and 4.

Table 3 Weighted accuracy (%) of different methods in different cases

Full size table

Table 4 Unweighted accuracy (%) of different methods in different cases

Full size table

From Table 3, it is clear that the performance of the proposed method outperforms that of other methods in most cases. Only in the case of I-B, the performance of the proposed method is slightly lower than that of BDA and TJM. For the proposed method, the average recognition accuracy reached 58.20% in the six cases. In the case of I-B, the recognition accuracy is the lowest among the six cases, which is 46.88%. In contrast, in the case of I-N, the recognition accuracy reached 67.75%, which is the highest among the six cases. Compared with TJM which has the highest recognition accuracy among the baseline methods, the average recognition accuracy of the proposed method is significantly improved by 13.3%.

Although weighted accuracy is an important indicator to evaluate the overall classification performance of the model, weighted accuracy is affected by the unbalanced distribution of sample classes. Therefore, unweighted accuracy is very important for evaluating the overall classification performance of the model when the distribution of sample classes is unbalanced. It can be seen from Table 4 that the unweighted accuracy of almost all methods is lower than the weighted accuracy. For the proposed method, unweighted accuracy is 3.27% lower than weighted accuracy. Compared with the baseline method, it still has advantages.

Furthermore, we can find that the average recognition accuracy of the proposed method, distribution adaptation method, and feature selection method is higher than that of most subspace learning. The reason is that the distribution of data in different domains is different. Therefore, the recognition performance of traditional subspace learning algorithms is poor in cross-corpus speech emotion recognition. Transfer learning can be used to improve recognition performance.

In addition, the confusion matrix of the proposed method in six cases is shown in Fig. 4. It can be seen that there are two types of emotion with more than 50% recognition accuracy in most cases. In the case of N-B and N-I, the highest recognition accuracy can be achieved for neutral. From Fig. 4b and f, it is clear that the proposed method has a good recognition ability for happy, and the highest recognition accuracy can be achieved for angry in the case of I-N and B-I. Moreover, it can be also found that sad is easier to be recognized than other emotions in most cases.

3.3.2 Ablation experiment

In this section, a set of ablation experiments is established to verify the impact of the two parts of the proposed method on the recognition performance. The specific results are shown in Fig. 5. The specific settings are as follows:

Subspace learning: Only Hessian-based Subspace Learning is performed.
Domain adaption: Only information entropy-based domain adaption is performed.
Subspace learning and domain adaption: Hessian-based subspace learning and domain adaption are combined.

The average recognition accuracy of the ablation experiments is shown in Fig. 5. It can be found that the recognition performance of the combined method (i.e., the proposed method) is better than that of the method only with Hessian-based subspace learning or domain adaption. Through ablation experiments, it is clear that both Hessian-based subspace learning and domain adaption have played a positive role in cross-corpus speech emotion recognition. In the cases of N-B, B-N, and N-I, the recognition accuracy of the domain adaption method is slightly higher than that of the Hessian-based subspace learning method. On the contrary, in the cases of I-N, B-I, and I-B, the recognition accuracy of the Hessian-based subspace learning method is slightly higher than that of the domain adaption method.

3.3.3 Comparison with deep learning-based method

In this section, IEMOCAP and MSP-Improv are used for cross-corpus speech emotion recognition. ADDoG-based method and CNN-based method [34] are chosen as reference methods. The recognition accuracy of the proposed method is compared with these reference methods. The result is shown in Fig. 6:

It can be seen from Fig. 6 that when MSP-Improv is the source domain and IEMOCAP is the target domain, the unweight accuracy of the proposed method is better than that of the CNN-based method but slightly lower than that of the ADDoG-based method. However, in the corpus reverse experiment, the unweight accuracy of the proposed method is slightly higher than that of the CNN-based method and ADDoG-based method. It can be clearly seen that the performance of the ADDoG-based method is the most stable among the three methods. In general, the proposed method can achieve well performance compared with traditional linear methods and deep learning methods.

3.3.4 Experiment of real-world corpus

In order to verify that the method proposed in this paper is also effective in the real world, in this section, a real-world corpus MSP-PODCAST and several corpora in controlled experimental environments are used for cross-corpus speech emotion recognition. The experimental setup of this paper is to set MSP-PODCAST as the source corpus and target corpus respectively for experiments with other corpora. The recognition accuracy of the proposed method using MSP-PODCAST as the target corpus is shown in Fig. 7, and Fig. 8 shows the recognition accuracy of the accuracy of MSP-PODCAST as the source corpus:

It can be seen from Figs. 7 and 8 that the recognition performance of the proposed method using MSP-PODCAST as the target corpus is better than that using MSP-PODCAST as the source corpus. When MSP-PODCAST is used as a source corpus, the transferable knowledge is limited due to the influence of complex acoustic conditions. It can be seen that the performance of speech emotion recognition is indeed affected by the corpus environment. In addition, it is clear that the recognition performance of the proposed method using IEMOCAP and MSP-Improv is better than that of other corpora.

3.3.5 Parameters analysis

The influence of different parameters on the recognition performance of the proposed method is analyzed in this section. The analyzed parameters include the number of the nearest neighbors K, the embedding regularization parameter μ, and the information entropy regularization parameter λ. Different recognition accuracy can be obtained by selecting different values of parameters.

First of all, the nearest neighbor number K is analyzed, which is used to identify the number of neighboring frames of the current frame. The complexity of the algorithm is affected by K. The smaller K is, the fewer neighboring frames are identified, and the less feature is provided. While the larger K is, the more neighboring frames are identified, the more feature is provided. However, if K is set large, some frames which are not useful for recognition may be identified as neighboring frames, which may lead to high algorithmic complexity. So, the range of K is set from 3 to 9 in this paper. In different cases, the recognition accuracy of different K is shown in Fig. 9. From Fig. 9, we can find that the proposed method achieves a good recognition accuracy when K = 6. However, it is not enough to only use the recognition accuracy to measure the recognition performance under different corpus settings. Therefore, variances of recognition accuracy are introduced in parameter analysis to measure the recognition performance under different corpus settings at the same time in this paper. For K, variances under different corpus settings are shown in Fig. 10. It can be seen from Fig. 10 that, although the variances of recognition accuracy achieve the maximum when K = 6, there is a small difference when K takes different values. Therefore, considering the algorithmic complexity and recognition performance, K is selected as 6 in this paper.

Then, the embedding regularization parameter μ is analyzed, which is used to control the value of the embedded coordinates. The range of μ is set as {− 1/2, − 1/3, − 1/4, 1/4, 1/3, 1/2, 1} in this paper. In different cases, the recognition accuracy of the proposed method with different μ is shown in Fig. 11. From Fig. 11, it is clear that the proposed method can achieve a good recognition accuracy when μ = 1/4. The variance of recognition accuracy with different μ under different corpus settings is shown in Fig. 12. Although the variance of recognition accuracy is very small when μ = 1, the recognition accuracy is significantly lower than that under other conditions. Therefore, in consideration of recognition accuracy and variance of recognition accuracy, μ = 1/4 is chosen in this paper.

Finally, the information entropy regularization parameter λ is analyzed, which controls the weight of the information entropy. The range of λ is set as {0.001, 0.01, 0.1, 1, 10, 100, 1000} in this paper. In different cases, the recognition accuracy of the proposed method with different λ is shown in Fig. 13. As shown in Fig. 13, when λ = 100 and λ = 1000, the changes in the recognition accuracy are great. Although when λ = 100, the recognition accuracy in both N-I and B-I cases exceeds 70%. However, it is not stable in these two cases as shown in Fig. 14. Therefore, considering recognition accuracy and variance of recognition accuracy in a compromise, λ = 10 is chosen in this paper.

3.4 Complexity analysis

For the performance evaluation of a method, both recognition accuracy and model complexity should be considered. For the deep learning-based method, the complexity of the model is determined by the network structure and the number of parameters. Therefore, some complexity analysis of the proposed method and reference methods are given in this subsection. For the CNN-based method, the feature encoder consists of two convolution layers and a max pooling layer, and the emotion classifier consists of fully connected layers and softmax. On this basis, the ADDoG model adds a critic composed of full connection layers. With the increase of the input MFBs, the calculation amount and trainable parameter amount of each layer will increase more. In addition, during training, when the number of samples in the source domain and target domain increases, the computational complexity of the loss function and iteration times increase. Although there is a user-defined maximum number of iterations for the proposed method, convergence can be achieved by an average of 50 or fewer iterations under each experimental setting. In summary, the proposed method requires relatively few adaptation steps compared to the needing of fine-tuning whole deep neural network.

4 Conclusion

In this paper, a cross-corpus speech emotion recognition method is proposed using subspace learning and domain adaptation. In the subspace learning part, the Hessian matrix is introduced to locally embed the features in both source and target domains to form the feature subspace. In the domain adaption part, the mapping relationship is constructed based on information entropy. Then, the common space of both the source and target domains is obtained, which reduces the discrepancy in feature distribution between the source and target domains. Extensive experiments on datasets in three different languages are conducted to verify the performance of the proposed method.

Availability of data and materials

Not applicable.

References

S. Zhao, G. Jia, J. Yang, G. Ding, K. Keutzer, Emotion recognition from multiple modalities: fundamentals and methodologies. IEEE Sign. Process. Magazine 38(6), 59–73 (2021)
Article Google Scholar
X. Wu, S. Hu, Z. Wu, X. Liu, H. Meng, in 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE ICASSP). Neural architecture search for speech emotion recognition (2022), pp. 1–4
Google Scholar
C.-C. Lee, K. Sridhar, J.-L. Li, W.-C. Lin, S. Bo-Hao, C. Busso, Deep representation learning for affective speech signal analysis and processing: preventing unwanted signal disparities. IEEE Sign. Process. Magazine 38(6), 22–38 (2021)
Article Google Scholar
J.S. Gómez-Cañón, E. Cano, T. Eerola, P. Herrera, H. Xiao, Y.-H. Yang, E. Gómez, Music emotion recognition: toward new, robust standards in personalized and context-sensitive applications. IEEE Sign. Process. Magazine 38(6), 106–114 (2021)
Article Google Scholar
W. Chung-Hsien, W.-B. Liang, Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Trans. Affect. Comput. 2(1), 10–21 (2011)
Article Google Scholar
J.-H. Hsu, M.-H. Su, C.-H. Wu, Y.-H. Chen, Speech emotion recognition considering nonverbal vocalization in affective conversations. IEEE/ACM Trans. Audio, Speech Lang. Process. 29, 1675–1686 (2021)
Article Google Scholar
B. Chen, Q. Cao, M. Hou, Z. Zhang, G. Lu, D. Zhang, Multimodal emotion recognition with temporal and semantic consistency. IEEE/ACM Trans. Audio, Speech Lang. Process. 29, 3592–3603 (2021)
Article Google Scholar
B.T. Atmaja, A. Sasou, M. Akagi, Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion. Speech Commun. 140, 11–28 (2022)
Article Google Scholar
Y. Jin, P. Song, W. Zheng, L. Zhao, Novel feature fusion method for speech emotion recognition based on multiple kernel learning. J. South. Univ. (English Edition) 29(2), 129–133 (2013)
MATH Google Scholar
U. Garg, S. Agarwal, S. Gupta, R. Dutt, D. Singh, in 2020 12th International Conference on Computational Intelligence and Communication Networks (CICN). Prediction of emotions from the audio speech signals using MFCC, MEL and Chroma (2020), pp. 87–91
Chapter Google Scholar
N.P. Jagini, R.R. Rao, in 2017 International Conference on Intelligent Computing and Control Systems (ICICCS). Exploring emotion specific features for emotion recognition system using PCA approach (2017), pp. 58–62
Chapter Google Scholar
S.R. Krishna, R.R. Rao, in 2017 International Conference on Communication and Signal Processing (ICCSP). Exploring robust spectral features for emotion recognition using statistical approaches (2017), pp. 1838–1843
Chapter Google Scholar
J.A. Russell, A circumplex model of affect. J. Pers. Soc. Psychol. 39, 1161–1178 (1980)
Article Google Scholar
J. Posner, J.A. Russell, B.S. Peterson, The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology. Dev. Psychopathol. 17(3), 715–734 (2005)
Article Google Scholar
A. Mehrabian, Basic dimensions for a general psychological theory (Oelgeschlager, Gunn& Hain, Incorporated, Cambridge, 1980), pp. 39–53
Google Scholar
R.F. Bales, Social interaction systems: theory and measurement (Transaction Publishers, Piscataway, 2001), pp. 139–140
Google Scholar
Y. Zhou, X. Liang, Y. Gu, Y. Yin, L. Yao, Multi-classifier interactive learning for ambiguous speech emotion recognition. IEEE/ACM Trans. Audio, Speech Lang. Process. 30, 695–705 (2022)
Article Google Scholar
Y. Pan, P. Shen, L. Shen, Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012)
Google Scholar
S. Mao, D. Tao, G. Zhang, P.C. Ching, T. Lee, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. Revisiting hidden Markov models for speech emotion recognition (2019), pp. 6715–6719
Google Scholar
H. Hu, M. Xu, W. Wu, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ‘07. GMM supervector based SVM with spectral features for speech emotion recognition (2007), pp. IV-413–IV-416
Chapter Google Scholar
Y.-C. Kao, C.-T. Li, T.-C. Tai, J.-C. Wang, in 2021 9th International Conference on Orange Technology (ICOT). Emotional speech analysis based on convolutional neural networks (2021), pp. 1–4
Google Scholar
C.-H. Park, D.-W. Lee, K.-B. Sim, in 2002 International Conference on Machine Learning and Cybernetics. Emotion recognition of speech based on RNN, vol 4 (2002), pp. 2210–2213
Chapter Google Scholar
S. Wang, X. Ling, F. Zhang, J. Tong, in 2010 International Conference on Measuring Technology and Mechatronics Automation. Speech emotion recognition based on principal component analysis and back propagation neural network (2010), pp. 437–440
Chapter Google Scholar
K.H. Lee, H. Kyun Choi, B.T. Jang, D.H. Kim, in 2019 International Conference on Information and Communication Technology Convergence (ICTC). A study on speech emotion recognition using a deep neural network (2019), pp. 1162–1165
Chapter Google Scholar
X. Wu et al., in IEEE/ACM Transactions on Audio, Speech, and Language Processing. Speech emotion recognition using sequential capsule networks, vol 29 (2021), pp. 3280–3291
Google Scholar
L. Yi, M.-W. Mak, Improving speech emotion recognition with adversarial data augmentation network. IEEE Trans. Neural Netw. Learn. Syst. 33(1), 172–184 (2022)
Article Google Scholar
S. Mao, P.C. Ching, T. Lee, Enhancing segment-based speech emotion recognition by iterative self-learning. IEEE/ACM Trans. Audio, Speech Lang. Process. 30, 123–134 (2022)
Article Google Scholar
N. Liu et al., Transfer subspace learning for unsupervised cross-corpus speech emotion recognition. IEEE Access 9, 95925–95937 (2021)
Article Google Scholar
P. Song, W. Zheng, Feature selection based transfer subspace learning for speech emotion recognition. IEEE Trans. Affect. Comput. 11(3), 373–382 (2020)
Article Google Scholar
H. Luo, J. Han, Nonnegative matrix factorization based transfer subspace learning for cross-corpus speech emotion recognition. IEEE/ACM Trans. Audio, Speech Lang. Process. 28, 2047–2060 (2020)
Article Google Scholar
P. Song, Transfer linear subspace learning for cross-corpus speech emotion recognition. IEEE Trans. Affect. Comput. 10(2), 265–275 (2019)
Article Google Scholar
J. Deng, X. Xu, Z. Zhang, S. Frühholz, B. Schuller, Universum autoencoder-based domain adaptation for speech emotion recognition. IEEE Sign. Process. Lett. 24(4), 500–504 (2017)
Article Google Scholar
Y. Zong, W. Zheng, T. Zhang, X. Huang, Cross-corpus speech emotion recognition based on domain-adaptive least-squares regression. IEEE Sign. Process. Lett. 23(5), 585–589 (2016)
Article Google Scholar
J. Gideon, M.G. McInnis, E.M. Provost, Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (ADDoG). IEEE Trans. Affect. Comput. 12(4), 1055–1068 (2021)
Article Google Scholar
M. Abdelwahab, C. Busso, Domain adversarial for acoustic emotion recognition. IEEE/ACM Trans. Audio, Speech Lang. Process. 26(12), 2423–2435 (2018)
Article Google Scholar
W. Zhang, P. Song, Transfer sparse discriminant subspace learning for cross-corpus speech emotion recognition. IEEE/ACM Trans. Audio, Speech Lang. Process. 28, 307–318 (2020)
Article Google Scholar
D.L. Donoho et al., Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. U. S. A. 100(10), 5591–5596 (2003)
Article MathSciNet MATH Google Scholar
Lianbo Zhang, D. Tao and Weifeng Liu, in Proceedings of the 16th International Conference on Communication Technology. Supervised Hessian Eigenmap for dimensionality reduction (IEEE, Hangzhou, China, 2015), pp.903–907.
F. Asano, Y. Suzuki, D.C. Swanson, Optimization of control source configuration in active control systems using Gram-Schmidt orthogonalization. IEEE Trans. Speech Audio Process. 7(2), 213–220 (1999)
Article Google Scholar
F. Nie, H. Huang, X. Cai, et al, in Proceedings of the 24th Annual Conference on Neural Information Processing Systems. Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization (NIPS, Vancouver, BC, Canada, 2010), pp.1–9
R. He, T. Tan, L. Wang, W. Zheng, in 2012 IEEE Conference on Computer Vision and Pattern Recognition. ℓ2, 1 regularized correntropy for robust feature selection (2012), pp. 2504–2511
Google Scholar
Y. Shi, F. Sha, in Proceedings of the 29th International Conference on Machine Learning. Information-Theoretical Learning of Discriminative Clusters for Unsupervised Domain Adaptation (IMLS, Edinburgh, United kingdom, 2012), pp.1079–1086
B. Gholami, P. Sahu, O. Rudovic, K. Bousmalis, V. Pavlovic, Unsupervised multi-target domain adaptation: an information theoretic approach. IEEE Trans. Image Process. 29, 3993–4002 (2020)
Article MATH Google Scholar
Y. Tu, M. Mak, J. Chien, Variational domain adversarial learning with mutual information maximization for speaker verification. IEEE/ACM Trans. Audio, Speech Lang. Process. 28, 2013–2024 (2020)
Article Google Scholar
D. Xin, T. Komatsu, S. Takamichi, H. Saruwatari, in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Disentangled speaker and language representations using mutual information minimization and domain adaptation for cross-lingual TTS (2021), pp. 6608–6612
Google Scholar
X. Wang, L. Yan and Q. Zhang, in Proceedings of the International Conference on Computer Network, Electronic and Automation. Research on the Application of Gradient Descent Algorithm in Machine Learning (IEEE, Xi'an, China, 2021), pp. 11–15
F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier and B. Weiss, in Proceedings of the Interspeech. A database of German emotional speech (ISCA, Lisbon, Portugal, 2005), pp. 1517–1520
H.-C. Chou, W.-C. Lin, L.-C. Chang, C.-C. Li, H.-P. Ma, C.-C. Lee, in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII). NNIME: the NTHU-NTUA Chinese interactive multimodal emotion corpus (2017), pp. 292–298
Chapter Google Scholar
C. Busso, M. Bulut, C.C. Lee, et al., IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resourc. Eval. 42(4), 335–359 (2008)
Article Google Scholar
C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, E.M. Provost, MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 8(1), 119–130 (2017)
Article Google Scholar
R. Lotfian, C. Busso, Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Trans. Affect. Comput. 10(4), 471–483 (2019)
Article Google Scholar
Fan, Weiquan, et al, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing. LSSED: a large-scale dataset and benchmark for speech emotion recognition (IEEE, Toronto, Canada,2021), pp. 641–645
Chapter Google Scholar
J. Haitsma, T. Kalker, in Proceedings of the 3rd International Conference on Music Information Retrieval. A highly robust audio fingerprinting system (ISMIR, Paris, France, 2002), pp. 107–115
Y.C. Du, W.C. Hu, L.Y. Shyu, in The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. The effect of data reduction by independent component analysis and principal component analysis in hand motion identification (2004), pp. 84–86
Chapter Google Scholar
S. Ji, J. Ye, Generalized linear discriminant analysis: a unified framework and efficient model selection. IEEE Trans. Neural Netw. 19(10), 1768–1782 (2008)
Article Google Scholar
D. Cai, Spectral Regression: A Regression Framework for Efficient Regularized Subspace Learning. (Doctoral dissertation, University of Illinois at Urbana-Champaign), 2009
Google Scholar
D. Cai, X. He, J. Han, Speed up kernel discriminant analysis. Int. J. Very Large Data Bases 20(1), 187–191 (2011)
Google Scholar
D. Cai, X. He, J. Han, in Seventh IEEE International Conference on Data Mining (ICDM 2007). Spectral regression: a unified approach for sparse subspace learning (2007), pp. 73–82
Chapter Google Scholar
B. Gong, Y. Shi, F. Sha, K. Grauman, in 2012 IEEE Conference on Computer Vision and Pattern Recognition. Geodesic flow kernel for unsupervised domain adaptation (2012), pp. 2066–2073
Chapter Google Scholar
B. Fernando, A. Habrard, M. Sebban, T. Tuytelaars, in 2013 IEEE International Conference on Computer Vision. Unsupervised visual domain adaptation using subspace alignment (2013), pp. 2960–2967
Chapter Google Scholar
J. Wang, W. Feng, Y. Chen, et al, in Proceedings of the ACM Multimedia Conference. Visual Domain Adaptation with Manifold Embedded Distribution Alignment (ACM, Seoul, Korea, 2018), pp. 402–410
M. Long, J. Wang, G. Ding, J. Sun, P.S. Yu, in 2013 IEEE International Conference on Computer Vision. Transfer feature learning with joint distribution adaptation (2013), pp. 2200–2207
Chapter Google Scholar
S.J. Pan, I.W. Tsang, J.T. Kwok, Q. Yang, Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22(2), 199–210 (2011)
Article Google Scholar
J. Wang, Y. Chen, S. Hao, W. Feng, Z. Shen, in 2017 IEEE International Conference on Data Mining (ICDM). Balanced distribution adaptation for transfer learning (2017), pp. 1129–1134
Chapter Google Scholar
M. Long, J. Wang, G. Ding, J. Sun, P.S. Yu, in 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Transfer joint matching for unsupervised domain adaptation (2014), pp. 1410–1417
Chapter Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants (61971015), Beijing Natural Science Foundation (No. L223033), and the Cooperative Research Project of BJUT-NTUT (No. NTUT-BJUT-110-05).

Funding

This work was supported by the National Natural Science Foundation of China under Grants (61971015) and the Cooperative Research Project of BJUT-NTUT (No. NTUT-BJUT-110-05).

Author information

Authors and Affiliations

Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
Xuan Cao, Maoshen Jia & Jiawei Ru
Department of Computer Science and Information Engineering, National Taipei University of Technology, Taipei, Taiwan
Tun-wen Pai

Authors

Xuan Cao
View author publications
You can also search for this author in PubMed Google Scholar
Maoshen Jia
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Ru
View author publications
You can also search for this author in PubMed Google Scholar
Tun-wen Pai
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

CX performed the whole research and wrote the paper. JM provided support to the writing and experiments. The authors read and approved the final version of the paper.

Corresponding author

Correspondence to Maoshen Jia.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Cao, X., Jia, M., Ru, J. et al. Cross-corpus speech emotion recognition using subspace learning and domain adaption. J AUDIO SPEECH MUSIC PROC. 2022, 32 (2022). https://doi.org/10.1186/s13636-022-00264-5

Download citation

Received: 20 August 2022
Accepted: 14 December 2022
Published: 27 December 2022
DOI: https://doi.org/10.1186/s13636-022-00264-5

Cross-corpus speech emotion recognition using subspace learning and domain adaption

Abstract

1 Introduction

2 The proposed method

2.1 Hessian-based subspace learning

2.2 Information entropy-based domain adaption

2.3 Optimization

3 Experiments and results analysis

3.1 Data preparation

3.2 Experimental settings

3.2.1 Parameter details

3.2.2 Traditional linear baseline

3.3 Results analysis

3.3.1 Comparison with the traditional linear baseline method

3.3.2 Ablation experiment

3.3.3 Comparison with deep learning-based method

3.3.4 Experiment of real-world corpus

3.3.5 Parameters analysis

3.4 Complexity analysis

4 Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords