A cross-corpus speech emotion recognition method is proposed by combing subspace learning and domain adaption. The block diagram of the proposed method is shown in Fig. 1.

Firstly, features of speech in the source corpus and target corpus are extracted to form the source domain and the target domain. Then, the Hessian-based subspace learning is performed on the feature in the source domain and the target domain to obtain low-dimensional features for forming their own independent subspace. The flowchart of the Hessian-based subspace learning part is shown in Fig. 2. Furthermore, the mapping relationship between the source domain subspace and the target domain subspace is established by using information entropy, which is used for reducing the difference of feature distribution between different domains. This mapping relationship is revealed by the common space. Therefore, it is important to find the common space corresponding to the two domains in this method. The flowchart of the domain adaption part is shown in Fig. 3. Finally, emotions are predicted.

In the part of Hessian-based subspace learning, the neighboring frames of the current frame are found based on neighborhood calculation. Then, the Hessian matrix [37] is constructed for low-dimensional embedding to obtain the subspace of the source and target domain, respectively.

After obtaining the subspace of the source and target domain, the transformation matrix is obtained through correlation coefficients of the subspace. Then, the distance between the feature data of each frame in the source domain subspace with that of each frame in the target domain subspace is calculated. And the probability that a frame in the subspace of the target domain is neighborhood to each frame in the source domain is obtained according to the distance. In this way, the posterior probability that the features of each frame in the target domain subspace are estimated to be a certain class can be obtained according to the known class labels of the features of each frame in the source domain subspace. Then, the entropy between the target domain features and emotion labels and the entropy between the features and domain labels of the two domains are calculated. Finally, the two information entropies are jointly optimized by numerical descent. The mapping relationship between the source domain subspace and the target domain subspace is acquired, which is described by a common space.

Then, Hessian-based subspace learning [38] and the domain adaption based on information entropy are introduced in detail. Finally, a specific optimization method for finding the common space is given.

### 2.1 Hessian-based subspace learning

An input feature matrix **X**=(*x*_{mn})_{M × N} is given, which is composed of the features of the speech. *m* and *n* are the feature index and the frame index, respectively*. M* and *N* are the total number of the feature dimension and the number of frames, respectively. First, the feature energy of each frame is as follows:

$$x_n^\text{e}={\textstyle\sum_{m=1}^M}\;x_{mn}^2,$$

(1)

where \({x}_n^{\textrm{e}}\) represents the feature energy of the *n*th frame, and *x*_{mn} represents the feature of the *m*th dimension in the *n*th frame.

Thus, an energy matrix can be formed as **X**^{e}= [\({x}_1^{\textrm{e}},{x}_2^{\textrm{e}},\dots, {x}_N^{\textrm{e}}\)]. Then, two new feature energy matrices **A** and **B**, which are used for calculating the distance of the feature between different frames, are defined as follows:

$$\left\{\begin{array}{c}\textbf{A}={\left({a}_{ij}\right)}_{N\times N}\\ {}\textbf{B}={\left({b}_{ij}\right)}_{N\times N}\end{array}\right.$$

(2)

where \({a}_{ij}={x}_j^{\textrm{e}}\), \({b}_{ij}={x}_i^{\textrm{e}}\), 1 ≤ *i*, *j* ≤ *N*, and *i* and *j* represent the index of the row and column, respectively. In order to find the nearest *K* frames of each frame, the distance **D**_{e} = (*d*_{ij})_{N × N} of the feature between different frames is calculated as follows:

$${\textbf{D}}_e=\textbf{A}+\textbf{B}-2{\textbf{X}}^T\textbf{X}$$

(3)

where *d*_{ij} represents the distance between the feature energy of the *i*th frame and the *j*th frame. The smaller the distance *d*_{ij} is, the closer the feature energies of the *i*th frame and the *j*th frame are. In fact, the definition of distance **D**_{e} is derived from Euclidean distance. **A** and **B** are formed by the square of the elements in the input matrix **X**. According to Eqs. (1), (2), and (3), the distance defined in this paper meets the requirements of non-negativity, directness, and identity. **A** and **B** are constructed in a way that also satisfies the symmetry of the distance.

The *j*th column from the matrix **D**_{e} (i.e., \({\boldsymbol{d}}_j^e={\left[{d}_{1j}^e,{d}_{2j}^e,\dots, {d}_{Nj}^e\right]}^T\)) denotes the distance vector of feature energy between the *j*th frame and each frame. The sorted distance matrix in ascending order is \({\textbf{d}}_j^{eS}={\left[{d}_{S_j(1)j}^e,{d}_{S_j(2)j}^e,\dots, {d}_{S_j(N)j}^e\right]}^T\); *S*_{j}(*i*) denotes the index of the frame sorted by the distance from the *j*th frame, where *S*_{j}(1) represents the index with the minimum distance in \({d}_{ij}^e\); and *S*_{j}(*N*) is the index of the maximum distance. It is worth mentioning that for each frame, \({d}_{jj}^e\) is the minimum element in \({\boldsymbol{d}}_j^e\), i.e., *S*_{j}(1) = *j*. The 2nd to the (*K*+1)-th minimum distance from \({\textbf{d}}_j^{eS}\) are selected to form the adjacent index matrix *i*_{j} = [*S*_{j}(2), *S*_{j}(3), …, *S*_{j}(*K* + 1)]^{T}of the *j*th frame. *K* denotes the number of the largest neighbor frames. Thereby, the *K*×*N* adjacent index matrix **I** = [*i*_{1}, *i*_{2}, …, *i*_{N}] of *N* frames is obtained. Then, the elements in the input matrix **X** correspond to the indices in **I** and are selected to form a neighborhood matrix **Z**_{n}, which is defined as follows:

$${\textbf{Z}}_n={\left({z}_{mk}^n\right)}_{M\times K}$$

(4)

where \({z}_{mk}^n={x}_{m{S}_n\left(k+1\right)}\), 1 ≤ *k* ≤ *K*, 1 ≤ *m* ≤ *M*, 1 ≤ *n* ≤ *N*. *k*, *m*, and *n* are the neighbor index, the feature index, and the frame index, respectively. **Z**_{n} represents the neighborhood matrix corresponding to the *n*th frame.

**E**
_{n} is a centralized matrix of **Z**_{n}, which is defined as follows:

$${\textbf{E}}_n={\left({e}_{mk}^n\right)}_{M\times K}$$

(5)

where \(e_{mk}^n=\frac1K\sum_{k=1}^Kz_{mk}^n\)

The purpose of the proposed Hessian-based subspace learning is to obtain the local coordinates of the neighborhood, which are transitioned by tangent coordinates. The tangent space consists of tangent coordinates, which is regarded as a subspace of the Euclidean space. A standard orthogonal coordinate system is associated with the inner product inheritance of the Euclidean space, which can be obtained by using singular value decomposition. Therefore, **Z**_{n} **− E**_{n} is subjected to singular value decomposition. The standard orthonormal basis \({\textbf{V}}_n={\left({v}_{ij}^n\right)}_{K\times K}\) can be obtained by singular value decomposition as follows:

$${\textbf{Z}}_n-{\textbf{E}}_n={\textbf{U}}_n{\boldsymbol{\Sigma}}_n{\textbf{V}}_n^T$$

(6)

where (·)^{T} denotes transposition. **U**_{n} is the left singular vector of **Z**_{n} **− E**_{n}. **Σ**_{n} is a diagonal matrix of singular values.

First *d* columns of **V**_{n} are extracted to constitute the tangent coordinates \({\textbf{V}}_n^d={\left({v}_{ij}^n\right)}_{K\times d}\) with dimension *K* × *d*.

Next, an association Hessian matrix **Q**_{n} is given by using \({\textbf{V}}_n^d\), which is defined as follows:

$${\textbf{Q}}_n={\left({q}_{kj}^n\right)}_{K\times \frac{d\left(d+1\right)}{2}}$$

(7)

where \({q}_{kj}^n={v}_{k{j}_1}^n{v}_{k{j}_2}^n\), *n* is the frame index, 1≤ *n* ≤*N*. *j*_{1} and *j*_{2} are the dimension indexes. The corresponding relationship among *j*, *j*_{1}, and *j*_{2} is given as follows:

$$j=j_2+{\textstyle\sum_{l-1}^{j_i-1}}{\textstyle\sum_{i=j_1}^d1}$$

(8)

where 1 ≤ *j*_{1} ≤ *d*, 1 ≤ *j*_{2} ≤ *d*, \(j=1,2,\dots, \frac{d\left(d+1\right)}{2}\).

Furthermore, an estimation matrix \({\textbf{L}}_n={\left({l}_{ij}^n\right)}_{K\times \left(1+d+\frac{d\left(d+1\right)}{2}\right)}\) is constructed as follows:

$${l}_{ij}^n=\left\{\begin{array}{c}\overset{1}{v_{ij}^n}\\ {}{q}_{ij}^n\end{array}\kern2.52em \begin{array}{c}j=1\\ {}2\le j\le d\\ {}d+1\le j\le \frac{d\left(\textrm{d}+1\right)}{2}\end{array}\right.$$

(9)

where 1 ≤ *i* ≤ *K*, 1 ≤ *n* ≤ *N*.

\({\textbf{G}}_n={\left({g}_{ij}^n\right)}_{K\times \left(1+d+\frac{d\left(d+1\right)}{2}\right)}\) can be obtained by Schmitt orthogonalization of estimated matrix **L**_{n} [39]. The last \(\frac{d\left(d+1\right)}{2}\) columns of **G**_{n} are taken to obtain the matrix \({\textbf{G}}_n^b={\left({g}_{ij}^{bn}\right)}_{K\times \frac{d\left(d+1\right)}{2}}\). Then, Hessian quadratic matrix **H** can be constructed by using the matrix \({\textbf{G}}_n^b\), which is formed as follows:

$$\textbf{H}={\textstyle\sum_{n=1}^N}\textbf{C}_n^T{\textbf{C}}_n$$

(10)

where \({\textbf{C}}_n={\left({\textrm{c}}_{ij}\right)}_{\frac{d\left(d+1\right)}{2}\times N}\) is a matrix composed of \({{\textbf{G}}_n^b}^T\), and it is defined as follows:

$${c}_{i{S}_n(j)}=\left\{\begin{array}{c}{g}_{ij}^{bn},\kern0.5em 1\le j\le K\\ {}\begin{array}{cc}0,& K<j\le N\end{array}\end{array}\right.$$

(11)

where \(1\le i\le \frac{d\left(d+1\right)}{2}\), and *S*_{n}(*j*) denotes the index of the frame sorted by the distance from the *n*th frame, 1≤*n* ≤ *N*.

Next, the *d*-dimensional subspace corresponding to the *d* smallest eigenvalues can be obtained by using **H**, which is a null space and denotes as **U** = (*u*_{ij})_{N × d}. If a manifold is locally equidistant to an open subset in Euclidean space, then the mapping function from this manifold to the open subset is a linear function. The quadratic mixed derivative of the linear function is 0, so the local quadratic form formed by the Hessian coefficients is also 0. Hence, the global Hessian matrix has a (*d*+1)-dimensional null space. The first-dimension subspace of the Hessian matrix is composed of a constant function, and other *d*-dimensional subspaces form equidistant coordinates. Then, the embedding matrix **R** = (*r*_{ij})_{d × d} can be calculated as follows:

$$r_{ij}=\underset{}{\textstyle\sum_{l\in\boldsymbol J}}u_{li}u_{lj}$$

(12)

where *J* represents the set of the index of the neighborhood frames, 1 ≤ *i* ≤ *d*, 1 ≤ *j* ≤ *d*.

Finally, the subspace **Y** is obtained according to the low-dimensional embedding:

$$\textbf{Y}={\textbf{R}}^{\mu }{\textbf{U}}^T$$

(13)

where *μ* is a regularization parameter, and (·)^{T} denotes transposition.

There may be a small number of outliers in the subspace **Y** after the low-dimensional embedding. In order to solve this problem, the outliers in the subspace **Y** are corrected in this paper. These outliers are characterized by a small number, with values that deviate from the distribution of most data. So, the detection thresholds are set to recognize the outliers. Then, the outliers are replaced with 2*Tr*(**U**^{T}**EU** )[40], where *Tr*(·) means the trace of the matrix in parentheses. **E** = (*e*_{ij})_{N × N} is a diagonal matrix, where *e*_{ij} is defined as [41]:

$${e}_{ij}=\left\{\begin{array}{c}\begin{array}{cc}\frac{1}{2\parallel {u}_i{\parallel}_2}& i=j\end{array}\\ {}\begin{array}{cc}0& \kern2em i\ne j\end{array}\end{array}\right.$$

(14)

Following the above steps, the source domain subspace **Y**_{s} and the target domain subspace **Y**_{t} can be obtained.

### 2.2 Information entropy-based domain adaption

A domain adaption method was proposed to build the relationship between the source domain subspace and the target domain subspace. In detail, a common space with similar feature distributions in the source and target domains is constructed. Both the information entropy between the data and emotion labels and the entropy between data and domain labels are used to optimize the mapping [42]. Thereby, the difference in feature distribution in different corpora can be reduced.

After obtaining the source domain subspace \({\textbf{Y}}_{\textrm{s}}={\left({y}_{ij}^{\textrm{s}}\right)}_{d\times N}\) and target domain subspace \({\textbf{Y}}_{\textrm{t}}={\left({y}_{ij}^{\textrm{t}}\right)}_{d\times N}\), a principal component coefficient of the source domain \({\textbf{W}}_{\textrm{s}}={\left({w}_{ij}^s\right)}_{d\times d}\) and the target domain \({\textbf{W}}_{\textrm{t}}={\left({w}_{ij}^t\right)}_{d\times d}\) is calculated. In some cases, the dimension of the source domain and the target domain is different, which leads to different dimensions of the principal component coefficients. The dimension of the principal component coefficient of the target domain and the source domain with the smallest dimension should be taken as *d*_{w}. The dimensions of the source domain and the target domain are the same in this paper, so *d*_{w} is set as *d*. Since the transfer is carried out from the source domain to the target domain, the target domain is used as the basis for the transformation space. The transformation matrix **W** for both source domain and target domain is set as **W** = **W**_{t}. Features in the source domain and target domain can be mapped into a common space by **W**.

First, the distance matrix **D** = (*d*_{ij})_{N × N} formed by the features between different frames from the source domain subspace and the target domain subspace is given as follows:

$$\textbf{D}={\textbf{A}}^{\prime }+{\textbf{B}}^{\prime }-2{{\textbf{X}}_{\textrm{s}}}^T{\textbf{X}}_{\textrm{t}}$$

(15)

where \({\textbf{X}}_{\textrm{s}}={\left({x}_{mn}^{\textrm{s}}\right)}_{d\times N}={\textbf{W}}^T{\textbf{Y}}_{\textrm{s}}\) denotes the source domain subspace features in transform space, \({\textbf{X}}_{\textrm{t}}={\left({x}_{mn}^{\textrm{t}}\right)}_{d\times N}={\textbf{W}}^T{\textbf{Y}}_{\textrm{t}}\) denotes the target domain subspace features in transform space, \(\boldsymbol{A}^{\boldsymbol{'}}={(a_{ij})}_{N\times N},a_{ij}=\sum\nolimits_{m=1}^{d}\left(x_{mj}^{\text{s}}\right)^{2}\), \(\mathbf{B}^{\boldsymbol{'}}={(b_{ij})}_{N\times N},b_{ij}=\sum\nolimits_{m=1}^{d}\left(x_{mi}^{\text{t}}\right)^{2}\).

The neighbor frames are detected according to the distance between the feature of each frame. Therefore, a conditional probability model is defined as follows:

$${p}_{ij}=\frac{e^{-{d}_{ij}}}{\sum_{i=1}^N{e}^{-{d}_{ij}}}$$

(16)

where 1≤ *i* ≤ *N*, 1≤ *j* ≤ *N*, and *p*_{ij} is the conditional probability density that the *j*th frame in the target domain is adjacent to the *i*th frame in the source domain. It can describe the probability of the nearest neighbor between each frame feature in the source domain and the frame feature in the target domain.

The emotion label corresponding to the *i*th frame in the source domain is Label_{i}, Label_{i}∈**Label** = {1, 2, ... , *L*}, i.e., there are a total of *L* types of emotion. According to formula (16), an emotion label probability estimate \({\hat{p}}_{lj}\) of the *j*th frame in the target domain is given as follows:

$${\widehat p}_{lj}={\textstyle\sum_{Label_{i=l}}}p_{ij}$$

(17)

where 1≤*l* ≤ *L*, 1 ≤ *j* ≤ *N*, 1 ≤ *i* ≤ *N*, and \({\hat{p}}_{lj}\) express the probability that the *j*th frame in the target domain is discriminated as the *l*th type of emotion when the emotion of the source domain is known.

Since \({\hat{p}}_{lj}\) is a preliminary probability estimate of the emotion label of each frame feature in the target domain, the relationship between target domain features and emotion labels cannot be directly revealed by \({\hat{p}}_{lj}\) [43,44,45]. Therefore, the entropy *I*(**X**_{t}; **Label**) between the target domain features and emotion labels is calculated by using \({\hat{p}}_{lj}\) in this paper, which is defined as follows:

$$I\left({\textbf{X}}_{\textrm{t}};\textbf{Label}\right)=-\sum\nolimits_{l=1}^L\left(\log \left(\sum\nolimits_{j=1}^N\frac{{\hat{p}}_{lj}}{N}\right)\sum\nolimits_{j=1}^N\frac{{\hat{p}}_{lj}}{N}\right)\Big)-\frac{\left(-{\sum}_{j=1}^N{\sum}_{l=1}^L\left({\hat{p}}_{lj}\log \left({\hat{p}}_{lj}\right)\right)\right)}{N}$$

(18)

Equation (18) is composed of two parts. In the first part, the entropy of the average probability that the feature of all frames in the target domain belongs to each emotion label is calculated. The average of the entropy of the feature in the target domain belonging to each emotion label is computed in the second part. In order to reduce the influence of incorrect labels on the feature discrimination results of each frame in the target domain, Eq. (18) needs to be optimized later. It should be noted that if only the second part is minimized, a degenerate solution will be obtained. That is, all frames in the target domain may be classified into the same type of emotion. So, the first part in Eq. (18) is necessary.

Then, the entropy *I*^{st}(**X**) between the features and domain labels of the two domains are introduced to maximize the similarity between the two domains, which is defined as:

$$I^{st}\left(\text{X}\right)=-\sum\nolimits_{t=1}^2\left(\sum\nolimits_{j=1}^{N+M}\frac{p_{tj}}{N+M}\log\left(\sum\nolimits_{j=1}^{N+M}\frac{p_{tj}}{N+M}\right)\right)-\frac{\left(-\sum_{j=1}^{N+M}\sum_{t=1}^2\left(p_{tj}\log\left(p_{tj}\right)\right)\right)}{N+M}$$

(19)

where 1 ≤ *j* ≤ *N* + *M*.

To calculate the entropy *I*^{st}(**X**), firstly, the distance \({d}_{ij}^{\prime }\) between the *i*th frame feature in the source domain and the *j*th frame feature in the target domains is calculated according to Eq. (3), where **X** = (*x*_{ij})_{d × (N + M)} denotes the feature for all frames in the source and target domains, **A =** (*a*_{ij})_{(N + M) × (N + M)}, \({a}_{ij}=\sum_{m=1}^d{\left({x}_{mj}\right)}^2\), **B =** (*b*_{ij})_{(N + M) × (N + M)}, and \({b}_{ij}=\sum_{m=1}^d{\left({x}_{mi}\right)}^2\). *N* and *M* denote the number of frames in the source domain and target domain, respectively. In this paper, the number of frames in the source domain is the same as that in the target domains, i.e., *N* = *M*. Then, the probability \({p}_{ij}^{\prime }\) of the *i*th frame feature and the *j*th frame being adjacent to each other in the source domain and the target domain is calculated according to Eq. (16) using \({d}_{ij}^{\prime }\). Next, the probability *p*_{tj} that the *j*th frame in the source domain and the target domain is judged as the target domain or the source domain is calculated according to Eq. (17).

### 2.3 Optimization

In this subsection, an iterative optimization algorithm based on numerical descent [46] is introduced using Eqs. (18) and (19). The objective function is:

$$f=\min\ \left\{\uplambda {I}^{st}\left(\textbf{X}\right)-I\Big({\textbf{X}}_{\textrm{t}};\textbf{Label}\Big)\right\}$$

(20)

where *λ* is the regularization parameter.

In the optimization process, the transfer coefficient matrix *g* is given for numerical descent in this paper, which is defined as follows:

$$\boldsymbol{g}=\uplambda {\boldsymbol{g}}^{st}\left(\textbf{X}\right)-\boldsymbol{g}\left({\textbf{X}}_{\textrm{t}};\textbf{Label}\right)$$

(21)

where *λ* is the regularization parameter.

The calculation process of *g*(**X**_{t}; **Label**) is as follows. First, an information matrix \({\textbf{I}}^C={\left({i}_{lj}^c\right)}_{L\times N}\) is defined using \({\hat{p}}_{lj}\) as:

$${i}_{lj}^c=\frac{\log \left({\hat{p}}_{lj}\right)-\log \left(\sum_{j=1}^N\frac{{\hat{p}}_{lj}}{N}\right)}{N}$$

(22)

where \({i}_{lj}^c\) represents the difference between the probability that the feature of the *j*th frame in the target domain belongs to the emotion of the *l*th category and the average probability that the features of all frames in the target domain belong to the emotion of the category*.*

Next, a coefficient matrix **Γ** = (*γ*_{ij})_{N × N} is calculated from *p*_{ij} and \({i}_{lj}^c\) as follow:

$${\gamma}_{ij}=\left(\sum\nolimits_{i=1}^N{o}_{ij}{p}_{ij}-{o}_{ij}\right){p}_{ij}$$

(23)

where \({o}_{ij}={i}_{lj}^c\), *Label*_{i} = *l*. *g*(**X**_{t}; **Label**) is obtained as follows:

$$\boldsymbol{g}\left({\textbf{X}}_{\textrm{t}};\textbf{Label}\right)=2\left[{\textbf{Y}}_{\textrm{s}}\boldsymbol{\Omega} {{\textbf{Y}}_{\textrm{s}}}^T+{\textbf{Y}}_{\textrm{t}}\boldsymbol{\Omega} {{\textbf{Y}}_{\textrm{t}}}^T-{\textbf{Y}}_{\textrm{s}}\boldsymbol{\Gamma} {{\textbf{Y}}_{\textrm{t}}}^T-{\textbf{Y}}_{\textrm{t}}\boldsymbol{\Gamma} {{\textbf{Y}}_{\textrm{s}}}^T\right]\textbf{W}$$

(24)

where **Ω** is a diagonal matrix, and the main diagonal element is \(\sum_{j=1}^N{\gamma}_{ij}\). **W** is the transfer matrix.

Since the calculation process of *g*(**X**_{t}; **Label**) and *g*^{st}(**X**) is the same, the calculation process of *g*(**X**_{t}; **Label**) is introduced in detail in this paper. The variables for the calculation process of *g*^{st}(**X**) refer to the calculation process of *I*^{st}(**X**).

Finally, the common space **L** is obtained. So, the feature data in the source domain after mapping is **F**_{s} = **Y**_{s}^{T}**L**, and the feature data from the target domain is **F**_{t} = **Y**_{t}^{T}**L**.