Speaker adaptation in the maximum a posteriori framework based on the probabilistic 2-mode analysis of training models

Jeong, Yongwon

doi:10.1186/1687-4722-2013-7

Research
Open access
Published: 11 April 2013

Speaker adaptation in the maximum a posteriori framework based on the probabilistic 2-mode analysis of training models

Yongwon Jeong¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2013, Article number: 7 (2013) Cite this article

2319 Accesses
1 Altmetric
Metrics details

Abstract

In this article, we describe a speaker adaptation method based on the probabilistic 2-mode analysis of training models. Probabilistic 2-mode analysis is a probabilistic extension of multilinear analysis. We apply probabilistic 2-mode analysis to speaker adaptation by representing each of the hidden Markov model mean vectors of training speakers as a matrix, and derive the speaker adaptation equation in the maximum a posteriori (MAP) framework. The adaptation equation becomes similar to the speaker adaptation equation using the MAP linear regression adaptation. In the experiments, the adapted models based on probabilistic 2-mode analysis showed performance improvement over the adapted models based on Tucker decomposition, which is a representative multilinear decomposition technique, for small amounts of adaptation data while maintaining good performance for large amounts of adaptation data.

1 Introduction

In automatic speech recognition (ASR) systems using hidden Markov models (HMMs)[1], mismatches between the training and testing conditions lead to performance degradation. One of such mismatches results from speaker variation. Thus, speaker adaptation techniques[2] are employed to transform a well-trained canonical model (e.g., speaker-independent (SI) HMM) to the target speaker. Speaker adaptation requires fewer adaptation data than needed to build a speaker-dependent (SD) model. Among speaker adaptation techniques, eigenvoice (EV)[3] expresses the model of a new speaker as a linear combination of basis vectors, which are built from the principal component analysis (PCA) of the HMM mean vectors of training speakers.

In a similar approach, speaker adaptation based on tensor analysis using Tucker decomposition[4] was investigated in[5], where bases were constructed from the multilinear decomposition of a tensor that consisted of the HMM mean vectors of training speakers. In the approach, all the training models were collectively arranged in a third-order tensor (3-D array):

\begin{align} ℳ^{R \times D \times S} \end{align}

(1)

where the first, second, and third modes (dimensions) were for the mixture component, dimension of the mean vector, and training speaker. In[5], Tucker decomposition was used to build bases and in the experiments, speaker adaptation using Tucker decomposition showed better performance than eigenvoice and maximum likelihood linear regression (MLLR) adaptation[6]. The improvement seemed to be attributable to the increased number of adaptation parameters and compact bases. Also noticed in[5] was that the increased number of adaptation parameters did not guarantee a good performance when the amount of adaptation data was small (the determination of the proper number of adaptation parameters for given adaptation data is a model-order selection problem). Extending the tensor-based approach, in[7], the fourth mode for noise was added (so, $ℳ$ became a 4-D array) so that the training models of various speakers and noise conditions were decomposed.

In this article, we describe a speaker adaptation method using probabilistic 2-mode analysis, which is an application of probabilistic tensor analysis (PTA)[8] to the second-order tensor (i.e., matrix); PTA is an application of probabilistic PCA (PPCA)[9] to tensor objects. Using probabilistic 2-mode analysis, we derive bases from training models in a probabilistic framework, and formulate the speaker adaptation equation in the maximum a posteriori (MAP) framework[10]. The speaker adaptation equation based on the probabilistic approach becomes similar to MAP linear regression (MAPLR) adaptation[11] as shown below. The experiments showed that the proposed method further improved the performance of the speaker adaptation based on Tucker decomposition for small amounts of adaptation data.

The rest of this article is organized as follows. Section 2.1 explains some tensor algebra and tensor decomposition. Section 2.3 explains the probabilistic 2-mode analysis of a set of mean vectors of training HMMs. In Section 2.5, the estimation of the prior distribution of the adaptation parameter is described. Section 2.6 describes the speaker adaptation in the MAP framework using the bases and the prior. Section 2.2 describes the speaker adaptation using Tucker decomposition, which is compared with the probabilistic 2-mode analysis-based method. We explain the experiments in Section 3 and conclude the article in Section 4. Some of the notations used in this article are summarized in Table1.

Table 1 Notations used in the article

Full size table

2 Methods

2.1 Multilinear decomposition

Following the convention of multilinear algebra, we denote vectors, matrices, and tensors by lowercase boldface letters (e.g., M), uppercase boldface letters (e.g., M), and calligraphic letters (e.g., $ℳ$ ), respectively, in this article.

A tensor is a multidimensional array, and an N-dimensional array is called the N th-order tensor (or N-way array). The order of a tensor is the number of indices for addressing the tensor; so the order of $ℳ^{I_{1} \times I_{2} \times \dots \times I_{N}}$ is N. Scalar, vector, and matrix are zeroth-, first-, and second-order tensors, respectively. There are three indices for addressing the array in a third-order tensor as depicted in Figure1.

Tensor algebra is performed in terms of matrix and vector representations of tensors; the mode-n flattening (matricization) of tensor $ℳ$ , which is denoted as M_(n), is obtained by reordering the elements as follows:

\begin{align} M_{(n)} \in ℝ^{I_{n} \times (I_{1} \times \dots \times I_{(n - 1)} \times I_{(n + 1)} \times \dots \times I_{N})} . \end{align}

(2)

That is, all the column vectors along the mode n are arranged into a matrix. For example, a third-order tensor $ℳ^{I \times J \times K}$ can be flattened into an I×(J K), J×(K I), or K×(I J) matrix as depicted in Figure2; for a $ℳ^{2 \times 2 \times 2}$ tensor:

\begin{align} ℳ = \{M (:, :, 1) M (:, :, 2)\} \\ M (:, :, 1) = [\begin{array}{l} m_{111} & m_{121} \\ m_{211} & m_{221} \end{array}], M (:, :, 2) = [\begin{array}{l} m_{112} & m_{122} \\ m_{212} & m_{222} \end{array}], \end{align}

(3)

the mode-n flattening is given as:

The operation of the mode-n flattening will be denoted as mat_n(·), i.e., ${mat}_{n} (ℳ) = M_{(n)}$ .

Multiplication of a tensor and a matrix is performed by n-mode product; the n-mode product of a tensor $W$ with a matrix U is denoted as

\begin{align} ℳ = W \times_{n} U \end{align}

(5)

and is carried out by matrix multiplication in terms of flattened matrices:

\begin{align} M_{(n)} = U W_{(n)} \end{align}

(6)

or elementwise

\begin{align} (W \times_{n} U_{i_{1} \dots i_{n - 1} j i_{n + 1} \dots i_{N}} = \sum_{i_{n} = 1}^{I_{n}} w_{i_{1} i_{2} \dots i_{N}} u_{j i_{n}} \end{align}

(7)

where w and u denote the elements of $W$ and U, respectively. If $W \in ℝ^{I_{1} \times I_{2} \times \dots \times I_{N}}$ and $U^{T} \in ℝ^{K_{n} \times I_{n}}$ , then the dimension of $W \times_{n} U^{T}$ becomes I₁×I₂×⋯×I_n−1×K_n×I_n+1×⋯×I_N.

As an extension of singular value decomposition (SVD) to tensor objects, Tucker decomposition decomposes a tensor as follows[4]:

\begin{align} ℳ^{I_{1} \times I_{2} \times \dots \times I_{N}} ≃ W^{K_{1} \times K_{2} \times \dots \times K_{N}} \prod_{n = 1}^{N} \times_{n} U_{n} \end{align}

(8)

where $U_{n} \in ℝ^{I_{n} \times K_{n}}$ , K_n≤I_n (n=1,…,N). The core tensor $W$ and mode matrices U_n’s correspond to the matrices of singular values and orthonormal basis vectors in matrix SVD, respectively. An example of Tucker decomposition of a third-order tensor is illustrated in Figure3.

The core tensor $W$ and mode matrices U_n’s in Tucker decomposition can be computed such that they minimize

\begin{align} Error = ∥ ℳ - W \prod_{n = 1}^{N} \times_{n} U_{n} ∥^{2} \end{align}

(9)

where the norm of a tensor is defined as $∥ X ∥ = \sqrt{\sum_{i_{1} = 1}^{I_{1}} \sum_{i_{2} = 1}^{I_{2}} \dots \sum_{i_{N} = 1}^{I_{N}} x_{i_{1} i_{2} \dots i_{N}}^{2}}$ . A representative technique for Tucker decomposition is the alternating least squares (ALS)[12]; the basic idea is to compute each mode matrix U_n alternatingly with other mode matrices fixed. For more details on Tucker decomposition, refer to[4]. In the following section, we explain probabilistic 2-mode analysis in the context of speaker adaptation.

2.2 Speaker adaptation using Tucker decomposition

The probabilistic 2-mode analysis based method is a probabilistic extension of the Tucker decomposition based method. Thus, we compare the probabilistic approach with the Tucker decomposition based method in the experiments. In this section, we explain the speaker adaptation based on the Tucker decomposition of training models in[5]. In this article, speaker adaptation is performed by updating the mean vectors of the output distribution of an HMM. The HMM mean vectors of each training speaker are arranged in an R×D matrix:

\begin{align} M_{s} = {[μ_{s; 1} \dots μ_{s; r} \dots μ_{s; R}]}^{T}, s = 1, \dots, S. \end{align}

(10)

Here, μ_s;r denotes the mean vector corresponding to mixture r of the s th training speaker model.

All the centered HMM mean vectors of training speakers, ${\{M_{s} - \bar{M}\}}_{s = 1}^{S}$ where $\bar{M} = 1 / S \sum_{s} M_{s}$ , are collectively expressed as a third-order tensor $\tilde{ℳ}$ , and we decompose the training tensor by Tucker decomposition as follows:

\begin{align} {\tilde{ℳ}}^{R \times D \times S} & ≃ G^{K_{R} \times K_{D} \times K_{S}} \times_{1} U_{mixture} \times_{2} U_{dim} \times_{3} U_{speaker} \\ = (G^{K_{R} \times K_{D} \times K_{S}} \times_{3} U_{speaker}) \times_{1} U_{mixture} \times_{2} U_{dim} . \end{align}

(11)

In the above equation, $U_{mixture} \in ℝ^{R \times K_{R}}$ , $U_{dim} \in ℝ^{D \times K_{D}}$ , and $U_{speaker} \in ℝ^{S \times K_{S}}$ are basis matrices for the mixture component, dimension of the mean vector, and training speaker, respectively (K_R≤R−1, K_D≤D−1, and K_S≤S−1); the core tensor $G$ is common across the mixture component, dimension of the mean vector, and training speaker. In Equation (11), the s th row vector of U_speaker, which is denoted as u_speaker;s, corresponds to the speaker weight of the s th speaker, thus the low-rank approximation of the s th speaker model is given by

\begin{align} M_{s} ≃ (G^{K_{R} \times K_{D} \times K_{S}} \times_{3} u_{speaker; s}) \times_{1} U_{mixture} \times_{2} U_{dim} + \bar{M} . \end{align}

(12)

If we define the augmented speaker weight $W_{s}^{K_{R} \times K_{D}} \equiv G^{K_{R} \times K_{D} \times K_{S}} \times_{3} u_{speaker; s}$ , Equation (12) becomes

\begin{align} M_{s} & ≃ W_{s} \times_{1} U_{mixture} \times_{2} U_{dim} + \bar{M} \\ = U_{mixture} W_{s} U_{dim}^{T} + \bar{M} . \end{align}

(13)

Thus, we express the model of a new speaker as

\begin{align} M_{new} = U_{mixture} W_{new} U_{dim}^{T} + \bar{M} . \end{align}

(14)

For the given adaptation data O={o₁,…,o_T}, we derive the equation for finding the speaker weight in a maximum likelihood (ML) criterion:

\begin{align} \sum_{t} \sum_{r} γ_{r} (t) C_{r}^{- 1} \underset{\equiv W_{new,aug}^{T}}{\underset{⏟}{U_{dim} W_{new}^{T}}} u_{mixture; r}^{T} u_{mixture; r} \\ = \sum_{t} \sum_{r} γ_{r} (t) C_{r}^{- 1} (o_{t} - {\bar{m}}_{r}^{T}) u_{mixture; r} \end{align}

(15)

where γ_r(t) denotes the occupation probability of being at mixture r at t given O, C_r the covariance matrix of the r th Gaussian component of an SI HMM (in this article, a diagonal covariance matrix is used); u_mixture;r and ${\bar{m}}_{r}$ denote the r th row vectors of U_mixture and $\bar{M}$ , respectively. In the above equation, W_{new, aug} can be computed using a technique similar to MLLR adaptation and the weight of the new speaker is obtained by

\begin{align} {\hat{W}}_{new} = W_{new, aug} U_{dim} \end{align}

(16)

which is plugged into Equation (14) to produce the model updated for the new speaker.

2.3 Probabilistic 2-mode analysis

The advantage of probabilistic 2-mode analysis over Tucker decomposition is similar to that of PPCA over standard PCA; probabilistic 2-mode analysis can deal with missing entries in the data tensor (although this is not the case in our experiments). In the modeling perspective, probabilistic 2-mode analysis assumes a distribution of latent variables, thus it is suitable for a MAP framework.

In this section, the ensemble of training models is expressed as

\begin{align} M = {M_{s}}_{s = 1}^{S} . \end{align}

(17)

Assuming the HMM mean vectors of training speakers are drawn from the matrix-variate normal distribution[13], we derive the adaptation equation based on the probabilistic 2-mode analysis of training models. We use probabilistic 2-mode analysis, the second-order case of PTA[8], to decompose the training models expressed in matrix form. The latent tensor model is expressed as

\begin{align} ℳ = W \prod_{n = 1}^{N} \times_{n} U_{n} + ℳ_{mean} + E \end{align}

(18)

where $W$ denotes the latent tensor, U_n’s the factor loading matrices, $ℳ_{mean}$ the mean, and $E$ is the error/noise process. The 2-mode case of the latent tensor model is given by

\begin{align} M = W \times_{1} U_{1} \times_{2} U_{2} + M_{mean} + E \end{align}

(19)

which becomes, for the training models {M₁,…,M_S},

\begin{align} M_{s} & = W_{s} \times_{1} U_{1} \times_{2} U_{2} + M_{mean} + E_{s} \\ = U_{mixture} W_{s} U_{dim}^{T} + M_{mean} + E_{s} \end{align}

(20)

where $W_{s} \in ℝ^{K_{R} \times K_{D}}$ denotes the latent matrix, $U_{mixture} \in ℝ^{R \times K_{R}}$ and $U_{dim} \in ℝ^{D \times K_{D}}$ the factor loading matrices (K_R≤R−1 and K_D≤D−1), M_mean the mean, and E_s the error/noise process. (Mode matrices and dimensions are defined as follows: U₁=U_mixture, U₂=U_dim, I₁=R, I₂=D, K₁=K_R, and K₂=K_D.) The distribution of W_s is assumed to be a matrix-variate normal, i.e., $W_{s} \sim N (0_{K_{R} \times K_{D}}, I_{K_{R}} \otimes I_{K_{D}})$ where ⊗ denotes the Kronecker product, and independent of E_s whose elements follow $N (0, σ^{2})$ . Figure4 shows the graphical model representing the probabilistic 2-mode model.

In Equation (20), it is computationally intractable to calculate U_n’s simultaneously. So, the following decoupled predictive density is defined:

\begin{align} p (M | M_{mean}, {U_{n}}_{n = 1}^{N}, σ^{2}) \propto \prod_{n = 1}^{N} p (M {\bar{\times}}_{n} U_{n}^{T} | {\bar{t}}_{n}, σ_{n}^{2}) \end{align}

(21)

where ${\bar{t}}_{n} \in ℝ^{I_{n} \times 1}$ and $σ_{n}^{2}$ denote the mean vector and noise variance, respectively, for mode n; $M {\bar{\times}}_{n} U_{n}^{T} \equiv M \times_{1} U_{1}^{T} \dots \times_{n - 1} U_{n - 1}^{T} \times_{n + 1} U_{n + 1}^{T} \dots \times_{N} U_{N}^{T}$ , i.e., the product of M with all the mode matrices except mode n, which is called the contracted n-mode product[14]. That is, the n th probabilistic function is defined as the projected M by all U_j’s expect U_n. Given observed data M, the decoupled posterior probabilistic function is defined as

\begin{align} p (M_{mean}, {U_{n}}_{n = 1}^{N}, σ^{2} | M) \propto \prod_{n = 1}^{N} p ({\bar{t}}_{n}, U_{n}, σ_{n}^{2}, M, {U_{j}}_{j = 1, j \neq n}^{N}) . \end{align}

(22)

By Bayes’ theorem, the n th posterior distribution can be expressed in terms of the decoupled likelihood function and the decoupled prior distribution:

\begin{align} p ( & {\bar{t}}_{n}, U_{n}, σ_{n}^{2}, M, {U_{j}}_{j = 1, j \neq n}^{N}) \propto \\ p (M, {U_{j}}_{j = 1, j \neq n}^{N} | {\bar{t}}_{n}, U_{n}, σ_{n}^{2}) p ({\bar{t}}_{n}, U_{n}, σ_{n}^{2}) . \end{align}

(23)

Therefore, the decoupled predictive density is given by

\begin{align} p (M | M) \propto & p (M | M_{new}, {U_{n}}_{n = 1}^{N}, σ^{2}) \\ \times p (M_{mean}, {U_{n}}_{n = 1}^{N}, σ^{2} | M) \\ = \prod_{n = 1}^{N} & p (M {\bar{\times}}_{n} U_{n}^{T} | {\bar{t}}_{n}, σ_{n}^{2}) \\ \times p (M, {U_{j}}_{j = 1, j \neq n}^{N} | {\bar{t}}_{n}, U_{n}, σ_{n}^{2}) . \end{align}

(24)

$(p ({\bar{t}}_{n}, U_{n}, σ_{n}^{2})$ is dropped out for a fixed U_n). This is the 2-mode case of the PTA in[8]. In our case, Equation (24) is given by

\begin{align} p (M | M) \propto p (U_{dim}^{T} M^{T} | {\bar{t}}_{mixture}, σ_{mixture}^{2}) \\ \times p (M, U_{dim} | {\bar{t}}_{mixture}, U_{mixture}, σ_{mixture}^{2}) \\ \times p (U_{mixture}^{T} M | {\bar{t}}_{dim}, σ_{dim}^{2}) \\ \times p (M, U_{mixture} | {\bar{t}}_{dim}, U_{dim}, σ_{dim}^{2}) . \end{align}

(25)

Now, U_n’s are obtained by maximizing the following posterior distribution:

\begin{align} p ({U_{n}}_{n = 1}^{N} | M) \approx \prod_{n = 1}^{N} p (U_{n} | M {\bar{\times}}_{n} U_{n}^{T}) \end{align}

(26)

where $p (U_{n} | M {\bar{\times}}_{n} U_{n}^{T}) \equiv \prod_{s = 1}^{S} M_{s} {\bar{\times}}_{n} U_{n}^{T}$ . The expectation-maximization (EM) algorithm[15] is applied to compute U_n’s. The application of the EM algorithm to construct probabilistic 2-mode model is explained in the next section.

2.4 Construction of probabilistic 2-mode model for speaker adaptation

In Equation (20), for the given training models, the maximum likelihood (ML) estimate of M_mean is given as $\bar{M} = (1 / S) \sum_{s} M_{s}$ and ${U_{n}, σ_{n}^{2}}$ can be estimated as follows. First, let us define the followings: Let $t_{n; j} \in ℝ^{I_{n} \times 1}$ be the j th column vector of

\begin{align} T_{(n)} = {mat}_{n} (M {\bar{\times}}_{n} U_{n}^{T}) \end{align}

(27)

for $1 \leq j \leq Ī_{n} S$ ( $Ī_{n} = \prod_{j = 1, j \neq n}^{N} I_{j}$ ) and $x_{n; j} \in ℝ^{K_{n} \times 1}$ be the j th column vector of

\begin{align} X_{(n)} = {mat}_{n} (M \prod_{n = 1}^{N} \times_{n} U_{n}^{T}) . \end{align}

(28)

Let us suppose $t_{n} | x_{n} \sim N (U_{n} x_{n} + {\bar{t}}_{n}, σ_{n}^{2} I_{I_{n}})$ and $x_{n} \sim N (0_{K_{n} \times 1}, I_{K_{n}})$ . Then, by integrating out x_n, $t_{n} \sim N ({\bar{t}}_{n}, G_{n})$ where ${\bar{t}}_{n} = 1 / (Ī_{n} S) \sum_{j = 1}^{Ī_{n} S} t_{n; j}$ and $G_{n} = U_{n} U_{n}^{T} + σ_{n}^{2} I_{I_{n}}$ . Consequently,

\begin{align} x_{n} | t_{n} \sim N (H_{n}^{- 1} U_{n}^{T} (t_{n} - {\bar{t}}_{n}), σ_{n}^{2} H_{n}^{- 1}) \end{align}

(29)

where $H_{n} = U_{n}^{T} U_{n} + σ_{n}^{2} I_{K_{n}}$ . The right-hand side of Equation (26) becomes

\begin{align} log p (U_{n} | M {\bar{\times}}_{n} U_{n}^{T}) \propto - \frac{Ī_{n} S}{2} (log | G_{n} | + tr [G_{n}^{- 1} S_{n}]) \end{align}

(30)

where $S_{n} = 1 / (Ī_{n} S - 1) \sum_{j = 1}^{Ī_{n} S} (t_{n; j} - {\bar{t}}_{n}) (t_{n; j} - {\bar{t}}_{n}^{T}$ and tr[ · ] denotes the trace of a matrix. Summing up for all the modes, we obtain the following log-likelihood function of the posterior distribution:

\begin{align} L = & \sum_{n} log p (U_{n} | M {\bar{\times}}_{n} U_{n}) \propto \\ - \sum_{n} \{\frac{Ī_{n} S}{2} (log | G_{n} | + tr [G_{n}^{- 1} S_{n}]) \} . \end{align}

(31)

The graphical model representation of the decoupled probabilistic model is shown in Figure5.

We seek to find U_n’s that maximize the log-likelihood function alternatingly. Mode matrices U₁ and U₂ are initialized with the results from the Tucker decomposition which minimizes the reconstruction error:

\begin{align} Error = \sum_{s} ∥ M_{s} - (W_{s} \times_{1} U_{1} \times_{2} U_{2} + \bar{M})^{2} . \end{align}

(32)

With the initial U₁ and U₂, the following procedure is performed for each mode (n=1,2).

Each training model is projected into mode matrices except mode n and expressed in a mode-n matrix:

\begin{align} T_{s, (n)} = {mat}_{n} (M_{s} {\bar{\times}}_{n} U_{n}^{T}) . \end{align}

(33)

All the column vectors of $\{T_{s, (n)}_{s = 1}^{S}$ constitute the training data set:

\begin{align} {t_{n; j}}, 1 \leq j \leq Ī_{n} S. \end{align}

(34)

Then, with an initial estimate of $σ_{n}^{2}$ (e.g., 0.005 was used in the experiments), the EM algorithm is iterated as follows until U_n and $σ_{n}^{2}$ converge.

E-step: From Equation (31), the expectation of the log-likelihood function of complete data w.r.t. $p (x_{n; j} | t_{n; j}, {\bar{t}}_{n}, U_{n}, σ_{n}^{2})$ is given as

\begin{align} 〈 L_{c} 〉 & = \sum_{n} \sum_{s} E [log p (M_{s}, W_{s} | {U}_{j = 1, j \neq n}^{N})] \\ = \sum_{n} \sum_{j = 1}^{I_{n} S} E [log p (t_{n; j}, x_{n; j} | {U}_{j = 1, j \neq n}^{N})] \end{align}

(35)

where

\begin{align} log p (t_{n; j}, x_{n; j} | {U}_{j = 1, j \neq n}^{N}) & = log p (x_{n; j}) + log p (t_{n; j} | x_{n; j}) \\ \propto - ∥ x_{n; j} ∥^{2} & - \frac{I_{n}}{2} log (σ_{n}^{2}) \\ - \frac{1}{σ_{n}^{2}} ∥ t_{n; j} - U_{n} x_{n; j} - {\bar{t}}_{n} ∥^{2} . \end{align}

(36)

So,

\begin{align} 〈 L_{c} 〉 \propto & - \sum_{n} \sum_{j = 1}^{Ī_{n} S} \{tr [〈 x_{n; j} x_{n; j}^{T} 〉] \\ + \frac{1}{σ_{n}^{2}} {(t_{n; j} - {\bar{t}}_{n})}^{T} (t_{n; j} - {\bar{t}}_{n}) \\ + \frac{I_{n}}{2} log (σ_{n}^{2}) + \frac{1}{σ_{n}^{2}} tr [U_{n}^{T} U_{n} 〈 x_{n; j} x_{n; j}^{T} 〉] \\ - \frac{2}{σ_{n}^{2}} {〈 x_{n; j} 〉}^{T} U_{n}^{T} (t_{n; j} - {\bar{t}}_{n}) \} \end{align}

(37)

with the sufficient statistics are given as follows from Equation (29):

\begin{align} 〈 x_{n; j} 〉 = H_{n}^{- 1} U_{n}^{T} (t_{n; j} - {\bar{t}}_{n}) \\ 〈 x_{n; j} x_{n; j}^{T} 〉 = σ_{n}^{2} H_{n}^{- 1} + 〈 x_{n; j} 〉 {〈 x_{n; j} 〉}^{T} . \end{align}

(38)

M-step: Model parameters are updated by maximizing 〈L_c〉 w.r.t. U_n and $σ_{n}^{2}$ . Setting $\partial_{U_{n}} 〈 L_{c} 〉 = 0$ produces

\begin{align} U_{n} = [\sum_{j = 1}^{Ī_{n} S} (t_{n; j} - {\bar{t}}_{n}) {〈 x_{n; j} 〉}^{T}] [\sum_{j = 1}^{Ī_{n} S} 〈 x_{n; j} x_{n; j}^{T} 〉^{- 1} . \end{align}

(39)

Next, setting $\partial_{σ_{n}^{2}} 〈 L_{c} 〉 = 0$ produces

\begin{align} σ_{n}^{2} = \frac{1}{I_{n} Ī_{n} S} \sum_{j = 1}^{Ī_{n} S} \{∥ t_{n; j} - {\bar{t}}_{n} ∥^{2} & - 2 {〈 x_{n; j} 〉}^{T} U_{n}^{T} (t_{n; j} - {\bar{t}}_{n}) \\ + tr [〈 x_{n; j} x_{n; j}^{T} 〉 U_{n}^{T} U_{n}] \} . \end{align}

(40)

Essentially, the procedure applies PPCA to the data set {t_n;j} for each mode.

2.5 Estimation of prior distribution

Given model parameters ${\bar{M}, U_{n}, σ_{n}^{2}}$ , the weight matrix for the training speaker model M_s is obtained by

\begin{align} W_{s} & = (M_{s} - \bar{M}) \prod_{n = 1}^{2} \times_{n} (H_{n}^{- 1} U_{n}^{T}) \\ = (H_{1}^{- 1} U_{mixture}^{T}) (M_{s} - \bar{M}) (H_{2}^{- 1} U_{dim}^{T}^{T} . \end{align}

(41)

From the set of weight matrices ${W_{s}}_{s = 1}^{S}$ , the distribution of the weight is estimated. In deriving the adaptation equation in the MAP framework, the parameters for the prior distribution can be obtained in closed-form solutions if p(W) follows a conjugate distribution. Hence, we assume the prior distribution of the weight to be a matrix-variate normal:

\begin{align} p (W) & \propto \frac{1}{| Σ |^{K_{D} / 2} | Ψ |^{K_{R} / 2}} \\ exp {- \frac{1}{2} tr [{(W - W_{mean})}^{T} Σ^{- 1} (W - W_{mean}) Ψ^{- 1}]} . \end{align}

(42)

Furthermore, the hyperparameters of p(W) can easily be estimated in an ML criterion if Ψ is known[16]. So, Ψ is assumed to be the identity matrix[17], and the hyperparameters are estimated as:

\begin{align} {\hat{W}}_{mean} = \frac{1}{S} \sum_{s} W_{s} = 0_{K_{R} \times K_{D}} \\ \hat{Σ} = \frac{1}{S - 1} \sum_{s} W_{s} W_{s}^{T} . \end{align}

(43)

2.6 Speaker adaptation in the MAP framework

Based on Equation (20), we express the model of a new speaker as

\begin{align} M_{new} = U_{mixture} W_{new} U_{dim}^{T} + \bar{M} . \end{align}

(44)

For the given adaptation data O={o₁,…,o_T}, we estimate the adaptation parameter in a MAP criterion:

\begin{align} Λ_{MAP} & = \underset{Λ}{arg max} p (Λ | O) \\ \propto \underset{Λ}{arg max} p (O | Λ) p (Λ) \\ \propto \underset{Λ}{arg max} log p (O | Λ) + log p (Λ) \end{align}

(45)

where Λ={W_new} denotes the model parameter.

Using the EM algorithm, we obtain the following auxiliary Q-function to be optimized (discarding the terms that are independent of the model parameter):

\begin{align} Q (Λ, \hat{Λ}) = & - \frac{1}{2} \sum_{t} \sum_{r} γ_{r} (t) tr [{(o_{t} - m_{new; r}^{T})}^{T} C_{r}^{- 1} (o_{t} - m_{new; r}^{T})] \\ - \frac{1}{2} tr [{(W_{new} U_{dim})}^{T} {\hat{Σ}}^{- 1} (W_{new} U_{dim})] \end{align}

(46)

where Λ and $\hat{Λ}$ denote the current and updated model parameters, respectively, and $m_{new; r} = u_{mixture; r} W_{new} U_{dim}^{T} + {\bar{m}}_{r}$ . In finding the speaker weight, we compute W_{new, aug}≡W_newU_dim, from which W_new is obtained. Solving in this way, we can use the row-by-row technique in MLLR adaptation[6]. Setting $\partial_{W_{new}} Q (Λ, \hat{Λ}) = 0$ yields the following equation:

\begin{align} \sum_{t} \sum_{r} γ_{r} (t) C_{r}^{- 1} \underset{\equiv W_{new, aug}^{T}}{\underset{⏟}{U_{dim} W_{new}^{T}}} u_{mixture; r}^{T} u_{mixture; r} + \underset{\equiv W_{new, aug}^{T}}{\underset{⏟}{U_{dim} W_{new}^{T}}} {\hat{Σ}}^{- 1} \\ = \sum_{t} \sum_{r} γ_{r} (t) C_{r}^{- 1} (o_{t} - {\bar{m}}_{r}^{T}) u_{mixture; r} . \end{align}

(47)

The above equation can be solved for W_{new, aug} in a similar way to MLLR adaptation in[6]: we define the followings:

\begin{align} V_{r} = \sum_{t} γ_{r} (t) C_{r}^{- 1} \\ D_{r} = u_{mixture; r}^{T} u_{mixture; r} \\ G_{(i)} = \sum_{r} v_{r} (i, i) D_{r} \\ Z = \sum_{t} \sum_{r} γ_{r} (t) C_{r}^{- 1} (o_{t} - {\bar{m}}_{r}^{T}) u_{mixture; r} \\ Σ_{(i)} = \frac{1}{S - 1} \sum_{s} w_{s; i} w_{s; i}^{T} \end{align}

(48)

where v_r(i,i) denotes the (i,i) element of V_r and w_s;i the i th column vector of W_{s, aug}≡W_sU_dim. Then, the speaker weight can be computed:

\begin{align} w_{new, aug, (i)}^{T} = [G_{(i)} + Σ_{(i)}^{- 1}^{- 1} z_{(i)}^{T}, i = 1, \dots, D \end{align}

(49)

where w_{new, aug, (i)} denotes the i th row of W_{new, aug} and z_(i) the i th row vector of Z. The method becomes similar to MAPLR adaptation in[11]. Finally, the speaker weight is obtained as

\begin{align} {\hat{W}}_{new} = W_{new, aug} U_{dim}^{+} \end{align}

(50)

where [ · ]⁺ denotes the pseudoinverse of a matrix. The weight is plugged into Equation (44) to produce the model adapted to the new speaker.

2.7 Speaker adaptation techniques compared in the experiments

In this section, we briefly review the speaker adaptation techniques compared with the probabilistic 2-mode analysis based method: eigenvoice adaptation[3], MLLR adaptation[6], and MAPLR adaptation[11].

In eigenvoice adaptation, the collection of HMM mean vectors of speaker s is arranged in an (R D)×1 vector:

\begin{align} μ_{s} = [\begin{array}{l} μ_{s; 1} \\ μ_{s; 2} \\ ⋮ \\ μ_{s; R} \end{array}] . \end{align}

(51)

Then, the set of S supervectors, {μ₁,…,μ_S}, is decomposed by PCA to produce the adaptation model

\begin{align} μ_{new} = Φ w_{new} + \bar{μ} \end{align}

(52)

where Φ=[ϕ₁…ϕ_K], the basis matrix consisting of the K dominant eigenvectors from PCA, and $\bar{μ} = 1 / S \sum_{s} μ_{s}$ . The K×1 weight vector can be obtained by maximizing the likelihood of the adaptation data, which is given by

\begin{array}{l} {\hat{w}}_{new} = {[\sum_{t} \sum_{r} γ_{r} (t) Φ_{r}^{T} C_{r}^{- 1} Φ_{r}]}^{- 1} \\ \times [\sum_{t} \sum_{r} γ_{r} (t) Φ_{r}^{T} C_{r}^{- 1} (o_{t} - {\bar{μ}}_{r})] \end{array}

(53)

where Φ_r and ${\bar{μ}}_{r}$ denote the D×K submatrix and D×1 subvector corresponding to the r th mixture of Φ and $\bar{μ}$ , respectively.

In MLLR adaptation, the updated model for a new speaker is obtained by linearly transforming the SI model (assuming a global regression matrix):

\begin{align} μ_{new, r} = W_{new} ξ_{r}, ξ_{r} = [\begin{array}{l} ω \\ μ_{SI, r} \end{array}] \end{align}

(54)

where μ_{SI, r} denotes the mean vector of the SI HMM corresponding to mixture r and ω is the bias offset term: ω=1 to include the term and ω=0 otherwise (ω=1 in our experiments). The D×(D+1) transformation matrix can be obtained in an ML criterion, which yields the following equation:

\begin{align} \sum_{t} \sum_{r} γ_{r} (t) C_{r}^{- 1} o_{t} ξ_{r}^{T} \\ = \sum_{t} \sum_{r} γ_{r} (t) C_{r}^{- 1} W_{new} ξ_{r} ξ_{r}^{T} . \end{align}

(55)

The above equation can be solved for W_new:

\begin{align} {\hat{w}}_{new, (i)}^{T} = G_{(i)}^{- 1} z_{(i)}^{T}, i = 1, \dots, D \end{align}

(56)

where ${\hat{w}}_{new, (i)}$ and z_(i) denote the i th row vectors of ${\hat{W}}_{new}$ and Z, respectively; G_(i) and Z are defined as:

\begin{align} V_{r} = \sum_{t} γ_{r} (t) C_{r}^{- 1} \\ D_{r} = ξ_{r} ξ_{r}^{T} \\ G_{(i)} = \sum_{r} v_{r} (i, i) D_{r} \\ Z = \sum_{t} \sum_{r} γ_{r} (t) C_{r}^{- 1} o_{t} ξ_{r}^{T} \end{align}

(57)

where v_r(i,i) denotes the (i,i) element of V_r.

In MAPLR adaptation, the prior for the transformation matrix is used in the MLLR framework. The parameters for the prior are obtained from the MLLR transformation matrices of training speakers {W₁,…,W_S}:

\begin{align} {\bar{w}}_{(i)} = \frac{1}{S} \sum_{s} w_{s, (i)} \\ Σ_{(i)} = \frac{1}{S - 1} \sum_{s} (w_{s, (i)} - {\bar{w}}_{(i)}^{T} (w_{s, (i)} - {\bar{w}}_{(i)}) \end{align}

(58)

where w_{s, (i)} denotes the i th row vector of W_s. Then, the transformation matrix for a new speaker is obtained in a MAP criterion. Deriving the equation in the same way as above, we can obtain the following:

\begin{align} {\hat{w}}_{new, (i)}^{T} = [G_{(i)} + Σ_{(i)}^{- 1}^{- 1} [z_{(i)} + {\bar{w}}_{(i)} Σ_{(i)}^{- 1}^{T} . \end{align}

(59)

3 Experiments

We carried out the large-vocabulary continuous-speech recognition (LVCSR) experiments using the Wall Street Journal corpus WSJ0[18]. In building the SI model, we used 12754 utterances of 101 speakers from the corpus. As the acoustic feature vector, we used the 39-dimensional vector consisting of 13-dimensional mel-frequency cepstral coefficients (MFCCs) including the 0th cepstral coefficient, their derivative coefficients, and their acceleration coefficients. The feature vector was extracted with the 20-ms Hamming window with the frame sliding of 10 ms. Using the HMM toolkit (HTK)[19], we built a tied-state triphone model (word-internal triphones) with 3472 tied states and 8-mixture Gaussian.

To build training models for constructing bases, we transformed the SI model by MLLR adaptation[6] using 32 regression classes followed by maximum a posteriori (MAP) adaptation[10]. We used the 101 adapted models to build the Tucker decomposition and probabilistic tensor based models as well as eigenvoice.

For adaptation and recognition test, we used Nov’92 5K non-verbalized adaptation and test sets. The number of testing speakers was 8; adaptation set was used for adaptation and testing set of 330 sentences was used for recognition test (the number of testing utterances per speaker was about 40). The length of an adaptation sentence was about 6 s and the adaptation was performed in supervised mode. In recognition test, we used WSJ 5K non-verbalized 5k closed-vocabulary set and WSJ standard 5K non-verbalized closed bigram.

The word recognition accuracy of the SI model is 91.54%. Table2 shows the results of the Tucker decomposition and probabilistic 2-mode based methods (K_S=100 in the Tucker decomposition based model). In the table, the probabilistic 2-mode based method shows improved performance over the Tucker decomposition based method for small amounts of adaptation data, which can be evidently seen in Figure6 for the Tucker decomposition and probabilistic 2-mode based models with (K_R=20,K_D=35). The results of MAPLR[11] are also shown in the figure. The use of MAP framework contributes to improved performance for small amounts of adaptation data. The number of free parameters of each method is given as follows: 20 · 35 for the Tucker 3-mode and probabilistic 2-mode based models, and 39 · 40 for MAPLR adaptation. In Figure7, the Tucker decomposition based method is compared with MLLR and eigenvoice adaptation techniques. The figure shows that the Tucker decomposition based method outperforms MLLR and eigenvoice adaptation techniques for adaptation sentences > 1. It can be inferred from the figure that eigenvoice adaptation will outperform the Tucker decomposition based method or MLLR for sparse adaptation data. The p-values from the matched-pair t-test are shown in Table3; although the values are not always small, the performance improvement of the probabilistic 2-mode based method seems meaningful. Additionally, Figure8 shows the performance of the probabilistic 2-mode based model with (K_R=20,K_D=35), MLLR adaptation with a full regression matrix, and MAPLR adaptation for adaptation data of about 6–240 s; for adaptation sentences ≥ 10 (about 60 s), the probabilistic 2-mode based model shows the comparable performance with MLLR adaptation and MAPLR adaptation. In Figure8, the p-values are given as: p<0.01 for 1–5 adaptation sentences between the probabilistic 2-mode based model and MLLR adaptation, p<0.05 for 2–5 adaptation sentences between the probabilistic 2-mode based model and MAPLR adaptation. The number of free parameters of each method is summarized in Table4.

Table 2 Word recognition accuracy (%) of the Tucker 3-mode and probabilistic 2-mode based methods

Full size table

Table 3 p -values from the matched-pair t -test

Full size table

Table 4 Number of free parameters of adaptation techniques

Full size table

We think that the performance improvement of the proposed method over MLLR or MAPLR adaptation comes from the use of basis vectors and speaker weight of large dimension. Additionally, we think that the performance improvement of the probabilistic 2-mode based method in the MAP framework over the Tucker decomposition based method in the ML framework for small amounts of adaptation data (e.g., 1 adaptation sentence) is due to its constraint on the weight. If the amount of adaptation data is small (e.g., 1 adaptation sentence), the weight cannot be reliably estimated in the ML framework where the weight is estimated using only adaptation data without constraint, as done in the Tucker decomposition based method. The results confirm that constraint on the weight in the MAP framework can produce better model when the amount of adaptation data is small.

The selection of appropriate dimensions of model parameters (e.g., K_R and K_D) in the probabilistic 2-mode analysis depends on the training models and also available adaptation data. The selection of model parameters affects the performance of the system, but how to choose the optimum model parameters is not obvious, which needs a further study.

4 Conclusions

In this article, we applied probabilistic tensor analysis to the adaptation of HMM mean vectors to a new speaker. The training models consisted of the mean vectors of HMMs expressed in matrix form and the training set was decomposed by probabilistic 2-mode analysis. The prior distribution of the adaptation parameter was estimated from the training models. Then, the speaker adaptation equation was derived in the MAP framework. Compared with the speaker adaptation method based on Tucker 3-mode decomposition in the ML framework, the proposed method further improved the performance for small amounts of adaptation data.

Abbreviations

ALS:: Alternating Least Squares
ASR:: Automatic Speech Recognition
EM:: Expectation-Maximization
HMM:: Hidden Markov Model
HTK:: HMM Toolkit
LVCSR:: Large-Vocabulary Continuous-Speech Recognition
MAP:: Maximum A Posteriori
MAPLR:: Maximum A Posteriori Linear Regression
MFCC:: Mel-Frequency Cepstral Coefficient
ML:: Maximum Likelihood
MLLR:: Maximum Likelihood Linear Regression
PCA:: Principal Component Analysis
PPCA:: Probabilistic Principal Component Analysis
PTA:: Probabilistic Tensor Analysis
SD:: Speaker-Dependent
SI:: Speaker-Independent
SVD:: Singular Value Decomposition

References

Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77(2):257-286. 10.1109/5.18626
Article Google Scholar
Gales M, Young S: The application of hidden Markov models in speech recognition. Found. Trends Signal, Process 2008, 1(3):195-304.
Article Google Scholar
Kuhn R, Junqua J-C, Nguyen P, Niedzielski N: Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process 2000, 8(6):695-707. 10.1109/89.876308
Article Google Scholar
Kolda TG, Bader BW: Tensor decompositions and applications. SIAM Rev 2009, 51(3):455-500. 10.1137/07070111X
Article MathSciNet Google Scholar
Jeong Y: Speaker adaptation based on the multilinear decomposition of training speaker models. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. TX: Dallas; 2010:4870-4873.
Google Scholar
Leggetter CJ, Woodland PC: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang 1995, 9(2):171-185. 10.1006/csla.1995.0010
Article Google Scholar
Jeong Y: Acoustic model adaptation based on tensor analysis of training models. IEEE Signal Process. Lett 2011, 18(6):347-350.
Article Google Scholar
Tao D, Song M, Li X, Shen J, Sun J, Wu X, Faloutsos C, Maybank SJ: Bayesian tensor approach for 3-D face modeling. IEEE Trans. Circ. Syst. Video Technol 2008, 18(10):1397-1410.
Article Google Scholar
Tipping ME, Bishop CM: Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B-Stat. Methodol 1999, 61(3):611-622. 10.1111/1467-9868.00196
Article MathSciNet Google Scholar
Gauvain J-L, Lee C-H: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process 1994, 2(2):291-298. 10.1109/89.279278
Article Google Scholar
Chesta C, Siohan O, Lee C-H: Maximum a posteriori linear regression for hidden Markov model adaptation. In Proceedings of EUROSPEECH. Hungary: Budapest; 1999:211-214.
Google Scholar
Carroll JD, Chang JJ: Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika 1970, 35(3):283-319. 10.1007/BF02310791
Article Google Scholar
Gupta AK, Nagar DK: Matrix Variate Distributions. Boca Raton, FL: Chapman and Hall/CRC; 1999.
Google Scholar
Bader BW, Kolda TG: Algorithm 862: MATLAB tensor classes for fast algorithm prototyping. ACM Trans. Math. Softw 2006, 32(4):635-653. 10.1145/1186785.1186794
Article MathSciNet Google Scholar
Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B-Stat. Methodol 1977, 39(1):1-38.
MathSciNet Google Scholar
Gupta AK, Varga T: Elliptically Contoured Models in Statistics. Norwell, MA: Kluwer; 1993.
Book Google Scholar
Siohan O, Chesta C, Lee C-H: Joint maximum a posteriori adaptation of transformation and HMM parameters. IEEE Trans. Speech Audio Process 2001, 9(14):417-428.
Article Google Scholar
Paul DB, Baker JM: The design for the Wall Street Journal-based CSR corpus. In Proceedings of DARPA Speech and Natural Language Workshop. TX: Austin; 1992:357-362.
Chapter Google Scholar
Young S, Evermann G, Kershaw D, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P: The HTK Book, Version 3.2. England: Cambridge University Engineering Department; 2002.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical Engineering, Pusan National University, Busan, 609–735, Republic of Korea
Yongwon Jeong

Authors

Yongwon Jeong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongwon Jeong.

Additional information

Competing interests

The author declares that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Jeong, Y. Speaker adaptation in the maximum a posteriori framework based on the probabilistic 2-mode analysis of training models. J AUDIO SPEECH MUSIC PROC. 2013, 7 (2013). https://doi.org/10.1186/1687-4722-2013-7

Download citation

Received: 29 May 2012
Accepted: 13 March 2013
Published: 11 April 2013
DOI: https://doi.org/10.1186/1687-4722-2013-7

Speaker adaptation in the maximum a posteriori framework based on the probabilistic 2-mode analysis of training models

Abstract

1 Introduction

2 Methods

2.1 Multilinear decomposition

2.2 Speaker adaptation using Tucker decomposition

2.3 Probabilistic 2-mode analysis

2.4 Construction of probabilistic 2-mode model for speaker adaptation

2.5 Estimation of prior distribution

2.6 Speaker adaptation in the MAP framework

2.7 Speaker adaptation techniques compared in the experiments

3 Experiments

4 Conclusions

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords