 Research
 Open access
 Published:
Contextdependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis
EURASIP Journal on Audio, Speech, and Music Processing volume 2014, Article number: 12 (2014)
Abstract
Decision treeclustered contextdependent hidden semiMarkov models (HSMMs) are typically used in statistical parametric speech synthesis to represent probability densities of acoustic features given contextual factors. This paper addresses three major limitations of this decision treebased structure: (i) The decision tree structure lacks adequate context generalization. (ii) It is unable to express complex context dependencies. (iii) Parameters generated from this structure represent sudden transitions between adjacent states. In order to alleviate the above limitations, many former papers applied multiple decision trees with an additive assumption over those trees. Similarly, the current study uses multiple decision trees as well, but instead of the additive assumption, it is proposed to train the smoothest distribution by maximizing entropy measure. Obviously, increasing the smoothness of the distribution improves the context generalization. The proposed model, named hidden maximum entropy model (HMEM), estimates a distribution that maximizes entropy subject to multiple momentbased constraints. Due to the simultaneous use of multiple decision trees and maximum entropy measure, the three aforementioned issues are considerably alleviated. Relying on HMEM, a novel speech synthesis system has been developed with maximum likelihood (ML) parameter reestimation as well as maximum output probability parameter generation. Additionally, an effective and fast algorithm that builds multiple decision trees in parallel is devised. Two sets of experiments have been conducted to evaluate the performance of the proposed system. In the first set of experiments, HMEM with some heuristic context clusters is implemented. This system outperformed the decision tree structure in small training databases (i.e., 50, 100, and 200 sentences). In the second set of experiments, the HMEM performance with four parallel decision trees is investigated using both subjective and objective tests. All evaluation results of the second experiment confirm significant improvement of the proposed system over the conventional HSMM.
1 Introduction
Statistical parametric speech synthesis (SPSS) has dominated speech synthesis research area over the last decade [1, 2]. It is mainly due to SPSS advantages over traditional concatenative speech synthesis approaches; these advantages include the flexibility to change voice characteristics [3–5], multilingual support [6–8], coverage of acoustic space [1], small footprint [1], and robustness [4, 9]. All of the above advantages stem from the fact that SPSS provides a statistical model for acoustic features instead of using original speech waveforms. However, these advantages are achieved at the expense of one major disadvantage, i.e., degradation in the quality of synthetic speech [1]. This shortcoming results from three important factors: vocoding distortion [10–13], accuracy of statistical models [14–25], and accuracy of parameter generation algorithms [26–28]. This paper is an attempt to alleviate the second factor and improve the accuracy of statistical models. Most of the researches carried out to improve the acoustic modeling performance aimed to develop systems that generate natural and highquality speech using large training speech databases (more than 30 min) [18, 21, 22]. Nevertheless, there exist a great number of underresourced languages (such as Persian) for which only limited amount of data are available. To alleviate this shortcoming, we target developing a statistical approach that leads to an appropriate speech synthesis system not only with large but also with small training databases.
Every SPSS system consists of two distinct phases, namely training and synthesis [1, 2]. In the training phase, first acoustic and contextual factors are extracted for the whole training database using a vocoder [12, 29, 30] and a natural language preprocessor. Next, the relationship between acoustic and contextual factors is modeled using a contextdependent statistical approach [14–25]. Synthesis phase starts with a parameter generation algorithm [26–28] that exploits trained contextdependent statistical models and aims to generate realistic acoustic feature trajectories for a given input text. Acoustic trajectories are then fed into the same vocoder used during the training phase in order to generate the desired synthesized speech.
In the most predominant statistical parametric approach, spectrum, excitation, and duration of speech are expressed concurrently in a unified framework of contextdependent multispace probability distribution hidden semiMarkov model (HSMM)[14]. More specifically, a multispace probability distribution [17] is estimated for each leaf node of decision trees [31]. These decision treebased structures split contextual space into a number of nonoverlapped clusters which form multiple groups of contextdependent HMM states, and each group shares the same output probability distribution [31]. In order to capture acoustic variations accurately, the model has to be able to express a large number of robust distributions [19, 20]. Decision trees are not efficient for such expression because increasing the number of distributions by growing the tree reduces the population of each leaf and consequently reduces the robustness of the distributions. This problem stemmed from the fact that decision tree assigns each HMM state to an only one cluster (small region in contextual space), therefore, each state contributes in modeling just one distribution. In other words, the decision tree structure makes the models match training data just in nonoverlapped regions which are expressed through decision tree terminal nodes [31]. In the case of limited training data, the decision tree would be small, so it cannot split contextual factor space sufficiently. In this case, the accordance between model and data is not sufficient, and therefore, the speech synthesis system generates unsatisfactory output. Accordingly, it is clear that by extending the decision tree in such a way that each state affects multiple distributions (larger portion of the contextual space), the generalization to unseen models will be improved. The main idea of this study is to extend nonoverlapped regions of one decision tree to overlapped regions of multiple decision trees and hence exploit contextual factors more efficiently.
A large number of research works have already been performed to improve the quality of basic decision treeclustered HSMM. Some of them are based on a model adaptation technique. This latter method exploits an invaluable prior knowledge attained from an average voice model [3], and adapts this general model using an adaptation algorithm such as maximum likelihood linear regression (MLLR)[32], maximum a posteriori (MAP)[33], and cluster adaptive training (CAT)[21]. However, working with average voice models is difficult for underresourced languages since building such general model needs remarkable efforts to design, record, and transcribe a thorough multispeaker speech database [3]. To alleviate the data sparsity problem in underresourced languages, speaker and language factorization (SLF) technique can be used [34]. SLF attempts to factorize speakerspecific and languagespecific characteristics in training data and then model them using different transforms. By representing the speaker attributes by one transform and language characteristics by a different transform, the speech synthesis system will be able to alter language and speaker separately. In this framework, it is possible to exploit the data from different languages to predict speakerspecific characteristics of the target speaker, and consequently, the data sparsity problem will be alleviated. Authors in [15, 16] also developed a new technique by replacing maximum likelihood (ML) point estimate of HSMM with a variational Bayesian method. Their system was shown to outperform HSMM when the amount of training data is small. Other notable structures used to improve statistical modeling accuracy are deep neural networks (DNNs)[18]. The decision tree structure is not efficient enough to model complicated context dependencies such as XORs or multiplexers [18]. To model such complex contextual functions, the decision tree has to be excessively large, but DNNs are capable to model complex contextual factors by employing multiple hidden layers. Additionally, a great number of overlapped contextual factors can be fed into a DNN to approximate output acoustic features, so DNNs are able to provide efficient context generalization. Speech synthesis based on Gaussian process regression (GPR) [35] is another novel approach that has recently been proposed to overcome HMMbased speech synthesis limitations. The GPR model predicts framelevel acoustic trajectories from framelevel contextual factors. The framelevel contextual factors include the relative position of the current frame within the phone and some articulatory information. These framelevel contextual factors are employed as the explanatory variable in GPR. The framelevel modeling of GPR removes the inaccurate stationarity assumption of state output distribution in HMMbased speech synthesis. Also, GPR can directly represent the complex context dependencies without using parameter tying by decision tree clustering; therefore, it is capable of improving context generalization.
Acoustic modeling with contextual additive structure has also been proposed to represent dependencies between contextual factors and acoustic features more precisely [19, 20, 23, 32, 36–40]. In this structure, acoustic trajectories are considered to be a sum of independent acoustic components which have different context dependencies (different decision trees have to be trained for those components). Since the mean vectors and covariance matrices of the distribution are equal to the sum of mean vectors and covariance matrices of additive components, the model would be able to exploit contextual factors more efficiently. Furthermore, in this structure, each training data sample contributes to modeling multiple mean vectors and covariance matrices. Many papers applied the additive structure just for F0 modeling [37–40]. Authors in [37] proposed an additive structure with multiple decision trees for mean vectors and a single tree for variance terms. In this paper, for different additive components, different sets of contextual factors were used and multiple trees were built simultaneously. In [40], multiple additive decision trees are also employed, but they train this structure using minimum generation error (MGE) criterion. Sakai [38] defines an additive model with three distinct layers, namely intonational phrase, wordlevel, and pitchaccent layers. All of these components were trained simultaneously using a regularized least square error criterion. Qian et al. [39] propose to use multiple additive regression trees with a gradientbased treeboosting algorithm. Decision trees are trained in successive stages to minimize the error squares. Takaki et al. [19, 20] applied additive structure for spectral modeling and reported that the computational complexity of this structure is extremely high for full context labels as used in speech synthesis. To alleviate this issue, they proposed two approaches: covariance parameter tying and a likelihood calculation algorithm using matrix inversion lemma [19]. Despite all the advantages, this additive structure may not match training data accurately because once training is done, the first and second moments of the training data and model may not be exactly the same in some regions.
Another important problem of conventional decision treeclustered acoustic modeling is difficulty in capturing the effect of weak contextual factors such as wordlevel emphasis [23, 36]. It is mainly because weak contexts have less influence on the likelihood measure [23]. One clear approach to address this issue is to construct the decision tree in two successive steps [36]. In the first step, all selections are done among weak contextual factors, and in the second step, the remaining questions are adopted [36]. This procedure can effectively exploit weak contextual factors, but it leads to a reduction in the amount of training data available for normal contextual factors. Context adaptive training with factorized decision trees [23] is another approach that can exploit weak context questions efficiently. In this system, a canonical model is trained using normal contextual factors and then a set of transforms is built by weak contextual factors. In fact, canonical models and transforms, respectively, represent the effects of normal and weak contextual factors [23]. However, this structure also improves context generalization of conventional HMMbased synthesis by exploiting adaptation techniques.
This paper introduces a maximum entropy model (MEM)based speech synthesis. MEM [41] has been demonstrated to be positively effective in numerous applications of speech and natural language processing such as speech recognition [42], prosody labeling [43], and partofspeech tagging [44]. Accordingly, the overall idea of this research is to improve HSMM context generalization by taking advantage of a distribution which not only matches training data in many overlapped contextual regions but also is optimum in the sense of an entropy criterion. This system has the potential to model the dependencies between contextual factors and acoustic features such that each training sample contributes to train multiple sets of model parameters. As a result, contextdependent acoustic modeling based on MEM could lead to a promising synthesis system even for limited training data.
The rest of the paper is organized as follows. Section 2 presents HSMMbased speech synthesis. The hidden maximum entropy model (HMEM) structure and the proposed HMEMbased speech synthesis system are explained in Section 3. Section 4 is dedicated to experimental results. Finally, Section 5 concludes this paper.
2 HSMMbased speech synthesis
This section aims to explain the predominant statistical modeling approach applied in speech synthesis, i.e., contextdependent multispace probability distribution lefttoright without skip transitions HSMM[3, 14] (simply called HSMM in the remainder of this paper). The discussion presented in this section provides a preliminary framework which will be used as a basis to introduce the proposed HMEM technique in Section 3. The most significant drawback of HSMM, namely inadequate context generalization, is also pointed out.
2.1 HSMM structure
HSMM is a hidden Markov model (HMM) having explicit state duration distribution instead of selfstate transition probabilities. Figure 1 illustrates the standard HSMM. As it can be observed, HSMM initially partitions acoustic parameter (observation) trajectories into a fixed number of time slices (socalled states) in order to moderate the undesirable influence of nonstationarity. Note that state durations are latent variables and have to be trained in an unsupervised manner. An Nstate HSMM λ is specified by a set of state output probability distributions {\left\{{\mathit{b}}_{\mathit{i}}\left(\bullet \right)\right\}}_{\mathit{i}=1}^{\mathit{N}} and a complementary set of state duration probability distributions {\left\{{\mathit{p}}_{\mathit{i}}\left(\bullet \right)\right\}}_{\mathit{i}=1}^{\mathit{N}}. To model these distributions, a number of distinct decision trees are used for output and duration probability distributions. Conventionally, different trees are trained for different states [31]. These trees cluster the whole contextual factor space into a large number of tiny regions which are expressed by terminal nodes. Thereafter, in each terminal node, the output distribution b_{ i }(∙) is modeled by a multispace probability distribution, and similarly, a typical Gaussian distribution is considered for the duration probability p_{ i }(∙) [14].
To handle the absence of fundamental frequency in unvoiced regions, multispace probability distribution (MSD) is used for output probability distribution [17]. In accordance with commonly used synthesizers, this paper assumes that acoustic sample space consists of G spaces. Each of these spaces, specified by an index g, represents an n_{ g } dimensional real space, i.e., {\mathrm{\mathcal{R}}}^{{\mathit{n}}_{\mathit{g}}}. Each observation vector o_{ t } has a probability w_{ g } to be generated by the g th space iff the dimensionality of o_{ t } is identical to n_{ g }. In other words, we have
where S(o_{ t }) represents a set of all space indexes with the same dimensionality of o_{ t }, and where {\mathcal{N}}_{\mathit{l}}\left(.;\mathit{\mu},\mathit{\Sigma}\right) denotes an ldimensional Gaussian distribution with mean μ, and covariance matrix ∑ ({\mathcal{N}}_{0} is defined to be 1). Furthermore, the output probability distribution of the i th state and g th space is denoted by b_{ ig }(o_{ t }) which is a Gaussian distribution with mean vector μ_{ ig } and covariance matrix ∑_{ ig }. Also, m_{ i } and {\mathit{\sigma}}_{\mathit{i}}^{2} represent mean and variance of the state duration probability.
Regarding the method for providing context dependency, it should be noted that HSMM normally offers binary decision trees and acoustic models are established for each leaf of these trees, separately [45, 46]. Suppose f and L are contextual functions based on a decision tree Y and are defined as
Applying the above functions, all model parameters of Equations 1 and 2 can be expressed by linear combinations of model parameters defined for each terminal node. More precisely,
where Y_{o} and Y_{d} are decision trees trained for modeling output observation vectors and state durations. All symbols with superscript l indicate model parameters defined for the l th leaf.
2.2 HSMM likelihood
Having described the HSMM structure, we can now probe the exact expression for model likelihood or the probability of the observation sequence O = [o_{1},o_{2},…,o_{ T }] as [14]:
where this equality is valid for every value of tϵ[1,T]. Also, α_{ t }(i) and βt(i) are partial forward and backward probability variables that are calculated successively from their previous or next values as follows [3, 14]:
where the initial forward and backward variables for every state indexes i are α_{0}(i)1 and β_{ T }(i) = 1.
2.3 HSMM parameter reestimation
The ML criterion is commonly used to estimate model parameters of HSMM. However, we are not aware of latent variables, i.e., state durations and space indexes; therefore, an expectation maximization (EM) algorithm has to be adopted. Applying EM algorithm leads to the following reestimation formulas [14]:
where γ_{ t }(i,g) denotes the posterior probability of being in state i and space g at time t, and {\mathit{\chi}}_{\mathit{t}}^{\mathit{d}}\left(\mathit{i}\right) is the probability of occupying the i th state from time t–d + 1 to t. The following equations calculate the above probabilities:
2.4 Inefficient context generalization
A major drawback of decision treeclustered HSMM can now be clarified. Suppose we have only two real contextual factors, f_{1} and f_{2}. Figure 2 shows a sample decision tree and the regions represented by its terminal nodes. By training HSMM, the model matches training data in all nonoverlapped regions expressed by the terminal nodes. However, there is no guarantee that this accordance is held for overlapped regions such as the region R in Figure 2.
It can be noticed from the definition of function f_{i}(c;Y) in Equation 3 that this function can be viewed as a set of L(Y) nonoverlapped binary contextual factors. The fact that these contextual factors are nonoverlapped leads to the insufficient context generalization, because this fact makes each training sample contribute to the model of only one leaf and only one Gaussian distribution. Hence, by extending f_{i}(c;Y) to overlapped contextual factors, more efficient context generalization capabilities could be achieved. Section 3 proposes an approach which enables the conventional structure to model the overlapped contextual factors and thus improves the modeling performance of unseen contexts.
3. Hidden maximum entropy model
The goal of this section is to develop a contextdependent statistical model for acoustic parameters with adequate context generalization. The previous section on HSMM revealed that inappropriate generalization stemmed from the application of nonoverlapped features only. Consequently, relating acoustic parameters to contextual information by incorporating overlapped features could improve generalization efficiency. This section proposes HMEM to establish this relation.
3.1 HMEM structure
The proposed HMEM technique exploits exactly the same structure and graphical model as the original HSMM, and thus, the model likelihood expression given by Equation 5 is also valid for HMEM. The only difference between HSMM and HMEM is the way they incorporate contextual factors in output and duration probability distributions (i.e., {\left\{{\mathit{b}}_{\mathit{i}}\left(\bullet \right)\right\}}_{\mathit{i}=1}^{\mathit{N}}, {\left\{{\mathit{p}}_{\mathit{i}}\left(\bullet \right)\right\}}_{\mathit{i}=1}^{\mathit{N}}). HSMM builds a decision tree and then trains a Gaussian distribution for each leaf of the tree. On the contrary, HMEM obeys the maximum entropy modeling approach which will be described in the next subsection.
3.1.1 Maximum entropy modeling
Let us now derive a simple maximum entropy model. Suppose an ℓdimensional random vector process with output x that may be influenced by some contextual information c. Our target is to construct a stochastic model that precisely predicts the behavior of x, when c is given, i.e., P(xc). Maximum entropy principle first imposes a set of constraints on P(xc) and then chooses a distribution as close as possible to a uniform distribution by maximizing the entropy criterion [41]. In fact, this method will find the least biased distribution among all distributions that satisfy our constraints. In other words,
employed constraints make the model preserve some contextdependent statistics of the training data. ℋ(P) represents entropy criterion [41] that is calculated as
Computing the above expression is extremely complex because there are a large number of contextual factors and all possible values of c are not calculable. However, authors in [41] applied the following approximation for P(x, c):
where \tilde{\mathit{P}}\left(\mathit{c}\right) denotes empirical probability which can be calculated directly using the training database [41]. The above approximation simplifies the entropy expression as
where the second term is constant and does not affect the optimization problem. Therefore, we have
Additionally, we adopt a set of L_{ f } predefined binary contextual factors, f_{ l }(c), and another set of L_{ g } binary contextual factors, g_{ l }(c), that both of them may be highly overlapped. In order to obtain a Gaussian distribution for \widehat{\mathit{P}}\left(\mathit{x}\mathit{c}\right) and extend the conventional HSMM distribution, first and secondorder contextdependent moments expressed in Equation 14 are considered for the constraints.
subject to following constraints:
where E and \widehat{\mathit{E}} indicate real and empirical mathematical expectations given in the following equations:
where x(c) denotes the realization of ℓdimensional random vector x for the context c in the database. If there are multiple realizations for x, x(c) will be obtained by taking the average over those values. In sum, the proposed contextdependent acoustic modeling approach obtains the smoothest (maximum entropy) distribution that captures firstorder moments of training data in L_{ f } regions indicated by {\left\{{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} and secondorder moments of data computed in {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}.
In order to solve the optimization problem expressed by Equation 10, the Lagrange multipliers method is applied. This method defines a new optimization function as follows:
where u_{ l } denotes a vector of Lagrange multipliers for satisfying the l th firstorder moment constraints and H_{ l } is a matrix of Lagrange multipliers for satisfying the l th secondorder moment constraints. Taking derivatives of the above function with respect to P leads to the following equality.
Therefore, one possible solution that maximizes entropy with the constraint of Equation 15 using Lagrange multipliers can be expressed as:
where H_{ l } and u_{ l } are model parameters related to the l th contextual factors g_{ l }(c) and f_{ l }(c), respectively. H_{ l } is an ℓbyℓ matrix and u_{ l } is an ℓdimensional vector. When f_{ l }(c) becomes 1 (i.e., it is active), u_{ l } affects the distribution; otherwise, it has no effect on the distribution. In fact, Equation 19 is nothing but the wellknown Gaussian distribution with mean vector –0.5H^{1}u, and covariance matrix H^{1}, both calculated from a specific contextdependent combination of model parameters. Indeed, the main difference of MEM in comparison with other methods such as spectral additive structure [19, 20] is that mean and variance in MEM are not a linear combination of other parameters. This type of combination enables MEM to match training data in all overlapped regions.
This form of contextdependent Gaussian distribution presents a promising flexibility in utilizing contextual information. On one hand, using detailed and nonoverlapped contextual factors such as features defined by Equation 3 (decision tree terminal node indicators) generates contextdependent Gaussian distributions which are identical to those used in conventional HSMM. These distributions have straightforward and efficient training procedure but suffer from insufficient context generalization capabilities. On the other hand, incorporating general and highly overlapped contextual factors overcomes the latter shortcoming and provides efficient context generalization, but its training procedure becomes more computationally complex. In the case of highly overlapped contextual factors, an arbitrary context activates several contextual factors, and hence, each observation vector is involved in modeling several model parameters.
3.1.2 MEbased modeling vs. additive modeling
At first glance, the contextual additive structure [19, 20, 32, 37] seems to have the same capabilities as the proposed MEbased contextdependent acoustic modeling. Therefore, to clarify their differences, this section compares HMEM with the additive structure through a very simple example.
In this example, the goal is to model a onedimensional observation value using both MEbased modeling and a contextual additive structure. Due to the prime importance of mean parameters in HMMbased speech synthesis [47], we investigate the difference between mean values predicted by two systems.
Figure 3A shows a threedimensional contextual factor space (c_{1}c_{2}c_{3}) which is clustered by an additive structure. The additive structure consists of three different additive components with three different decision trees, namely Q_{1}, Q_{2}, and Q_{3}. Each tree has a simple structure with just one binary question that splits a specific dimension of the contextual factor space into two regions. Each region is represented by a leaf node, and inside that leaf node, a mean parameter of each additive component is written. As it is depicted in the figure, these trees split contextual factor space into eight different cubic clusters. Mean values estimated for these cubic clusters are computed by adding mean values of additive components.
In contrast, Figure 3B shows the corresponding MEbased modeling approach. In the previous subsection, it is described that MEbased contextdependent modeling needs two sets of regions, {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} and {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}. This example assumes that the leaves of Q_{1} and Q_{2} are defined as the first set of regions {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}, and the leaves of Q_{3} are defined as the second set {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}. Therefore, according to the explanation of the previous subsection, first empirical moments of Q_{1} and Q_{2}, in addition to the second empirical moments of Q_{3}, are captured by MEbased modeling. Figure 3B shows the estimated model mean values for all eight cubic clusters. As it is realized from the figure, model mean values estimated by MEbased modeling is a combination of adding parameters live in the regions {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} divided by the parameters defined for the regions {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}. In fact, the proposed MEbased modeling is an extension to the additive structure that ties all covariance matrices [19]. This extension is clear because if {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}} is defined with one region containing all contextual feature space, the MEbased modeling converts to the additive structure that ties all covariance matrices [19].
3.1.3 HMEMbased speech synthesis
HMEM improves both state duration distribution {\left\{{\mathit{p}}_{\mathit{i}}\left(\bullet \right)\right\}}_{\mathit{i}=1}^{\mathit{N}} and output observation distribution {\left\{{\mathit{b}}_{\mathit{i}}\left(\bullet \right)\right\}}_{\mathit{i}=1}^{\mathit{N}} using maximum entropy modeling. According to the discussion presented in Section 3.1.1, MEM requires two sets of contextual factors. In this section, for the sake of simplicity, it is assumed that the contextual regions defined for firstorder moment constraints {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} are identical to the regions defined for secondorder moment constraints {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}. All equations presented in this section is based on this assumption; however, their extension to the general case (different {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} and {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}) is straightforward. Therefore, we define {\mathit{f}}_{\mathit{l}}^{\mathit{d}}\left(\mathit{i}\right) and {\mathit{f}}_{\mathit{l}}^{\mathit{o}}\left(\mathit{i}\right) as L^{d} and L^{o} contextual factors which are designed carefully for the purpose of modeling duration and acoustic parameters of the i th state. Maximum entropy criterion leads to the following duration and output probability distributions.
In these equations, S(o_{ t }) is a set of all possible spaces defined for o_{ t }. {\mathit{u}}_{\mathit{i}}^{\mathit{l}} and {\mathit{h}}_{\mathit{i}}^{\mathit{l}} are the duration model parameters, and {\mathit{w}}_{\mathit{ig}}^{\mathit{l}}, {\mathit{u}}_{\mathit{ig}}^{\mathit{l}}, and {\mathit{H}}_{\mathit{ig}}^{\mathit{l}} denote the output model parameters related to the l th contextual factor, g th space, and i th state.
We can now probe the differences between HSMM and HMEM contextdependent acoustic modeling. These two modeling approaches are dramatically close to each other, so that defining HMEM contextual factors based on the decision trees described by Equation 3 would reduce HMEM to HSMM. Accordingly, HMEM extends HSMM and enables its structure to exploit overlapped contextual factors.
Moreover, another significant conclusion that could be drawn from this section is that several HSMM concepts are transposable within the HMEM framework. These concepts involve Viterbi algorithm, methods which calculate forward/backward variables and occupation probabilities, and even all parameter generation algorithms [26–28]. It just needs to define mean vectors, covariance matrices, and space probabilities of HSMM in accordance with Equation 20.
3.2 HMEM parameter reestimation
In the training phase, we are given a set of K i.i.d. training data {\left\{{\mathit{O}}^{\mathit{k}}\right\}}_{\mathit{k}=1}^{\mathit{K}}; the goal is to find the best set of model parameters \phantom{\rule{0.12em}{0ex}}\widehat{\mathit{\lambda}}, which maximizes the log likelihood:
Substituting Equation 5 for the likelihood P(O^{(k)}λ) leads to an excessively complex optimization problem with seemingly impossible direct solution. The major issue is that the distribution wholly depends upon the latent variables which are unknown. The expectation maximization (EM) technique offers an iterative algorithm which overcomes this problem and accurately solves the issue:
where d and q represent possible state durations and space indexes for the k th training utterance and the second summation is calculated over all possible values of d and q. In general, these functions cannot be minimized in a closedform expression. Therefore, a numerical optimization technique such as the BroydenFletcherGoldfarbShanno (BFGS) [48] method or Newton algorithm has to be derived to find one of the local optima. This paper proposes to exploit the outstanding BFGS algorithm, due to its favorable characteristics. However, BFGS needs solely the first partial derivatives of the cost functions calculated as follows:
where γ_{ t }(i, g) and {\mathit{\chi}}_{\mathit{t}}^{\mathit{d}}\left(\mathit{i}\right) are defined in Section 2.3. Therefore, at every iteration of BFGS, we need to find the above gradient values and BFGS estimates new parameters which are closer to the optimum ones.
At first glance, calculating the above gradient expressions seems to be computationally expensive, but they can be calculated efficiently if we rewrite them in terms of sufficient statistics as in the following equations. By doing this, the computational complexity no longer depends on the number of training observation vectors, but rather on the total number of states. Furthermore, storing sufficient statistics instead of all observation vectors reduces the amount of main memory usage of the training procedure. These equations are expressed as
where {\tilde{\mathit{X}}}_{\mathit{i}}, {\tilde{\mathit{m}}}_{\mathit{i}}, and {\tilde{\mathit{r}}}_{\mathit{i}} are sufficient statistics required to train duration distribution and are calculated as
Also, \tilde{\mathit{\gamma}}\left(\mathit{i},\mathit{g}\right), \tilde{\mathit{\mu}}\left(\mathit{i},\mathit{g}\right), and \tilde{\mathit{R}}\left(\mathit{i},\mathit{g}\right) are sufficient statistics related to output probability distribution:
These equations prove that regardless of calculating sufficient statistics, an EM iteration in HMEM is just equivalent to train three maximum entropy models for state duration distribution, state output distribution for each subspace, and subspace probability.Having introduced HMEM parameter estimation procedure, we can now proceed to explain the overall structure of HMEM. Figure 4 shows the whole architecture illustrating the HMEMbased speech synthesis system. Just like other statistical parametric approaches, it consists of two phases, training and synthesis. In the training phase, we first extract a parametric representation of the speech signal (i.e., acoustic features) including both spectral and excitation features from training speech database. In parallel, contextual factors are obtained for all states of the database. Thereafter, both acoustic and contextual factors are applied for HMEM training. The training procedure is performed by iterating through three steps: computing sufficient statistics, training all maximum entropy distributions, and calculating occupation probabilities. However, the training procedure needs prior information about state occupation probabilities for the first iteration. This paper proposes to utilize a trained HMM for this purpose. Training procedure continues until an amount of increase in likelihood falls below a specific threshold. The synthesis phase is completely identical to a typical HSMMbased speech synthesis system. The only difference is that in HMEM state, mean and covariance parameters are estimated in accordance with Equation 20 instead of tracing a binary decision tree.
3.3 Decision treebased context clustering
Statistical parametric speech synthesis systems typically exploit around 50 different types of contextual factors [23]. For such system, it is impossible to prepare tanning data covering all contextdependent models, and there are a large number of unseen models that have to be predicted in synthesis phase. Therefore, a context clustering approach such as decision treebased clustering has to be used to decide about unseen contexts [31, 45]. Due to the critical importance of context clustering algorithms in HMMbased speech synthesis systems, this section focuses on designing a clustering algorithm for HMEM.
As it is realized from the discussion in this section, In order to implement the proposed architecture, we initially need to define two sets of contextual regions. These regions are represented by two sets, namely {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} and {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}. First and secondorder moment constraints have to be satisfied for all regions in {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} and {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}, respectively. Before training, the first empirical moments of all regions in {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} and the second empirical moments of all regions in {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}} are computed using training data. Then, HMEM is trained to be consistent with these empirical moments. The major difficulty in defining these regions is to find a satisfactory balance between model complexity and the availability of training data. For limited training databases, a model with a small number of parameters, i.e., small number of regions has to be defined. In this case, bigger (strongly overlapped) contextual regions seem to be more desirable, because they can alleviate the problem of weak context generalization. On the other hand, for large training databases, larger number of contextual regions has to be defined to escape from underfitting model to training data. In this case, smaller contextual regions can be applied to capture the details of acoustic features. This section introduces an algorithm that defines multiple contextual regions for first and secondorder moments by considering HMEM structure.
Due to the complex relationship between acoustic features and contextual factors, it is extremely difficult to find the optimum sets of contextual regions that maximize likelihood for HMEM. For the sake of simplicity, we have made some simplifying assumptions to find a number of suboptimum contextual regions. These assumptions are expressed as follows:

We have used conventional binary decision tree structures to define {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} and {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}. This is a common approach in many former papers [19, 20, 23]. It should be noted that the decision tree structure is not the only possible structure to express the relationship between acoustic features and contextual factors. For example, other approaches such as neural networks or softclustering methods can be applied as well. However, in this paper, we limit our discussion to the conventional binary decision tree structure.

Multiple decision trees are trained for {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}, and just one decision tree is constructed for {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}. In this way, the final HMEM preserves the first empirical moments of multiple decision trees, and the second moments of just one decision tree. This assumption is a result of the fact that firstorder moments seem to be more important than secondorder moments [32, 47].

The discussion of current section shows that the ML estimates of parameters defined for {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} and {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}} significantly depend on each other. Therefore, in each step of decision tree construction, a BFGS optimization algorithm has to be executed to reestimate both sets of parameters simultaneously, and this procedure leads to an extreme amount of computational complexity. To alleviate this problem, it is proposed to borrow {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}} from a baseline system (conventional HMMbased speech synthesis system) and construct {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} independently.

In HMEM structure, {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} is responsible to provide satisfactory clustering of firstorder moments (mean vectors). Similarly, contextual additive structures [19, 20, 37] that tie all covariance matrices offer multiple overlapped clustering of mean vectors based on the likelihood criterion; therefore, an appropriate method is to borrow {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} from the contextual additive structure.

However, training a contextual additive structure using algorithms proposed in [19, 20] is still computationally expensive for large training databases (more than 500 sentences). Three modifications are applied to the algorithm proposed by Takaki et al. [19] for computational complexity reduction: (i) The number of decision trees is considered to be fixed (in our experiments, an additive structure with four decision trees is built). (ii) Questions are selected one by one for different decision trees. Therefore, all trees are grown simultaneously, and the size of all trees would be equal. (iii) In the process of selecting the best pair of question and leaf, it is assumed that just the parameters of candidate leaf will be changed and all other parameters remain unchanged. It should be noted that the selection procedure is repeated until the total number of free parameters reaches the number of parameters trained for the baseline system (HSMMbased speech synthesis system).
In sum, the final algorithm of determining {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} and {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}} can be summarized as follows. {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}} is simply borrowed from a conventional HMMbased speech synthesis system. {\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} also resulted from an independent context clustering algorithm that is a fast and simplified version of contextual additive structure [19]. This clustering algorithm builds four binary contextdependent decision trees, simultaneously. It should be noted that when the number of clusters reaches the number of leaves of the decision tree trained for an HSMMbased system, the clustering algorithm is finished.
The following algorithm shows the overall procedure of the proposed context clustering.
4 Experiments
We have conducted two sets of experiments. First, the performance of HMEM with heuristic context clusters is examined; second, the impact of the proposed method for decision treebased context clustering presented in the Section 3.3 is evaluated.
4.1 Performance evaluation of HMEM with heuristic context clusters
This subsection aims to compare HMEMbased acoustic modeling with conventional HSMMbased method. In this subsection, contextual regions of HMEM are defined heuristically and it is fixed for different sizes of training database.
4.1.1 Experimental conditions
A Persian speech database [49] consisting of 1,000 utterances from a male speaker was used throughout our experiments. Sentences were between 5 and 20 words long and have an average duration of 8 s. This database was specifically designed for the purpose of speech synthesis. Sentences in the database covered most frequent Persian words, all biletter combinations, all biphoneme combinations, and most frequent Persian syllables. In the modeling of the synthesis units, 31 phonemes were used, including silence. As presented in Section 4.1.2, a large variety of phonetic and linguistic contextual factors was considered in this work.
Speech signals were sampled at a rate of 16 kHz and windowed by a 25ms Blackman window with a 5ms shift. 40 Melcepstral coefficients, 5 bandpass aperiodicity and fundamental frequency, and their delta and deltadelta coefficients extracted by STRAIGHT [11] were employed as our acoustic features. In this experiment, the number of states was 5, and multistream lefttoright with no skip path MSDHSMM was trained as the traditional HSMM system. Decision trees were built using maximum likelihood criterion, and the size of decision trees was determined by MDL principle [46]. Additionally, global variance (GV)based parameter generation algorithm [20, 26] and STRAIGHT vocoder were applied in the synthesis phase.
Both subjective and objective tests were carried out to compare HMEM that uses some heuristic contextual regions with the traditional HSMM system. In our experiments, two different synthesis systems named HMEM1 and HMEM2 were developed based on the proposed approach. HMEM1 employs a small number of general highly overlapped contextual factors that are designed carefully for each stream, while HMEM2 uses a larger number of contextual factors.
More precisely, a set of 64 initial contextual factors were extracted for each segment (phoneme) of the Persian database. These factors contain both segmental and suprasegmental contextual features. From these contextual factors, a set of approximately 8,000 contextual questions were designed and the HSMM system was trained using these questions. Each question can form two regions; therefore, these 8,000 questions can be converted to 16,000 regions. For each stream of HMEM1, a small number of these contextual regions that seem to be more important for that stream were selected and HMEM1 was trained using them. Contextual factors of HMEM2 contain all contextual factors of HMEM1 in addition to a number of detailed ones. The number of contextual regions in HMEM2 is twice the number of regions in HMEM1. Regions of both HMEM1 and HMEM2 were selected based on the linguistic knowledge of the Persian language. Table 1 shows the number of contextual regions for different synthesis systems (namely HSMM with different training data sizes, HMEM1 and HMEM2).
Experiments were conducted on five different training sets with 50, 100, 200, 400, and 800 utterances. Additionally, a fixed set of 200 utterances, not included in the training sets, was used for testing.
4.1.2 Employed contextual factors
In our experiments, contextual factors contained phonetic, syllable, word, phrase, and sentence level features. In each of these levels, both general and detailed features were considered. Features such as phoneme identity, syllable stress pattern, or word partofspeech tag are examples of general features, and a question like the position of the current phoneme is a sample of a detailed one. Specific information with regard to contextual features is presented in this subsection.
Contextual factors play a significant role in the proposed HMEM method. As a consequence, they have been designed carefully and are now briefly presented:

➢ Phoneticlevel features

Phoneme identity before the preceding phoneme; preceding, current, and succeeding phonemes; and phoneme identity after the next phoneme

Position of the current phoneme in the current syllable (forward and backward)

Whether this phoneme is ‘Ezafe’ [50] or not (Ezafe is a special feature in Persian pronounced as a short vowel ‘e’ and relates two different words together. Ezafe is not written but is pronounced and has a profound effect on intonation)


➢ Syllablelevel features

Stress level of this syllable (five different stress levels are defined for our speech database)

Position of the current syllable in the current word and phrase (forward and backward)

Type of the current syllable (syllables in Persian language are structured as CV, CVC, or CVCC, where C and V denote consonants and vowels, respectively)

Number of the stressed syllables before and after the current syllable in the current phrase

Number of syllables from the previous stressed syllable to the current syllable

Vowel identity of the current syllable


➢ Wordlevel features

Partofspeech (POS) tag of the preceding, current and succeeding word

Position of the current word in the current sentence (forward and backward)

Whether the current word contains ‘Ezafe’ or not

Whether this word is the last word in the sentence or not


➢ Phraselevel features

Number of syllables in the preceding, current, and succeeding phrase

Position of the current phrase in the sentence (forward and backward)


➢ Sentencelevel features

Number of syllables, words, and phrases in the current sentence

Type of the current sentence

4.1.3 Illustratory example
Before going further with the objective and subjective evaluations, the superiority of HMEM over HSMM when few training data are available can be already illustrated. Although the improvement will be shown in Sections 4.1.4 and 4.1.5 to be achieved for all speech characteristics (log F0, duration, and spectral features), it is here emphasized for the prediction of log F0 trajectories. Figure 5 shows the trajectory of log F0 generated by HSMM and HMEM1 with 100 training utterances, in contrast to the natural contour. This plot confirms the superiority of HMEM over HSMM in modeling fundamental frequency when the amount of training data is small, as the generated contour by HMEM is far closer to the natural one compared to HSMM.
In limited training sets, HSMM produces sudden transitions between adjacent states. This drawback is the result of decision treeclustered contextdependent modeling. More specifically, when few data are available for training, the number of leaves in the decision tree is reduced. As a result, the distance between the mean vectors of adjacent states can be large. Even the parameter generation algorithm proposed by [26–28] cannot compensate such jumps. In such cases, the quality of synthetic speech with HSMM is expected to deteriorate.
On the opposite, if we let adjacent states contain common active contextual factors, then the variation of mean vectors in state transitions will be smoother. This is the key idea of HMEM which makes it possible to outperform HSMM when the data are limited. However, the use of overlapped contextual factors in HMEM will result in oversmoothing problem when the size of the training data is increased. Therefore, the detailed contextual factors are additionally considered in HMEM2 to alleviate the oversmoothing issue.
4.1.4 Objective evaluation
The average melcepstral distortion (MCD) [51] and rootmeansquare (RMS) error of phoneme durations (expressed in terms of number of frames) were selected as relevant metrics for our objective assessment. For the calculation of both average melcepstral distance and RMS error of phoneme durations, the state boundaries (state durations) were determined using Viterbi alignment with the speaker's real utterance.
The MCD measure is defined by:
where mc_{ i } is the i th melcepstral coefficients in a frame, mc^{t} is the target coefficient we are comparing against, and mc^{p} is the generated coefficient. In addition, RMS is defined as the following function:
where N is the total number of states in a sentence, d_{ s } is the duration of the s th state, {\mathit{d}}_{\mathit{s}}^{\mathit{t}} is the original duration, and {\mathit{d}}_{\mathit{s}}^{\mathit{p}} is the estimated duration.Figure 6 shows the average melcepstral distance between spectra generated from the proposed method and spectra obtained by analyzing the speaker's real utterance. For comparison, we also present the average distance of spectra generated from the HSMMbased method and the real spectra. In this figure, it is clearly observed that the proposed HMEM systems outperform the standard HSMM approach for limited training datasets. Nonetheless, this advantage disappears when more than 200 utterances are available for training. It can be noticed that a reduction of the size of the training set has a dramatic impact on the performance of HSMM, contrary to HMEMbased systems.The same conclusions are observed for Figure 7 in which the generated duration of proposed systems is compared against that of HSMM. It can be again noticed that the proposed systems outperform HSMM in small databases. However, when the size of the database increases, HSMM gradually surpasses the proposed HMEM systems. Furthermore, detailed features added in HMEM2 affect the proposed method constructively when the synthesis units model by large databases. Thus, we expect that the proposed method could be comparable with HSMM or outperform it even for large databases if we apply more detailed and welldesigned features.
In summary, from these figures and the illustratory example presented before, we can see that when the available data are limited, all features (log F0, duration, and spectra) of synthetic speech generated by HMEM are closer to the original features than those obtained with HSMM. However, when the training database is large, the HSMMbased method performs better than HMEM. Nevertheless, employing more detailed features can assist the proposed method in becoming closer to the HSMMbased synthetic speech.
In addition to the abovementioned objective measurements, we have compared the accuracy of voiced/unvoiced detection in the proposed system with its counterpart in HSMMbased synthesis. Table 2 shows information about the false negative (FN), false positive (FP), true negative (TN), and true positive (TP) rates. Moreover, the data in Table 2 are summarized in Table 3 in which the accuracy of detecting voice/unvoiced regions is presented. As realized from these tables, the proposed method detects voiced/unvoiced regions more accurately than HSMM regardless of the size of the database. In other words, not only in small databases but also for larger ones, HMEM outperforms HSMM in terms of detecting voiced/unvoiced regions.
4.1.5 Subjective evaluation
Two different subjective methods are employed in order to show the effectiveness of the proposed system and assess the effect of the size of the training database. A comparative mean opinion score (CMOS) test [52] with a 7point scale, ranging from −3 (meaning that method A is much better than method B) to 3 (meaning the opposite), and a preference scoring [53] are used to evaluate the subjective quality of the synthesized speech. The results of this evaluation are respectively shown in Figures 8 and 9.
Twenty native participants were asked to listen to ten randomly chosen pairs of synthesized speech samples generated by two different systems (selected arbitrarily among HMEM1, HMEM2, and HSMM).
Remarkably, the proposed systems are noticed to be of a great interest when the training data are limited (i.e., for 50, 100, and 200 utterances) and are in line with the conclusions of the objective assessments. The superiority of HMEM1 over HSMM and HMEM2 is clear in the training sets containing 50 and 100 utterances. In other words, general contextual factors lead the proposed system to a better performance when the amount of training data is very small. Gradually, as the number of utterances in the training set increases, detailed features assist the proposed system in achieving more effective synthetic speech. Therefore, HMEM2 surpasses HMEM1 for training sets with 200 and more utterances. However, for relatively large training sets (400 and 800), the use of HSMM is recommended.
Table 1 compares the number of leaf nodes in different speech synthesis systems. It can be seen from the table that to model mgc stream, HMEM2 exploits more parameters than HSMM400 and HSMM800, but the objective evaluations presented in Figure 6 show that HSMM400 and HSMM800 results in better melcepstral distances. The above argument shows that HMEM with some heuristic contextual clusters cannot exploit model parameters efficiently. In fact, a great number of contextual regions in HMEM1 and HMEM2 are redundant; therefore, their corresponding parameters are not useful. The next section evaluates the performance of HMEM with the suboptimum context clustering algorithm proposed in Section 3.3. This proposed clustering algorithm selects appropriate contextual regions and consequently solves the aforementioned problem.
4.2 Performance evaluation of HMEM with decision treebased context clustering
This section is dedicated to the second set of experiments conducted to evaluate the performance of HMEM with decision tree construction algorithm proposed in Section 3.3. As it is realized from the first set of experiments, HMEM with heuristic and naïve contextual regions cannot outperform HSMM in large training databases. This section proves that by employing appropriate sets of {\left\{{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathrm{l}=1}^{{\mathit{L}}_{\mathit{f}}} and {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathrm{l}=1}^{{\mathit{L}}_{\mathit{g}}}, HMEM outperforms HSMM even for large databases.
4.2.1 Experimental conditions
Experiments were carried out on Nick [54], a British male database collected in Edinburgh University. This database consists of 2,500 utterances from a male speaker. We considered five sets including 50, 100, 200, 400, and 800 utterances for training, and 200 sentences that were not included in training sets were used as test data. Each sentence in the database is about 5 s of speech. Speech signals are sampled at 48 kHz, windowed by a 25ms Blackman window with 5ms shift. This database was specifically designed for the purpose of speech synthesis research, and utterances in the database covered most frequent English words. Also, different segmental and suprasegmental contextual factors were extracted for this database.
The speech analysis conditions and model topologies of CSTR/EMIME HTS 2010 [54] were used in this experiment. Bark cepstrum was extracted from smooth STRAIGHT trajectories [11]. Also, instead of log F0 and five frequency subbands (0 to 1, 1 to 2, 2 to 4, 4 to 6, and 6 to 8 kHz), pitch in mel and auditoryscale motivated frequency bands for aperiodicity measure were applied [54]. The analysis process resulted in 40 bark cepstrum coefficients, 1 mel in pitch value, and 25 auditoryscale motivated frequency bands aperiodicity parameters for each frame of training speech signals. These parameters incorporated with their delta and deltadelta parameters considered as the observation vectors of the statistical parametric model.
A fivestate multistream lefttoright with no skip path MSDHSMM was trained as the baseline system. Conventional maximum likelihoodbased decision tree clustering algorithm was used to tie HMM states, but MDL criterion is used to determine the size of decision trees.
In order to have a fair comparison, the proposed system (HMEM with decision tree structure) was trained with the same number of free model parameters as the baseline system. HMEM was trained based on the decision tree construction algorithm presented in Section 3.3 and parameter reestimation algorithm proposed in Section 3.2. It should be noted that four decision trees were built for {\left\{{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}} and one decision tree for {\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}. After training acoustic models, in the synthesis phase, GVbased parameter generation algorithm [20, 26] and STRAIGHT synthesis module generated synthesized speech signals. Both subjective and objective tests were conducted to compare HMEM that uses decision treebased clusters with traditional HSMMbased synthesis.
It is useful to mention that training the proposed HMEM structure with decision treebased context clustering took approximately 5 days for 800 training sentences, while training its corresponding HSMMbased synthesis system took approximately 16 h.
4.2.2 Employed contextual factors
In this experiment, employed contextual factors contained phonetic, syllable, word, phrase, and sentence level factors. In each of these levels, all important features were considered. Specific information about these features is presented in this subsection.

➢ Phoneticlevel features

Phoneme identity before the preceding phoneme; preceding, current, and succeeding phonemes; and phoneme identity after the next phoneme

Position of the current phoneme in the current syllable, word, phrase, and sentence


➢ Syllablelevel features

Stress level of previous, current, and next syllable (three different stress levels are defined for this database)

Position of the current syllable in the current word, phrase, and sentence

Number of the phonemes of the previous, current, and next syllable

Whether the previous, current, and next syllable is accented or not

Number of the stressed syllables before and after the current syllable in the current phrase

Number of syllables from the previous stressed syllable to the current syllable

Number of syllables from the previous accented syllable to the current syllable


➢ Wordlevel features

Partofspeech (POS) tag of the preceding, current, and succeeding word

Position of the current word in the current phrase and sentence (forward and backward)

Number of syllables of the previous, current, and next word

Number of content words before and after current word in the current phrase

Number of words from previous and next content word


➢ Phraselevel features

Number of syllables and words of the preceding, current, and succeeding phrase

Position of the current phrase in the sentence

Current phrase ToBI end tone


➢ Sentencelevel features

Number of phonemes, syllables, words, and phrases in the current utterance

Type of the current sentence

4.2.3 Objective evaluation
Two wellknown measures were applied for objective evaluation of the proposed decision treebased HMEM in comparison with conventional HSMM. The first measure computes RMS error of generated log F0 trajectories, and the second one compares synthesized spectrograms using average MCD criterion. The results of these measures are shown in Figures 10 and 11. As it is realized from Figure 10 that shows the RMS error of the log F0 in terms of cent for different sizes of training data, the log F0 trajectories generated from the proposed approach are more similar to the natural log F0 trajectories, and therefore, HMEM improves the performance of log F0 modeling. However, by increasing the size of the database, the amount of this improvement is slightly reduced. Hence, it can be implied from this figure that in log F0 modeling, the effect of applying overlapped regions for small databases is relatively more than its effect on big databases. Additionally, Figure 11 shows the result of average MCD test. This result also confirms the improvement of HMEM performance in contrast to conventional HSMM for all training databases. As it is clear from the figure, the improvement in average MCD test is fixed for all databases.
4.2.4 Subjective evaluation
We conducted paired comparison tests and reported comparative mean opinion score (CMOS) and preference score as subjective evaluations. Fifteen nonprofessional native listeners were presented with 30 randomly chosen pairs of synthesized speech generated by HMEM and HSMM. Listeners selected the synthesized speech which sounds better and determined how much is better (much better, better, slightly better, or about the same). The results are shown in Figures 12 and 13.
Both CMOS test and preference score confirm the superiority of the proposed method over HSMM in all databases. Thus, if context clusters are determined through an effective approach, the proposed HMEM will outperform HSMM.
5. Conclusions
This paper addressed the main shortcomings of HSMM in contextdependent acoustic modeling, namely inadequate context generalization. HSMM uses decision treebased context clustering that does not provide efficient generalization, because each acoustic feature vector is associated with modeling only one context cluster. In order to alleviate this problem, this paper proposed HMEM as a new acoustic modeling technique based on maximum entropy modeling approach. HMEM improves HSMM by enabling its structure to take advantage of overlapped contextual factors, and therefore, it can provide superior context generalization. Experimental results using objective and subjective criteria showed that the proposed system outperforms HSMM.
Despite the advantages, which enabled our system to outperform HSMM, a drawback of computationally complex training procedure is noticed in large databases.
References
Zen H, Tokuda K, Black AW: Statistical parametric speech synthesis. Speech Comm. 2009, 51(11):10391064. 10.1016/j.specom.2009.04.004
Black AW, Zen H, Tokuda K: Statistical parametric speech synthesis, in IEEE International Conference on Acoustics, vol. 4. Speech and Signal Processing (ICASSP), Honolulu, Hawaii, USA; 2007:IV1229IV1232.
Yamagishi J, Kobayashi T: Averagevoicebased speech synthesis using HSMMbased speaker adaptation and adaptive training. IEICE  Trans. Info. Syst. 2007, 90(2):533543.
Yamagishi J, Nose T, Zen H, Ling ZH, Toda T, Tokuda K, King S, Renals S: Robust speakeradaptive HMMbased texttospeech synthesis, In IEEE Transactions on Audio, Speech, and Language Processing. 2009, 17(6):12081230.
Yamagishi J, Kobayashi T, Nakano Y, Ogata K, Isogai J: Analysis of speaker adaptation algorithms for HMMbased speech synthesis and a constrained SMAPLR adaptation algorithm, IEEE Transactions on Audio, Speech, and Language Processing. 2009, 17(1):6683.
Wu YJ, Nankaku Y, Tokuda K: State mapping based method for crosslingual speaker adaptation in HMMbased speech synthesis. In INTERSPEECH. Brighton, UK; 2009:528531.
Liang H, Dines J, Saheer L: A comparison of supervised and unsupervised crosslingual speaker adaptation approaches for HMMbased speech synthesis. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). Dallas, Texas, USA; 2010:45984601.
Gibson M, Hirsimaki T, Karhila R, Kurimo M, Byrne W: Unsupervised crosslingual speaker adaptation for HMMbased speech synthesis using twopass decision tree construction. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). Dallas, Texas, USA; 2010:46424645.
Yamagishi J, Ling Z, King S: Robustness of HMMbased speech synthesis. In INTERSPEECH. Brisbane, Australia; 2008:581584.
Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T: Mixed excitation for HMMbased speech synthesis. In INTERSPEECH. Aalborg, Denmark; 2001:22632266.
Kawahara H, MasudaKatsuse I, de Cheveigné A: Restructuring speech representations using a pitchadaptive time–frequency smoothing and an instantaneousfrequencybased F0 extraction: possible role of a repetitive structure in sounds. Speech Comm. 1999, 27(3):187207.
Drugman T, Wilfart G, Dutoit T: A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis. In INTERSPEECH. Brighton, United Kingdom; 2009:17791782.
Drugman T, Dutoit T: The deterministic plus stochastic model of the residual signal and its applications. IEEE Trans. Audio. Speech. Lang. Process 2012, 20(3):968981.
Zen H, Tokuda K, Masuko T, Kobayasih T, Kitamura T: A hidden semiMarkov modelbased speech synthesis system. IEICE  Trans. Info. Syst 2007, 90(5):825.
Hashimoto K, Nankaku Y, Tokuda K: A Bayesian approach to hidden semi Markov model based speech synthesis, in Proceedings of INTERSPEECH. Brighton, United Kingdom; 2009:17511754.
Hashimoto K, Zen H, Nankaku Y, Masuko T, Tokuda K: A Bayesian approach to HMMbased speech synthesis, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Taipei, Taiwan; 2009:40294032.
Tokuda K, Masuko T, Miyazaki N, Kobayashi T: Multispace probability distribution HMM. IEICE Trans. on Info. Syst 2002, 85(3):455464.
Zen H, Senior A, Schuster M: Statistical parametric speech synthesis using deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, British Columbia, Canada; 2013:79627966.
Takaki S, Nankaku Y, Tokuda K: Spectral modeling with contextual additive structure for HMMbased speech synthesis. In Proceedings of 7th ISCA Speech Synthesis Workshop. Kyoto, Japan; 2010:100105.
Takaki S, Nankaku Y, Tokuda K: Contextual partial additive structure for HMMbased speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, British Columbia, Canada; 2013:78787882.
Gales MJ: Cluster adaptive training of hidden Markov models. IEEE Trans. Speech. Audio. Process. 2000, 8(4):417428. 10.1109/89.848223
Zen H, Gales MJ, Nankaku Y, Tokuda K: Product of experts for statistical parametric speech synthesis, IEEE Trans. Audio. Speech. Lang. Process. 2012, 20(3):794805.
Yu K, Zen H, Mairesse F, Young S: Context adaptive training with factorized decision trees for HMMbased statistical parametric speech synthesis. Speech Comm. 2011, 53(6):914923. 10.1016/j.specom.2011.03.003
Toda T, Young S: Trajectory training considering global variance for HMMbased speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Taipei, Taiwan; 2009:40254028.
Qin L, Wu YJ, Ling ZH, Wang RH, Dai LR: Minimum generation error criterion considering global/local variance for HMMbased speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Las Vegas, Nevada, USA; 2008:46214624.
Toda T, Tokuda K: Speech parameter generation algorithm considering global variance for HMMbased speech synthesis. IEICE  Trans. Info. Syst. Arch 2007, E90D(5):816824. 10.1093/ietisy/e90d.5.816
Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T: Speech Parameter Generation Algorithms for HMMbased Speech Synthesis, in ICASSP, vol. 3. Istanbul; 2000:13151318.
Tokuda K, Kobayashi T, Imai S: Speech parameter generation from HMM using dynamic features. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1. Detroit, Michigan, USA; 1995:660663.
Comparing glottalflowexcited statistical parametric speech synthesis methods In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, British Columbia, Canada; 2013:78307834.
Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T: Simultaneous modeling of spectrum, pitch and duration in HMMbased speech synthesis. Proceedings of Eurospeech 1999, 23472350.
Young SJ, Odell JJ, Woodland PC: Treebased state tying for high accuracy acoustic modeling. Proceedings of the workshop on Human Language Technology, Association for Computational Linguistics 1994, 307312.
Leggetter CJ, Woodland PC: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech. Lang. 1995., 9(2):
Digalakis VV, Neumeyer LG: Speaker adaptation using combined transformation and Bayesian methods. IEEE Trans. Speech. Audio. Process. 1996, 4(4):294300. 10.1109/89.506933
Zen H, Braunschweiler N, Buchholz S, Gales MJ, Knill K, Krstulovic S, Latorre J: Statistical parametric speech synthesis based on speaker and language factorization. IEEE Transactions. Audio. Speech. Lang. Process. 2012, 20(6):17131724.
Koriyama T, Nose T, Kobayashi T: Statistical parametric speech synthesis based on Gaussian process regression, IEEE Journal of Selected Topics in Signal Processing. 2013, 111.
Yu K, Mairesse F, Young S: Wordlevel emphasis modeling in HMMbased speech synthesis. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). Dallas, Texas, USA; 2010:42384241.
Zen H, Braunschweiler N: Contextdependent additive log f_0 model for HMMbased speech synthesis. In INTERSPEECH. Brighton, United Kingdom; 2009:20912094.
Sakai S: Additive modeling of English f0 contour for speech synthesis, in Proceedings of ICASSP. Las Vegas, Nevada, USA; 2008:277280.
Qian Y, Liang H, Soong FK: Generating natural F0 trajectory with additive trees. In INTERSPEECH. Brisbane, Australia; 2008:21262129.
Wu YJ, Soong F: Modeling pitch trajectory by hierarchical HMM with minimum generation error training. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Kyoto, Japan; 2012:40174020.
Berger AL, Pietra VJD, Pietra SAD: A maximum entropy approach to natural language processing. Computer Ling 1996, 22: 3971. 10.1016/00960551(96)000057
Borthwick A: A maximum entropy approach to named entity recognition, PhD dissertation (New York University). 1999.
Rangarajan V, Narayanan S, Bangalore S: Exploiting acoustic and syntactic features for prosody labeling in a maximum entropy framework, in Proceedings of NAACL HLT. 2007, 18.
Ratnaparkhi A: A maximum entropy model for partofspeech tagging, in Proceedings of the conference on empirical methods in natural language processing. 1996, 1: 133142.
Odell JJ: The use of context in large vocabulary speech recognition, PhD dissertation (Cambridge University). 1995.
Shinoda K, Takao W: MDLbased contextdependent subword modeling for speech recognition. J. Acoust. Soc. Jpn 2000, 21(2):7986. 10.1250/ast.21.79
Oura K, Zen H, Nankaku Y, Lee A, Tokuda K: A covariancetying technique for HMMbased speech synthesis. J. IEICE 2010, E93D(3):595601.
Nocedal J, Stephen JW: Numerical Optimization. Book of Springer, USA; 1999.
Bijankhan M, Sheikhzadegan J, Roohani MR, Samareh Y, Lucas C, Tebiani M: The speech database of Farsi spoken language. Proceedings of 5th Australian International Conference on Speech Science and Technology (SST) 1994, 826831.
Ghomeshi J: Nonprojecting nouns and the ezafe: construction in Persian. Nat. Lang. Ling. Theor. 1997, 15(4):729788. 10.1023/A:1005886709040
Kubichek R: Melcepstral distance measure for objective speech quality assessment. IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, vol. 1 1993, 125128.
Picart B, Drugman T, Dutoit T: Continuous control of the degree of articulation in HMMbased speech synthesis, 12th Annual Conference of the International Speech Communication Association (ISCA). INTERSPEECH, Florence, Italy; 2011:17971800.
Yamagishi J: AverageVoiceBased Speech Synthesis, PhD dissertation. Tokyo Institute of 1362 Technology, Yokohama; 2006.
Yamagishi J, Watts O: The CSTR/EMIME HTS system for Blizzard challenge, in Proceedings of Blizzard Challenge 2010. Kyoto, Japan; 2010:16.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Khorram, S., Sameti, H., Bahmaninezhad, F. et al. Contextdependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis. J AUDIO SPEECH MUSIC PROC. 2014, 12 (2014). https://doi.org/10.1186/16874722201412
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/16874722201412