Contextdependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis
 Soheil Khorram^{1}Email author,
 Hossein Sameti^{1},
 Fahimeh Bahmaninezhad^{1},
 Simon King^{2} and
 Thomas Drugman^{3}
https://doi.org/10.1186/16874722201412
© Khorram et al.; licensee Springer. 2014
Received: 12 August 2013
Accepted: 3 March 2014
Published: 7 April 2014
Abstract
Decision treeclustered contextdependent hidden semiMarkov models (HSMMs) are typically used in statistical parametric speech synthesis to represent probability densities of acoustic features given contextual factors. This paper addresses three major limitations of this decision treebased structure: (i) The decision tree structure lacks adequate context generalization. (ii) It is unable to express complex context dependencies. (iii) Parameters generated from this structure represent sudden transitions between adjacent states. In order to alleviate the above limitations, many former papers applied multiple decision trees with an additive assumption over those trees. Similarly, the current study uses multiple decision trees as well, but instead of the additive assumption, it is proposed to train the smoothest distribution by maximizing entropy measure. Obviously, increasing the smoothness of the distribution improves the context generalization. The proposed model, named hidden maximum entropy model (HMEM), estimates a distribution that maximizes entropy subject to multiple momentbased constraints. Due to the simultaneous use of multiple decision trees and maximum entropy measure, the three aforementioned issues are considerably alleviated. Relying on HMEM, a novel speech synthesis system has been developed with maximum likelihood (ML) parameter reestimation as well as maximum output probability parameter generation. Additionally, an effective and fast algorithm that builds multiple decision trees in parallel is devised. Two sets of experiments have been conducted to evaluate the performance of the proposed system. In the first set of experiments, HMEM with some heuristic context clusters is implemented. This system outperformed the decision tree structure in small training databases (i.e., 50, 100, and 200 sentences). In the second set of experiments, the HMEM performance with four parallel decision trees is investigated using both subjective and objective tests. All evaluation results of the second experiment confirm significant improvement of the proposed system over the conventional HSMM.
Keywords
1 Introduction
Statistical parametric speech synthesis (SPSS) has dominated speech synthesis research area over the last decade [1, 2]. It is mainly due to SPSS advantages over traditional concatenative speech synthesis approaches; these advantages include the flexibility to change voice characteristics [3–5], multilingual support [6–8], coverage of acoustic space [1], small footprint [1], and robustness [4, 9]. All of the above advantages stem from the fact that SPSS provides a statistical model for acoustic features instead of using original speech waveforms. However, these advantages are achieved at the expense of one major disadvantage, i.e., degradation in the quality of synthetic speech [1]. This shortcoming results from three important factors: vocoding distortion [10–13], accuracy of statistical models [14–25], and accuracy of parameter generation algorithms [26–28]. This paper is an attempt to alleviate the second factor and improve the accuracy of statistical models. Most of the researches carried out to improve the acoustic modeling performance aimed to develop systems that generate natural and highquality speech using large training speech databases (more than 30 min) [18, 21, 22]. Nevertheless, there exist a great number of underresourced languages (such as Persian) for which only limited amount of data are available. To alleviate this shortcoming, we target developing a statistical approach that leads to an appropriate speech synthesis system not only with large but also with small training databases.
Every SPSS system consists of two distinct phases, namely training and synthesis [1, 2]. In the training phase, first acoustic and contextual factors are extracted for the whole training database using a vocoder [12, 29, 30] and a natural language preprocessor. Next, the relationship between acoustic and contextual factors is modeled using a contextdependent statistical approach [14–25]. Synthesis phase starts with a parameter generation algorithm [26–28] that exploits trained contextdependent statistical models and aims to generate realistic acoustic feature trajectories for a given input text. Acoustic trajectories are then fed into the same vocoder used during the training phase in order to generate the desired synthesized speech.
In the most predominant statistical parametric approach, spectrum, excitation, and duration of speech are expressed concurrently in a unified framework of contextdependent multispace probability distribution hidden semiMarkov model (HSMM)[14]. More specifically, a multispace probability distribution [17] is estimated for each leaf node of decision trees [31]. These decision treebased structures split contextual space into a number of nonoverlapped clusters which form multiple groups of contextdependent HMM states, and each group shares the same output probability distribution [31]. In order to capture acoustic variations accurately, the model has to be able to express a large number of robust distributions [19, 20]. Decision trees are not efficient for such expression because increasing the number of distributions by growing the tree reduces the population of each leaf and consequently reduces the robustness of the distributions. This problem stemmed from the fact that decision tree assigns each HMM state to an only one cluster (small region in contextual space), therefore, each state contributes in modeling just one distribution. In other words, the decision tree structure makes the models match training data just in nonoverlapped regions which are expressed through decision tree terminal nodes [31]. In the case of limited training data, the decision tree would be small, so it cannot split contextual factor space sufficiently. In this case, the accordance between model and data is not sufficient, and therefore, the speech synthesis system generates unsatisfactory output. Accordingly, it is clear that by extending the decision tree in such a way that each state affects multiple distributions (larger portion of the contextual space), the generalization to unseen models will be improved. The main idea of this study is to extend nonoverlapped regions of one decision tree to overlapped regions of multiple decision trees and hence exploit contextual factors more efficiently.
A large number of research works have already been performed to improve the quality of basic decision treeclustered HSMM. Some of them are based on a model adaptation technique. This latter method exploits an invaluable prior knowledge attained from an average voice model [3], and adapts this general model using an adaptation algorithm such as maximum likelihood linear regression (MLLR)[32], maximum a posteriori (MAP)[33], and cluster adaptive training (CAT)[21]. However, working with average voice models is difficult for underresourced languages since building such general model needs remarkable efforts to design, record, and transcribe a thorough multispeaker speech database [3]. To alleviate the data sparsity problem in underresourced languages, speaker and language factorization (SLF) technique can be used [34]. SLF attempts to factorize speakerspecific and languagespecific characteristics in training data and then model them using different transforms. By representing the speaker attributes by one transform and language characteristics by a different transform, the speech synthesis system will be able to alter language and speaker separately. In this framework, it is possible to exploit the data from different languages to predict speakerspecific characteristics of the target speaker, and consequently, the data sparsity problem will be alleviated. Authors in [15, 16] also developed a new technique by replacing maximum likelihood (ML) point estimate of HSMM with a variational Bayesian method. Their system was shown to outperform HSMM when the amount of training data is small. Other notable structures used to improve statistical modeling accuracy are deep neural networks (DNNs)[18]. The decision tree structure is not efficient enough to model complicated context dependencies such as XORs or multiplexers [18]. To model such complex contextual functions, the decision tree has to be excessively large, but DNNs are capable to model complex contextual factors by employing multiple hidden layers. Additionally, a great number of overlapped contextual factors can be fed into a DNN to approximate output acoustic features, so DNNs are able to provide efficient context generalization. Speech synthesis based on Gaussian process regression (GPR) [35] is another novel approach that has recently been proposed to overcome HMMbased speech synthesis limitations. The GPR model predicts framelevel acoustic trajectories from framelevel contextual factors. The framelevel contextual factors include the relative position of the current frame within the phone and some articulatory information. These framelevel contextual factors are employed as the explanatory variable in GPR. The framelevel modeling of GPR removes the inaccurate stationarity assumption of state output distribution in HMMbased speech synthesis. Also, GPR can directly represent the complex context dependencies without using parameter tying by decision tree clustering; therefore, it is capable of improving context generalization.
Acoustic modeling with contextual additive structure has also been proposed to represent dependencies between contextual factors and acoustic features more precisely [19, 20, 23, 32, 36–40]. In this structure, acoustic trajectories are considered to be a sum of independent acoustic components which have different context dependencies (different decision trees have to be trained for those components). Since the mean vectors and covariance matrices of the distribution are equal to the sum of mean vectors and covariance matrices of additive components, the model would be able to exploit contextual factors more efficiently. Furthermore, in this structure, each training data sample contributes to modeling multiple mean vectors and covariance matrices. Many papers applied the additive structure just for F0 modeling [37–40]. Authors in [37] proposed an additive structure with multiple decision trees for mean vectors and a single tree for variance terms. In this paper, for different additive components, different sets of contextual factors were used and multiple trees were built simultaneously. In [40], multiple additive decision trees are also employed, but they train this structure using minimum generation error (MGE) criterion. Sakai [38] defines an additive model with three distinct layers, namely intonational phrase, wordlevel, and pitchaccent layers. All of these components were trained simultaneously using a regularized least square error criterion. Qian et al. [39] propose to use multiple additive regression trees with a gradientbased treeboosting algorithm. Decision trees are trained in successive stages to minimize the error squares. Takaki et al. [19, 20] applied additive structure for spectral modeling and reported that the computational complexity of this structure is extremely high for full context labels as used in speech synthesis. To alleviate this issue, they proposed two approaches: covariance parameter tying and a likelihood calculation algorithm using matrix inversion lemma [19]. Despite all the advantages, this additive structure may not match training data accurately because once training is done, the first and second moments of the training data and model may not be exactly the same in some regions.
Another important problem of conventional decision treeclustered acoustic modeling is difficulty in capturing the effect of weak contextual factors such as wordlevel emphasis [23, 36]. It is mainly because weak contexts have less influence on the likelihood measure [23]. One clear approach to address this issue is to construct the decision tree in two successive steps [36]. In the first step, all selections are done among weak contextual factors, and in the second step, the remaining questions are adopted [36]. This procedure can effectively exploit weak contextual factors, but it leads to a reduction in the amount of training data available for normal contextual factors. Context adaptive training with factorized decision trees [23] is another approach that can exploit weak context questions efficiently. In this system, a canonical model is trained using normal contextual factors and then a set of transforms is built by weak contextual factors. In fact, canonical models and transforms, respectively, represent the effects of normal and weak contextual factors [23]. However, this structure also improves context generalization of conventional HMMbased synthesis by exploiting adaptation techniques.
This paper introduces a maximum entropy model (MEM)based speech synthesis. MEM [41] has been demonstrated to be positively effective in numerous applications of speech and natural language processing such as speech recognition [42], prosody labeling [43], and partofspeech tagging [44]. Accordingly, the overall idea of this research is to improve HSMM context generalization by taking advantage of a distribution which not only matches training data in many overlapped contextual regions but also is optimum in the sense of an entropy criterion. This system has the potential to model the dependencies between contextual factors and acoustic features such that each training sample contributes to train multiple sets of model parameters. As a result, contextdependent acoustic modeling based on MEM could lead to a promising synthesis system even for limited training data.
The rest of the paper is organized as follows. Section 2 presents HSMMbased speech synthesis. The hidden maximum entropy model (HMEM) structure and the proposed HMEMbased speech synthesis system are explained in Section 3. Section 4 is dedicated to experimental results. Finally, Section 5 concludes this paper.
2 HSMMbased speech synthesis
This section aims to explain the predominant statistical modeling approach applied in speech synthesis, i.e., contextdependent multispace probability distribution lefttoright without skip transitions HSMM[3, 14] (simply called HSMM in the remainder of this paper). The discussion presented in this section provides a preliminary framework which will be used as a basis to introduce the proposed HMEM technique in Section 3. The most significant drawback of HSMM, namely inadequate context generalization, is also pointed out.
2.1 HSMM structure
where S(o_{ t }) represents a set of all space indexes with the same dimensionality of o_{ t }, and where ${\mathcal{N}}_{\mathit{l}}\left(.;\mathit{\mu},\mathit{\Sigma}\right)$ denotes an ldimensional Gaussian distribution with mean μ, and covariance matrix ∑ (${\mathcal{N}}_{0}$ is defined to be 1). Furthermore, the output probability distribution of the i th state and g th space is denoted by b_{ ig }(o_{ t }) which is a Gaussian distribution with mean vector μ_{ ig } and covariance matrix ∑_{ ig }. Also, m_{ i } and ${\mathit{\sigma}}_{\mathit{i}}^{2}$ represent mean and variance of the state duration probability.
where Y_{o} and Y_{d} are decision trees trained for modeling output observation vectors and state durations. All symbols with superscript l indicate model parameters defined for the l th leaf.
2.2 HSMM likelihood
where the initial forward and backward variables for every state indexes i are α_{0}(i)1 and β_{ T }(i) = 1.
2.3 HSMM parameter reestimation
2.4 Inefficient context generalization
It can be noticed from the definition of function f_{i}(c;Y) in Equation 3 that this function can be viewed as a set of L(Y) nonoverlapped binary contextual factors. The fact that these contextual factors are nonoverlapped leads to the insufficient context generalization, because this fact makes each training sample contribute to the model of only one leaf and only one Gaussian distribution. Hence, by extending f_{i}(c;Y) to overlapped contextual factors, more efficient context generalization capabilities could be achieved. Section 3 proposes an approach which enables the conventional structure to model the overlapped contextual factors and thus improves the modeling performance of unseen contexts.
3. Hidden maximum entropy model
The goal of this section is to develop a contextdependent statistical model for acoustic parameters with adequate context generalization. The previous section on HSMM revealed that inappropriate generalization stemmed from the application of nonoverlapped features only. Consequently, relating acoustic parameters to contextual information by incorporating overlapped features could improve generalization efficiency. This section proposes HMEM to establish this relation.
3.1 HMEM structure
The proposed HMEM technique exploits exactly the same structure and graphical model as the original HSMM, and thus, the model likelihood expression given by Equation 5 is also valid for HMEM. The only difference between HSMM and HMEM is the way they incorporate contextual factors in output and duration probability distributions (i.e., ${\left\{{\mathit{b}}_{\mathit{i}}\left(\bullet \right)\right\}}_{\mathit{i}=1}^{\mathit{N}}$, ${\left\{{\mathit{p}}_{\mathit{i}}\left(\bullet \right)\right\}}_{\mathit{i}=1}^{\mathit{N}}$). HSMM builds a decision tree and then trains a Gaussian distribution for each leaf of the tree. On the contrary, HMEM obeys the maximum entropy modeling approach which will be described in the next subsection.
3.1.1 Maximum entropy modeling
where x(c) denotes the realization of ℓdimensional random vector x for the context c in the database. If there are multiple realizations for x, x(c) will be obtained by taking the average over those values. In sum, the proposed contextdependent acoustic modeling approach obtains the smoothest (maximum entropy) distribution that captures firstorder moments of training data in L_{ f } regions indicated by ${\left\{{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ and secondorder moments of data computed in ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$.
where H_{ l } and u_{ l } are model parameters related to the l th contextual factors g_{ l }(c) and f_{ l }(c), respectively. H_{ l } is an ℓbyℓ matrix and u_{ l } is an ℓdimensional vector. When f_{ l }(c) becomes 1 (i.e., it is active), u_{ l } affects the distribution; otherwise, it has no effect on the distribution. In fact, Equation 19 is nothing but the wellknown Gaussian distribution with mean vector –0.5H^{1}u, and covariance matrix H^{1}, both calculated from a specific contextdependent combination of model parameters. Indeed, the main difference of MEM in comparison with other methods such as spectral additive structure [19, 20] is that mean and variance in MEM are not a linear combination of other parameters. This type of combination enables MEM to match training data in all overlapped regions.
This form of contextdependent Gaussian distribution presents a promising flexibility in utilizing contextual information. On one hand, using detailed and nonoverlapped contextual factors such as features defined by Equation 3 (decision tree terminal node indicators) generates contextdependent Gaussian distributions which are identical to those used in conventional HSMM. These distributions have straightforward and efficient training procedure but suffer from insufficient context generalization capabilities. On the other hand, incorporating general and highly overlapped contextual factors overcomes the latter shortcoming and provides efficient context generalization, but its training procedure becomes more computationally complex. In the case of highly overlapped contextual factors, an arbitrary context activates several contextual factors, and hence, each observation vector is involved in modeling several model parameters.
3.1.2 MEbased modeling vs. additive modeling
At first glance, the contextual additive structure [19, 20, 32, 37] seems to have the same capabilities as the proposed MEbased contextdependent acoustic modeling. Therefore, to clarify their differences, this section compares HMEM with the additive structure through a very simple example.
In this example, the goal is to model a onedimensional observation value using both MEbased modeling and a contextual additive structure. Due to the prime importance of mean parameters in HMMbased speech synthesis [47], we investigate the difference between mean values predicted by two systems.
In contrast, Figure 3B shows the corresponding MEbased modeling approach. In the previous subsection, it is described that MEbased contextdependent modeling needs two sets of regions, ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ and ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$. This example assumes that the leaves of Q_{1} and Q_{2} are defined as the first set of regions ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$, and the leaves of Q_{3} are defined as the second set ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$. Therefore, according to the explanation of the previous subsection, first empirical moments of Q_{1} and Q_{2}, in addition to the second empirical moments of Q_{3}, are captured by MEbased modeling. Figure 3B shows the estimated model mean values for all eight cubic clusters. As it is realized from the figure, model mean values estimated by MEbased modeling is a combination of adding parameters live in the regions ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ divided by the parameters defined for the regions ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$. In fact, the proposed MEbased modeling is an extension to the additive structure that ties all covariance matrices [19]. This extension is clear because if ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$ is defined with one region containing all contextual feature space, the MEbased modeling converts to the additive structure that ties all covariance matrices [19].
3.1.3 HMEMbased speech synthesis
In these equations, S(o_{ t }) is a set of all possible spaces defined for o_{ t }. ${\mathit{u}}_{\mathit{i}}^{\mathit{l}}$ and ${\mathit{h}}_{\mathit{i}}^{\mathit{l}}$ are the duration model parameters, and ${\mathit{w}}_{\mathit{ig}}^{\mathit{l}}$, ${\mathit{u}}_{\mathit{ig}}^{\mathit{l}}$, and ${\mathit{H}}_{\mathit{ig}}^{\mathit{l}}$ denote the output model parameters related to the l th contextual factor, g th space, and i th state.
We can now probe the differences between HSMM and HMEM contextdependent acoustic modeling. These two modeling approaches are dramatically close to each other, so that defining HMEM contextual factors based on the decision trees described by Equation 3 would reduce HMEM to HSMM. Accordingly, HMEM extends HSMM and enables its structure to exploit overlapped contextual factors.
Moreover, another significant conclusion that could be drawn from this section is that several HSMM concepts are transposable within the HMEM framework. These concepts involve Viterbi algorithm, methods which calculate forward/backward variables and occupation probabilities, and even all parameter generation algorithms [26–28]. It just needs to define mean vectors, covariance matrices, and space probabilities of HSMM in accordance with Equation 20.
3.2 HMEM parameter reestimation
where γ_{ t }(i, g) and ${\mathit{\chi}}_{\mathit{t}}^{\mathit{d}}\left(\mathit{i}\right)$ are defined in Section 2.3. Therefore, at every iteration of BFGS, we need to find the above gradient values and BFGS estimates new parameters which are closer to the optimum ones.
3.3 Decision treebased context clustering
Statistical parametric speech synthesis systems typically exploit around 50 different types of contextual factors [23]. For such system, it is impossible to prepare tanning data covering all contextdependent models, and there are a large number of unseen models that have to be predicted in synthesis phase. Therefore, a context clustering approach such as decision treebased clustering has to be used to decide about unseen contexts [31, 45]. Due to the critical importance of context clustering algorithms in HMMbased speech synthesis systems, this section focuses on designing a clustering algorithm for HMEM.
As it is realized from the discussion in this section, In order to implement the proposed architecture, we initially need to define two sets of contextual regions. These regions are represented by two sets, namely ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ and ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$. First and secondorder moment constraints have to be satisfied for all regions in ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ and ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$, respectively. Before training, the first empirical moments of all regions in ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ and the second empirical moments of all regions in ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$ are computed using training data. Then, HMEM is trained to be consistent with these empirical moments. The major difficulty in defining these regions is to find a satisfactory balance between model complexity and the availability of training data. For limited training databases, a model with a small number of parameters, i.e., small number of regions has to be defined. In this case, bigger (strongly overlapped) contextual regions seem to be more desirable, because they can alleviate the problem of weak context generalization. On the other hand, for large training databases, larger number of contextual regions has to be defined to escape from underfitting model to training data. In this case, smaller contextual regions can be applied to capture the details of acoustic features. This section introduces an algorithm that defines multiple contextual regions for first and secondorder moments by considering HMEM structure.
Due to the complex relationship between acoustic features and contextual factors, it is extremely difficult to find the optimum sets of contextual regions that maximize likelihood for HMEM. For the sake of simplicity, we have made some simplifying assumptions to find a number of suboptimum contextual regions. These assumptions are expressed as follows:

We have used conventional binary decision tree structures to define ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ and ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$. This is a common approach in many former papers [19, 20, 23]. It should be noted that the decision tree structure is not the only possible structure to express the relationship between acoustic features and contextual factors. For example, other approaches such as neural networks or softclustering methods can be applied as well. However, in this paper, we limit our discussion to the conventional binary decision tree structure.

Multiple decision trees are trained for ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$, and just one decision tree is constructed for ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$. In this way, the final HMEM preserves the first empirical moments of multiple decision trees, and the second moments of just one decision tree. This assumption is a result of the fact that firstorder moments seem to be more important than secondorder moments [32, 47].

The discussion of current section shows that the ML estimates of parameters defined for ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ and ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$ significantly depend on each other. Therefore, in each step of decision tree construction, a BFGS optimization algorithm has to be executed to reestimate both sets of parameters simultaneously, and this procedure leads to an extreme amount of computational complexity. To alleviate this problem, it is proposed to borrow ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$ from a baseline system (conventional HMMbased speech synthesis system) and construct ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ independently.

In HMEM structure, ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ is responsible to provide satisfactory clustering of firstorder moments (mean vectors). Similarly, contextual additive structures [19, 20, 37] that tie all covariance matrices offer multiple overlapped clustering of mean vectors based on the likelihood criterion; therefore, an appropriate method is to borrow ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ from the contextual additive structure.

However, training a contextual additive structure using algorithms proposed in [19, 20] is still computationally expensive for large training databases (more than 500 sentences). Three modifications are applied to the algorithm proposed by Takaki et al. [19] for computational complexity reduction: (i) The number of decision trees is considered to be fixed (in our experiments, an additive structure with four decision trees is built). (ii) Questions are selected one by one for different decision trees. Therefore, all trees are grown simultaneously, and the size of all trees would be equal. (iii) In the process of selecting the best pair of question and leaf, it is assumed that just the parameters of candidate leaf will be changed and all other parameters remain unchanged. It should be noted that the selection procedure is repeated until the total number of free parameters reaches the number of parameters trained for the baseline system (HSMMbased speech synthesis system).
In sum, the final algorithm of determining ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ and ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$ can be summarized as follows. ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$ is simply borrowed from a conventional HMMbased speech synthesis system. ${\left\{\phantom{\rule{0.12em}{0ex}}{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ also resulted from an independent context clustering algorithm that is a fast and simplified version of contextual additive structure [19]. This clustering algorithm builds four binary contextdependent decision trees, simultaneously. It should be noted that when the number of clusters reaches the number of leaves of the decision tree trained for an HSMMbased system, the clustering algorithm is finished.
4 Experiments
We have conducted two sets of experiments. First, the performance of HMEM with heuristic context clusters is examined; second, the impact of the proposed method for decision treebased context clustering presented in the Section 3.3 is evaluated.
4.1 Performance evaluation of HMEM with heuristic context clusters
This subsection aims to compare HMEMbased acoustic modeling with conventional HSMMbased method. In this subsection, contextual regions of HMEM are defined heuristically and it is fixed for different sizes of training database.
4.1.1 Experimental conditions
A Persian speech database [49] consisting of 1,000 utterances from a male speaker was used throughout our experiments. Sentences were between 5 and 20 words long and have an average duration of 8 s. This database was specifically designed for the purpose of speech synthesis. Sentences in the database covered most frequent Persian words, all biletter combinations, all biphoneme combinations, and most frequent Persian syllables. In the modeling of the synthesis units, 31 phonemes were used, including silence. As presented in Section 4.1.2, a large variety of phonetic and linguistic contextual factors was considered in this work.
Speech signals were sampled at a rate of 16 kHz and windowed by a 25ms Blackman window with a 5ms shift. 40 Melcepstral coefficients, 5 bandpass aperiodicity and fundamental frequency, and their delta and deltadelta coefficients extracted by STRAIGHT [11] were employed as our acoustic features. In this experiment, the number of states was 5, and multistream lefttoright with no skip path MSDHSMM was trained as the traditional HSMM system. Decision trees were built using maximum likelihood criterion, and the size of decision trees was determined by MDL principle [46]. Additionally, global variance (GV)based parameter generation algorithm [20, 26] and STRAIGHT vocoder were applied in the synthesis phase.
Both subjective and objective tests were carried out to compare HMEM that uses some heuristic contextual regions with the traditional HSMM system. In our experiments, two different synthesis systems named HMEM1 and HMEM2 were developed based on the proposed approach. HMEM1 employs a small number of general highly overlapped contextual factors that are designed carefully for each stream, while HMEM2 uses a larger number of contextual factors.
The number of leaf nodes for each stream in different speech synthesis systems
Various speech synthesis systems  

HSMM100  HSMM200  HSMM400  HSMM800  HMEM1  HMEM2  
Streams of acoustic features  bap  239  392  581  958  565  1,130 
dur  124  193  319  512  256  512  
log F0  590  904  1,425  2,487  565  1,130  
mgc  267  416  736  1,279  695  1,390  
Total parameters  75,628  118,314  204,683  354,133  188,217  377,834 
Experiments were conducted on five different training sets with 50, 100, 200, 400, and 800 utterances. Additionally, a fixed set of 200 utterances, not included in the training sets, was used for testing.
4.1.2 Employed contextual factors
In our experiments, contextual factors contained phonetic, syllable, word, phrase, and sentence level features. In each of these levels, both general and detailed features were considered. Features such as phoneme identity, syllable stress pattern, or word partofspeech tag are examples of general features, and a question like the position of the current phoneme is a sample of a detailed one. Specific information with regard to contextual features is presented in this subsection.

➢ Phoneticlevel features

Phoneme identity before the preceding phoneme; preceding, current, and succeeding phonemes; and phoneme identity after the next phoneme

Position of the current phoneme in the current syllable (forward and backward)

Whether this phoneme is ‘Ezafe’ [50] or not (Ezafe is a special feature in Persian pronounced as a short vowel ‘e’ and relates two different words together. Ezafe is not written but is pronounced and has a profound effect on intonation)


➢ Syllablelevel features

Stress level of this syllable (five different stress levels are defined for our speech database)

Position of the current syllable in the current word and phrase (forward and backward)

Type of the current syllable (syllables in Persian language are structured as CV, CVC, or CVCC, where C and V denote consonants and vowels, respectively)

Number of the stressed syllables before and after the current syllable in the current phrase

Number of syllables from the previous stressed syllable to the current syllable

Vowel identity of the current syllable


➢ Wordlevel features

Partofspeech (POS) tag of the preceding, current and succeeding word

Position of the current word in the current sentence (forward and backward)

Whether the current word contains ‘Ezafe’ or not

Whether this word is the last word in the sentence or not


➢ Phraselevel features

Number of syllables in the preceding, current, and succeeding phrase

Position of the current phrase in the sentence (forward and backward)


➢ Sentencelevel features

Number of syllables, words, and phrases in the current sentence

Type of the current sentence

4.1.3 Illustratory example
In limited training sets, HSMM produces sudden transitions between adjacent states. This drawback is the result of decision treeclustered contextdependent modeling. More specifically, when few data are available for training, the number of leaves in the decision tree is reduced. As a result, the distance between the mean vectors of adjacent states can be large. Even the parameter generation algorithm proposed by [26–28] cannot compensate such jumps. In such cases, the quality of synthetic speech with HSMM is expected to deteriorate.
On the opposite, if we let adjacent states contain common active contextual factors, then the variation of mean vectors in state transitions will be smoother. This is the key idea of HMEM which makes it possible to outperform HSMM when the data are limited. However, the use of overlapped contextual factors in HMEM will result in oversmoothing problem when the size of the training data is increased. Therefore, the detailed contextual factors are additionally considered in HMEM2 to alleviate the oversmoothing issue.
4.1.4 Objective evaluation
The average melcepstral distortion (MCD) [51] and rootmeansquare (RMS) error of phoneme durations (expressed in terms of number of frames) were selected as relevant metrics for our objective assessment. For the calculation of both average melcepstral distance and RMS error of phoneme durations, the state boundaries (state durations) were determined using Viterbi alignment with the speaker's real utterance.
In summary, from these figures and the illustratory example presented before, we can see that when the available data are limited, all features (log F0, duration, and spectra) of synthetic speech generated by HMEM are closer to the original features than those obtained with HSMM. However, when the training database is large, the HSMMbased method performs better than HMEM. Nevertheless, employing more detailed features can assist the proposed method in becoming closer to the HSMMbased synthetic speech.
FN, FP, TN, and TP rates of detecting voiced/unvoiced regions through HMEM2 and the HSMMbased method
# training data  Implemented systems  Really voiced (%)  Really unvoiced (%)  

50  HMEM2  Voiced  77.00  3.70 
Unvoiced  7.09  12.21  
HSMM  Voiced  78.23  5.79  
Unvoiced  5.86  10.12  
100  HMEM2  Voiced  75.78  2.75 
Unvoiced  8.31  13.16  
HSMM  Voiced  78.07  5.34  
Unvoiced  6.02  10.57  
200  HMEM2  Voiced  77.25  1.54 
Unvoiced  6.84  14.37  
HSMM  Voiced  78.43  4.43  
Unvoiced  5.66  11.48  
400  HMEM2  Voiced  77.18  1.34 
Unvoiced  6.91  14.57  
HSMM  Voiced  76.09  2.70  
Unvoiced  8.00  13.21  
800  HMEM2  Voiced  77.10  0.83 
Unvoiced  6.99  15.08  
HSMM  Voiced  77.17  2.66  
Unvoiced  6.92  13.25 
Accuracy of voiced/unvoiced detector
# training data  HMEM2 accuracy (%)  HSMM accuracy (%) 

50  89.21  88.35 
100  88.94  88.64 
200  91.62  89.91 
400  91.75  89.30 
800  92.18  90.42 
4.1.5 Subjective evaluation
Twenty native participants were asked to listen to ten randomly chosen pairs of synthesized speech samples generated by two different systems (selected arbitrarily among HMEM1, HMEM2, and HSMM).
Remarkably, the proposed systems are noticed to be of a great interest when the training data are limited (i.e., for 50, 100, and 200 utterances) and are in line with the conclusions of the objective assessments. The superiority of HMEM1 over HSMM and HMEM2 is clear in the training sets containing 50 and 100 utterances. In other words, general contextual factors lead the proposed system to a better performance when the amount of training data is very small. Gradually, as the number of utterances in the training set increases, detailed features assist the proposed system in achieving more effective synthetic speech. Therefore, HMEM2 surpasses HMEM1 for training sets with 200 and more utterances. However, for relatively large training sets (400 and 800), the use of HSMM is recommended.
Table 1 compares the number of leaf nodes in different speech synthesis systems. It can be seen from the table that to model mgc stream, HMEM2 exploits more parameters than HSMM400 and HSMM800, but the objective evaluations presented in Figure 6 show that HSMM400 and HSMM800 results in better melcepstral distances. The above argument shows that HMEM with some heuristic contextual clusters cannot exploit model parameters efficiently. In fact, a great number of contextual regions in HMEM1 and HMEM2 are redundant; therefore, their corresponding parameters are not useful. The next section evaluates the performance of HMEM with the suboptimum context clustering algorithm proposed in Section 3.3. This proposed clustering algorithm selects appropriate contextual regions and consequently solves the aforementioned problem.
4.2 Performance evaluation of HMEM with decision treebased context clustering
This section is dedicated to the second set of experiments conducted to evaluate the performance of HMEM with decision tree construction algorithm proposed in Section 3.3. As it is realized from the first set of experiments, HMEM with heuristic and naïve contextual regions cannot outperform HSMM in large training databases. This section proves that by employing appropriate sets of ${\left\{{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathrm{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ and ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathrm{l}=1}^{{\mathit{L}}_{\mathit{g}}}$, HMEM outperforms HSMM even for large databases.
4.2.1 Experimental conditions
Experiments were carried out on Nick [54], a British male database collected in Edinburgh University. This database consists of 2,500 utterances from a male speaker. We considered five sets including 50, 100, 200, 400, and 800 utterances for training, and 200 sentences that were not included in training sets were used as test data. Each sentence in the database is about 5 s of speech. Speech signals are sampled at 48 kHz, windowed by a 25ms Blackman window with 5ms shift. This database was specifically designed for the purpose of speech synthesis research, and utterances in the database covered most frequent English words. Also, different segmental and suprasegmental contextual factors were extracted for this database.
The speech analysis conditions and model topologies of CSTR/EMIME HTS 2010 [54] were used in this experiment. Bark cepstrum was extracted from smooth STRAIGHT trajectories [11]. Also, instead of log F0 and five frequency subbands (0 to 1, 1 to 2, 2 to 4, 4 to 6, and 6 to 8 kHz), pitch in mel and auditoryscale motivated frequency bands for aperiodicity measure were applied [54]. The analysis process resulted in 40 bark cepstrum coefficients, 1 mel in pitch value, and 25 auditoryscale motivated frequency bands aperiodicity parameters for each frame of training speech signals. These parameters incorporated with their delta and deltadelta parameters considered as the observation vectors of the statistical parametric model.
A fivestate multistream lefttoright with no skip path MSDHSMM was trained as the baseline system. Conventional maximum likelihoodbased decision tree clustering algorithm was used to tie HMM states, but MDL criterion is used to determine the size of decision trees.
In order to have a fair comparison, the proposed system (HMEM with decision tree structure) was trained with the same number of free model parameters as the baseline system. HMEM was trained based on the decision tree construction algorithm presented in Section 3.3 and parameter reestimation algorithm proposed in Section 3.2. It should be noted that four decision trees were built for ${\left\{{\mathit{f}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{f}}}$ and one decision tree for ${\left\{{\mathit{g}}_{\mathit{l}}\left(\mathit{c}\right)\right\}}_{\mathit{l}=1}^{{\mathit{L}}_{\mathit{g}}}$. After training acoustic models, in the synthesis phase, GVbased parameter generation algorithm [20, 26] and STRAIGHT synthesis module generated synthesized speech signals. Both subjective and objective tests were conducted to compare HMEM that uses decision treebased clusters with traditional HSMMbased synthesis.
It is useful to mention that training the proposed HMEM structure with decision treebased context clustering took approximately 5 days for 800 training sentences, while training its corresponding HSMMbased synthesis system took approximately 16 h.
4.2.2 Employed contextual factors

➢ Phoneticlevel features

Phoneme identity before the preceding phoneme; preceding, current, and succeeding phonemes; and phoneme identity after the next phoneme

Position of the current phoneme in the current syllable, word, phrase, and sentence


➢ Syllablelevel features

Stress level of previous, current, and next syllable (three different stress levels are defined for this database)

Position of the current syllable in the current word, phrase, and sentence

Number of the phonemes of the previous, current, and next syllable

Whether the previous, current, and next syllable is accented or not

Number of the stressed syllables before and after the current syllable in the current phrase

Number of syllables from the previous stressed syllable to the current syllable

Number of syllables from the previous accented syllable to the current syllable


➢ Wordlevel features

Partofspeech (POS) tag of the preceding, current, and succeeding word

Position of the current word in the current phrase and sentence (forward and backward)

Number of syllables of the previous, current, and next word

Number of content words before and after current word in the current phrase

Number of words from previous and next content word


➢ Phraselevel features

Number of syllables and words of the preceding, current, and succeeding phrase

Position of the current phrase in the sentence

Current phrase ToBI end tone


➢ Sentencelevel features

Number of phonemes, syllables, words, and phrases in the current utterance

Type of the current sentence

4.2.3 Objective evaluation
4.2.4 Subjective evaluation
Both CMOS test and preference score confirm the superiority of the proposed method over HSMM in all databases. Thus, if context clusters are determined through an effective approach, the proposed HMEM will outperform HSMM.
5. Conclusions
This paper addressed the main shortcomings of HSMM in contextdependent acoustic modeling, namely inadequate context generalization. HSMM uses decision treebased context clustering that does not provide efficient generalization, because each acoustic feature vector is associated with modeling only one context cluster. In order to alleviate this problem, this paper proposed HMEM as a new acoustic modeling technique based on maximum entropy modeling approach. HMEM improves HSMM by enabling its structure to take advantage of overlapped contextual factors, and therefore, it can provide superior context generalization. Experimental results using objective and subjective criteria showed that the proposed system outperforms HSMM.
Despite the advantages, which enabled our system to outperform HSMM, a drawback of computationally complex training procedure is noticed in large databases.
Declarations
Authors’ Affiliations
References
 Zen H, Tokuda K, Black AW: Statistical parametric speech synthesis. Speech Comm. 2009, 51(11):10391064. 10.1016/j.specom.2009.04.004View ArticleGoogle Scholar
 Black AW, Zen H, Tokuda K: Statistical parametric speech synthesis, in IEEE International Conference on Acoustics, vol. 4. Speech and Signal Processing (ICASSP), Honolulu, Hawaii, USA; 2007:IV1229IV1232.Google Scholar
 Yamagishi J, Kobayashi T: Averagevoicebased speech synthesis using HSMMbased speaker adaptation and adaptive training. IEICE  Trans. Info. Syst. 2007, 90(2):533543.View ArticleGoogle Scholar
 Yamagishi J, Nose T, Zen H, Ling ZH, Toda T, Tokuda K, King S, Renals S: Robust speakeradaptive HMMbased texttospeech synthesis, In IEEE Transactions on Audio, Speech, and Language Processing. 2009, 17(6):12081230.Google Scholar
 Yamagishi J, Kobayashi T, Nakano Y, Ogata K, Isogai J: Analysis of speaker adaptation algorithms for HMMbased speech synthesis and a constrained SMAPLR adaptation algorithm, IEEE Transactions on Audio, Speech, and Language Processing. 2009, 17(1):6683.Google Scholar
 Wu YJ, Nankaku Y, Tokuda K: State mapping based method for crosslingual speaker adaptation in HMMbased speech synthesis. In INTERSPEECH. Brighton, UK; 2009:528531.Google Scholar
 Liang H, Dines J, Saheer L: A comparison of supervised and unsupervised crosslingual speaker adaptation approaches for HMMbased speech synthesis. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). Dallas, Texas, USA; 2010:45984601.Google Scholar
 Gibson M, Hirsimaki T, Karhila R, Kurimo M, Byrne W: Unsupervised crosslingual speaker adaptation for HMMbased speech synthesis using twopass decision tree construction. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). Dallas, Texas, USA; 2010:46424645.Google Scholar
 Yamagishi J, Ling Z, King S: Robustness of HMMbased speech synthesis. In INTERSPEECH. Brisbane, Australia; 2008:581584.Google Scholar
 Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T: Mixed excitation for HMMbased speech synthesis. In INTERSPEECH. Aalborg, Denmark; 2001:22632266.Google Scholar
 Kawahara H, MasudaKatsuse I, de Cheveigné A: Restructuring speech representations using a pitchadaptive time–frequency smoothing and an instantaneousfrequencybased F0 extraction: possible role of a repetitive structure in sounds. Speech Comm. 1999, 27(3):187207.View ArticleGoogle Scholar
 Drugman T, Wilfart G, Dutoit T: A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis. In INTERSPEECH. Brighton, United Kingdom; 2009:17791782.Google Scholar
 Drugman T, Dutoit T: The deterministic plus stochastic model of the residual signal and its applications. IEEE Trans. Audio. Speech. Lang. Process 2012, 20(3):968981.View ArticleGoogle Scholar
 Zen H, Tokuda K, Masuko T, Kobayasih T, Kitamura T: A hidden semiMarkov modelbased speech synthesis system. IEICE  Trans. Info. Syst 2007, 90(5):825.View ArticleGoogle Scholar
 Hashimoto K, Nankaku Y, Tokuda K: A Bayesian approach to hidden semi Markov model based speech synthesis, in Proceedings of INTERSPEECH. Brighton, United Kingdom; 2009:17511754.Google Scholar
 Hashimoto K, Zen H, Nankaku Y, Masuko T, Tokuda K: A Bayesian approach to HMMbased speech synthesis, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Taipei, Taiwan; 2009:40294032.Google Scholar
 Tokuda K, Masuko T, Miyazaki N, Kobayashi T: Multispace probability distribution HMM. IEICE Trans. on Info. Syst 2002, 85(3):455464.Google Scholar
 Zen H, Senior A, Schuster M: Statistical parametric speech synthesis using deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, British Columbia, Canada; 2013:79627966.Google Scholar
 Takaki S, Nankaku Y, Tokuda K: Spectral modeling with contextual additive structure for HMMbased speech synthesis. In Proceedings of 7th ISCA Speech Synthesis Workshop. Kyoto, Japan; 2010:100105.Google Scholar
 Takaki S, Nankaku Y, Tokuda K: Contextual partial additive structure for HMMbased speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, British Columbia, Canada; 2013:78787882.Google Scholar
 Gales MJ: Cluster adaptive training of hidden Markov models. IEEE Trans. Speech. Audio. Process. 2000, 8(4):417428. 10.1109/89.848223View ArticleGoogle Scholar
 Zen H, Gales MJ, Nankaku Y, Tokuda K: Product of experts for statistical parametric speech synthesis, IEEE Trans. Audio. Speech. Lang. Process. 2012, 20(3):794805.View ArticleGoogle Scholar
 Yu K, Zen H, Mairesse F, Young S: Context adaptive training with factorized decision trees for HMMbased statistical parametric speech synthesis. Speech Comm. 2011, 53(6):914923. 10.1016/j.specom.2011.03.003View ArticleGoogle Scholar
 Toda T, Young S: Trajectory training considering global variance for HMMbased speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Taipei, Taiwan; 2009:40254028.Google Scholar
 Qin L, Wu YJ, Ling ZH, Wang RH, Dai LR: Minimum generation error criterion considering global/local variance for HMMbased speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Las Vegas, Nevada, USA; 2008:46214624.Google Scholar
 Toda T, Tokuda K: Speech parameter generation algorithm considering global variance for HMMbased speech synthesis. IEICE  Trans. Info. Syst. Arch 2007, E90D(5):816824. 10.1093/ietisy/e90d.5.816View ArticleGoogle Scholar
 Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T: Speech Parameter Generation Algorithms for HMMbased Speech Synthesis, in ICASSP, vol. 3. Istanbul; 2000:13151318.Google Scholar
 Tokuda K, Kobayashi T, Imai S: Speech parameter generation from HMM using dynamic features. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1. Detroit, Michigan, USA; 1995:660663.Google Scholar
 Comparing glottalflowexcited statistical parametric speech synthesis methods In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, British Columbia, Canada; 2013:78307834.Google Scholar
 Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T: Simultaneous modeling of spectrum, pitch and duration in HMMbased speech synthesis. Proceedings of Eurospeech 1999, 23472350.Google Scholar
 Young SJ, Odell JJ, Woodland PC: Treebased state tying for high accuracy acoustic modeling. Proceedings of the workshop on Human Language Technology, Association for Computational Linguistics 1994, 307312.Google Scholar
 Leggetter CJ, Woodland PC: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech. Lang. 1995., 9(2):Google Scholar
 Digalakis VV, Neumeyer LG: Speaker adaptation using combined transformation and Bayesian methods. IEEE Trans. Speech. Audio. Process. 1996, 4(4):294300. 10.1109/89.506933View ArticleGoogle Scholar
 Zen H, Braunschweiler N, Buchholz S, Gales MJ, Knill K, Krstulovic S, Latorre J: Statistical parametric speech synthesis based on speaker and language factorization. IEEE Transactions. Audio. Speech. Lang. Process. 2012, 20(6):17131724.View ArticleGoogle Scholar
 Koriyama T, Nose T, Kobayashi T: Statistical parametric speech synthesis based on Gaussian process regression, IEEE Journal of Selected Topics in Signal Processing. 2013, 111.Google Scholar
 Yu K, Mairesse F, Young S: Wordlevel emphasis modeling in HMMbased speech synthesis. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). Dallas, Texas, USA; 2010:42384241.Google Scholar
 Zen H, Braunschweiler N: Contextdependent additive log f_0 model for HMMbased speech synthesis. In INTERSPEECH. Brighton, United Kingdom; 2009:20912094.Google Scholar
 Sakai S: Additive modeling of English f0 contour for speech synthesis, in Proceedings of ICASSP. Las Vegas, Nevada, USA; 2008:277280.Google Scholar
 Qian Y, Liang H, Soong FK: Generating natural F0 trajectory with additive trees. In INTERSPEECH. Brisbane, Australia; 2008:21262129.Google Scholar
 Wu YJ, Soong F: Modeling pitch trajectory by hierarchical HMM with minimum generation error training. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Kyoto, Japan; 2012:40174020.Google Scholar
 Berger AL, Pietra VJD, Pietra SAD: A maximum entropy approach to natural language processing. Computer Ling 1996, 22: 3971. 10.1016/00960551(96)000057View ArticleGoogle Scholar
 Borthwick A: A maximum entropy approach to named entity recognition, PhD dissertation (New York University). 1999.Google Scholar
 Rangarajan V, Narayanan S, Bangalore S: Exploiting acoustic and syntactic features for prosody labeling in a maximum entropy framework, in Proceedings of NAACL HLT. 2007, 18.Google Scholar
 Ratnaparkhi A: A maximum entropy model for partofspeech tagging, in Proceedings of the conference on empirical methods in natural language processing. 1996, 1: 133142.Google Scholar
 Odell JJ: The use of context in large vocabulary speech recognition, PhD dissertation (Cambridge University). 1995.Google Scholar
 Shinoda K, Takao W: MDLbased contextdependent subword modeling for speech recognition. J. Acoust. Soc. Jpn 2000, 21(2):7986. 10.1250/ast.21.79View ArticleGoogle Scholar
 Oura K, Zen H, Nankaku Y, Lee A, Tokuda K: A covariancetying technique for HMMbased speech synthesis. J. IEICE 2010, E93D(3):595601.Google Scholar
 Nocedal J, Stephen JW: Numerical Optimization. Book of Springer, USA; 1999.View ArticleGoogle Scholar
 Bijankhan M, Sheikhzadegan J, Roohani MR, Samareh Y, Lucas C, Tebiani M: The speech database of Farsi spoken language. Proceedings of 5th Australian International Conference on Speech Science and Technology (SST) 1994, 826831.Google Scholar
 Ghomeshi J: Nonprojecting nouns and the ezafe: construction in Persian. Nat. Lang. Ling. Theor. 1997, 15(4):729788. 10.1023/A:1005886709040View ArticleGoogle Scholar
 Kubichek R: Melcepstral distance measure for objective speech quality assessment. IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, vol. 1 1993, 125128.View ArticleGoogle Scholar
 Picart B, Drugman T, Dutoit T: Continuous control of the degree of articulation in HMMbased speech synthesis, 12th Annual Conference of the International Speech Communication Association (ISCA). INTERSPEECH, Florence, Italy; 2011:17971800.Google Scholar
 Yamagishi J: AverageVoiceBased Speech Synthesis, PhD dissertation. Tokyo Institute of 1362 Technology, Yokohama; 2006.Google Scholar
 Yamagishi J, Watts O: The CSTR/EMIME HTS system for Blizzard challenge, in Proceedings of Blizzard Challenge 2010. Kyoto, Japan; 2010:16.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.