Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis
© Khorram et al.; licensee Springer. 2014
Received: 12 August 2013
Accepted: 3 March 2014
Published: 7 April 2014
Decision tree-clustered context-dependent hidden semi-Markov models (HSMMs) are typically used in statistical parametric speech synthesis to represent probability densities of acoustic features given contextual factors. This paper addresses three major limitations of this decision tree-based structure: (i) The decision tree structure lacks adequate context generalization. (ii) It is unable to express complex context dependencies. (iii) Parameters generated from this structure represent sudden transitions between adjacent states. In order to alleviate the above limitations, many former papers applied multiple decision trees with an additive assumption over those trees. Similarly, the current study uses multiple decision trees as well, but instead of the additive assumption, it is proposed to train the smoothest distribution by maximizing entropy measure. Obviously, increasing the smoothness of the distribution improves the context generalization. The proposed model, named hidden maximum entropy model (HMEM), estimates a distribution that maximizes entropy subject to multiple moment-based constraints. Due to the simultaneous use of multiple decision trees and maximum entropy measure, the three aforementioned issues are considerably alleviated. Relying on HMEM, a novel speech synthesis system has been developed with maximum likelihood (ML) parameter re-estimation as well as maximum output probability parameter generation. Additionally, an effective and fast algorithm that builds multiple decision trees in parallel is devised. Two sets of experiments have been conducted to evaluate the performance of the proposed system. In the first set of experiments, HMEM with some heuristic context clusters is implemented. This system outperformed the decision tree structure in small training databases (i.e., 50, 100, and 200 sentences). In the second set of experiments, the HMEM performance with four parallel decision trees is investigated using both subjective and objective tests. All evaluation results of the second experiment confirm significant improvement of the proposed system over the conventional HSMM.
Statistical parametric speech synthesis (SPSS) has dominated speech synthesis research area over the last decade [1, 2]. It is mainly due to SPSS advantages over traditional concatenative speech synthesis approaches; these advantages include the flexibility to change voice characteristics [3–5], multilingual support [6–8], coverage of acoustic space , small footprint , and robustness [4, 9]. All of the above advantages stem from the fact that SPSS provides a statistical model for acoustic features instead of using original speech waveforms. However, these advantages are achieved at the expense of one major disadvantage, i.e., degradation in the quality of synthetic speech . This shortcoming results from three important factors: vocoding distortion [10–13], accuracy of statistical models [14–25], and accuracy of parameter generation algorithms [26–28]. This paper is an attempt to alleviate the second factor and improve the accuracy of statistical models. Most of the researches carried out to improve the acoustic modeling performance aimed to develop systems that generate natural and high-quality speech using large training speech databases (more than 30 min) [18, 21, 22]. Nevertheless, there exist a great number of under-resourced languages (such as Persian) for which only limited amount of data are available. To alleviate this shortcoming, we target developing a statistical approach that leads to an appropriate speech synthesis system not only with large but also with small training databases.
Every SPSS system consists of two distinct phases, namely training and synthesis [1, 2]. In the training phase, first acoustic and contextual factors are extracted for the whole training database using a vocoder [12, 29, 30] and a natural language pre-processor. Next, the relationship between acoustic and contextual factors is modeled using a context-dependent statistical approach [14–25]. Synthesis phase starts with a parameter generation algorithm [26–28] that exploits trained context-dependent statistical models and aims to generate realistic acoustic feature trajectories for a given input text. Acoustic trajectories are then fed into the same vocoder used during the training phase in order to generate the desired synthesized speech.
In the most predominant statistical parametric approach, spectrum, excitation, and duration of speech are expressed concurrently in a unified framework of context-dependent multi-space probability distribution hidden semi-Markov model (HSMM). More specifically, a multi-space probability distribution  is estimated for each leaf node of decision trees . These decision tree-based structures split contextual space into a number of non-overlapped clusters which form multiple groups of context-dependent HMM states, and each group shares the same output probability distribution . In order to capture acoustic variations accurately, the model has to be able to express a large number of robust distributions [19, 20]. Decision trees are not efficient for such expression because increasing the number of distributions by growing the tree reduces the population of each leaf and consequently reduces the robustness of the distributions. This problem stemmed from the fact that decision tree assigns each HMM state to an only one cluster (small region in contextual space), therefore, each state contributes in modeling just one distribution. In other words, the decision tree structure makes the models match training data just in non-overlapped regions which are expressed through decision tree terminal nodes . In the case of limited training data, the decision tree would be small, so it cannot split contextual factor space sufficiently. In this case, the accordance between model and data is not sufficient, and therefore, the speech synthesis system generates unsatisfactory output. Accordingly, it is clear that by extending the decision tree in such a way that each state affects multiple distributions (larger portion of the contextual space), the generalization to unseen models will be improved. The main idea of this study is to extend non-overlapped regions of one decision tree to overlapped regions of multiple decision trees and hence exploit contextual factors more efficiently.
A large number of research works have already been performed to improve the quality of basic decision tree-clustered HSMM. Some of them are based on a model adaptation technique. This latter method exploits an invaluable prior knowledge attained from an average voice model , and adapts this general model using an adaptation algorithm such as maximum likelihood linear regression (MLLR), maximum a posteriori (MAP), and cluster adaptive training (CAT). However, working with average voice models is difficult for under-resourced languages since building such general model needs remarkable efforts to design, record, and transcribe a thorough multi-speaker speech database . To alleviate the data sparsity problem in under-resourced languages, speaker and language factorization (SLF) technique can be used . SLF attempts to factorize speaker-specific and language-specific characteristics in training data and then model them using different transforms. By representing the speaker attributes by one transform and language characteristics by a different transform, the speech synthesis system will be able to alter language and speaker separately. In this framework, it is possible to exploit the data from different languages to predict speaker-specific characteristics of the target speaker, and consequently, the data sparsity problem will be alleviated. Authors in [15, 16] also developed a new technique by replacing maximum likelihood (ML) point estimate of HSMM with a variational Bayesian method. Their system was shown to outperform HSMM when the amount of training data is small. Other notable structures used to improve statistical modeling accuracy are deep neural networks (DNNs). The decision tree structure is not efficient enough to model complicated context dependencies such as XORs or multiplexers . To model such complex contextual functions, the decision tree has to be excessively large, but DNNs are capable to model complex contextual factors by employing multiple hidden layers. Additionally, a great number of overlapped contextual factors can be fed into a DNN to approximate output acoustic features, so DNNs are able to provide efficient context generalization. Speech synthesis based on Gaussian process regression (GPR)  is another novel approach that has recently been proposed to overcome HMM-based speech synthesis limitations. The GPR model predicts frame-level acoustic trajectories from frame-level contextual factors. The frame-level contextual factors include the relative position of the current frame within the phone and some articulatory information. These frame-level contextual factors are employed as the explanatory variable in GPR. The frame-level modeling of GPR removes the inaccurate stationarity assumption of state output distribution in HMM-based speech synthesis. Also, GPR can directly represent the complex context dependencies without using parameter tying by decision tree clustering; therefore, it is capable of improving context generalization.
Acoustic modeling with contextual additive structure has also been proposed to represent dependencies between contextual factors and acoustic features more precisely [19, 20, 23, 32, 36–40]. In this structure, acoustic trajectories are considered to be a sum of independent acoustic components which have different context dependencies (different decision trees have to be trained for those components). Since the mean vectors and covariance matrices of the distribution are equal to the sum of mean vectors and covariance matrices of additive components, the model would be able to exploit contextual factors more efficiently. Furthermore, in this structure, each training data sample contributes to modeling multiple mean vectors and covariance matrices. Many papers applied the additive structure just for F0 modeling [37–40]. Authors in  proposed an additive structure with multiple decision trees for mean vectors and a single tree for variance terms. In this paper, for different additive components, different sets of contextual factors were used and multiple trees were built simultaneously. In , multiple additive decision trees are also employed, but they train this structure using minimum generation error (MGE) criterion. Sakai  defines an additive model with three distinct layers, namely intonational phrase, word-level, and pitch-accent layers. All of these components were trained simultaneously using a regularized least square error criterion. Qian et al.  propose to use multiple additive regression trees with a gradient-based tree-boosting algorithm. Decision trees are trained in successive stages to minimize the error squares. Takaki et al. [19, 20] applied additive structure for spectral modeling and reported that the computational complexity of this structure is extremely high for full context labels as used in speech synthesis. To alleviate this issue, they proposed two approaches: covariance parameter tying and a likelihood calculation algorithm using matrix inversion lemma . Despite all the advantages, this additive structure may not match training data accurately because once training is done, the first and second moments of the training data and model may not be exactly the same in some regions.
Another important problem of conventional decision tree-clustered acoustic modeling is difficulty in capturing the effect of weak contextual factors such as word-level emphasis [23, 36]. It is mainly because weak contexts have less influence on the likelihood measure . One clear approach to address this issue is to construct the decision tree in two successive steps . In the first step, all selections are done among weak contextual factors, and in the second step, the remaining questions are adopted . This procedure can effectively exploit weak contextual factors, but it leads to a reduction in the amount of training data available for normal contextual factors. Context adaptive training with factorized decision trees  is another approach that can exploit weak context questions efficiently. In this system, a canonical model is trained using normal contextual factors and then a set of transforms is built by weak contextual factors. In fact, canonical models and transforms, respectively, represent the effects of normal and weak contextual factors . However, this structure also improves context generalization of conventional HMM-based synthesis by exploiting adaptation techniques.
This paper introduces a maximum entropy model (MEM)-based speech synthesis. MEM  has been demonstrated to be positively effective in numerous applications of speech and natural language processing such as speech recognition , prosody labeling , and part-of-speech tagging . Accordingly, the overall idea of this research is to improve HSMM context generalization by taking advantage of a distribution which not only matches training data in many overlapped contextual regions but also is optimum in the sense of an entropy criterion. This system has the potential to model the dependencies between contextual factors and acoustic features such that each training sample contributes to train multiple sets of model parameters. As a result, context-dependent acoustic modeling based on MEM could lead to a promising synthesis system even for limited training data.
The rest of the paper is organized as follows. Section 2 presents HSMM-based speech synthesis. The hidden maximum entropy model (HMEM) structure and the proposed HMEM-based speech synthesis system are explained in Section 3. Section 4 is dedicated to experimental results. Finally, Section 5 concludes this paper.
2 HSMM-based speech synthesis
This section aims to explain the predominant statistical modeling approach applied in speech synthesis, i.e., context-dependent multi-space probability distribution left-to-right without skip transitions HSMM[3, 14] (simply called HSMM in the remainder of this paper). The discussion presented in this section provides a preliminary framework which will be used as a basis to introduce the proposed HMEM technique in Section 3. The most significant drawback of HSMM, namely inadequate context generalization, is also pointed out.
2.1 HSMM structure
where S(o t ) represents a set of all space indexes with the same dimensionality of o t , and where denotes an l-dimensional Gaussian distribution with mean μ, and covariance matrix ∑ ( is defined to be 1). Furthermore, the output probability distribution of the i th state and g th space is denoted by b i|g (o t ) which is a Gaussian distribution with mean vector μ ig and covariance matrix ∑ ig . Also, m i and represent mean and variance of the state duration probability.
where Yo and Yd are decision trees trained for modeling output observation vectors and state durations. All symbols with superscript l indicate model parameters defined for the l th leaf.
2.2 HSMM likelihood
where the initial forward and backward variables for every state indexes i are α0(i)-1 and β T (i) = 1.
2.3 HSMM parameter re-estimation
2.4 Inefficient context generalization
It can be noticed from the definition of function fi(c;Y) in Equation 3 that this function can be viewed as a set of L(Y) non-overlapped binary contextual factors. The fact that these contextual factors are non-overlapped leads to the insufficient context generalization, because this fact makes each training sample contribute to the model of only one leaf and only one Gaussian distribution. Hence, by extending fi(c;Y) to overlapped contextual factors, more efficient context generalization capabilities could be achieved. Section 3 proposes an approach which enables the conventional structure to model the overlapped contextual factors and thus improves the modeling performance of unseen contexts.
3. Hidden maximum entropy model
The goal of this section is to develop a context-dependent statistical model for acoustic parameters with adequate context generalization. The previous section on HSMM revealed that inappropriate generalization stemmed from the application of non-overlapped features only. Consequently, relating acoustic parameters to contextual information by incorporating overlapped features could improve generalization efficiency. This section proposes HMEM to establish this relation.
3.1 HMEM structure
The proposed HMEM technique exploits exactly the same structure and graphical model as the original HSMM, and thus, the model likelihood expression given by Equation 5 is also valid for HMEM. The only difference between HSMM and HMEM is the way they incorporate contextual factors in output and duration probability distributions (i.e., , ). HSMM builds a decision tree and then trains a Gaussian distribution for each leaf of the tree. On the contrary, HMEM obeys the maximum entropy modeling approach which will be described in the next subsection.
3.1.1 Maximum entropy modeling
where x(c) denotes the realization of ℓ-dimensional random vector x for the context c in the database. If there are multiple realizations for x, x(c) will be obtained by taking the average over those values. In sum, the proposed context-dependent acoustic modeling approach obtains the smoothest (maximum entropy) distribution that captures first-order moments of training data in L f regions indicated by and second-order moments of data computed in .
where H l and u l are model parameters related to the l th contextual factors g l (c) and f l (c), respectively. H l is an ℓ-by-ℓ matrix and u l is an ℓ-dimensional vector. When f l (c) becomes 1 (i.e., it is active), u l affects the distribution; otherwise, it has no effect on the distribution. In fact, Equation 19 is nothing but the well-known Gaussian distribution with mean vector –0.5H-1u, and covariance matrix H-1, both calculated from a specific context-dependent combination of model parameters. Indeed, the main difference of MEM in comparison with other methods such as spectral additive structure [19, 20] is that mean and variance in MEM are not a linear combination of other parameters. This type of combination enables MEM to match training data in all overlapped regions.
This form of context-dependent Gaussian distribution presents a promising flexibility in utilizing contextual information. On one hand, using detailed and non-overlapped contextual factors such as features defined by Equation 3 (decision tree terminal node indicators) generates context-dependent Gaussian distributions which are identical to those used in conventional HSMM. These distributions have straightforward and efficient training procedure but suffer from insufficient context generalization capabilities. On the other hand, incorporating general and highly overlapped contextual factors overcomes the latter shortcoming and provides efficient context generalization, but its training procedure becomes more computationally complex. In the case of highly overlapped contextual factors, an arbitrary context activates several contextual factors, and hence, each observation vector is involved in modeling several model parameters.
3.1.2 ME-based modeling vs. additive modeling
At first glance, the contextual additive structure [19, 20, 32, 37] seems to have the same capabilities as the proposed ME-based context-dependent acoustic modeling. Therefore, to clarify their differences, this section compares HMEM with the additive structure through a very simple example.
In this example, the goal is to model a one-dimensional observation value using both ME-based modeling and a contextual additive structure. Due to the prime importance of mean parameters in HMM-based speech synthesis , we investigate the difference between mean values predicted by two systems.
In contrast, Figure 3B shows the corresponding ME-based modeling approach. In the previous subsection, it is described that ME-based context-dependent modeling needs two sets of regions, and . This example assumes that the leaves of Q1 and Q2 are defined as the first set of regions , and the leaves of Q3 are defined as the second set . Therefore, according to the explanation of the previous subsection, first empirical moments of Q1 and Q2, in addition to the second empirical moments of Q3, are captured by ME-based modeling. Figure 3B shows the estimated model mean values for all eight cubic clusters. As it is realized from the figure, model mean values estimated by ME-based modeling is a combination of adding parameters live in the regions divided by the parameters defined for the regions . In fact, the proposed ME-based modeling is an extension to the additive structure that ties all covariance matrices . This extension is clear because if is defined with one region containing all contextual feature space, the ME-based modeling converts to the additive structure that ties all covariance matrices .
3.1.3 HMEM-based speech synthesis
In these equations, S(o t ) is a set of all possible spaces defined for o t . and are the duration model parameters, and , , and denote the output model parameters related to the l th contextual factor, g th space, and i th state.
We can now probe the differences between HSMM and HMEM context-dependent acoustic modeling. These two modeling approaches are dramatically close to each other, so that defining HMEM contextual factors based on the decision trees described by Equation 3 would reduce HMEM to HSMM. Accordingly, HMEM extends HSMM and enables its structure to exploit overlapped contextual factors.
Moreover, another significant conclusion that could be drawn from this section is that several HSMM concepts are transposable within the HMEM framework. These concepts involve Viterbi algorithm, methods which calculate forward/backward variables and occupation probabilities, and even all parameter generation algorithms [26–28]. It just needs to define mean vectors, covariance matrices, and space probabilities of HSMM in accordance with Equation 20.
3.2 HMEM parameter re-estimation
where γ t (i, g) and are defined in Section 2.3. Therefore, at every iteration of BFGS, we need to find the above gradient values and BFGS estimates new parameters which are closer to the optimum ones.
3.3 Decision tree-based context clustering
Statistical parametric speech synthesis systems typically exploit around 50 different types of contextual factors . For such system, it is impossible to prepare tanning data covering all context-dependent models, and there are a large number of unseen models that have to be predicted in synthesis phase. Therefore, a context clustering approach such as decision tree-based clustering has to be used to decide about unseen contexts [31, 45]. Due to the critical importance of context clustering algorithms in HMM-based speech synthesis systems, this section focuses on designing a clustering algorithm for HMEM.
As it is realized from the discussion in this section, In order to implement the proposed architecture, we initially need to define two sets of contextual regions. These regions are represented by two sets, namely and . First- and second-order moment constraints have to be satisfied for all regions in and , respectively. Before training, the first empirical moments of all regions in and the second empirical moments of all regions in are computed using training data. Then, HMEM is trained to be consistent with these empirical moments. The major difficulty in defining these regions is to find a satisfactory balance between model complexity and the availability of training data. For limited training databases, a model with a small number of parameters, i.e., small number of regions has to be defined. In this case, bigger (strongly overlapped) contextual regions seem to be more desirable, because they can alleviate the problem of weak context generalization. On the other hand, for large training databases, larger number of contextual regions has to be defined to escape from under-fitting model to training data. In this case, smaller contextual regions can be applied to capture the details of acoustic features. This section introduces an algorithm that defines multiple contextual regions for first- and second-order moments by considering HMEM structure.
Due to the complex relationship between acoustic features and contextual factors, it is extremely difficult to find the optimum sets of contextual regions that maximize likelihood for HMEM. For the sake of simplicity, we have made some simplifying assumptions to find a number of suboptimum contextual regions. These assumptions are expressed as follows:
We have used conventional binary decision tree structures to define and . This is a common approach in many former papers [19, 20, 23]. It should be noted that the decision tree structure is not the only possible structure to express the relationship between acoustic features and contextual factors. For example, other approaches such as neural networks or soft-clustering methods can be applied as well. However, in this paper, we limit our discussion to the conventional binary decision tree structure.
Multiple decision trees are trained for , and just one decision tree is constructed for . In this way, the final HMEM preserves the first empirical moments of multiple decision trees, and the second moments of just one decision tree. This assumption is a result of the fact that first-order moments seem to be more important than second-order moments [32, 47].
The discussion of current section shows that the ML estimates of parameters defined for and significantly depend on each other. Therefore, in each step of decision tree construction, a BFGS optimization algorithm has to be executed to re-estimate both sets of parameters simultaneously, and this procedure leads to an extreme amount of computational complexity. To alleviate this problem, it is proposed to borrow from a baseline system (conventional HMM-based speech synthesis system) and construct independently.
In HMEM structure, is responsible to provide satisfactory clustering of first-order moments (mean vectors). Similarly, contextual additive structures [19, 20, 37] that tie all covariance matrices offer multiple overlapped clustering of mean vectors based on the likelihood criterion; therefore, an appropriate method is to borrow from the contextual additive structure.
However, training a contextual additive structure using algorithms proposed in [19, 20] is still computationally expensive for large training databases (more than 500 sentences). Three modifications are applied to the algorithm proposed by Takaki et al.  for computational complexity reduction: (i) The number of decision trees is considered to be fixed (in our experiments, an additive structure with four decision trees is built). (ii) Questions are selected one by one for different decision trees. Therefore, all trees are grown simultaneously, and the size of all trees would be equal. (iii) In the process of selecting the best pair of question and leaf, it is assumed that just the parameters of candidate leaf will be changed and all other parameters remain unchanged. It should be noted that the selection procedure is repeated until the total number of free parameters reaches the number of parameters trained for the baseline system (HSMM-based speech synthesis system).
In sum, the final algorithm of determining and can be summarized as follows. is simply borrowed from a conventional HMM-based speech synthesis system. also resulted from an independent context clustering algorithm that is a fast and simplified version of contextual additive structure . This clustering algorithm builds four binary context-dependent decision trees, simultaneously. It should be noted that when the number of clusters reaches the number of leaves of the decision tree trained for an HSMM-based system, the clustering algorithm is finished.
We have conducted two sets of experiments. First, the performance of HMEM with heuristic context clusters is examined; second, the impact of the proposed method for decision tree-based context clustering presented in the Section 3.3 is evaluated.
4.1 Performance evaluation of HMEM with heuristic context clusters
This subsection aims to compare HMEM-based acoustic modeling with conventional HSMM-based method. In this subsection, contextual regions of HMEM are defined heuristically and it is fixed for different sizes of training database.
4.1.1 Experimental conditions
A Persian speech database  consisting of 1,000 utterances from a male speaker was used throughout our experiments. Sentences were between 5 and 20 words long and have an average duration of 8 s. This database was specifically designed for the purpose of speech synthesis. Sentences in the database covered most frequent Persian words, all bi-letter combinations, all bi-phoneme combinations, and most frequent Persian syllables. In the modeling of the synthesis units, 31 phonemes were used, including silence. As presented in Section 4.1.2, a large variety of phonetic and linguistic contextual factors was considered in this work.
Speech signals were sampled at a rate of 16 kHz and windowed by a 25-ms Blackman window with a 5-ms shift. 40 Mel-cepstral coefficients, 5 bandpass aperiodicity and fundamental frequency, and their delta and delta-delta coefficients extracted by STRAIGHT  were employed as our acoustic features. In this experiment, the number of states was 5, and multi-stream left-to-right with no skip path MSD-HSMM was trained as the traditional HSMM system. Decision trees were built using maximum likelihood criterion, and the size of decision trees was determined by MDL principle . Additionally, global variance (GV)-based parameter generation algorithm [20, 26] and STRAIGHT vocoder were applied in the synthesis phase.
Both subjective and objective tests were carried out to compare HMEM that uses some heuristic contextual regions with the traditional HSMM system. In our experiments, two different synthesis systems named HMEM1 and HMEM2 were developed based on the proposed approach. HMEM1 employs a small number of general highly overlapped contextual factors that are designed carefully for each stream, while HMEM2 uses a larger number of contextual factors.
The number of leaf nodes for each stream in different speech synthesis systems
Various speech synthesis systems
Streams of acoustic features
Experiments were conducted on five different training sets with 50, 100, 200, 400, and 800 utterances. Additionally, a fixed set of 200 utterances, not included in the training sets, was used for testing.
4.1.2 Employed contextual factors
In our experiments, contextual factors contained phonetic, syllable, word, phrase, and sentence level features. In each of these levels, both general and detailed features were considered. Features such as phoneme identity, syllable stress pattern, or word part-of-speech tag are examples of general features, and a question like the position of the current phoneme is a sample of a detailed one. Specific information with regard to contextual features is presented in this subsection.
➢ Phonetic-level features
Phoneme identity before the preceding phoneme; preceding, current, and succeeding phonemes; and phoneme identity after the next phoneme
Position of the current phoneme in the current syllable (forward and backward)
Whether this phoneme is ‘Ezafe’  or not (Ezafe is a special feature in Persian pronounced as a short vowel ‘e’ and relates two different words together. Ezafe is not written but is pronounced and has a profound effect on intonation)
➢ Syllable-level features
Stress level of this syllable (five different stress levels are defined for our speech database)
Position of the current syllable in the current word and phrase (forward and backward)
Type of the current syllable (syllables in Persian language are structured as CV, CVC, or CVCC, where C and V denote consonants and vowels, respectively)
Number of the stressed syllables before and after the current syllable in the current phrase
Number of syllables from the previous stressed syllable to the current syllable
Vowel identity of the current syllable
➢ Word-level features
Part-of-speech (POS) tag of the preceding, current and succeeding word
Position of the current word in the current sentence (forward and backward)
Whether the current word contains ‘Ezafe’ or not
Whether this word is the last word in the sentence or not
➢ Phrase-level features
Number of syllables in the preceding, current, and succeeding phrase
Position of the current phrase in the sentence (forward and backward)
➢ Sentence-level features
Number of syllables, words, and phrases in the current sentence
Type of the current sentence
4.1.3 Illustratory example
In limited training sets, HSMM produces sudden transitions between adjacent states. This drawback is the result of decision tree-clustered context-dependent modeling. More specifically, when few data are available for training, the number of leaves in the decision tree is reduced. As a result, the distance between the mean vectors of adjacent states can be large. Even the parameter generation algorithm proposed by [26–28] cannot compensate such jumps. In such cases, the quality of synthetic speech with HSMM is expected to deteriorate.
On the opposite, if we let adjacent states contain common active contextual factors, then the variation of mean vectors in state transitions will be smoother. This is the key idea of HMEM which makes it possible to outperform HSMM when the data are limited. However, the use of overlapped contextual factors in HMEM will result in over-smoothing problem when the size of the training data is increased. Therefore, the detailed contextual factors are additionally considered in HMEM2 to alleviate the over-smoothing issue.
4.1.4 Objective evaluation
The average mel-cepstral distortion (MCD)  and root-mean-square (RMS) error of phoneme durations (expressed in terms of number of frames) were selected as relevant metrics for our objective assessment. For the calculation of both average mel-cepstral distance and RMS error of phoneme durations, the state boundaries (state durations) were determined using Viterbi alignment with the speaker's real utterance.
In summary, from these figures and the illustratory example presented before, we can see that when the available data are limited, all features (log F0, duration, and spectra) of synthetic speech generated by HMEM are closer to the original features than those obtained with HSMM. However, when the training database is large, the HSMM-based method performs better than HMEM. Nevertheless, employing more detailed features can assist the proposed method in becoming closer to the HSMM-based synthetic speech.
FN, FP, TN, and TP rates of detecting voiced/unvoiced regions through HMEM2 and the HSMM-based method
# training data
Really voiced (%)
Really unvoiced (%)
Accuracy of voiced/unvoiced detector
# training data
HMEM2 accuracy (%)
HSMM accuracy (%)
4.1.5 Subjective evaluation
Twenty native participants were asked to listen to ten randomly chosen pairs of synthesized speech samples generated by two different systems (selected arbitrarily among HMEM1, HMEM2, and HSMM).
Remarkably, the proposed systems are noticed to be of a great interest when the training data are limited (i.e., for 50, 100, and 200 utterances) and are in line with the conclusions of the objective assessments. The superiority of HMEM1 over HSMM and HMEM2 is clear in the training sets containing 50 and 100 utterances. In other words, general contextual factors lead the proposed system to a better performance when the amount of training data is very small. Gradually, as the number of utterances in the training set increases, detailed features assist the proposed system in achieving more effective synthetic speech. Therefore, HMEM2 surpasses HMEM1 for training sets with 200 and more utterances. However, for relatively large training sets (400 and 800), the use of HSMM is recommended.
Table 1 compares the number of leaf nodes in different speech synthesis systems. It can be seen from the table that to model mgc stream, HMEM2 exploits more parameters than HSMM-400 and HSMM-800, but the objective evaluations presented in Figure 6 show that HSMM-400 and HSMM-800 results in better mel-cepstral distances. The above argument shows that HMEM with some heuristic contextual clusters cannot exploit model parameters efficiently. In fact, a great number of contextual regions in HMEM1 and HMEM2 are redundant; therefore, their corresponding parameters are not useful. The next section evaluates the performance of HMEM with the suboptimum context clustering algorithm proposed in Section 3.3. This proposed clustering algorithm selects appropriate contextual regions and consequently solves the aforementioned problem.
4.2 Performance evaluation of HMEM with decision tree-based context clustering
This section is dedicated to the second set of experiments conducted to evaluate the performance of HMEM with decision tree construction algorithm proposed in Section 3.3. As it is realized from the first set of experiments, HMEM with heuristic and naïve contextual regions cannot outperform HSMM in large training databases. This section proves that by employing appropriate sets of and , HMEM outperforms HSMM even for large databases.
4.2.1 Experimental conditions
Experiments were carried out on Nick , a British male database collected in Edinburgh University. This database consists of 2,500 utterances from a male speaker. We considered five sets including 50, 100, 200, 400, and 800 utterances for training, and 200 sentences that were not included in training sets were used as test data. Each sentence in the database is about 5 s of speech. Speech signals are sampled at 48 kHz, windowed by a 25-ms Blackman window with 5-ms shift. This database was specifically designed for the purpose of speech synthesis research, and utterances in the database covered most frequent English words. Also, different segmental and suprasegmental contextual factors were extracted for this database.
The speech analysis conditions and model topologies of CSTR/EMIME HTS 2010  were used in this experiment. Bark cepstrum was extracted from smooth STRAIGHT trajectories . Also, instead of log F0 and five frequency sub-bands (0 to 1, 1 to 2, 2 to 4, 4 to 6, and 6 to 8 kHz), pitch in mel and auditory-scale motivated frequency bands for aperiodicity measure were applied . The analysis process resulted in 40 bark cepstrum coefficients, 1 mel in pitch value, and 25 auditory-scale motivated frequency bands aperiodicity parameters for each frame of training speech signals. These parameters incorporated with their delta and delta-delta parameters considered as the observation vectors of the statistical parametric model.
A five-state multi-stream left-to-right with no skip path MSD-HSMM was trained as the baseline system. Conventional maximum likelihood-based decision tree clustering algorithm was used to tie HMM states, but MDL criterion is used to determine the size of decision trees.
In order to have a fair comparison, the proposed system (HMEM with decision tree structure) was trained with the same number of free model parameters as the baseline system. HMEM was trained based on the decision tree construction algorithm presented in Section 3.3 and parameter re-estimation algorithm proposed in Section 3.2. It should be noted that four decision trees were built for and one decision tree for . After training acoustic models, in the synthesis phase, GV-based parameter generation algorithm [20, 26] and STRAIGHT synthesis module generated synthesized speech signals. Both subjective and objective tests were conducted to compare HMEM that uses decision tree-based clusters with traditional HSMM-based synthesis.
It is useful to mention that training the proposed HMEM structure with decision tree-based context clustering took approximately 5 days for 800 training sentences, while training its corresponding HSMM-based synthesis system took approximately 16 h.
4.2.2 Employed contextual factors
➢ Phonetic-level features
Position of the current phoneme in the current syllable, word, phrase, and sentence
➢ Syllable-level features
Stress level of previous, current, and next syllable (three different stress levels are defined for this database)
Position of the current syllable in the current word, phrase, and sentence
Number of the phonemes of the previous, current, and next syllable
Whether the previous, current, and next syllable is accented or not
Number of the stressed syllables before and after the current syllable in the current phrase
Number of syllables from the previous stressed syllable to the current syllable
Number of syllables from the previous accented syllable to the current syllable
➢ Word-level features
Part-of-speech (POS) tag of the preceding, current, and succeeding word
Position of the current word in the current phrase and sentence (forward and backward)
Number of syllables of the previous, current, and next word
Number of content words before and after current word in the current phrase
Number of words from previous and next content word
➢ Phrase-level features
Number of syllables and words of the preceding, current, and succeeding phrase
Position of the current phrase in the sentence
Current phrase ToBI end tone
➢ Sentence-level features
Number of phonemes, syllables, words, and phrases in the current utterance
Type of the current sentence
4.2.3 Objective evaluation
4.2.4 Subjective evaluation
Both CMOS test and preference score confirm the superiority of the proposed method over HSMM in all databases. Thus, if context clusters are determined through an effective approach, the proposed HMEM will outperform HSMM.
This paper addressed the main shortcomings of HSMM in context-dependent acoustic modeling, namely inadequate context generalization. HSMM uses decision tree-based context clustering that does not provide efficient generalization, because each acoustic feature vector is associated with modeling only one context cluster. In order to alleviate this problem, this paper proposed HMEM as a new acoustic modeling technique based on maximum entropy modeling approach. HMEM improves HSMM by enabling its structure to take advantage of overlapped contextual factors, and therefore, it can provide superior context generalization. Experimental results using objective and subjective criteria showed that the proposed system outperforms HSMM.
Despite the advantages, which enabled our system to outperform HSMM, a drawback of computationally complex training procedure is noticed in large databases.
- Zen H, Tokuda K, Black AW: Statistical parametric speech synthesis. Speech Comm. 2009, 51(11):1039-1064. 10.1016/j.specom.2009.04.004View ArticleGoogle Scholar
- Black AW, Zen H, Tokuda K: Statistical parametric speech synthesis, in IEEE International Conference on Acoustics, vol. 4. Speech and Signal Processing (ICASSP), Honolulu, Hawaii, USA; 2007:IV1229-IV1232.Google Scholar
- Yamagishi J, Kobayashi T: Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training. IEICE - Trans. Info. Syst. 2007, 90(2):533-543.View ArticleGoogle Scholar
- Yamagishi J, Nose T, Zen H, Ling ZH, Toda T, Tokuda K, King S, Renals S: Robust speaker-adaptive HMM-based text-to-speech synthesis, In IEEE Transactions on Audio, Speech, and Language Processing. 2009, 17(6):1208-1230.Google Scholar
- Yamagishi J, Kobayashi T, Nakano Y, Ogata K, Isogai J: Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm, IEEE Transactions on Audio, Speech, and Language Processing. 2009, 17(1):66-83.Google Scholar
- Wu YJ, Nankaku Y, Tokuda K: State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis. In INTERSPEECH. Brighton, UK; 2009:528-531.Google Scholar
- Liang H, Dines J, Saheer L: A comparison of supervised and unsupervised cross-lingual speaker adaptation approaches for HMM-based speech synthesis. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). Dallas, Texas, USA; 2010:4598-4601.Google Scholar
- Gibson M, Hirsimaki T, Karhila R, Kurimo M, Byrne W: Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using two-pass decision tree construction. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). Dallas, Texas, USA; 2010:4642-4645.Google Scholar
- Yamagishi J, Ling Z, King S: Robustness of HMM-based speech synthesis. In INTERSPEECH. Brisbane, Australia; 2008:581-584.Google Scholar
- Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T: Mixed excitation for HMM-based speech synthesis. In INTERSPEECH. Aalborg, Denmark; 2001:2263-2266.Google Scholar
- Kawahara H, Masuda-Katsuse I, de Cheveigné A: Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Comm. 1999, 27(3):187-207.View ArticleGoogle Scholar
- Drugman T, Wilfart G, Dutoit T: A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis. In INTERSPEECH. Brighton, United Kingdom; 2009:1779-1782.Google Scholar
- Drugman T, Dutoit T: The deterministic plus stochastic model of the residual signal and its applications. IEEE Trans. Audio. Speech. Lang. Process 2012, 20(3):968-981.View ArticleGoogle Scholar
- Zen H, Tokuda K, Masuko T, Kobayasih T, Kitamura T: A hidden semi-Markov model-based speech synthesis system. IEICE - Trans. Info. Syst 2007, 90(5):825.View ArticleGoogle Scholar
- Hashimoto K, Nankaku Y, Tokuda K: A Bayesian approach to hidden semi Markov model based speech synthesis, in Proceedings of INTERSPEECH. Brighton, United Kingdom; 2009:1751-1754.Google Scholar
- Hashimoto K, Zen H, Nankaku Y, Masuko T, Tokuda K: A Bayesian approach to HMM-based speech synthesis, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Taipei, Taiwan; 2009:4029-4032.Google Scholar
- Tokuda K, Masuko T, Miyazaki N, Kobayashi T: Multi-space probability distribution HMM. IEICE Trans. on Info. Syst 2002, 85(3):455-464.Google Scholar
- Zen H, Senior A, Schuster M: Statistical parametric speech synthesis using deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, British Columbia, Canada; 2013:7962-7966.Google Scholar
- Takaki S, Nankaku Y, Tokuda K: Spectral modeling with contextual additive structure for HMM-based speech synthesis. In Proceedings of 7th ISCA Speech Synthesis Workshop. Kyoto, Japan; 2010:100-105.Google Scholar
- Takaki S, Nankaku Y, Tokuda K: Contextual partial additive structure for HMM-based speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, British Columbia, Canada; 2013:7878-7882.Google Scholar
- Gales MJ: Cluster adaptive training of hidden Markov models. IEEE Trans. Speech. Audio. Process. 2000, 8(4):417-428. 10.1109/89.848223View ArticleGoogle Scholar
- Zen H, Gales MJ, Nankaku Y, Tokuda K: Product of experts for statistical parametric speech synthesis, IEEE Trans. Audio. Speech. Lang. Process. 2012, 20(3):794-805.View ArticleGoogle Scholar
- Yu K, Zen H, Mairesse F, Young S: Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis. Speech Comm. 2011, 53(6):914-923. 10.1016/j.specom.2011.03.003View ArticleGoogle Scholar
- Toda T, Young S: Trajectory training considering global variance for HMM-based speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Taipei, Taiwan; 2009:4025-4028.Google Scholar
- Qin L, Wu YJ, Ling ZH, Wang RH, Dai LR: Minimum generation error criterion considering global/local variance for HMM-based speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Las Vegas, Nevada, USA; 2008:4621-4624.Google Scholar
- Toda T, Tokuda K: Speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE - Trans. Info. Syst. Arch 2007, E90-D(5):816-824. 10.1093/ietisy/e90-d.5.816View ArticleGoogle Scholar
- Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T: Speech Parameter Generation Algorithms for HMM-based Speech Synthesis, in ICASSP, vol. 3. Istanbul; 2000:1315-1318.Google Scholar
- Tokuda K, Kobayashi T, Imai S: Speech parameter generation from HMM using dynamic features. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1. Detroit, Michigan, USA; 1995:660-663.Google Scholar
- Comparing glottal-flow-excited statistical parametric speech synthesis methods In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Vancouver, British Columbia, Canada; 2013:7830-7834.Google Scholar
- Yoshimura T, Tokuda K, Masuko T, Kobayashi T, Kitamura T: Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. Proceedings of Eurospeech 1999, 2347-2350.Google Scholar
- Young SJ, Odell JJ, Woodland PC: Tree-based state tying for high accuracy acoustic modeling. Proceedings of the workshop on Human Language Technology, Association for Computational Linguistics 1994, 307-312.Google Scholar
- Leggetter CJ, Woodland PC: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech. Lang. 1995., 9(2):Google Scholar
- Digalakis VV, Neumeyer LG: Speaker adaptation using combined transformation and Bayesian methods. IEEE Trans. Speech. Audio. Process. 1996, 4(4):294-300. 10.1109/89.506933View ArticleGoogle Scholar
- Zen H, Braunschweiler N, Buchholz S, Gales MJ, Knill K, Krstulovic S, Latorre J: Statistical parametric speech synthesis based on speaker and language factorization. IEEE Transactions. Audio. Speech. Lang. Process. 2012, 20(6):1713-1724.View ArticleGoogle Scholar
- Koriyama T, Nose T, Kobayashi T: Statistical parametric speech synthesis based on Gaussian process regression, IEEE Journal of Selected Topics in Signal Processing. 2013, 1-11.Google Scholar
- Yu K, Mairesse F, Young S: Word-level emphasis modeling in HMM-based speech synthesis. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). Dallas, Texas, USA; 2010:4238-4241.Google Scholar
- Zen H, Braunschweiler N: Context-dependent additive log f_0 model for HMM-based speech synthesis. In INTERSPEECH. Brighton, United Kingdom; 2009:2091-2094.Google Scholar
- Sakai S: Additive modeling of English f0 contour for speech synthesis, in Proceedings of ICASSP. Las Vegas, Nevada, USA; 2008:277-280.Google Scholar
- Qian Y, Liang H, Soong FK: Generating natural F0 trajectory with additive trees. In INTERSPEECH. Brisbane, Australia; 2008:2126-2129.Google Scholar
- Wu YJ, Soong F: Modeling pitch trajectory by hierarchical HMM with minimum generation error training. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Kyoto, Japan; 2012:4017-4020.Google Scholar
- Berger AL, Pietra VJD, Pietra SAD: A maximum entropy approach to natural language processing. Computer Ling 1996, 22: 39-71. 10.1016/0096-0551(96)00005-7View ArticleGoogle Scholar
- Borthwick A: A maximum entropy approach to named entity recognition, PhD dissertation (New York University). 1999.Google Scholar
- Rangarajan V, Narayanan S, Bangalore S: Exploiting acoustic and syntactic features for prosody labeling in a maximum entropy framework, in Proceedings of NAACL HLT. 2007, 1-8.Google Scholar
- Ratnaparkhi A: A maximum entropy model for part-of-speech tagging, in Proceedings of the conference on empirical methods in natural language processing. 1996, 1: 133-142.Google Scholar
- Odell JJ: The use of context in large vocabulary speech recognition, PhD dissertation (Cambridge University). 1995.Google Scholar
- Shinoda K, Takao W: MDL-based context-dependent subword modeling for speech recognition. J. Acoust. Soc. Jpn 2000, 21(2):79-86. 10.1250/ast.21.79View ArticleGoogle Scholar
- Oura K, Zen H, Nankaku Y, Lee A, Tokuda K: A covariance-tying technique for HMM-based speech synthesis. J. IEICE 2010, E93-D(3):595-601.Google Scholar
- Nocedal J, Stephen JW: Numerical Optimization. Book of Springer, USA; 1999.View ArticleGoogle Scholar
- Bijankhan M, Sheikhzadegan J, Roohani MR, Samareh Y, Lucas C, Tebiani M: The speech database of Farsi spoken language. Proceedings of 5th Australian International Conference on Speech Science and Technology (SST) 1994, 826-831.Google Scholar
- Ghomeshi J: Non-projecting nouns and the ezafe: construction in Persian. Nat. Lang. Ling. Theor. 1997, 15(4):729-788. 10.1023/A:1005886709040View ArticleGoogle Scholar
- Kubichek R: Mel-cepstral distance measure for objective speech quality assessment. IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, vol. 1 1993, 125-128.View ArticleGoogle Scholar
- Picart B, Drugman T, Dutoit T: Continuous control of the degree of articulation in HMM-based speech synthesis, 12th Annual Conference of the International Speech Communication Association (ISCA). INTERSPEECH, Florence, Italy; 2011:1797-1800.Google Scholar
- Yamagishi J: Average-Voice-Based Speech Synthesis, PhD dissertation. Tokyo Institute of 1362 Technology, Yokohama; 2006.Google Scholar
- Yamagishi J, Watts O: The CSTR/EMIME HTS system for Blizzard challenge, in Proceedings of Blizzard Challenge 2010. Kyoto, Japan; 2010:1-6.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.