A uniform phase representation for the harmonic model in speech synthesis applications
© Degottex and Erro; licensee Springer. 2014
Received: 27 February 2014
Accepted: 7 July 2014
Published: 16 October 2014
Feature-based vocoders, e.g., STRAIGHT, offer a way to manipulate the perceived characteristics of the speech signal in speech transformation and synthesis. For the harmonic model, which provide excellent perceived quality, features for the amplitude parameters already exist (e.g., Line Spectral Frequencies (LSF), Mel-Frequency Cepstral Coefficients (MFCC)). However, because of the wrapping of the phase parameters, phase features are more difficult to design. To randomize the phase of the harmonic model during synthesis, a voicing feature is commonly used, which distinguishes voiced and unvoiced segments. However, voice production allows smooth transitions between voiced/unvoiced states which makes voicing segmentation sometimes tricky to estimate. In this article, two-phase features are suggested to represent the phase of the harmonic model in a uniform way, without voicing decision. The synthesis quality of the resulting vocoder has been evaluated, using subjective listening tests, in the context of resynthesis, pitch scaling, and Hidden Markov Model (HMM)-based synthesis. The experiments show that the suggested signal model is comparable to STRAIGHT or even better in some scenarios. They also reveal some limitations of the harmonic framework itself in the case of high fundamental frequencies.
Parametric speech signal representations are necessary in almost every field of speech technologies: speech and speaker recognition ,, speech and speaker transformation , synthesis , diarization , etc. Each of these fields, however, requires a particular type of parametrization scheme. Thus, while low-dimensional filter bank based Mel-Frequency Cepstral Coefficients (MFCC)  are sufficiently accurate for recognition purposes, they are not suitable for speech reconstruction by themselves. Indeed, applications involving spoken outputs, such as speech coding , require the speech signals to be represented by a set of features yielding almost transparent analysis/resynthesis. Voice transformation and speech synthesis impose even stricter requirements, since the parametric speech representations they deal with must provide a solid and flexible framework to sculpt all the characteristics of the speech sounds through direct manipulation of the features (see, for instance, -). Interestingly, recent statistical trends are also encouraging research on parametric speech representations with a constant number of parameters and with good mathematical properties .
Sinusoidal models represent the speech signal by means of a sum of sinusoids given by their instantaneous frequency, amplitude, and phase . These models have been widely used for speech analysis, resynthesis, and modification -. Sinusoidal models have evolved over the years ,, and recently, the so-called adaptive Harmonic Model (aHM)  has also been shown to yield practically transparent analysis/resynthesis and excellent modification performance ,. Despite the inherent assumption that speech can be represented only by harmonic sinusoidal components, even in unvoiced segments, aHM succeeds at capturing the relevant spectral information and noisy nature of a speech signal and thus, representing the speech signal in a uniform way, without using any voicing decision. As long as the perceptual information carried in the phase is preserved, the uniform way of describing and manipulating signals is a remarkable practical advantage of aHM with respect to alternative models involving an explicit separation between harmonics and noise  for two main reasons: (i) locating the voicing boundaries is an error-prone process and (ii) in voice transformation, such a separation implies the need for two independent modification procedures, one for each component ,, which increases the risk that listeners perceive them as two independent output signals.
The features handled by aHM are not directly compatible with methods involving statistical modeling because the amplitude and phase parameters lie on the harmonic grid which is dependent on the fundamental frequency f0. To avoid this issue, amplitude and phase information have to be isolated from f0 and translated into independent parameters. However, while amplitude envelopes are relatively easy to obtain through interpolation between sinusoidal amplitudes ,, the representation of phase remains an open problem. Recent attempts of obtaining a consistent phase envelope - provide features which are theoretically valid in voiced time-frequency regions but are not informative in unvoiced ones. Thus, standard speech parametrization systems used in statistical frameworks tend to discard the phase information. Instead, they rely on a minimum-phase component derived from the amplitude envelope, along with complementary parameters related to the degree of harmonicity in different time-frequency regions, such as band aperiodicities  or maximum voiced frequency .
This article presents a novel phase representation that has been designed to handle, in a uniform manner across time, all the relevant information conveyed by the phase parameters of a full-band aHM model, namely the maximum-phase component and the noisiness. This is done through the following steps: first, aHM analysis  is performed to obtain the instantaneous phase from the waveform; then, the minimum-phase term is subtracted from the measured phases, and the local Phase Distortion (PD)  is calculated; finally, the short-time mean and standard deviation of the PD are computed in the neighborhood of each frame, the former being highly correlated to the maximum-phase component, and the latter to the degree of noisiness. Among the advantages of this novel approach, we can mention the following: (i) it is valid to analyze signals exhibiting harmonic and noise components that overlap both in time and in frequency and thus, avoiding binary voiced/unvoiced decisions which are error-prone and result in annoying artifacts, especially in synthesis ,. (ii) Since it helps avoiding an explicit separation between harmonics and noise, it provides a solid and uniform framework for speech manipulation thus, avoiding artifacts near the voicing boundaries . (iii) It can be easily made compatible with statistical frameworks. Moreover, given the continuous nature of the feature streams, the use of Multi-Space Distributions (MSD)  can be avoided. In that sense, the involved training and generation procedures can be simplified. In addition to these advantages, the suggested phase representation facilitates the study of the perceptual importance of the maximum-phase component and the degree of noisiness, which are linked to separate features. Indeed, phase perception is still a source of controversy in speech processing -.
The next section first summarizes the low-level analysis of the speech signal using the aHM model. Then, the novel phase features based on the mean and standard deviation of PD are described in details, which are followed by the description of the synthesis step. Finally, the evaluation section will show the importance of the features and demonstrate the feasibility of the suggested representation in the context of voice transformation and speech synthesis.
2The adaptive harmonic model
For this work, the Adaptive Iterative Refinement (AIR) algorithm presented in  is used to refine the f0(t) values and the sinusoidal parameters (ai,h and ϕi,h) are estimated using the Least Squares (LS) solution.
Conversely to the conventional harmonic model , the aHM model uses a full-band non-stationary frequency basis. This mainly allows to represent a whole speech recording using a single and continuous harmonic structure during both analysis and synthesis steps . This structural property is very convenient for the phase models and processing used in this work.
Also, in unvoiced segments, assuming that an f0(t) curve can be obtained without substantial erratic jumps, it has been shown that aHM can represent both voiced and unvoiced segments uniformly, without voicing decision . Given the goal of this work, this property is obviously a necessary prerequisite. Additionally, together with its harmonic tracking algorithm (i.e., AIR), this model provides almost always the most accurate and precise sinusoidal parameters compared to state-of-the-art methods . Eventually, this good accuracy and precision might not be critical for obtaining the results of this article. However, this allows to minimize the influence of the sinusoidal parameter estimation on the results thus, strengthening the link between the suggested phase processing techniques and the results obtained. Finally, like the conventional harmonic model , the resynthesis obtained by aHM is almost indistinguishable from the original recording . This ensures that the aforementioned properties come with no perceptual degradation.
Despite the advantages of aHM, the sinusoidal parameters ai,h, ϕi,h, and fi,h=h f0(t i ) lie on the harmonic grid, which is not convenient for manipulation of the perceived characteristics or for statistical modeling. Moreover, the instantaneous phase parameter ϕi,h constantly wraps from one instant to the next, which makes its modeling far from straightforward. In the following steps, we aim at building amplitude and phase features which are independent of the harmonic structure and we focus on the modeling of the instantaneous phase.
2.1 A simple representation of the amplitude
where is the real cepstrum of |V i (f)|, i.e., , as described in . Modeling the amplitude envelope is a well investigated subject, and it is out of the scope of this article. In order to estimate |V i (f)| in a robust and simple way, we used a linear interpolation of ai,h across frequency, as used in , on a discrete scale of 512 frequency bins up to the Nyquist frequency. However, for the reason of clarity, the continuous notation in hertz will be used in the following.
Even though the assumption of spectrally flat source is widely used, it is also known that this hypothesis is basically wrong since the glottal pulses have a low-pass characteristic . Therefore, in this work, |V i (f)| encompasses the amplitude spectra of both the glottal source and the vocal tract. Nevertheless, using PD, it has been shown that this assumption allows to extract glottal source information which is almost independent of the vocal tract filter . Indeed, this property was critical for estimation of glottal model parameters using phase minimization ,. For the work presented in this article, this same property ensures that the impact of the vocal tract filter on the phase features representing the source is minimal. On the contrary, the impact of the glottal source on the vocal tract feature is far from negligible, which is not convenient. However, a robust separation of the vocal tract filter and the glottal source is far from straightforward ,-. Thus, in this work, we chose to favor again robustness and simplicity, in order to focus, beforehand, on the phase features. Interposing a separation process within the presented phase feature extraction can be part of future works.
3Representations of the phase
In this section, we first describe the analytical model of the instantaneous phase used in this work. State-of-the-art phase processing are then described and discussed analytically using this model. Finally, the novel characterization of the short-term statistics of PD is described.
3.1 Theoretical model of the instantaneous phase
In order to represent the instantaneous phase parameter ϕi,h, models have been already suggested for phase synchronisation between frames , and speech coding ,. In this work, we suggest to represent the measured ϕi,h using a model similar to that in :
whose terms are described here below. In voiced segments, each glottal pulse of the glottal source has a shape which has mainly maximum-phase characteristics ,. This glottal pulse shape has also a position in time c i . Speech processing techniques often define c i as glottal closure instants ,, or as energy local maxima of a residual signal ,, or as pitch pulse onsets ,, for centering windows and to synchronize instantaneous phase parameters.
Even though such a definition is necessary for many approaches, we will show below that it is not necessary when using the Relative Phase Shift (RPS) , or PD, which avoids an extra estimation procedure and its potential misestimation errors. In unvoiced segments, one can assume that this shape is basically random for each frame. In Equation ??, the source shape term θi,h represents this pulse shape in both voiced and unvoiced segments. Since the analysis windows are not centered on each c i (i.e., c i ≠t i ), a linear phase term is also necessary in order to represent the time delay between t i (the window’s center) and the position of the source shape c i . In the literature, assuming the frequency structure is stationarity within a frame, i.e., f0(t)=f0(t i ), this term is often simplified to a term which is linear in both frequency and time, i.e., h(2π f0(t i )/f s )(t i −c i ). Conversely, in Equation ??, we use the integral form since the harmonic structure is not stationary in the aHM model.
Finally, according to the voice production Equation 3, the voice source is convolved by the vocal tract filter impulse response. Thus, we add the minimum-phase ∠ V i (ω) to the model.
The following sections describe the suggested way to characterize ϕi,h for speech processing using statistics of PD and using the RPS as an intermediate step.
3.2 From phases to relative phase shift
The linear phase component in Equation ?? constantly wraps the instantaneous phase ϕi,h from one time instant to the next. This constitutes a major issue in phase modeling .
Equation 9 shows that the RPS computation discards the linear phase terms. It remains only the source shape at each harmonic relative to that of the first harmonic and the contribution of the minimum-phase envelope ∠ V i (f) relative to that at the first harmonic. In voiced segments, these two remaining terms can be easily assumed to evolve smoothly across time because the shape of the glottal pulse and the vocal tract do so. Therefore, this property of RPS basically solves the issue of phase wrapping.
Additionally, c i is also discarded in Equation 9 so that there is no need to estimate any GCI or pitch pulse onset. This avoids misestimation of such time instants and the consequences on speech processing techniques.
The third row of Figure 1 shows an example of (with, again, the interpolation on a continuous frequency axis). In Equation 11, only the source shape and the harmonic number h remains. Ideally, we want to extract features from the speech waveform which are independent from each other as much as possible. However, h belongs to the harmonic structure which is already handled by f0(t) and the property of harmonicity of the model. Therefore, this harmonic number is still inconvenient for characterizing the phase properties independently from f0(t). Interpolating the values on a continuous frequency axis (as depicted in Figure 1) removes the harmonic sampling. However, the harmonic number is still present in the interpolated values. Note also that, h increases the RPS variance towards high frequencies and drowns the variance of θi,h into that of h θi,1, which is not convenient for characterizing the source shape in mid and high frequencies.
3.3 From relative phase shift to phase distortion
where denotes the finite difference operator. The fifth row of Figure 1 shows an example of PD (with the interpolation on a continuous frequency axis). Basically, the PD measures the phase desynchronization which exists between each sinusoidal component of the voice source. Additionally, this desynchronization is centered on the first harmonic phase, like for the RPS. The finite difference makes also the PD similar to the group delay, whose perceived characteristics have been already studied and demonstrated  and whose applications are numerous ,-.
Since the linear and filter terms cancel, only the source shape terms remain in Equation 13. Equation 13 also shows that the computation of the PD represents the phase desynchronization of the source shape between each harmonic, centered on that of the first harmonic. Compared to Equation 11, the harmonic number h is also removed, as expected, by using the finite frequency difference. Consequently, when h increases, it adds to the RPS measurement but does not influence the PD measurement. For example, when using PD in fourth and fifth rows of Figure 1, one can see red patterns appearing around 1.5 s between 4 and 8 kHz. On the contrary, no clear pattern appears in the same time-frequency region using the RPS (second and third rows). Using RPS, the region concerned actually seems as blurred as in noisy time-frequency regions (e.g., around 1.8 s). This is explained by the presence of the harmonic number h in RPS which increases the wrapping of the phase values.
3.4 Statistical features of the phase distortion
As shown in Equation 13, the phase distortion represents basically the source shape. In voiced segments, the source shape accounts mainly for the shape of the glottal pulse. In unvoiced segments, the time evolution of this shape throughout adjacent frames reproduces the noisiness of the voice source. Therefore, in this section, we suggest to statistically characterize the phase distortion in a short-term window in order to extract a feature related to the shape at a given time and another feature representing the local variation of this shape around that same time. This characterization will allow to manipulate the components of the speech in voice transformation and Hidden Markov Model (HMM)-based synthesis.
We first assume that the information carried in PD is independent of the fundamental frequency. As a consequence, we interpolate PDi,h on a linear frequency scale (as done for the previous figures), like a phase spectral envelope ,, and thus, removing the harmonic number from the representation. To achieve this phase envelope, we first unwrap PDi,h and then interpolate it linearly on a discrete scale of 512 frequency bins up to the Nyquist frequency, and thus, PDi,h becomes PD i [ k]. Here, the unwrapping function is necessary to avoid meaningless values during the interpolation process. Nevertheless, the resulting PD i [ k] is still a circular data. Instead of the discrete notation in bins, the continuous notation in hertz will be used in the following descriptions and sections for the reason of clarity, i.e., PD i [ k]⇔PD i (f), like for the amplitude spectral envelope V i (f).
On a frame-by-frame basis in an analysis/synthesis procedure, the sole information carried by PD i (f) might be sufficient to reconstruct an instantaneous phase which has the same perceived characteristics as those of the instantaneous phase ϕi,h. This property has actually been shown through listening tests in . However, through manipulation of PD i (f), by time scaling, pitch scaling, or statistical modeling, the short-term statistical characteristics of the analyzed voice might not be preserved. For example, stretching PD i (f) over time would automatically reduce its temporal variance and thus, changing the extent of randomness in the voice, which is not the purpose of a time stretching transformation.
In this article, we suggest to preserve the short-term mean and short-term standard deviation of PD i (f) in speech processing applications using features that represent these two moments. In order to estimate the mean and standard deviation, we assume that the distribution of PD i (f) obeys a normal distribution. Moreover, since PD i (f) is a circular data defined in (−π,π], we make use of the wrapped normal distribution ,.
3.4.1 The short-term mean of PD
where and we used N=25 frames in this work, which corresponds to six periods. This averaging of PD i (f) is necessary for separating the randomness characteristics of the phase from its smoothly varying behaviors. Even though six periods might appear to be substantial, it ensures that the mean does not model the randomness of the phase, which has to be modeled by the feature described below.
3.4.2 The short-term standard deviation of PD
where and M=9 frames. Using two periods for the standard deviation and six for the mean are motivated by the following reason. A wider window for the standard deviation could cover the end of a noisy segment and the beginning of a voiced segment and thus, overestimating the presence of noise at the beginning of the voiced segment. Therefore, a short window seems necessary to quickly adapt the standard deviation estimate in transients. On the other hand, using a wider window for the mean allows to obtain a more robust estimate of the source shape in transients where harmonic sinusoidal parameters are less reliable than in voiced segments.
In order to have the same number of analysis instants in each period, the step size of the analysis instants was first adapted to f0(t) (see Equation 14). However, both mean and standard deviation have to be independent from f0(t) so that each feature represents independent characteristics of the speech signal. Additionally, a variable step size is not desirable for many applications, like in statistical modeling, where a constant step size is necessary. Consequently, prior to any application, μ i (f) and σ i (f) features are resampled at new time instants , with a constant step size, each 5 ms.
The analysis steps described above provide, each 5 ms, the features f0(t i ), V i (f), μ i (f), and σ i (f). This section describes the method used to resynthesize a full speech signal using these features. This synthesis method is similar to that used originally for the aHM model . Basically, each harmonic track is first synthesized across time, independently of each other, using a sampling rate f s . The synthetic harmonics are then added up all together, without using any windowing scheme. Since the synthetic signal is bandlimited to the Nyquist frequency, the continuous notation for the time axis will be used in the following, for reason of simplicity (i.e., x[ n]⇔x(t)).
Then, these anchor values are interpolated across time on a logarithmic scale in order to obtain .
where H is the maximum value of all H i and the indicator function discards any harmonic segment whose frequency is higher than Nyquist.
The rest of the synthesis process is identical to that of aHM. The continuous relative phase values are interpolated across time using splines, and the linear phase is added at the end in order to obtain the continuous instantaneous phase Equation 21 which is finally used in Equation 22.
The complete analysis/synthesis procedure is called Harmonic Model + Phase Distortion (HMPD).
4.1 Correction of σ i (f)
This section aims at assessing the quality and versatility of the proposed phase representation. To this end, experiments have been conducted in three different scenarios: resynthesis with no modification (Section 4), pitch scale modification (Section 4), and HMM-based speech synthesis (Section 4).
Even in resynthesis, objective measures such as Signal to Reconstruction Error Ratio (SRER) or PESQ  are not suitable for evaluation as they are waveform sensitive. While it is true that the suggested HMPD representation retains the waveform characteristics of the signal, it does not keep its linear phase term: the original linear phase removed between Equations 8 and 9 and the synthetic one added in Equation 21 are not necessarily the same but just have the same derivative, i.e., f0(t). Consequently, original and synthesized waveforms are not time synchronous. Objective measures are also inconvenient for comparing different configurations of the HMPD vocoder, including those dropping the maximum-phase component given by μ i (f), in which the shape of the glottal pulse is not preserved.
5.1 Quality of resynthesis
The first test was designed to evaluate mainly the importance of μ i (f) and σ i (f) in terms of perceptual quality. The speech database used in this experiment contained a total of 32 utterances spoken in 16 different languages (two utterances per language, one from a male speaker and one from a female speaker). Such a multilingual database had been thoroughly designed to exhibit a very wide phonetic variability and also an heterogeneous set of speakers. The sampling frequency of the signals, f s , varied between 16 and 44.1 kHz. All original recordings showed high signal-to-noise ratios as they had been collected from various synthesis databases. The test was conducted through a web-based interface. A total of 43 volunteer listeners were presented with the original recordings of randomly selected signals along with their reconstructed versions using: aHM; the suggested HMPD using both μ i (f) and σ i (f); HMPD using σ i (f) only; the well-known STRAIGHT vocoder, which was used as a hidden anchor. Then, they were asked to grade the quality of these sounds using a 5-points scale . The order of the reconstruction methods was randomized too, and the listenings were made through headphones or earphones. For consistency and to avoid the fatigue of the evaluators, each listener was asked to grade only the voices of two languages (both male and female voices) randomly selected among the 16.
Among other reasons, the measure of randomness using σ i (f) might not adapt quickly enough in transients, so that the beginning and the end of voiced segments can be sometimes over randomized. Smoothing techniques and different separation procedures for estimation of μ i (f) and σ i (f) should be investigated in the future.
Regarding the relative performance of HMPD- μ σ and HMPD- σ, the average scores indicate that, for the voices used in this experiment, the listeners were not able to perceive any difference between them. This suggests that the contribution of μ i (f) is not perceptually significant in comparison with that of σ i (f). Even more, since the link between PDi,h and the maximum-phase of the glottal pulse has been shown and exploited ,, this suggests that the maximum-phase information is hardly noticeable at this overall quality level. Admittedly, this could also be an indicator that μ i (f) is not capturing the maximum-phase component properly. In any case, the average results also show that the quality provided by HMPD is at least as good as that of STRAIGHT and better for male voices. Note also that, compared to STRAIGHT, the difference of quality between genders is also clearly reduced using HMPD. In other words, the phase randomization technique suggested in this paper, which exploits σ i (f), might be a potential improvement and replacement for STRAIGHT’s aperiodicity measures .
5.2 Quality of pitch shifting
A second experiment was conducted to check the consistency of HMPD in a more challenging scenario. In that sense, pitch scaling is preferable over time scaling because it can shed light on possible inaccuracies in isolating amplitude or phase information from periodicity information. Therefore, after the analysis step, f0(t i ) was multiplied by a factor of 2, or 0.5, in order to shift the pitch of the voice one octave upwards or downwards, respectively. The signals in the database described in Section 4 were manipulated using three different methods: HMPD- μ σ, HMPD- σ, and STRAIGHT. In the case of HMPD, the pitch modification factor was applied to all f0(t i ) values, without any distinction between voiced and unvoiced segments, while in STRAIGHT unvoiced segments were obviously kept unvoiced.
Using a web-based interface, 30 listeners gave their pairwise preferences for the three possible combinations of methods using a 5-points scale : strong preference for one method, preference for one method, preference for the other method, strong preference for the other method, or uncertainty. Again, each listener assessed the quality of the upwards and downwards shifts of the recordings of two languages, for one female and one male speaker per language. The individual scores given by the listeners were then aggregated into a single mean score for each method, which shows global preference of one method against all the others.
Concerning the comparison between HMPD and STRAIGHT, for upwards pitch shifting, STRAIGHT is clearly preferred over HMPD- σ. However, for downwards shifts, clear preferences are shown for HMPD- σ. Informal listening revealed that for upwards pitch shifting, the speech signals modified by HMPD sound tenser and lack some noisiness. This is due to the inherent limitations of modeling speech exclusively through harmonics: even for an adequate phase variance across time, at high pitch values, the frequency gap between every two consecutive harmonics does not allow a proper reconstruction of noise characteristics. STRAIGHT is not prone to this effect because it uses a wideband noise . This is undoubtedly one issue in HMPD to be solved in future works.
5.3 Quality of statistical parametric speech synthesis
To assess the quality of HMPD- σ in statistical parametric speech synthesis, we built a system based on the HMM-based Speech Synthesis System (HTS)  (v2.1.1). HTS learns a correspondence between labels containing phonetic, linguistic, and prosodic information and one/many streams of vectors containing acoustic features. This correspondence is modeled at phone level through five-state left-to-right context-dependent HMMs with explicit state duration distributions. The technology behind this well-known system is explained in detail in .
Both HMPD and STRAIGHT were slightly modified to meet the requirements of HTS. In both of them, 39th-order Mel-CEPstral (MCEP) coefficients were used to model the amplitude envelope |V i (f)| as suggested in , the only difference being that in HMPD, these coefficients were obtained from discrete harmonic amplitudes as in . To model the degree of noisiness, the aperiodicity measures provided by STRAIGHT were averaged within five meaningful bands, as detailed in , whereas HMPD’s σ i (f), which takes values in the range [ 0,∞) like |V i (f)|, was also translated into MCEP coefficients (order 12). For synthesis, the σ i (f) on linear scale was recovered from the corresponding MCEP coefficients, like the amplitude envelope. Given the importance of pitch artifacts in HMM-based speech synthesis, for a fair comparison, we used the same f0(t i ) values for both vocoders, namely those provided by STRAIGHT. In unvoiced segments, the continuous f0(t) curve required by HMPD- σ was simply obtained by linear interpolation of the non-zero f0(t i ) values. The resulting curve was then modeled using continuous HMMs with one Gaussian mixture per state instead of MSD-HMMs, as proposed by .
Summary of the streams used in the HMM-based synthesis system
We trained models for four different speech databases: one female and one male speaker in Spanish, containing 1.2 and 2K utterances, respectively ,; and one female and one male speaker in English, containing 1.1 and 2.8K utterances, respectively ,, all with f s =16 kHz). All the samples using STRAIGHT and HMPD are available at . For the sack of completeness, samples using impulse-based glottal sources (μ i (f)=0 and σ i (f)=0 ∀i,f in the whole signal or only in the voiced segments, as often used in the literature as baseline systems ,) have also been generated and are available on the demonstration page . However, given their very poor quality, they have not been included in the following listening test in order to avoid their potential influence on the results of STRAIGHT and HMPD. Therefore, we conducted a pairwise preference test between STRAIGHT and HMPD only, similar to that of Section 4. For each voice, 31 listeners gave their preference for each method for one synthetic utterance randomly selected among ten.
Interestingly, the gender dependencies observed in the previous experiments also arise in Figure 8. Indeed, listeners seem to prefer the female voices of STRAIGHT and the male voices of HMPD- σ. As mentioned in Section 4, this phenomenon is due to the inherent limitations of harmonic modeling at high pitch values. Forthcoming works will address this issue.
In this paper, features based on mean and standard deviation of the PD have been suggested for analysis/synthesis of speech signals, leading to a new HMPD vocoder.
These features avoid voiced and unvoiced segmentation. Thus, the perceived quality of HMPD synthesis is independent of the reliability of a voicing estimator. A first listening test has shown that HMPD resynthesis quality is as good as that of the STRAIGHT vocoder for female voices and better for male voices.
A second preference test about pitch scaling has shown a limitation of HMPD when the harmonics are not dense enough to properly reproduce noise properties (e.g., with high f0). Future works are planned to address this fundamental issue of the harmonic models. However, a clear preference has been shown for HMPD in downwards shifts, suggesting that additive wideband noise, often used in existing vocoders, is not necessary for low pitched voices. A last test has suggested that the quality of HMPD in HMM-based speech synthesis is similar to that of the state-of-the-art. Therefore, HMPD basically simplifies the signal representation, in terms of uniformity, by removing the voicing decision, without losing, on average, perceived quality.
G. Degottex has been funded by the Swiss National Science Foundation (SNSF) (grants PBSKP2_134325, PBSKP2_140021), Switzerland, and the Foundation for Research and Technology-Hellas (FORTH), Heraklion, Greece. D. Erro has been funded by the Basque Government (BER2TEK, IE12-333) and the Spanish Ministry of Economy and Competitiveness (SpeechTech4All, TEC2012-38939-C03-03).
- Gales MJF, Young SJ: The application of hidden Markov models in speech recognition. Foundations Trends Signal Process 2007, 1(3):195-304. 10.1561/2000000004View ArticleGoogle Scholar
- Kinnunen T, Li H: An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 2010, 52(1):12-40. 10.1016/j.specom.2009.08.009View ArticleGoogle Scholar
- Stylianou Y: Harmonic plus noise models for speech combined with statistical methods, for speech and speaker modification. PhD thesis,. TelecomParis, France; 1996.Google Scholar
- Tokuda K, Nankaku Y, Toda T, Zen H, Yamagishi J, Oura K: Speech synthesis based on hidden markov models. Proc. IEEE 2013, 101(5):1234-1252. 10.1109/JPROC.2013.2251852View ArticleGoogle Scholar
- Anguera X, Bozonnet S, Evans N, Fredouille C, Friedland O, Vinyals O: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process 2012, 20(2):356-370. 10.1109/TASL.2011.2125954View ArticleGoogle Scholar
- Davis SB, Mermelstein P: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process 1980, 28: 357-366. 10.1109/TASSP.1980.1163420View ArticleGoogle Scholar
- Spanias AS: Speech coding: a tutorial review. Proc. IEEE 1994, 82(10):1541-1582. 10.1109/5.326413View ArticleGoogle Scholar
- Scott JM, Assmann PF, Nearey TM: Intelligibility of frequency shifted speech. J. Acoust. Soc. Am 2001, 109(5):2316-2316. 10.1121/1.4744141View ArticleGoogle Scholar
- Schweinberger SR, Casper C, Hauthal N, Kaufmann JM, Kawahara H, Kloth N, Robertson DM, Simpson AP, Zäske R: Auditory adaptation in voice perception. Curr. Biol 2008, 6(9):684-688. 10.1016/j.cub.2008.04.015View ArticleGoogle Scholar
- Kawahara H, Masuda-Katsuse I, de Cheveigne A: Restructuring speech representations using a pitch-adaptative time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun 1999, 27(3–4):187-207. 10.1016/S0167-6393(98)00085-5View ArticleGoogle Scholar
- McAulay R, Quatieri T: Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Acoust. Speech Signal Process 1986, 34(4):744-754. 10.1109/TASSP.1986.1164910View ArticleGoogle Scholar
- T Quatieri, RJ McAulay, in Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP), 10. Speech transformations based on a sinusoidal representation (Tampa, Florida, USA, 1985), pp. 489–492.View ArticleGoogle Scholar
- Quatieri TF, McAulay R: Shape invariant time-scale and pitch modification of speech. IEEE Trans. Signal Process 1992, 40(3):497-510. 10.1109/78.120793View ArticleGoogle Scholar
- TF Quatieri, R McAulay, in Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP). Phase coherence in speech reconstruction for enhancement and coding applications (Glasgow, Scotland, UK, 23–26 May 1989), pp. 207–210.Google Scholar
- Pantazis Y, Rosec O, Stylianou Y: Adaptive AM-FM signal decomposition with application to speech analysis. IEEE Trans. Audio Speech Lang. Process 2010, 19(2):290-300. 10.1109/TASL.2010.2047682View ArticleGoogle Scholar
- Degottex G, Stylianou Y: Analysis and synthesis of speech using an adaptive full-band harmonic model. IEEE Trans. Audio Speech Lang. Proc 2013, 21(10):2085-2095. 10.1109/TASL.2013.2266772View ArticleGoogle Scholar
- G Kafentzis, G Degottex, O Rosec, Y Stylianou, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Time-scale modifications based on a full-band adaptive harmonic model (Vancouver, 26–31 May 2013), pp. 8193–8197.Google Scholar
- G Kafentzis, G Degottex, O Rosec, Y Stylianou, in Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. ICASSP. Pitch modifications of speech based on an adaptive harmonic model (Florence, 4–9 May 2014).Google Scholar
- J Laroche, Y Stylianou, E Moulines, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2. HNS: Speech modification based on a harmonic+noise model (Minneapolis, USA, 27–30 Apr 1993), pp. 550–553.Google Scholar
- Richard G, d’Alessandro C: Analysis/synthesis and modification of the speech aperiodic component. Speech Commun 1996, 19(3):221-244. 10.1016/0167-6393(96)00038-6View ArticleGoogle Scholar
- E Banos, D Erro, A Bonafonte, A Moreno, in Proc. V Jornadas en Tecnologias del Habla. Flexible harmonic/stochastic modelling for HMM-based speech synthesis (Bilbao, Spain, 12–14 Nov 2008).Google Scholar
- El-Jaroudi A, Makhoul J: Discrete all-pole modeling. IEEE Trans. Signal Process 1991, 39(2):411-423. 10.1109/78.80824View ArticleGoogle Scholar
- Campedel-Oudot M, Cappe O, Moulines E: Estimation of the spectral envelope of voiced sounds using a penalized likelihood approach. IEEE Trans. Speech Audio Process 2001, 9(5):469-481. 10.1109/89.928912View ArticleGoogle Scholar
- Saratxaga I, Hernaez I, Erro D, Navas E, Sanchez J: Simple representation of signal phase for harmonic speech models. Electron. Lett 2009, 45(7):381-383. 10.1049/el.2009.3328View ArticleGoogle Scholar
- G Degottex, A Roebel, X Rodet, in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP). Function of phase-distortion for glottal model estimation (Prague, 22–27 May 2011), pp. 4608–4611.Google Scholar
- Y Ohtani, M Tamura, M Morita, T Kagoshima, M Akamine, in Proc. Interspeech. HMM-based speech synthesis using sub-band basis spectrum model (Portland, Oregon, USA, September 9–13 2012), pp. 1440–1443.Google Scholar
- Maia R, Akamine M, Gales MJF: Complex cepstrum for statistical parametric speech synthesis. Speech Commun 2013, 55(5):606-618. 10.1016/j.specom.2012.12.008View ArticleGoogle Scholar
- H Kawahara, J Estill, O Fujimura, in Proc. Second International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA). Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system straight (Florence, Italy, 2001).Google Scholar
- Erro D, Sainz I, Navas E, Hernaez I: Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE J. Selected Topics Signal Process 2014, 8(2):184-194. 10.1109/JSTSP.2013.2283471View ArticleGoogle Scholar
- J Latorre, MJF Gales, S Buchholz, K Knill, M Tamurd, Y Ohtani, M Akamine, in Proc.IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? (Prague, 22–27 May 2011), pp. 4724–4727.Google Scholar
- Degottex G, Lanchantin P, Roebel A, Rodet X: Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis. Speech Comm 2013, 55(2):278-294. 10.1016/j.specom.2012.08.010View ArticleGoogle Scholar
- Tokuda K, Masuko T, Myizaki N, Kobayashi T: Multi-space probability distribution HMM. IEICE Trans. Inform. Syst 2002, E85-D: 455-464.Google Scholar
- I Saratxaga, I Hernaez, M Pucher, I Sainz, in Proc. Interspeech. Perceptual importance of the phase related information in speech (ISCAPortland, Oregon, USA, September 9–13 2012).Google Scholar
- Mowlaee P, Saeidi R: Iterative closed-loop phase-aware single-channel speech enhancement. Signal Process. Lett. IEEE 2013, 20(12):1235-1239. 10.1109/LSP.2013.2286748View ArticleGoogle Scholar
- P Mowlaee, R Saiedi, R Martin, in Proceedings of the International Conference on Spoken Language Processing. Phase estimation for signal reconstruction in single-channel speech separation, (2012), pp. 1–4.Google Scholar
- Miller RL: Nature of the vocal cord wave. J. Acoust. Soc. Am 1959, 31(6):667-677. 10.1121/1.1907771View ArticleGoogle Scholar
- Oppenheim AV, Schafer RW: Digital Signal Processing. Prentice-Hall, New Jersey, USA; 1978.Google Scholar
- Paul DB: The spectral envelope estimation vocoder. IEEE Trans. Acoust. Speech Signal Process 1981, 29(4):786-794. 10.1109/TASSP.1981.1163643View ArticleGoogle Scholar
- B Doval, C d’Alessandro, N Henrich, in Proc. ISCA Voice Quality: Functions, Analysis and Synthesis (VOQUAL). The voice source as a causal/anticausal linear filter (Geneva, 27–29 Aug 2003), pp. 16–20.Google Scholar
- Degottex G, Roebel A, Rodet X: Phase minimization for glottal model estimation. IEEE Trans. Audio Speech Lang. Process 2011, 19(5):1080-1090. 10.1109/TASL.2010.2076806View ArticleGoogle Scholar
- Oppenheim A, Schafer R, Stockham T: Nonlinear filtering of multiplied and convolved signals. Proc. IEEE 1968, 56(8):1264-1291. 10.1109/PROC.1968.6570View ArticleGoogle Scholar
- B Bozkurt, B Doval, C d’Alessandro, T Dutoit, in Proc. International Conference on Spoken Language Processing (ICSLP). Zeros of Z-transform (ZZT) decomposition of speech for source-tract separation (South Korea, Japan, 4–8 Oct 2004).Google Scholar
- T Drugman, B Bozkurt, T Dutoit, in Proc. Interspeech. Complex cepstrum-based decomposition of speech for glottal source estimation (Brighton, UK, 6–10 Sep 2009), pp. 116–119.Google Scholar
- T Drugman, T Dubuisson, A Moinet, C d’Alessandro, T Dutoit, in Proc. International Conference on Signal Processing and Multimedia Applications (SIGMAP). Glottal source estimation robustness (Porto, Portugal, 26–29 Jul 2008).Google Scholar
- Laroche J, Dolson M: Improved phase vocoder time-scale modification of audio. IEEE Trans. Speech Audio Process 1999, 7(3):323-332. 10.1109/89.759041View ArticleGoogle Scholar
- Stylianou Y: Removing linear phase mismatches in concatenative speech synthesis. IEEE Trans. Speech Audio Process 2001, 9(3):232-239. 10.1109/89.905997View ArticleGoogle Scholar
- Agiomyrgiannakis Y, Stylianou Y: Wrapped gaussian mixture models for modeling and high-rate quantization of phase data of speech. IEEE Trans. Audio Speech Lang. Proc 2009, 17(4):775-786. 10.1109/TASL.2008.2008229View ArticleGoogle Scholar
- Smits R, Yegnanarayana B: Determination of instants of significant excitation in speech using group delay function. IEEE Trans. Speech Audio Process 1995, 3(5):325-333. 10.1109/89.466662View ArticleGoogle Scholar
- Ananthapadmanabha T, Yegnanarayana B: Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Trans. Acoust Speech Signal Process 1979, 27(4):309-319. 10.1109/TASSP.1979.1163267View ArticleGoogle Scholar
- Moulines E, Charpentier F: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun 1990, 9: 453-467. Neuropeech ’89 10.1016/0167-6393(90)90021-ZView ArticleGoogle Scholar
- C Hamon, E Mouline, F Charpentier, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1. A diphone synthesis system based on time-domain prosodic modifications of speech (Glasgow, 23–26 May 1989), pp. 238–241.Google Scholar
- Lipshitz SP, Pocock M, Vanderkooy J: On the audibility of midrange phase distortion in audio systems. J. Audio Eng. Soc 1982, 30(9):580-595.Google Scholar
- Hansen V, Madsen ER: On aural phase detection: Part 1. J. Audio Eng. Soc 1974, 22(1):10-14.Google Scholar
- Hansen V, Madsen ER: On aural phase detection: Part 2. J. Audio Eng. Soc 1974, 22(10):783-788.Google Scholar
- M Tahon, G Degottex, L Devillers, in Proc. International Conference on Speech Prosody. Usual voice quality features and glottal features for emotional valence detection (Shanghai, China, 22–25 May 2012), pp. 693–696.Google Scholar
- Banno H, Takeda K, Itakura F: The effect of group delay spectrum on timbre. Acoust. Sci. Technol 2002, 23(1):1-9. 10.1250/ast.23.1View ArticleGoogle Scholar
- Yegnanarayana B, Saikia D, Krishnan T: Significance of group delay functions in signal reconstruction from spectral magnitude or phase. Acoust. Speech Signal Process. IEEE Trans 1984, 32(3):610-623. 10.1109/TASSP.1984.1164365View ArticleGoogle Scholar
- Murthy HA, Yegnanarayana B: Speech processing using group delay functions. Elsevier Signal Process 1991, 22: 259-267. 10.1016/0165-1684(91)90014-AView ArticleGoogle Scholar
- D Zhu, KK Paliwal, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1. Product of power spectrum and group delay function for speech recognition (Montreal, Quebec, Canada, 17–21 May 2004), pp. 125–81.Google Scholar
- Naylor PA, Kounoudes A, Gudnason J, Brookes M: Estimation of glottal closure instants in voiced speech using the DYPSA algorithm. IEEE Trans. Audio Speech Lang. Process 2007, 15(1):34-43. 10.1109/TASL.2006.876878View ArticleGoogle Scholar
- T Drugman, T Dubuisson, T Dutoit, in IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). Phase-based information for voice pathology detection (Prague, 22–27 May 2011), pp. 4612–4615.Google Scholar
- Y Shiga, S King, in Proc. Eurospeech, 3. Estimation of voice source and vocal tract characteristics based on multi-frame analysis (Geneva, 1–4 Sept 2003), pp. 1749–1752.Google Scholar
- J Bonada, in Proc. Digital Audio Effects (DAFx). High quality voice transformations based on modeling radiated voice pulses in frequency domain (Naples, Italy, 5–8 Oct 2004).Google Scholar
- Fisher NI: Statistical Analysis of Circular Data. Cambridge University Press, UK; 1995.Google Scholar
- RJ McAulay, TF Quatieri, in Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP), 1. Sine-wave phase coding at low data rates (Toronto, 14–17 May 1991), pp. 577–580.Google Scholar
- R McAulay, TF Quatieri, in Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP), 12. Multirate sinusoidal transform coding at rates from 2.4 kbps to 8 kbps (Dallas, Texas, USA, 1987), pp. 1645–1648.Google Scholar
- A Sugiyama, R Miyahara, in Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP). Phase randomization - a new paradigm for single-channel signal enhancement (Vancouver, 26–31 May 2013), pp. 7487–7491.Google Scholar
- ITU-T P.862: Perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Technical report, International Telecommunication Union (ITU), Geneva, Switzerland; 2000.Google Scholar
- ITU-R BS.1284-1: General methods for the subjective assessment of sound quality. Technical report, International Telecommunication Union (ITU), Geneva, Switzerland; 2003.Google Scholar
- G Degottex, D Erro, Demonstrations audio samples of HMPD-based synthesis (2014). . Accessed 9 Oct 2014., [http://gillesdegottex.eu/ExDegottexG2014jhmpd]Google Scholar
- Doval B, d’Alessandro C, Henrich N: The spectrum of glottal flow models. Acta Acustica United Acustica 2006, 92(6):1026-1046.Google Scholar
- H Zen, T Nose, J Yamagishi, S Sako, T Masuko, A Black, K Tokuda, in Proc. ISCA Workshop on Speech Synthesis, SSW 2007. The HMM-based speech synthesis system (HTS) version 2.0 (Bonn, 22–24 Aug 2007).Google Scholar
- Zen H, Toda T, Nakamura M, Tokuda K: Details of the nitech HMM-based speech synthesis system for the blizzard challenge 2005. IEICE Trans. Inf. Syst 2007, E90-D(1):325-333. 10.1093/ietisy/e90-1.1.325View ArticleGoogle Scholar
- Yu K, Young S: Continuous F0 modeling for HMM based statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process 2011, 19(5):1071-1079. 10.1109/TASL.2010.2076805View ArticleGoogle Scholar
- I Sainz, D Erro, E Navas, I Hernáez, J Sánchez, I Saratxaga, I Odriozola, in Proc. of European Language Resources Association (ELRA). Versatile speech databases for high quality synthesis for basque (Istanboul, Turkey, 23–25 May 2012).Google Scholar
- Rodriguez-Banga E, Garcia-Mateo C: Documentation of the UVIGO_ESDA Spanish database. Technical report, Universidade de Vigo, Vigo, Spain; 2010.Google Scholar
- J Kominek, AW Black, in Proc. ISCA Speech Synthesis Workshop. The CMU ARCTIC speech databases (Geneva, Switzerland, 1–4 Sep 2003), pp. 223–224.Google Scholar
- Cooke M, Mayo C, Valentini-Botinhao C, Stylianou Y, Sauert B, Tang Y: Evaluating the intelligibility benefit of speech modifications in known noise conditions. Speech Commun 2013, 55(4):572-585. 10.1016/j.specom.2013.01.001View ArticleGoogle Scholar
- D Erro, I Sainz, E Navas, I Hernaez, in Proc. IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP). HNM-based MFCC+F0 extractor applied to statistical speech synthesis (Prague, 22–27 May 2011), pp. 4728–4731.Google Scholar
- P Lanchantin, G Degottex, X Rodet, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). A HMM-based speech synthesis system using a new glottal source and vocal-tract separation method (Dallas, 14–19 March 2010), pp. 4630–4633.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.