A parametric prosody coding approach for Mandarin speech using a hierarchical prosodic model

Chiang, Chen-Yu

doi:10.1186/s13636-018-0129-5

Research
Open access
Published: 11 July 2018

A parametric prosody coding approach for Mandarin speech using a hierarchical prosodic model

Chen-Yu Chiang ORCID: orcid.org/0000-0003-4997-8774¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2018, Article number: 5 (2018) Cite this article

2604 Accesses
Metrics details

Abstract

In this paper, a novel parametric prosody coding approach for Mandarin speech is proposed. It employs a hierarchical prosodic model (HPM) as a prosody-generating model in the encoder to analyze the speech prosody of the input utterance to obtain a parametric representation of four prosodic-acoustic features of syllable pitch contour, syllable duration, syllable energy level, and syllable-juncture pause duration for encoding. In the decoder, the four prosodic-acoustic features are reconstructed by a synthesis operation using the decoded HPM parameters. The reconstructed prosodic features are lastly used in an HMM-based speech synthesizer to generate the reconstructed speech. Objective and subjective evaluations showed that the proposed prosody coding approach encoded speech with better quality and lower data rate than the conventional segment-based coding scheme with vector or scalar quantization approach did. The reconstructed speech encoded by the proposed approach has good quality at low data rates of 81.4 and 72.7 bps for speaker-dependent and speaker-independent tasks, respectively. An application of the proposed prosody coding approach to speaking rate conversion by directly changing the HPM parameters to those of a different speaking rate is also illustrated. An informal listening test confirmed that both converted speeches of high and low speaking rate sounded very smooth.

1 Introduction

Speech coding is a process to transform a digitized speech signal into a bit-efficient representation that keeps reasonable speech quality so as to facilitate speech transmission over a band-limited channel or speech storage in a memory-limited media. In general, speech coding techniques can be classified into three categories, including waveform coding, parametric coding, and hybrid coding. The waveform coding technique attempts to maintain the waveform shape of the original speech signal in sample level without any knowledge about the speech generation process. Famous standard speech coders of this category are G.711 A-law and μ-law Pulse Code Modulation (PCM) coders [1], and G.726 and G.727 Adaptive Differential PCM coders [2]. Generally, a waveform coder works well at a high bit rate of 32 kbps or above. The parametric coding technique represents a speech signal by parameters of a speech-generating model. Among various speech-generating models, the most successful one is the linear predictive coding (LPC) model that assumes speech signal is the output of an all-pole model (autoregressive model) fed with an excitation input signal. The parameters of the all-pole filter conceptually represent the vocal tract shape that is highly correlated with the spectral envelope of the speech, while the excitation signal uses a quasiperiodic impulse train to represent information of fundamental frequency (or F0) for voiced speech, pseudorandom noise for unvoiced speech, or a combination of the two (i.e., mixed excitation). Coders of this type encode speech signal in a frame-based processing manner and could operate at low bit rates ranging from 2 to 5 kbps. Differing from waveform coders, parametric coders make no attempt to preserve the original waveform shape but to keep the perceptual quality of the reconstructed speech. Famous standard LPC-based speech coders are FS 1015 LPC of LPC-10e algorithm [3, 4] and MELP (mixed excitation linear prediction) [5]. The hybrid coding technique tries to combine the advantages of both waveform coding and parametric coding. Coders of this type are similar to parametric coders in utilizing speech-generating models, but also similar to waveform coders in keeping encoded speech waveforms close to the original ones by more detailed modeling of the excitation signal. They generally adopt the code-excited linear prediction (CELP) algorithm [6] to minimize perceptually weighted error. Representative standard hybrid coders are FS1016 CELP [7, 8], ITU-T G.728 LD-CELP [9], ETSI AMR-ACELP [10], etc. A hybrid coder generally operates at a medium bit rate of 5 to 15 kbps.

To further encode speech at a very low bit rate of less than 2 kbps, the abovementioned existing sampled-based or frame-based speech coders are unable to obtain reconstructed speech with good intelligibility and naturalness due to the loss of modeling accuracy at such a low data rate. Therefore, segment-based speech coders, such as segment vocoders [11,12,13,14,15,16,17,18,19,20,21,22,23], phonetic vocoders [24,25,26,27,28,29,30,31,32,33], and text-to-speech (TTS)-based speech coders [34,35,36], were proposed to overcome this limitation. Those coders generally process speech at segment level instead of sample or frame level. Generally, a segment vocoder [11,12,13,14,15,16,17,18,19,20,21,22,23] firstly divides the speech signal into a sequence of fixed- or variable-length segments by a speech segmenter or a speech recognizer and then quantizes each segment by a codebook of pre-stored speech segments. A segment vocoder of fixed length simply quantizes a sequence of speech segments of l frames (l > 1) by matrix quantization (MQ) [14, 22]. In a variable-length segment vocoder [11,12,13, 15,16,17,18,19,20,21,22,23], segmental unit is usually pre-determined in the design of speech segmenter or speech recognizer. General segmental units can be phones, di-phones, syllables, or automatically derived acoustic units. Most segment vocoders reported in literatures operated at very low bit rates for speaker-dependent speech coding. However, to apply segment vocoders to speaker-independent speech coding, higher bit rate may be required due to the use of a larger codebook to capture the speaker variability. A phonetic vocoder [24,25,26,27,28,29,30,31,32,33] adopts a recognition-synthesis scheme in which speech is delimited into a sequence of phonetic segments (usually phone or phone-like units) by a speech recognizer, and reconstructed by a TTS synthesizer given with the recognized phone identities and their corresponding quantized prosodic features. Generally, the speech quality of this type coder is subject to the performance of the speech recognizer and the TTS synthesizer used. Due to the fact that phonetic vocoders do not encode the information of speaker, the speaker identity is missing in the reconstructed speech. Therefore, phonetic vocoders are suitable for speaker-dependent speech coding. A TTS-based speech coder [34,35,36] adopts a TTS synthesizer to generate speech from a given text by concatenating speech units properly selected from a large speech inventory and modifying the prosody of the selected speech units to the quantized prosodic parameters. A speech coder of this type can be viewed as a phonetic vocoder which operates in an oracle status that the correct text is given, and all speech segments are well segmented with correct phonetic transcriptions, i.e., speech is segmented by forced alignment. Although segment-based speech coders generally operate with longer coding delay and higher computational complexity than the conventional sample- and frame-based coders, they are potentially very useful in some applications that require a large amount of pre-recorded speech with limited memory space, such as the speech coding of story readings in an electronic book, computer-assisted language learning systems and electronic dictionaries, and saving speech in a matrix bar code, i.e., quick response (QR) code.

Concluding from above discussions, we find that those previous studies mainly focused on the modeling or encoding of spectral information. For frame-based speech coding, one milestone was the use of vector quantization (VQ) in encoding LPCs [37] or line spectral frequencies (LSFs) [38, 39] to greatly reduce the bit amount for spectral information via taking advantage of high intra-frame correlation among LPC/LSF coefficients. Predictive VQ [40] was proposed to further reduce the bit-rate by using the property of inter-frame spectral redundancy or correlation. For segment vocoders [11,12,13,14,15,16,17,18,19,20,21,22,23], the main study issues focused on the choice of segmental units, the realization of segmentation and segment quantization, and the design of segment codebook. For phonetic vocoders [24,25,26,27,28,29,30,31,32,33], the studies mainly focused on the choice of acoustic unit for speech recognition/synthesis [24,25,26,27] and speaker adaptation of spectral information [28, 30, 32, 33]. For TTS-based vocoders [34,35,36], the main concern lay in the methods of unit selection for speech synthesis. On the other hand, encoding of the prosodic information of speech signal was rarely addressed. Prosody refers to certain inherent suprasegmental properties of speech signal that carry melodic, timing, rhythmic, and pragmatic information of continuous speech. Prosodic features are physically represented by any domain’s (generally phone, syllable, word, phrase, sentence, etc.) variations on pitch contour, energy level, duration, and silence of spoken utterances. In conventional speech coding methods, prosodic features are generally ignored, or simply scalar- or vector-quantized. For sample-based waveform coding [1, 2], no prosodic features are needed to be encoded owing to the fact that a waveform coder attempts to maintain the waveform shape of the original signal. For frame-based coders [3,4,5,6,7,8,9,10], information of pitch contour and gain is embedded in the framed excitation signal which can be efficiently represented by positions and amplitudes of important residual samples [3, 4], encoded by an excitation codebook [6,7,8,9,10], or represented by a mixed excitation model in terms of pitch period, bandpass voicing strengths and Fourier magnitudes [5]. For segment-based speech coding, the prosodic information associated with each segmental unit is usually encoded directly after quantization without considering underlying prosodic models. Methods proposed for encoding segmental pitch contour include scalar-quantization of segmental mean value [11, 13] or values at segmental end points [17], vector-quantization [25, 31], scalar-quantization after being parameterized by piecewise linear approximation (PLA) [12, 18, 19, 24, 32, 33, 36], the frame-by-frame scheme as used in the frame-based coder [14, 15, 21, 23, 27, 28], and quantization by using stored pitch contour patterns [34]. For segment duration, it is usually directly encoded/scalar-quantized [11, 12, 15, 17,18,19, 23, 24, 26,27,28], or vector-quantized [29,30,31,32,33]. Aside from properly encoding prosodic information for bit saving, post-modification or manipulation on the prosody of the encoded speech is also an interesting topic for a segment-based speech coder to realize some attractive and fancy functions such as changing speaking rate, changing speaker identity, and changing speaking style or emotion. Therefore, a parametric prosody coding approach basing on a prosody-generating model, which well describes the suprasegmental variations of the prosodic features, is highly desirable to have more potential than the conventional prosody coding approaches to not only save bit amount for efficiently encoding prosodic features but also be easier to realize a useful function of post-modification on the encoded prosody.

In this paper, a novel parametric prosody coding approach to efficiently encoding prosodic-acoustic features for segment-based Mandarin speech coding is proposed. It differs from the conventional prosody coding approaches using simple scalar- or vector-quantization mainly on adopting an analysis-synthesis scheme to obtain a parametric representation of the prosodic features of the input speech for encoding by an analysis operation in the encoder, and to reconstruct the prosodic features from the decoded parameters by a synthesis operation in the decoder. A hierarchical prosodic model (HPM) proposed previously [41] is employed to serve as the prosody-generating model in the analysis-synthesis scheme. The HPM is a sophisticated speech prosody model to well describe the various relations among prosodic-acoustic features, prosodic structure, and linguistic features so that it can be used to produce a compact and accurate representation of the prosodic features of the input speech for high-performance prosody coding. Besides, the HPM also provides us a platform to easily realize some post-modifications on the decoded prosody via manipulating its parameters. An example of modifying the speaking rate of the reconstructed speech via directly replacing the HPM parameters will be demonstrated in this study.

The paper is organized as follows. Section 2 presents the proposed Mandarin-speech prosody coding approach in detail. Section 3 discusses the experimental results of evaluating the proposed prosody coding approach on two continuous-speech databases. In Section 4, an application of the parametric prosody coding to speaking rate conversion is demonstrated. Some conclusions are given in the last section.

2 The proposed method

Figure 1 shows a schematic diagram of the proposed parametric prosody coding approach. In the encoder, the input utterance is firstly segmented into syllable segments interleaving with pauses by a forced aligner using the linguistic information of the associated text. The prosodic-acoustic features associated with each syllable segment are then extracted. Then, a parametric representation of the prosodic-acoustic features of a syllable segment is estimated by a prosody analysis operation based on the HPM. Lastly, the HPM parameters and some low-level linguistic features are encoded and transmitted to the decoder. In the decoder, the prosodic-acoustic features of each syllable segment are firstly reconstructed by a prosody synthesis operation which feeds the decoded low-level linguistic features and HPM parameters into the prosody-generating model, i.e., the HPM. The output speech is finally generated by an HMM-based speech synthesizer using the reconstructed prosodic-acoustic features and the decoded low-level linguistic features. Some primary parts of the proposed approach are discussed in detail in the following subsections. The HPM serving as the prosody-generating model is firstly introduced in 2.1. Then, the analysis-synthesis operations and prosody-parameter coding are described in Section 2.2. Lastly, the reconstruction of speech signal is discussed in Section 2.3.

2.1 The prosody-generating model HPM

The HPM used in this study is the statistical prosodic model proposed previously [41, 42]. Although the detail of the HPM has been included in [41], we briefly reintroduce it here to make the presentation of the proposed parametric prosody coding approach more complete and easier to understand. The HPM is a model designed to describe the various relationships of prosodic-acoustic features, prosodic structure, and linguistic features. Three types of prosodic-acoustic features are modeled in the HPM: syllable prosodic-acoustic features, syllable-juncture prosodic-acoustic features, and inter-syllable differential prosodic-acoustic features. The syllable prosodic-acoustic features include syllable pitch contour sp_n, syllable duration sd_n, and syllable energy level se_n of the n-th syllable. Here, the pitch contour of each syllable is represented by a 3-rd order orthogonal polynomial expansion [43]. The basis polynomials used are normalized, in length, to [0,1] and can be expressed as:

$$ {\displaystyle \begin{array}{l}{\phi}_0\left(\frac{i}{M}\right)=1\\ {}{\phi}_1\left(\frac{i}{M}\right)={\left[\frac{12\cdot M}{M+2}\right]}^{1/2}\cdot \left[\frac{i}{M}-\frac{1}{2}\right]\\ {}{\phi}_2\left(\frac{i}{M}\right)={\left[\frac{180\cdot {M}^3}{\left(M-1\right)\left(M+2\right)\left(M+3\right)}\right]}^{1/2}\cdot \left[{\left(\frac{i}{M}\right)}^2-\frac{i}{M}+\frac{M-1}{6\cdot M}\right]\\ {}{\phi}_3\left(\frac{i}{M}\right)={\left[\frac{2800\cdot {M}^5}{\left(M-1\right)\left(M-2\right)\left(M+2\right)\left(M+3\right)\left(M+4\right)}\right]}^{1/2}\\ {}\kern3em \cdot \left[{\left(\frac{i}{M}\right)}^3-\frac{3}{2}{\left(\frac{i}{M}\right)}^2+\frac{6{M}^2-3M+2}{10\cdot {M}^2}\left(\frac{i}{M}\right)-\frac{\left(M-1\right)\left(M-2\right)}{20\cdot {M}^2}\right]\end{array}} $$

(1)

for 0 ≤ i ≤ M, where M + 1 is the length of the current syllable log-pitch contour and M ≥ 3 in frame. They are, in fact, discrete Legendre polynomials. The pitch contour F_n(i) of syllable n can then be approximated by:

$$ {F}_n(i)\approx \sum \limits_{j=0}^3{\alpha}_{j,n}\cdot {\phi}_j\left(\frac{i}{M_n}\right)\kern0.5em i=0\sim {M}_n, $$

(2)

where

$$ {\alpha}_{j,n}=\frac{1}{M_n+1}\sum \limits_{i=0}^{M_n}{F}_n(i)\cdot {\phi}_j\left(\frac{i}{M_n}\right)\kern2em j=0\sim 3 $$

(3)

Then, the four coefficients of syllable n form a vector sp_n = [α_{0, n}, α_{1, n}, α_{2, n}, α_{3, n}]^T to represent its pitch contour. The syllable-juncture prosodic-acoustic features include pause duration pd_n and energy-dip level ed_n of the syllable juncture between the n-th and (n + 1)-th syllables (referred to as syllable juncture n thereafter). The inter-syllable differential prosodic-acoustic features include the normalized pitch-level jump pj_n, and the two normalized duration lengthening factors dl_n and df_n of syllable juncture n. Note that these differential features are obtained after eliminating the effects of low-level linguistic features, i.e., tone and base-syllable type. Specifically, the normalized pitch-level jump is defined by:

$$ {pj}_n=\left({sp}_{n+1}(1)-{\chi}_{t_{n+1}}\right)-\left({sp}_n(1)-{\chi}_{t_n}\right) $$

(4)

where sp_n(1) is the first dimension of syllable pitch contour sp_n (i.e., syllable pitch level); t_n ∈ {1, 2, 3, 4, 5} is the tone of syllable n; and χ_t is the average pitch-level of tone t. The two normalized duration lengthening factors are defined by:

$$ {dl}_n=\left({sd}_n-{\pi}_{t_n}-{\pi}_{s_n}\right)-\left({sd}_{n-1}-{\pi}_{t_{n-1}}-{\pi}_{s_{n-1}}\right) $$

(5)

$$ {df}_n=\left({sd}_n-{\pi}_{t_n}-{\pi}_{s_n}\right)-\left({sd}_{n+1}-{\pi}_{t_{n+1}}-{\pi}_{s_{n+1}}\right) $$

(6)

where π_t and π_s represent respectively the average syllable durations of tone t and of base-syllable type s. So, the complete prosodic-acoustic feature sequence is A = {X, Y, Z} = {sp, sd, se, pd, ed, pj, dl, df}, where X = {sp, sd, se}, Y = {pd, ed}, and Z = {pj, dl, df} represent sequences of the syllable prosodic-acoustic features, the syllable-juncture prosodic-acoustic features, and the inter-syllable differential prosodic-acoustic features, respectively.

The prosodic structure considered in the HPM is a four-layer prosody hierarchy shown in Fig. 2. It is a modified version of the hierarchical prosodic phrase grouping (HPG) model proposed by Tseng [44]. It is composed of four types of layered prosodic constituents, from bottom to top, syllable (SYL), prosodic word (PW), prosodic phrase (PPh), and breathe/prosodic phrase group (BG/PG). In the HPM, the prosody hierarchy is represented in terms of two types of prosody tags T = {B, P}: the break type B of syllable juncture and the prosodic state P of syllable. The break type B is used to specify the boundaries of the prosodic constituents while the prosodic state P is used to specify the patterns of the higher-level prosodic constituents. As shown in Fig. 2, the four prosodic constituents are delimited by seven break types denoted as B0, B1, B2–1, B2–2, B2–3, B3, and B4 [41, 42]. First, B0 and B1 represent respectively non-breaks of reduced syllable boundary (or tightly-coupling syllable juncture) and normal syllable boundary, within a PW, which have no identifiable pauses between SYLs. Second, PW boundary B2 = {B2–1, B2–2, B2–3} is perceived as a minor-break boundary where a slight tone of voice change usually follows. Here, B2–1, B2–2, and B2–3 represent PW boundaries with F0 reset, short pause, and pre-boundary syllable duration lengthening, respectively. Third, PPh boundary B3 is perceived as a clear pause. Fourth, B4 is defined for a breathing pause or a complete speech paragraph end characterized by final lengthening coupled with weakening of speech sounds. The prosodic state P of syllable is conceptually defined as the state in a prosodic phrase to account for the prosodic-acoustic feature variations imposed on higher-level prosodic constituents (i.e., PW, PPh, and BG/PG). In the HPM, three types of prosodic states are used, i.e., pitch prosodic state p, duration prosodic state q, and energy prosodic state r. So, the complete prosodic tag sequence is T = {B, P}, where B = {B_n} is a break type sequence with B_n∈{B0, B1, B2–1, B2–2, B2–3, B3, B4} being the break type of syllable juncture n, and P = {p, q, r} is the prosodic-state tag sequence with p = {p_n}, q = {q_n} and r = {r_n}.

The linguistic features involved in the HPM can be classified into two classes: the low-level linguistic features and the high-level linguistic features. The low-level linguistic features are those accounting for the prosodic-acoustic feature variation resulting from the prosodic constituent of the lowest level, i.e., SYL, while the high-level linguistic features account for the syllable prosodic-acoustic feature variations imposed on higher-level prosodic constituents (i.e., PW, PPh, and BG/PG) through the prosodic state. The low-level linguistic features are syllable-level features including lexical tone sequence t, base-syllable sequence s, and final type sequence f. The high-level linguistic features are word-level features or above. For simplicity, only the word-level linguistic features are used in the HPM. They include word length sequence WL, part-of-speech sequence POS, and punctuation mark sequence PM. In summary, the linguistic feature sequence used is L = {t, s, f, WL, POS, PM}.

To give a clearer picture of notations for the features and prosodic tags used in this study, we summarize them in Table 1.

Table 1 Notations of prosodic tags, prosodic-acoustic features and linguistic features

A parametric prosody coding approach for Mandarin speech using a hierarchical prosodic model

Abstract

1 Introduction

2 The proposed method

2.1 The prosody-generating model HPM

2.2 The parametric prosody coding approach

2.2.1 Prosody analysis operation

2.2.2 Coding of prosody parameters

2.3 Speech synthesis

3 Experimental results

3.1 Database & Experiment Setting

3.2 Training of the HPMs

3.3 Performance evaluations

3.3.1 Comparison between the equal-RMSE baselines and the proposed approach

3.3.2 Comparison between the baselines with various numbers of logF0 Codewords and the proposed approach

3.3.3 Analysis of bit rates

4 Application to speaking rate conversion

5 Conclusions

Abbreviations

References

Acknowledgements

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Authors’ information

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords