Punctuation-generation-inspired linguistic features for Mandarin prosody generation

Chiang, Chen-Yu; Hung, Yu-Ping; Yeh, Han-Yun; Liao, I-Bin; Pan, Chen-Ming

doi:10.1186/s13636-019-0147-y

Research
Open access
Published: 21 February 2019

Punctuation-generation-inspired linguistic features for Mandarin prosody generation

Chen-Yu Chiang ORCID: orcid.org/0000-0003-4997-8774¹,
Yu-Ping Hung¹,
Han-Yun Yeh¹,
I-Bin Liao² &
…
Chen-Ming Pan²

EURASIP Journal on Audio, Speech, and Music Processing volume 2019, Article number: 4 (2019) Cite this article

3397 Accesses
2 Citations
Metrics details

Abstract

This paper proposes two novel linguistic features extracted from text input for prosody generation in a Mandarin text-to-speech system. The first feature is the punctuation confidence (PC), which measures the likelihood that a major punctuation mark (MPM) can be inserted at a word boundary. The second feature is the quotation confidence (QC), which measures the likelihood that a word string is quoted as a meaningful or emphasized unit. The proposed PC and QC features are influenced by the properties of automatic Chinese punctuation generation and linguistic characteristic of the Chinese punctuation system. Because MPMs are highly correlated with prosodic–acoustic features and quoted word strings serve crucial roles in human language understanding, the two features could potentially provide useful information for prosody generation. This idea was realized by employing conditional random-field-based models for predicting MPMs, quoted word string locations, and their associated confidences—that is, PC and QC—for each word boundary. The predicted punctuations and their confidences were then combined with traditional linguistic features to predict prosodic–acoustic features for performing speech synthesis using multilayer perceptrons. Both objective and subjective tests demonstrated that the prosody generated with the proposed linguistic features was superior to that generated without the proposed features. Therefore, the proposed PC and QC are identified as promising features for Mandarin prosody generation.

1 Introduction

Prosody generation serves a crucial role in a text-to-speech system (TTS). Prosody generation can be regarded as a function mapping from linguistic features to prosodic structures or prosodic–acoustic features. In a practical implementation of an unlimited-text Mandarin TTS (MTTS), the availability and reliability of linguistic features are highly dependent on the performance of the text analyzer employed. A basic text analyzer comprises a Chinese word segmenter, grapheme-to-phone (G2P) converter, and part-of-speech (POS) tagger. Prosodic structures are abstract descriptions of speech prosody and are usually categorically represented using prosodic break tags, such as nonbreak and minor or major break. A Mandarin prosody hierarchy that is commonly agreed upon is a four-layer prosodic structure. The four layers are, from the lowest to the highest, the syllable (SYL) layer, prosodic word (PW) layer, intermediate phrase (or prosodic phrase; PPh) layer, and intonation phrase (IP) layer, which are demarked by nonbreaks, minor breaks, major breaks, and utterance boundaries, respectively [1,2,3]. Prosodic–acoustic features are prosodic information that is numerically represented by values or vectors of the log-F0 contour, duration, and the energy of any linguistic domain, for example, a phone, syllable, initial or final, or word. The representative prosodic–acoustic features for Mandarin speech are the syllable log-F0 contour, syllable duration, pause duration, and syllable energy level [4,5,6]. In hidden Markov model (HMM)-based synthesis, the most popular speech synthesis method [7,8,9,10], prosodic–acoustic features are modeled at the HMM state level, that is, modeled using the state duration, state log-F0 value, and energy contour enclosed by the spectral parameters.

Irrespective of the target (prosodic structure or prosodic–acoustic features) of prosody generation, studies on prosody generation have focused on the following two problems: (1) design or utilization of a prediction model and (2) utilization of features. For the first problem, the prediction methods popularly used for generating a prosodic structure are the hierarchical stochastic model [11], the N-gram model [12], classification and regression tree (CART) [13, 14], bottom-up/sifting hierarchical CART [13], the Markov model [15], artificial neural networks [16], and the maximum entropy model [17]. Moreover, the popular pattern recognition tools for generating prosodic–acoustic features are a multilayer perceptron (MLP) [18,19,20,21,22,23], a recurrent neural network [4], CART [7,8,9,10, 24], and a decision tree with the hidden Markov model with multispace distribution modeling of the F0 contour [7,8,9,10]. For the second problem, conventional linguistic features, such as POSs, word length, sentence length, and position of a word in a sentence, are widely used in many existing MTTSs [4, 12,13,14, 17, 22, 24,25,26,27]. Some studies have improved the accuracy of prosodic structure prediction or prosodic–acoustic feature prediction by incorporating higher-level syntactic features, such as word chunks [16] and syntactic trees [16, 26, 27]. Moreover, statistical linguistic features such as connective degree [14], punctuation confidence (PC) [28,29,30,31], and quotation confidence (QC) [30, 31] have been proposed to neglect complex syntactic tree parsing and manual word chunking that is impractical when constructing an unlimited-text MTTS.

This paper focuses on the second problem to extend and elaborate on our previous research pertaining to PC [28,29,30,31] and QC [30, 31] features. A more substantial analysis and modeling details are provided in this paper to provide readers with an insight into the proposed PC and QC features, the design of which is influenced by automatic Chinese punctuation generation [32] and the linguistic characteristic of the Chinese punctuation system [33]. PC measures the likelihood of inserting a major punctuation mark (MPM) at a word boundary, whereas QC measures the likelihood of using a word string that is quoted by Chinese quotation marks (or brackets) to emphasize the meaning of the quoted word string. In [32], a maximum-entropy-based automatic Chinese punctuation generation method was proposed to insert 16 types of PMs into unpunctuated text by using word features and lexical–functional grammar features. The results in [32] indicated that the punctuation generation model could generate alternative or acceptable insertions, deletions, or substitutions of PMs. A successful outcome was also obtained in a punctuation experiment involving human readers, as reported by Tseng [33], in which the alternative punctuation strategies of different native Mandarin Chinese speakers were found. These observations reflect that Chinese PMs serve as a loose reference to the syntactic structure and semantic domain. Therefore, native Chinese writers can freely utilize PMs to delimit written Chinese into various linguistic elements, such as sentences, phrases, and clauses, for clearly expressing the meaning of text. Furthermore, the punctuation generation of a speaker when reading written Chinese reflects the speaker’s prosodic phrasing strategy because pause breaks are highly correlated with some MPMs, such as the period, comma, exclamatory mark, question mark, semicolon, and colon. Therefore, an automatic punctuation generation model that predicts MPMs and is trained by using a large text corpus can learn punctuation strategies for predicting MPMs from various contributors for providing useful cues for predictions of both prosodic breaks [28, 31] and prosodic–acoustic features [29,30,31].

Word strings enclosed by brackets or quotes have essential or unique meanings in sentences. In our analysis of a large text corpus, the Academia Sinica Balanced Corpus of Modern Chinese (ASBC) 4.0 [34], which contains 9,454,734 words (31,126 paragraphs), we discovered that the functions of quoted word strings can be classified into several cases: (1) adding supplementary information to the proceeding words; (2) representing the name of a particular person, place, or institution; (3) emphasizing the meaning of a word string; and (4) indicating a newly derived compound word or word chunk that has a complex meaning. In cases (3) and (4), the quoted word strings, which are named quoted phrases in this paper, from small to large linguistic units, may form newly derived words, compound words, base phrases, word chunks, syntactic phrases, or sentences. The aforementioned linguistic units are usually larger than common words, contain more complex meanings than a word or may even have new meanings, and may be a higher-level unit in terms of the syntax compared with the POSs of words. Because a quoted phrase exhibits richer linguistic information than only words, it plays a crucial role in human language understanding during the reading of a text. Moreover, it is generally agreed that a speaker can generate good prosody if they understand the meaning of a text. Thus, adding quotations to plain Chinese text and then regarding the added brackets as linguistic features may enable a system to generate prosody that sounds natural. Note that in written Chinese, the use of quotations by adding brackets depends on the writing style or habit of the text contributor. Chinese input texts may thus already contain some brackets for the four functions indicated previously. However, the remaining unquoted words may also be emphasized and be regarded as larger syntactic units if they share similar contextual POSs or word structures with the quoted phrases. For Chinese texts containing no quotations, if quotations can be labeled with brackets automatically by a machine when the word and POS information are given, then the features associated with the labeled brackets could provide richer linguistic information and thus enhance the performance of prosodic–acoustic feature prediction.

To realize the use of automatic MPMs and for quotation predictions, we constructed two types of conditional random field [35, 36] (CRF)-based automatic punctuation generation models: the CRF-based MPM generation model and CRF-based quotation generation model. The CRF-based MPM generation model predicts MPMs and generates the associated confidence measures, which are referred to as the PC, through MPM-removed word or POS sequences. The PC can be regarded as a statistical linguistic feature measuring the likelihood of correctly inserting an MPM into a text. Word junctures in which MPMs are more likely to be inserted are, it is reasonable to assume, junctures in which pause breaks are more likely. We could, therefore, expect that the utilization of PC in prosody generation would improve the performance of prosodic–acoustic feature generation. The CRF-based quotation generation model predicts the structure of a quoted word string (hereafter referred to as the quoted phrase, or QP) from the bracket-removed word or POS sequences and calculates the associated confidence, which is referred to as the QC. The QC can also be considered a statistical linguistic feature used for measuring the likelihood of word strings that are quoted using left and right brackets. Because words in brackets constitute meaning, it is reasonable to assume that fewer prosodic breaks are inserted within quoted text and that quoted text may be emphasized using some variation in prosodic–acoustic features. Therefore, we inferred that the use of QC may assist in prosody generation.

To evaluate the usefulness of the proposed PC and QC in Mandarin prosody generation, experiments of prosodic–acoustic feature prediction were conducted, and the corresponding objective and subjective tests were evaluated. The experimental database used was a Mandarin speech corpus, the Treebank speech corpus, which contains 425 utterances with 56,237 syllables uttered by a professional female announcer. The corpus is further divided into three parts: a training set of 301 utterances with 41,317 syllables, a development set of 75 utterances with 10,551 syllables, and a test set of 44 utterances with 3898 syllables. The corpus used for training the CRF-based punctuation generator was the ASBC 4.0 [34] (hereafter denoted as the ASBC text corpus). For the prosodic–acoustic feature prediction, the proposed linguistic features combined with conventional linguistic features were employed as the input to directly predict four prosodic–acoustic features of the syllable log-F0 contour, syllable duration, syllable energy level, and intersyllable pause duration. Objective tests were evaluated using the root-mean-square error (RMSE). Subjective tests were then conducted on speech-synthesized utterances by using the predicted prosodic–acoustic features.

Several advantages of the approach were discovered. First, the PC and QC were conveniently determined from the features of word or POS sequences robustly obtained by performing segmentation of the current word and employing POS-tagging technologies without using complicated statistical syntactic parsing. This advantage makes the proposed approach suitable for practical online unlimited TTS. Second, because the CRF-based punctuation generation models were trained by using a large text corpus, the models could learn alternative punctuation strategies from numerous paragraphs by various writers to generate more reliable PCs and QCs. Third, compared with the size of an available text corpus for constructing a statistical syntactic parser, the size of the corpus used to train the CRF-based punctuation generator was considerably larger. Therefore, we infer that the obtained PC and QC are more robust than the syntactic features derived from an automatic syntactic parser.

The research process and corresponding section organization of this paper are summarized as follows:

Section 2: Analysis of punctuations

We demonstrate the relationship between punctuations and prosodic structures by analyzing the Treebank speech corpus, which is labeled with prosodic break tags. The analyses that motivated our use of the proposed PC are explained. This section also analyzes the quoted phrases in the ASBC text corpus, thus identifying possible QC candidates for the training of the CRF-based quotation model.

Section 3: Construction of the CRF-based MPM generation model

The CRF-based MPM generation model was trained by using the ASBC text corpus. The precision and recall of the MPM insertions are examined on the test dataset of the ASBC text corpus. The feasibility of using the proposed PC in prosody generation was examined by analyzing the relationship between the prosodic–acoustic features of the training dataset of the Treebank speech corpus and the associated PC generated using the CRF-based MPM generation model.

Section 4: Construction of the CRF-based quotation generation model

The model was trained and examined using the ASBC text corpus. The feasibility of using the QC for prosody generation was determined using the Treebank speech corpus.

Section 5: Prosody generation experiments

The prosody generation experiments were conducted on the Treebank speech corpus. The PC and QC features generated by the proposed automatic punctuation generation models by using the Treebank text corpus were combined with the conventional linguistic features to predict the prosodic–acoustic features of the syllable pitch contour, syllable duration, syllable energy level, and pause duration. Objective and subjective tests were conducted to verify the usefulness of the proposed PC and QC features.

Section 6: Conclusions and future work

2 Analysis of punctuations

Because prosodic–acoustic features are highly dependent on Mandarin’s prosodic structure and the prosodic structure is categorically represented by a finite set of prosodic break tags, it is more convenient to analyze the relationship between prosodic break types and PMs than to analyze the relationship between numerical prosodic–acoustic features and PMs. Therefore, the relationship between Chinese PMs and Mandarin prosodic structure is analyzed in this section. The following subsections present the analyses that provided the motivations and rationality for using the proposed PC and QC features. The prosody labeling system for determining the prosodic structures of utterances is introduced in Section 2.1. The relationship between the labeled prosodic break types and PM types is discussed in Section 2.2. Section 2.3 presents the experimental process wherein native Mandarin speakers were allowed to manually insert MPMs in PM-removed texts excerpted from the Treebank speech corpus. The relationships between the manually inserted MPMs by the native Mandarin speakers and the associated prosodic break types are analyzed, thus providing evidence for the proposed PC. An analysis of the quoted phrases in the ASBC text corpus is presented in Section 2.4, identifying the possible QC candidates for the training of the CRF-based quotation generation model.

2.1 Prosody labeling system

The widely used prosody labeling systems are ToBI [37], TILT [38], and C-ToBI [39]. These prosody labeling systems require manual labeling by humans with linguistic expertise. To reduce the human labor required and to increase the consistency of prosody labeling, Chiang et al. [40, 41] proposed an unsupervised joint prosody labeling and modeling (PLM) method for constructing a speaker-dependent statistical hierarchical prosodic model and labeling prosody tags for Mandarin speech. The PLM method was then successfully applied to construct a speaker-independent hierarchical prosodic model for use in a large vocabulary speech recognition task [42]. Hence, in this study, to avoid the need for intensive human labeling and inconsistent labeling results, the corpus was labeled with seven break types using the PLM method [40, 41] proposed by Chiang et al. As illustrated in Fig. 1, the seven break types—B0, B1, B2-1, B2-2, B2-3, B3, and B4—delimit an utterance into four types of prosodic units: a SYL, PW, PPh, and breathe group or prosodic phrase group (BG/PG).

In the labeling system, each defined break type is characterized by its specific juncture’s prosodic–acoustic features. B4 is defined as a major break and contains a long pause and apparent F0 reset across adjacent syllables. B3 is a major break with a medium pause and medium F0 reset. B0 and B1 are nonbreaks of a tightly coupled syllable juncture and a normal syllable boundary within a PW, respectively, which have no identifiable pauses between SYLs. Moreover, B2 is a minor break with three variants—an F0 reset (B2-1), short pause (B2-2), and preboundary syllable duration lengthening (B2-3).

Among the various types of prosodic–acoustic features, pause duration is the most salient cue for specifying the boundaries of prosodic units. Figure 2 shows probability density functions (pdfs) of Gamma distributions for the seven break types and reveals that the higher-level break types were generally associated with longer pause durations. According to the pdfs of pause durations for each of the break type shown in Fig. 2, the long pause of B4 has pause duration > = 400 ms, the medium pause of B3 has the pause duration in the interval of 200 ~ 400 ms, and the short pause of B2-2 has the pause duration in the interval of 30–200ms. On the other hand, B0, B1, B2-1, and B2-3 have very short pause durations (< 30 ms). Specifically, on the basis of this analysis of the pause duration of the seven break types, this study defined four break classes for conveniently conducting the analysis presented in Section 2.2: (i) B4, (ii) B3, (iii) B2-2, and (iv) the nonpause break (NPB) type that comprises B0, B1, B2-1, and B2-3.

2.2 Relationship between the labeled break types and PM types

In general, pause breaks are considered to co-occur with PMs. Most TTSs cautiously insert pauses only for MPMs, such as commas and periods. This cautious strategy of pause insertion can make synthesized speech very clear but may sound unnatural because the input sentence can be very long and contain complicated syntactic structures. Table 1 displays the co-occurrence matrix of the four break classes and three syllable juncture types, calculated using the training dataset of the Treebank speech corpus. The table reveals that most PM locations co-occur with breaks of the pause-related type (B2-2, B3, and B4), whereas most intraword locations map to NPBs. Non-PM interword locations co-occur with NPBs, B2-2, and B3. Approximately 40% of prosodic phrase boundaries (B3s) and more than 94% of B2-2s occur at non-PM interword junctures. By conducting a more detailed analysis, we found that 60% of non-PM B3s coincide with the depth-1 node boundary of a fully parsed syntactic tree. These results imply that inserting pauses only at PM locations would be unsatisfactory.

Table 1 Co-occurrence matrix of four target break types and three syllable juncture types

Punctuation-generation-inspired linguistic features for Mandarin prosody generation

Abstract

1 Introduction

2 Analysis of punctuations

2.1 Prosody labeling system

2.2 Relationship between the labeled break types and PM types

2.3 Human-labeled PMs versus prosodic break types

2.4 Analysis of quotations

3 Proposed PC

3.1 CRF-based MPM generator

3.2 Design of the prediction targets

3.3 Design of features and templates

3.4 Experiment of PC generation

4 QC

4.1 Design of prediction targets

4.2 Design of features and templates

4.3 Experiment of QC generation

5 Prosody generation experiments

5.1 Text analysis and linguistic feature sets

5.1.1 Raw

5.1.2 WordSeg

5.1.3 WordPos

5.1.4 G2P

5.1.5 Advanced feature set—PCs and QCs

5.2 MLP-based prosody generation

5.3 Objective tests

5.4 Subjective tests

6 Conclusions and future work

Abbreviations

References

Acknowledgements

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords