Skip to main content

Table 2 Summary of supervised ETTS approaches. “LingF” stand for linguistic features, “EmoL” stands for emotion labels, “MelS” stands for mel-spectrograms, “PhnSq” stands for phoneme sequence, “ChrSq” stands for character sequence, “ProsF” stands for prosodic features, “LM-F” stands for features from a language model and “ET” stands for expression/emotion transplantation

From: Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Ref No.

Inputs

Emotion label representation

ET

TTS model

[80]

LingF+EmoL

One-hot vector

 

DL-SPSS, HMM

[65]

LingF+EmoL

One-hot vector/dependent layers

\(\checkmark\)

DL-SPSS

[66]

LingF+EmoL

Perception vector/matrix

 

DL-SPSS

[41]

LingF+EmoL

One-hot vector

 

DL-SPSS

[42]

LingF+EmoL

Dependent layers

 

DL-SPSS

[81]

LingF+EmoL

One-hot vector/set of neurons

\(\checkmark\)

DL-SPSS

[43]

LingF+EmoL

One-hot vector/dependent layers/separated Model

 

DL-SPSS

[82]

LingF+EmoL

One-hot vector

\(\checkmark\)

DL-SPSS

[83]

PhnSq+LM-F+EmoL

Embedding vector

 

Encode-Dttention-Decoder

[28, 78]

LingF+EmoL

One-hot vector/dependent layers/separated Model

\(\checkmark\)

DL-SPSS

[26]

PhnSq+MelS+EmoL

One-hot vector as ground truth GSTs weights

 

Tacotron2

[27]

PhnSq+LingF+EmoL

Embedding vector

 

Tacotron2

[84]

LingF+EmoL

Joint embedding with other data labels

 

DL-SPSS

[85]

LingF+ProsF+EmoL

Ground truth for a classifier

 

DL-SPSS

[86]

PhnSq+EmoL

Embedding vector

 

Transformer TTS

[32, 36]

ChrSq+MelS+EmoL

Ground truth for a classifier

 

Tacotron2

[69]

LingF+EmoL

One-hot vector/dependent layers

\(\checkmark\)

DL-SPSS

[34]

PhnSq+MelS+EmoL

Ground truth for a classifier

 

Tacotron2

[64]

ChrSq+LM-F+EmoL

Ground truth for a predictor

 

Tacotron2

[39, 87]

PhnSq+MelS+EmoL

Ground truth for a classifier

 

Tacotron2