Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

EURASIP Journal on Audio, Speech, and Music Processing

Table 2 Summary of supervised ETTS approaches. “LingF” stand for linguistic features, “EmoL” stands for emotion labels, “MelS” stands for mel-spectrograms, “PhnSq” stands for phoneme sequence, “ChrSq” stands for character sequence, “ProsF” stands for prosodic features, “LM-F” stands for features from a language model and “ET” stands for expression/emotion transplantation

Ref No.	Inputs	Emotion label representation	ET	TTS model
[80]	LingF+EmoL	One-hot vector		DL-SPSS, HMM
[65]	LingF+EmoL	One-hot vector/dependent layers	\(\checkmark\)	DL-SPSS
[66]	LingF+EmoL	Perception vector/matrix		DL-SPSS
[41]	LingF+EmoL	One-hot vector		DL-SPSS
[42]	LingF+EmoL	Dependent layers		DL-SPSS
[81]	LingF+EmoL	One-hot vector/set of neurons	\(\checkmark\)	DL-SPSS
[43]	LingF+EmoL	One-hot vector/dependent layers/separated Model		DL-SPSS
[82]	LingF+EmoL	One-hot vector	\(\checkmark\)	DL-SPSS
[83]	PhnSq+LM-F+EmoL	Embedding vector		Encode-Dttention-Decoder
[28, 78]	LingF+EmoL	One-hot vector/dependent layers/separated Model	\(\checkmark\)	DL-SPSS
[26]	PhnSq+MelS+EmoL	One-hot vector as ground truth GSTs weights		Tacotron2
[27]	PhnSq+LingF+EmoL	Embedding vector		Tacotron2
[84]	LingF+EmoL	Joint embedding with other data labels		DL-SPSS
[85]	LingF+ProsF+EmoL	Ground truth for a classifier		DL-SPSS
[86]	PhnSq+EmoL	Embedding vector		Transformer TTS
[32, 36]	ChrSq+MelS+EmoL	Ground truth for a classifier		Tacotron2
[69]	LingF+EmoL	One-hot vector/dependent layers	\(\checkmark\)	DL-SPSS
[34]	PhnSq+MelS+EmoL	Ground truth for a classifier		Tacotron2
[64]	ChrSq+LM-F+EmoL	Ground truth for a predictor		Tacotron2
[39, 87]	PhnSq+MelS+EmoL	Ground truth for a classifier		Tacotron2