Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

EURASIP Journal on Audio, Speech, and Music Processing

Table 8 Objective evaluation metrics for expressive speech synthesis models

Metric	Description
Mel-Cepstral Distortion (MCD)	Sums the squared differences between the Mel-Frequency Cepstrum Coefficients (MFCC) from the ground truth and synthesized sample.
Gross Pith Error (GPE)	Calculates percentage of voiced frames that deviate in pitch by more than 20% compared to the ground truth samples.
Voice Decision Error (VDE)	Measures the difference of voiced/unvoiced decision between the ground truth and the synthesized sample.
F0 Frame Error (FFE)	Combines GPE and VDE by measuring the percentage of frames that either contain a 20% pitch error (GPE) or a voicing decision error (VDE) in ground truth and synthesized samples.
Word Error Rate (WER)	Measures word error rate of the synthesized speech’s transcription with respect to the input text. Public automatic speech recognition (ASR) models are used for transcribing synthesized speech.
Band APeriodicity Distortion (BAPD)	Measures over linearly spaced band aperiodicity coefficients between the ground truth and the synthesized samples.
Root Mean Square Error (RMSE)	Measure the root mean square error of F0 or energy of the synthesized samples compared to their ground truth.