Fig. 2From: Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resourcesStructure of TTS models based on deep learning. The autoregressive models follows the upper track with sequential attention mechanism, non-autoregressive models follow the lower track with parallel attention unit utilizing alignments from an external aligner or a pretrained autoregressive modelBack to article page