Fig. 7From: Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resourcesBaseline model (2) for unsupervised approaches: variational autoencoder (VAE) integrated to encoder-decoder TTS model. VAE learns latent space variables mean \(\mu (x)\) and variance \(\sigma ^2(x)\) and samples latent prosody vector(z) from the learned prosody spaceBack to article page