Fig. 8From: Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resourcesBaseline model (3) for unsupervised approaches: Global Style Tokens (GSTs) integrated to encoder-decoder TTS model. Style token layer learns a single weight for each token and generates the style embedding from summation of all the tokens multiplied by their corresponding weightsBack to article page