Fig. 5From: Automated audio captioning: an overview of recent progress and new challengesDiagram of an RNN-based language model. The RNN decoder generates the sentence from the left (i.e., the first word) to the right (i.e., the final word) in an auto-regressive manner, given the audio feature sequence generated from the encoder and previously generated words by the decoder. A start token “<s>” is fed into the RNN at the first time step to start the generation, while the generation process is terminated when a stop token “</s>” is generatedBack to article page