Skip to main content
Fig. 5 | EURASIP Journal on Audio, Speech, and Music Processing

Fig. 5

From: Automated audio captioning: an overview of recent progress and new challenges

Fig. 5

Diagram of an RNN-based language model. The RNN decoder generates the sentence from the left (i.e., the first word) to the right (i.e., the final word) in an auto-regressive manner, given the audio feature sequence generated from the encoder and previously generated words by the decoder. A start token “<s>” is fed into the RNN at the first time step to start the generation, while the generation process is terminated when a stop token “</s>” is generated

Back to article page