Fig. 6From: Automated audio captioning: an overview of recent progress and new challengesDiagram of a Transformer-based language model. When generating a word at each time step, the masked multi-head attention module attends to the previously generated words to exploit contextual information. The output of the masked self-attention module is then fused with the audio feature sequence from the encoder in the cross-attention moduleBack to article page