Skip to main content
Fig. 4 | EURASIP Journal on Audio, Speech, and Music Processing

Fig. 4

From: Automated audio captioning: an overview of recent progress and new challenges

Fig. 4

Diagram of the Transformer-based audio encoder. The input spectrogram is first split into small patches. These patches are then projected into 1-D embeddings through a linear layer, where a positional embedding is further added to each patch embedding to capture position information. The resulting embeddings are then fed into the Transformer blocks to obtain the encoded audio features

Back to article page