Fig. 4From: Automated audio captioning: an overview of recent progress and new challengesDiagram of the Transformer-based audio encoder. The input spectrogram is first split into small patches. These patches are then projected into 1-D embeddings through a linear layer, where a positional embedding is further added to each patch embedding to capture position information. The resulting embeddings are then fed into the Transformer blocks to obtain the encoded audio featuresBack to article page