Fig. 3From: Automated audio captioning: an overview of recent progress and new challengesDiagram of a 10-layer CNN audio encoder. The input representation is first processed via four convolutional blocks and pooling layers, where each block consists of two convolutional layers. The feature maps output by the last convolutional block are then averaged along the frequency axis and fed into a two-layer multi-layer perceptron (MLP) to obtain encoded audio featuresBack to article page