Skip to main content
Fig. 2 | EURASIP Journal on Audio, Speech, and Music Processing

Fig. 2

From: Language agnostic missing subtitle detection

Fig. 2

Network architecture of our Audio Classification model. The network takes in 2 s audio clips sampled at 48 kHz as input and extracts log mel spectrogram as input. The spectrogram is passed as an input to the VGGish network consisting of 4 convolutional blocks. Each block consists of conv2D-BatchNorm(BN)-PReLU-conv2D-BN-PReLU and a 2 ×2 MaxPool2D layer. Following the blocks, we pool along the temporal axis and reshape the input into a 2D array. This input is passed through two fully connected layers of sizes 512 (with a dropout of 0.5) and 121 respectively. Finally, we perform a softmax on 121 categories

Back to article page