Fig. 2From: Language agnostic missing subtitle detectionNetwork architecture of our Audio Classification model. The network takes in 2 s audio clips sampled at 48 kHz as input and extracts log mel spectrogram as input. The spectrogram is passed as an input to the VGGish network consisting of 4 convolutional blocks. Each block consists of conv2D-BatchNorm(BN)-PReLU-conv2D-BN-PReLU and a 2 ×2 MaxPool2D layer. Following the blocks, we pool along the temporal axis and reshape the input into a 2D array. This input is passed through two fully connected layers of sizes 512 (with a dropout of 0.5) and 121 respectively. Finally, we perform a softmax on 121 categoriesBack to article page