Skip to main content
Fig. 1 | EURASIP Journal on Audio, Speech, and Music Processing

Fig. 1

From: Language agnostic missing subtitle detection

Fig. 1

Network architecture of our GRU based VAD model. The model uses 800 ms audio signal sampled at 32 kHz as input, extracts magnitude (STFT) and Instantaneous Frequency (IF) spectrograms using feature extraction module. These two spectrograms are then normalized using Batch Normalization (BN) and passed through two parallel two layered Bi-GRU module. The outputs of the GRUs are time averaged, concatenated and passed through linear layer (128 dimension) followed by a Parametric ReLU (PReLU), Batch Normalization (BN), linear layer (2 dimensional) and a softmax to generate probability of speech and non-speech

Back to article page