Skip to main content
Fig. 2 | EURASIP Journal on Audio, Speech, and Music Processing

Fig. 2

From: An end-to-end approach for blindly rendering a virtual sound source in an audio augmented reality environment

Fig. 2

Overview of the proposed model architecture. The top row (a) illustrates the pre-processing step where features are extracted from the binaural speech signal and stacked with the logarithmic mel-spectrogram resulting in the input shape of 6 × 64 × 1000 (C × H × W). Further, a time and frequency mask is applied to the input spectrum and the resultant is convolved with 16 kernels of size 3 producing 16 representations of size 32 × 500 which is treated as input to the transformer. The middle row (b) describes the architecture of the transformer where MV2 means MobileNetV2 block and \(\downarrow\) means a reduction in the input size. N is the size of the kernel and M is the number of the linear transformers in the MobileViTV3 block. The bottom row (c) shows the architecture of each MobileViTV3 block where N and M are dependent upon the position of the block

Back to article page