Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

EURASIP Journal on Audio, Speech, and Music Processing

Table 3 Results of DEEP SPECTRUM, pre-trained audio models, CRNN and their fusion on DCASE 2018, task 5, DCASE 2017, task 1 [31] and ESC-50

		DCASE18 (F1 [%])		DCASE17 (Acc [%])		ESC-50 (Acc [%])
		Devel	Test	Devel	Test	Devel	Test
Proposed DEEP SPECTRUM [42]
Network	Pre-training
Densenet121	ImageNet	82.8	81.1	78.9	64.4	73.6	75.0
Densenet121	None	77.7	75.7	71.2	59.0	45.3	44.0
ResNet50	ImageNet	81.9	80.3	76.5	55.9	70.3	72.0
ResNet50	None	70.1	69.9	72.7	61.0	44.8	44.8
VGG16	ImageNet	79.4	77.0	70.1	54.1	63.0	64.8
VGG16	None	73.3	71.6	72.2	57.8	44.6	45.3
VGG19	ImageNet	78.6	77.9	71.8	57.1	62.9	62.5
VGG19	None	74.6	73.2	72.0	61.0	42.9	46.0
Pre-trained audio models
openl3 [28]	AudioSet	73.3	68.4	79.3	67.7	69.8	70.8
PANN [30]	AudioSet	84.6	84.6	69.3	65.7	91.0	89.3
Proposed fusion
Proposed CRNN		81.4	82.2	68.9	59.2	62.3	68.8
ImageNet pre-trained Deep Spectrum		84.4	84.0	77.7	63.5	70.7	73.5
All untrained Deep Spectrum		77.9	78.1	74.5	63.3	44.9	46.8
All Deep Spectrum		84.6	84.3	78.7	67.3	69.8	75.8
CRNN + Deep Spectrum		85.0	85.5	80.6	70.0	73.5	78.8
Deep Spectrum + AudioSet nets		87.0	87.0	82.7	71.2	90.9	92.3
CRNN + AudioSet nets + Deep Spectrum		86.8	86.8	82.5	71.7	89.6	90.8
Baselines and SOTA
Challenge baselines with CNNs [8,31,32]		84.5	85.0	74.8	61.0	72.4*	72.4*
CNN + Data augmentation [2]		90.0	88.4	–	–	–	–
Data augmentation with GANs [44]		–	–	87.1	83.3	–	–
Fine-tuned PANNs [30]		–	–	–	–	94.7*	94.7*

^*ESC-50 baseline given for 5-fold CV which is different from the evaluated 4-fold plus test setup. The best results of every type of evaluated system are marked in bold