Skip to main content

Table 3 Results of DEEP SPECTRUM, pre-trained audio models, CRNN and their fusion on DCASE 2018, task 5, DCASE 2017, task 1 [31] and ESC-50

From: Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

   DCASE18 (F1 [%]) DCASE17 (Acc [%]) ESC-50 (Acc [%])
   Devel Test Devel Test Devel Test
Proposed DEEP SPECTRUM [42]
Network Pre-training       
Densenet121 ImageNet 82.8 81.1 78.9 64.4 73.6 75.0
Densenet121 None 77.7 75.7 71.2 59.0 45.3 44.0
ResNet50 ImageNet 81.9 80.3 76.5 55.9 70.3 72.0
ResNet50 None 70.1 69.9 72.7 61.0 44.8 44.8
VGG16 ImageNet 79.4 77.0 70.1 54.1 63.0 64.8
VGG16 None 73.3 71.6 72.2 57.8 44.6 45.3
VGG19 ImageNet 78.6 77.9 71.8 57.1 62.9 62.5
VGG19 None 74.6 73.2 72.0 61.0 42.9 46.0
Pre-trained audio models
openl3 [28] AudioSet 73.3 68.4 79.3 67.7 69.8 70.8
PANN [30] AudioSet 84.6 84.6 69.3 65.7 91.0 89.3
Proposed fusion
Proposed CRNN 81.4 82.2 68.9 59.2 62.3 68.8
ImageNet pre-trained Deep Spectrum 84.4 84.0 77.7 63.5 70.7 73.5
All untrained Deep Spectrum 77.9 78.1 74.5 63.3 44.9 46.8
All Deep Spectrum 84.6 84.3 78.7 67.3 69.8 75.8
CRNN + Deep Spectrum 85.0 85.5 80.6 70.0 73.5 78.8
Deep Spectrum + AudioSet nets 87.0 87.0 82.7 71.2 90.9 92.3
CRNN + AudioSet nets + Deep Spectrum 86.8 86.8 82.5 71.7 89.6 90.8
Baselines and SOTA
Challenge baselines with CNNs [8,31,32] 84.5 85.0 74.8 61.0 72.4* 72.4*
CNN + Data augmentation [2] 90.0 88.4
Data augmentation with GANs [44] 87.1 83.3
Fine-tuned PANNs [30] 94.7* 94.7*
  1. *ESC-50 baseline given for 5-fold CV which is different from the evaluated 4-fold plus test setup. The best results of every type of evaluated system are marked in bold