DCASE18 (F1 [%]) | DCASE17 (Acc [%]) | ESC-50 (Acc [%]) | |||||
---|---|---|---|---|---|---|---|
Devel | Test | Devel | Test | Devel | Test | ||
Proposed DEEP SPECTRUM [42] | |||||||
Network | Pre-training | ||||||
Densenet121 | ImageNet | 82.8 | 81.1 | 78.9 | 64.4 | 73.6 | 75.0 |
Densenet121 | None | 77.7 | 75.7 | 71.2 | 59.0 | 45.3 | 44.0 |
ResNet50 | ImageNet | 81.9 | 80.3 | 76.5 | 55.9 | 70.3 | 72.0 |
ResNet50 | None | 70.1 | 69.9 | 72.7 | 61.0 | 44.8 | 44.8 |
VGG16 | ImageNet | 79.4 | 77.0 | 70.1 | 54.1 | 63.0 | 64.8 |
VGG16 | None | 73.3 | 71.6 | 72.2 | 57.8 | 44.6 | 45.3 |
VGG19 | ImageNet | 78.6 | 77.9 | 71.8 | 57.1 | 62.9 | 62.5 |
VGG19 | None | 74.6 | 73.2 | 72.0 | 61.0 | 42.9 | 46.0 |
Pre-trained audio models | |||||||
openl3 [28] | AudioSet | 73.3 | 68.4 | 79.3 | 67.7 | 69.8 | 70.8 |
PANN [30] | AudioSet | 84.6 | 84.6 | 69.3 | 65.7 | 91.0 | 89.3 |
Proposed fusion | |||||||
Proposed CRNN | 81.4 | 82.2 | 68.9 | 59.2 | 62.3 | 68.8 | |
ImageNet pre-trained Deep Spectrum | 84.4 | 84.0 | 77.7 | 63.5 | 70.7 | 73.5 | |
All untrained Deep Spectrum | 77.9 | 78.1 | 74.5 | 63.3 | 44.9 | 46.8 | |
All Deep Spectrum | 84.6 | 84.3 | 78.7 | 67.3 | 69.8 | 75.8 | |
CRNN + Deep Spectrum | 85.0 | 85.5 | 80.6 | 70.0 | 73.5 | 78.8 | |
Deep Spectrum + AudioSet nets | 87.0 | 87.0 | 82.7 | 71.2 | 90.9 | 92.3 | |
CRNN + AudioSet nets + Deep Spectrum | 86.8 | 86.8 | 82.5 | 71.7 | 89.6 | 90.8 | |
Baselines and SOTA | |||||||
84.5 | 85.0 | 74.8 | 61.0 | 72.4* | 72.4* | ||
CNN + Data augmentation [2] | 90.0 | 88.4 | – | – | – | – | |
Data augmentation with GANs [44] | – | – | 87.1 | 83.3 | – | – | |
Fine-tuned PANNs [30] | – | – | – | – | 94.7* | 94.7* |