Proposed DEEP SPECTRUM [42]
|
Network
|
Pre-training
| | | | | | |
Densenet121
|
ImageNet
|
82.8
|
81.1
|
78.9
|
64.4
|
73.6
|
75.0
|
Densenet121
|
None
|
77.7
|
75.7
|
71.2
|
59.0
|
45.3
|
44.0
|
ResNet50
|
ImageNet
|
81.9
|
80.3
|
76.5
|
55.9
|
70.3
|
72.0
|
ResNet50
|
None
|
70.1
|
69.9
|
72.7
|
61.0
|
44.8
|
44.8
|
VGG16
|
ImageNet
|
79.4
|
77.0
|
70.1
|
54.1
|
63.0
|
64.8
|
VGG16
|
None
|
73.3
|
71.6
|
72.2
|
57.8
|
44.6
|
45.3
|
VGG19
|
ImageNet
|
78.6
|
77.9
|
71.8
|
57.1
|
62.9
|
62.5
|
VGG19
|
None
|
74.6
|
73.2
|
72.0
|
61.0
|
42.9
|
46.0
|
Pre-trained audio models
|
openl3 [28]
|
AudioSet
|
73.3
|
68.4
|
79.3
|
67.7
|
69.8
|
70.8
|
PANN [30]
|
AudioSet
|
84.6
|
84.6
|
69.3
|
65.7
|
91.0
|
89.3
|
Proposed fusion
|
Proposed CRNN
|
81.4
|
82.2
|
68.9
|
59.2
|
62.3
|
68.8
|
ImageNet pre-trained Deep Spectrum
|
84.4
|
84.0
|
77.7
|
63.5
|
70.7
|
73.5
|
All untrained Deep Spectrum
|
77.9
|
78.1
|
74.5
|
63.3
|
44.9
|
46.8
|
All Deep Spectrum
|
84.6
|
84.3
|
78.7
|
67.3
|
69.8
|
75.8
|
CRNN + Deep Spectrum
|
85.0
|
85.5
|
80.6
|
70.0
|
73.5
|
78.8
|
Deep Spectrum + AudioSet nets
|
87.0
|
87.0
|
82.7
|
71.2
|
90.9
|
92.3
|
CRNN + AudioSet nets + Deep Spectrum
|
86.8
|
86.8
|
82.5
|
71.7
|
89.6
|
90.8
|
Baselines and SOTA
|
Challenge baselines with CNNs [8,31,32]
|
84.5
|
85.0
|
74.8
|
61.0
|
72.4*
|
72.4*
|
CNN + Data augmentation [2]
|
90.0
|
88.4
|
–
|
–
|
–
|
–
|
Data augmentation with GANs [44]
|
–
|
–
|
87.1
|
83.3
|
–
|
–
|
Fine-tuned PANNs [30]
|
–
|
–
|
–
|
–
|
94.7*
|
94.7*
|