Skip to main content

Table 3 Results of DEEP SPECTRUM, pre-trained audio models, CRNN and their fusion on DCASE 2018, task 5, DCASE 2017, task 1 [31] and ESC-50

From: Towards cross-modal pre-training and learning tempo-spatial characteristics for audio recognition with convolutional and recurrent neural networks

  

DCASE18 (F1 [%])

DCASE17 (Acc [%])

ESC-50 (Acc [%])

  

Devel

Test

Devel

Test

Devel

Test

Proposed DEEP SPECTRUM [42]

Network

Pre-training

      

Densenet121

ImageNet

82.8

81.1

78.9

64.4

73.6

75.0

Densenet121

None

77.7

75.7

71.2

59.0

45.3

44.0

ResNet50

ImageNet

81.9

80.3

76.5

55.9

70.3

72.0

ResNet50

None

70.1

69.9

72.7

61.0

44.8

44.8

VGG16

ImageNet

79.4

77.0

70.1

54.1

63.0

64.8

VGG16

None

73.3

71.6

72.2

57.8

44.6

45.3

VGG19

ImageNet

78.6

77.9

71.8

57.1

62.9

62.5

VGG19

None

74.6

73.2

72.0

61.0

42.9

46.0

Pre-trained audio models

openl3 [28]

AudioSet

73.3

68.4

79.3

67.7

69.8

70.8

PANN [30]

AudioSet

84.6

84.6

69.3

65.7

91.0

89.3

Proposed fusion

Proposed CRNN

81.4

82.2

68.9

59.2

62.3

68.8

ImageNet pre-trained Deep Spectrum

84.4

84.0

77.7

63.5

70.7

73.5

All untrained Deep Spectrum

77.9

78.1

74.5

63.3

44.9

46.8

All Deep Spectrum

84.6

84.3

78.7

67.3

69.8

75.8

CRNN + Deep Spectrum

85.0

85.5

80.6

70.0

73.5

78.8

Deep Spectrum + AudioSet nets

87.0

87.0

82.7

71.2

90.9

92.3

CRNN + AudioSet nets + Deep Spectrum

86.8

86.8

82.5

71.7

89.6

90.8

Baselines and SOTA

Challenge baselines with CNNs [8,31,32]

84.5

85.0

74.8

61.0

72.4*

72.4*

CNN + Data augmentation [2]

90.0

88.4

Data augmentation with GANs [44]

87.1

83.3

Fine-tuned PANNs [30]

94.7*

94.7*

  1. *ESC-50 baseline given for 5-fold CV which is different from the evaluated 4-fold plus test setup. The best results of every type of evaluated system are marked in bold