Skip to main content

Table 3 Performance of different approaches on each animal sound dataset

From: Ensemble of convolutional neural networks to improve animal audio classification

Approach

Descriptor

BIRD

BIRDZ

WHALE†

BAT

Handcrafted features with SVM

Acoustic features

80.2

82.1

85.8

–

 

LBP

85.8

87.0

90.6

91.2

 

LBP-HF

85.0

86.2

89.9

92.6

 

LBP-RI

86.1

87.5

91.0

93.0

 

MLPQ

87.5

88.8

92.1

93.5

 

HASC

87.9

89.1

92.0

92.9

 

LHF

86.0

86.9

90.5

91.9

 

GABOR

87.3

87.2

90.3

90.9

 

BSIF

88.8

87.5

90.4

92.4

 

AHP

84.4

77.5

89.9

92.1

 

LETRIST

67.7

75.6

90.3

89.5

 

BoF

89.9

60.4

87.2

94.2

Deep learning using the four types of audio images

CNN (Fig. 3)

61.8

84.4

93.5

98.6

 

AlexNet

79.8

88.9

95.5

97.8

 

GoogleNet

77.8

86.1

94.8

95.9

 

Vgg-16

83.6

90.4

96.6

90.1

 

Vgg-19

86.3

89.6

96.6

88.6

 

ResNet50

81.9

88.9

96.1

93.7

 

InceptionV3

82.3

88.5

96.5

85.9

Ensembles of deep learning

Fus_Spec

87.9

91.0

96.6

97.3

 

Fus_HP

49.8

88.1

95.2

–

 

Fus_Scatter

46.6

91.3

96.7

–

 

Fus_Spec + Fus_HP + Fus_Scatter

87.2

93.9

97.1

97.3 ∗

 

Fus_Spec + Fus_Scatter

87.9

94.8

97.2

97.3 ∗

 

Fus_Spec + Fus_Scatter + CNN

84.0

95.1

96.1

98.7 ∗

Ensembles of DL and handcrafted

Fus_Spec + Fus_Scatter + CNN + Fus_Hand

94.1

99.0

95.9

99.3∗

 

Fus_Spec + Fus_Scatter + Fus_Hand

94.7

98.9

96.5

98.9 ∗

Related works

Deep learning, acoustic, and visual features [36]

94.8

–

93.3

–

 

Acoustic and visual features [39]

94.5

–

92.2

–

 

MFCC + SVM [64]

–

93.6

–

–

 

DFT + SVM [62]

–

–

–

92.0

  1. The rates are described using accuracy, except for the WHALE dataset, in which the rates are in AUC-ROC
  2. *Fus_Scatter and Fus_HP were not used in this result once they were not available for BAT
  3. †The metric used for the WHALE dataset is AUC-ROC