From: Ensemble of convolutional neural networks to improve animal audio classification
Approach | Descriptor | BIRD | BIRDZ | WHALE†| BAT |
---|---|---|---|---|---|
Handcrafted features with SVM | Acoustic features | 80.2 | 82.1 | 85.8 | – |
LBP | 85.8 | 87.0 | 90.6 | 91.2 | |
LBP-HF | 85.0 | 86.2 | 89.9 | 92.6 | |
LBP-RI | 86.1 | 87.5 | 91.0 | 93.0 | |
MLPQ | 87.5 | 88.8 | 92.1 | 93.5 | |
HASC | 87.9 | 89.1 | 92.0 | 92.9 | |
LHF | 86.0 | 86.9 | 90.5 | 91.9 | |
GABOR | 87.3 | 87.2 | 90.3 | 90.9 | |
BSIF | 88.8 | 87.5 | 90.4 | 92.4 | |
AHP | 84.4 | 77.5 | 89.9 | 92.1 | |
LETRIST | 67.7 | 75.6 | 90.3 | 89.5 | |
BoF | 89.9 | 60.4 | 87.2 | 94.2 | |
Deep learning using the four types of audio images | CNN (Fig. 3) | 61.8 | 84.4 | 93.5 | 98.6 |
AlexNet | 79.8 | 88.9 | 95.5 | 97.8 | |
GoogleNet | 77.8 | 86.1 | 94.8 | 95.9 | |
Vgg-16 | 83.6 | 90.4 | 96.6 | 90.1 | |
Vgg-19 | 86.3 | 89.6 | 96.6 | 88.6 | |
ResNet50 | 81.9 | 88.9 | 96.1 | 93.7 | |
InceptionV3 | 82.3 | 88.5 | 96.5 | 85.9 | |
Ensembles of deep learning | Fus_Spec | 87.9 | 91.0 | 96.6 | 97.3 |
Fus_HP | 49.8 | 88.1 | 95.2 | – | |
Fus_Scatter | 46.6 | 91.3 | 96.7 | – | |
Fus_Spec + Fus_HP + Fus_Scatter | 87.2 | 93.9 | 97.1 | 97.3 ∗ | |
Fus_Spec + Fus_Scatter | 87.9 | 94.8 | 97.2 | 97.3 ∗ | |
Fus_Spec + Fus_Scatter + CNN | 84.0 | 95.1 | 96.1 | 98.7 ∗ | |
Ensembles of DL and handcrafted | Fus_Spec + Fus_Scatter + CNN + Fus_Hand | 94.1 | 99.0 | 95.9 | 99.3∗ |
Fus_Spec + Fus_Scatter + Fus_Hand | 94.7 | 98.9 | 96.5 | 98.9 ∗ | |
Related works | Deep learning, acoustic, and visual features [36] | 94.8 | – | 93.3 | – |
Acoustic and visual features [39] | 94.5 | – | 92.2 | – | |
MFCC + SVM [64] | – | 93.6 | – | – | |
DFT + SVM [62] | – | – | – | 92.0 |