Ensemble of convolutional neural networks to improve animal audio classification

EURASIP Journal on Audio, Speech, and Music Processing

Table 3 Performance of different approaches on each animal sound dataset

Approach	Descriptor	BIRD	BIRDZ	WHALE^†	BAT
Handcrafted features with SVM	Acoustic features	80.2	82.1	85.8	–
	LBP	85.8	87.0	90.6	91.2
	LBP-HF	85.0	86.2	89.9	92.6
	LBP-RI	86.1	87.5	91.0	93.0
	MLPQ	87.5	88.8	92.1	93.5
	HASC	87.9	89.1	92.0	92.9
	LHF	86.0	86.9	90.5	91.9
	GABOR	87.3	87.2	90.3	90.9
	BSIF	88.8	87.5	90.4	92.4
	AHP	84.4	77.5	89.9	92.1
	LETRIST	67.7	75.6	90.3	89.5
	BoF	89.9	60.4	87.2	94.2
Deep learning using the four types of audio images	CNN (Fig. 3)	61.8	84.4	93.5	98.6
	AlexNet	79.8	88.9	95.5	97.8
	GoogleNet	77.8	86.1	94.8	95.9
	Vgg-16	83.6	90.4	96.6	90.1
	Vgg-19	86.3	89.6	96.6	88.6
	ResNet50	81.9	88.9	96.1	93.7
	InceptionV3	82.3	88.5	96.5	85.9
Ensembles of deep learning	Fus_Spec	87.9	91.0	96.6	97.3
	Fus_HP	49.8	88.1	95.2	–
	Fus_Scatter	46.6	91.3	96.7	–
	Fus_Spec + Fus_HP + Fus_Scatter	87.2	93.9	97.1	97.3 ^∗
	Fus_Spec + Fus_Scatter	87.9	94.8	97.2	97.3 ^∗
	Fus_Spec + Fus_Scatter + CNN	84.0	95.1	96.1	98.7 ^∗
Ensembles of DL and handcrafted	Fus_Spec + Fus_Scatter + CNN + Fus_Hand	94.1	99.0	95.9	99.3^∗
	Fus_Spec + Fus_Scatter + Fus_Hand	94.7	98.9	96.5	98.9 ^∗
Related works	Deep learning, acoustic, and visual features [36]	94.8	–	93.3	–
	Acoustic and visual features [39]	94.5	–	92.2	–
	MFCC + SVM [64]	–	93.6	–	–
	DFT + SVM [62]	–	–	–	92.0

The rates are described using accuracy, except for the WHALE dataset, in which the rates are in AUC-ROC
^*Fus_Scatter and Fus_HP were not used in this result once they were not available for BAT
^†The metric used for the WHALE dataset is AUC-ROC