Automated audio captioning: an overview of recent progress and new challenges

Mei, Xinhao; Liu, Xubo; Plumbley, Mark D.; Wang, Wenwu

doi:10.1186/s13636-022-00259-2

EURASIP Journal on Audio, Speech, and Music Processing

Table 3 Performances of some surveyed audio captioning methods on two main datasets. Scores are taken from the respective papers. Only single model performance is considered. Compared to Clotho v1, Clotho v2 introduces new audio clips into the training set and a new validation set, while retaining the same evaluation set. Some methods merge the new validation set into the training set, these methods are still evaluated using the same evaluation set. We report these results separately

From: Automated audio captioning: an overview of recent progress and new challenges

Dataset	Method	Year	BLEU\(_{1}\)	BLEU\(_{2}\)	METEOR	CIDEr	SPICE	SPIDEr
AudioCaps	Kim et al. [20]	2019	0.614	0.446	0.203	0.593	0.144	0.369
	Koizumi et al. [68]	2020	0.638	0.458	0.199	0.603	0.139	0.371
	Eren et al. [39]	2020	0.710	0.490	0.290	0.750	-	-
	Xu et al. [44]	2021	0.655	0.476	0.229	0.660	0.168	0.414
	Mei et al. [47]	2021	0.647	0.488	0.222	0.679	0.160	0.420
	Gontier et al. [69]	2021	0.699	0.523	0.241	0.753	0.176	0.465
	Liu et al. [70]	2022	0.671	0.498	0.232	0.667	0.172	0.420
Clotho v1	Drossos et al. [64]	2019	0.420	0.140	0.090	0.100	-	-
	Cakir et al. [57]	2020	0.409	0.156	0.088	0.107	0.040	0.074
	Nguyen et al. [33]	2020	0.417	0.154	0.089	0.093	0.040	0.067
	Perez-Castanos [38]	2020	0.469	0.265	0.136	0.214	0.086	0.150
	Tran et al. [40]	2020	0.489	0.303	0.143	0.268	0.095	0.182
	Takeuchi et al. [42]	2020	0.512	0.325	0.145	0.290	0.089	0.190
	Koizumi et al. [18]	2020	0.521	0.309	0.149	0.258	0.097	0.178
	Chen et al. [34]	2020	0.534	0.343	0.160	0.346	0.108	0.227
	Xu et al. [43]	2020	0.561	0.341	0.162	0.338	0.108	0.223
	Eren et al. [39]	2020	0.590	0.350	0.220	0.280	-	-
	Xu et al. [44]	2021	0.556	0.363	0.169	0.377	0.115	0.246
	Koh et al. [66]	2022	0.551	0.369	0.165	0.380	0.111	0.246
Clotho v2	Narisetty et al. [48]	2021	0.536	0.341	0.160	0.346	0.108	0.227
	Won et al. [77]	2021	0.564	0.376	0.177	0.441	0.128	0.285
	Ye et al. [36]	2021	0.577	-	0.174	0.419	0.119	0.269
	Han et al. [37]	2021	0.585	0.392	0.177	0.474	0.130	0.302
Clotho v2 + val set	Narisetty et al.[48]	2021	0.541	0.346	0.161	0.362	0.110	0.236
	Liu et al. [23]	2021	0.553	0.349	0.168	0.368	0.115	0.242
	Mei et al. [35]	2021	0.561	0.374	0.171	0.426	0.124	0.275
	Chen et al. [73]	2022	0.572	0.379	0.171	0.407	0.119	0.263
	Xiao et al. [59]	2022	0.578	0.387	0.177	0.434	0.122	0.278

Highest scores for each split are shown in bold

Back to article page