Skip to main content

Table 3 Performances of some surveyed audio captioning methods on two main datasets. Scores are taken from the respective papers. Only single model performance is considered. Compared to Clotho v1, Clotho v2 introduces new audio clips into the training set and a new validation set, while retaining the same evaluation set. Some methods merge the new validation set into the training set, these methods are still evaluated using the same evaluation set. We report these results separately

From: Automated audio captioning: an overview of recent progress and new challenges

Dataset

Method

Year

BLEU\(_{1}\)

BLEU\(_{2}\)

METEOR

CIDEr

SPICE

SPIDEr

AudioCaps

Kim et al. [20]

2019

0.614

0.446

0.203

0.593

0.144

0.369

 

Koizumi et al. [68]

2020

0.638

0.458

0.199

0.603

0.139

0.371

 

Eren et al. [39]

2020

0.710

0.490

0.290

0.750

-

-

 

Xu et al. [44]

2021

0.655

0.476

0.229

0.660

0.168

0.414

 

Mei et al. [47]

2021

0.647

0.488

0.222

0.679

0.160

0.420

 

Gontier et al. [69]

2021

0.699

0.523

0.241

0.753

0.176

0.465

 

Liu et al. [70]

2022

0.671

0.498

0.232

0.667

0.172

0.420

Clotho v1

Drossos et al. [64]

2019

0.420

0.140

0.090

0.100

-

-

 

Cakir et al. [57]

2020

0.409

0.156

0.088

0.107

0.040

0.074

 

Nguyen et al. [33]

2020

0.417

0.154

0.089

0.093

0.040

0.067

 

Perez-Castanos [38]

2020

0.469

0.265

0.136

0.214

0.086

0.150

 

Tran et al. [40]

2020

0.489

0.303

0.143

0.268

0.095

0.182

 

Takeuchi et al. [42]

2020

0.512

0.325

0.145

0.290

0.089

0.190

 

Koizumi et al. [18]

2020

0.521

0.309

0.149

0.258

0.097

0.178

 

Chen et al. [34]

2020

0.534

0.343

0.160

0.346

0.108

0.227

 

Xu et al. [43]

2020

0.561

0.341

0.162

0.338

0.108

0.223

 

Eren et al. [39]

2020

0.590

0.350

0.220

0.280

-

-

 

Xu et al. [44]

2021

0.556

0.363

0.169

0.377

0.115

0.246

 

Koh et al. [66]

2022

0.551

0.369

0.165

0.380

0.111

0.246

Clotho v2

Narisetty et al. [48]

2021

0.536

0.341

0.160

0.346

0.108

0.227

 

Won et al. [77]

2021

0.564

0.376

0.177

0.441

0.128

0.285

 

Ye et al. [36]

2021

0.577

-

0.174

0.419

0.119

0.269

 

Han et al. [37]

2021

0.585

0.392

0.177

0.474

0.130

0.302

Clotho v2 + val set

Narisetty et al.[48]

2021

0.541

0.346

0.161

0.362

0.110

0.236

 

Liu et al. [23]

2021

0.553

0.349

0.168

0.368

0.115

0.242

 

Mei et al. [35]

2021

0.561

0.374

0.171

0.426

0.124

0.275

 

Chen et al. [73]

2022

0.572

0.379

0.171

0.407

0.119

0.263

 

Xiao et al. [59]

2022

0.578

0.387

0.177

0.434

0.122

0.278

  1. Highest scores for each split are shown in bold