From: Automated audio captioning: an overview of recent progress and new challenges
Reference | Year | Audio encoder | Text decoder | Key aspects |
---|---|---|---|---|
Drossos et al. [11] | 2017 | RNN | RNN | Attention |
Wu et al. [29] | 2019 | RNN | RNN | N\(\backslash\)A |
Xu et al. [19] | 2019 | RNN | RNN | Sentence similarity loss |
Ikawa et al. [30] | 2019 | RNN | RNN | “Specificity” term |
Kim et al. [20] | 2019 | CNN(VGGish)+RNN | RNN | Multi-scale features, semantic attention |
Nguyen et al. [33] | 2020 | RNN | RNN | Temporal subsampling |
Cakir et al. [57] | 2020 | RNN | RNN | Multi-task learning (keywords) |
Perez-Castanos et al. [76] | 2020 | CNN | RNN | Attention |
Chen et al. [34] | 2020 | CNN | Transformer | Pre-trained encoder |
Xu et al. [43] | 2020 | CRNN | RNN | Reinforcement learning |
Takeuchi et al. [42] | 2020 | CNN+RNN | RNN | Keywords, sentence length estimation |
Tran et al. [40] | 2020 | CNN | Transformer | 1-D and 2-D CNN |
Eren et al. [39] | 2020 | CNN(PANNs)+RNN | RNN | Keywords |
Koizumi et al. [18] | 2020 | CNN(VGGish)+Transformer | Transformer | Keywords |
Koizumi et al. [68] | 2020 | CNN(VGGish) | GPT-2+Transformer | GPT-2, similar captions retrieval |
Xu et a. [44] | 2021 | CNN\(\backslash\)CRNN | RNN | Attention, transfer learning |
Mei et al. [35] | 2021 | CNN(PANNs) | Transformer | Transfer learning, reinforcement learning |
Mei et al. [47] | 2021 | Transformer | Transformer | Full transformer network |
Han et al. [37] | 2021 | CNN(PANNs) | Transformer | Weakly supervised pre-training, keywords |
Ye et al. [36] | 2021 | CNN(PANNs) | RNN | Keywords, attention |
Gontier et al. [69] | 2021 | CNN(VGGish) | BART | YAMNet tags, BART |
Narisetty et al. [48] | 2021 | CNN(PANNs)+Conformer | Transformer+RNN | ASR techniques |
Liu et al. [23] | 2021 | CNN(PANNs) | Transformer | Contrastive learning |
Won et al. [77] | 2021 | CNN(PANNs) | Transformer | Transfer learning |
Berg et al. [22] | 2021 | CNN | Transformer | Continual learning |
Weck et al. [56] | 2021 | CNN(VGGish,YAMNet,OpenL3,COALA) | Transformer | Transfer learning |
Mei et al. [62] | 2021 | CNN(PANNs) | Transformer | GAN, diversity |
Xiao et al. [59] | 2022 | CNN | Transformer | Attention-free Transformer |
Liu et al. [70] | 2022 | CNN(PANNs) | BERT | Transfer learning, BERT |
Chen et al. [73] | 2022 | CNN | Transformer | Transfer learning, contrastive learning |
Koh et al. [66] | 2022 | CNN(PANNs)+Transformer | Transformer | Transfer learning, regularization |
Narisetty et al. [75] | 2022 | Transformer | Transformer | Joint modeling of ASR and AAC |