Training audio transformers for cover song identification

In the past decades, convolutional neural networks (CNNs) have been commonly adopted in audio perception tasks, which aim to learn latent representations. However, for audio analysis, CNNs may exhibit limitations in effectively modeling temporal contextual information. Analogous to the successes of transformer architecture used in the fields of computer vision and audio classification, to capture long-range global contexts better, we here extend this line of work and propose an Audio Similarity Transformer (ASimT), a convolution-free, purely transformer network-based architecture for learning effective representations of audio signals. Furthermore, we introduce a novel loss MAPLoss, used in tandem with classification loss, to directly enhance the mean average precision. In the experiments, ASimT demonstrates its state-of-the-art performance in cover song identification on public datasets.


Introduction
Cover song identification (CSI), referred to as the task which identifies alternative versions of a given song from a music collection, is an important task in the field of music information retrieval (MIR).Various downstream applications can benefit from CSI, such as music rights management, music retrieval, and song recommendation.Despite numerous research efforts behind, CSI remains a challenging task due to its complexity in the presence of variations in timbre, rhythm, key signature, song structure, and lyrics across different song versions [1,2].While humans can easily discern these variations among versions, it is difficult for machines to perform similarity matching between distinct renditions, which makes CSI still an intriguing and demanding task within the MIR domain.
In addressing the issue of version identification, a diverse range of approaches have been proposed, which can be broadly classified into two main categories.The first category follows a more traditional methodology, while the second category employs data-driven methods.Specifically, the first category implements a three-stage process: feature extraction, optional post-processing, and similarity estimation.The initial stage focuses on the extraction of relevant features from high-dimensional audio signals.Considering the existence of keys, tempo, and structural variations among different song versions, some studies incorporate a second step, which adopts various post-processing techniques to achieve transposition, tempo, timing, and structure invariance in the version identification problem.In the final stage, an array of segmentation schemes and local alignment algorithms are leveraged to measure the similarity between sequences processed during the preceding stages.
For instance, Bello introduced a CSI system [3] that characterized audio signal in harmonic content using the Chroma [4] feature, which represents the intensity of twelve pitch classes.Subsequently, their system employed Needleman-Wunsch-Sellers (NWS) algorithm [5] to estimate the similarity between approximated chord sequences in order to identify possible cover songs.In [1], the authors used an enhanced chroma feature called harmonic pitch class profiles (HPCP) [6] to describe music audio.Besides, they introduced a second stage in order to attain transposition invariance by transposing the tonality of the target HPCP feature sequence to that of the other songs.In their proposed system, dynamic time warping (DTW) [7] was utilized to measure the similarity between extracted feature sequences.
Following the studies that focused on individual features and alignment methods, Foucard et al. [8] were the first to demonstrate that the combination of melody and its accompaniment as distinct modalities, employing various fusion schemes, could enhance performance in cover song identification.Their work laid the groundwork for a line of approaches that investigated the combination of different features and/or alignment methods to further improve accuracy.Tralie [9] explored the combination of complementary features-namely HPCP, Mel-frequency cepstral coefficients (MFCC), and selfsimilarity MFCC-using a similarity network fusion (SNF) technique prior to implementing alignment for version matching.In a similar vein, Chen et al. [10] utilized SNF to fuse two types of similarities, Qmax and Dmax.The fused similarity was subsequently fed into a classifier to identify cover pairs.
With the advent of deep learning techniques, over the last few decades, cover song identification tasks have gradually transitioned from methods that heavily relied on sequence alignment to end-to-end models that learn representations towards improved efficiency in accomplishing the task.Convolutional neural networks (CNNs) have become particularly popular in the second category.For instance, in [11][12][13][14][15], CNNs play a critical role to detect cover songs.While CNNs are widely used to learn audio representations by exploring spatial locality, we believe that incorporating long-range global context could help improve the performance of the CSI task, as the original song may be restructured (e.g., a main verse might be placed after the chorus in the cover version).However, few attempts have been made to capture longrange dependencies among audio frames in CSI tasks [16].Ye et al. implemented an LSTM-based Siamese network in the CSI problem [17], which revealed the potential of investigating long-term contexts in CSI tasks.Despite this, the exploration of long-term dependencies in this field remains relatively uncharted.
The transformer architecture proposed in [18] has successfully demonstrated its ability to model sequential data with long-range dependencies in numerous NLP tasks (e.g., text generation and classification) and, more recently, in the computer vision field (e.g., image retrieval and classification).Additionally, an exciting extension of transformer-based models [19,20] in audio classification suggests that transformer-based approaches may find alternative solutions and avoid typical errors caused by convolution backbones.
Inspired by the success of adopting the transformer architecture for modeling long-term dependencies in audio classification tasks, we propose a transformerbased method to explore whether long-range global contexts can also enhance cover song detection.While there have been some efforts to explore audio comprehension with a transformer architecture [19][20][21][22], to our knowledge, the utilization of a plain transformer directly in cover song identification has not been studied before.To address this void, we propose the Audio Similarity Transformer (ASimT), which employs a Siamese architecture with a transformer backbone mapping each audio signal to a single embedding vector.Current deep learningbased approaches predominantly employ classification loss, triplet loss, or their variants or combinations during the training stage.However, these losses do not guarantee the optimization of mean average precision (MAP) [23], a critical evaluation metric in version identification tasks.Therefore, in this paper, we explore a rank loss named MAPLoss that directly optimizes MAP for an enhanced version identification performance.Given that version identification can also be regarded as a retrieval problem (i.e., retrieving all versions of a query song), our MAPLoss is adapted from SmoothAP Loss, which has achieved successes in image retrieval task [23][24][25].To boost the learning efficiency and supply additional supervised information, we combine MAPLoss and cross-entropy loss for training our Siamese architecture.Experimental results demonstrate a competitive performance of our proposed method.

Audio feature
Audio feature extraction is necessary for both traditional and deep learning-based approaches, as it is a crucial element in the former and is used for further learning in the latter.The constant-Q transform (CQT) [26], a lowlevel descriptor, has been used in numerous CSI studies [13][14][15] since it was first introduced.Notably, it is found that cover versions tend to maintain similar melodic and harmonic contents while they may exhibit variations in style, instrumentation, and the arrangement [15].Consequently, researchers have been motivated to adopt music descriptors representing melodic and harmonic information to tackle the CSI problem.Dominant melody has been studied in [12,27] to describe melodic content for the CSI problem.Chroma [4], which captures the intensity of twelve pitch classes, has been widely used as an essential audio feature in classical approaches [28][29][30].The pitch class profile (PCP) [31] has emerged as a predominant representation for analyzing harmonic content in audio signals.Subsequently, HPCP was developed to enhance the robustness of tonal content summarization and has been extensively applied to the CSI problem [32,33].In particular, feature combinations with HPCP have been investigated in the CSI problem [9,34].Salamon et al. utilized HPCP to summarize harmonic content, subsequently integrating melody and bass content to enhance the performance of their CSI system.HPCP was employed in [9] alongside MFCC and self-MFCC [35] in a fusive manner to improve the CSI problem performance.
As a novel PCP variant, convolutional and recurrent estimators for music analysis (CREMA) [36] estimates pitch-class information required for chord sequence prediction and has contributed to superior performance in cover song analysis, as reported in [37].This finding is plausible, given that cover versions often exhibit similar chord progression.The advancements achieved by CREMA have spurred a series of subsequent studies [38,39], which further corroborate the validity of the CREMA feature in CSI problems.Consequently, we employ CREMA as the feature of our proposed method, each of which can be represented as x ∈ R 12×W , where W denotes the number of frames.To bolster the performance of our system, we apply data augmentation to the original CREMA feature, yielding a processed CREMA with dimension of 23 × W , as detailed in Section 3.4.

Metric learning
Deep learning-based CSI systems can be further classified into two categories.The first category approaches the version identification problem as a multi-class classification task, with each version group being treated as a unique class [15,40,41].However, due to the substantial number of version groups (i.e., the classes) and the limited number of versions within each group (i.e., the samples), version identification does not entirely fit within the framework of a classification problem.This observation gives rise to the second category, which leverages metric learning techniques to enhance intra-class similarity and inter-class discrimination [13,14,38,39].Loss functions employed in these metric learning-based methods typically consider triplets [13,14,38] to achieve the desirable results in the context of CSI-for instance, contrastive loss [42] and triplet loss [43].The training procedure for these methods involves repeatedly sampling of random and different triplets of song versions and backpropagating the loss gradients.Nonetheless, as Burges et al. [44] highlight, the limited rank-positional awareness provided by the triplet loss may lead to inefficient use of a model's capacity, causing the model to focus on improving the rank order of positive instances at lower ranks, which is often to the detriment of those at high ranks.Consequently, there is no theoretical assurance that the process of minimizing the triplet loss would necessarily coincide with minimizing the actual ranking loss.
In this paper, we embark on a distinct path by directly optimizing the mean average precision (MAP) metric.While the average precision (AP) is a non-decomposable and non-differentiable function, recent advancements by He et al. have demonstrated that it can be approximated [45].This method has yielded successful outcomes in the realm of image matching and retrieval tasks [23,24].Given that the task of version identification can be interpreted as a version retrieval problem, and considering that no previous attempts have been made to leverage a loss function to directly improve the MAP value, we introduce an adaption of the smoothAP loss in our model (we term it as MAPLoss), the efficacy of which in resolving image retrieval problems has been demonstrated before.

Transformer architecture
This section describes the transformer architecture in a fashion similar to [18].We have adopted this architecture in our work.
Since ASimT is designed for similarity metric learning, we utilized only the encoder component of the transformer architecture.The transformer backbone, acting as the encoder, takes as an input a sequence of pre-processed CREMA features (the detailed processing procedure is explained in Section 3.4) and produces the corresponding learned latent representation.Given that the standard transformer processes 1D sequences of token embeddings, it is necessary to reshape the processed CREMA features into a sequence of flattened 2D patches.Following the method employed in vision transformer (ViT) [46], we reshape the processed CREMA sequence x ∈ R H ×W into flattened 2D patches x p ∈ R N ×P 2 , where (H, W) represents the resolution of our processed CREMA feature.In contrast to ViT, audio feature is a single-channel spectrogram whereas an image feature comprises 3 channels.(P, P) denotes the resolution of each processed CREMA feature patch with an overlap of L in both the time and frequency dimensions.Consequently, the number of patches, which is the input sequence length for the standard transformer encoder, would be N = 2L⌊(W − L)/(P − L)⌋ .In our case, H = 23 is the frequency dimension and W is the time dimension.Because we use the SHS 5+ dataset [12] (details of which will be given later), where the CREMA representation spans the first 3 min of the audio of each track, the time dimension has the value of 1937.Following the settings in ViT, we set the patch resolution as (P, P) = (16,16) .Simi- lar to Audio Spectrogram Transformer (AST), we have an overlap of L = 6.
Analogous to the bidirectional encoder representations from transformers (BERT), we introduce a learnable [CLS] token at the beginning of the input sequence.More specifically, the input sequence for our framework can be expressed as x r = (x class , x 1 p , x 2 p , . . ., x N p ) .This sequence is then mapped into a D-dimensional embedding via a trainable linear projection.To preserve positional information, we utilize position embeddings employed in the standard transformer.These embeddings are added to patch embeddings, obtaining the input sequence for the transformer backbone, as illustrated in Eq. 1. (1) As depicted in Fig. 1, the transformer module consists of a multi-headed self-attention block and a feedforward block.Layer normalization is applied prior to each block, while residual connections are implemented following each block [47].The multi-headed self-attention block calculates a probabilistic score that indicates the importance of each embedding.Each multi-headed attention layer projects the input sequence to query Q , key K , and value V , through three learnable matrices , where D k represents the dimension of each attention head.We employ the scaleddot production attention as the type of attention mechanism.More specifically, for the layer representation at the l-th transformer layer, z l = [h 1 l , h 2 l , . . ., h n l ] is utilized to compute the l-th layer self-attention head A l : Fig. 1 The overall framework of our proposed ASimT for CSI The feed-forward block comprises two linear layers.The first linear layer is followed by a GELU activation and a dropout layer, while after the second layer, only a dropout layer is applied.In an effort to optimize the classification task and enhance the learning of efficient representations for MAPLoss, we incorporate two linear layers after the transformer backbone.Each of these layers is followed by a Sigmoid activation function.

ImageNet pretraining
Although the transformer is capable of modeling longrange contexts compared to CNN models, it requires more data during the training stage, which can be quite resource-intensive.Therefore, akin to the transformer AST [19], we also adopt an off-the-shelf Ima-geNet-pretrained ViT in our proposed ASimT with a few modifications.First, the input sequence of ViT has three channels, whereas the input sequence of our ASimT is a single-channel spectrogram.Thus, we average the weights along the three channels of the ViT patch embedding layer and make use of them as the weights of the ASimT patch embedding layer.Second, we adopt the cut and bi-linear method proposed in [19] for positional embedding adaptation. (2)

Loss functions
As we discussed in Section 2.2, the classification accuracy alone cannot guarantee good mean average precision for the version identification problem.Upon a deeper exploration of various metric learning techniques in Section 2.2, we decide to utilize both the classification loss and a novel MAPLoss to optimize the version identification/retrieval task.The MAPLoss could directly improve the MAP value during the training stage, resulting a more effective latent representation produced by our Siamese network (Fig. 2).
Fig. 2 Motivation of using MAPLoss.From a, we can see that while purely classification loss can facilitate successful identification of each class, the minimal inter-class distance may result in erroneous version retrievals.The integration of MAPLoss, as depicted in b, aims to enhance intra-class compactness and inter-class separation.As a result, this approach can achieve better retrieval performance, i.e., MAP in this work

A. Cross-entropy loss
To provide more supervised information for the training signals, we also consider a cross-entropy loss during the training phase.The cross-entropy loss is computed as follows: where p(y, x) is the ground-truth one-hot distribution of sample x and P(y | z) is the predicted distribution by our ASimT encoder and linear classifier.

B. Mean average precision loss
Mean average precision is a prevalent metric in the field of cover version identification, which is used in the Mirex Audio Cover Song Identification contest. 1 Given an input query song, the task is to rank all instances in the retrieval set, denoted as � = {I i , i = 0, . . ., k} .For each query song I q , the retrieval set can be split into positive and nega- tive sets premised upon the relevance score.Suppose the set with positive relevance scores is represented by R P and the set with negative relevance scores is by R N .Therefore, the complete relevance score set is manifested as R = R P ∪ R N .Subsequently, for a query song I q , the approximated AP can be expressed as follows: where , in which τ is the temperature to parameterize the margin.d ij = [s(q, i) − s(q, j)] is the difference matrix and s(•, •) denotes the cosine similarity.It can be computed as s(q, i) , where v q is the vectorial latent repre- sentation obtained from our Siamese model.Then, the MAP of a batch input can be computed as: where m is the number of instances in the batch, AP t is the average precision of the t-th query.Subsequently, we can formulate the MAPLoss as follows: Hence, our final loss function can be written as: (4)

C. Contrastive loss
To further underscore the efficacy of MAPLoss, we conduct additional experiments with a commonly used metric learning technique, contrastive loss.The contrastive loss maximizes the similarity between encoded low-dimensional representations with the same labels which are referred to as positives and minimizes the similarity between learned representations with unmatched labels, by defining a negative.Given a set {z k } of learned representations and a set of {y k } includ- ing positive and negative samples, the contrative loss can be computed as: where β is a constant margin.The constant margin is designed to prevent the model from being overwhelmed.With the constant margin, only negative pairs whose similarity is higher than the tolerance will contribute to the contrastive loss.

Data augmentation
In order to enhance the learning of ASimT and prevent it from overfitting to the training data, we adopt two data augmentation functions.The first function considers key transposition and tempo variation which are common in cover songs.Following the strategy proposed in [41] and [38], we expand the dimension of the CREMA feature from 12 × T to 23 × T .To bolster the robustness of ASimT in deal- ing with potential key transpositions, we randomly roll the input CREMA feature x in the pitch dimen- sion between 0 and 11 bins.For tempo variation, we adopt the strategy used in [38], stretching the temporal dimension with a random factor ranging from 0.7 to 1.5.Additionally, time warping is also incorporated into our first augmentation function, involving the duplication, silence, or removal of frames with respective probabilities of 0.3, 0.4, and 0.3.The second function focuses on addressing variable lengths of the input audio signals.For training data exceeding the predefined length of 1800 in this work, we randomly truncate it at any point for further data augmentation.If the resulting sequence falls short of the predefined length, zero-padding is applied.For testing data longer than the predefined length, it will be trimmed from the very beginning. ( 1 https:// www.music ir.org/ mirex/ wiki/ 2021: Audio Cover Song Ident ifica tion An overview of the ASimT training process is provided in Algorithm 1.

Algorithm 1 ASimT's learning algorithm 4 Experiments
SHS 5+ and SHS 4-are built with the SecondHandSongs API by [27] to train and evaluate CSI systems.Specifically, SHS 5+ is utilized as the training set, whereas SHS 4- is employed for testing.The splitting of the two datasets is founded on the number of cover versions of each collected song to counteract data imbalance.For optimal data availability during the training phase, SHS 5+ exclu- sively comprises songs with at least five versions, culminating in a total of 62,311 tracks from 7460 unique original works.
However, in practical scenarios, most songs usually have 2 or 3 covers [27].Consequently, SHS 4-, serving as the test set, consists of 19,455 original works, with each work only incorporating songs with up to four versions, totaling 48,483 tracks.This makes SHS 4-more representative of real-world conditions compared to SHS 5+ when performing the cover song identification task with normal query audio.Therefore, we employ SHS 4-as the test set to assess the performance of our proposed ASimT.
We use ImageNet in our experiments.More specifically, our ASimT is trained on the pretrained weights of a data-efficient image transformer (DeiT) [48].The configuration for our transformer encoder is set with an embedding dimension of 768 and consists of 12 transformer layers.Furthermore, the multi-head attention block is configured with 12 heads.The first linear layer transforms the 768-dimensional outputs from the transformer backbone into a 256-dimensional space.Subsequently, the second linear layer maps this 256-dimensional output from the first layer into the number of classes present in the training dataset, which in this study, amounts to 7460 classes.In order to facilitate the application of MAPLoss, we set the batch size to be 350, which includes 70 classes, each containing 5 instances (equivalent to the minimal number of versions for each song in the SHS 5+ ).The learning rate is initially set at 2e −3 with cosine learning rate decay.The model is coupled with a stochastic gradient descent (SGD) [49] optimizer with a weight decay of 1e −4 .

Evaluation on large dataset
For the purpose of evaluation, we have chosen to utilize the MAP and the mean rank of the first correctly identified cover (MR1), commonly accepted metrics in the Mirex Audio Cover Song Identification contest.This paper utilizes the SHS 5+ and SHS 4-datasets for the pur- pose of training and testing, respectively.As a result, we evaluate our proposed method in comparison to existing studies that make use of these two datasets.
Triplet loss is explored in [12] for the CSI problem.Notably, they compiled the SHS 4-dataset and conducted an evaluation of their methodology using this collection.Hence, we select this work as the baseline method to compare with our method.Yesiler et al. introduced a data distillation method to address the CSI problem, which included reducing the embedding size [39].Given that their model also employs the CREMA feature, we include it as a baseline for comparison with our method.As their original research was trained on Da-TACOS, we implement their method using our SHS 5+ dataset for training and evaluate it on the SHS 4− dataset.We also conduct experiments training with classification and contrastive loss as part of the baselines to explore the validity of MAPLoss.
Table 1 presents the results of our ASimT.Evidently, our proposed method surpasses all the baselines in terms of MAP and MR1.This implies that our approach sets a new benchmark in performance, demonstrating how a standard transformer backbone can be effectively adapted for audio understanding and cover song analysis.Furthermore, our experiments show that the combination of classification loss and our proposed MAPLoss outperforms that of classification loss and contrastive loss.This indicates that directly optimizing the MAP value during the training stage can significantly improve the performance of version indentification.

Evaluation on small dataset
In real-world scenarios, a vast number of songs are available on the Internet, including original songs and their cover versions.This abundance of data can be utilized to train CSI models more effectively, thereby improving the accuracy of version identification for practical use.In many cases, the original or alternate versions of a query song may be present in the training collection.
To simulate this situation, we create a small dataset following the approach employed in [27].We randomly select 350 tracks from the training dataset, comprising works with 7 covers each.Out of these, 100 tracks are included in the training stage.For testing, we compute the similarity between all pairs of the 350 tracks, resulting in a 350 × 349 similarity matrix.As a result, we achieve a MAP of 74.68 %.This approach is practical in real-world applications, as companies like Shazam2 typically train their models on millions of songs to achieve high accuracy.The results obtained on the small dataset significantly surpass those achieved on SHS 4− data- set.Such observations align with findings from previous research.For instance, [50] reported a MAP of 0.09475 on a large dataset containing 12,960 tracks.In a similar vein, [15] noted a decrease in accuracy as the dataset size increased.They speculated that this could be attributed to larger datasets have the tendency to contain songs with similar melodic structures, chord patterns, and accompaniments, thus complicating the task of identifying cover versions.Our large evaluation set, SHS 4− , consisting of a total of 48,483 tracks, presents a comparable challenge for version identification tasks.

The impact of pooling
In the previous work on image classification [46] and retrieval [51], the output of the [CLS] token was used as the latent representation for subsequent classification or metric learning tasks.In a similar manner, AST, like [46], transforms the output of the [CLS] token into a class prediction linear layer.As AST employs DeiT as the pretrained model, and given that DeiT incorporates two [CLS] tokens, AST averages the outputs of these two tokens for the purpose of audio event classification.To further explore the impact of pooling in version identification problems, we conduct experiments comparing the training curves of both average pooling and the sole use of the global feature vector [CLS] (Figs. 3 and 4).This experiment, conducted with only the classification loss, guides our decision to use either average pooling or [CLS] in our final training.Interestingly, the final performance using average pooling and that of solely the [CLS] token proved similar.However, the accuracy increased more rapidly when using the global feature vector [CLS], and similarly, the loss declined more rapidly when using [CLS] compared to using average pooling.As a result, we adopted [CLS] as the output of our transformer backbone for further training.

Conclusion
In this work, we venture to explore how a convolutionfree, purely attention-based transformer architecture can be adapted for cover song analysis.We introduce Our experiments demonstrated that our MAPLoss could deliver competitive results and also illustrated the potential utility of the transformer model in cover song identification tasks.Nonetheless, when evaluated on the large dataset, both in this work and in related research, the mean average precision was found to be relatively low.This could be due, in part, to that large datasets having the tendency to contain music works sharing similar chord sequences.Given that our CREMA feature mainly encapsulates the harmonic context in audio signals, in the future, to further enhance the performance, we plan to take the melodic context into account as well.
By integrating these two musical dimensions in a fusive approach, we anticipate that we can more effectively identify cover versions, even when dealing with large datasets.

Fig. 3
Fig.3The training accuracy with or without average pooling