Skip to main content

Two-layer similarity fusion model for cover song identification

Abstract

Various musical descriptors have been developed for Cover Song Identification (CSI). However, different descriptors are based on various assumptions, designed for representing distinct characteristics of music, and often differ in scale and noise level. Therefore, a single similarity function combined with a specific descriptor is generally not able to describe the similarity between songs comprehensively and reliably. In this paper, we propose a two-layer similarity fusion model for CSI, which combines the information carried by different descriptors and similarity functions organically and incorporates the advantages of both early fusion and late fusion. In particular, in the early fusion, the similarities obtained by the same descriptor and different similarity functions are integrated with the Similarity Network Fusion (SNF) technique. Then, in the late fusion, the learning method selected by sparse group LASSO algorithm is applied on each early fused similarity to obtain the probability that the corresponding song pair belongs to the reference/cover pair. Lastly, the final fused similarity is achieved by averaging all the obtained probabilities. Extensive experimental results on the music collection that is composed of samples provided by the SecondHandSongs (SHS) verify that the proposed scheme outperforms state-of-the-art fusion based CSI schemes in terms of identification accuracy and classification efficiency.

1 Introduction

The explosion of musical data makes us face new challenges unthinkable two decades ago. For example, how to retrieve different versions, performance, or renditions of a previously recorded musical composition has become a challenging problem [1]. Cover Song Identification (CSI) can help in this regard. Its potential applications include music right and licenses management and music creation aid. It has become an active research field in Music Information Retrieval (MIR) over the past decades.

Since the cover version may be obtained in different ways (such as remastering, instrumental, mashup, live performance, acoustic, demo, remix, quotation, medley, and standard [2]), it may differ from the original in timbre, tempo, timing, structure, key, harmonization, lyrics and language, and/or noise [3]. What remains almost invariable among the various cover versions are harmonic progressions and melody evolution, which form the basis of the most existing CSI descriptor extraction algorithms. Among these descriptors, the Chroma (also called Pitch Class Profiles (PCP)) [4] and its variations [513] are the most widely-used descriptors for describing harmonic progressions. In [9], the beat-synchronous chroma for two tracks were cross-correlated, from the results of which the sharp peaks indicating good local alignment were looked for to determine the distance between them. It performed the best in the audio CSI task contest of the 2006 Music Information Retrieval Evaluation eXchange (MIREX) [14]. The Harmonic PCP (HPCP) descriptor proposed in [15] shares the common properties of PCP, but since it is only based on the peaks of the spectrum within a certain frequency band, it reduces the influence of noisy spectral components further. It also takes the presence of harmonic frequencies into account and is tuning independent. The CSI scheme based on HPCP and Q max similarity measure [5, 16] achieved the highest identification accuracy in the 2009 MIREX audio CSI task contest. In [10], the lower pitch-frequency cepstral coefficients were discarded and the remaining coefficients were projected onto chroma bins to obtain the Chroma DCT-Reduced log Pitch descriptor. This descriptor achieved a high degree of timbre invariance and, hence, outperformed conventional PCP in the context of music matching and retrieval applications. In [13], to describe the similarity of singing voice between cover versions of popular songs, two concepts from psychoacoustics (time-varying loudness contour and critical band) were combined with conventional PCP descriptors organically to obtain Cochlear PCP (CPCP). Besides harmonic progression, melody evolution can also be used for the CSI task, for example, in [1719], the main melody (denoted as MLD in this paper) was extracted for cover song retrieval. Recently, timbre-based descriptors are studied for the CSI task [12, 20]. In [12], a new descriptor, Modified Perceptual Linear Prediction Lifted Cepstrum (MPLPLC), was obtained by modifying the Perceptual Linear Prediction (PLP) model in automatic speech recognition field through introducing new research achievements in psychophysics and taking the difference between speech and music into consideration to make it suitable for music signal analysis. In addition, different kinds of similarity functions, such as Cross-Correlation (CC) [9], Dynamic Time Warping (DTW) [11], Qmax [5], and Dmax [21], have been proposed for measuring the similarity between descriptors.

However, since different descriptors are based on various assumptions, designed for representing distinct characteristics of music, and often differ in scale and noise level, it is impossible to characterize all songs of different genres with the same descriptor comprehensively, nor it is possible to use only one similarity function to measure the similarity between descriptors reliably. To solve this problem, some researchers began to study descriptor or similarity fusion models for the CSI task [2226] (see Section 2). In this paper, we propose a two-layer similarity fusion model for the CSI task aiming at enhancing the identification accuracy and classification efficiency further. The main contributions of this paper include (i) our model, combining the advantages of two musical descriptors and two similarity functions, generates more comprehensive and reliable similarity description between songs. (ii) The sparse group LASSO algorithm [27] is included in the proposed model to select the most suitable learning method for the late fusion stage to ensure the fusion efficiency and reduce the computational complexity as well. (iii) By incorporating the advantages of early fusion and late fusion organically, the proposed model outperforms state-of-the-art fusion-based CSI schemes in identification accuracy and classification efficiency. (iv) Through projecting the ordinary similarity to probability-based similarity in late fusion, the proposed model is flexible and generic enough to include more musical descriptors and similarity functions. (v) Extensive experiments have been conducted on a music collection that is composed of 3364 samples provided by SecondHandSong (SHS)1 to verify the efficiency of the proposed model in comparison with other CSI schemes with or without similarity fusion.

2 Information fusion for CSI

Information fusion consists of combining information originating from several sources in order to improve decision making [28]. This technique is rather commonly adopted in content-based MIR field. For instance, in [29], different descriptors were combined to improve genre classification accuracy. For the CSI task, information fusion should be a suitable idea because it is easier to capture the tonal similarity between tracks by different kinds of descriptors and similarity functions. In fact, some recent studies have suggested that version detection can be improved through the combination of different descriptors [30] or different similarity functions [2426, 31, 32]. Generally, the information fusion in the CSI field can be performed in four levels: feature, descriptor, similarity, and decision.

The feature-level fusion is the simplest way of fusion. In [30], frames from the same moment in time were taken for both chroma and melody, and then, they were combined by creating a tuple of note or chord. Finally, to reduce the number of tuples, four different representations were proposed with different alphabet sizes. However, fusion at this level may not get desired results in practice because (i) independent analysis of different features often lead to inconsistent conclusion that are hard to integrate (for example, two tracks may be judged as the reference/cover pair by one feature and the reference/non-cover pair by another feature) and (ii) preselecting a set of features leads to biased analysis. So, in [30], the improvement achieved by the fusion scheme is limited in contrast to that achieved by chord-based representations.

The descriptor-level fusion is the strategy that combines different descriptors into one descriptor vector. The simplest way is to concatenate or merge descriptors. In [30], the chord descriptor and the melody descriptor were fused by concatenating or merging them. The problems resulted from this kind of fusion include the following: First, when performing descriptor binding of different nature/domains, normalization techniques should be applied first to standardize all descriptor values in the same range, which has been a great challenge for the machine learning community [33]. Second, concatenation or merging may result in the “curse of dimensionality” problem, which means the dimension of the descriptor space increases in such a way that the available training instances become indistinguishable and not enough for allowing the definition of a good decision hyperplane [33]. Third, concatenation or merging further dilutes the already low signal-to-noise ratio in each descriptor. So, as shown in [30], this kind of fusion may not result in satisfactory results.

The similarity level fusion is based on the strategy known as mixture of experts. The similarity between two tracks is obtained by calculating individual pairwise distance for each descriptor and then combining them into a final pairwise distance value. Several similarity level fusion schemes have been proposed for the CSI task. In [22], the main melody and accompaniment of the music were extracted first. Then, the maximum value of the similarities based on main melody, accompaniment, and mixture signal separately was taken as the final similarity. In [24], the task of detecting cover versions was posed as a classification problem. The similarities based on different descriptors and corresponding similarity functions were concatenated as a feature vector, which was then used to train a classifier for determining whether the corresponding two tracks belong to the reference/cover or reference/non-cover pair. Since only chroma descriptors were considered, the fused similarity only accounted for the same musical facet, the harmony. To solve this problem, in [25], the similarities based on three related yet different descriptors (harmony, melody, and bass line) were fused with the power of a standard classification approach similar to [24]. In [31], the fusion of different similarities was achieved by projecting all similarities in a multi-dimensional space, where the dimensionality of this space was the number of similarities considered. In [26], the similarities based on different descriptors and corresponding similarity functions were obtained first. Then, the Similarity Network Fusion (SNF) technique [34] was used to fuse the similarity communities based on each similarity. Finally, the track-by-track similarities in the fused similarity network were adopted for version identification. Due to the merits of SNF technique, this fusion scheme could reduce the noise existing in each similarity network and integrate the common as well as complementary information caught by different descriptors and corresponding similarity measures.

Finally, the fourth strategy for fusion is known as decision-level fusion. The CSI scheme proposed in [35] belongs to this kind of fusion. First, the similarities based on different descriptors were adopted to train the classifier. Then, the decision made by each classifier was integrated with standard rank aggregation. This fused scheme achieved an increase of up to 23.5% identification accuracy compared to single classifiers.

According to the stage at which the fusion is performed, the above fusion schemes can be classified into early fusion and late fusion. Early fusion happens before the classification step. The feature-level, the descriptor-level, and the similarity-level fusions belong to early fusion. The main advantage of early fusion is that all the features can be “seen” by the classifier and only one learning phase is required [36]. However, the performance of early fusion will be greatly affected by including features of little contribution. On the other hand, the late fusion approach operates at the decision level [37]. When compared to early fusion, late fusion is easier to perform, but it cannot learn the correlation among features. Usually another learning procedure is needed to combine these classification outputs. To avoid the over-fitting problem, simple mean, which can yield better or at least comparable results as those training another classifier for fusion, can be adopted [36].

In this work, we propose a two-level fusion model, which integrates the advantages of both early fusion and late fusion organically, for the CSI task. Concretely, in the early fusion, the similarities based on a specific musical descriptor (HPCP [15] or MLD [38]) by two different similarity functions (Qmax [5] and Dmax [21]) are fused with the SNF technique. In the late fusion, one optimal classifier, which is selected by the sparse group LASSO technique [27], is performed on each early fused similarity to obtain the probability that the corresponding tracks belong to the reference/cover pair. Then, the mean value of these probabilities are obtained as the final fused similarity.

3 Proposed CSI scheme

The block diagram of the proposed scheme is shown in Fig. 1.

Fig. 1
figure 1

Block diagram of the proposed CSI scheme

3.1 Similarity calculation: Qmax and Dmax

The Qmax [16] similarity measure tries to calculate the length of the longest time segment in which two sequences f i and f j exhibit similar patterns. First, a Cross Recurrence Plot (CRP), denoted as c, is generated by setting its element c p,q to “1” when there exists recurrence between f i (p) and f j (q) and “0” otherwise. More details about the CRP calculation can be found in [5]. In the CRP, the length of the diagonal pattern of “1” indicates the degree of similarity between these two sequences. However, as shown in [32], due to the possible alignment constraints in the Qmax (see Fig. 2 a), it fails to identify the cover versions when the CRP includes such phenomenon as shown in Fig. 3 a, where there is serious short disruption of diagonal. This phenomenon may be resulted from the skip of some chords or part of the melody when performing the cover version. To solve this problem, we modified the Qmax by changing the possible alignment constraints from Fig. 2 a, b to obtain a new measure, called Dmax [21]. As shown in Fig. 3 b, c, for the case shown in Fig. 3 a, the Dmax performs better than the Qmax.

Fig. 2
figure 2

Possible alignment constraints in the a Qmax and b Dmax

Fig. 3
figure 3

a The CRP for the song “Addicted to Love” as performed by Tina Turner and Robert Palmer and the corresponding cumulative matrix obtained by the b Qmax and c Dmax

In the Qmax and Dmax measures, first, a cumulative matrix (denoted as o and \(\hat {\mathbf {o}}\), respectively) is generated based on c with Eqs. (1) and (2), respectively.

$$ {}\begin{aligned} o_{p,q}=\!\left\{\!\!\begin{array}{ll} \max\{o_{p-1,q-1},o_{p-2,q-1},o_{p-1,q-2}\}+1, & \; if \; c_{p,q}=1 \\ \max\{0,o_{p-1,q-1}-\gamma(c_{p-1,q-1}), \\ o_{p-2,q-1}-\gamma(c_{p-2,q-1}),\\o_{p-1,q-2}-\gamma(c_{p-1,q-2})\}, & if \; c_{p,q}=0 \end{array}\right. \end{aligned} $$
(1)
$$ \begin{aligned} \hat{o}_{p,q}=\left\{\begin{array}{ll} \max\{\hat{o}_{p-1,q-1},\hat{o}_{p-2,q-1}+c_{p-1,q},\\ \hat{o}_{p-1,q-2}+c_{p,q-1},\\ \hat{o}_{p-3,q-1}+c_{p-2,q}+c_{p-1,q},\\ \hat{o}_{p-1,q-3}+c_{p,q-2}+c_{p,q-1}\}+1, & \; if \; c_{p,q}=1 \\ \max\{0,\hat{o}_{p-1,q-1}-\gamma(c_{p-1,q-1}), \\ \hat{o}_{p-2,q-1}+c_{p-1,q}-\gamma(c_{p-2,q-1}),\\ \hat{o}_{p-1,q-2}+c_{p,q-1}-\gamma(c_{p-1,q-2}),\\ \hat{o}_{p-3,q-1}+c_{p-2,q}+c_{p-1,q}-\gamma(c_{p-3,q-1}),\\ \hat{o}_{p-1,q-3}+c_{p,q-2}+c_{p,q-1}-\gamma(c_{p-1,q-3})\}, & if \; c_{p,q}=0 \end{array}\right. \end{aligned} $$
(2)

In both Eqs. (1) and (2), γ is calculated with Eq. (3).

$$\begin{array}{*{20}l} \gamma(z)=\left\{\begin{array}{ll} \gamma_{o}, & \; if \; z=1 \\ \gamma_{e}, & \; if \; z=0 \end{array}\right. \end{array} $$
(3)

where γ o and γ e are the penalty for a disruption onset and a disruption extension, respectively.

Then, the normalized Qmax distance and Dmax distance, denoted as d Q (i,j) and d D (i,j), can be respectively calculated with Eqs. (4) and (5).

$$ d_{Q}(i,j)=\sqrt{N_{j}}/\max(o_{p,q}) $$
(4)
$$ d_{D}(i,j)=\sqrt{N_{j}}/\max(\hat{o}_{p,q}) $$
(5)

where N j is the length of f j .

Suppose the track collection is composed of N tracks and \(\mathbf {f}_{i}^{(k)},k=1,\cdots,K\) is the k-th kind of descriptor of i-th track. For \(\mathbf {f}_{i}^{(k)}\) and \(\mathbf {f}_{j}^{(k)}\), their similarities based on the Qmax function and Dmax function are denoted as \(d_{Q}^{(k)}(i,j)\) and \(d_{D}^{(k)}(i,j)\), respectively.

3.2 Early fusion: SNF

The early fusion is realized by the SNF technique [34]. With any music descriptor, the track similarity networks based on Qmax and Dmax are represented as graphs G Q (V,E Q ) and G D (V,E D ), respectively. The vertices V correspond to the track collection, and the edges E Q (or E D ) are weighted by similarity based on Qmax (or Dmax). To compute the fused similarity matrix from the Qmax and Dmax matrices, the full kernels (denoted as P Q and P D ) and the sparse kernels (denoted as Q Q and Q D ) are defined on the vertex set V (see Eqs. (6)–(9)), respectively.

$$\begin{array}{*{20}l} P_{Q}(i,j)=\left\{\begin{array}{cc} \frac{d_{Q}(i,j)}{2\sum_{k\neq i }d_{Q}(i,k)}, & \;\;\;\;\;\;\;j\neq i \\ 1/2, & \;\;\;\;\;\;\; j= i \end{array}\right. \end{array} $$
(6)
$$\begin{array}{*{20}l} P_{D}(i,j)=\left\{\begin{array}{cc} \frac{d_{D}(i,j)}{2\sum_{k\neq i }d_{D}(i,k)}, & \;\;\;\;\;\;\;j\neq i \\ 1/2, & \;\;\;\;\;\;\; j= i \end{array}\right. \end{array} $$
(7)

Let N i,Q (or N i,D ) represent a set of i-th track’s neighbors including itself in G Q (or G D ). For the given graph G Q (or G D ), the K Nearest Neighbors (KNN) is used to measure local affinity as Eq. (8) (or Eq. (9)).

$$\begin{array}{*{20}l} Q_{Q}(i,j)=\left\{\begin{array}{cc} \frac{d_{Q}(i,j)}{\sum_{k \in N_{i,Q}}d_{Q}(i,k)}, & \;\;\;\;\;\;\; j \in N_{i,Q} \\ 0, & \;\;\;\;\;\;\; otherwise \end{array}\right. \end{array} $$
(8)
$$\begin{array}{*{20}l} Q_{D}(i,j)=\left\{\begin{array}{cc} \frac{d_{D}(i,j)}{\sum_{k \in N_{i,D}}d_{D}(i,k)}, & \;\;\;\;\;\;\; j \in N_{i,D} \\ 0, & \;\;\;\;\;\;\; otherwise \end{array}\right. \end{array} $$
(9)

Let P Q,t=0=P Q be the initial status matrix at t=0. The similarity matrix based on Qmax measure is iteratively updated with Eq. (10). After each iteration, normalization (Eq. (8)) is performed on P Q,t+1. P D,t+1 is obtained by the same way.

$$ \mathbf{P}_{Q,t+1}=\mathbf{Q}_{Q}\times \left(\mathbf{P}_{D,t}\right) \times \left(\mathbf{Q}_{Q}\right)^{T} $$
(10)

After t steps, the overall status matrix, denoted as P, is obtained with Eq. (11).

$$ \mathbf{P}=\left(\mathbf{P}_{Q,t}+\mathbf{P}_{D,t}\right)/2 $$
(11)

For the k-th descriptor, f (k), the corresponding fused similarity network is denoted as G (k)(V,E (k)). The weights of each edge in G (k)(V,E (k)) are concatenated to generate the k-th early fused similarity vector, denoted as \(\mathbf {X}^{(k)}=[X^{(k)}_{1},\cdots,X^{(k)}_{N^{2}}]^{T}\).

3.3 Late fusion and group LASSO algorithm

First, for each descriptor, different learning methods, denoted as U={U 1,,U M }, are performed on the early fused similarities \(X^{(k)}_{n},n=1,\cdots,N^{2}\). For \(X^{(k)}_{n}\), the probability that it belongs to the reference/cover pair obtained by different learning methods are concatenated together to obtain \(\hat {\mathbf {X}}^{(k)}_{n}=[\hat {X}^{(k)}_{1n},\cdots,\hat {X}^{(k)}_{Mn}]\). Then, \(\hat {\mathbf {X}}^{(k)}_{n},n=1,\cdots,N^{2}\) combined with their labels (reference/cover or reference/non-cover) are used to train the group LASSO [27] algorithm to select the most efficient learning method for X (k). It should be noted that for HPCP or MLD descriptor, the early fused similarities are used to train almost all learning methods provided by Weka with default parameters, respectively. Only BayesNet (BN), NaiveBayesUpdateable (NBU), RBFNetwork (RBFN), DecisionTable (DT), and J48 methods yield good results. So, the group LASSO algorithm is then applied to the results obtained by each of these five learning methods to select the most efficient one for each descriptor. For each kind of descriptor, the probability obtained by each learning method is regarded as one group. Finally, assuming that for early fused similarities \(X_{n}^{\mathrm {(HPCP)}}\) and \(X_{n}^{\mathrm {(MLD)}}\), the corresponding probability-based similarities obtained by the most efficient learning method are \(\tilde {X}_{n}^{\mathrm {(HPCP)}}\) and \(\tilde {X}_{n}^{\mathrm {(MLD)}}\), respectively; the final fused similarity is obtained by taking the mean of \(\tilde {X}_{n}^{\mathrm {(HPCP)}}\) and \(\tilde {X}_{n}^{\mathrm {(MLD)}}\).

As shown in [39], the idea of group LASSO is to incorporate a mixed-norm regularization on logistic regression. It solves the optimization problem shown in Eq. (12).

$$ {}\begin{aligned} \hat{\boldsymbol{\beta}}_{\lambda} &= \mathop{\arg\min}\limits_{\boldsymbol{\beta},\alpha} \sum_{i}\log\left(1+\exp\left(-y_{i}\left(\boldsymbol{\beta}^{T}\mathbf{x}_{i}+\alpha\right)\right)\right) \\ &\qquad \qquad + \lambda\sum_{g=1}^{G}\|\boldsymbol{\beta}_{\mathbf{I}_{g}}\|_{2} \end{aligned} $$
(12)

where x i is the ith training sample, y i is the ground truth label ({0,1}), and α is the intercept. ·2 refers to the 2 norm. β is composed of G predefined non-overlapping groups, and I g is the index set of the gth group. Parameter λ controls the level of sparsity of the resulting model.

To select the most efficient learning method for the early fused similarities, the results of each learning methods (the probabilities that the similarities belong to the reference/cover pairs) are concatenated together to form the vector x i in Eq. (12). Then, for a fixed λ, Eq. (12) is solved to get \(\hat {\boldsymbol{\beta }}_{\lambda }\). The learning method corresponding to the largest \(\hat {\boldsymbol{\beta }}_{\lambda }\) value is considered as the most efficient one.

4 Experiments

All the descriptors, similarity functions, and learning methods adopted in this work are listed in Table 1.

Table 1 The descriptors, similarity functions, and learning methods used

4.1 Datasets

The dataset, denoted as DB3364, is composed of tracks included in the test set of the SHS dataset. There are 1212 original tracks and 2152 cover versions. All the audio files are obtained by songs on our own. The average number of tracks in each cover set is 2.76, ranging from 2 to 172. Furthermore, DB3364 is split into one training set, denoted as DB801, and three testing sets, denoted as DB799, DB802, and DB962, respectively. These datasets are not overlapping, and their specific information is listed in Table 2. It should be noted that we did not use the descriptors provided by the SHS dataset directly. The HPCP, MLD, CPCP, and Beat Synchronous Chroma (BSC) descriptors were extracted from the audio with the algorithms shown in [13, 15, 19], and [9], respectively.

Table 2 Cover song datasets used

4.2 Evaluation measures

With the final similarity obtained by each CSI scheme, an ordered list of results for each given query can be obtained. Then, the identification accuracy can be evaluated using standard information retrieval metrics, the mean of average precision (MAP)[5], Mean averaged reciprocal rank (MaRR)[40], total number of covers identified in top 10 (TOP10), and mean rank (MR) of first correctly identified cover.

In addition, the final fused similarities are adopted to train a classifier (BayesNet classifier provided by Weka3 with default parameters), which can then be used to estimate whether any two tracks belong to the reference/cover pair or not. Assume the obtained confusion matrix is as Table 3, where class A and class B denote the reference/non-cover and reference/cover class, respectively. Then, three different parameters, True Negative Rate (TNR), Classification Accuracy (CA), and Average Classification Accuracy (ACA), are calculated according to Eqs. (13)–(15) to measure the classification efficiency. The ACA is adopted to avoid that the evaluation of classification results are biased towards the majority class (the reference/non-cover class in this work). Since 10-fold cross-validation protocol is adopted, all reported results are in terms of mean TNR, mean CA, and mean ACA.

$$ \mathrm{TNR=TN/(FN+TN)} $$
(13)
Table 3 Confusion matrix
$$ \mathrm{CA=(TP+TN)/(TP+FP+FN+TN)} $$
(14)
$$ \mathrm{ACA=(TN/(FP+TN)+TP/(TP+FN))/2} $$
(15)

4.3 Experimental results

To illustrate how the proposed model behaves in easy and hard conditions, we manually choose concrete cover sets where one descriptor performs better than the other and where both descriptors perform well. The information of the tracks included in this study is listed in Table 4. The six tracks are used both as the queries and the targets. The corresponding 6×6 distance matrices obtained by HPCP-Dmax, HPCP-Qmax, 1L-HPCP-QD (early fusion based on HPCP descriptor), MLD-Dmax, MLD-Qmax, 1L-MLD-QD (early fusion based on MLD descriptor), 2L-HPCP-Best1 (two-layer fusion when the BayesNet classifier is only applied on HPCP-QD similarity), 2L-MLD-Best1 (two layer fusion when the BayesNet classifier is only applied on MLD-QD similarity), and 2L-Best1 (two-layer fusion when the BayesNet classifier is applied on both HPCP-QD and MLD-QD similarities) are shown in Fig. 4 ai, respectively. The cells corresponding to the query/cover pairs are marked with blue boxes.

Fig. 4
figure 4

Distance matrices obtained by CSI schemes with or without similarity fusion. The actual values are subtracted by 1 to make the visual comparison easier. a HPCP-Dmax. b HPCP-Qmax. c 1L-HPCP-QD. d MLD-Dmax. e MLD-Qmax. f 1L-MLD-QD. g 2L-HPCP-Best1. h 2L-MLD-Best1. i 2L-Best1

Table 4 The tracks in the cover sets

The experimental results shown in Fig. 4 demonstrate that (i) by comparing the results shown in Fig. 4 a, b, d, e, we can see that the HPCP descriptor works better than the MLD descriptor for No. 1 cover set, the MLD descriptor performs better than the HPCP descriptor for No. 3 cover set, and both HPCP and MLD descriptors perform well on No. 2 cover set. The tracks in the No. 1 cover set are two different versions of “Something Wonderful”. These two tracks are mainly composed of the sound of stringed and wind instruments, and they include no prominent melody. In this case, the HPCP descriptor, which can describe the harmonic progression very well, performs better than MLD descriptor. The No. 3 cover set is composed of two versions of “Never Can Say Goodbye” performed by different singers. Both of them include main melody performed by female singer, and the accompaniment in these two tracks is weak when compared with the vocal sound. In this circumstance, the MLD descriptor performs better than the HPCP descriptor. Since the two tracks in No. 2 cover set include strong accompaniment and predominant melody, both the HPCP and MLD descriptors perform well on it. (ii) By comparing the results among Fig. 4 ac and those among Fig. 4 df, we can see that when the single similarity (HPCP-Dmax in this case) can not distinguish the reference/cover pair (No. 3 cover set in this case), the early fused similarity (1L-HPCP-QD in this case) can perform very well, and when the two single similarity measures (MLD-Dmax and MLD-Qmax in this case) perform well on No. 1 cover set, the early fused similarity (1L-MLD-QD in this case) performs better on it. The latent reason is that the early fusion method may utilize the complementarity between the Qmax and Dmax in finding alignments. (iii) As shown in Fig. 4 gi, when compared with 2L-HPCP-Best1 or 2L-MLD-Best1-based schemes, 2L-Best1 achieves global optimum on three cover sets. The possible reason is that the late fusion method may fuse the complementarity between the two early fused similarities (1L-HPCP-QD and 1L-MLD-QD in this case) efficiently. (iv) By comparing the results in Fig. 4 c, f, i, we can see that when compared with the early fusion methods (1L-HPCP-QD and 1L-MLD-QD in this case), the late fusion may enlarge the distance between the intro-distance and inter-distance further, which may result in a higher identification accuracy.

4.3.1 Efficiency of early fusion

To test the validity of the early fusion, the identification accuracy (in terms of MAP, MaRR, TOP10, and MR) and classification efficiency (in terms of TNR, CA, and ACA) obtained before and after the early fusion are compared in Fig. 5, where HPCP+QD [32] (or MLD+QD, CPCP+QD, and BSC+QD) denotes the early fused similarity based on HPCP+Qmax [5] (or MLD+Qmax, CPCP+Qmax, and BSC+Qmax) measure and HPCP+Dmax (or MLD+Dmax, CPCP+Dmax, and BSC+Dmax) measure. We observe that the performances (in terms of all the evaluation measures except for MR) achieved by the early fusion scheme are much better than those of the fused objects (including the scheme proposed in [5]), which verifies that these two similarity measures (Qmax and Dmax) carry complementary information. As shown in [34], the early fusion scheme can integrate this information efficiently because (i) the low-weight edges in each similarity network are cut, which helps to reduce the noise and (ii) the high-weight edges present in one or two networks are added to the other and the low-weight edges supported by both networks are retained depending on how tightly connected their neighborhoods are across networks, which helps to integrate common as well as complementary information across the similarity networks.

Fig. 5
figure 5

Comparison of the identification accuracy in terms of a MAP, b MaRR, c TOP10, and d MR and the classification efficiency in terms of e TNR, f CA, and g ACA before and after early fusion on different data sets. QD early fusion of the Qmax and Dmax-based similarities with SNF

4.3.2 Learning method selection

The averages of the magnitudes of regression coefficients obtained by the group LASSO for each learning method and descriptor considered are plotted in Fig. 6. It can be seen that across the learning methods we choose, the BayesNet is the most efficient for both HPCP and MLD.

Fig. 6
figure 6

Averages of the magnitudes of regression coefficients across the learning methods included for a HPCP and b MLD

In addition, we compare the performances obtained after late fusion under different learning method combinations in Fig. 7, where 2L-Best1 denotes the combination of the classification results of the top one learning method for HPCP+QD similarity and that for MLD+QD similarity, the 2L-Best2 means the combination of the classification results of the top two learning method for HPCP+QD similarity and those for MLD+QD similarity, and so on. We observe that (i) different learning method combinations obtain the similar MAP and MaRR performances on all four datasets and 2L-Best1 performs more stably across different datasets than the other combinations. (ii) 2L-Best1 achieves consistently better performances, in terms of TOP10, TNR, and ACA, than the other combinations on all four datasets. (iii) In Fig. 7 d, the lines with circle, triangle, and square overlap, which means that 2L-Best1 scheme performs similar as 2L-Best2 or 2L-Best3 scheme and better than 2L-Best4 or 2L-Best5 scheme in term of MR. (iv) As shown in Fig. 7 f, 2L-Best1 performs worse than the other four combinations, but the gap is very small (about 0.01%). In general, 2L-Best1 combination performs much better than the other four combinations especially when computational complexity is considered. So, 2L-Best1 is adopted to obtain the final fused similarity. Specifically, the mean of the probability-based similarities obtained by the BayesNet classifier for HPCP-QD similarity and for MLD-QD similarity is taken as the final fused similarity.

Fig. 7
figure 7

Comparison of identification accuracy, in terms of a MAP, b MaRR, c TOP10, and d MeanRank, and classification efficiency, in terms of e TNR, f CA, and g ACA, obtained after late fusion based on different learning method combinations on different datasets

4.3.3 Complementarity between different descriptors

Another important question is whether the information carried by different descriptors is complementary. To answer this question, the performances obtained by 2L-HPCP-Best1 and those achieved by 2L-Best1 combination are compared in Tables 5 and 6. We observe that 2L-Best1 scheme performs better than 2L-HPCP-Best1 scheme across all the evaluation measures included on all four datasets except for the CA performance on DB802, where the gap is less than 0.005%. So it is verified that the MLD-QD and HPCP-QD similarities carry complementary information. As shown in Tables 5 and 6, the similar conclusion can also be obtained for the CPCP-QD and BSC-QD similarities. Therefore, the combination of different descriptors can help to improve performances.

Table 5 Identification accuracy achieved by different descriptor combinations
Table 6 Classification efficiency achieved by different descriptor combinations

4.3.4 Comparison with state-of-the-art fusion based CSI schemes

In these experiments, the performances of eight fusion techniques are compared on all four datasets. They are two-layer fusion with best learning method (2L-Best1), two-layer fusion when HPCP is considered (2L-HPCP-Best1), early fusion based on HPCP descriptor (1L-HPCP-QD) [32] and on MLD descriptor (1L-MLD-QD), the schemes proposed in [24, 26], and [31], and the Particle Swarm Optimization (PSO) based one. In PSO based scheme, the similarities used in this work, HPCP+Qmax, HPCP+Dmax, MLD+Qmax and MLD+Dmax, are weighted and added together, and the optimal weight combination is sought by PSO technique [41]. For the fusion schemes in [26] and [24]4, the fused objects are those provided in [26] and [24], respectively. Unfortunately, we can not obtain the implementation of the pitch salience function used in [31], so the fused objects for [31] are those used in this work.

The comparison results in terms of identification accuracy and classification efficiency are shown in Tables 5 and 6, respectively. It can be seen that: i) For HPCP-MLD (or BSC-CPCP) based combination, 2L-Best1 scheme performs much better than 1L-HPCP-QD [32] (or 1L-BSC-QD) or 1L-MLD-QD (or 1L-CPCP-QD) scheme in terms of all evaluation measures on all four datasets except for CA on DB802 (where the gap is smaller than 0.005%), which verifies the necessity and validity of the late fusion. ii) For HPCP-MLD based combination, 2L-Best1 scheme performs much better than state-of-the-art fusion based CSI schemes [24, 26, 31] and PSO based one in terms of identification accuracy and classification efficiency on all four datasets, except for the TNR on DB962, where PSO scheme achieves higher TNR at the sacrifice of much lower CA. iii) For BSC-CPCP based combination, 2L-Best1 scheme performs much better than the fusion based CSI scheme in [31] and PSO based one in terms of identification accuracy and classification efficiency on all four datasets. However, in some cases, 2L-Best1 scheme performs worse than the fusion based CSI schemes in [24, 26]. The possible reason is that the number and the type of the descriptors fused in [24, 26] are different from those used in BSC-CPCP based combination.

As shown in Tables 5 and 6, when the similar experiments are applied on the CPCP- and BSC-based similarities, the similar results are obtained. It should be noted that since there are too many reference/non-cover pairs in the training set, the classifier would determine that almost all reference/test pairs are reference/non-cover pairs during testing the classifier. So, almost all CA values in Table 6 are around 98 and 99%.

5 Conclusions

To date, few investigations have been done on describing similarity between versions by combining different musical descriptors and similarity functions. In this paper we take a necessary step in this field, which can not only improve performances in terms of identification accuracy and classification efficiency but also improve our understanding of the relationship between different musical descriptors and similarity functions. Two musical descriptors and two different similarity functions are fused by a two-layer fusion model. In the early fusion, the similarities obtained by applying different similarity functions on one music descriptor are fused by the SNF technique. In the late fusion, the early fused similarities based on two different musical descriptors are integrated through mapping the similarities to probability-based similarities by learning method and then taking the mean value of the results. In addition, the group LASSO technique is adopted to select the best learning method for each kind of early fused similarity before the late fusion to ensure the fusion efficiency and reduce the computational complexity as well. By incorporating the advantages of early fusion and late fusion, the proposed scheme achieves better performances in terms of identification accuracy and classification efficiency than state-of-the-art fusion-based CSI schemes. Another important advantage of the proposed model is that it is flexible and generic enough to include more musical descriptors and similarity functions to enhance the performance further. However, the disadvantage of the proposed scheme is that it achieves higher CSI identification accuracy at the cost of higher computation complexity.

For future work, considering the similarity between CSI task and Query-By-Humming (QBH) task, we plan to modify the proposed model to make it suitable for the QBH task.

6 Endnotes

1 http://labrosa.ee.columbia.edu/millionsong/secondhand

2 A complete list of tracks included in the music collection and the code of the proposed scheme can be found (http://nchenecust.com/download.html).

3 http://weka.wikispaces.com/

4 http://infiniteseriousness.weebly.com/cover-song-detection.html

Abbreviations

ACA:

Average classification accuracy

BN:

BayesNet

BSC:

Beat-synchronous chroma

CA:

Classification accuracy

CC:

Cross-correlation

CPCP:

Cochlear PCP

CRP:

Cross recurrence plot

CSI:

Cover Song Identification

DT:

DecisionTable

DTW:

Dynamic time warping

HPCP:

Harmonic PCP

MAP:

Mean of average precision

MaRR:

Mean averaged reciprocal rank

MIR:

Music Information Retrieval

MIREX:

Music information retrieval evaluation eXchange

MLD:

Melody

MPLPLC:

Modified perceptual linear prediction lifted cepstrum

MR:

Mean rank

NBU:

NaiveBayesUpdateable

PCP:

Pitch class profile

PLP:

Perceptual linear prediction

PSO:

Particle swarm optimization

QBH:

Query-By-Humming

QD:

Early fusion of Qmax and Dmax similarities based on SNF

RBEU:

RBFNetwork

SHS:

SecondHandSongs

SNF:

Similarity network fusion

TNR:

True negative rate

TOP10:

total number of covers identified in top 10

References

  1. MA Casey, R Veltkamp, M Goto, M Leman, C Rhodes, M Slaney, Content-based music information retrieval: current directions and future challenges. Proc. IEEE. 96(4), 668–696 (2008).

    Article  Google Scholar 

  2. J Serrà Julià, Identification of versions of the same musical composition by processing audio descriptions. PhD thesis, Universitat Pompeu Fabra (2011).

  3. J Serrà, E Gómez, P Herrera, in Advances in Music Information Retrieval. Audio cover song identification and similarity: background, approaches, evaluation, and beyond (SpringerSpringer-Verlag Berlin, 2010), pp. 307–332.

    Chapter  Google Scholar 

  4. T Fujishima, in Proceedings of International Computer Music Association. Realtime chord recognition of musical sound: a system using common lisp music (International Computer Music Association, Inc.San Francisco, 1999), pp. 464–467.

    Google Scholar 

  5. J Serra, X Serra, RG Andrzejak, Cross recurrence quantification for cover song identification. N. J. Phys.11(9), 093017 (2009).

    Article  Google Scholar 

  6. T-M Chang, E-T Chen, C-B Hsieh, P-C Chang, in Proceedings of 2013 IEEE 2nd Global Conference on Consumer Electronics (GCCE). Cover song identification with direct chroma feature extraction from aac files (IEEETokyo, Japan, 2013), pp. 55–56.

    Chapter  Google Scholar 

  7. TC Walters, DA Ross, RF Lyon, in International Symposium on Computer Music Modeling and Retrieval. The intervalgram: an audio feature for large-scale cover-song recognition (Lecture Notes in Computer Science, SpringerBerlin, Heidelberg, 2012), pp. 197–213.

    Google Scholar 

  8. X Chuan, in Proceedings of 2012 International Conference on Systems and Informatics (ICSAI). Cover song identification using an enhanced chroma over a binary classifier based similarity measurement framework (IEEEYantai University, Yantai, China, 2012), pp. 2170–2176.

    Google Scholar 

  9. DPW Ellis, Identifying ’Cover Songs’ with Beat-Synchronous Chroma Features, MIREX extended abstract, 1–4 (2006). http://hdl.handle.net/10022/AC:P:13699.

  10. M Muller, S Ewert, Towards timbre-invariant audio features for harmony-based music. IEEE/ACM Trans. Audio, Speech, Lang. Process.18(3), 649–662 (2010).

    Article  Google Scholar 

  11. J Serra, E Gómez, P Herrera, X Serra, Chroma binary similarity and local alignment applied to cover song identification. IEEE/ACM Trans. Audio, Speech, Lang. Process.16(6), 1138–1151 (2008).

    Article  Google Scholar 

  12. N Chen, JS Downie, H Xiao, Y Zhu, J Zhu, in Proc. ISMIR. Modified perceptual linear prediction liftered cepstrum (mplplc) model for pop cover song recognition (the ATIC Research Group of the University of MalagaMalaga, Spain, 2015).

    Google Scholar 

  13. N Chen, JS Downie, H-D Xiao, Y Zhu, Cochlear pitch class profile for cover song identification. Appl. Acoust.99:, 92–96 (2015).

    Article  Google Scholar 

  14. JS Downie, The music information retrieval evaluation exchange (2005–2007): A window into music information retrieval research. Acoust. Sci. Technol.29(4), 247–255 (2008).

    Article  Google Scholar 

  15. E Gómez, Tonal description of music audio signals. PhD thesis, Universitat Pompeu Fabra (2006).

  16. Serra, J̀, M Zanin, RG Andrzejak, in Proceedings of 2009 International Society for Music Information Retrieval. Cover song retrieval by cross recurrence quantification and unsupervised set detection (Kobe, 2009), pp. 1–3.

  17. W-H Tsai, H-M Yu, H-M Wang, Using the similarity of main melodies to identify cover versions of popular songs for music document retrieval. J. Inform. Sci. Eng.24(6), 1669–1687 (2008).

    Google Scholar 

  18. M Marolt, A mid-level representation for melody-based retrieval in audio collections. IEEE Trans. Multimedia. 10(8), 1617–1625 (2008).

    Article  Google Scholar 

  19. J Salamon, J Serra, E Gómez, Tonal representations for music retrieval: from version identification to query-by-humming. Int. J. Multimedia Inform. Retriev.2(1), 45–58 (2013).

    Article  Google Scholar 

  20. CJ Tralie, P Bendich, in Proceedings of 16th International Society for Music Information Retrieval Conference. Cover song identification with timbral shape sequences (Malaga, Spain, 2015), pp. 38–44.

  21. F Yang, N Chen, Cover song identification based on cross recurrence plot and local alignment. J. East China Univ. Sci. Technol.42(2), 247–253 (2016).

    Google Scholar 

  22. R Foucard, J-L Durrieu, M Lagrange, G Richard, in Proceedings of 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2010). Multimodal similarity between musical streams for cover version detection (IEEEDallas, 2010), pp. 5514–5517.

    Chapter  Google Scholar 

  23. CCS Liem, A Hanjalic, in Proceedings of 10th International Society for Music Information Retrieval Conference (ISMIR 2009). Cover song retrieval: a comparative study of system component choices (Kobe, 2009), pp. 573–578.

  24. S Ravuri, DP Ellis, in Proceedings of 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2010). Cover song detection: from high scores to general classification (IEEEDallas, 2010), pp. 65–68.

    Chapter  Google Scholar 

  25. J Salamon, Serra, J̀, E Gómez, in Proceedings of the 21st International Conference Companion on World Wide Web. Melody, bass line, and harmony representations for music version identification (ACMNew York, 2012), pp. 887–894.

    Google Scholar 

  26. N Chen, H-d Xiao, Similarity fusion scheme for cover song identification. Electron. Lett.52(13), 1173–1175 (2016).

    Article  Google Scholar 

  27. J Friedman, T Hastie, R Tibshirani, A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736 (2010). https://arxiv.org/pdf/1001.0736.pdf.

  28. I Bloch, Information fusion in signal and image processing: major probabilistic and non-probabilistic numerical approaches (John Wiley & Sons, Hoboken, 2013).

    Google Scholar 

  29. C McKay, I Fujinaga, in ISMIR, 2004. Automatic genre classification using large high-level musical feature sets (Citeseer, Audiovisual Institute, Universitat Pompeu FabraBarcelona, 2004), pp. 525–530.

    Google Scholar 

  30. T Ahonen, et al, Cover song identification using compression-based distance measures. Series of publications A/Department of Computer Science, University of Helsinki (2016).

  31. A Degani, M Dalai, R Leonardi, P Migliorati, in Proceedings of 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 2013). A heuristic for distance fusion in cover song identification (IEEE, Telecom ParisTech, 46 rue BarraultParis, 2013), pp. 1–4.

    Chapter  Google Scholar 

  32. N Chen, Cover song identification based on similarity fusion. MIREX extended abstract (2016). http://www.music-ir.org/mirex/abstracts/2016/CL1.pdf.

  33. FA Faria, JA Dos Santos, A Rocha, Torres RdS, A framework for selection and fusion of pattern classifiers in multimedia recognition. Pattern Recognit. Lett.39:, 52–64 (2014).

    Article  Google Scholar 

  34. B Wang, AM Mezlini, F Demir, M Fiume, Z Tu, M Brudno, B Haibe-Kains, A Goldenberg, Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 11(3), 333–337 (2014).

    Article  Google Scholar 

  35. J Osmalsky, J-J Embrechts, P Foster, S Dixon, in 16th International Society for Music Information Retrieval Conference. Combining features for cover song identification (the ATIC Research Group of the University of MalagaMalaga, 2015).

    Google Scholar 

  36. Z-z Lan, L Bao, S-I Yu, W Liu, AG Hauptmann, Multimedia classification and event detection using double fusion. Multimedia Tools Appl.71(1), 333–347 (2014).

    Article  Google Scholar 

  37. B Ionescu, J Benois-Pineau, T Piatrik, G Quénot, Fusion in computer vision (Springer International Publishing, Switzerland, 2014).

    Book  Google Scholar 

  38. J Salamon, E Gómez, J Bonada, in Proc. of 14th Int. Conf. on Digital Audio Effects (DAFx-11). Sinusoid extraction and salience function design for predominant melody estimation (Paris, 2011), pp. 73–80.

  39. Y Wang, K Han, D Wang, Exploring monaural features for classification-based speech segregation. IEEE Trans. Audio, Speech, Lang. Process.21(2), 270–279 (2013).

    Article  Google Scholar 

  40. J Salamon, Melody extraction from polyphonic music signals. PhD thesis, Universitat Pompeu Fabra (2013).

  41. Y Shi, et al, in Proceedings of the 2001 Congress on Evolutionary Computation, 1. Particle swarm optimization: developments, applications and resources (IEEESeoul, 2001), pp. 81–86.

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China [grant number 61271349].

Availability of data and materials

A complete list of tracks included in the music collection and the code of the proposed scheme can be found at http://nchenecust.com/download.html.

Authors’ contributions

NC conceived of the study; participated in the design of the work, data collection, data analysis, interpretation, and coordination; and drafted the manuscript. ML participated in the data collection, data analysis, and interpretation and helped to draft the manuscript. HX participated in the design of the work and critical revision of the article. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ning Chen.

Additional information

Ning Chen is the main contributor.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, N., Li, M. & Xiao, H. Two-layer similarity fusion model for cover song identification. J AUDIO SPEECH MUSIC PROC. 2017, 12 (2017). https://doi.org/10.1186/s13636-017-0108-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-017-0108-2

Keywords