 Research
 Open Access
 Published:
Music classification by lowrank semantic mappings
EURASIP Journal on Audio, Speech, and Music Processing volume 2013, Article number: 13 (2013)
Abstract
A challenging open question in music classification is which music representation (i.e., audio features) and which machine learning algorithm is appropriate for a specific music classification task. To address this challenge, given a number of audio feature vectors for each training music recording that capture the different aspects of music (i.e., timbre, harmony, etc.), the goal is to find a set of linear mappings from several feature spaces to the semantic space spanned by the class indicator vectors. These mappings should reveal the common latent variables, which characterize a given set of classes and simultaneously define a multiclass linear classifier that classifies the extracted latent common features. Such a set of mappings is obtained, building on the notion of the maximum margin matrix factorization, by minimizing a weighted sum of nuclear norms. Since the nuclear norm imposes rank constraints to the learnt mappings, the proposed method is referred to as lowrank semantic mappings (LRSMs). The performance of the LRSMs in music genre, mood, and multilabel classification is assessed by conducting extensive experiments on seven manually annotated benchmark datasets. The reported experimental results demonstrate the superiority of the LRSMs over the classifiers that are compared to. Furthermore, the best reported classification results are comparable with or slightly superior to those obtained by the stateoftheart taskspecific music classification methods.
1 Introduction
Retail and online music stores usually index their collections by artist or album name. However, people often need to search for music by content. For example, a search facility is offered by emerging musicoriented recommendation services, such as last.fm (http://www.last.fm/) and Pandora (http://www.pandora.com/), where social tags are employed as semantic descriptors of the music content. Social tags are textbased labels, provided by either human experts or amateur users to categorize music with respect to genre, mood, and other semantic tags. The major drawbacks of this approach for the semantic annotation of music content are (1) a newly added music recording must be tagged manually, before it can be retrieved [1], which is a timeconsuming and expensive process and (2) unpopular music recordings may not be tagged at all [2]. Consequently, an accurate contentbased automatic classification of music should be exploited to mitigate the just mentioned drawbacks, allowing the deployment of robust music browsing and recommendation engines.
A considerable volume of research in contentbased music classification has been conducted so far. The interested reader may refer to [2–5] for a comprehensive survey. Most music classification methods focus on music categorization with respect to genre, mood, or multiple semantic tags. They consist mainly of two stages, namely a music representation stage and a machine learning one. In the first stage, the various aspects of music (i.e., the timbral, the harmonic, the rhythmic content, etc.) are captured by extracting either low or midlevel features from the audio signal. Such features include timbral texture features, rhythmic features, pitch content, or their combinations, yielding a bagoffeatures (BOF) representation [1, 2, 6–18]. Furthermore, spectral, cepstral, and auditory modulationbased features have been recently employed either in BOF approaches or as autonomous music representations in order to capture both the timbral and the temporal structure of music [19–22]. At the machine learning stage, music genre and mood classification are treated as singlelabel multiclass classification problems. To this end, support vector machines (SVMs) [23], nearestneighbor (NN) classifiers, Gaussian mixture modelbased ones [3], and classifiers relying on sparse and lowrank representations [24] have been employed to classify the audio features into genre or mood classes. On the contrary, automatic music tagging (or autotagging) is considered as a multilabel, multiclass classification problem. A variety of algorithms have been exploited in order to associate the tags with the audio features. For instance, music tag prediction may be treated as a set of binary classification problems, where standard classifiers, such as the SVMs [12, 14] or adaboost [25], can be applied. Furthermore, probabilistic autotagging systems have been proposed, attempting to infer the correlations or joint probabilities between the tags and the audio features [1, 9, 26].
Despite the existence of many wellperforming music classification methods, it is still unclear which music representation (i.e., audio features) and which machine learning algorithm is appropriate for a specific music classification task. A possible explanation for the aforementioned open question is that the classes (e.g., genre, mood, or other semantic classes) in music classification problems are related to and built on some common unknown latent variables, which are different in each problem. For instance, many different songs, although they share instrumentation (i.e., have similar timbral characteristics), convey different emotions and belong to different genres. Furthermore, cover songs, which have the same harmonic content with the originals, may differ in the instrumentation and possibly evoke a different mood, so they are classified into different genres. Therefore, the challenge is to reveal the common latent features based on given music representations, such as timbral, auditory, etc., and to simultaneously learn the models that are appropriate for each specific classification task.
In this paper, a novel, robust, generalpurpose music classification method is proposed to address the aforementioned challenge. It is suitable for both singlelabel (i.e., genre or mood classification) and multilabel (i.e., music tagging) multiclass classification problems, providing a systematic way to handle multiple audio features capturing the different aspects of music. In particular, given a number of audio feature vectors for each training music recording, the goal is to find a set of linear mappings from the feature spaces to the semantic space defined by the class indicator vectors. Furthermore, these mappings should reveal the common latent variables, which characterize a given set of classes and simultaneously define a multiclass linear classifier that classifies the extracted latent common features. Such a model can be derived by building on the notion of the maximum margin matrix factorization [27]. That is, in the training phase, the set of mappings is found by minimizing a weighted sum of nuclear norms. To this end, an algorithm that resorts to the alternating direction augmented Lagrange multiplier method [28] is derived. In the test phase, the class indicator vector for labeling any test music recording is obtained by multiplying each mapping matrix with the corresponding feature vector and by summing all the resulting vectors next. Since the nuclear norm imposes rank constraints to the learnt mappings, the proposed classification method is referred to as lowrank semantic mappings (LRSMs).
The motivation behind the LRSMs arises from the fact that uncovering hidden shared variables among the classes facilitates the learning process [29]. To this end, various formulations for common latent variable extraction have been proposed for multitask learning [30], multiclass classification [31], collaborative prediction [32], and multilabel classification [33]. The LRSMs differ significantly from the aforementioned methods [29–31, 33] in that the extracted common latent variables come from many different (vector) feature spaces.
The performance of the LRSMs in music genre, mood, and multilabel classification is assessed by conducting experiments on seven manually annotated benchmark datasets. Both the standard evaluation protocols for each dataset and a small sample size setting are employed. The auditory cortical representations [34, 35], the melfrequency cepstral coefficients [36], and the chroma features [37] were used for music representation. In the singlelabel case (i.e., genre or mood classification), the LRSMs are compared against three wellknown classifiers, namely the sparse representationbased classifier (SRC) [38], the linear SVMs, and the NN classifier with a cosine distance metric. Multilabel extensions of the aforementioned classifiers, namely the multilabel sparse representationbased classifier (MLSRC)[39], the RankSVMs [40], and the multilabel knearest neighbor (MLkNN) [41], as well as the parallel factor analysis 2 (PARAFAC2)based autotagging method [42] are compared with the LRSMs in music tagging. The reported experimental results demonstrate the superiority of the LRSMs over the classifiers that are compared to. Moreover, the best classification results disclosed are comparable with or slightly superior to those obtained by the stateoftheart music classification systems.
To summarize, the contributions of the paper are as follows:

A novel method for music classification (i.e., the LRSMs) is proposed that is able to extract the common latent variables that are shared among all the classes and simultaneously learn the models that are appropriate for each specific classification task.

An efficient algorithm for the LRSMs is derived by resorting to the alternating direction augmented Lagrange multiplier method, which is suitable for largescale data.

The LRSMs provide a systematic way to handle multiple audio features for music classification.

Extensive experiments on seven datasets demonstrate the effectiveness of the LRSMs in music genre, mood, and multilabel classification when the melfrequency cepstral coefficients (MFCCs), the chroma, and the auditory cortical representations are employed for music representation.
The paper is organized as follows: In Section 2, basic notation conventions are introduced. The audio feature extraction process is briefly described in Section 3. In Section 4, the LRSMs are detailed. Datasets and experimental results are presented in Section 5. Conclusions are drawn in Section 6.
2 Notations
Throughout the paper, matrices are denoted by uppercase boldface letters (e.g., X,L), vectors are denoted by lowercase boldface letters (e.g., x), and scalars appear as either uppercase or lowercase letters (e.g., N,K,i,μ,ϵ). I denotes the identity matrix of compatible dimensions. The i th column of x is denoted as x_{ i }. The set of real numbers is denoted by $\mathbb{R}$, while the set of nonnegative real numbers is denoted by ${\mathbb{R}}_{+}$.
A variety of norms on realvalued vectors and matrices will be used. For example, ∥x∥_{0} is ℓ_{0} quasinorm counting the number of nonzero entries in x. The matrix ℓ_{1} norm is denoted by $\parallel \mathbf{X}{\parallel}_{1}=\sum _{i}\sum _{j}\left{x}_{\mathit{\text{ij}}}\right$. $\parallel \mathbf{X}{\parallel}_{F}=\sqrt{\sum _{i}\sum _{j}{x}_{\mathit{\text{ij}}}^{2}}=\sqrt{\text{tr}\left({\mathbf{X}}^{T}\mathbf{X}\right)}$ is the Frobenius norm, where tr(.) denotes the trace of a square matrix. The nuclear norm of x (i.e., the sum of singular values of a matrix) is denoted by ∥X∥_{∗}. The ℓ_{ ∞ } norm of x, denoted by ∥X∥_{ ∞ }, is defined as the element of x with the maximum absolute value.
3 Audio feature extraction
Each music recording is represented by three songlevel feature vectors, namely the auditory cortical representations [34, 35], the MFCCs [36], and the chroma features [37]. Although much more elaborated music representations have been proposed in the literature, the just mentioned features perform quite well in practice [14, 22–24]. Most importantly, songlevel representations are suitable for largescale music classification problems since the space complexity for audio processing and analysis is reduced and the database overflow is prevented [3].
3.1 Auditory cortical representations
The auditory cortex plays a crucial role in the hearing process since auditory sensations turn into perception and cognition only when they are processed by the cortical area. Therefore, one should focus on how audio information is encoded in the human primary auditory cortex in order to represent music signals in a psychophysiologically consistent manner [43]. The mechanical and neural processing in the early and central stages of the auditory system can be modeled as a twostage process. At the first stage, which models the cochlea, the audio signal is converted into an auditory representation by employing the constantQ transform (CQT).The CQT is a timefrequency representation, where the frequency bins are geometrically spaced and the Qfactors (i.e., the ratios of the center frequencies to the bandwidths) of all bins are equal [44]. The neurons in the primary auditory cortex are organized according to their selectivity in different spectral and temporal stimuli [43]. To this end, in the second stage, the spectral and temporal modulation content of the CQT is estimated by twodimensional (2D) multiresolution wavelet analysis, ranging from slow to fast temporal rates and from narrow to broad spectral scales. The analysis yields a fourdimensional (4D) representation of time, frequency, rate, and scale that captures the slow spectral and temporal modulation content of audio that is referred to as auditory cortical representation[34]. Details on the mathematical formulation of the auditory cortical representations can be found in [34, 35].
In this paper, the CQT is computed efficiently by employing the fast implementation scheme proposed in [44]. The audio signal is analyzed by employing 128 constantQ filters covering eight octaves from 44.9 Hz to 11 KHz (i.e., 16 filters per octave). The magnitude of the CQT is compressed by raising each element of the CQT matrix to the power of 0.1. At the second stage, the 2D multiresolution wavelet analysis is implemented via a bank of 2D Gaussian filters with scales ∈{0.25,0.5,1,2,4,8} (cycles/octave) and (both positive and negative) rates ∈{±2,±4,±8,±16,±32} (Hz). The choice of the just mentioned parameters is based on psychophysiological evidence [34]. For each music recording, the extracted 4D cortical representation is timeaveraged, and the 3D ratescalefrequency cortical representation is obtained. The overall procedure is depicted in Figure 1. Accordingly, each music recording can be represented by a vector $\mathbf{x}\in {\mathbb{R}}_{+}^{7,680}$ by stacking the elements of the 3D cortical representation into a vector. The dimension of the vectorized cortical representation comes from the product of 128 frequency channels, 6 scales, and 10 rates. An ensemble of music recordings is represented by the data matrix $\mathbf{X}\in {\mathbb{R}}_{+}^{7,680\times S}$, where S is the number of the available recordings in each dataset. Finally, the entries of x are postprocessed as follows: Each row of x is normalized to the range [0,1] by subtracting from each entry the row minimum and then by dividing it with the range (i.e., the difference between the row maximum and the row minimum).
3.2 Melfrequency cepstral coefficients
The MFCCs encode the timbral properties of the music signal by encoding the rough shape of the logpower spectrum on the melfrequency scale [36]. They exhibit the desirable property that a numerical change in the MFCC coefficients corresponds to a perceptual change. In this paper, MFCC extraction employs frames of 92.9ms duration with a hop size of 46.45 ms and a 42 bandpass filter bank. The filters are uniformly spaced on the melfrequency scale. The correlation between the frequency bands is reduced by applying the discrete cosine transform along the logenergies of the bands yielding a sequence of 20dimensional MFCC vectors. By averaging the MFCCs along the time axis, each music recording is represented by a 20dimensional MFCC vector.
3.3 Chroma features
The chroma features [37] are adept in characterizing the harmonic content of the music signal by projecting the entire spectrum onto 12 bins representing the 12 distinct semitones (or chroma) of a musical octave. They are calculated by employing 92.9 ms frames with a hop size of 23.22 ms as follows: First, the salience of different fundamental frequencies in the range 80 to 640 Hz is calculated. The linear frequency scale is transformed into a musical one by selecting the maximum salience value in each frequency range corresponding to one semitone. Finally, the octave equivalence classes are summed over the whole pitch range to yield a sequence of 12dimensional chroma vectors.
The chroma as well as the MFCCs, extracted from an ensemble of music recordings, is postprocessed as described in subsection 3.1.
4 Classification by lowrank semantic mappings
Let each music recording be represented by R types of feature vectors of size d_{ r }, ${\mathbf{x}}^{\left(r\right)}\in {\mathbb{R}}^{{d}_{r}}$, r=1,2,…,R. Consequently, an ensemble of N training music recordings is represented by the set {X^{(1)},X^{(2)},…,X^{(R)}}, where ${\mathbf{X}}^{\left(r\right)}=\left[{\mathbf{x}}_{1}^{\left(r\right)},{\mathbf{x}}_{2}^{\left(r\right)},\dots ,{\mathbf{x}}_{N}^{\left(r\right)}\right]\in {\mathbb{R}}^{{d}_{r}\times N}$, r=1,2,…,R. The class labels of the N training samples are represented as indicator vectors forming the matrix L∈{0,1}^{K×N}, where K denotes the number of classes. Clearly, l_{ k n }=1 if the n th training sample belongs to the k th class. In a multilabel setting, more than one nonzero elements may appear in the class indicator vector l_{ n }∈{0,1}^{K}.
These R different feature vectors characterize different aspects of music (i.e., timbre, rhythm, harmony, etc.), having different properties, and thus, they live in different (vector) feature spaces. Since different feature vectors have different intrinsic discriminative power, an intuitive idea is to combine them in order to improve the classification performance. However, in practice, most of the machine learning algorithms can handle only a single type of feature vectors and thus cannot be naturally applied to multiple features. A straightforward strategy to handle multiple features is to concatenate all the feature vectors into a single feature vector. However, the resulting feature space is rather ad hoc and lacks physical interpretation. It is more reasonable to assume that multiple feature vectors live in a union of feature spaces, which is what the proposed method actually does in a principled way. Leveraging information contained in multiple features can dramatically improve the learning performance as indicated by the recent results in multiview learning [30, 45].
Given a set of (possibly few) training samples along with the associated class indicator vectors, the goal is to learn R mappings ${\mathbf{M}}^{\left(r\right)}\in {\mathbb{R}}^{K\times {d}_{r}}$ from the feature spaces ${\mathbb{R}}^{{d}_{r}}$, r=1,2,…,R, to the label space {0,1}^{K}, having a generalization ability and appropriately utilizing the crossfeature information, so that
As discussed in Section 1, the mappings $\mathbf{M}\in {\mathbb{R}}^{K\times {d}_{r}}$, r=1,2,…,R, should be able to (1) reveal the common latent variables across the classes and (2) predict simultaneously the class memberships based on these latent variables. To do this, we seek for ${\mathbf{C}}^{\left(r\right)}\in {\mathbb{R}}^{K\times {p}_{r}}$ and $\mathbf{F}\in {\mathbb{R}}^{{p}_{r}\times {d}_{r}}$, such that ${\mathbf{M}}^{\left(r\right)}={\mathbf{C}}^{\left(r\right)}{\mathbf{F}}^{\left(r\right)}\in {\mathbb{R}}^{K\times {d}_{r}}$, r=1,2,…,R. In this formulation, the rows of F^{(r)} reveal the p_{ r } latent features (variables), and the rows of C^{(r)} are the weights predicting the classes. Clearly, the number of p_{ r } common latent variables and the matrices C^{(r)}, F^{(r)} are unknown and need to be jointly estimated.
Since the dimensionality of the R latent feature spaces (i.e., p_{ r }) is unknown, inspired by maximum margin matrix factorization [27], we can allow the unknown matrices C^{(r)} to have an unbounded number of columns and F^{(r)}, r=1,2,…,R to have an unbounded number of rows. Here, the matrices C^{(r)} and F^{(r)} are required to be lownorm. This constraint is mandatory because otherwise the resulting linear transform induced by applying first F^{(r)} and then C^{(r)} would degenerate to a single transform. Accordingly, the unknown matrices are obtained by solving the following minimization problem:
where λ_{ r }, r=1,2,…,R, are regularization parameters and the least squares loss function $\frac{1}{2}\parallel \mathbf{L}\sum _{r=1}^{R}{\mathbf{C}}^{\left(r\right)}{\mathbf{F}}^{\left(r\right)}{\mathbf{X}}^{\left(r\right)}{\parallel}_{F}^{2}$ measures the labeling approximation error. It is worth mentioning that the least squares loss function is comparable to other loss functions, such as the hinge loss employed in SVMs [46], since it has been proved to be (universally) Fisher consistent [47]. This property along with the fact that it leads into the formulation of a tractable optimization problem motivated us to adopt the least squares loss here. By Lemma 1 in [27], it is known that
Thus, based on (3), the optimization problem (2) can be rewritten as
Therefore, the mappings M^{(r)}, r=1,2,…R, are obtained by minimizing the weighted sum of their nuclear norms and the labeling approximation error, that is, the nuclear normregularized least squares labeling approximation error. Since the nuclear norm is the convex envelope of the rank function [48], the derived mappings between the feature spaces and the semantic space spanned by the class indicator matrix L are lowrank as well. This justifies why the solution of (4) yields lowrank semantic mappings (LRSMs). The LRSMs are strongly related and share the same motivations with the methods in [31] and [32], which have been proposed for multiclass classification and prediction, respectively. In both methods, the nuclear normregularized loss is minimized in order to infer relationships between the label vectors and feature vectors. The two key differences between the methods in [31] and [32] and the LRSMs are (1) the LRSMs are able to adequately handle multiple features, drawn from different feature spaces, and (2) the least squares loss function is employed instead of hinge loss, resulting into formulation (4) which can be efficiently solved for largescale data.
Problem (4) is solved as follows: By introducing the auxiliary variables W^{(r)}, r=1,2,…,R, (4) is equivalent to
which can be solved by employing the alternating direction augmented Lagrange multiplier (ADALM) method, which is a simple, but powerful, algorithm that is well suited to largescale optimization problems [28, 49]. That is, by minimizing the augmented Lagrange function [28],
where Ξ^{(r)}, r=1,2,…,R, are the Lagrange multipliers and ζ>0 is a penalty parameter. By applying the ADALM, (6) is minimized with respect to each variable in an alternating fashion, and finally, the Lagrange multipliers are updated at each iteration. If only W^{(1)} is varying and all the other variables are kept fixed, we simplify (6) writing $\mathcal{\mathcal{L}}\phantom{\rule{0.3em}{0ex}}\left({\mathbf{W}}^{\left(1\right)}\right)\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\text{instead of}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\mathcal{\mathcal{L}}\phantom{\rule{0.3em}{0ex}}\left({\mathbf{W}}^{\left(1\right)}\phantom{\rule{0.3em}{0ex}},{\mathbf{W}}^{\left(2\right)}\phantom{\rule{0.3em}{0ex}},\dots ,{\mathbf{W}}^{\left(R\right)}\phantom{\rule{0.3em}{0ex}},{\mathbf{M}}^{\left(1\right)}\phantom{\rule{0.3em}{0ex}},{\mathbf{M}}^{\left(2\right)}\phantom{\rule{0.3em}{0ex}},\dots ,\right.$M^{(R)},Ξ^{(1)},Ξ^{(2)},…,Ξ^{(R)}). Let t denote the iteration index. Given ${\mathbf{W}}_{\left[t\right]}^{\left(r\right)},{\mathbf{M}}_{\left[t\right]}^{\left(r\right)}$, r=1,2,…R, and ζ_{[t]}, the iterative scheme of ADALM for (6) reads as follows:
The solution of (7) is obtained in closed form via the singular value thresholding operator defined for any matrix Q as [50]: ${\mathcal{D}}_{\tau}\left[\mathbf{Q}\right]=\mathbf{U}{\mathcal{S}}_{\tau}{\mathbf{V}}^{T}$ with Q=U Σ V^{T} being the singular value decomposition and ${\mathcal{S}}_{\tau}\left[q\right]=\text{sgn}\left(q\right)max\left(\rightq\tau ,0)$ being the shrinkage operator [51]. The shrinkage operator can be extended to matrices by applying it elementwise. Consequently, ${\mathbf{W}}_{[t+1]}^{\left(r\right)}={\mathcal{D}}_{\frac{{\lambda}_{r}}{{\zeta}_{\left[t\right]}}}\left[{\mathbf{M}}_{\left[t\right]}^{\left(r\right)}+\frac{{\mathbf{\Xi}}_{\left[t\right]}^{\left(r\right)}}{{\zeta}_{\left[t\right]}}\right]$. Problem (8) is an unconstrained least squares problem, which admits a unique closedform solution, as is indicated in Algorithm 1 summarizing the ADALM method for the minimization of (5). The convergence of Algorithm 1 is just a special case of that of the generic ADALM [28, 49].
The set of the lowrank semantic matrices {M^{(1)},M^{(2)},…,M^{(R)}}, obtained by Algorithm 1, captures the semantic relationships between the label space and the R audio feature spaces. In music classification, the semantic relationships are expected to propagate from the R feature spaces to the label vector space. Therefore, a test music recording can be labeled as follows: Let ${\widehat{\mathbf{x}}}^{\left(r\right)}\in {\mathbb{R}}^{{d}_{r}}$, r=1,2,…,R, be a set of feature vectors extracted from the test music recording and l∈{0,1}^{K} be the class indicator vector of this recording. First, the intermediate class indicator vector $\widehat{\mathbf{l}}\in {\mathbb{R}}^{K}$ is obtained by
Algorithm 1 Solving ( 5 ) by the ADALM method
The (final) class indicator vector (i.e., L) has ∥l∥_{0}=v<K, containing 1 in the positions, which are associated with the v largest values in $\widehat{\mathbf{l}}$. Clearly, for singlelabel multiclass classification, v=1.
4.1 Computational complexity
The dominant cost for each iteration in Algorithm 1 is the computation of the singular value thresholding operator (i.e., step 4), that is, the calculation of the singular vectors of ${\mathbf{M}}_{\left[t\right]}^{\left(r\right)}+\frac{{\mathbf{\Xi}}_{\left[t\right]}^{\left(r\right)}}{{\zeta}_{\left[t\right]}}$ whose corresponding singular values are larger than the threshold $\frac{{\lambda}_{r}}{{\zeta}_{\left[t\right]}}$. Thus, the complexity of each iteration is O(R·d·N^{2}).
Since the computational cost of the LRSMs depends highly on the dimensionality of feature spaces, dimensionality reduction methods can be applied. For computational tractability, dimensionality reduction via random projections is considered. Let the true low dimensionality of the data be denoted by z. Following [52], a random projection matrix, drawn from a normal zeromean distribution, provides with high probability a stable embedding[53] with the dimensionality of the projection ${d}_{r}^{\prime}$ selected as the minimum value such that ${d}_{r}^{\prime}>2zlog(7,680/\underset{r}{\overset{\prime}{d}})$. Roughly speaking, a stable embedding approximately preserves the Euclidean distances between all vectors in the original space in the feature space of reduced dimensions. In this paper, we propose to estimate z by robust principal component analysis [51] on the highdimensional training data (e.g., X^{(r)}). That is, the principal component pursuit is solved:
Then, z is the rank of the outlierfree data matrix Γ^{(r)}[51] and corresponds to the number of its nonzero singular values.
5 Experimental evaluation
5.1 Datasets and evaluation procedure
The performance of the LRSMs in music genre, mood, and multilabel music classification is assessed by conducting experiments on seven manually annotated benchmark datasets for which the audio files are publicly available. In particular, the GTZAN [17], ISMIR, Homburg [54], Unique [16], and 1517Artists [16] datasets are employed for music genre classification, the MTV dataset [15] for music mood classification, and the CAL500 dataset [1] for music tagging. Brief descriptions of these datasets are provided next.
The GTZAN (http://marsyas.info/download/data_sets) consists of 10 genre classes, namely blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. Each genre class contains 100 excerpts of 30s duration.
The ISMIR (http://ismir2004.ismir.net/ISMIR_Contest.html) comes from the ISMIR 2004 Genre classification contest and contains 1,458 full music recordings distributed over six genre classes as follows: classical (640), electronic (229), jazzblues (52), metalpunk (90), rockpop (203), and world (244), where the number within parentheses refers to the number of recordings which belong to each genre class. Therefore, 43.9% of the music recordings belong to the classical genre.
The Homburg (http://wwwai.cs.unidortmund.de/audio.html) contains 1,886 music excerpts of 10s length by 1,463 different artists. These excerpts are unequally distributed over nine genres, namely alternative, blues, electronic, folkcountry, funk/soul/RnB, jazz, pop, rap/hiphop, and rock. The largest class is the rap/hiphop genre containing 26.72% of the music excerpts, while the funk/soul/RnB is the smallest one containing 2.49% of the music excerpts.
The 1517Artists (http://www.seyerlehner.info/index.php?p=1_3_Download) consists of 3,180 fulllength music recordings from 1,517 different artists, downloaded free from download.com. The 190 most popular songs, according to the number of total listenings, were selected for each of the 19 genres, i.e., alternative/punk, blues, children’s, classical, comedy/spoken, country, easy listening/vocal, electronic, folk, hiphop, jazz, latin, new age, RnB/soul, reggae, religious, rock/pop, soundtracks, and world. In this dataset, the music recordings are distributed almost uniformly over the genre classes.
The Unique (http://www.seyerlehner.info/index.php?p=1_3_Download) consist of 3,115 music excerpts of popular and wellknown songs, distributed over 14 genres, namely blues, classic, country, dance, electronica, hiphop, jazz, reggae, rock, schlager (i.e., music hits), soul/RnB, folk, world, and spoken. Each excerpt has 30s duration. The class distribution is skewed. That is, the smallest class (i.e., spoken music) accounts for 0.83%, and the largest class (i.e., classic) for 24.59% of the available music excerpts.
The MTV (http://www.openaudio.eu/) contains 195 fullmusic recordings with a total duration of 14.2 h from the MTV Europe Most Wanted Top Ten of 20 years (1981 to 2000), covering a wide variety of popular music genres. The ground truth was obtained by five annotators (Rater A to Rater E, four males and one female), who were asked to make a forced binary decision according to the two dimensions in Thayer’s mood plane [55] (i.e., assigning either +1 or −1 for arousal and valence, respectively) according their mood perception.
The CAL500 (http://cosmal.ucsd.edu/cal/) is a corpus of 500 recordings of Western popular music, each of which has been manually annotated by at least three human annotators, who employ a vocabulary of 174 tags. The tags used in CAL500 dataset annotation span six semantic categories, namely instrumentation, vocal characteristics, genres, emotions, acoustic quality of the song, and usage terms (e.g., ‘I would like to listen this song while driving’) [1].
Each music recording in the aforementioned datasets was represented by three songlevel feature vectors, namely the 20dimensional MFCCs, the 12dimensional chroma features, and the auditory cortical representations of reduced dimensions. The dimensionality of the cortical features was reduced via random projections as described in Section 4. In particular, the dimensions of the cortical features after random projections are 1,570 for the GTZAN, 1,391 for the ISMIR, 2,261 for the Homburg, 2,842 for the 1517Artists, 2,868 for the Unique, 518 for the MTV, and 935 for the CAL500 dataset, respectively.
Two sets of experiments in music classification were conducted. First, to be able to compare the performance of the LRSMs with that of the stateoftheart music classification methods, standard evaluation protocols were applied to the seven datasets. In particular, following [16, 17, 20, 22, 56, 57], stratified 10fold crossvalidation was applied to the GTZAN dataset. According to [15, 16, 54], the same protocol was also applied to the Homburg, Unique, 1517Artists, and MTV datasets. The experiments on the ISMIR 2004 Genre dataset were conducted according to the ISMIR 2004 Audio Description Contest protocol. The protocol defines training and evaluation sets, which consist of 729 audio files each. The experiments on music tagging were conducted following the experimental procedure defined in [26]. That is, 78 tags, which have been employed to annotate at least 50 music recordings in the CAL500 dataset, were used in the experiments by applying fivefold crossvalidation.
Fu et al. [3] indicated that the main challenge for future music information retrieval systems is to be able to train the music classification systems for largescale datasets from few labeled data. This situation is very common in practice since the number of annotated music recordings per class is often limited [3]. To this end, the performance of the LRSMs in music classification given a few training music recordings is investigated in the second set of experiments. In this smallsample size setting, only 10% of the available recordings were used as the training set and the remaining 90% for the test in all, but the CAL500, datasets. The experiments were repeated 10 times. In music tagging, 20% of the recordings in the CAL500 were used as the training set and the remaining 80% for the test. This experiment was repeated five times.
The LRSMs are compared against three wellknown classifiers, namely the SRC [38], the linear SVMs ^{a}, and the NN classifier with a cosine distance metric in music genre and mood classification, by applying the aforementioned experimental procedures. In music tagging, the LRSMs are compared against the multilabel variants of the aforementioned singlelabel classifiers, namely the MLSRC [39], the RankSVMs [40], the MLkNN [41], as well as the wellperforming PARAFAC2based autotagging method [42]. The number of neighbors used in the MLkNN was set to 15. The sparse coefficients in the SRC and MLSRC are estimated by the LASSO ^{b}[58].
The performance in music genre and mood classification is assessed by reporting the classification accuracy. Three metrics, namely the mean pertag precision, the mean pertag recall, and the F_{1} score, are used in order to assess the performance of autotagging. These metrics are defined as follows [1]: Pertag precision is defined as the fraction of music recordings annotated by any method with label w that are actually labeled with tag w. Pertag recall is defined as the fraction of music recordings actually labeled with tag w that the method annotates with label w. The F_{1} score is the harmonic mean of precision and recall. That is, ${F}_{1}=2\xb7\frac{\text{precision}\xb7\text{recall}}{\text{precision}+\text{recall}}$ yields a scalar measure of overall annotation performance. If a tag is never selected for annotation, then following [1, 26], the corresponding precision (that otherwise would be undefined) is set to the tag prior to the training set, which equals the performance of a random classifier. In the music tagging experiments, the length of the class indicator vector returned by the LRSMs as well as the MLSRC, the RankSVMs, the MLkNN, and the PARAFAC2based autotagging method was set to 10 as in [1, 26]. That is, each test music recording is annotated with 10 tags. The parameters in the LRSMs have been estimated by employing the method in [59]. That is, for each training set, a validation set (disjoint from the test set) was randomly selected and used next for tuning the parameters (i.e., λ_{ r }, r=1,2,…,R).
5.2 Experimental results
In Tables 1, 2, and 3, the experimental results in music genre, mood, and multilabel classification are summarized, respectively. These results have been obtained by applying the standard protocol defined for each dataset. In Tables 4 and 5, music classification results are reported, when a small training set is employed. Each classifier is applied to the auditory cortical representations (cortical features) of reduced dimensions, the 20dimensional MFCCs, the 12dimensional chroma features, the linear combination of cortical features and MFCCs (fusion cm, i.e., R=2), and the linear combination of all the aforementioned features (fusion cmc, i.e., R=3). Apart from the proposed LRSMs, the other competitive classifiers handle the fusion of multiple audio features in an ad hoc manner. That is, an augmented feature vector is constructed by stacking the cortical features on the top of the 20dimensional MFCCs and the 12dimensional chroma features. In the last rows of Tables 1, 2, and 3, the figures of merit for the top performing music classification methods are included for comparison purposes.
By inspecting Table 1, the best music genre classification accuracy has been obtained by the LRSMs in four out five datasets, when all the features have been exploited for music representation. Comparable performance has been achieved by the combination of cortical features and the MFCCs. This is not the case for the Unique dataset, where the SVMs achieve the best classification accuracy when employing the fusion of the cortical features, the MFCCs, and the chroma features. Furthermore, the LRSMs outperform all the classifiers being compared to when they are applied to cortical features. The MFCCs are classified more accurately by the SRC or the SVMs than the LRSMs. This is because the MFCCs and the chroma features have a low dimensionality and the LRSMs are not able to extract the appropriate common latent features the genre classes are built on. The best classification accuracy obtained by the LRSMs on all datasets ranks high compared to that obtained by the majority of music genre classification techniques, as listed in last rows of Table 1. In particular, for the Homburg, 1517Artists, and Unique datasets, the best accuracy achieved by the LRSMs outperforms that obtained by the stateoftheart music classification methods. Regarding to the GTZAN and ISMIR datasets, it is worth mentioning that the results reported in [20] have been obtained by applying feature aggregation on the combination of four elaborated audio features.
Schuller et al. argued that the two dimensions in Thayer’s mood model, namely the arousal and the valence, are independent of each other [15]. Therefore, mood classification can be reasonably done independently in each dimension, as presented in Table 2. That is, each classifier makes binary decisions between excitation and calmness on the arousal scale as well as negativity and positivity in the valence dimension, respectively. Both overall and perrater music mood classification accuracies are reported. The overall accuracies are the mean accuracies over all raters for all songs in the dataset. The LRSMs outperform the classifiers that are compared to when the cortical features and their fusion with the MFCCs and the chroma features are employed for music representation, yielding higher classification accuracies than those reported in the row entry NONLYR in Tables twelve and thirteen [15] when only audio features are employed. It is seen that the inclusion of the chroma features does not alter the measured figures of merit. Accordingly, the chroma features could be omitted without any performance deterioration. It is worth mentioning that substantial improvements in the classification accuracy are reported when audio features are combined with lyric features [15]. The overall accuracy achieved by the LRSMs in valence and arousal is considered satisfactory, considering the inherent ambiguity in the mood assignments and the realistic nature of the MTV dataset.
The results reported in Table 3 indicate that in music tagging, the LRSMs outperform the MLSRC, the MLkNN, and the PARAFAC2 with respect to pertag precision, pertag recall, and F_{1} score for all the music representations employed. The RankSVMs yield the best tagging performance with respect to the F_{1} score and the recall. The cortical features seem to be more appropriate for music annotation than the MFCCs, no matter which annotation method is employed. Although the LRSMs achieve top performance against the stateoftheart methods with respect to pertag precision, the reported recall is much smaller compared to that published for the majority of music tagging methods (last five rows in Table 3). This result is due to the songlevel features employed here, which fail to capture the temporal information with some tags (e.g., instrumentation). In contrast, the wellperforming autotagging method with respect to recall, which is reported in Table 3, employs sequences of audio features for music representation.
In Tables 4 and 5, music classification results, by applying a smallsample size setting, are summarized. These results have been obtained by employing either the fusion of the cortical features, the MFCCs, and the chroma features or the fusion of the former two audio representations. Clearly, the LRSMs outperform all the classifiers they are compared to in most music classification tasks. The only exceptions are the prediction of valence on the MTV dataset, where the best classification accuracy is achieved by the SRC, and the music genre classification accuracy on the Unique dataset, where the top performance is achieved by the SVMs. Given the relatively small number of training music recordings, the results in Tables 4 and 5 are quite acceptable, indicating that the LRSMs are an appealing method for music classification in realworld conditions.
6 Conclusions
The LRSMs have been proposed as a generalpurpose music classification method. Given a number of music representations, the LRSMs are able to extract the appropriate features for each specific music classification task, yielding higher performance than the methods they are compared to. Furthermore, the best classification results obtained by the LRSMs either meet or slightly outperform those obtained by the stateoftheart methods for music genre, mood, and multilabel music classification. The superiority of the auditory cortical representations has been demonstrated over the conventional MFCCs and chroma features in the three music classification tasks studied as well. Finally, the LRSMs yield high music classification performance when a small number of training recordings is employed. This result highlights the potential of the proposed method for practical music information retrieval systems.
Endnotes
^{a} The LIBSVM was used in the experiments (http://www.csie.ntu.edu.tw/~cjlin/libsvm/).^{b} The SPGL1 Matlab solver was used in the implementation of the SRC and the MLSRC (http://www.cs.ubc.ca/~mpf/spgl1/).
Abbreviations
 ADALM:

Alternating direction augmented Lagrange multiplier
 BOF:

Bagoffeatures
 CQT:

ConstantQ transform
 LRSMs:

Lowrank semantic mappings
 MFCCs:

Melfrequency cepstral coefficients
 MLkNN:

Multilabel knearest neighbor
 MLSRC:

Multilabel sparse representationbased classifier
 NN:

Nearest neighbor
 PARAFAC2:

Parallel factor analysis 2
 SVMs:

Support vector machines
 2D:

Twodimensional.
References
 1.
Turnbull D, Barrington L, Torres D, Lanckriet G: Semantic annotation and retrieval of music and sound effects. IEEE Trans. Audio, Speech, Lang. Proc. 2008, 16(2):467476.
 2.
BertinMahieux T, Eck D, Mandel M: Automatic tagging of audio: the stateoftheart. In Machine Audition: Principles, Algorithms and Systems. Edited by: Wang W. Hershey: IGI; 2010:334352.
 3.
Fu Z, Lu G, Ting KM, Zhang D: A survey of audiobased music classification and annotation. IEEE Trans. Multimedia 2011, 13(2):303319.
 4.
Kim YE, Schmidt EM, Migneco R, Morton BG, Richardson P, Scott J, Speck JA, Turnbull D: Music emotion recognition: a state of the art review. In Proc. 11th Int. Conf. Music Information Retrieval. Utrecht; 9–13 Aug 2010:255266.
 5.
Scaringella N, Zoia G, Mlynek D: Automatic genre classification of music content: a survey. IEEE Signal Proc. Mag. 2006, 23(2):133141.
 6.
Bergstra J, Mandel M, Eck D: Scalable genre and tag prediction with spectral covariance. In Proc. 11th Int. Conf. Music Inform. Retrieval. Utrecht; 9–13 Aug 2010:507512.
 7.
Chen L, Wright P, Nejdl W: Improving music genre classification using collaborative tagging data. In Proc. ACM 2nd Int. Conf. Web Search and Data Mining. Barcelona: ACM; 9–12 Feb 2009:8493.
 8.
Chang K, Jang JSR, Iliopoulos CS: Music genre classification via compressive sampling. In Proc. 11th Int. Conf. Music Information Retrieval. Utrecht; 9–13 Aug 2010:387392.
 9.
Hoffman M, Blei D, Cook P: Easy as CBA: a simple probabilistic model for tagging music. In Proc. 10th Int. Conf. Music Information Retrieval. Kobe; 26–30 Oct 2009:369374.
 10.
Holzapfel A, Stylianou Y: Musical genre classification using nonnegative matrix factorizationbased features. IEEE Trans. Audio, Speech, Lang. Proc. 2008, 16(2):424434.
 11.
Lukashevich H, Abeber J, Dittmar C, Grossman H: From multilabeling to multidomainlaneling: a novel twodimensional approach to music genre classification. In Proc. 10th Int. Conf. Music Information Retrieval. Kobe; 26–30 Oct 2009:459464.
 12.
Mandel MI, Ellis DPW: Multipleinstance learning for music information retrieval. In Proc. 9th Int. Conf. Music Information Retrieval. Philadelphia; 14–18 Sept 2008:577582.
 13.
Miotto R, Barrington L, Lanckriet G: Improving autotagging by modeling semantic cooccurrences. In Proc. 11th Int. Conf. Music Information Retrieval. Utrecht; 9–13 Aug 2010:297302.
 14.
Ness SR, Theocharis A, Tzanetakis G, Martins LG: Improving automatic music tag annotation using stacked generalization of probabilistic SVM outputs. In Proc. 17th ACM Int. Conf. Multimedia. Beijing; 19–22 Oct 2009:705708.
 15.
Schuller B, Hage C, Schuller D, Rigoll G: “Mister D.J., Cheer Me Up!”: musical and textual features for automatic mood classification. J. New Music Res. 2010, 39: 1334. 10.1080/09298210903430475
 16.
Seyerlehner K, Widmer G, Pohle T, Knees P: Fusing blocklevel features for music similarity estimation. In Proc. 13th Int. Conf. Digital Audio Effects. Graz; 6–10 Sept 2010:528531.
 17.
Tzanetakis G, Cook P: Musical genre classification of audio signals. IEEE Trans. Speech Audio Proc. 2002, 10(5):293302. 10.1109/TSA.2002.800560
 18.
Zhen C, Xu J: Multimodal music genre classification approach. In Proc. 3rd IEEE Int. Conf. Computer Science and Information Technology. Chengdu; 9–11 July 2010:398402.
 19.
GarciaGarcia D, ArenasGarcia J, ParradoHernandez E, Diazde Maria F: Music genre classification using the temporal structure of songs. In Proc. IEEE 20th Int. Workshop Machine Learning for Signal Processing. Kittila; 29 Aug–1 Sept 2010:266271.
 20.
Lee CH, Shih JL, Yu KM, Lin HS: Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features. IEEE Trans. Multimedia 2009, 11(4):670682.
 21.
Nagathil A, Gerkmann T, Martin R: Musical genre classification based on a highlyresolved cepstral modulation spectrum. In Proc. 18th European Signal Processing Conf.. Aalborg; 23–27 Aug 2010:462466.
 22.
Panagakis Y, Kotropoulos C, Arce GR: Nonnegative multilinear principal component analysis of auditory temporal modulations for music genre classification. IEEE Trans. Audio, Speech, Lang Tech. 2010, 18(3):576588.
 23.
Mandel M, Ellis DPW: Songlevel features support vector machines for music classification. In Proc. 6th Int. Conf. Music Information Retrieval. London; 11–15 Sept 2005:594599.
 24.
Panagakis Y, Kotropoulos C: Automatic music mood classification via lowrank representation. In Proc. 19th European Signal Processing Conf.. Barcelona; 29 Aug–2 Sept 2011:689693.
 25.
BertinMahieux T, Eck D, Maillet F, Lamere P: Autotagger: a model for predicting social tags from acoustic features on large music databases. J. New Music Res. 2008, 37(2):115135. 10.1080/09298210802479250
 26.
Coviello E, Chan A, Lanckriet G: Time series models for semantic music annotation. IEEE Trans. Audio, Speech, Lang Proc. 2011, 19(5):13431359.
 27.
Srebro N, Rennie JDM, Jaakkola T: Maximummargin matrix factorization. In 2004 Advances in Neural Information Processing Systems. Vancouver; 13–18 Dec 2004:13291336.
 28.
Bertsekas DP: Constrained Optimization and Lagrange Multiplier Methods. Belmont: Athena Scientific; 1996.
 29.
Ando R, Kubota R, Zhang T: A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 2005, 6: 18171853.
 30.
Torralba A, Murphy KP, Freeman WT: Sharing visual features for multiclass and multiview object detection. IEEE Trans. Pattern Anal Mach. Intel. 2007, 29(5):854869.
 31.
Amit Y, Fink M, Srebro N, Ullman S: Uncovering shared structures in multiclass classification. In Proc. 24th Int. Conf. on Machine Learning. Corvallis; 20–24 June 2007:1724.
 32.
Rennie JDM, Srebro N: Fast maximum margin matrix factorization for collaborative prediction. In Proc. 2005 Int. Conf. Machine Learning. Los Angeles; 15–17 Dec 2005:713719.
 33.
Ji S, Tang L, Yu S, Ye J: Extracting shared subspace for multilabel classification. In Proc. 2005 Int. Conf. Machine Learning. Corvallis; 20–24 June 2007.
 34.
Yang X, Wang K, Shamma SA: Auditory representations of acoustic signals. IEEE Trans. Inf. Theory 1992, 38(2):824839. 10.1109/18.119739
 35.
Mesgarani N, Slaney M, Shamma SA: Discrimination of speech from nonspeech based on multiscale spectrotemporal modulations. IEEE Trans. Audio, Speech, Lang Proc. 2006, 14(3):920930.
 36.
Logan B: Mel frequency cepstral coefficients for music modeling. In Proc. 1st Int. Symposium Music Information Retrieval. Plymouth; 23–25 Oct 2000.
 37.
Ryynanen M, Klapuri A: Automatic transcription of melody, bass line, and chords in polyphonic music. Comput Music J. 2008, 32(3):7286. 10.1162/comj.2008.32.3.72
 38.
Wright J, Yang A, Ganesh A, Sastry S, Ma Y: Robust face recognition via sparse representation. IEEE Trans. Pattern Anal Mach. Int. 2009, 31(2):210227.
 39.
Sakai T, Itoh H, Imiya A: Multilabel classification for image annotation via sparse similarity voting. In Proc. Computer VisionACCV 2010 Workshops. Queenstown: Springer; 8–12 Nov 2010:344353.
 40.
Elisseeff A, Weston JA: A kernel method for multilabelled classification. In 14th Advances in Neural Information Processing Systems. Cambridge: MIT; 2002:681687.
 41.
Zhang ML, Zhou ZH: MLKNN: a lazy learning approach to multilabel learning. Pattern Recogn. 2007, 40(7):20382048. 10.1016/j.patcog.2006.12.019
 42.
Panagakis Y, Kotropoulos C: Automatic music tagging via PARAFAC2. In Proc. 2011 IEEE Int. Conf. Acoustics, Speech, and Signal Processing. Prague; 22–27 May 2011:481484.
 43.
Munkong R, BiingJ Hwang: Auditory perception and cognition. IEEE Signal Proc Mag. 2008, 25(3):98117.
 44.
Schoerkhuber C, Klapuri A: ConstantQ transform toolbox for music processing. In Proc. 7th Sound and Music Computing Conf.. Barcelona; 21–24 July 2010.
 45.
Memisevic R: On multiview feature learning. In Proc. 2012 Int. Conf. Machine Learning. Edinburgh; 26 June.
 46.
Fung GM, Mangasarian OL: Multicategory proximal support vector machine classifiers. Mach Learn. 2005, 59(12):7797. 10.1007/s1099400504636
 47.
Zou H, Zhu J, Hastie T: New multicategory boosting algorithms based on multicategory Fisherconsistent losses. Ann Appl. Stat. 2008, 2(4):12901306. 10.1214/08AOAS198
 48.
Fazel M: Matrix rank minimization with applications. PhD thesis: Department of Electrical Engineering, Stanford University; 2002.
 49.
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J: Distributed optimization statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 2011, 3: 1122.
 50.
Cai JF, Candes EJ, Shen Z: A singular value thresholding algorithm for matrix completion. SIAM J. Optimization 2009, 2(2):569592.
 51.
Candes EJ, Li X, Ma Y, J Wright: Robust principal component analysis? ACM J. 2011, 58(3):137.
 52.
Donoho DL, Tanner J: Counting faces of randomly projected polytopes when the projection radically lowers dimension. J. Am. Math. Soc. 2009, 22: 153.
 53.
Baraniuk R, Cevher V, WakinM M: Lowdimensional models for dimensionality reduction and signal recovery: a geometric perspective. Proc IEEE 2010, 98(6):959971.
 54.
Homburg H, Mierswa I, Moller B, Morik K, Wurst M: A benchmark dataset for audio classification and clustering. In Proc. 6th Int. Conf. Music Information Retrieval. London; 11–15 Sept 2005:528531.
 55.
Thayer RE: The Biopsychology of Mood and Arousal. Boston: Oxford University Press; 1989.
 56.
Bergstra J, Casagrande N, Erhan D, Eck D, Kegl B: Aggregate features and ADABOOST for music classification. Mach Learn. 2006, 65(23):473484. 10.1007/s1099400690197
 57.
Tsunoo E, Tzanetakis G, Ono N, Sagayama S: Beyond timbral statistics: improving music classification using percussive patterns and bass lines. IEEE Trans. Audio, Speech, Lang. Proc. 2011, 19(4):10031014.
 58.
Tibshirani R: Regression shrinkage selection via the LASSO. J. R. Statist. Soc B. 1996, 58: 267288.
 59.
Tibshirani R: On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min. Knowl. Discov. 1997, 1(3):317328. 10.1023/A:1009752403260
 60.
Aryafar K, Shokoufandeh A: Music genre classification using explicit semantic analysis. In Proc. 1st ACM Int. Workshop Music Information Retrieval with UserCentered and Multimodal Strategies. Scottsdale; 28 Nov–1 Dec 2011:3338.
 61.
Osendorfer C, Schluter J, Schmidhuber J, van der Smagt P: Unsupervised learning of lowlevel audio features for music similarity estimation. In Proc. 28th Int. Conf. Machine Learning. Bellevue; 28 June–2 July 2011.
 62.
Pampalk E, Flexer A, Widmer G: Improvements of audiobased music similarity and genre classification. In Proc. 6th Int. Conf. Music Information Retrieval. London; 11–15 Sept 2005:628633.
 63.
Xie B, Bian W, Tao D, Chordia P: Music tagging with regularized logistic regression. In Proc. 12th Int. Conf. Music Information Retrieval. Miami; 24–28 Oct 2011:711716.
 64.
Eck D, Lamere P, Green S, BertinT Mahieux: Automatic generation of social tags for music recommendation. 2007.
Acknowledgements
This research has been cofinanced by the European Union (European Social Fund  ESF) and Greek national funds through the Operational Program ‘Education and Lifelong Learning’ of the National Strategic Reference Framework (NSRF)  Research Funding Program: Heraclitus II. Investing in Knowledge Society through the European Social Fund.
Author information
Additional information
Competing interests
Both authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Received
Accepted
Published
DOI
Keywords
 Music classification; Music genre; Music mood; Nuclear norm minimization; Auditory representations