 Research
 Open Access
Piano multipitch estimation using sparse coding embedded deep learning
 Xingda Li^{1}Email authorView ORCID ID profile,
 Yujing Guan^{1},
 Yingnian Wu^{2} and
 Zhongbo Zhang^{1}
https://doi.org/10.1186/s136360180132x
© The Author(s) 2018
 Received: 2 November 2017
 Accepted: 2 August 2018
 Published: 12 September 2018
Abstract
As the foundation of many applications, multipitch estimation problem has always been the focus of acoustic music processing; however, existing algorithms perform deficiently due to its complexity. In this paper, we employ deep learning to address piano multipitch estimation problem by proposing MPENet based on a novel multimodal sparse incoherent nonnegative matrix factorization (NMF) layer. This layer originates from a multimodal NMF problem with LorentzianBlockFrobenius sparsity constraint and incoherentness regularization. Experiments show that MPENet achieves stateoftheart performance (83.65% Fmeasure for polyphony level 6) on RAND subset of MAPS dataset. MPENet enables NMF to do online learning and accomplishes multilabel classification by using only monophonic samples as training data. In addition, our layer algorithms can be easily modified and redeveloped for a wide variety of problems.
Keywords
 Multipitch estimation
 Multimodal NMF
 Nonnegative sparse coding
 Nonnegative incoherent dictionary learning
 Deep learning
1 Introduction
Multipitch estimation problem (MPE, cf. [1–4] and references therein) is the concurrent identification of multiple notes in an acoustic polyphonic music clip. For example, {C_{4},D_{4}}, {E_{0},G_{2},A_{5}}, {F_{3},A_{3},C_{4},E_{4},G_{4},B_{4},D_{5}}^{1}, or other combinations. Generally, it is a prerequisite for Automatic Music Transcription (AMT, [5]), Musical Information Retrieval (MIR, [6]), and many other acoustic music processing applications. It is worth emphasizing that MPE is different from Automatic Chord Estimation (ACE, [7]) in two aspects: (1) note combinations in MPE can be totally random instead of certain relationships in ACE and (2) MPE is a multilabel classification [8] problem while ACE is a singlelabel one.
Frequency relationships between the overtones of a reference note n and its upper octave
Harmonic  Interval  Semitone  Variance cents  

1  2  4  8  16  Prime (octave)  0  0 
17  Minor second  + 1  + 5  
9  18  Major second  + 2  + 4  
19  Minor third  + 3  − 2  
5  10  20  Major third  + 4  − 14  
21  Fourth  + 5  − 29  
11  22  Tritone  + 6  − 49  
23  + 28  
3  6  12  24  Fifth  + 7  + 2  
25  Minor sixth  + 8  − 27  
23  26  + 41  
27  Major sixth  + 9  + 6  
7  14  28  Minor seventh  + 10  − 31  
29  + 30  
15  30  Major seventh  + 11  − 12  
31  + 45 
Besides, the acoustical characteristics of different instruments make the problem even more difficult: on the one hand, timbre variation results in different overtone magnitude distributions; on the other hand, inharmonicity^{2} leads to various overtone frequency distributions [11]. Pianos are especially harder to deal with than other stringed instruments due to the complicated way of strings being wired. Due to the inharmonicity and its uniqueness on different strings [11], the slight frequency mismatch between the first overtone of a note and its upper octave will cause an interference pattern (a.k.a. acoustic beat) if pianos are tuned by exact equal temperament. In order to eliminate such acoustic beats, pianos are usually tuned individually by welltrained experts (called harmonic tuning, the deviation from the exact equal temperament often forms a Railsback curve [12]).
The other challenge of MPE comes from the complexity of note combination. Strategies for solving multilabel classification can be generally categorized into two, “one vs. all” and “one vs. one,” respectively. Let the class number be n, the former needs n classifiers while the latter needs \(\left (\begin {array}{cc}n\\2\end {array}\right) = \frac {n^{2}  n}{2}\)ones. Although it is computationally feasible for most circumstances, classifiers are trained independently from feature extraction. The lack of supervision in feature extraction may degrade the performance since it is more meaningful for features to minimize classification error rather than reconstruction error [13]. Another existing strategy needs 2^{n} classifiers by encoding multilabels into singlelabels. It is only feasible when n is not large; otherwise, one may suffer from dimension explosion problem. Taking the piano for example, choosing 7 notes from 88 yields \(\left (\begin {array}{cc}88\\7\end {array}\right) \approx 6.3\times 10^{9}\) combinations. Even if only timbre and decay are included, it is almost impossible to construct and train such a largescale dataset.
Moreover, as one of the most commonly used features in acoustic music processing applications, timefrequency representation is constrained by the uncertainty principle. The algorithm performance then may be degraded by such deficient feature. Meanwhile, recent results have shown that feature fusion from different sensors (namely modality, one may consider someone’s fingerprint and iris, or footages of some action from different angles) has advantages for recognition tasks (cf. [14–16] and references therein). Combined information from multiple sources is more robust and tolerant to noises and errors. Multimodal joint representation under constraints maximizes the utility of different features, which can be used more effectively in taskdriven scenarios. Note that multimodal features are different from stacking multiple features into one because the latter does not take the modality relationship into account, and increasing dimensionality brings huge computation and storage costs.

LorentzianBlockFrobenius sparsity: A novel \(\ \cdot \_{\mathcal {L}BF, \gamma }\) is imposed to a multimodal NMF model. Penalty is determined by the magnitude of class templates of all modalities so that class sparsity can be ensured.

Multimodal sparse incoherent NMF layer: A new deep learning layer based on the above constrained NMF model is presented. Sparse representations, as layer outputs, are computed by Alternating Direction Method of Multipliers (ADMM). Dictionaries, as layer parameters, are updated by Projected Stochastic Gradient Descent (PSGD). Incoherentness is added to the net loss as weight decay. Layer formulation and algorithms are given in Section 3.

Multipitch estimation network (MPENet): Given the decomposition capability of proposed layer, we employ “one vs. all” strategy and present a unified deep learning network consisting of a training subnet and a test subnet. Experiments show that the test net achieves stateoftheart results by using only two modalities of monophonic samples as training data. Network details are explained in Section 4.
2 Related work
Owing to the nonnegativity and superposition properties of musical spectra, Nonnegative Matrix Factorization (NMF, [23]) is applied widely in the latest acoustic music processing studies. Musical spectral data is decomposed into a dictionary and corresponding coefficients (also referred to as codings, activations, or activities in some references, we may use any of them according to the context). Note that NMF algorithms converge in unsupervised fashion, only rank1 decomposition makes sense for computational stability and uniqueness purpose. Thus, most methods utilizing NMF employ a threestep procedure: (1) training note templates individually from samples of each note, (2) constructing a dictionary by concatenating all note templates, and (3) estimating multiple notes by computing the codings with the dictionary fixed. In early studies, each template has only one atom (columns of a dictionary are called atoms). Weninger et al. [24] develop this simple structure by dividing note samples into two parts: onset and decay. Then, two atoms are learned respectively from both parts, which yields a twocolumn note template. Such dictionary helps to capture the feature variation and distinguish note state over time. O’Hanlon and Plumbley [25] take a further step on dictionary flexibility. Note templates are constructed by using linear combinations of several predefined fixed narrowband harmonic atoms. The input spectral data is then approximated under βdivergence group sparsity constraint. Other methods employing similar idea but different implementations are proposed in [2, 4, 26–29]. Such procedure uses fixed dictionary to get note activations during test, so MPE results heavily depend on the learned note templates, i.e., training samples. One has to retrain each template once new samples are added into training set. For other work using NMF with row/group sparsity and incoherent dictionaries, refer to [30–32] and references therein. Note that there are also studies that use unsupervised NMF instead of training note templates via isolated note samples. Bertin et al. [33] propose a tempering scheme favoring NMF with ItakuraSaito divergence to global minima. O’Hanlon and Sandler [34] propose an iterative hard thresholding approach for l_{0} sparse NMF problem with Hellinger distance. ERBT spectrograms of polyphonic music pieces are decomposed directly and a pitch salience matrix is calculate to detect active notes. A semisupervised NMF method can be referred to [35].
Many nonNMF based algorithms have been proposed for MPE problem. Tolonen and Karjalainen [36] divide the signal into two channels according to a fixed frequency and compute autocorrelation of the low channel and the envelope of the high channel to form summary autocorrelation function (SACF) and enhanced SACF (ESACF). The SACF and ESACF representations are used to observe the periodicities of the signal and estimate notes. Klapuri [37] calculates the salience representation through a weighted summation of overtone amplitudes. Three estimators based on direct, iterative, and joint strategies are proposed to extract notes from the salience function. Emiya et al. [1] employ a probabilistic spectral smoothness principle to iteratively estimate polyphonic content from a set of note candidates. An assumption of maximum number of concurrent notes (n_{max}=6) is imposed to avoid extracting overmany notes. Adalbjörnsson et al. [3] use a fixed dictionary to reconstruct input signal under block sparsity constraint. Notes are then identified through coding magnitudes. The fixed dictionary used here, however, is constructed according to equaltempered scale so that the algorithm is unsuitable for instruments with inharmonicity.
Deep learning has been used to address AMT problem in recent papers. Sigtia et al. [38] presents a realtime model which introduces recurrent neural networks (RNN) into a convolutional neural network (CNN, with only convolution, pooling, and fully connected layers). Kelz et al. [39] compare the performances of networks with different types of inputs (spectrograms with linearly/logarithmically spaced bins, logarithmically scaled magnitude, and constantQ transform), layers (dropout and batch normalization), and depths. Hawthorne et al. [40] propose a deep model with bidirectional long short term memory (BiLSTM) networks and two objective functions (onsets and frames), achieving stateoftheart performance on MAPS [1] under configuration 2 described in [38]. For more acoustic music processing work using deep learning, refer to [40] and references therein. Note that the deep learning methods listed here all use music pieces as training data, which means polyphonic information can be accessed, hence music language model and classifiers are learned simultaneously.
3 Multimodal sparse incoherent NMF layer
3.1 Notation
3.2 Prototype
It enforces dictionaries of different modalities using same atom to present same event, for example, l_{2,1} encourages collaboration among all modalities, and l_{1,1} imposes extra sparsity within rows.
For MPE problem, dictionary incoherentness should be imposed to provide flexibility of modeling universal note representations in contrast to redundancy. As we discussed in Section 1, singleatom note templates cannot cover the diversity of music spectra whereas NMF cannot guarantee the stability and uniqueness for multiatom ones. Because harmonic tuning aggravates overlapping partials, we can not distinguish that a spectral peak is a note overtone or a summation of several ones, i.e., it is not feasible to decompose frequency domain into orthogonal bins according to the center frequencies of harmonic series. In order to detect notes directly from factorization, a “good” dictionary should be trained under the supervision of data and task, possessing the following properties: (1) note templates are mutually discriminative and (2) for a certain note, all possible variants can be and only can be represented by its templates.
In (3), A_{i□} contains all template coefficients of the ith class of all modalities. Frobenius norm incorporates the contributions of different modalities. Thus, LorentzianBlockFrobenius norm imposes class sparsity instead of row sparsity. The inner product term enforces that the dictionary columns have the least coherentness. It ensures the discrimination among class templates as well as the linear representation within each template. Note that analytic or straight optimization cannot be done for (3) because it is not jointly convex with respect to (w.r.t) \(\left \{\mathbf {D}^{i}, i = \mathbb {N}^{m}\right \}\) and A. It is convex w.r.t either one while the other fixed. Hence, many alternating schemes ([42, 44, 46, 47]) split (3) into two subproblems, sparse coding and dictionary learning, respectively.
3.3 Structure
3.4 Forward pass
(5) can be solved using Alternating Direction Method of Multipliers (ADMM, ref. [44, 46, 47]), details are given in Algorithm 1 (proofs in Appendix), where Φ is given in Algorithm 2.
3.5 Backward pass
\(\tilde {\mathbf {a}}\) is defined in (7). Then, \(\frac {\partial l_{\text {new}} }{ \partial \mathbf {D}_{p, q} }\) can be computed because \(\frac {\partial \mathbf {a} }{\partial \mathbf {D}_{p, q}}\) can be obtained by taking the derivative w.r.t D_{p,q} on (11) and \(\frac {\partial l_{\text {net}} }{ \partial \mathbf {a} }\) is given by the last layer.
\(i = \mathbb {N}^{m}\), diag(M) is a diagonal matrix whose diagonal entries come from M, and \(\mathbf {U}^{i} \in \mathbb {R}^{d \times d}\).
4 MPENet
In this section, we detailedly explain the layer and tensor specifics of MPENet. In order to avoid misunderstandings caused by layer names in different deep learning frameworks (for example, commonly called “fully connected” is named as “inner product” in Caffe, “linear” in PyTorch, and “dense” in TensorFlow), during illustration, we will give the mathematics expression of some layers if necessary. Meanwhile, in order to give the most direct ideas of how MPENet is constructed, we switch to Caffe terminology accordingly (see Figs. 1 and 2).
4.1 Deep multilabel prediction module (DMP)
The reason for such structure roots from the property of incoherent dictionaries. If samples of certain class can be and only can be represented by its template atoms, the existence of this class is only related to the coefficient magnitudes. “One vs. all” strategy can be employed natively. If dictionaries are not as good as expected, crossentropy loss will correct each “detection” module as well as dictionaries of the proposed layer through backward pass, which completes a positive circle.
4.2 Multilabel accuracy layer
5 Experiment results
In this section, we first briefly demonstrate the dataset and features used in our experiment, then illustrate parameter initialization and network configuration in detail. Piano MPE results, experiment results about how MPENet works, timbre robustness results, and AMT results are given in the end of this section.
5.1 Dataset and features
MAPS [1] is a commonly used piano dataset for multipitch estimation and automatic transcription. It contains nine kinds of recording conditions (referred to as “StbgTGd2,” “AkPnBsdf,” “AkPnBcht,” “AkPnCGdD,” “AkPnStgb,” “SptkBGAm,” “SptkBGCl,” “ENSTDkAm,” and “ENSTDkCl”), two of them (“ENSTDkAm” and “ENSTDkCl”) are from real pianos and seven are synthesized by softwares. Each kind has same subset hierarchies which include ISOL (monophonic recordings), RAND (random combination), and UCHO (chords).
ISOL/NO subset, which contains 264 monophonic wav files covering 88 notes (n=88) and 3 loudness levels, is used as training set. RAND subset, which contains 6 polyphony levels ranging from 2 to 7 (labeled as P2–P7), is used as test set. Each one of P2–P7 has 50 files, and the note combination of each file is generated randomly. In [1], a 93ms frame which is 10 ms after onset of each file in P2–P6 is analyzed. As comparison, we conduct similar evaluation in our experiment. P7 is used as validation set for parameter tuning.
Feature specifics used in MPENet
Flen (ms)  Slen (ms)  Minf (Hz)  Maxf (Hz)  Dim  Misc  

CQT  23.2  6.5  27.5  7000  576  ppo:72 
STFT  648  nfft:4096 
5.2 Parameter initialization
Due to the nonconvexity of problem (3), only local minimization can be guaranteed. The initial value of dictionaries is crucial for convergence and performance. Since totally random initialization makes the codings of first several epochs meaningless, it is a waste of time and computation resources. Plus, because each monophonic file in the training set lasts for over 2 s, many samples are similar to each other during decay. It is not reasonable to initialize the dictionary using random samples as most dictionary learning algorithms do [15] either. In our experiment, two procedures are employed to initialize the dictionary.
It is worth emphasizing that (18) is a plain datadriven problem, and neither MPENet nor classifiers are involved at the time. It has nothing to do with deep learning and can be implemented by any language. The form of L is the extension of binary labels to impose classification information, because there are no note probabilities but only codings on our hands.
Likewise, LCIDL also needs a good dictionary to start for acceleration. In order to find it and determine the atom number, a fast clustering algorithm based on density peaks [50] is employed to filtrate samples hierarchically. Specifically speaking, we first extract 30 cluster centers from each modality of each file in the training set. Then, we stack them according to their note indices. This gives us two matrices with 810 columns for each note (810=30×3 (loudness) ×9 (recording)). Finally, through computing density peaks on these two matrices and considering the overhead and efficiency of computation and storage, we empirically set the atom number of each note template to be 15 (i.e., a=15 in (3)) and obtain a 576×1320 matrix and 648×1320 matrix for starting LCIDL.
After LCIDL is done, we obtain a “roughly good” dictionary, it has low reconstruction loss, incoherentness, and coding shape like L as (18) governs. When the real training of MPENet begins, this “roughly good” dictionary is copied into MSINMF layer, and classifiers are initialized randomly. During the first several epochs of training, the learning rate of classifiers is relatively larger than that of MSINMF since we want to hold codings a little bit to fit classifiers first. As the classification error decreases, the learning rates of all layers become equal to do joint learning.
5.3 Network configuration
5.4 MPE results
Precision (%) result with unknown polyphony level
Recall (%) result with unknown polyphony level
P2  P3  P4  P5  P6  

Tolonen  58  43  35  30  28 
Tolonen500  77  68  52  45  32 
Klapuri  89  88  83  78  62 
Emiya  90  82  71  63  47 
MPENet  99.00  95.26  87.56  82.15  77.75 
Fmeasure (%) result with unknown polyphony level
P2  P3  P4  P5  P6  

Tolonen  53  45  42  38  33 
Tolonen500  65  63  57  51  40 
Klapuri  91  90  86  81  72 
Emiya  93  87  80  75  63 
MPENet  97.11  94.25  90.08  86.89  83.65 
Fmeasure (%) result with known polyphony level
P2  P3  P4  P5  P6 

99.78  94.48  86.91  85.13  80.42 
5.5 How MPENet works
Experiment settings for part comparison, where group 1 corresponds to modality variation only, group 2 to atom number, group 3 to joint learning, and group 4 to dictionary incoherentness
Group  Name  Modality (m)  Atom number (a)  Incoherentness  Joint learning 

1  Config.1  CQT  15  √  √ 
Config.2  STFT  15  √  √  
2  Config.3  CQT&STFT  5  √  √ 
Config.4  CQT&STFT  10  √  √  
Config.5  CQT&STFT  20  √  √  
3  Config.6  CQT&STFT  15  √  × 
4  Config.7  CQT&STFT  15  ×  √ 
MPENetd  CQT&STFT  15  √  √ 
5.5.1 Modality
5.5.2 Atom number
5.5.3 Joint learning
5.5.4 Dictionary incoherentness
Forward and backward algorithm for (19) can be derived according to Algorithms 1 and 3. During training, we find the loss of training phase stays to a relatively high value (about two order higher than that of MPENetd). Things do not change even if we reinitialize the parameters or train for extra several epochs. Moreover, during test, the sigmoid outputs of multilabel accuracy layer for P2–P6 are all less than the detection threshold t, so the metrics of Config.7 are all zero, test fails.
5.6 Timbre robustness
5.7 AMT results
6 Conclusions
In this paper, we propose a new deep learning layer based on a NMF model with multimodal inputs under sparsity and incoherentness constraints. Such “layerization” of optimization problem provides the possibility to learn dictionaries and other features jointly under a unified deep learning framework. It enables modularization, online learning, and parameter finetuning for dictionary learning problem, which can be used to simplify the model refactoring and extension. In comparison with the “high level” features produced by other deep learning layers, the proposed layer learns discriminative and representative dictionaries so that the outputs are more realistically meaningful. Experiment results demonstrate that our test net improves the MPE performance substantially on MAPS dataset.
Restricted by hardwares, we pay more attention to layer algorithm and the network structure than model training. Unlike those fully explored and welltuned deep learning models, MPENet with empirical parameters, simple layer combinations and shallow structures have plenty room for improvements. For future work, there are several directions that can be considered: (1) from the layer point of view, performance grows with the increasing modality number. According to our experiment results, automatic parameter adaptation will also improve the estimation greatly; (2) from the network point of view, regularization, depth, and structure are new focuses for extracting more representative and robust features.
7 Appendix
Proposition 1
a obtained by Algorithm 1 is a minimizer of (5).
Proof
Applying ADMM, the update scheme of (21) is
and KKT conditions are
If ξ<4, i.e., λ<4γ^{2}, (45) is greater than 0 constantly, then (43) has only one real root. One can calculate it directly through Cardano’s method; if ξ=4, when \(\gamma ' = \frac {1}{27}\), (45) equals to 0, then we have \(\beta = \frac {1}{3}\), otherwise (43) still has only one real root.
Note that ax^{3}+bx^{2}+cx+d=0 having three different real roots, so its derivative 3a^{2}+2bx+c has two different real roots, i.e., 4b^{2}−12ac=4A>0 constantly. Finally, for θ∈[0,π], it is easy to find that x_{2}<x_{3}<x_{1}, we have β_{1}=x_{2}, β_{3}=x_{1}. Substituting the coefficients of (42) into x_{2} and x_{1}, one can get the equivalent expression described in Algorithm 2.
Summing all the above discussion up completes Algorithm 1. □
Proposition 2
\(\left \{ \frac {\partial l_{\text {new}}} {\partial \mathbf {D}^{i}}, i = \mathbb {N}^{m} \right \} \) described in Algorithm 3 is the gradient of l_{new} w.r.t D^{i}.
Proof
where \(i = \mathbb {N}^{f^{k}}\), \(j = \mathbb {N}^{d}\) and \(\mathbf {E}^{k}_{ij} \in \mathbb {R}^{f^{k} \times d}\) denotes an allzero matrix except the (i,j)th element is 1.
where U^{k} is defined in (17). □
For certain stringed instruments, overtones are close to but not exactly integer multiples of the fundamental frequency, the degree of departure from whole multiples is called inharmonicity.
Declarations
Acknowledgements
The authors would like to thank the supports from High Performance Computing Center of Changchun Normal University and Computing Center of Jilin Province. This work is supported by the Project Music Intelligent Analysis of the Education Department of Jilin Province (No.1105061).
Authors’ contributions
XL initiated the MSINMF algorithm under YG’s supervision, proposed MPENet, implemented MSINMF layer and MPENet by using Caffe, carried out all experiments, and drafted the manuscript. YG and YW participated in algorithm refinement and helped to draft the manuscript. ZZ helped to improve the running time of algorithm and parameter tuning. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 V Emiya, R Badeau, B David, Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Trans. Audio, Speech, Lang. Process.18(6), 1643–1654 (2010). https://doi.org/10.1109/TASL.2009.2038819.
 D Akaue, T Otsuka, K Itoyama, HG Okuno, in Proceedings of the 13th International Society for Music Information Retrieval Conference. Bayesian nonnegative harmonictemporal factorization and its application to multipitch analysis (Porto, Portugal, 2012).Google Scholar
 SI Adalbjörnsson, A Jakobsson, MG Christensen, Multipitch estimation exploiting block sparsity. Sig. Process. 109:, 236–247 (2015). https://doi.org/10.1016/j.sigpro.2014.10.014.
 K Yoshii, M Goto, A nonparametric Bayesian multipitch analyzer based on infinite latent harmonic allocation. IEEE Trans. Audio, Speech, Lang. Process.20(3), 717–730 (2012). https://doi.org/10.1109/TASL.2011.2164530.
 E Benetos, S Dixon, D Giannoulis, H Kirchhoff, A Klapuri, Automatic music transcription: challenges and future directions. J Intell. Inf. Syst.41(3), 407–434 (2013). https://doi.org/10.1007/s1084401302583.
 JS Downie, Music information retrieval. Annu. Rev. Inf. Sci. Technol.37(1), 295–340 (2005). https://doi.org/10.1002/aris.1440370108.
 M McVicar, R SantosRodriguez, Y Ni, TD Bie, Automatic chord estimation from audio: a review of the state of the art. IEEE/ACM Trans. Audio, Speech, Lang. Process.22(2), 556–575 (2014). https://doi.org/10.1109/taslp.2013.2294580.
 G Tsoumakas, I Katakis, Multilabel classification: an overview. Int. J. Data Warehous. Min.3(3), 1–13 (2007). https://doi.org/10.4018/jdwm.2007070101.
 R Liu, S Li, in 2009 IEEE Youth Conference on Information, Computing and Telecommunication. A review on music source separation, (2009), pp. 343–346. https://doi.org/10.1109/YCICT.2009.5382353.
 R Badeau, V Emiya, B David, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. Expectationmaximization algorithm for multipitch estimation and separation of overlapping harmonic spectra, (2009), pp. 3073–3076. https://doi.org/10.1109/ICASSP.2009.4960273.
 RW Young, Inharmonicity of plain wire piano strings. J. Acoust. Soc. Am.24(3), 267–273 (1952). https://doi.org/10.1121/1.1906888.
 OL Railsback, Scale temperament as applied to piano tuning. J. Acoust. Soc. Am.9(3), 274–274 (1938). https://doi.org/10.1121/1.1902056.
 S Kong, D Wang, A brief summary of dictionary learning based approach for classification (revised) (2012). http://arxiv.org/abs/1205.6544.Google Scholar
 S Shekhar, VM Patel, NM Nasrabadi, R Chellappa, Joint sparse representation for robust multimodal biometrics recognition. IEEE Trans. Pattern Anal. Mach. Intell.36(1), 113–126 (2014). https://doi.org/10.1109/TPAMI.2013.109.
 S Bahrampour, NM Nasrabadi, A Ray, WK Jenkins, Multimodal taskdriven dictionary learning for image classification. IEEE Trans. Image Process.25(1), 24–38 (2016). https://doi.org/10.1109/TIP.2015.2496275.
 G Monaci, P Jost, P Vandergheynst, B Mailhe, S Lesage, R Gribonval, Learning multimodal dictionaries. IEEE Trans. Image Process.16(9), 2272–2283 (2007). https://doi.org/10.1109/TIP.2007.901813.
 D Yu, L Deng, Deep learning and its applications to signal and information processing [exploratory dsp]. IEEE Sign. Process. Mag.28(1), 145–154 (2011). https://doi.org/10.1109/MSP.2010.939038.
 Y Bengio, Learning deep architectures for ai. Found. TrendsⓇMach. Learn.2(1), 1–127 (2009). https://doi.org/10.1561/2200000006.
 Y LeCun, Y Bengio, G Hinton, Deep learning. Nature. 521(7553), 436–444 (2015). https://doi.org/10.1038/nature14539.
 O AbdelHamid, A r. Mohamed, H Jiang, L Deng, G Penn, D Yu, Convolutional neural networks for speech recognition. IEEE Trans. Audio, Speech, Lang. Process.22(10), 1533–1545 (2014). https://doi.org/10.1109/TASLP.2014.2339736.
 SA Raczyński, E Vincent, S Sagayama, Dynamic Bayesian networks for symbolic polyphonic pitch modeling. IEEE Trans. Audio, Speech, Lang. Process.21(9), 1830–1840 (2013). https://doi.org/10.1109/TASL.2013.2258012.
 Y Jia, E Shelhamer, J Donahue, S Karayev, J Long, R Girshick, S Guadarrama, T Darrell, in Proceedings of the 22Nd ACM International Conference on Multimedia. MM ’14. Caffe: Convolutional architecture for fast feature embedding (ACMNew York, 2014), pp. 675–678. https://doi.org/10.1145/2647868.2654889. http://doi.acm.org/10.1145/2647868.2654889.
 DD Lee, HS Seung, Learning the parts of objects by nonnegative matrix factorization. Nature. 401:, 788 (1999).View ArticleMATHGoogle Scholar
 F Weninger, C Kirst, B Schuller, HJ Bungartz, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. A discriminative approach to polyphonic piano note transcription using supervised nonnegative matrix factorization, (2013), pp. 6–10. https://doi.org/10.1109/ICASSP.2013.6637598.
 K O’Hanlon, MD Plumbley, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Polyphonic piano transcription using nonnegative matrix factorisation with group sparsity, (2014), pp. 3112–3116. https://doi.org/10.1109/ICASSP.2014.6854173.
 T Nilsson, SI Adalbjörnsson, NR Butt, A Jakobsson, in 21st European Signal Processing Conference (EUSIPCO 2013). Multipitch estimation of inharmonic signals, (2013), pp. 1–5.Google Scholar
 B Fuentes, R Badeau, G Richard, Harmonic adaptive latent component analysis of audio and application to music transcription. IEEE Trans. Audio, Speech, Lang. Process.21(9), 1854–1866 (2013). https://doi.org/10.1109/TASL.2013.2260741.
 N BoulangerLewandowski, Y Bengio, P Vincent, in Proceedings of the 13th International Society for Music Information Retrieval Conference. Discriminative nonnegative matrix factorization for multiple pitch estimation (Porto, Portugal, 2012).Google Scholar
 M Genussov, I Cohen, Multiple fundamental frequency estimation based on sparse representations in a structured dictionary. Digit. Signal Proc.23(1), 390–400 (2013). https://doi.org/10.1016/j.dsp.2012.08.012.
 TST Chan, YH Yang, Informed groupsparse representation for singing voice separation. IEEE Signal Proc. Lett.24(2), 156–160 (2017). https://doi.org/10.1109/LSP.2017.2647810.
 K O’Hanlon, MD Plumbley, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Automatic music transcription using row weighted decompositions, (2013), pp. 16–20. https://doi.org/10.1109/ICASSP.2013.6637600.
 A Lefèvre, F Bach, C Févotte, in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Itakurasaito nonnegative matrix factorization with group sparsity, (2011), pp. 21–24. https://doi.org/10.1109/ICASSP.2011.5946318.
 N Bertin, C Fevotte, R Badeau, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. A tempering approach for itakurasaito nonnegative matrix factorization. with application to music transcription, (2009), pp. 1545–1548. https://doi.org/10.1109/ICASSP.2009.4959891.
 K O’Hanlon, MB Sandler, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). An iterative hard thresholding approach to l _{0} sparse hellinger nmf, (2016), pp. 4737–4741. https://doi.org/10.1109/ICASSP.2016.7472576.
 E Vincent, N Bertin, R Badeau, Adaptive harmonic spectral decomposition for multiple pitch estimation. IEEE Trans. Audio, Speech, Lang. Process.18(3), 528–537 (2010). https://doi.org/10.1109/TASL.2009.2034186.
 T Tolonen, M Karjalainen, A computationally efficient multipitch analysis model. IEEE Trans. Audio, Speech, Lang. Process.8(6), 708–716 (2000). https://doi.org/10.1109/89.876309.
 A Klapuri, in Proceedings of the 7th International Conference on Music Information Retrieval. Multiple fundamental frequency estimation by summing harmonic amplitudes (Victoria (BC), Canada, 2006).Google Scholar
 S Sigtia, E Benetos, S Dixon, An endtoend neural network for polyphonic piano music transcription. IEEE Trans. Audio, Speech, Lang. Process.24(5), 927–939 (2016). https://doi.org/10.1109/taslp.2016.2533858.
 R Kelz, M Dorfer, F Korzeniowski, S Böck, A Arzt, G Widmer, On the potential of simple framewise approaches to piano transcription (2016). http://arxiv.org/abs/1612.05153.
 C Hawthorne, E Elsen, J Song, A Roberts, I Simon, C Raffel, J Engel, S Oore, D Eck, Onsets and frames: Dualobjective piano transcription (2017). http://arxiv.org/abs/1710.11153.
 S Bahrampour, A Ray, NM Nasrabadi, KW Jenkins, Qualitybased multimodal classification using treestructured sparsity. 2014 IEEE Conf. Comput. Vision and Pattern Recognition (2014). https://doi.org/10.1109/cvpr.2014.524.
 C Bao, H Ji, Y Quan, Z Shen, Dictionary learning for sparse coding: Algorithms and convergence analysis. IEEE Trans. Pattern Anal. Mach. Intell.38(7), 1356–1369 (2016). https://doi.org/10.1109/TPAMI.2015.2487966.
 J Mairal, F Bach, J Ponce, Taskdriven dictionary learning. IEEE Trans. Pattern Anal.Mach. Intell.34(4), 791–804 (2012). https://doi.org/10.1109/TPAMI.2011.156.
 T Goldstein, S Osher, The split bregman method for l1regularized problems. Siam J Imaging Sci.2(2), 323–343 (2009). https://doi.org/10.1137/080725891.
 RE Carrillo, KE Barner, Lorentzian iterative hard thresholding: Robust compressed sensing with prior information. IEEE Trans. Sig. Process.61(19), 4822–4833 (2013). https://doi.org/10.1109/TSP.2013.2274275.
 D Han, X Yuan, A note on the alternating direction method of multipliers. J. Optim. Nutr.155(1), 227–238 (2012). https://doi.org/10.1007/s109570120003z.
 J Bolte, S Sabach, M Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program.146(1), 459–494 (2014). https://doi.org/10.1007/s1010701307019.
 Z Jiang, Z Lin, LS Davis, Label consistent ksvd: learning a discriminative dictionary for recognition. IEEE Trans. Pattern Anal. Mach. Intell.35(11), 2651–2664 (2013). https://doi.org/10.1109/TPAMI.2013.88.
 J Mairal, F Bach, J Ponce, G Sapiro, Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res.11:, 19–60 (2010).MathSciNetMATHGoogle Scholar
 A Rodriguez, A Laio, Clustering by fast search and find of density peaks. Science. 344(6191), 1492–1496 (2014). https://doi.org/10.1126/science.1242072. http://arxiv.org/abs/http://science.sciencemag.org/content/344/6191/1492.full.pdf.