- Open Access
Musical note analysis of solo violin recordings using recursive regularization
© Lin et al.; licensee Springer 2014
- Received: 17 July 2013
- Accepted: 16 May 2014
- Published: 14 June 2014
Composers may not provide instructions for playing their works, especially for instrument solos, and therefore, different musicians may give very different interpretations of the same work. Such differences usually lead to time, amplitude, or frequency variations of musical notes in a phrase in the signal point of view. This paper proposes a frame-based recursive regularization method for time-dependent analysis of each note presenting in solo violin recordings. The system of equations evolves when a new frame is added and an old frame is dropped to track the varying characteristics of violin playing. This method is compared with a time-dependent non-negative matrix factorization method. The complete recordings of both BWV 1005 No. 3 played by Kuijken and 24 Caprices op. 1 no. 24 in A minor played by Paganini are used for the transcription experiment, where the proposed method performs strongly. The analysis results of a short passage extracted from BWV 1005 No. 3 performed by three famous violinists reveal numerous differences in the styles and performances of these violinists.
- Recall Rate
- Intensity Matrix
- Musical Note
- Harmonic Structure
- Input Frame
Analyses of performances are mostly subjective in the domain of musicology. Objective analysis has become possible with advances in information technologies and sound/music analysis tools, such as pitch/partial tracking, score alignment/following, melody tracking, and extraction. Non-negative matrix factorization (NMF)  is a popular tool for musical signal analysis such as pitch estimation, chord recognition, and automatic transcription –. In NMF, the matrix of the input magnitude spectrum is decomposed into the product of two matrices. One matrix is formed by a certain number of magnitude spectra and is called the template matrix or dictionary matrix. The other matrix is the intensity information of the notes and is called the intensity matrix or activation matrix. When considering the decomposition of audio recordings, these matrices are, for both NMF and the proposed method, related to several notes, each with a quasi-harmonic spectrum activating during a specific time period. Furthermore, NMF is usually used on the Fourier spectrogram which is easy to apply time-frequency masking but hard to extract time-varying sources. Some additional models are needed to enforce the procedure of decomposition, such as time-dependent parametric and harmonic templates  and Markov-chained base . On the other hand, decomposing constant-Q spectrograms is difficult to apply time-frequency masking but shows good potential to deal with spreading of higher harmonic frequencies when the pitch is getting higher, such like scale invariance across linear frequency  and shift invariance across log-frequency . State-of-the-art methods in these areas can be found in the annual Music Information Retrieval Evaluation eXchange (MIREX) .
Good results can be achieved when the number of notes and/or the spectra information of notes is known a priori. In the analysis of polyphonic recordings, to determine the number of notes appearing in a single time frame is firstly discussed. A harmonic structure is generally desirable, and the spectral bases are usually constrained to be harmonic in the applications . A fixed number of templates are usually set in previous works according to the note range of interest. Pitches of a violin can, however, vary continually, and fixed pitch templates are unsuitable in the analysis of bowed string instruments. Two issues are then welcome to be discussed in this work: (a) how to determine the exact number of notes and (b) how to model the time-varying notes with suitable templates.
Methods to estimate the possible number of notes have been discussed in ,. In , a dynamic note number detection for NMF is proposed to analyze solo bowed string instrument recordings. Since fixed template is employed in , a note with a time-varying spectrum which resulted from performing skills such as vibrato and portamento is encouraged to be obtained by using multiple templates. In , the time-dependent parametric and harmonic templates are applied to NMF when the pitch of a note varies. The method provides a parametric representation of the harmonic atoms, which can depend on a fundamental frequency parameter, a chirp parameter, and so on with respect to time. It can therefore represent a time-varying note by using only one single template.
Recursive regularization  has been widely applied in the areas of system identification, image restoration, noise reduction, echo cancellation, and blind deconvolution ,. The proposed method decomposes the magnitude spectrogram into the product of a template matrix and an intensity matrix based on the modified version of the previous work in high-resolution image reconstruction . In this work, a new algorithm is developed such that the two matrices are updated whenever a new frame is added and an old frame is dropped. This online scheme is similar to the so-called online dictionary learning but does not keep a global dictionary for identified patterns, i.e., musical notes of the same pitch. To analyze violin solo recordings, we regard each note as one single source in this paper. Some works have been proposed for online dictionary learning using L2 norms ,, KL divergence , and IS divergence . Here, we considered L2 norms to simplify the derivations of the proposed recursive algorithm. Similar to , the new iterative update procedure also eliminates the matrix inversion operation to reduce the computational complexity. Because the convergence of the recursive regularization has been well addressed, those who are interested could find related materials in . The systematic flow proposed in the previous work  to find a new note template is modified for the application of this work. The concepts of harmonic and sparseness constraint , are also adopted. The proposed method is compared with the time-dependent NMF method  by using the complete recordings of BWV 1005 No. 3 played by Kuijken  and 24 Caprices op. 1 no. 24 in A minor played by Paganini . Finally, the proposed method is tested using Bach solo violin recordings by three violinists, that is, Arthur Grumiaux, Sigiswald Kuijken, and Hilary Hahn ,,. It is easy to identify the differences in their playing styles when note-by-note spectral and intensity information is available. The insightful discussions will be discussed in the ‘Results’ section.
The remainder of this paper is organized as follows: The ‘Formulation of regularized analysis system’ section presents the basic formulation of the decomposition problem using the regularization method. The ‘Frame-based recursive regularization analysis’ section presents the frame-based recursive regularization analysis method. We then present some experiments and corresponding results in the ‘Experiments’ and ‘Results’ sections. Lastly, the ‘Conclusions’ section offers the conclusion and the discussion of future works.
The result can be obtained by evaluating (3) and (4) iteratively. Although (1) is similar to NMF in its formulation, (3) and (4) do not enforce the factorization of a non-negative matrix into two non-negative matrices, in comparison to NMF. Since the goal is to get a reasonable distribution of frequency energies, the negative elements of W and H can be set to zeros to re-evaluate the equations and obtain a non-negative result in every iteration. Notice that the matrix V is represented as (1) rather than V = W H commonly used in NMF-related literatures to make the derivatives of following formulations more readable without loss of generality.
In our experience, the system described by (6) and (7) requires a smaller number of iterations than NMF to converge. Furthermore, the system also requires a smaller number of frames than NMF to obtain reasonably good results. The ‘Experiments’ section presents the simulation results of the proposed method and a comparison to other methods.
The proposed method is designed by considering the following issues. Firstly, it is crucial to determine the exact number of templates to obtain reasonably good results. Such an issue is widely discussed in  for conventional NMF-based methods. Secondly, it is crucial to determine in which manner the penalty term is set in (5). Finally, matrix inversion consumes substantial computing power, compared to the gradient descent algorithms used in NMF. These problems are discussed in the following section.
3.1 Refined update rules
Therefore, W(l)=P(l)R(l) and the template matrix for frame-(l+1) can also be calculated by W(l+1)=P(l+1)R(l+1).
The oldest frame is removed by using (22) and (25), and the new input frame is added by using (24) and (26). Hence, the template matrix for each new input frame can be computed recursively without matrix inversion by using the results generated by previous input frames. Since both frequency response and intensity of a note evolve slowly in a short time, C W and C H are determined by the the results of W and H obtained in the previous update iteration, i.e., C W is updated by C W (l+1)=W(l).
In (28), C H ⊤(l+1) can be set to H⊤(l) because it is assumed that the intensity cannot change abruptly. The forgetting factors λ and γ can determine the effects of old frames. They are both set to 100 in this work. The time-varying template matrix and the corresponding intensity matrix can be calculated alternatively when a new input frame is added. Further details and the overall procedure are presented in the following section.
3.2 Analysis procedure
where I(α,β)=1 in the interval [ α,β]; otherwise, it is 0. f j is the fundamental frequency of the j th recognized tone, and p is the partial index. ε is set at 3% of the partial frequency, p f j .
where is the mask function of frame-l. In (30), represents the original Guard template of frame-l, and ⊗ is the element-wise multiplication. The number of templates, r, is equal to j+1. The procedure described in the previous section is performed again for frame-(l+1) to obtain the new template matrix and intensity matrix.
A re-estimation of the pitch of each note based on the updated template matrix is necessary because all templates, as well as pitches, can vary by frame. Consequently, the mask functions of all templates must be updated. Because each template contains only one harmonic set, C W (l+1) in Equation 13 is computed by S(l)⊗W(l), where . Based on Equation 5, the iterative update procedure forces W(l+1) to retain harmonic structures for all the notes as much as possible, depending on the regularization parameter, λ.
That is, the i th note is removed after frame-l if Equation 31 holds. T is empirically set to 0.1 in this work. By removing such notes, the computation complexity is also reduced.
Two excerpts are generated by a MIDI synthesizer for preliminary tests. Firstly, a synthetic chirp signal is generated, and its pitch varies from C5 to A5 in 1,000 time frames. The second test uses a signal with vibrating notes including six notes in the following order: E5, D5, C5, B4, A4, G4, and A4+B4. A vibrato effect is generated by setting proper MIDI commands. All parameter sets are the same as those in the previous test. Moreover, two recordings, the BWV 1005 No. 3 performed by Kuijken  and RWC database C038 , are used to evaluate the accuracy of all methods, as proposed in . The former contains 587 notes that are manually annotated as the ground truth. The latter contains 1,745 notes whose ground truth is provided from the syncRWC annotations .
The window size is 4,096 samples, the hop size is 256 samples, and the sampling rate is 44.1 kHz. A Hamming window and 4096-FFT are subsequently applied.
4.3 Performance evaluation
An objective measure for evaluating the performance of a source separation method proposed in  is adopted for the following discussion. To compare different approaches, the signal-to-distortion ratio (SDR), the signal-to-artifact ratio (SAR), and the signal-to-interference ratio (SIR) are computed with each note as the target. In this work, we considered each note as a separate source. The SIR, SAR, and SDR values of these notes are averaged respectively.
where P and R represent the precision rate and recall rate, respectively.
5.1 The first preliminary test: glissando
5.2 The second preliminary test: vibrato
5.3 Main experiments
The system is tested on the complete recordings of BWV 1005 No. 3 played by Kuijken  and 24 Caprices op. 1 no. 24 in A minor played by Paganini from RWC database . A total of 587+1, 745 notes are annotated as the ground truth. The analysis is performed blindly; however, pitches outside the possible range are excluded. Additional score information is also excluded from the process.
The second recording contains more complicated performing styles with larger amount of notes than the first one. False alarms are enormously increased compared with the first recording especially in NMF case, since the energies of overlapping partials of voiced notes and unvoiced notes interfere each other in the intensity matrix H. The performance of the proposed method maintains balance between precision and recall rate. Its number of unvoiced notes transcribed as voiced and voiced notes transcribed as unvoiced is lower than NMF and TD-NMF. That shows our approach is more stable than NMF and TD-NMF.
Numbers of matrix multiplications
(m−k+1)(2n r k+2r2n+2r2k+2r n+2r k)
+(2r2n+2k r n+4r3+(2k+2)r2)
5.5 Performance analysis
According to the figures, a notable difference is that the pitches of Kuijken’s performance are one semitone lower than those of the other violinists’ since the instruments are usually used in historically informed performances (HIP), and musicians usually follow the tradition of the period during which the music is composed. Secondly, violinists use vibrato techniques frequently. We can observe the vibrato effects of B4 note in the performance played by Hahn, as shown in Figure 8, compared to those played by Grumiaux and Kuijken, as shown in Figures 6 and 7, respectively. Thirdly, Kuijken’s style is distinct. He freely used numerous trills, which were not indicated in the score of Bach’s solo violin works. This can be viewed in Figure 7b,c. As shown in Figures 6c and 8c, C5 is off and B4 is on. After B4 continues for a period of time, it is off and C5 is on again. Moreover, C5 is the strongest note in Grumiaux’s playing, whereas B4 is the most prominent note in Hahn’s playing. Comparatively, Kuijken played equal intensity on these two notes. Hahn’s recording sounded brighter than the other two recordings since the spectral energy of the G3 note in Hahn’s recording is larger than the others’. This may be caused by the decision of her balance engineer since the lower notes are weak in amplitude in all of her recordings. A notable mistake is also observed, that is, both Kuijken and Grumiaux played an extra D4 note, which is not in the original score. This may be a coincidence, or it is possible that they used a different score edition. Finally, their tempi also differ considerably. Kuijken used 1.68 s (290 frames) to finish the short passage, whereas Grumiaux used 1.83 s, and Hahn used 2.37 s. Hahn’s tempo is 40% slower than Kuijken’s. As an HIP musician, Kuijken played faster than the other violinists.
A recursive regularization analysis method is proposed to analyze acoustic recordings of solo violin works. Similar to NMF, the proposed method factorizes the matrix formed with the Fourier magnitude coefficients of multiple frames into a template matrix and an intensity matrix. The frame-by-frame-based procedure is designed for time-varying musical signals, such as solo violin recordings. The system of equations is updated by adding a new frame and dropping an old frame to avoid the problems of most NMF methods when the signal varies substantially. The proposed method is compared to the time-dependent NMF method by using two synthesized signals and exhibited superior SDR performances. The objective performance of the proposed method is also verified. For Kuijken’s recording of BWV 1005 No. 3, the precision rate is 97.28%, the recall rate is 85.35%, and the F-measure is 90.93%. For a larger recording database from RWC C038, the precision rate is 85.99%, the recall rate is 87.08%, and the F-measure is 86.53%. It shows the stability of our approach. Finally, the proposed method is used to analyze the recordings of J.S. Bach’s BWV 1005 No. 3 by three violinists, that is, Arthur Grumiaux, Sigiswald Kuijken, and Hilary Hahn. The results show that the time-varying characteristics of most notes appearing in the recordings can be tracked efficiently. The styles of the three violinists are easily distinguished through the separated results.
We are currently investigating possible approaches to improve the extraction of new notes from Guard template. An octave error may occur in our case because of the overlapping partials of the octave notes. In addition, because of the basis of the least-squares method, the performance of the proposed method with respect to signals of small amplitude, such as higher partials, is not as effective as NMF using other types of cost functions such as KL and IS divergences. The derivation of other cost functions into the proposed method may improve performance. Moreover, a supervised learning procedure can be introduced if note activations are available. Note activations can not only eliminate pitch detection errors but also constrain the intensity matrix for each note. As our approach preserves more musical characteristic details in the note level, nearly perfect decomposition is possible if it incorporates with more constraints, such as timbre, inharmonic bias, and phase. Therefore, many music information retrieval tasks are suitable to take our approach as a preprocessing, for example, player identification or expressive remix.
The authors would like to thank the National Science Council, ROC, for its financial support of this work, under contract no. NSC98-2221-E-006-158-MY3.
- Lee DD, Seung HS: Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401(6755):788-791. 10.1038/44565View ArticleGoogle Scholar
- P Smaragdis, JC Brown, Non-negative matrix factorization for polyphonic music transcription, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. (New Paltz, NY, 19–22 Oct. 2003), pp. 177–180.Google Scholar
- Virtanen T: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. Audio Speech Lang. Process. IEEE Trans. on 2007, 15(3):1066-1074. 10.1109/TASL.2006.885253View ArticleGoogle Scholar
- CT Lee, YH Yang, H Chen, in IEEE International Conference on Multimedia and Expo (ICME). Automatic transcription of piano music by sparse representation of magnitude spectra (Taipei, Taiwan, 11–15 July 2011), pp. 1–6.Google Scholar
- R Hennequin, R Badeau, B David, in Proc. of the 13th Int. Conference on Digital Audio Effects. Time-dependent parametric and harmonic templates in non-negative matrix factorization (Graz,Austria, 6–10 Sept. 2010).Google Scholar
- M Nakano, JL Roux, H Kameoka, in LVA/ICA’10 Proceedings of the 9th International Conference on Latent Variable Analysis and Signal Separation. Nonnegative matrix factorization with Markov-chained bases for modeling time-varying patterns in music spectrograms (St. Malo, France, 27–30 Sept. 2010), pp. 149–156.Google Scholar
- R Hennequin, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. Scale-invariant probabilistic latent component analysis (New Paltz, NY, 16–19 Oct. 2011).View ArticleGoogle Scholar
- Smaragdis P: Relative-pitch tracking of multiple arbitrary sounds. J. Acoust. Soc. Am. 2009, 125: 3406-3413. 10.1121/1.3106529View ArticleGoogle Scholar
- MIREX, Music Information Retrieval Evaluation eXchange (MIREX.) . Accessed 19 May 2011. http://www.music-ir.org/mirex/wiki/MIREX_HOME MIREX, Music Information Retrieval Evaluation eXchange (MIREX.). Accessed 19 May 2011.
- Bertin N, Badeau R, Vincent E: Enforcing harmonicity and smoothness in Bayesian non-negative matrix factorization applied to polyphonic music transcription. Audio Speech Lang. Process. IEEE Trans. on 2010, 18(3):538-549. 10.1109/TASL.2010.2041381View ArticleGoogle Scholar
- Yeh C, Roebel A, Rodet X: Multiple fundamental frequency estimation and polyphony inference of polyphonic music signals. Audio Speech Lang. Process. IEEE Trans. on 2010, 18(6):1116-1126. 10.1109/TASL.2009.2030006View ArticleGoogle Scholar
- WC Chang, WY Su, C Yeh, A Roebel, X Rodet, in Proc. of the 11th Int. Conference on Digital Audio Effects (DAFx-08). Multiple-F0 tracking based on a high-order HMM model (Espoo,Finland, 1–4 Sept. 2008).Google Scholar
- TM Wang, YL Chen, WH Liao, A Su, in International Conference on Digital Audio Effects (DAFX). Analysis and trans-synthesis of acoustic bowed-string instrument recordings–a case study using Bach cello suites (IRCAM,Paris, France, 19–23 Sept. 2011).Google Scholar
- Unser M, Aldroubi A, Eden M: Recursive regularization filters: design, properties, and applications. IEEE Trans. on Pattern Anal. Mach. Intell 1991, 13(3):272-277. 10.1109/34.75514View ArticleGoogle Scholar
- Nesta F, Svaizer P, Omologo M: Convolutive BSS of short mixtures by ICA recursively regularized across frequencies. Audio Speech Lang. Process. IEEE Trans. on 2011, 19(3):624-639. 10.1109/TASL.2010.2053027View ArticleGoogle Scholar
- Kim SP, Su WY: Recursive high-resolution reconstruction of blurred multiframe images. Image Process. IEEE Trans. on 1993, 2(4):534-539. 10.1109/83.242363View ArticleGoogle Scholar
- Mairal J, Bach F, Ponce J, Sapiro G: Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res 2010, 11: 19-60.MathSciNetGoogle Scholar
- F Wang, C Tan, AC König, P Li, in Proc. SIAM, Hilton Phoenix East/Mesa. Efficient document clustering via online nonnegative matrix factorizations (Mesa, USA, 28–30 Apr 2011).Google Scholar
- Duan Z, Mysore G, Smaragdis P: Online PLCA for real-time semi-supervised source separation. Latent Variable Anal. Signal Sep 2012, 7191: 34-41. 10.1007/978-3-642-28551-6_5View ArticleGoogle Scholar
- A Lefevre, F Bach, C Févotte, in Proc. WASPAA. Online algorithms for nonnegative matrix factorization with the Itakura-Saito divergence (IEEE,New Paltz, NY, 16–19 Oct. 2011), pp. 313–316.Google Scholar
- Vincent E, Berlin N, Badeau R: Harmonic and inharmonic nonnegative matrix factorization for polyphonic pitch transcription. In Proc. of International Conference on Acoustics, Speech and Signal Processing. IEEE,, Las Vegas, Nevada, USA; 2008:109-112.Google Scholar
- Hoyer PO: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res 2004, 5: 1457-1469.MathSciNetGoogle Scholar
- S Kuijken, Bach: Sonatas & Partitas, BWV 1001-1006, CD 1 (Deutsche Harmonia Mundi, 1990).Google Scholar
- M GOTO, RWC Music Database. . Accessed 13–17 Oct. 2002. http://staff.aist.go.jp/m.goto/RWC-MDB/ M GOTO, RWC Music Database. . Accessed 13–17 Oct. 2002.
- A Grumiaux, Bach: Sonatas & Partitas (BWV 1001-1006), CD 1 (Philips, 2006).Google Scholar
- H Hahn, Hilary Hahn plays Bach (Sony, 1997).Google Scholar
- MA Woodbury, Inverting modified matrices. Memorandum Rep. 42, 106 (1950).Google Scholar
- Siao YS, Chang WC, Su WY: Pitch detection/tracking strategy for musical recordings of solo bowed-string and wind instruments. J. Inf. Sci. Eng 2009, 25(4):1239-1253.Google Scholar
- S Dixon, in Proceedings of Australasian Computer Music Conference. On the computer recognition of solo piano music (Brisbane, Australia, 17 Jul 2011), pp. 31–37.Google Scholar
- M GOTO, Music synchronization for RWC Music Database (classical music). . Accessed 24 Sept. 2010. http://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/SyncRWC/
- Vincent E, Gribonval R, Févotte C: Performance measurement in blind audio source separation. IEEE Trans. on Audio Speech Lang. Process 2006, 14(4):1462-1469. 10.1109/TSA.2005.858005View ArticleGoogle Scholar
- S Ewert, M Müller, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Using score-informed constraints for NMF-based source separation (Kyoto, Japan, 25–30 Mar 2012).View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.