Musical note analysis of solo violin recordings using recursive regularization

Composers may not provide instructions for playing their works, especially for instrument solos, and therefore, different musicians may give very different interpretations of the same work. Such differences usually lead to time, amplitude, or frequency variations of musical notes in a phrase in the signal point of view. This paper proposes a frame-based recursive regularization method for time-dependent analysis of each note presenting in solo violin recordings. The system of equations evolves when a new frame is added and an old frame is dropped to track the varying characteristics of violin playing. This method is compared with a time-dependent non-negative matrix factorization method. The complete recordings of both BWV 1005 No. 3 played by Kuijken and 24 Caprices op. 1 no. 24 in A minor played by Paganini are used for the transcription experiment, where the proposed method performs strongly. The analysis results of a short passage extracted from BWV 1005 No. 3 performed by three famous violinists reveal numerous differences in the styles and performances of these violinists.


Introduction
Analyses of performances are mostly subjective in the domain of musicology. Objective analysis has become possible with advances in information technologies and sound/music analysis tools, such as pitch/partial tracking, score alignment/following, melody tracking, and extraction. Non-negative matrix factorization (NMF) [1] is a popular tool for musical signal analysis such as pitch estimation, chord recognition, and automatic transcription [2][3][4]. In NMF, the matrix of the input magnitude spectrum is decomposed into the product of two matrices. One matrix is formed by a certain number of magnitude spectra and is called the template matrix or dictionary matrix. The other matrix is the intensity information of the notes and is called the intensity matrix or activation matrix. When considering the decomposition of audio recordings, these matrices are, for both NMF and the proposed method, related to several notes, each with a quasiharmonic spectrum activating during a specific time period. Furthermore, NMF is usually used on the Fourier spectrogram which is easy to apply time-frequency masking but hard to extract time-varying sources. Some additional models are needed to enforce the procedure of *Correspondence: showmin@csie.ncku.edu.tw SCREAM Laboratory, Department of Computer Science Information Engineering, National Cheng-Kung University, Tainan 701, Taiwan decomposition, such as time-dependent parametric and harmonic templates [5] and Markov-chained base [6]. On the other hand, decomposing constant-Q spectrograms is difficult to apply time-frequency masking but shows good potential to deal with spreading of higher harmonic frequencies when the pitch is getting higher, such like scale invariance across linear frequency [7] and shift invariance across log-frequency [8]. State-of-the-art methods in these areas can be found in the annual Music Information Retrieval Evaluation eXchange (MIREX) [9].
Good results can be achieved when the number of notes and/or the spectra information of notes is known a priori. In the analysis of polyphonic recordings, to determine the number of notes appearing in a single time frame is firstly discussed. A harmonic structure is generally desirable, and the spectral bases are usually constrained to be harmonic in the applications [10]. A fixed number of templates are usually set in previous works according to the note range of interest. Pitches of a violin can, however, vary continually, and fixed pitch templates are unsuitable in the analysis of bowed string instruments. Two issues are then welcome to be discussed in this work: (a) how to determine the exact number of notes and (b) how to model the time-varying notes with suitable templates.
Methods to estimate the possible number of notes have been discussed in [11,12]. In [13], a dynamic note http://asmp.eurasipjournals.com/content/2014/1/25 number detection for NMF is proposed to analyze solo bowed string instrument recordings. Since fixed template is employed in [13], a note with a time-varying spectrum which resulted from performing skills such as vibrato and portamento is encouraged to be obtained by using multiple templates. In [5], the time-dependent parametric and harmonic templates are applied to NMF when the pitch of a note varies. The method provides a parametric representation of the harmonic atoms, which can depend on a fundamental frequency parameter, a chirp parameter, and so on with respect to time. It can therefore represent a time-varying note by using only one single template.
Recursive regularization [14] has been widely applied in the areas of system identification, image restoration, noise reduction, echo cancellation, and blind deconvolution [15,16]. The proposed method decomposes the magnitude spectrogram into the product of a template matrix and an intensity matrix based on the modified version of the previous work in high-resolution image reconstruction [16]. In this work, a new algorithm is developed such that the two matrices are updated whenever a new frame is added and an old frame is dropped. This online scheme is similar to the so-called online dictionary learning but does not keep a global dictionary for identified patterns, i.e., musical notes of the same pitch. To analyze violin solo recordings, we regard each note as one single source in this paper. Some works have been proposed for online dictionary learning using L2 norms [17,18], KL divergence [19], and IS divergence [20]. Here, we considered L2 norms to simplify the derivations of the proposed recursive algorithm. Similar to [16], the new iterative update procedure also eliminates the matrix inversion operation to reduce the computational complexity. Because the convergence of the recursive regularization has been well addressed, those who are interested could find related materials in [16]. The systematic flow proposed in the previous work [13] to find a new note template is modified for the application of this work. The concepts of harmonic and sparseness constraint [21,22] are also adopted. The proposed method is compared with the time-dependent NMF method [5] by using the complete recordings of BWV 1005 No. 3 played by Kuijken [23] and 24 Caprices op. 1 no. 24 in A minor played by Paganini [24]. Finally, the proposed method is tested using Bach solo violin recordings by three violinists, that is, Arthur Grumiaux, Sigiswald Kuijken, and Hilary Hahn [23,25,26]. It is easy to identify the differences in their playing styles when note-by-note spectral and intensity information is available. The insightful discussions will be discussed in the 'Results' section.
The remainder of this paper is organized as follows: The 'Formulation of regularized analysis system' section presents the basic formulation of the decomposition problem using the regularization method. The 'Frame-based recursive regularization analysis' section presents the frame-based recursive regularization analysis method. We then present some experiments and corresponding results in the 'Experiments' and 'Results' sections. Lastly, the 'Conclusions' section offers the conclusion and the discussion of future works.

Formulation of regularized analysis system
Let m be the number of consecutive frames. Each frame has 2n samples. Discrete Fourier transform (DFT) is performed for each frame to obtain its magnitude information. We obtain an m-by-n matrix, V, where V ji represents the magnitude of the ith Fourier coefficient of the jth frame. Let r be the number of pitches present in these frames. We obtain an r-by-n template matrix, W, where W ki represents the magnitude of the ith Fourier coefficient of the kth template. Finally, we obtain an m-by-r intensity matrix, H, where H jk represents the intensity of the kth template of the jth frame. Hence, W and H are used to construct V, as follows: A cost function can be set as Subsequently, the template matrix and the intensity matrix can be obtained easily as The result can be obtained by evaluating (3) and (4) iteratively. Although (1) is similar to NMF in its formulation, (3) and (4) do not enforce the factorization of a non-negative matrix into two non-negative matrices, in comparison to NMF. Since the goal is to get a reasonable distribution of frequency energies, the negative elements of W and H can be set to zeros to re-evaluate the equations and obtain a non-negative result in every iteration. Notice that the matrix V is represented as (1) rather than V = WH commonly used in NMF-related literatures to make the derivatives of following formulations more readable without loss of generality.
To improve (3) and (4), a penalty term is added to support a temporal smoothness mechanism in the context of the proposed online scheme. For example, (2) can be modified to where λ > 0 and γ > 0 are carefully chosen to ensure the stability of the solution. Such a process is called regularization [14]. Since both frequency response and intensity of a note evolve slowly in a short time, C W is determined as a template matrix in the previous update iteration, http://asmp.eurasipjournals.com/content/2014/1/25 where λ is a corresponding penalty factor to achieve spectral smoothness. Similarly, C H and γ are the terms related to temporal smoothness. The solution becomes In our experience, the system described by (6) and (7) requires a smaller number of iterations than NMF to converge. Furthermore, the system also requires a smaller number of frames than NMF to obtain reasonably good results. The 'Experiments' section presents the simulation results of the proposed method and a comparison to other methods.
The proposed method is designed by considering the following issues. Firstly, it is crucial to determine the exact number of templates to obtain reasonably good results. Such an issue is widely discussed in [13] for conventional NMF-based methods. Secondly, it is crucial to determine in which manner the penalty term is set in (5). Finally, matrix inversion consumes substantial computing power, compared to the gradient descent algorithms used in NMF. These problems are discussed in the following section.

Refined update rules
This section presents the high computation complexity problem caused by matrix inversions. The template and intensity matrices can be adaptively learned from the current input frame and some previous input frames. For brevity, we present only the derivation to recursively evaluate the time-varying template matrix W. The derivation of the update rule for the intensity matrix H is omitted. Let the system start with the first m input frames. m must not be large when the signal varies rapidly. Let the r-byn template matrix for the lth input frame be denoted by W(l), which is obtained using frame-l, frame-(l − 1), . . ., and frame-(l − m + 1). V(l) and H(l) are subsequently defined as where l − m + 1 ≤ q ≤ l. Hence, the cost function in (5) for frame-l is rewritten as Subsequently, the template matrix is obtained as (13) Therefore, to obtain the template matrix and the corresponding intensity values for every new input frame, (6) and (7) must be re-evaluated when a new input frame is added. A recursive procedure is proposed to reduce the number of matrix inversions. First, two new matrices are defined as where Therefore, W(l) = P(l)R(l) and the template matrix for frame-(l + 1) can also be calculated by W(l + 1) = P(l + 1) R(l + 1).
Similar to (16) and (17), we obtain We definẽ By using the Woodbury matrix identity [27], . Therefore, we obtainP(l) without matrix inversion bỹ Next, By using (21) with A =P(l) −1 = l k=l−m+2 h (k)h(k) + λI, B = h (l + 1), C = +1, and D = h(l + 1), we obtain A matrix inversion is unnecessary when R(l + 1) is achieved bỹ The oldest frame is removed by using (22) and (25), and the new input frame is added by using (24) and (26). Hence, the template matrix for each new input frame can be computed recursively without matrix inversion by using the results generated by previous input frames. Since both frequency response and intensity of a note evolve slowly in a short time, C W and C H are determined  by the the results of W and H obtained in the previous update iteration, i.e., C W is updated by C W (l + 1) = W(l).
The intensity matrix is subsequently calculated. Similar to (13), the intensity matrix for the (l −m+1)th frame, the (l − m + 2)th frame, . . ., and the lth frame is obtained as Subsequently, the intensity information corresponding to the template matrix of the (l + 1)th frame is computed as In (28), C H (l + 1) can be set to H (l) because it is assumed that the intensity cannot change abruptly. The forgetting factors λ and γ can determine the effects of old frames. They are both set to 100 in this work. The timevarying template matrix and the corresponding intensity matrix can be calculated alternatively when a new input frame is added. Further details and the overall procedure are presented in the following section.

Analysis procedure
As shown in Figure 1, the analysis system started from only one template initialized with random values, which is called Guard template. An initialization loop determines all possible note templates and evaluates the template matrix and the intensity matrix in the same time. After the initialization loop, the main loop takes care of the new frame addition, the new note template detection, the offset note template evaluation, and the old frame removal. For the new note template detection, we used the weighted greatest common divisor and vote (WGCDV) [28] method to detect new tones in Guard template by using a floating point GCD lookup table and a framebased correction method. WGCDV estimates F0 in three steps: (a) locates the peaks of the frequency response of Objective performance comparison of NMF with harmonic constraints [13], time-dependent parametric NMF [5], and the proposed method by using SIR, SAR, and SDR. http://asmp.eurasipjournals.com/content/2014/1/25 the current frame and regards them as possible partials, (b) finds a likely GCD value for each partial pair using a lookup table method, (c) weights the likely GCD values and voting the deterministic GCD according to the spectral energy. Once a harmonic structure is recognized within Guard template at the current frame, these harmonic peaks will be extracted from Guard template and a corresponding new template is added to the original template matrix. Figure 2 shows an example of extracted harmonic structures of C4 and D4 notes from the original Guard template by using a mask function, S. As shown in Figure 2, each mask function represented in a dashed line can be defined by a harmonic set corresponding to the detected note as follows: where I(α, β) = 1 in the interval [α, β]; otherwise, it is 0. f j is the fundamental frequency of the jth recognized tone, and p is the partial index. is set at 3% of the partial frequency, pf j . Subsequently, the new template is computed by where S l j is the mask function of frame-l. In (30), W l 0 = [W l 01 W l 02 . . . W l 0n ] T represents the original Guard template of frame-l, and ⊗ is the element-wise multiplication. The number of templates, r, is equal to j + 1. The procedure described in the previous section is performed again for frame-(l + 1) to obtain the new template matrix and intensity matrix.
A re-estimation of the pitch of each note based on the updated template matrix is necessary because all templates, as well as pitches, can vary by frame. Consequently, the mask functions of all templates must be updated. , the iterative update procedure forces W(l + 1) to retain harmonic structures for all the notes as much as possible, depending on the regularization parameter, λ.
Finally, a musical note is muted eventually after it is played for a while. In this study, a note is removed from the analysis process by removing its template and the corresponding intensity information. A note can be removed if That is, the ith note is removed after frame-l if Equation 31 holds. T is empirically set to 0.1 in this work. By removing such notes, the computation complexity is also reduced.

Experiments
Database Two excerpts are generated by a MIDI synthesizer for preliminary tests. Firstly, a synthetic chirp signal is generated, and its pitch varies from C5 to A5 in 1,000 time frames. The second test uses a signal with vibrating notes including six notes in the following order: E5, D5, C5, B4, A4, G4, and A4+B4. A vibrato effect is generated by setting proper MIDI commands. All parameter sets are the same as those in the previous test. Moreover, two recordings, the BWV 1005 No. 3 performed by Kuijken [23] and RWC database C038 [24], are used to evaluate the accuracy of all methods, as proposed in [29]. The former contains 587 notes that are manually annotated as the ground truth. The latter contains 1,745 notes whose ground truth is provided from the syncRWC annotations [30].

Parameters
The window size is 4,096 samples, the hop size is 256 samples, and the sampling rate is 44.1 kHz. A Hamming window and 4096-FFT are subsequently applied.

Performance evaluation
An objective measure for evaluating the performance of a source separation method proposed in [31] is adopted for the following discussion. To compare different approaches, the signal-to-distortion ratio (SDR), the signal-to-artifact ratio (SAR), and the signal-tointerference ratio (SIR) are computed with each note as the target. In this work, we considered each note as a separate source. The SIR, SAR, and SDR values of these notes are averaged respectively.
To evaluate the performance of music transcription, we chose note-level metrics with a tolerance of one window before and after the reference onset time. Overall accuracy is defined as where TP (true positives) is the number of correctly transcribed voiced notes (within a quarter tone distance in frequency and 50-ms onset distance in time), FP (false positives) is the number of unvoiced notes transcribed as voiced, and FN (false negatives) is the number of voiced notes transcribed as unvoiced. This measure is defined for note-based onset evaluation in this case and is bounded by 0 and 1, with 1 corresponding to perfect transcription. Another metric, called F-measure, is defined as whereas where P and R represent the precision rate and recall rate, respectively.

The first preliminary test: glissando
The first part of this section presents a comparison of the proposed method by using a glissando signal to the conventional NMF method with harmonic constraints on matrix W [32] and the time-dependent parametric Table 4 Complexity comparison
NMF method [5] that uses the same harmonic structure for different notes of the same instrument. For the time-dependent parametric NMF, 10 templates are used to cover the possible pitch range of the signals. For the proposed method, it started with the initialized Guard template, and the required number of templates is determined adaptively during the analysis process. Figure 3a shows the original spectrogram of the note, and Figure 3b,c shows the reconstructed spectrograms that resulted from the time-dependent NMF method and the proposed method, respectively. In [5], the pitch of a template is constrained to vary within a fixed semitone; therefore, a number of discontinuous effects occurred among the templates, as shown in Figure 3b. Moreover, 10 note templates are required for the time-dependent NMF approach in this case. Because the pitch of a template is allowed to vary freely in the proposed method, only one template is required to represent the chirp signal. This accounts for the superior performance of the proposed method in which the signal varies more than one semitone.

The second preliminary test: vibrato
In this case, a synthetic signal with six vibrating notes appearing in the following order, E5, D5, C5, B4, A4, G4, and A4+B4, is used. Figure 4 shows the simulation results. Figure 4b shows that the harmonic-constrained NMF is limited in presenting the signal with frequency modulation because of the consistency of the templates. As discussed in [13], separation performance can be improved by increasing the number of templates for each note. The time-dependent parametric NMF performs efficiently in signal that includes vibrato effects; the result is shown in Figure 4c. The only disadvantage is that the amplitudes of the partials are not determined independently from time to time, as shown in the circled region of Figure 4c. The amplitude set trends toward the average among all played notes. The spectrogram obtained from the proposed method is closer to the original spectrogram than those obtained from NMF methods. Table 1 shows the separation results of all methods. The timedependent NMF experiences a number of interference errors from the additional partial energy shown in the circled region of Figure 4c. By considering the SAR, the proposed method preserves most of the vibrato effects in all cases, whereas the harmonic-constrained NMF preserves only steady partial energies. Based on the success of both measures, the proposed method scores the highest SDR as well.

Main experiments
The system is tested on the complete recordings of BWV 1005 No. 3 played by Kuijken [23] and 24 Caprices op. 1 no. 24 in A minor played by Paganini from RWC database [24]. A total of 587 + 1,745 notes are annotated as the ground truth. The analysis is performed blindly; however, pitches outside the possible range are excluded. Additional score information is also excluded from the process. Table 2  their perfect fifth notes are too strong, 56 notes remain in Guard templates because the automatic detection method fails to extract them from Guard template, and 8 notes are mixed with other detected notes because their pitches are close to those of certain high intensity notes. Hence, the following results are obtained: precision rate = 501/(501 + 14) = 97.28%, recall rate = 501/(501 + 86) = 85.35%, accuracy = 501/(501 + 14 + 86) = 83.36%, and F-measure = 90.93%.
The second recording contains more complicated performing styles with larger amount of notes than the first one. False alarms are enormously increased compared with the first recording especially in NMF case, since the energies of overlapping partials of voiced notes and unvoiced notes interfere each other in the intensity matrix H. The performance of the proposed method maintains balance between precision and recall rate. Its number of unvoiced notes transcribed as voiced and voiced notes transcribed as unvoiced is lower than NMF and TD-NMF. That shows our approach is more stable than NMF and TD-NMF.
In Tables 2 and 3, the recursive regularization method without the online scheme means that the sliding frame size is maximized to the whole signal duration. It is notable that, according to accuracy and F-measure of the transcription report, the results of this conventional recursive regularization method provides a baseline performance of the proposed algorithm. This work takes advantages of high-speed property of recursive regularization and improves the performance by imposing the proposed online scheme.

Complexity
In addition, the complexity of the proposed update rules is compared with the NMF as shown in Table 4. Let n be the number of frequency bins, m be the number of frames, and r be the number of templates. Suppose m input frames are set to be processed and only k frames are analyzed for each iteration, where k < m. For each iteration, the complexities of two update rules are O(krn) and O(r 2 n).
The complexity of the proposed update rules is similar to traditional NMF update rules in the case of regarding Euclidean distance as the cost function. Under the similar computational complexity, the proposed method is able to handle the time-varying music features and offer better results.

Performance analysis
Here, the proposed method attempts to analyze the performances of three violinists of different generations: Arthur Grumiaux, Sigiswald Kuijken, and Hilary Hahn. The signals are extracted from their CDs [23,25,26]. The following passages excerpted from BWV 1005 No. 3 are used to compare their performances. The musical score is shown in Figure 5. This study analyzes the signals of the short period during five notes in the bracket. The According to the figures, a notable difference is that the pitches of Kuijken's performance are one semitone lower than those of the other violinists' since the instruments are usually used in historically informed performances (HIP), and musicians usually follow the tradition of the period during which the music is composed. Secondly, violinists use vibrato techniques frequently. We can observe the vibrato effects of B4 note in the performance played by Hahn, as shown in Figure 8, compared to those played by Grumiaux and Kuijken, as shown in Figures 6 and 7, respectively. Thirdly, Kuijken's style is distinct. He freely used numerous trills, which were not indicated in the score of Bach's solo violin works. This can be viewed in Figure 7b,c. As shown in Figures 6c and 8c, C5 is off and B4 is on. After B4 continues for a period of time, it is off and C5 is on again. Moreover, C5 is the strongest note in Grumiaux's playing, whereas B4 is the most prominent note in Hahn's playing. Comparatively, Kuijken played equal intensity on these two notes. Hahn's recording sounded brighter than the other two recordings since the spectral energy of the G3 note in Hahn's recording is larger than the others'. This may be caused by the decision of her balance engineer since the lower notes are weak in amplitude in all of her recordings. A notable mistake is also observed, that is, both Kuijken and Grumiaux played an extra D4 note, which is not in the original score. This may be a coincidence, or it is possible that they used a different score edition. Finally, their tempi also differ considerably. Kuijken used 1.68 s (290 frames) to finish the short passage, whereas Grumiaux used 1.83 s, and Hahn used 2.37 s. Hahn's tempo is 40% slower than Kuijken's. As an HIP musician, Kuijken played faster than the other violinists.

Conclusions
A recursive regularization analysis method is proposed to analyze acoustic recordings of solo violin works. Similar to NMF, the proposed method factorizes the matrix formed with the Fourier magnitude coefficients of multiple frames into a template matrix and an intensity matrix. The frameby-frame-based procedure is designed for time-varying musical signals, such as solo violin recordings. The system of equations is updated by adding a new frame and dropping an old frame to avoid the problems of most NMF methods when the signal varies substantially. The proposed method is compared to the time-dependent NMF method by using two synthesized signals and exhibited superior SDR performances. The objective performance of the proposed method is also verified. For Kuijken's recording of BWV 1005 No. 3, the precision rate is 97.28%, the recall rate is 85.35%, and the F-measure is 90.93%. For a larger recording database from RWC C038, the precision rate is 85.99%, the recall rate is 87.08%, and the F-measure is 86.53%. It shows the stability of our approach. Finally, the proposed method is used to analyze the recordings of J.S. Bach's BWV 1005 No. 3 by three violinists, that is, Arthur Grumiaux, Sigiswald Kuijken, and Hilary Hahn. The results show that the time-varying characteristics of most notes appearing in the recordings can be tracked efficiently. The styles of the three violinists are easily distinguished through the separated results.
We are currently investigating possible approaches to improve the extraction of new notes from Guard template. An octave error may occur in our case because of the overlapping partials of the octave notes. In addition, because of the basis of the least-squares method, the performance of the proposed method with respect to signals of small amplitude, such as higher partials, is not as effective as NMF using other types of cost functions such as KL and IS divergences. The derivation of other cost functions into the proposed method may improve performance. Moreover, a supervised learning procedure can be introduced if note activations are available. Note activations can not only eliminate pitch detection errors but also constrain the intensity matrix for each note. As our approach preserves more musical characteristic details in the note level, nearly perfect decomposition is possible if it incorporates with more constraints, such as timbre, inharmonic bias, and phase. Therefore, many music information retrieval tasks are suitable to take our approach as a preprocessing, for example, player identification or expressive remix.