Percussive/harmonic sound separation by non-negative matrix factorization with smoothness/sparseness constraints
- Francisco Jesus Canadas-Quesada^{1}Email author,
- Pedro Vera-Candeas^{1},
- Nicolas Ruiz-Reyes^{1},
- Julio Carabias-Orti^{2} and
- Pablo Cabanas-Molero^{1}
https://doi.org/10.1186/s13636-014-0026-5
© Canadas-Quesada et al.; licensee Springer. 2014
Received: 23 February 2014
Accepted: 12 June 2014
Published: 11 July 2014
Abstract
In this paper, unsupervised learning is used to separate percussive and harmonic sounds from monaural non-vocal polyphonic signals. Our algorithm is based on a modified non-negative matrix factorization (NMF) procedure that no labeled data is required to distinguish between percussive and harmonic bases because information from percussive and harmonic sounds is integrated into the decomposition process. NMF is performed in this process by assuming that harmonic sounds exhibit spectral sparseness (narrowband sounds) and temporal smoothness (steady sounds), whereas percussive sounds exhibit spectral smoothness (broadband sounds) and temporal sparseness (transient sounds). The evaluation is performed using several real-world excerpts from different musical genres. Comparing the developed approach to three current state-of-the art separation systems produces promising results.
Introduction
The separation of percussive and harmonic sounds remains a challenging problem in music research. Percussive sound pertains to drum instruments, whereas the harmonic sound pertains to pitched instruments. We develop a method to separate monaural music signals (for which spatial information is unavailable), motivated by the large number of one-channel music recordings such as live performances or old recordings (from before the 1960s). In a musical context, a listener can effortlessly distinguish between percussive and harmonic sounds; therefore, these two types of sounds must have significantly different characteristics. Ono et al. [1],[2] used Harmonic/Percussive Sound Separation (HPSS) to separate harmonic and percussive sounds by exploiting the anisotropy of harmonic and percussive sounds in a maximum a posteriori (MAP) framework. The authors considered a spectrogram, assuming anisotropic smoothness [1], i.e., percussive sounds have a structure that is vertically smooth in frequency, whereas harmonic sounds are temporally stable and have a structure that is horizontally smooth in time.
Over the last decade, several approaches have been developed to separate percussive/harmonic sounds from monaural polyphonic music [3]-[6]. The method developed here offers the following advantages over the method outlined in [3]: (i) The method is robust for various sources and not only for flat spectrum sources, (ii) threshold choices are not required, (iii) hand-tuning is not necessary, (iv) the method is quite fast (e.g., the developed method can factorize an input signal lasting 30 s in approximately 18 s). Unlike the methods given in [4],[5], no labelled data is required by the proposed method because the percussive/harmonic information is obtained from the spectro-temporal features used in the factorization stage. Other recently published state-of-the-art techniques are presented in [7],[8]. In [7], anisotropy [1] is used in Median Filtering-based Separation (MFS) under two assumptions: the harmonics are considered to be outliers in a temporal slice that contains a mixture of percussive and pitched instruments, and the percussive onsets are considered to be outliers in a frequency slice. A median operator is used to remove these outliers because median filtering has been used extensively in image processing for removing salt and pepper noise from images [9]. In this manner, the extraction of percussive sounds can be seen as the removal of outliers (overtones from harmonic sounds) in a time frame of a spectrogram while the extraction of harmonic sounds can be seen as the removal of outliers (onsets from percussive sounds) in a frequency bin of a spectrogram. In [8], drum source separation is performed using non-negative matrix partial co-factorization (NMPCF). In NMPCF, the input spectrogram and a drum-only matrix (which is picked up from a priori drum recordings) are simultaneously decomposed. The shared basis vectors in this co-factorization are associated with the drum characteristics, which are used to extract drum-related components from the musical signals.
The developed approach is practically useful in the field of audio engineering applications for music information retrieval, where the percussive/harmonic separation task can be used as a preprocessing tool. The extraction of a harmonic sound source can be used to enhance music transcription [11] and chord detection [12]. Extracting a percussive sound source can also enhance onset detection [2]. Extracting both harmonic and percussive sound sources is useful for remixing and for audio to score alignment [13].
We implement an unsupervised approach in which imposed smoothness and sparseness constraints are used to automatically discriminate between percussive and harmonic signals in a NMF framework. Our specific contribution is the inclusion of sparseness criteria in a NMF framework for percussive/harmonic separation. Compared to methods that require some training (semi-supervised or supervised), our approach provides a more robust source-to-distortion ratio (as we will show in Section 3) because the separation process does not depend on a supervised training. In Section 3, we show that the developed method produces promising results in comparison with two unsupervised (i.e., untrained) approaches (HPSS and MFS) and a supervised (i.e., trained) approach (NMPCF) dedicated to percussive/harmonic separation.
The remainder of this paper is organized as follows: In Section 2, we describe our novel method. In Section 3, the results are evaluated and compared. Finally, conclusions and future work are presented in Section 4.
Developed method
Non-negative matrix factorization (NMF) has been widely used in the field of digital image processing [14]-[17] in recent years and has also been successfully applied to music analysis [18],[19]. Following [5],[6],[8], we apply NMF to percussive/harmonic separation, motivated in part by the aforementioned promising results.
where the i-th column of matrix W is a frequency basis that represents the spectral pattern of a component that is active in the spectrogram. Additionally, the i-th row of matrix H is a temporal gain or activation and represents the time interval over which the i-th frequency basis is active. Standard NMF cannot be used to distinguish between a percussive or harmonic frequency basis. Standard NMF can only ensure convergence to local minima, which enables the reconstruction of the signal spectrogram but cannot discriminate between percussive and harmonic frequency bases.
For clarity, the term source refers to a musical instrument, and the term component is used to select a specific frequency basis.
2.1 Signal representation
2.2 A modified non-negative matrix factorization for percussive/harmonic separation
where X_{ P }, X_{ H }, W_{ P }, H_{ P }, W_{ H } and H_{ H } are non-negative matrices. The parameter R_{ p } denotes the number of percussive components, and the parameter R_{ h } denotes the number of harmonic components used in the factorization process.
Next, we detail the decomposition process. This process adapts the concept of anisotropy which was initially appropriated by [1],[2] to a NMF framework. Anisotropy is used to estimate W_{ P }, H_{ P }, W_{ H } and H_{ H } by minimizing a global objective function that depends on a β-divergence cost, a percussive cost and a harmonic cost.
2.2.1 β-divergence cost
where $x={X}_{{n}_{\beta}}$ and y=X_{ P }+X_{ H }.
This factorization only considers the β-divergence cost and cannot automatically determine whether a basis belongs to a percussive signal or a harmonic signal. To overcome this drawback, the factorization process is modified to include specific spectro-temporal features related to percussive and harmonic sounds. Therefore, our contribution to the analysis of monaural signals is the development of a percussive/harmonic separation using an unsupervised NMF approach. This NMF approach models the mixture signal using an objective function that considers the β-divergence cost and common spectro-temporal features from the percussive/harmonic signals. Unlike other methods [4]-[6],[8] that have been developed for percussive/harmonic separation, our approach does not use any labelled data to train the NMF bases maintaining competitive SDR results.
2.2.2 Percussive cost
where a high cost is assigned to large changes in the frequency between the bases ${W}_{{P}_{f,{r}_{p}}}$ and ${W}_{{P}_{f-1,{r}_{p}}}$ in adjacent frequency bins. Normalization is used to make the global objective function independent of the signal norm, i.e., the bases W_{ P } are normalized by ${\sigma}_{{W}_{P}}$. The value ${\sigma}_{{W}_{P}}$ of each percussive component r_{ p } is calculated as ${\sigma}_{{W}_{{P}_{{r}_{p}}}}=\sqrt{\frac{1}{F}\sum _{f=1}^{F}{W}_{{P}_{f,{r}_{p}}}^{2}}$. To ensure that each percussive or harmonic cost has the same weight in the global objective function, each cost is normalized. Taking SSM into account, it is normalized by a factor equal to $\frac{T}{{R}_{p}}$. Thus, we avoid scaling problems that can arise from the number of frames, the number of frequency bins or the number of components considered.
Similarly to the W_{ P } treatment, the activations H_{ P } are normalized by ${\sigma}_{{H}_{P}}$. The value ${\sigma}_{{H}_{P}}$ of each percussive component r_{ p } is calculated as ${\sigma}_{{H}_{{P}_{{r}_{p}}}}=\sqrt{\frac{1}{T}\sum _{t=1}^{T}{H}_{{P}_{{r}_{p},t}}^{2}}$. As mentioned above, the TSP is normalized by a factor $\frac{F}{{R}_{p}}$ such that the percussive costs are equally weighted.
2.2.3 Harmonic cost
Normalization is used to make the global objective function independent to the signal norm; Thus, the H_{ H } gains are normalized by ${\sigma}_{{H}_{H}}$. The value ${\sigma}_{{H}_{H}}$ of each harmonic component r_{ h } is calculated as ${\sigma}_{{H}_{{H}_{{r}_{h}}}}=\sqrt{\frac{1}{T}\sum _{t=1}^{T}{H}_{{H}_{{r}_{h},t}}^{2}}$. The harmonic cost TSM is normalized by a factor $\frac{F}{{R}_{h}}$.
Similarly to the treatment for H_{ H }, the bases W_{ H } are normalized by ${\sigma}_{{W}_{H}}$. The value ${\sigma}_{{W}_{H}}$ of each harmonic component r_{ h } is calculated as ${\sigma}_{{W}_{{H}_{{r}_{h}}}}=\sqrt{\frac{1}{F}\sum _{f=1}^{F}{W}_{{H}_{f,{r}_{h}}}^{2}}$. The harmonic cost SSP is normalized by a factor $\frac{T}{{R}_{h}}$.
2.2.4 Global percussive/harmonic NMF algorithm
Preliminary results do not show significant differences from initializing K_{SSM} and K_{TSM} with different values; thus, these parameters are set equal to each other (i.e., K_{SSM}=K_{TSM}). The parameters K_{TSP} and K_{SSP} are treated in the same way (i.e., K_{TSP}=K_{SSP}) because a similar behaviour is observed for these parameters in the preliminary studies. To determine if both sparseness and smoothness affect the performance of the separation process, we define the parameter K_{SP}=K_{TSP}=K_{SSP} to represent the sparseness terms and the parameter K_{SM}=K_{TSM}=K_{SSM} to represent the smoothness terms.
where ⊙ is the element-wise product operator and the fraction is the element-wise division operator.
By iteratively updating the matrices W_{ P }, H_{ P }, W_{ H }, H_{ H } using maxIter iterations, our scheme can automatically distinguish between the bases belonging to the percussive or harmonic sounds.
2.2.5 Signal reconstruction
The developed percussive/harmonic sound separation is detailed in Algorithm 1.
Evaluation and comparison
3.1 Test data
Identifier, title and artist of the files of the development database E
Identifier | Title | Artist |
---|---|---|
E _01 | Two minutes to midnight | Iron Maiden |
E _02 | Bullet with butterfly wings | Smashing Pumpkins |
E _03 | Gamma ray | Beck |
E _04 | Go your own way | Fleetwood Mac |
E _05 | Hotel California | Eagles |
Identifier, title and artist of the files of the first test database T1
Identifier | Title | Artist |
---|---|---|
T 1_01 | In my place | Coldplay |
T 1_02 | La bamba | Los Lobos |
T 1_03 | Livin’ on a prayer | Bon Jovi |
T 1_04 | No one to depend on | Santana |
T 1_05 | Ring of fire | Johnny Cash |
T 1_06 | Rooftops | Lost prophets |
T 1_07 | So lonely | The Police |
T 1_08 | Song 2 | Blur |
T 1_09 | Sultans of swing | Dire Straits |
T 1_10 | Under pressure | Queen |
T 1_11 | Are you gonna go my way | Lenny Kravitz |
T 1_12 | Feel the pain | Dinosaur Jr |
T 1_13 | Hollywood nights | Bob Seger & The Silver Bullet Band |
T 1_14 | Hurts so good | John Mellencamp |
T 1_15 | Kick out the jams | MC5’s Wayne Kramer |
T 1_16 | Make it wit chu | Queens Of The Stone Age |
T 1_17 | One way or another | Blondie |
T 1_18 | Shiver | Coldplay |
T 1_19 | Shout it out loud | Kiss |
T 1_20 | Sympathy for the devil | The Rolling Stones |
Identifier, title and artist of the files of the second test database T2
Identifier | Title | Artist |
---|---|---|
T 2_01 | Roads | Bearlin |
T 2_02 | The_ones_we_love | Another Dreamer |
T 2_03 | Remember_the_name | Fort Minor |
T 2_04 | Tour | Ultimate NZ Tour |
In summary, the dataset is composed of three databases. The development database E (five excerpts) is used to optimize the parameters (β, K_{SM}, K_{SP}, R_{ p } and R_{ h }) of the developed method. Two test databases T1 (20 excerpts) and T2 (4 excerpts) are then used to evaluate the performance of the separation process. Note that the development database E is not a part of the test databases T1 and T2.
3.2 Experimental setup
The quality of the audio separation using different frame sizes N, time shifts J and maxIter iterations is evaluated in preliminary studies. The experimental results show approximately the same performance using N>1024 samples and J<512 samples; therefore, we use the values N=1024 and J=512 because these values provide the best trade-off between the performance and the computational cost. The convergence of the algorithm is empirically observed. In fact, in all the performed simulations the convergence is achieved after 100 iterations. For this reason, we choose m a x I t e r=100. The values R_{ p } and R_{ h } are initialized to 150 because we initially supposed that this number of components would be a representative number of percussive and harmonic spectral patterns. However, these parameters will be later analyzed (subsection 3.4.1).
Sound source separation applications, based on NMF approaches, usually adopt logarithmic frequency discretization (e.g., uniformly spaced sub-bands on the equivalent rectangular bandwidth (ERB) scale [28]). As harmonic signals are organized in a chromatic scale and using this scale, the musical notes are defined with semitone resolution, the developed method uses a 1/4 semitone resolution. Therefore, the time-frequency representation is obtained by integrating the STFT bins corresponding to each 1/4 semitone interval. To obtain the separated signals, the frequency resolution of the percussive and harmonic masks defined in Equations 15 and 16 must be extended to the resolution of the STFT. Taking into account that each bin of the STFT belongs to a value in the 1/4 semitone resolution, each bin of the masks with the STFT resolution takes the value that belongs to in the 1/4 semitone resolution. Consequently, all bins belonging to the same 1/4 semitone have the same mask value. Percussive and harmonic masks with the resolution of the STFT are then obtained and the inverse transform can be computed following Equations 17 and 18.
3.3 Algorithms for comparison
We use three recent state-of-the-art percussive/harmonic sound separation methods to evaluate the developed method: HPSS [1], MFS [7] and NMPCF [8]. HPSS and MFS are implemented in this study, whereas the separation results from NMPCF have been provided directly by the authors.
3.4 Results
Three metrics are used to assess the performance of the developed method [29],[30]: (1) the source-to-distortion ratio (SDR), which provides information on the overall quality of the separation process; (2) the source-to-interferences ratio (SIR), which is a measure of the presence of harmonic sounds in the percussive signal and vice versa; and (3) the source-to-artifacts ratio (SAR), which provides information on the artifacts in the separated signal from separation and/or resynthesis.
As previously mentioned, the output of each method is composed of two signals, a percussive signal x_{ p }(t) (i.e., harmonic sounds have been attenuated or removed) and a harmonic signal x_{ h }(t) (i.e., percussive sounds have been attenuated or removed). In the database T1, the percussive average (Perc) is computed using the mean of all of the separated percussive signals. The harmonic average (Harm) is computed using the mean of all of the separated harmonic signals. In the same way, the overall average (Overall) is computed using the mean of all of the separated percussive and harmonic signals.
3.4.1 Parameters optimization
Figure 3 shows the optimization of the parameters K_{SP} and K_{SM} using β=1.5 and different number of percussive R_{ p } and harmonic R_{ h } components in order to analyze the effect of different dictionary sizes. Results show the optimal values of the parameters K_{SP} and K_{SM} are obtained using K_{SP}=K_{SM}=0.1 for all the dictionary sizes evaluated. This fact suggests that the dictionary size could not affect the optimal values of K_{SP} and K_{SM} because these values provide the maximum SDR for each dictionary size evaluated.
Figure 4 shows the optimization of the parameter β using the optimal K_{SP}=K_{SM}=0.1 and different number of percussive and harmonic components in order to analyze the effect of a different dictionary sizes. For each value β, results show that SDR performance exhibits the similar behaviour independently of the dictionary size, increasing from β=0 to β=1.5 and decreasing from β=1.5 to β=2. As occurred in Figure 2, the value β that maximizes SDR is achieved using β=1.5, but the differences, compared to the other results using different dictionary sizes, are not significant.
Figure 5 shows the optimization of the smoothness K_{TSM}- K_{SSM} and sparseness K_{TSP}- K_{SSP} parameters. Figure 5a shows that although different values of K_{TSM} and K_{SSM} have been used, the maximum SDR performance is obtained using K_{TSM}=K_{SSM}=0.1 as occurred when these parameters were initialized with the same values (see Figure 2d). This fact suggests that the smoothness parameters K_{TSM} and K_{SSM} could be initialized using equal values K_{SM}=K_{TSM}=K_{SSM} since they do not show significant differences from initializing them with different values in order to obtain the best SDR performance. A similar behaviour can be observed from the sparseness K_{TSP}- K_{SSP} parameters (see Figure 5b) which obtain the maximum SDR performance using a sparseness parameter equal to each other K_{SP}=K_{TSP}= K_{SSP}=0.1.
3.4.2 Performance evaluation
Analysis of the statistical significance of the percussive/harmonic SDR-SIR-SAR results
Percussive | Harmonic | |||||
---|---|---|---|---|---|---|
SDR | SIR | SAR | SDR | SIR | SAR | |
HPSS | <10^{−8} | 0.2 | <10^{−11} | <10^{−9} | 0.9 | <10^{−14} |
MFS | 0.2 | <10^{−3} | 0.8 | 0.1 | <4·10^{−2} | <10^{−6} |
NMPCF | <10^{−7} | <10^{−4} | <10^{−7} | <10^{−6} | <10^{−4} | <10^{−5} |
Figure 9 shows that HPSS produces the best overall SIR results but the SIR performance of the developed method is nearly identical to that obtained using HPSS. It can be confirmed using a one-sided paired t test that indicates HPSS does not significantly outperform SIR results compared to the developed method neither in percussive nor harmonic sounds (see Table 4). However, Table 4 shows that the developed method improves significantly SIR results compared to MFS and NMPCF for both percussive and harmonic sounds. Both HPSS and the proposed one: (i) enable most of the harmonic content to be removed while maintaining the quality of the percussive signal and vice versa and (ii) capture polyphonic richness. The developed method produces better percussive quality for the SIR than MFS because it uses information to model percussive sounds that is not used by MFS. That is, the developed method models percussive sounds using smoothness in the frequency and sparseness in the time, whereas MFS models percussive sounds using only smoothness in the frequency (by removing outliers in a temporal slice).
As previously mentioned, HPSS produces the best overall SIR on average. However, these results are obtained at the expense of introducing more artifacts and lead to greater overall distortion. This fact can be observed in Figure 10 in which the worst percussive and harmonic SAR results are obtained by HPSS. Considering SAR results and using a one-sided paired t test (see Table 4), the developed method significantly outperforms HPSS and NMPCF in percussive sounds and the three state-of-the-art methods in harmonic sounds. The developed method also offers the advantage of producing the best SAR results (excluding NMPCF, which will be discussed in the next paragraph) because the artifacts in the reconstruction signal are minimized.
For the case of NMPCF, SDR and SIR results exhibit the worst separation performance, and therefore, this method always ranks last. The poor performance of NMPCF may be attributed to its high dependence on the drum-only matrix used in the decomposition process. This drum-only matrix is obtained training with drum sounds, but these trained drum features may be sufficiently different from the percussive features evaluated in the test database T1. The harmonic signal, provided by NMPCF, is composed of most of the original harmonic and percussive sounds. It causes a high harmonic SAR because the proportion of artifacts is too small compared to the proportion of the target (harmonic) and the interference (percussive) sounds. However, the percussive signal, provided by NMPCF, is composed of a residual part of the original percussive and harmonic sounds. It causes a low percussive SAR because the proportion of artifacts is too high compared to the proportion of the target (percussive) and the interference (harmonic) sounds.
Percussive, harmonic and overall SDR, SIR and SAR results for each excerpt of the database T2
HPSS | MFS | Proposed | |||||||
---|---|---|---|---|---|---|---|---|---|
SDR | SIR | SAR | SDR | SIR | SAR | SDR | SIR | SAR | |
Percussive separation | |||||||||
Identifier | |||||||||
T2_01 | 2.6 | 13.2 | 1.1 | −0.2 | −1.5 | 8.4 | 4.0 | 6.5 | 5.7 |
T2_02 | 2.4 | 10.2 | 3.4 | 3.1 | 8.0 | 4.9 | 5.2 | 8.3 | 7.5 |
T2_03 | 2.6 | 6.9 | 4.0 | 2.5 | 2.1 | 12.3 | 2.8 | 2.6 | 11.1 |
T2_04 | 5.5 | 11.5 | 6.5 | 6.2 | 9.6 | 8.0 | 7.5 | 10.3 | 10.3 |
Average | 3.2 | 10.5 | 3.8 | 2.9 | 4.6 | 8.4 | 4.9 | 7.0 | 8.7 |
Harmonic separation | |||||||||
Identifier | |||||||||
T2_01 | 9.8 | 13.8 | 11.9 | 7.1 | 13.8 | 11.5 | 11.0 | 14.8 | 13.9 |
T2_02 | 4.8 | 6.3 | 9.8 | 5.5 | 16.2 | 11.6 | 7.5 | 9.3 | 12.1 |
T2_03 | 4.8 | 8.7 | 6.3 | 4.6 | 11.0 | 8.0 | 5.0 | 9.1 | 8.6 |
T2_04 | 5.6 | 11.5 | 6.7 | 6.2 | 9.3 | 8.7 | 7.5 | 10.6 | 10.5 |
Average | 6.3 | 10.1 | 8.7 | 5.9 | 12.6 | 10.0 | 7.8 | 11.0 | 11.3 |
Overall separation | |||||||||
Identifier | |||||||||
T2_01 | 6.2 | 13.5 | 6.5 | 3.5 | 6.2 | 10.0 | 7.5 | 10.7 | 9.8 |
T2_02 | 3.6 | 8.3 | 6.6 | 4.3 | 12.1 | 8.3 | 6.4 | 8.8 | 9.8 |
T2_03 | 3.7 | 7.8 | 5.2 | 3.6 | 6.6 | 10.2 | 3.9 | 5.9 | 9.9 |
T2_04 | 5.6 | 11.5 | 6.6 | 6.2 | 9.5 | 8.4 | 7.5 | 10.5 | 10.4 |
Average | 4.8 | 10.3 | 6.2 | 4.4 | 8.6 | 9.2 | 6.3 | 9.0 | 10.0 |
To illustrate the separation performance of the developed method, audio examples (from the T1 and T2 databases) have been uploaded to a web page. Each audio example (mixed track, separated-percussive track and separated-harmonic track) has been evaluated using HPSS, MFS and the developed method. The web page can be found at https://dl.dropboxusercontent.com/u/22448214/PercHarmFeb2014/index.html.
Conclusions
This paper presents an unsupervised learning process for separating percussive and harmonic sounds from monaural instrumental music. Our formulation is based on a modified NMF approach that automatically distinguishes between percussive and harmonic bases by integrating spectro-temporal features, such as anisotropic smoothness or time-frequency sparseness, into the factorization process. The developed method exhibits the following advantages: (i) prior knowledge of the number of instruments playing the music excerpt is not required, and (ii) neither prior information about the musical instruments nor a supervised training are required to classify the bases.
Different experiments are performed to optimize the parameters of the developed method. The results show that (i) the value β=1.5 provides the maximum SDR since this measure has been computed at the best trade-off between high and low energy changes in the frequency; (ii) The maximum SDR is achieved using a higher value of smoothness constraints compared to sparseness constraints evaluating the databases T1 and T2 and (iii) a higher number of components, i.e., R_{ p }>250 and R_{ h }>500, reduces the SDR performance.
The analysis of the dependence of the parameters of the developed method shows that (i) the dictionary size could not affect the optimal values of K_{SP} and K_{SM} because they provide the maximum SDR for each dictionary size evaluated; (ii) SDR performance obtains the maximum SDR using β=1.5 independently of the dictionary size and (iii) the smoothness K_{TSM}- K_{SSM} and sparseness K_{TSP}- K_{SSP} parameters could be initialized using equal values K_{SM}=K_{TSM}=K_{SSM} and K_{SP}=K_{TSP}=K_{SSP} since they do not show significant differences from initializing them with different values in order to obtain the maximum SDR performance.
Evaluating the database T1 shows that the developed method obtains the best quality performance in terms of the percussive, harmonic and overall SDR for the separation process in relation to the three state-of-the-art separation methods. The proposed method significantly outperforms the other methods taking into account most of the percussive/harmonic SDR, SIR or SAR results which is confirmed by a one-sided paired t test. A significant strength of the developed method is its robustness in evaluating different databases.
Evaluating the database T2 illustrates some of the strengths and weaknesses of the developed method. An interesting strength shown by the developed method is its successful separation performance in evaluating purely harmonic or percussive sounds. However, harmonic onsets and audio effects are not successfully separated because their spectro-temporal features have not been modelled in the factorization process.
Future work will focus on three topics. First, we will try to improve the quality of the separated signals by defining a new spectral distance that integrates novel spectro-temporal features of the percussive and harmonic sounds. Second, a novel constraint based on the vibrato effect will be investigated to extend this method to singing-voice signals. Finally, in order to improve the singing-voice signals, a set of novel percussive constraints will be analyzed to distinguish between percussive music instruments and unvoiced vocal sounds.
Appendix
Detailed terms of the multiplicative update rules of percussive sounds
where ^{ T } denotes the transpose matrix operator. The terms ${\left[\frac{\partial \text{SSM}}{\partial {W}_{P}}\right]}^{\pm}$ and ${\left[\frac{\partial \text{TSP}}{\partial {H}_{P}}\right]}^{\pm}$ are defined using [22], as adapted to the matrix W_{ P } and H_{ P }.
Detailed terms of the multiplicative update rules of harmonic sounds
where ^{ T } denotes the transpose matrix operator. The terms ${\left[\frac{\partial \text{SSP}}{\partial {W}_{H}}\right]}^{\pm}$ and ${\left[\frac{\partial \text{TSM}}{\partial {H}_{H}}\right]}^{\pm}$ are defined using [22], as adapted to the matrix W_{ H } and H_{ H }.
Declarations
Acknowledgements
This work was supported by the Andalusian Business, Science and Innovation Council under project P2010- TIC-6762 and (FEDER) the Spanish Ministry of Economy and Competitiveness under Project TEC2012-38142-C04-03.
Authors’ Affiliations
References
- N Ono, K Miyamoto, J Le Roux, H Kameoka, S Sagayama, in Proceedings of the European Signal Processing Conference. Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram (LausanneSwitzerland, August 2008), pp. 25–29.Google Scholar
- N Ono, K Miyamoto, H Kameoka, S Sagayama, in Proceedings of the Ninth International Conference on Music Information Retrieval (ISMIR). A real-time equalizer of harmonic and percussive components in music signals (Philadelphia, Pennsylvania USA, September 14–18 2008), pp. 139–144.Google Scholar
- L Daudet, in Proceedings of the Third International Conference on Computer Music Modeling and Retrieval. Review on techniques for the extraction of transients in musical signals (Pisa, Italy, September 26–28 2005), pp. 219–232.Google Scholar
- M Helen, T Virtanen, in Proceedings of the European Signal Processing Conference. Separation of drums from polyphonic music using non-negative matrix factorisation and support vector machine (Anatalya, Turkey, September 4–8 2005).Google Scholar
- Gillet O, Richard G: Transcription and separation of drum signals from polyphonic music. IEEE Trans. Audio Speech Lang. Process 2008, 3(16):529540.Google Scholar
- Ozerov A, Vincent E, Bimbot F: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Process 2012, 20(4):11181133. 10.1109/TASL.2011.2172425View ArticleGoogle Scholar
- D Fitzgerald, in Proceedings of DAFX. Harmonic/percussive separation using median filtering (Graz, Austria, September 6–10 2010).Google Scholar
- J Yoo, M Kim, K Kang, S Choi, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Nonnegative matrix partial co-factorization for drum source separation (Dallas, Texas, USA, March 14–19 2010).View ArticleGoogle Scholar
- Jain R, Kasturi R, Schunck B: Machine Vision. McGraw-Hill, New York; 1995.Google Scholar
- H Tachibana, H Kameoka, S Sagayama, in International Conference on Acoustics, Speech and Signal Processing (ICASSP). Comparative evaluations of various harmonic/percussive sound separation algorithms based on anisotropic continuity of spectrogram (Tokyo, Japan, March 25–30 2012).View ArticleGoogle Scholar
- Canadas-Quesada F, Ruiz-Reyes N, Vera-Candeas P, Carabias J, Maldonado S: A multiple-F0 estimation approach based on Gaussian spectral modelling for polyphonic music transcription. J. New Music Res 2010, 39(1):93-107. 10.1080/09298211003695579View ArticleGoogle Scholar
- Y Ueda, Y Uchiyama, T Nishimoto, N Ono, S Sagayama, HMM-based approach for automatic chord detection using refined acoustic features, (Dallas, Texas, USA, March 14–19 2010).View ArticleGoogle Scholar
- D Zhiyao, B Pardo, in International Conference on Acoustics, Speech and Signal Processing (ICASSP). A state space model for online polyphonic audio-score alignment (Prague, Czech Republic, May 22–27 2011).Google Scholar
- Lee D, Seung S: Learning the parts of objects by nonnegative matrix factorization. Nature 1999, 401(21):788-791.Google Scholar
- Hoyer P: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res 2004, 5: 1457-1469.MathSciNetGoogle Scholar
- Monga V, Mhcak M: Robust and secure image Hashing via non-negative matrix factorizations. IEEE Trans. Inf. Forensics Secur 2007, 2(3):376-390. 10.1109/TIFS.2007.902670View ArticleGoogle Scholar
- Kotsia I, Zafeiriou S, Pitas I: A novel discriminant non-negative matrix factorization algorithm with applications to facial image characterization problems. IEEE Trans. Inf. Forensics Secur 2007, 2(3):588-595. 10.1109/TIFS.2007.902017View ArticleGoogle Scholar
- P Smaragdis, J Brown, in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Non-negative matrix factorization for polyphonic music transcription (New Paltz, New York, USA, October 19–22 2003).View ArticleGoogle Scholar
- J Paulus, T Virtanen, in Proceedings of the European Signal Processing Conference. Drum transcription with non-negative spectrogram factorisation (Antalya, Turkey, September 4–8 2005).Google Scholar
- D Lee, H Seung, in Advances in NIPS. Algorithms for non-negative matrix factorization, (2000), pp. 556–562.Google Scholar
- Févotte C, Bertin N, Durrieu JL: Nonnegative matrix factorization with the Itakura-Saito divergence. With application to music analysis. Neural Comput. 2009, 21(3):793830. 10.1162/neco.2008.04-08-771View ArticleGoogle Scholar
- Virtanen T: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process 2007, 15(3):1066-1074. 10.1109/TASL.2006.885253View ArticleGoogle Scholar
- J Eggert, E Korner, in Proceedings of the International Joint Conference on Neural Networks (IJCNN4). Sparse coding and NMF (Budapest, Hungary, 25–29 July 2004), pp. 2529–2533.Google Scholar
- J Parras-Moral, F Canadas-Quesada, P Vera-Candeas, N Ruiz-Reyes, in Stockholm Music Acoustics Conference jointly with Sound And Music Computing Conference. Audio restoration of solo guitar excerpts using a excitation-filter instrument model (Stockholm, Sweden, 30 July).Google Scholar
- Activision, Guitar hero World Tour. . Accessed 09/06/2014., [http://en.wikipedia.org/wiki/Guitar_Hero_World_Tour]
- Activision, Guitar hero 5. . Accessed 09/06/2014., [http://en.wikipedia.org/wiki/Guitar_hero_5]
- S Araki, A Ozerov, V Gowreesunker, H Sawada, F Theis, G Nolte, D Lutter, N Duong, in 9th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA’10). The 2010 signal separation evaluation campaign (SiSEC2010): audio source separation (Saint-MaloFrance, September 2010), pp. 114–122.Google Scholar
- Vincent E: Musical source separation using time-frequency source priors. IEEE Trans. Audio Speech Lang. Process 2006, 14(1):91-98. 10.1109/TSA.2005.860342View ArticleGoogle Scholar
- Vincent E, Févotte C, Gribonval R: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process 2006, 14(4):1462-1469. 10.1109/TSA.2005.858005View ArticleGoogle Scholar
- C Févotte, R Gribonval, E Vincent, BSS_EVAL toolbox user guide - Revision, 2.0, Technical Report 1706, IRISA (April 2005).Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.