RPCA-DRNN technique for monaural singing voice separation

In this study, we propose a methodology for separating a singing voice from musical accompaniment in a monaural musical mixture. The proposed method uses robust principal component analysis (RPCA), followed by postprocessing, including median filter, morphology, and high-pass filter, to decompose the mixture. Subsequently, a deep recurrent neural network comprising two jointly optimized parallel-stacked recurrent neural networks (sRNNs) with mask layers and trained on limited data and computation is applied to the decomposed components to optimize the final estimated separated singing voice and background music to further correct misclassified or residual singing and background music in the initial separation. The experimental results of MIR-1K, ccMixter, and MUSDB18 datasets and the comparison with ten existing techniques indicate that the proposed method achieves competitive performance in monaural singing voice separation. On MUSDB18, the proposed method reaches the comparable separation quality in less training data and lower computational cost compared to the other state-of-the-art technique.


Introduction
In a natural environment rich in sound emanating from multiple sources, a target sound reaching our ears is usually mixed with other acoustic interference. The sources of background acoustic interference, including car noise, street noise, music, other people's voices [1], and even reverberations [2], corrupt the target sound, complicate signal processing, pose severe challenges for the hearing impaired, and degrade the performance of automatic sound recognition systems. In musical pieces, instead of background noise, singing voices are often mixed with musical accompaniments. Generally, a song is a combination of human vocal singing and music played using string and percussion instruments. Vocal melody has a unique pitch contour, whereas background music is a repetitive rhythm created using a variety of instruments. With respect to a singing voice, which is generally the focus, musical accompaniment can be considered interference or noise because in most cases, the singing voice in a song is the most impressive part to listeners and it conveys abundant important information useful in a wide variety of research; for instance, determining the lyrics [3], language [4], singer [5,6], and emotion [7] conveyed by a song. Therefore, techniques for separating singing voices from accompaniments are important for various music information retrieval (MIR) applications, such as automatic lyric recognition [3] and alignment [8], melody extraction [9], song language identification [4], singer identification [5,6], contentbased music retrieval, and annotation [10]. Such applications are indispensable in systems such as karaoke gaming, query-by-humming, active music listening [11], and audio remixing.
However, the separation of singing voice from musical accompaniment is genuinely challenging. Human listeners generally have the remarkable ability to segregate sound streams from a mixture of sounds in day-to-day life, but this remains a highly demanding job for machines, especially in the monaural case because it lacks the spatial cues that can be acquired when two or more microphones are used. Furthermore, the experience of speech separation may not straightforwardly apply to singing separation. Singing voice and speech, both human sounds, have many similarities, but they are also dissimilar. Therefore, the difficulties encountered in the separation of singing and speech from their respective backgrounds are different. The most important difference between singing and speech in terms of their separation from a background is the nature of the other coexisting sounds. The background interference mixed with speech may be harmonic or nonharmonic, narrowband, or broadband and generally uncorrelated to the speech. However, the musical accompaniment in a song is usually harmonic and broadband, correlated to the singing, and does not fit the general assumptions of noise, such as whiteness or stationarity. Hence, traditional noise-suppression methods are unsuitable.
Additionally, singing voices usually contain clear and strong harmonic structures and rapidly changing harmonics, such as vibratos or slides, and musical accompaniment can be considered the sum of percussive sounds and harmonics. Simple harmonic extraction techniques are not useful for polyphonic mixtures and rapidly changing harmonics because the extraction results are inaccurate and harmonic instruments (not only singing) also contain harmonics. Moreover, onset and offset cues, which are generally useful in auditory scene analysis because different sounds normally start and end at different times, are not useful either because the starting and ending times of singing voices and musical accompaniments usually coincide. In addition, in singing, lyrics are expressed by changing notes according to the melody, which makes singing an interpretation of a predefined musical score. Therefore, pitch in singing tends to be piecewise constant, with abrupt pitch changes and different types of fluctuations. The pitch range of singing could be as high as 1000 or 1400 Hz for soprano singers [12], compared with the normal range of 80 to 400 Hz for speech. Hence, pitch-extraction techniques are commonly inaccurate, and in songs, distinguishing between voiced and unvoiced is problematic.
Among them, by assuming and utilizing the underlying properties of singing and musical accompaniment, Huang et al. [30] performed a robust principal component analysis (RPCA) to decompose the magnitude spectrogram of a song into low-rank and sparse matrices, which correspond to the music accompaniment and singing voice, respectively. Studies have demonstrated that such decomposition methods outperform sophisticated pitch-based methods [31]. However, the assumptions of low-rank and sparsity may not be true in all cases. For example, the sound of drums is sparse but not low-rank, and the vocal part of a song can sometimes be low-rank [31].
In addition to the aforementioned categories of methods, hybrids or fusions of existing building blocks have emerged. Among them, some integrate pitch or F0 information to improve separation. For example, Virtanen et al. [49] proposed a hybrid method that combines pitch-based inference and nonnegative spectrogram factorization. Rafii et al. [50] combined the repetitionbased methodrepeating pattern extraction technique with a similarity matrix (REPET-SIM), which is a generalization of REPET and uses a similarity matrix to identify the repeating elements of the background music, with a pitch-based method. Ikemiya et al. [51,52] utilized the mutual dependency of F0 estimation and source separation to improve singing voice separation by combining a time-frequency mask based on RPCA with a mask based on harmonic structures. Other cascading [53] and fusion methods [54] have been proposed as well.
In the real world, learning involves observing large numbers of objects in the world and drawing inferences about their categories, and this is coupled with occasional experiences of supervised learning [55]. In other words, information gleaned from data may create some underlying assumptions or rules and extends the knowledge obtained from labeled data, which helps to improve category learning [56]. Although the manner in which humans combine different ways of learning and jointly exploit different data remains unclear [55], we may assume that humans use underlying knowledge derived from observation and inferences plus supervised learning for pattern recognition. In our daily experiences, we glean information from a large amount of data to arrive at a reasonable central tendency and draw boundaries between different categories. The hidden structure discovered by the process can be leveraged to obtain a deep insight into the informational content, and the insight can lead to assumptions of the underlying properties, and as a result, no prior training is required. The RPCA approach [30] is a famous example. We, therefore, chose RPCA as a pre-processing method. Then, the following supervised learning can further adjust the results of RPCA to increase their accuracy.
Therefore, our intention is to effectively combine assumptions of the underlying properties with supervised learning to improve the separation of singing voice and background music given by a monaural mixture. Because labeled data are always difficult to obtain and are usually insufficient, employing approaches without prior training for initial separation and then employing supervised learning with limited data based on the initial separation rather than on the original input can help improve the separation quality. Benefit from the initial separation without prior training, our method may achieve good results without data augmentation if the amount of data is not too low and therefore can greatly reduce the computational load. Hence, we propose using RPCA based on the underlying low-rank and sparse properties of accompaniments and vocals, respectively, to achieve the initial separation and apply supervised DRNN to limited data to further separate the results of RPCA in order to further correct misclassified or residual singing and background music from the initial separation.
The remainder of this paper is organized as follows. Section 2 introduces the proposed RPCA-DRNN model, including RPCA, postprocessing (median filter, morphology, and high-pass filter), and the architecture of the DRNN. Section 3 describes the datasets, objective and subjective measures, and experiment results. A comparison of the proposed method with the reference methods is given as well. Finally, conclusions are provided in the final section.

Proposed RPCA-DRNN method
Music is usually composed of multiple mixed sounds, such as human vocals and various instrumental sounds. Huang et al. [30] reported that the magnitude spectrogram of a song can be regarded as the superposition of a sparse matrix and a low-rank matrix and can be decomposed by RPCA. The sparse matrix and low-rank matrix decomposed appear to be corresponding to the singing voice and accompaniment. Hence, based on the assumptions of the correspondence of singing with sparse matrix, and low-rank with accompaniment, RPCA can be applied to the singing/accompaniment separation problem. Without any pretraining, its results are superior to those of sophisticated pitch-based methods [31].
However, the underlying low-rank and sparsity assumptions may not be true in all cases. The decomposed sparse matrix may contain instrumental sounds (e.g., percussion) besides singing voice, and the decomposed low-rank matrix may contain vocal besides instrumental sounds. Upon listening to the separated singing voice, it is apparent that there is some residual background music. Likewise, some part of the singing voice is misclassified as background music. Therefore, additional methods or techniques are needed to reclassify the RPCA output to increase the separation accuracy.
We propose an RPCA-DRNN method that employs an RPCA with postprocessing to perform the initial separation and a supervised DRNN to perform the subsequent separation. The mixed signal is input into the RPCA and separated into the sparse and low-rank matrices. Then, postprocessing, including median filter, morphology, and high-pass filter, is applied. The DRNN that follows comprises two jointly optimized parallelstacked recurrent neural networks (sRNNs) with mask layers. The resulting sparse and low-rank matrices obtained after RPCA and postprocessing are sent to their corresponding sRNNs. One sRNN further separates the sparse matrix into the estimated singing and musical accompaniment parts because there is a residual background music component in the initial separated sparse matrix. Similarly, the other sRNN further separates the low-rank matrix into the estimated singing and musical accompaniment parts because there is a residual singing vocal component in the low-rank matrix. The final estimated singing is the sum of the singing part estimated from the sparse matrix and the residual singing part estimated from the low-rank matrix, and it is compared with the original clean singing voice. Correspondingly, the final estimated musical accompaniment is the sum of the residual musical accompaniment part estimated from the sparse matrix and the musical accompaniment part estimated from the low-rank matrix, and it is compared with the original clean musical accompaniment. By reducing the error between the estimated singing and clean singing parts and that between the estimated musical accompaniment and clean musical accompaniment parts, we can jointly optimize the DRNN and obtain the final model. The time-domain waveform of singing/ music is reconstructed by applying inverse short-time Fourier transform (ISTFT) to the estimated magnitude spectrum of singing/music along with the phase spectrum of the sparse/low-rank matrix.
In the following subsections, details of the techniques associated with each part of the proposed method are discussed.

RPCA
The convex program RPCA was proposed by Candès et al. [57] to recover a low-rank matrix L from highly corrupted measurements C = L + S, where S is a sparse matrix and has arbitrary magnitude. The convex optimization problem is defined as where ‖•‖ * and ‖•‖ 1 denote the nuclear norm and l1norm (i.e., the sum of the singular values and the sum of absolute values of the matrix entries, respectively). The dimensions of L, S, and C are m × n, and λ is a positive tradeoff parameter that can be selected based on prior knowledge about solutions to practical problems [57]. Generally, λ ¼ 1= ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi maxðm; nÞ p , which works well for incoherent matrices [54].
Musical accompaniment generally has an underlying repeating structure and can be considered a low-rank signal L. By contrast, singing voices with more variation have a higher rank and are comparatively sparse in the time and frequency domains; they can be considered sparse signals S [30]. Then, C, L, S, m, and n can be considered as the spectrum of mixture, accompaniment, singing, and the number of frequency bins and frames. Therefore, RPCA can be used to separate singing vocals from a mixture without training.
The separation is performed as follows. First, shorttime Fourier transform (STFT) is used to obtain the spectrogram of the mixture C. Then, the inexact augmented Lagrange multiplier (ALM) method [58] is used to obtain L and S.
To improve the quality of the separation results, a binary time-frequency masking is further applied in [30]. The binary time-frequency mask M b is defined as i = 1…m and j = 1…n. κ is the threshold of the magnitude ratio of sparse to low-rank. When the ratio is greater than the threshold, the binary mask is classified as 1.
However, the use of a soft mask in REPET can marginally improve the quality of the overall results (only statistically significant for the source-to-artifact ratio (SAR) of a singing voice), except for the source-tointerference ratio (SIR) of the singing voice [59]. Moreover, the experiments of [29] revealed that the use of a soft mask is perceptually better than the use of a binary mask. Therefore, the following soft mask M s is adopted in the proposed method: RPCA is a method that does not need any training or labeled data, and hence, it is convenient to use. Nevertheless, the sparse and low-rank assumptions are rather strong assumptions, and they may not be suitable for every situation. For example, the sound of drums is sparse and can be classified as a singing voice, and the vocal part can sometimes be classified as low-rank. The decomposed low-rank matrix might be a mixture of singing and instrumental sounds, and the decomposed sparse matrix might be a mixture of vocal and percussion sounds [31]. It is even more bothering for separation when the low-rank matrix contains a non-vocal, harmonic instrument (such as electric guitar and string instrument). From the study of Yang [60], the sparse signals generated by RPCA also often contain percussive components. It is because percussive sound can be considered as a periodic stream that is sparse in time-domain [61]. Moreover, RPCA does not consider other information, such as pitch or structure information. The output of RPCA-the separated singingstill contains some background music, and the separated music still contains singing. Thus, the quality of such RPCA separation is limited, and other methods must be employed to improve the results.

Postprocessing
The soft mask generated from RPCA is postprocessed to improve the separation performance. The postprocessing is applied to the soft mask instead of L and S, so the sum of the obtained low-rank and sparse matrices are still equal to the mixture. The postprocessing includes median filtering, morphology, and high-pass filtering, as depicted in Fig. 1. Between the spectrum of clean singing and the resultant singing spectrum obtained using RPCA, the former is more spotless and has a clearer structure. By contrast, the spectrum of estimated singing contains noise, has a broken structure, and has very-low-frequency parts that seldom appear in vocals. Hence, postprocessing is needed to further improve the mask.

Median filter
Median filter, a widely used nonlinear digital filtering technique in image processing and sound separation, especially for separating harmonic from percussive sounds [62], is applied to remove noise from the soft mask M s . As the low-rank hypothesis barely holds for drum sounds, a median filter is applied to enhance the separation. The procedure of the two-dimensional median filter is to run through the time-frequency unit of the mask unit-by-unit and replace each unit with the median value of the neighboring d m by d n units belonging to a window sliding over the mask unit-by-unit. The soft mask after the processing of the median filter is M sm .

Morphology
Morphology [63] is a set of popular operations in image processing that is employed to process images based on predefined kernels or structuring elements. Two very common morphology operators, erosion and dilation [63], with predefined structuring elements, are applied to the soft mask M sm to enhance the possible singing spectrum pattern. By creating a structuring element of a certain size and shape, operations sensitive to specific shapes can be constructed. A structuring element defines common shapes, such as lines or circles, and is represented by a matrix of 0 s and 1 s, where 1 symbolizes the neighborhood. The center of the structuring element slides through the pixel being processed. First, normalization is performed to transform the soft mask into a grayscale image. Then, grayscale erosion is performed, followed by grayscale dilation. The erosion operation outputs the minimum neighborhood value of the pixels that fall within the predefined structuring elements. The dilation operation, by contrast, outputs the maximum neighborhood value. Considering the original singing spectrogram contains horizontal line-like structures as in Fig. 2a, and the horizontal line structures in the singing spectrogram after RPCA with soft mask and median filter as in Fig. 2b are broken, a line-structuring element of length len and degree θ, as shown in Fig. 3a, is applied to both the erosion and dilation operations. Figure 3b is a schematic diagram of erosion and dilation by using a line-structuring element of length 10 and degree 5 on a binary example. From Fig.  3b, observing the places circled, after the erosion and dilation, the small gap on the horizontal line is patched up.
With D K representing the domain of the kernel (the structuring element) K, grayscale erosion is defined and performed as follows: which is equivalent to a local-minimum operator. ⊖ is an erosion operator. By contrast, grayscale dilation is equivalent to a localmaximum operator and is defined as follows: ⊕ is a dilation operator. By using the erosion and dilation operations, the skeleton of the horizontal line structure of the singing spectrum can be reconstructed. Observing the singing spectrogram after applying morphology as in Fig. 2c, the line structures are rebuilt.

High-pass filter
Liutkus et al. [59] demonstrated that the application of a high-pass filter at the cutoff frequency of 100 Hz to the estimated singing voice yields overall statistically superior results, except for SAR. Therefore, we adopt the same filtering scheme in the postprocessing of the vocal estimate because the frequency of a singing voice is rarely lower than 100 Hz.

DRNN
Although the RPCA with postprocessing can separate a mixture into singing voice and background music through the processed mask, the estimated singing voice is doped with background music. Similarly, the estimated background music is doped with vocal melody. Therefore, it is necessary to use other techniques to generate a model for suitably reclassifying the doped part as either singing voice or background music.
Neural networks can effectively perform this separation. Among neural networks, recurrent neural networks (RNNs), which introduce the memory from previous time steps, are widely used to model the temporal information in time-series signals, such as audio or speech [64]. However, in the current time step, there is only one layer between the input information and the output. If hierarchical processing or multiple time scales are needed for processing the time series, RNNs do not support such operations. To solve the problem, a DRNN is proposed for performing hierarchical processing and capturing the structure of the time series [65].
The architecture of the DRNN [37], which is conceptually a combination of DNN and RNN, is a multilayer perceptron in which each layer is equivalent to an RNN-each layer has temporal feedback loops. Hermans and Schrauwen [65] demonstrated that a DRNN generates diverse time scales at different levels. Therefore, it can capture a time series more inherently. The architecture of the DRNN can be represented with hidden layers. Temporally recurrent connections can happen in all layers, as in the sRNN [66], or in a single layer. We used an sRNN in the experiments conducted herein. However, using recurrent connections at a single layer is a possible choice as well.

sRNN
sRNNs, which have multiple levels of transition functions, can be presented as follows: where h l t is the hidden activation of the lth layer at time t, and h 0 t is equal to the input ; f h is a state transition function; ∅ l (•) is an element-wise nonlinear function in the lth layer; W l is the weight matrix of the lth layer, which is multiplied with the activation of the l-1 layer h l−1 t ; and U l is the weight matrix of the recurrent connection in the lth layer, which is multiplied with the activation of layer l at time t-1 h l t−1 .

Gated recurrent unit
Instead of using a traditional nonlinear function unit, such as sigmoid, tanh, or rectified linear unit (ReLU) in the sRNN, we use the gated recurrent unit (GRU) as the hidden unit. The GRU [67] is a variant of the long short-term memory (LSTM) unit, and it combines the forget and input gates into a single update gate and is simpler to compute and implement. Chung et al. [68] reported that in polyphonic music modeling and speech signal modeling, the performance of the GRU is comparable to that of an LSTM unit and superior to that of the traditional unit tanh.

Proposed model architecture
The architecture of the proposed method is depicted in Fig. 4. A DRNN comprising two jointly optimized parallel sRNNs with mask layers, which are not trainable and just arithmetic operations on the network outputs, is used to further improve the results of RPCA with postprocessing. The inputs to the DRNN are S t and L t , which are the sparse and low-rank magnitude spectra obtained from the RPCA output after postprocessing at time t.
The output predictionsV St ϵR F andB St ϵR F represent the predicted magnitude spectra of the singing voice and residual background music separated from S, respectively, andB Lt ϵR F andV L;t ϵR F represent the predicted magnitude spectra of the background music and residual singing voice separated from L, respectively, where F is the dimension of the magnitude spectra. The ReLU is used as the activation function in the output layer owing to its advantages of efficient computation, scale invariance, superior gradient propagation, biological plausibility, and sparse activation.
Huang et al. [35] used a time-frequency mask to further smooth the separation outcomes and enforce the constraint that the sum of the separated components is equal to the original unseparated signal. They incorporated the mask as a layer in the neural network to make sure that the DRNN is optimized based on the masked output. Hence, a mask layer is added to each sRNN, as in the architecture depicted in Fig. 4. The predicted magnitude spectra of the singing voiceṼ St and the residual background musicB St separated from S t by incorporating the mask concept can be respectively expressed as follows: Likewise, the predicted magnitude spectra of the background musicB Lt and the residual singing voiceṼ Lt separated from L t by incorporating the mask concept can, respectively, be expressed as follows: The final estimated singingṼ t is the sum ofṼ St and V Lt , as expressed in (11), and it is compared with the original clean singing voice. The final estimated musical accompaniment is the sum ofB St andB Lt , as expressed in (12), and it is compared with the original clean musical accompaniment. Therefore, an extra layer to perform the summation is added, as depicted in Fig. 4.

Discriminative training
The neural network is optimized by minimizing the sum of the squared errors between the estimated and clean singing voices and those between the estimated and clean background musical accompaniments. Moreover, two discriminative terms [35] are added to further penalize the interferences from other sources. The loss function is defined as follows: where T represents the length of an input sequence, and 0 ≤ ω dis ≤ 1. The output targets V t ϵ ℝ F and B t ϵ ℝ F represent the clean magnitude spectra of the singing voice and background music at time t, respectively. In (13), kṼ t −V t k 2 and kB t −B t k 2 are subloss terms to penalize the deviation between the final estimated and clean singing voices and that between the final estimated and clean background musical accompaniments; moreover, −ω dis kV t −B t k 2 and −ω dis kB t −Ṽ t k 2 are discriminative terms to further penalize the interference from other sources. The term ω dis is a weight to control the prominence of the discriminative terms.

Experiment results and evaluations
Three datasets, including MIR-1K [15], an amateur Chinese karaoke set, ccMixter [69], gathered from ccmixter.org, and MUSDB18 [70], a professionally produced set, were used, and ten existing source separation techniques were evaluated and compared in our experiments.

Dataset
MIR-1K, ccMixter, and MUSDB18 are used in our experiments. The MIR-1K dataset was developed by Jyh-shing Roger Jang [15]. This dataset consists of 110 Chinese karaoke songs performed by 11 male and 8 female amateurs. These songs are split into 1000 song clips with durations ranging from 4 to 13 s. The sampling rate is 16 kHz, and each sample occupies 16 bits. Each clip is composed of singing voices and background music in different channels. The mixture is generated by mixing the singing voice and background music with the same energy-with a signal-to-noise ratio equaling 0 dB. One hundred and seventy-five clips with a total length of 23 min 36 s sung by one male and one female singer were used as the training set. The remaining 825 clips with a total length of 1 h 49 min 49 s and sung by ten male and seven female singers were used as the test set. Because the training dataset was not large and may have lacked variety, we repeatedly shifted the background music by 10,000 samples in each instance and mixed the shifted background music with the singing voice to create a more diverse mixture. After the above-described circular shift, the total length of the clips in the training dataset was 5 h 8 min 7 s. The ccMixter dataset [69] contains 50 full-length tracks with many different musical genres. Each track ranges from 1 m 17 s to 7 m 36 s. The total length is 3 h 12 m 48 s. In training and testing, 40 and 10 tracks are used, and the lengths are 2 h 33 m 34 s and 39 m 13 s, respectively. The sampling rate 16 kHz, down-sampled from 44.1 kHz, is used.
The MUSDB18 dataset was developed by Rafii et al. [70] and released by the 2018 community-based signal separation evaluation campaign (SiSEC 2018), which aims to compare the performance of source separation systems on the same data and metrics. The songs in MUSDB18 are stereophonic mixture. Each song contains four instrumental categories, namely vocals, bass, drums, and others. The MUSDB18 dataset contains a total of 150 songs of different genres, with 100 of them used for training and 50 for testing. The total length of MUSDB18 is 11 h 14 min 15 s, of which the total length of training is 7 h 46 min 24 s and the total length of testing is 3 h 27 min 50 s. The sampling rate is 44.1 Hz. In monaural singing voice separation, sources other than vocal will be treated as accompaniment. We estimate vocal and accompaniment from left and right channels of the mixtures of MUSDB18, respectively.
sRNN: implemented reference method; the sRNN technique with model architecture, as shown in Fig. 5. In the last two layers of sRNN, two branches are used, so we can optimize the output of the network to be as close as possible to clean vocals and clean accompaniment at the same time. The GRU is used as the hidden unit. Joint mask optimization, as presented in (7)(8)(9)(10), and discriminative training (ω dis = 0.5), as described in Section 2, are applied as well. The sRNN architecture contains three hidden layers, with 1000 neurons per layer. The input spectrum is calculated by carrying out a 1024-point STFT with a hop size of 256. The Adam optimization [68] is used. The batch size is 64. The learning rate is 0.0001. The global step 100,000 is used as a stop criterium.
MLRR: reference method proposed in [31]. MLRR considers both the vocal and instrumental spectrograms as low-rank matrices and uses the learned dictionaries for decomposition. The results were directly reported from the literature.
RNMF: reference method proposed in [24]. RNMF is a nonnegative variant of RPCA. The results were directly reported from [35].
MOD-GD: reference method proposed in [37]. MOD-GD function for learning the time-frequency masks of the sources is used. The results were directly reported from the literature of the 2-DRNN architecture.
U-Net: reference method proposed in [45]. U-Net is a convolutional network initially developed for biomedical image. The results were directly reported from [43]. EFN: reference method proposed in [42]. EFN is an end-to-end framework to extract effective representations of the magnitude spectra. The results were directly reported from the literature of the model with GRU.
CRNN-A: reimplemented reference method proposed in [43]. CRNN-A uses a CNN as the front-end of an RNN. The objective result was directly reported from the literature and the subjective result was obtained from our implementation.
Open-Unmix: reference method proposed in [41]. Open-Unmix is based on the bi-directional LSTM model. It used stronger data augmentation methods on MUSDB18 than E-MRP-CNN [47]. Normalization and input/output scalar is also used. The results were directly reported from the literature.
E-MRP-CNN: reference method proposed in [47]. E-MRP-CNN automatically searches for effective MRP-CNN structures using genetic algorithms. Gain and sliding data augmentation is used on MUSDB18. The ratio of the augmentation is four times of the original data. The results were directly reported from the literature of model S-17-1-MUS.
RPCA-DRNN: the proposed method that uses RPCA with soft mask, medium filter, morphology, and highpass filter followed by a DRNN that contains two parallel sRNNs to further correct the residual singing voice and music output. The window size and hop size in STFT are 1024 and 256, respectively. RPCA-DRNN f : a feather version of RPCA-DRNN, which uses only 5 songs of MUSDB18 with data augmentation for training. After the circular shift augmentation, the total length was 54 min 28 s.

Objective measures
There are four fundamental metrics, namely source-todistortion ratio (SDR), image-to-spatial distortion ratio (ISR), source-to-interference ratio (SIR), and source-toartifact ratio (SAR) [72][73][74], and a derived metric, namely normalized SDR (NSDR) [13]. To compare the performance with other existing source separation systems on the same data and metrics, we used the overall performance metrics, namely global NSDR (GNSDR), global SIR (GSIR), and global SAR (GSAR) [35,47], to objectively measure the performance of the evaluation methods considered in the experiment of MIR-1K and ccMixter. Based on the same reason, BSS Eval version 4 [74] of SDR, ISR, SIR, SAR, the evaluation metrics released by SiSEC, is used in the experiment of MUSDB18, which is a dataset also released by SiSEC.
Assume an estimated source signal in separation isŝðt Þ. This signal is the same as the clean source signal s(t) plus the spatial distortion e spat (t) [75], the interference error e interf (t), and the artifact error e artif (t), as presented in (14).
The metrics SDR, SIR, and SAR are defined as follows: where ‖•‖ represents the Euclidean norm. The other metrics derived from SDR, SAR, and SIR are NSDR, GNSDR, GSIR, and GSAR. NSDR [13] calculates the difference in SDR between the mixture and the separated singing voice as in (19). It can be considered as the improvement of SDR owing to the adoption of the separation technique. GNSDR, GSIR, and GSAR [35] are the length-weighted means of NSDR, SIR, and SAR, respectively, as expressed in (19)(20)(21)(22).
where k is the song index andv k , v k , c k , and l k are the estimated singing voice, clean singing voice, mixture, and song length of the kth song, respectively. GNSDR, GSIR, and GSAR are adopted as objective measures in the experiment. From (15)(16)(17)(18)(19)(20)(21)(22), the higher the values of SDR, ISR, SIR, SAR, NSDR, GNSDR, GSIR, and GSAR are, the better is the separation performance.

Subjective measures
The objective measures GNSDR, GSIR, and GSAR help to objectively compare the separation quality of our proposed methods and the reference methods. Higher values of GNSDR, GSIR, and GSAR for the estimated separated singing voice represent closeness to the original clean singing voice. However, the estimated separated singing voice with the highest objective measure scores is not necessarily perceived as the cleanest separation. Therefore, subjective measures, by asking listeners to consider interference (the residue of background music) and artifacts in the separated singing voice, are applied. Two subjective measures, namely mean opinion score (MOS) [76] and comparison mean opinion score (CMOS) [76], are adopted to this end.
The MOS is commonly used in audio and video analysis. The absolute category rating scale is often used, typically in the range of 1-5, which represents the ratings of bad, poor, fair, good, and excellent. Given the difficulty of absolute grading of subjective perceptions of the MOS for separated singing, the CMOS measure prescribed in Annex E of the ITU-T Recommendation P.800 [76] is additionally used as another subjective measure to evaluate the separation quality. In CMOS, listeners listen to and compare the target separated singing voice with the reference voice. Scores ranging from − 3 to + 3, totaling seven levels of assessment, can be assigned. To reduce the difficulty associated with distinguishing subtle differences and grading for evaluators, we reduced the number of assessment levels to 5, where the scores − 2, − 1, 0, 1, and 2 for the singing voice represent the ratings of worse, slightly worse, equal, slightly better, and better, respectively, than the reference sound.

Experiment results of ablation study
To compare the performances of each part of RPCA-DRNN, an ablation study that removes some part of the system was built. Three datasets including MIR-1K, ccMixter, and MUSDB18 are used. Table 1 lists all the combinations of different parts of RPCA-DRNN. These combinations will be included in our ablation experiment and the experiment results are shown in Table 2.
Observing the experiment results of Table 2 (parts a and b), adding any postprocessing helps reduce the total error (the sum of the interference and artifact error). Among the three steps of postprocessing (median filter, morphology, and high-pass filter), high-pass filter reduces the total error most. It can be observed that RPCA_s_h performed better in GNSDR than RPCA_s_m and RPCA_s_M, and RPCA_s_h-DRNN performed better in GNSDR than RPCA_s_m-DRNN and RPCA_s_M-DRNN. In addition, the combinations without high-pass filter performed worst in reducing the total error. It can be observed that RPCA_s_m_M was worse in GNSDR than RPCA_s_m_h and RPCA_s_M_h, and RPCA_s_m_ M-DRNN was worse in GNSDR than RPCA_s_m_h-DRNN and RPCA_s_M_h-DRNN. sRNN outperformed all the combinations without DRNN in GNSDR. At last, the proposed RPCA-DRNN beats all the combinations in GNSDR and GSAR, and beats RPCA_b-DRNN and RPCA_s-DRNN in all the objective measures. Therefore, RPCA-DRNN performs better than conventional RPCA and sRNN, and taking the postprocessing on the soft mask does improve the separated quality.
Besides, observing the experiment results of Table 2 (part c) on MUSDB18, the combinations without highpass filter performed worst in SDR, ISR, SIR, and SAR. It can be observed that RPCA_s_m_M-DRNN was worse in all of the four measures than RPCA_s_m_h-DRNN and RPCA_s_M_h-DRNN. At last, the proposed RPCA-DRNN beats all in all of the four measures. Therefore, on MUSDB18, it is confirmed again that RPCA-DRNN performs better than sRNN and high-pass filter is the most influential part in postprocessing.

Experiment results of MIR-1K
Songs from the test set were used in the objective and subjective tests. Both the separated singing voice and accompaniment are evaluated. Ten techniques, namely RPCA_b, RPCA_p, sRNN, MLRR, RNMF, MOD-GD, U-Net, EFN, CRNN-A, and RPCA-DRNN, were compared in our objective experiment. Accordingly, ten varieties of separated singing voices were evaluated in the objective voice quality assessment. The comparison of the proposed method RPCA-DRNN with RPCA_b, RPCA_p, sRNN, MLRR, RNMF, Mod-GD, U-Net, EFN, and CRNN-A in terms of the objective measures GNSDR, GSIR, and GSAR is summarized in Table 3. The results indicate that the proposed method RPCA-DRNN is superior to all of the reference methods in GNSDR and GSAR. Therefore, RPCA-DRNN can reduce the total error most, but respectively speaking, it is more successful in reducing artifact than interference error. The box plots of the comparison of the proposed method RPCA-DRNN with RPCA_b, RPCA_p, sRNN, and CRNN-A are presented in Fig. 6, which shows a clearer statistical insight.
RPCA-DRNN, CRNN-A, and sRNN were further compared in a subjective assessment. In the subjective test, there were ten listeners. All of them are music enthusiasts but are not familiar with source separation or audio engineering. Each listener was allotted ten sets of RPCA-DRNN, CRNN-A, and sRNN separated singing voice clips from different songs in the test set. In total, there were 100 testing sets of singing voice clips. The ordering of the target and reference voices was changed randomly and was not revealed to the evaluators. The listeners were asked to evaluate the separation performance and provide MOS scores. Table 4 shows the percentage distribution of the MOS scores assigned to the singing voices separated using sRNN, CRNN-A, and the proposed RPCA-DRNN. Sixty-seven percent, 51%, and 43% of the singing voice clips separated using RPCA-DRNN, CRNN-A, and sRNN were rated good or excellent, respectively. Thus, it is clear that the percentages of good and excellent scores of RPCA-DRNN were higher than those of CRNN-A and sRNN (67% vs. 51% and 43%). Furthermore, the average MOS of RPCA-DRNN was 3.79, whereas those of CRNN-A and sRNN were 3.54 and 3.46. Therefore, the subjective performance of the proposed RPCA-DRNN method in terms of the MOS scores was superior to that of the CRNN-A and sRNN.
A further analysis of the separated singing voice clips was then conducted by performing a CMOS test to measure the subjective quality of separation. Two target-reference pairs, RPCA-DRNN vs. sRNN and RPCA-DRNN vs. CRNN-A, were used. The same ten listeners and 100 testing sets as those in the MOS test were used, but different testing sets were allocated to each listener. The percentages of RPCA-DRNN-separated clips assigned "worse," "slightly worse," "equal," "slightly better," and "better" CMOS scores compared with the percentages of each class of score for sRNN and CRNN-A separated clips assigned are listed in Table 5. The results in Table 5 indicate that RPCA-DRNN was preferred compared to sRNN and CRNN-A. For sRNN, based on the vote percentages, RPCA-DRNN was preferred (slightly better and better) for 70% of testing pairs, but 18% of the listeners perceived its output to be indistinguishable from that of sRNN. By contrast, sRNN was preferred by only 12% of the listeners. For CRNN-A, RPCA-DRNN was voted 75% equal, slightly better or better.
To ensure that the results of our subjective auditory tests in Table 5 are statistically significant and in support of our argument, we examined the p-values of the binomial statistic in addition to the preferred percentage. Given that five options were available to the listeners Table 1 Combinations of different parts of RPCA-DRNN (worse, slightly worse, equal, slightly better, better), we assumed the probability of choosing any answer as 1/5 to calculate the p-values. For example, in 48 of 100 trials, RPCA-DRNN was voted as better than sRNN. Consequently, the p-value of the binomial statistic was less than 5.3e−10. The p-value of the preferred rates of RPCA-DRNN is considerably lower than 0.05, which is, by convention [77], considered statistically significant with a confidence level of 95%. This means that there is an extremely low chance that the observed differences among the listeners' choices were due to chance, and the listeners did have preferences. For comparing with CRNN-A, RPCA-DRNN was voted as 31% better than CRNN-A. Consequently, the p-value of the binomial statistic was less than 0.0084, and the p-values of the preferred rates of RPCA-DRNN for individual listeners are also considerably lower than 0.05. The estimated training duration for the training set, inference duration for the testing set, and carbon footprint of training of the proposed and reference methods are shown in Table 6. The power of the graphics processing unit (GPU) (one 1080Ti) used is about 250 W. The carbon footprint is obtained under the assumption that 1 kWh of electricity discharges 0.62 kg of carbon dioxide and only the GPU power consumption is considered.

Experiment results of MUSDB18
In the experiment of MUSDB18, sounds other than vocal, such as bass, drums, and others, are considered as accompaniment. Vocal and accompaniment are estimated respectively from the left-channel and rightchannel. Both the separated singing voice and accompaniment are evaluated. Table 7 compares the results of RPCA-DRNN, RPCA-DRNN l , RPCA-DRNN f , sRNN, Open-Unmix, and E-MRP-CNN. For readers interested in the separation performance of other techniques using MUSDB18 up to 2018, they can reference the results of the 2018 signal separation evaluation campaign [74]. RPCA-DRNN is superior to sRNN both in vocal and accompaniment separation. Besides, in vocal separation, RPCA-DRNN is superior to Open-Unmix and E-MRP-CNN in SIR (19.53 to 12.19 and 13.40) and slightly better in SAR (6.87 to 5.98 and 6.32) and SDR (6.41 to 5.57 and 6.36), but slightly worse in ISR (12.32 to 14.07 and 13.61). In accompaniment separation, RPCA-DRNN is superior to Open-Unmix and E-MRP-CNN in SIR (24.77 to 19.62 and 16.18) and slightly better in SAR (15.78 to 12.54 and 14.41), but worse in SDR (8.70 to 11.06 and 12.99) and ISR (18.37 to 19.06 and 23.00). Since our proposed method is for monaural separation, the spatial distortion is not under consideration. The worst performance on reducing spatial distortion makes low ISR. For separated vocals, from the definition of SDR, when the spatial distortion of RPCA-DRNN is bigger, it can be observed that the reduced sum of interference and artifact error by RPCA-DRNN is more than the reduced sum by  Open-Unmix and E-MRP-CNN. For separated accompaniment, since the spatial distortion of RPCA-DRNN is more than the other two, the performance of SDR of RPCA-DRNN is also low. Note that E-MRP-CNN used gain and sliding augmentation method, and the ratio of the augmented data of E-MRP-CNN to the original data is 1:4. Open-Unmix used even stronger data augmentation methods. It also used normalization and input/output scalar, while the proposed RPCA-DRNN did not use any data augmentation in the experiment of MUSDB18.  Fig. 7, which presents more statistical details. The data of E-MRP-CNN is from [47]. From Fig. 7, compared to E-MRP-CNN, RPCA-DRNN is better in SIR and slightly better in SDR, ISR, and SAR in vocal separation, and better in SIR, slightly better in SAR, worse in SDR, and slightly worse in ISR in accompaniment separation. RPCA-DRNN was further compared with Open-Unmix and E-MRP-CNN in a subjective assessment. Twenty listeners were attended. All of them are music enthusiasts but are not familiar with source separation or audio engineering. Four of them were former band members. Each listener was randomly assigned 5 songs from 6 songs. For each song, listeners were allocated the three separated singing from Open-Unmix, E-MRP-CNN, and RPCA-DRNN in random order. Table 8 shows the percentage distribution of the MOS scores. All the three methods are evaluated as good and   Table 5 Percentages of RPCA-DRNN-separated clips assigned "worse," "slightly worse," "equal," "slightly better," and "better" CMOS scores compared with percentages of such scores for sRNN and CRNN-A on MIR-1K A further analysis of the separated singing voice was then conducted by performing a CMOS test. The RPCA-DRNN vs. Open-Unmix and RPCA-DRNN vs. E-MRP-CNN were evaluated. The same 20 listeners and 6 songs as those in the MOS test were used, and 5 songs (two pairs in each song) were allocated to each listener randomly. Table 9 lists the CMOS results. RPCA-DRNN was voted as 88% and 91% equal or slightly better than Open-Unmix and E-MRP-CNN. For both the pairs (RPCA-DRNN vs. Open-Unmix and RPCA-DRNN vs. E-MRP-CNN), the p-value of the binomial statistic was less than 2.2e−16. It is considerably lower than 0.05, which is considered statistically significant with a confidence level of 95%. Compared to Open-Unmix, the percentage of slightly better of RPCA-DRNN is higher than slightly worse, and compared to E-MRP-CNN, the percentage of slightly better and slightly worse of RPCA-DRNN are the same. Therefore, RPCA-DRNN achieves better performance compared to Open-Unmix and competitive performance with E-MRP-CNN in CMOS in monaural singing voice separation.
The training duration for the training set, inference duration for the testing set, the GPU (power) used, and carbon footprint of training on MUSDB18 of RPCA-DRNN, RPCA-DRNN l , RPCA-DRNN f , sRNN, and E-MRP-CNN are shown in Table 10. The power of one 1080Ti and one 3090 is about 250 W and 350 W. The carbon footprint of training is under the assumption that 1 kWh of electricity discharges 0.62 kg of carbon dioxide and only the GPU power consumption is considered. The estimation of E-MRP-CNN counts only the most time-consuming evolution process and is based on the total power consumption 1560 W of 6 GPUs they used, including two 1080Ti, one 2080Ti, one Titan RTX, one Titan V, and one Titan XP, under the condition of 100 generations with 2 h running for each evolution, while RPCA-DRNN counts the total computation. In such case, the carbon footprint of E-MRP-CNN is about 2.5 times of RPCA-DRNN on one 1080Ti, 5.7 times of RPCA-DRNN on one 3090, and 32.2 times of RPCA-DRNN l and RPCA-DRNN f on one 3090. Therefore, the proposed RPCA-DRNN provides competitive performance at a lower training cost. Besides, the light and feather versions of RPCA-DRNN, which achieve better separation quality than sRNN, have only half the carbon footprint than sRNN.

Conclusions
We proposed a method based on our daily learning experiences that first uses the underlying knowledge and characteristics gleaned or inferenced and adopts method without prior training to separate sources on the basis of reasonable tendencies and assumptions and then uses supervised learning to jointly exploit labeled data to further improve the separation results. A method combining RPCA and supervised DRNN was employed in an experiment to improve the separation of singing voice from musical accompaniment in monophonic mixtures. First, RPCA was used to roughly separate the mixture into sparse voice and low-rank music. Second, postprocessing, including median filtering, morphology, and high-pass filtering, was performed to smooth and enhance the spectral structure of estimated singing and filter out unnecessary parts. Then, supervised DRNN was utilized to achieve further separation. The misclassified or residual singing and background music from the initial separation was further corrected to improve the results. Based on the objective scores on MIR-1K, the proposed method was found to be superior to RPCA, sRNN, MLRR, RNMF, MOD-GD, U-Net, EFN, and CRNN-A in terms of GNSDR and GSAR. Moreover, when the total numbers of neurons are the same, RPCA-DRNN with two smaller nets outperformed sRNN with one larger net in the subjective tests in terms of MOS and CMOS scores, because RPCA-DRNN is a solution combining underlying knowledge and supervised learning, and the DRNN of RPCA-DRNN is only for further correcting the residual singing voice and music from the output of RPCA with soft mask, medium filter, morphology, and high-pass filter. The variation of the inputs of the DRNNs of RPCA-DRNN is relatively small compared to the input of sRNN, which is the original sound mixture. RPCA-DRNN was also voted 75% equal, slightly better or better than CRNN-A. In addition, based on the objective scores on MUSDB18, RPCA-DRNN is superior to Open-Unmix and E-MRP-CNN in SDR, SIR, and SAR in vocal separation, and superior to Open-Unmix and E-MRP-CNN in SIR and SAR in accompaniment separation. This result is obtained under the condition of no data augmentation applied in the proposed RPCA-DRNN, while both Open-Unmix and E-MRP-CNN use data augmentation. The subjective test also confirms the preference of RPCA-DRNN. Besides, the performance of the light and feather versions of RPCA-DRNN is still very competitive even in very few and limited training data.
Therefore, the combination of the underlying properties inferenced and supervised learning, which is characteristic of humans' daily learning experiences, improved the separation of a singing voice from background music in the case of a monaural mixture. Benefitting from the initial RPCA separation without prior training, the proposed method achieves competitive results even with limited data or without data augmentation and hence can greatly reduce the computational load.
The main limitation of RPCA-DRNN is that at least one GPU card with 1080Ti or higher is needed for training. The database with less training data (e.g., MIR-1K with 23 min 36 s training data) is recommended to use data augmentation, and the database with enough training data (e.g., MUSDB18 with 7 h 46 min 24 s training data or 53 min 45 s in RPCA-DRNN l ) can still get a good result without data augmentation.
In the future, we will try with other neural network architectures, data augmentation methods that generate realistic mixtures, and use the proposed method in applications such as singing voice analysis and resynthesis systems. The proposed system can also be revised for more source separation by adding additional DRNNs. Moreover, adapting the method to stereo source separation by handling the spatial relation of the sound in different channels is also an interesting future work.   Table 9 Percentages of RPCA-DRNN-separated clips assigned "worse," "slightly worse," "equal," "slightly better," and "better" CMOS scores compared with percentages of such scores for Open-Unmix and E-MRP-CNN on MUSDB18