 Methodology
 Open access
 Published:
The whole is greater than the sum of its parts: improving music source separation by bridging networks
EURASIP Journal on Audio, Speech, and Music Processing volume 2024, Article number: 39 (2024)
Abstract
This paper presents the crossing scheme (Xscheme) for improving the performance of deep neural network (DNN)based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multidomain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency and timedomain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNNbased MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNNbased separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of OpenUnmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional timedomain audio separation network (ConvTasNet) extended with our Xscheme, respectively called XUMX, XD3Net and XConvTasNet, by comparing them with their original versions. We also verified the effectiveness of Xscheme in a largescale data regime, showing its generality with respect to data size. XUMX Large (XUMXL), which was trained on largescale internal data and used in our experiments, is newly available at https://github.com/asteroidteam/asteroid/tree/master/egs/musdb18/XUMX.
1 Introduction
There is a huge amount of music in our lives, e.g., from radio and TV, as background music in stores or provided by online streaming services [1,2,3,4,5,6]. To specialize music for diverse purposes, it is sometimes necessary to remix it, e.g., making the vocal tracks louder, suppressing undesired instruments, or upmixing to more audio channels. It is easy for us to implement such operations when we have access to each audio source independently that was used to mix the music. However, if we only have access to the final recording, which is often the case, this is much more challenging. In such cases, it is necessary to separate music into each instrument, which is called music source separation (MSS), to achieve the above operations.
MSS has a long history, and it is known to be a very challenging problem [7]; therefore, many approaches have been investigated, e.g., local Gaussian modeling [8, 9], nonnegative matrix factorization [10,11,12], kernel additive modeling [13], and combinations of these approaches [14, 15]. Datadriven machine learning approaches for MSS have also been of great interest to researchers. Many methods that use deep neural networks (DNNs) have been investigated to improve MSS performance. Specifically, multilayer perceptrons (MLPs) [16], convolutional neural networks (CNNs) [17], and recurrent neural networks (RNNs) [18], which are the three basic DNN architectures, have been used for MSS. An MLP was used to separate the input spectra then obtain separated results [19, 20]. CNNs and RNNs were used to achieve source separation with better quality [21,22,23] than previous MLPbased methods since the convolutional and recurrent layers of CNNs and RNNs can effectively capture the temporal contexts.
Although the above studies drastically improved MSS performance, there are two problems with respect to the training of music separation networks:

(P1)
Most DNNbased MSS methods tend to handle only the time or frequencydomain but not both.

(P2)
They do not handle the mutual effect among output sources since network architectures and loss functions are independently computed for each estimated source and the corresponding ground truth.
For example, a wellknown opensource MSS method, called OpenUnmix (UMX) [24]^{Footnote 1}, executes MSS only in the frequencydomain. It also applies the conventional mean squared error (MSE) loss function to individual pairs of estimated and corresponding ground truth magnitude spectrograms for each instrument. In other words, UMX trains networks individually for each instrument and achieves MSS by using all of each independent network onebyone. In the field of speech enhancement (SE), which can be regarded as a case of audio source separation, there are methods for solving the above problems. For solving (P1), Kim et al. [25] showed the effectiveness of multidomain processing via hybrid denoising networks, and Su et al. [26] reported that building two discriminators responsible for the time and frequencydomains can enable effective denoising and dereverberation in their scheme of using generative adversarial networks (GANs). For solving (P2), from the classical SE methods such as Wiener filter [27] to current SE methods, e.g., noiseaware training [28] and noiseaware variational autoencoder [29], there are many situations in which knowing and using the information of the noise such as type, level, and time variation is generally beneficial for the following extraction of the target speech. In MSS, other nontarget sources can be similarly regarded as “noise,” and its information may be beneficial for the following target source separation. There is also research on using it for MSS using a Wiener filter [19], but it is used only as postprocessing; thus, the information of other nontarget sources is not used to train a DNN. Since our first work in [30], more models like Hybrid Demucs (HDmucs) and Hyblid Transformer Demucs (HTDemucs) [31, 32] as well as Bandsplit RNN (BSRNN) [33] appeared that showed the benefit of working jointly in both domains.
Inspired by these discussions, we first append an additional differentiable shorttime Fourier transform (STFT) or inverse STFT (ISTFT) layer^{Footnote 2} during training only. To consider the characteristics on both of time and frequency domains, some existing methods such as [31] adopted the architecture having two separated branches each of which is respectively for time and frequency features, but it is the unique architecture per each method, and thus it is difficult to use its architecture for other method. On the other hand, the application of both loss functions in the time as well as frequency domains, i.e., applying multidomain loss (MDL), easily becomes feasible for almost all existing methods since it is merely a loss function built on multidomain. Intuitively, the two domains are also giving a complementary view of the separation performance. For a timedomain (TD)loss, it might happen that we have a periodic noise pattern which is unnoticed by it as it is only computing an instantaneous error. However, in the frequency domain such a periodic noise pattern becomes visible and will be reduced by the frequencydomain (FD) loss. On the other hand, FDloss lacks considering the effect caused by phase information since it only deals with magnitude spectrograms, but TD loss can contain it in the error of loss. Furthermore, to consider the relationship among output sources, we then bridge each instrument network by adding averaging operations if the original source separation is achieved by applying each independent instrument network to the input mixture. This is called bridging operation. For the bridged network to better determine the relationship among output instruments, we produce output spectrograms for instrument combinations and apply MDL to them. We call this loss computation combination loss (CL). The combination of bridging operation and CL helps the separation network determine the cause of an estimation error, i.e., which sources are leaking to the target instrument.
In summary, MDL solves (P1) since the separation network can help determine the estimation error in both time and frequency domains. Bridging operation and CL solve (P2) since they enable the separation network to handle the mutual relationships among the separated sources. We collectively call this “Xscheme,” which crosses the information among all sources with MDL. It is important to note that Xscheme can improve the performance of DNNbased MSS systems while maintaining the original calculation cost. This is because MDL and CL only affect the training step and thus do not change the original inference step. Moreover, bridging operation requires only a slight network modification which does not increase the number of parameters that need to be learned and only slightly the computational costs. More specifically, the rate of computational cost that will be increased by applying Xscheme is depending on the original size of target network. However, as our Xscheme merely adds averaging operators to merge subnetworks together, these additional costs can often be neglected. For instance, only 4 additional averaging operators are needed in the case of a 4instrument dataset like MUSDB18. No matter how small the deep neural networks are, we believe all existing ones should have much larger computational costs compared to adding a few averaging operators. Hence, there is almost no increase in computational cost by our proposed Xscheme.
Although we confirmed the validity of Xscheme in our previously proposed DNNbased MSS method, i.e., extended UMX (XUMX) [30] realized by applying Xscheme to UMX, there remains three questions: (i) its generality to other types of network architectures, (ii) the effective positions where we should bridge the paths of the target networks, and (iii) its scalability to a largescale data regime. Hence, in this paper, we address these questions. Specifically, we validate the effectiveness of Xscheme by applying it to different types of DNNbased MSS methods: wellknown CNNbased and RNNbased ones, i.e., densely connected dilated DenseNet (D3Net) [34, 35] and OpenUnmix (UMX) [24]. Furthermore, not only these frequencydomain networks (i.e., UMX and D3Net) but also wellknown timedomain one, convolutional timedomain audio separation network (ConvTasNet) [1], is extended by Xscheme in this paper. We also present a detailed study regarding the bridging positions and potential to use a large dataset for training XUMX.
The rest of this paper is organized as follows. In Section 2, we give a brief review of related work. In Section 3, we present Xscheme. In Section 4, we show the effectiveness of Xscheme by applying it to UMX, D3Net, and ConvTasNet resulting in XUMX, XD3Net, and XConvTasNet in terms of the MSS task. Finally, we conclude this paper in Section 5.
2 Related work
DNNbased MSS methods can be roughly categorized into time and frequencydomain methods. UMX [24] receives the input spectrogram of a mixture song and extracts the target instrument by using fully connected and bidirectional long shortterm memory (BLSTM) layers on the spectrogram, i.e., it works in the frequency domain. Similarly, D3Net [34, 35] extracts the target instrument from the input spectrogram by using convolutional layers in the frequencydomain. Note that they use a multichannel Wiener filter (MWF) [19] to reduce artifacts caused by nonlinear separation due to DNNbased processing. Frequencydomainbased methods are powerful; thus, both methods recorded good scores on MUSDB18, which is a public dataset prepared for signal separation evaluation campaign (SiSEC) 2018 [36].
Timedomain methods directly operate on timedomain signals. To the best of our knowledge, Lluís et al. and Stoller et al. almost simultaneously started to explore timedomain MSS methods [37, 38]. However, the MSS performances of such methods were inferior to those of frequencydomain based methods. Specifically, the overall signaltodistortion ratio (SDR) was reported to be only around 3.2 dB, which was almost 2 dB behind that of frequencydomain methods. Note that the experiments in the above studies were conducted on the same public dataset, i.e., MUSDB18; thus, we can compare their results. Défossez et al. then investigated a new timedomain method, Demucs [39], which is based on WaveUnet [38]. Demucus improves the modeling capability by incorporating gated linear unit layers [40], BLSTM, and faster strided convolutions; thus, it demonstrated competitive results to frequencydomain methods on MUSDB18.
Although both time and frequencydomain methods have recorded good MSS performance, there are still concerns. Almost all DNNbased frequencydomain methods tend to use only a spectrogram without phase information since it is difficult for DNNs to work with complex data. The phase information is often ignored with such methods. Therefore, the phase of the input mixture is often used with the output magnitude spectrogram to be able to compute the ISTFT, although this might yield a mismatch to the target source’s spectrogram. The Fourier basis, which is used to calculate the above spectrogram, is not always optimal for DNNbased MSS methods. Timedomain methods, however, can optimize their networks from the perspective of being endtoend, i.e., including the phase information, but tend to make the training more difficult. Inspired by this insight, we previously proposed XUMX, which can use timedomain information via MDL [30], and confirmed that it performed better than UMX. Methods using time and frequency information in a hybrid manner were proposed for DNNbased MSS. For example, KUIELABMDXNet [41] and DannaSep [42] are hybrid methods using time and frequency features. Specifically, they combine the heterogeneous time and frequencybased MSS networks on the basis of the blending scheme [22], resulting in high performing hybrid MSS.
The number of methods using complexvalued features, i.e., spectrogram magnitude and the corresponding phase via STFT, for MSS has recently been increasing [43,44,45]. Specifically, latent source attentive frequency transformation (LaSAFT) [43] and its light version, LightSAFT [44], use complexaschannels (CaC) [46] built on Unet [47], enabling MSS in the complexvalued domain. Défossez et al. also improved upon the original Demucs by using CaC, called HDemucs [31], to use time as well as complexvalued frequency information. Its architecture consists of two branches were each handles either time or complexvalued frequency input, respectively. Liu et al. proposed channelwise subband phaseaware ResUNet (CWSPResUNet) [45] which includes phase estimation by using the loss function of complex ideal ratio mask (cIRM) [48]. Their motivations, which involve phase information as well as spectrogram magnitude, are similar to those of the hybrid methods that compensate for the missing phase information by adding the timedomain signal. Therefore, the above complexdomain methods are hybrid methods.
There have been several attempts to directly estimate the phase of the target source [49, 50]. PhaseNet [49] successfully predicts the phase information by defining the phaseestimation problem as a classification of discretized phase values. DiffPhase [51] generates as well as predicts the phase through the framework of a diffusionbased generative model, which is suitable for the given spectrogram magnitude. The authors reported that the perceptual scores of reconstructed time signals were high even when their phases were partially generated.
From this literature review, we can see that using the time domain or similar features as well as the frequency domain is important to achieve good MSS performance. However, changing the network architecture such that the time and frequencydomain features input can be jointly used and optimizing this new architecture may be a laborious task. Xscheme, which includes MDL, is simple and easy to use, thus it enables many methods to handle both time as well as frequencydomain features in a hybrid manner.
Furthermore, there are some studies that attempted to integrate subnetworks each of which is dedicated for extracting one specific instrument [3,4,5]. Specifically, MeseguerBrocal et al. proposed DNNbased MSS method that used just a single conditioned network [3]. By applying Featurewise Linear Modulation (FiLM) [6] to the target network as conditioning, separating an arbitrary desired instrument through a single network becomes feasible. Selecting which instrument should be separated is achieved by FiLMbased conditioning without instrumentwise training. Furthermore, Slizovskaia et al. proposed the conditioned networkbased MSS method that accepts visual features as well as audio ones, i.e., audiovisual features [4]. Besides using the conditioned network, Kadandale et al. proposed multitask modelbased MSS method that used the unified single network outputting all instruments simultaneously [5]. While using a conditioning scheme might require a longer network training to ensure that we do enough training steps for each conditioning signal, multitask modelbased networks need fewer iterations due to simultaneously learning them as multitask. They increase the number of output instruments by changing the number of output kernels of Unet and then they easily change the each instrument’s dedicated network to the unified multitask one. However, their method only tried on CNNbased method, i.e., Unet, and it might be difficult to apply to other types of DNNs.
Our Xscheme can be regarded as a modification to change the target network to a multitask one by bridging, and it can further be applied to not only CNNbased but also other types of networks as shown in the following sections.
3 Xscheme for DNNbased MSS
In this section, we describe Xscheme, which consists of three components, i.e., MDL, bridging operation, and CL. As mentioned in Section 1, MDL should solve (P1) and bridging operation and CL should solve (P2).
Throughout the paper, we use the following notations. We first assume that the timedomain mixture signal \(\varvec{x}\) consists of J sources, i.e.,
where \(\varvec{y}_j\) denotes the timedomain signal of the jth source. Note that \(\varvec{x}\) and \(\varvec{y}_j\) are column vectors with their samples, which they respectively denote the monaural signals. In general, the audio signal of music consists of two channels, i.e., stereo signal. However, the calculation of some metrics such as MSE and SDR that are used in our method does not have unique operators specialized for the multichannel signal, and thus we calculated the following loss values by using each channel onebyone and summed up them resulting in the final loss. To the best of our knowledge, although there is a multichannel version of classical SDR (e.g., https://github.com/sigsep/bsseval), almost all of other existing methods and their implementation also handled each channel of stereo signals onebyone. Thus, for sake of simplicity, the vectors used in the following equations are denotes as single, i.e., monaural, signals. The DNN then predicts the spectrogram of the jth target source from the input mixture spectrogram \(\varvec{X} = \mathcal {S} \{ \varvec{x} \}\):
where \(\mathcal {S}\) and \(\mathcal {S}^{1}\) represent the operators of STFT and inverse STFT (ISTFT), respectively. A variable with the hat symbol, e.g., \(\hat{\bullet }\), denotes the results estimated with the DNN. Therefore, \(\hat{\varvec{y}}_j\) and \(\hat{\varvec{Y}}_j\) are respectively the predicted time and frequencydomain results of ground truths, i.e., \(\varvec{y}_j\) and \(\varvec{Y}_j\), via the DNN.
3.1 Multidomain loss
For MDL, we first append an additional differentiable and fixed STFT or ISTFT layer after the final layer of the target DNN, as shown in Fig. 1. STFT and ISTFT consist of only productsum operation, called butterfly computation [2], and thus all computational operations of it are differentiable. In other words, STFT and ISTFT consist of just some matrixvector products each of which is differentiable. It is then possible to calculate the loss functions in both time and frequencydomains before and after the appended layer. Hence, we can easily add STFT and ISTFT as the differentiable operators resulting in STFT and ISTFT layers. Since this appended layer is only used during training for computing MDL, it does not affect the inference step. In Xscheme, we use the loss functions of the MSE and weighted signaltodistortion ratio (wSDR) [52] as the frequency and timedomains, i.e.,
where \(\alpha\) is a scaling parameter for mixing multiple domains of loss. Specifically, \(\mathcal {L}_{\text {MSE}}^J\) and \(\mathcal {L}_{\text {wSDR}}^J\) are respectively calculated as follows:
where t and f denote the indexes of the time frame and frequency bin of the spectrogram \(Y_j(t,f)\), respectively. In addition, \(\rho _j\) is the energy ratio between the jth source \(\varvec{y}_j\) and mixture \(\varvec{x}\) in the timedomain, i.e., \(\rho _j = \Vert \varvec{y}_j\Vert ^2 / (\Vert \varvec{y}_j\Vert ^2 + \Vert \varvec{x}  \varvec{y}_j\Vert ^2)\). Note that the output range of the wSDR in Eq. (4b) is bounded to \([1, 1]\). Therefore, \(\left( \mathcal {L}_{\text {wSDR}}^J + 1.0\right)\) written in Eq. (3) is bounded to [0, 2.0], and it is useful to mix with another type of loss, i.e., MSE in our case. Although the SDR is traditionally calculated including the logarithm, we keep the nologarithm style and use Eq. (4b) for MDL due to the above reason.
By using MDL, the target DNN can leverage the advantage of both domains even if the original network operates in either one of them. MDL can also be applied to many conventional DNNbased MSS methods by simply replacing the loss function; thus, no additional calculation is required during the inference.
3.2 Combination schemes
In this subsection, we explain bridging operation for DNNbased MSS (Section 3.2.1) and CL (Section 3.2.2) to help independent extraction networks support each other.
3.2.1 Bridging operation
As shown in the blue rectangle of Fig. 2a, if DNNbased MSS is achieved using independent instrument networks, it is difficult for each network to take into account their mutual effect. Thus, we argue that it is effective to cross the network graphs to help independent subnetworks support each other^{Footnote 3}. This is the reason Xscheme includes bridging operation. Note that we adopt a just simple averaging layer as bridging operation. There may be some possible ways to joint subnetworks: using other techniques like crossattention [53], squeezeandexcitation [54], and transformaverageconcatenate [55]. But we consider that they may increase the computational cost and some parameters which are supposed to be learned. One of our motivations is enhancing the existing DNNbased MSS methods keeping calculation cost and original simplicity as much as possible, and thus we focus on adding a simple averaging layer as bridging operation. Please note that the bigger size of CPU/GPU memory tends to be necessary since our Xscheme requires to put all subnetworks, each of which is used to separate an instrument, on CPU/GPU in parallel during training. But this is only a bottleneck during training and might require to adjust the batch size. When doing separation, i.e., inference, this is in general not a problem anymore due to the batch size of one.
We previously did not investigate the detailed settings of bridging operation such as its position and numbers. As shown in Fig. 2b, it is possible to place a bridge between layers #l and #\((l+1)\). We can place multiple bridges depending on the number of target network layers L, namely, we can place up to \((L1)\) bridges. Namely, we connect the paths to cross each source’s networks by adding one or more average operators to the original network. Note that bridging operation does not have any learnable parameters; thus, the calculation cost slightly increases compared with the original network due to merely adding a few averaging operations. We can then regard the parts before and after the last added bridge as the interaction and each source extraction part; thus, their capacity depending on the position of bridging affects the final MSS performance. Motivated by the above discussion, we will conduct experiments on XUMX (Section 4.3) to confirm the effect of the number and position of bridging operation on MSS performance.
3.2.2 Combination loss
As mentioned above, the purpose of applying bridging operation is to enable each sourceextraction network to handle the relationship among output sources via built bridges. In other words, it is necessary for each sourceextraction network to learn its mutual relationships during training. However, using only bridging operation is insufficient for the networks to work together if the loss function is computed independently for each instrument. Thus, it is effective to cross the loss function as well as network paths via CL to boost the benefit of the built bridges. For CL, we consider the combinations of output spectrograms to enable each DNNbased source extractor to interact with each other. Specifically, we combine two or more estimated spectrograms into new ones, where each one can extract two or more sources from the mixture. Using the newly obtained combination spectrograms enables us to compute more loss functions than when we use only the individual instrument spectrograms independently, i.e.,
where \(N > J\) is the total number of possible combinations except for mixing all J sources, i.e., \(N = \sum \nolimits _{i=1}^{J1} \left( {\begin{array}{c}J\\ i\end{array}}\right)\), and n denotes the index of the nth combination^{Footnote 4}. For instance, when separating \(J=4\) sources, as is the case with MUSDB18, we can consider 14 \((=\sum \nolimits _{i=1}^{41} \left( {\begin{array}{c}4\\ i\end{array}}\right) )\) combinations in total, as shown in Fig. 3, whereas conventional methods handle only each source independently, i.e., 4 source spectrograms.
To explain the advantage of CL, let us consider the following example. Assume that we have a system with leakage of vocals into drums and bass resulting in similar errors that both instruments exhibit. By considering the combination drums + bass, we notice that the two errors are correlated, resulting in an even larger leakage of vocals, which we try to mitigate using CL. More formally, let \(\epsilon _j\) denote the prediction error of the jth source; \(\hat{y}_j = y_j + \epsilon _j\). We can then consider the MSE of the combination \(u = y_1 + y_2\):
When we consider \(y_1\) and \(y_2\) separately without the combination, the term “\(2 \epsilon _1 \epsilon _2\)” does not appear in the MSE; \(\text {MSE}\left( y_1, \hat{y}_1 \right) + \text {MSE}\left( y_2, \hat{y}_2 \right) = \mathbb {E} \left[ \epsilon _1^2 + \epsilon _2^2 \right]\). Therefore, by using CL, we can monitor the error correlation term “\(\mathbb {E}[2 \epsilon _1 \epsilon _2]\),” which helps the sourceextraction networks train when they are correlated. Specifically, we expect the term “\(\mathbb {E}[2 \epsilon _1 \epsilon _2]\)” to be able to detect errors leaking into the wrong track. In order to efficiently reduce this term, we use the bridging operations which allows each subnetwork to be aware of the others and, hence, to reduce potential leakage to a wrong source. Specifically, tying networks together helps the training as now also gradient information is exchanged which can help to learn to have a small “\(2 \epsilon _1 \epsilon _2\)” term. Furthermore, they also benefit from a joint feature extraction. Therefore, we bridged the network by just adding simple average operators as shown in Fig. 2b, which turned out to be beneficial since their results were actually improved in spite of using the same configurations except applying our Xscheme.
We can also analyze CL in terms of a geometrical viewpoint. Focusing on Eq. (4b), since the wSDR consists of two cosine similarity functions, it monitors the angle consisting of the ground truth \(y_j\) and corresponding predicted \(\hat{y}_j\). However, there is a critical case in which the prediction error cannot be detected in terms of the cosine similarity. As shown in Fig. 4, when the predicted \(\hat{y}_1\) and \(\hat{y}_2\) are respectively orthogonal to the corresponding ground truth \(y_1\) and \(y_2\), it is difficult to detect the prediction error since cos(\(y_1\), \(\hat{y}_1\)) and cos(\(y_2\), \(\hat{y}_2\)) are both zeros. However, CL can detect its prediction error via the combined signals u and \(\hat{u}\) since the score of \(\cos (u, \hat{u}) = 1\) penalizes the target network by substituting it for the wSDRbased loss function. There is possibly a case that all of cos\((u, \hat{u})\), cos\((y_1, \hat{y}_1)\), and cos\((y_2, \hat{y}_2)\) simultaneously become zero. However, in such case, all vectors (including their sums) are orthogonal, CL just does not bring a benefit but also does not cause any degredation. Namely, there is no harm and it is just not effective. Furthermore, if we would include the multichannel Wiener filter (MWF) like UMX and XUMX, then we can expect that this case can not appear as MWF redistribute the residual to all sources and by this always have a nonorthogonal sum which results in an error. Note that we need to apply our Xscheme after using MWF in that case.
Independent subnetworks can detect each other via the added bridges and CL. The DNNbased MSS network extended with Xscheme can handle multiple sources together, i.e., separate two or more sources, rather than each source independently. From a different viewpoint, CL can be considered to provide a similar benefit to multitask learning [56] since it handles multiple objectives jointly by computing combinational loss functions.
We can apply Xscheme to many DNNbased MSS methods to improve their performances while maintaining almost the same computational cost as the original method since MDL and CL are merely loss functions and bridging operation is achieved with simple average operations without increasing learnable parameters. As discussed in Section 4, Xscheme improves DNNbased MSS performance.
4 Experiments
In this section, we present our experiments on Xscheme for MSS. We first explore the effect of the bridging position using XUMX [30] to provide insights on the optimal position and its sensitivity. Next, we confirm the scalability of Xscheme in a largescale data regime. Finally, we demonstrate the generality of Xscheme by applying it to another type of network architectures, D3Net and ConvTasNet.
We used the following datasets and STFT/ISTFT settings for the experiments.
4.1 MUSDB18 [57]
The MUSDB18 dataset is comprised of 150 songs, each of which was recorded at a 44.1kHz sampling rate. It consists of two subsets (“train” and “test”), where we further split the train set into “train” and “valid” as defined in the official “musdb” package^{Footnote 5}. For each song, the mixture and its four sources, i.e., bass, drums, other, and vocals, are available.
4.2 STFT/ISTFT
We used a Hann window with a length of 4096 samples and 75% overlap. We used STFT magnitudes obtained from the mixture signal as input and trained networks to estimate target mask \(M_j(t, f)\) or spectrograms \(Y_j(t, f)\), where f is the frequency bin and t the frame index. To use STFT and ISTFT as differentiable layers for MDL, we used “torch.stft” and “torch.istft” from PyTorch which are readily available and provide a differentiable implementation of the STFT/ISTFT^{Footnote 6}. Please see also https://github.com/asteroidteam/asteroid for our actual implementation.
4.3 XUMX
The network architecture of UMX is illustrated in Fig. 5a. The network was trained to estimate all the sources’ masks with the Adam [58] optimizer for 1000 epochs. The learning rate was set to 0.001 with a weight decay of 0.00001. The batch size was set to 14 and each input was a random crop of 6.0 sec from the dataset. The scaling parameter \(\alpha\), introduced in Eq. (3) for MDL was set to 10.0 to approximately equalize the ranges of \(\mathcal {L}^J_{\text {MSE}}\) and \(\mathcal {L}^J_{\text {wSDR}}\) by looking at each loss function’s learning curves, respectively. Note that the details of other settings are shown in our code^{Footnote 7} and previous paper [30].
4.3.1 Bridging positions
As shown in Fig. 2, bridging operation can be applied to arbitrary positions between the layers. The number of bridges can also be increased depending on the number of gaps between adjacent layers. Therefore, in this section, we present the results regarding the position of bridging operation on XUMX trained under the same configurations, e.g., the number of epochs, regularization parameters, and type of optimizer, as mentioned in the previous subsection. We show the simplified network architecture and possible bridging candidates of UMX in Fig. 5b. UMX roughly consists of three affine blocks and a BLSTM block. Each affine block has a fully connected layer, batch normalization layer, and activation function. The BLSTM block has three consecutive BLSTM layers with dropout. In this experiment, we considered three positions as candidates for inserting the bridging network and examined the performance for all combinations of bridging position.
The results are shown in Fig. 6. The performances of almost all bridged versions of UMX, i.e., bridging position (BP) from 1 to 7 (BP1BP7), were superior to the baseline from the perspective of sourcetointerference ratio (SIR) and source imagetospatial distortion ratio (ISR) (see “Avg.” of Fig. 6b and c). Only the SIR result of BP1 did not outperform that of the baseline but was comparable. Hence, we argue that bridging operation can improve the suppression of the other interference instruments without increasing linear distortions since ISR becomes low when the output signal increases linear distortions. Focusing on the SDR results, which were computed by summing up the weighted SIR, ISR, and SAR, we argue that XUMX outperformed UMX because the SDR results of BP1BP7 improved in most cases compared with that of the baseline (see Fig. 6a). In particular, BP4, which bridged the paths between “Affine Block” and “BLSTMs Block,” performed the best in terms of the SDR. Hence, bridging paths between the gaps of different type of blocks or layers is probably effective in terms of sharing each subextraction network’s information.
4.3.2 Effectiveness of CL
First of all, we confirmed the validity of the term “\(2 \epsilon _1 \epsilon _2\)” mentioned in Section 3.2.2, which is ignored in the case of training each of instruments’ subnetworks separately. Not only this term but also bridging networks bring benefit. Tying networks together helps the training as now also gradient information is exchanged which can help to learn to have a small “\(2 \epsilon _1 \epsilon _2\)” term. Furthermore, they also benefit from a joint feature extraction. In this way, by computing this term through our Xscheme, we take this mutual effect among sources into account when training the DNN. Specifically, by considering this term in the loss function, it is expected to penalize an errors having correlation between instruments when either of an instrument is wrongly separated to the wrong track. To confirm this, we compared the actual separated results of UMX and XUMX. As shown in Fig. 7, XUMX succeeded to suppress errors in the vocals track leaked from drums as expected. In particular, the regions highlighted by colored rectangles were obvious, and this improvement was also audible. It is considered that this power, i.e., energy from drums, which leaked in the wrong track was penalized through the loss function by our Xscheme as we discussed in Section 3.2.2 resulting in the performance improvement shown in Fig. 7a.
To confirm the validity of CL in more detail, we monitored the performance change according to the number of combinations. As we discussed in Section 3.2.2, the combined vector may not potentially bring a benefit especially when the number of the target sources is few. Therefore, we fixed the target source as “vocals” and trained 2 XUMXs each of which was respectively trained by using 3 and 4 instruments for CL and bridging operation. Note that both of them always received the mixture signal consisting of 4 instruments as input, but the number of output instruments, i.e., subnetworks, was different. Namely, “XUMX on 3 sources” separated 3 instruments from input that was 4 sources mixture whereas “XUMX on 4 sources” separated 4, i.e., full, instruments from input. Then, in terms of CL, the number of combinations used for CL was different. The results are summarized in Table 1.
As shown in the table, all results of “XUMXs on 3 sources” were inferior to the model with all four sources. Intuitively, the more related to the vocals the excluded source was, the worse performance the results tended to be. The power of “Bass” is concentrated on lower frequency bands, and thus “Bass” has lower correlation with “Vocals” than “Drums.”
4.3.3 Scalability with large training datasets
In this section, we discuss the potential of XUMX for a largescale training dataset, i.e., XUMXL, which was not assessed in our previous study [30]. DNNs can generalize well if enough data is available for training, and some regularization methods might become ineffective in such a case. Thus, it is important to investigate the scalability of Xscheme. Specifically, we trained UMX and XUMX on an internal dataset consisting of 1505 songs with a total duration of approximately 100 h, which is 10 times larger than MUSDB18. The dataset exhibits a diverse linguistic composition, with 63% of the songs being in English, 20% in French, 6% in German, and the remaining 11% comprising various other languages such as Italian, Spanish, and Dutch. Regarding musical genres, the collection predominantly features pop and rock music. It also includes a selection of country songs and movie soundtracks, though these are less prevalent. We denote this dataset as “INTERNAL” and note that it has no overlapped songs with MUSDB18. Each song of INTERNAL consists of four instruments, as in MUSDB18.
The results are summarized in Table 2. XUMX and XUMXL outperformed the corresponding UMX and UMXL if they were trained on the same dataset, i.e., using MUSDB18 or INTERNAL. XUMX and XUMXL outperformed the original UMX and UMXL for all instruments (see the boldface in Table 2). This shows that Xscheme is effective even when we have more training data available.
It is worth noting that XUMXL greatly outperformed not only our selfimplemented UMXL trained on INTERNAL but also “public UMXL,” which was provided by the authors of UMX, although the size of our dataset is one fifth of theirs^{Footnote 8} (see the yellow highlighted cells in Table 2). From these results, we argue that Xscheme can use a given dataset for training more successfully, and even outperform a traditional setup with more training data.
4.3.4 XD3Net and XConvTasNet
Next, we firstly integrated Xscheme into D3Net resulting in XD3Net. The network architecture is shown in Fig. 8. The original D3Net, C1, uses bandwise MDenseNets [21] and integrated their outputs by applying a dense block, but they are independent of each other, i.e., there is no path to share their relationship among them. Hence, the bridging path is added at the end of bandwise D3 blocks, resulting in XD3Net, as in the experiment in Section 4.3.1. This suggests that the semantic boundary can be a good position for inserting bridging operation. The differences in network the architecture between D3Net and XD3Net are shown in Fig. 8. Each network of XD3Net was trained to estimate all the sources’ spectrograms with the Adam [58] optimizer for 70 epochs. The initial learning rate was set to 0.001 with a weight decay of 0.00001, and its learning rate was dropped to 0.0003 and 0.0001 after 40 and 60 epochs, respectively. The batch size was set to 6 and each input was a randomly cropped music spectrogram with 352 frames. The scaling parameter \(\alpha\) was set again to 10, as we did for XUMX.
The results are shown in Fig. 9. Note that “P” denotes the proposed method that includes all components of Xscheme, i.e., MDL, bridging operation, and CL, while “C1C7” denote the comparative methods lacking some of these components in order to confirm their effectiveness onebyone. In terms of the SDR, the methods using at least one component of Xscheme, i.e., C2C7 and P, were superior to D3Net, i.e., C1 (see the average performances denoted as “Avg.” in Fig. 9a). Therefore, the validity of each component of Xscheme was confirmed on a CNNbased MSS method (D3Net) as well as an RNNbased MSS method (UMX). Overall, we could improve MSS performance by 0.3 dB.
In particular, the positive effect of MDL was notable compared with our previous corresponding results on XUMX [30] (see the results of methods including MDL, i.e., C2, C5, C6, and P). Therefore, regardless of whether the target network is an originally integrated one or tied against independent source subnetworks, the lossfunctionrelated core components of Xscheme, i.e., MDL and CL, can improve MSS performance.
Finally, to see the effectiveness of applying our Xscheme to DNNbased MSS methods, we summarize the performance comparison before and after applying Xscheme in Table 3. To confirm the effectiveness for not only frequencydomain network, i.e., UMX and D3Net used in the above, but also timedomain network, we also applied Xscheme to ConvTasNet [1] resulting in XConvTasNet. Note that timedomain networks tend to require much larger size of memory and the corresponding training time than frequencydomain ones, and thus we did not run an ablation study but instead applied our Xscheme to ConvTasNet using the learnings from XUMX and XD3Net. As shown in Table 3, all instruments were improved when comparing to the vanilla networks. In addition to the quantitative results, we also studied their spectrograms shown in Fig. 10. As shown in the figure, all vanilla methods tend to miss the power of “Other” track, but all of them became to be able to detect and extract it by applying our Xscheme. This is due to the missed power that leaked in wrong tracks which was penalized through the Xscheme loss function as we discussed in Section 3.2.2. Thus, we can argue that our Xscheme works well not only for frequencydomain networks but also for timedomain ones, e.g., ConvTasNet.
From the aforementioned results, we can conclude that our Xscheme can be applied to diverse types of networks such as CNNbased time and frequencydomain models as well as RNNbased time and frequencydomain ones. However, please note that the detailed effect of our method, e.g., where the most effective bridging position is, the number of combinations and bridges that should be used, and what types of timedomain and frequencydomain loss functions are effective, may be different depending on the detailed characteristics of the target network. Therefore, it is important to insert each core part of Xscheme, (i) MDL, (ii) bridging operation, and (iii) CL, onebyone and adapting them such that the optimal configuration for the target network is found.
5 Conclusion
We revisited our previous proposal and summarized its core component, a versatile scheme called Xscheme. Xscheme consists of three parts: (i) MDL, (ii) bridging operation, and (iii) CL, which improve the performance of DNNbased MSS with almost no increase in calculation cost. Specifically, as MDL and CL are merely loss functions used during training, they do not affect the computational cost at inference. As shown in Fig. 2, bridging operation does not increase calculation cost due to adding only a few average computations. To verify Xscheme for another type of network that differs from the recurrent type, i.e., UMX, we derived an Xschemebased convolutional networks in this paper. The frequencydomain and timedomain convolutional networks extended by Xscheme are respectively XD3Net and XConvTasNet. We confirmed their validity compared to the original ones through experiments. We also examined the detailed effectiveness of XUMX through the following experiments: (a) searching for the effective bridging position(s) and b) using a larger dataset than MUSDB18. XUMXL, obtained by training XUMX on our largescale data regime (INTERNAL), greatly outperformed the public UMXL, although the size of our dataset is about 20% that of the private training dataset for public UMXL. Therefore, by sharing the information regarding the progress of MSS among all subnetworks, our Xschemebased MSS method can versatilely improve MSS performance more effectively than the original method. It is worth noting that Xscheme has high practicability since its components, i.e., MDL, CL, and bridging operation, have almost no effect on the inference of the original target method and are easy to implement.
Availability of data and materials
The dataset generated and/or analyzed during the current study are available in the “MUSDB18” repository, https://sigsep.github.io/datasets/musdb.html [57]. The dataset, “INTERNAL” (see Table 2), generated and/or analyzed during the current study are not publicly available since it is compensation and thus not public dataset.
Notes
The implementations on two different libraries are available at https://github.com/sigsep/openunmixpytorch and https://github.com/sigsep/openunmixnnabla
If the network outputs a spectrogram, we append an ISTFT layer whereas an STFT layer is added if the network output is a time signal.
Note that bridging operation may be only needed for methods such as UMX, since it consists of individual extraction networks. In other words, this bridging is not necessary for methods that learn one network for all sources such as Demucs [39].
Initial experiments showed that the combination \({}_J\textrm{C}_J\), i.e., adding and mixing all sources, does not further improve MSS performance, thus it is not used in Eq. (5).
Some famous DNN libraries such as PyTorch and TensorFlow already provided official implemented STFT and ISTFT layers.
The size of dataset used for training UMXL, i.e., 500 h, was confirmed with the developers.
References
Y. Luo, N. Mesgarani, ConvTasNet: Surpassing ideal timefrequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
A.V. Oppenheim, R.W. Schafer, J.R. Buck, DiscreteTime Signal Processing, 2nd edn. (Prenticehall Englewood Cliffs, USA, 1999)
G. MeseguerBrocal, G. Peeters, in Proc. of the 20th International Society for Music Information Retrieval Conference (ISMIR), ed. by A. Flexer, G. Peeters, J. Urbano, A. Volk. ConditionedUNet: introducing a control mechanism in the UNet for multiple source separations (2019). pp. 159–165. http://archives.ismir.net/ismir2019/paper/000017.pdf. Accessed 29 Apr 2024
O. Slizovskaia, G. Haro, E. Gómez, Conditioned source separation for musical instrument performances. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2083–2095 (2021). https://doi.org/10.1109/TASLP.2021.3082331
V.S. Kadandale, J.F. Montesinos, G. Haro, E. Gómez, in Proc. of IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP). Multichannel UNet for music source separation (2020), pp. 1–6. https://doi.org/10.1109/MMSP48831.2020.9287108
E. Perez, F. Strub, H. de Vries, V. Dumoulin, A. Courville, FiLM: Visual reasoning with a general conditioning layer. Proc. AAAI Conf. Artif. Intell. 32(1) (2018). https://doi.org/10.1609/aaai.v32i1.11671
E. Cano, D. FitzGerald, A. Liutkus, M.D. Plumbley, F.R. Stöter, Musical source separation: An introduction. IEEE Signal Process. Mag. 36(1), 31–40 (2019). https://doi.org/10.1109/MSP.2018.2874719
N.Q.K. Duong, E. Vincent, R. Gribonval, Underdetermined reverberant audio source separation using a fullrank spatial covariance model. IEEE Trans. Audio Speech Lang. Process. 18(7), 1830–1840 (2010)
D. FitzGerald, A. Liutkus, R. Badeau, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). PROJET — Spatial audio separation using projections (Institute of Electrical and Electronics Engineers (IEEE), Shanghai, 2016), pp. 36–40
A. Liutkus, D. Fitzgerald, R. Badeau, in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Cauchy nonnegative matrix factorization (Institute of Electrical and Electronics Engineers (IEEE), New Paltz, 2015), pp. 1–5
J. Le Roux, J.R. Hershey, F. Weninger, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Deep NMF for speech separation (Institute of Electrical and Electronics Engineers (IEEE), South Brisbane, 2015), pp. 66–70
Y. Mitsufuji, S. Koyama, H. Saruwatari, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Multichannel blind source separation based on nonnegative tensor factorization in wavenumber domain (Institute of Electrical and Electronics Engineers (IEEE), Shanghai, 2016), pp. 56–60
A. Liutkus, D. Fitzgerald, Z. Rafii, B. Pardo, L. Daudet, Kernel additive models for source separation. IEEE Trans. Signal Process. 62(16), 4298–4310 (2014)
A. Ozerov, C. Fevotte, Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans. Audio Speech Lang. Process. 18(3), 550–563 (2010)
A. Liutkus, D. Fitzgerald, Z. Rafii, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Scalable audio separation with light kernel additive modelling (Institute of Electrical and Electronics Engineers (IEEE), South Brisbane, 2015), pp. 76–80
C. Van Der Malsburg, in Brain Theory, ed. by G. Palm, A. Aertsen. Frank Rosenblatt: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, (Springer Berlin Heidelberg, Berlin, Heidelberg, 1986), pp. 245–248
K. Fukushima, Neocognitron: A selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36, 193–202 (1980)
D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by backpropagating errors. Nature 323(6088), 533–536 (1986)
A.A. Nugraha, A. Liutkus, E. Vincent, in Proc. of 24th European Signal Processing Conference (EUSIPCO). Multichannel music separation with deep neural networks (Institute of Electrical and Electronics Engineers (IEEE), Budapest, 2016), pp. 1748–1752
S. Uhlich, F. Giron, Y. Mitsufuji, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Deep neural network based instrument extraction from music (Institute of Electrical and Electronics Engineers (IEEE), South Brisbane, 2015), pp. 2135–2139
N. Takahashi, Y. Mitsufuji, in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Multiscale multiband densenets for audio source separation (Institute of Electrical and Electronics Engineers (IEEE), New Paltz, 2017), pp. 21–25
S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, Y. Mitsufuji, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Improving music source separation based on deep neural networks through data augmentation and network blending (Institute of Electrical and Electronics Engineers (IEEE), New Orleans, 2017), pp. 261–265
N. Takahashi, N. Goswami, Y. Mitsufuji, in Proc. of IWAENC. MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation (2018)
F.R. Stöter, S. Uhlich, A. Liutkus, Y. Mitsufuji, OpenUnmix  A reference implementation for music source separation. J. Open Source Softw. 4, 1667 (2019)
J.H. Kim, J. Yoo, S. Chun, A. Kim, J.W. Ha, Multidomain processing via hybrid denoising networks for speech enhancement. arXiv (2018)
J. Su, Z. Jin, A. Finkelstein, in Proc. of Interspeech. HiFiGAN: HighFidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks (International Speech Communication Association (ISCA), Shanghai, 2020), pp. 4506–4510. https://doi.org/10.21437/Interspeech.20202143
N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications (The MIT Press, USA, 1949). https://doi.org/10.7551/mitpress/2946.001.0001
J. Lee, Y. Jung, M. Jung, H. Kim, in Proc. of AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Dynamic noise embedding: Noise aware training and adaptation for speech enhancement (Institute of Electrical and Electronics Engineers (IEEE), Auckland, 2020), pp. 739–746
H. Fang, G. Carbajal, S. Wermter, T. Gerkmann, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Variational autoencoder for speech enhancement with a noiseaware encoder (Institute of Electrical and Electronics Engineers (IEEE), Toronto, 2021), pp. 676–680
R. Sawata, S. Uhlich, S. Takahashi, Y. Mitsufuji, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). All for one and one for all: Improving music separation by bridging networks (Institute of Electrical and Electronics Engineers (IEEE), Toronto, 2021), pp. 51–55
A. Défossez, in Proc. of the International Society for Music Information Retrieval (ISMIR) Conference Workshop on Music Source Separation. Hybrid Spectrogram and Waveform Source Separation (International Society for Music Information Retrieval (ISMIR), 2021)
S. Rouard, F. Massa, A. Défossez, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Hybrid transformers for music source separation (IEEE, Rhodes Island, 2023), pp. 1–5
Y. Luo, J. Yu, Music source separation with bandsplit rnn. IEEE/ACM Trans. Audio Speech Lang. Process. (2023)
N. Takahashi, Y. Mitsufuji, in Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Densely connected multidilated convolutional networks for dense prediction tasks (2021), pp. 993–1002. https://doi.org/10.1109/CVPR46437.2021.00105
N. Takahashi, Y. Mitsufuji, D3Net: Densely connected multidilated DenseNet for music source separation. CoRR. abs/2010.01733 (2020). https://arxiv.org/abs/2010.01733
F.R. Stöter, A. Liutkus, N. Ito, in Proc. of International Conference on Latent Variable Analysis and Signal Separation. The 2018 signal separation evaluation campaign (Springer International Publishing, Guildford, 2018), pp. 293–305
F. Lluís, J. Pons, X. Serra, in Proc. of Interspeech. Endtoend music source separation: Is it possible in the waveform domain? (International Speech Communication Association (ISCA), Graz, 2019), pp. 4619–4623
D. Stoller, S. Ewert, S. Dixon, in Proc. of the 19th International Society for Music Information Retrieval (ISMIR) Conference. WaveUNet: A multiscale neural network for end toend audio source separation (International Society for Music Information Retrieval (ISMIR), Paris, 2018), pp. 334–340
A. Défossez, N. Usunier, L. Bottou, F. Bach, Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed. CoRR. abs/1909.01174 (2019). http://arxiv.org/abs/1909.01174
Y.N. Dauphin, A. Fan, M. Auli, D. Grangier, in Proc. of the 34th International Conference on Machine Learning (ICML), vol. 70. Language modeling with gated convolutional networks (Proceedings of Machine Learning Research (PMLR), Sydney, 2017), pp. 933–941
M. Kim, W. Choi, J. Chung, D. Lee, S. Jung, in Proc. of the International Society for Music Information Retrieval (ISMIR) Conference Workshop on Music Source Separation. KUIELabMDXNet: A twostream neural network for music demixing (International Society for Music Information Retrieval (ISMIR), 2021)
C.Y. Yu, K.W. Cheuk, in Proc. of the International Society for Music Information Retrieval (ISMIR) Conference Workshop on Music Source Separation. DannaSep: Unite to separate them all (International Society for Music Information Retrieval (ISMIR), 2021)
W. Choi, M. Kim, J. Chung, S. Jung, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). LaSAFT: Latent source attentive frequency transformation for conditioned source separation (Institute of Electrical and Electronics Engineers (IEEE), Toronto, 2021), pp. 171–175
Y.S. Jeong, J. Kim, W. Choi, J. Chung, S. Jung, in Proc. of the International Society for Music Information Retrieval (ISMIR) Conference Workshop on Music Source Separation. LightSAFT: Lightweight latent source aware frequency transform for source separation (International Society for Music Information Retrieval (ISMIR), 2021)
H. Liu, Q. Kong, J. Liu, in Proc. of the International Society for Music Information Retrieval (ISMIR) Conference Workshop on Music Source Separation. CWSPResUNet: Music source separation with channelwise subband phaseaware ResUNet (International Society for Music Information Retrieval (ISMIR), 2021)
W. Choi, M. Kim, J. Chung, D. Lee, S. Jung, in Proc. of the 21st International Society for Music Information Retrieval (ISMIR) Conference. Investigating UNets with various intermediate blocks for spectrogrambased singing voice separation (International Society for Music Information Retrieval (ISMIR), Montreal, 2020), pp. 192–198
O. Ronneberger, P. Fischer, T. Brox, in Proc. of Medical Image Computing and ComputerAssisted Intervention (MICCAI). UNet: Convolutional networks for biomedical image segmentation (Springer, Munich, 2015), pp. 234–241
Q. Kong, Y. Cao, H. Liu, K. Choi, Y. Wang, in Proc. of the 22nd International Society for Music Information Retrieval (ISMIR) Conference. Decoupling magnitude and phase estimation with deep ResUNet for music source separation (International Society for Music Information Retrieval (ISMIR), 2021), pp. 342–349
N. Takahashi, P. Agrawal, N. Goswami, Y. Mitsufuji, in Proc. Interspeech. PhaseNet: Discretized phase modeling with deep neural networks for audio source separation (International Speech Communication Association (ISCA), Hyderabad, 2018), pp. 2713–2717
D. Yin, C. Luo, Z. Xiong, W. Zeng, in Proc. AAAI. Phasen: A phaseandharmonicsaware speech enhancement network (AAAI Press, New York City, 2020), pp. 9458–9465
T. Peer, S. Welker, T. in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Gerkmann, DiffPhase: Generative DiffusionBased STFT Phase Retrieval (Institute of Electrical and Electronics Engineers (IEEE), Rhodes Island, 2023), pp. 1–5. https://doi.org/10.1109/ICASSP49357.2023.10095396
H.S. Choi, J.H. Kim, J. Huh, A. Kim, J.W. Ha, K. Lee, Phaseaware Speech Enhancement with Deep Complex UNet. CoRR abs/1903.03107 (2019). http://arxiv.org/abs/1903.03107
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L.U. Kaiser, I. Polosukhin, in Proc. of Advances in Neural Information Processing Systems (NeurIPS), ed. by I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett. Attention is all you need, vol. 30 (Neural Information Processing Systems Foundation, Inc. (NeurIPS), Long Beach, 2017)
J. Hu, L. Shen, G. Sun, in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Squeezeandexcitation networks (Institute of Electrical and Electronics Engineers (IEEE), Salt Lake City, 2018)
Y. Luo, Z. Chen, N. Mesgarani, T. Yoshioka, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Endtoend microphone permutation and number invariant multichannel speech separation (Institute of Electrical and Electronics Engineers (IEEE), Barcelona, 2020), pp. 6394–6398
R. Caruana, Multitask learning. Mach. Learn. 28(1), 41–75 (1997)
Z. Rafii, A. Liutkus, F.R. Stöter, S.I. Mimilakis, R. Bittner. The MUSDB18 corpus for music separation (2017). https://doi.org/10.5281/zenodo.1117372
D.P. Kingma, J. Ba, in Proc. of 3rd International Conference on Learning Representations (ICLR), ed. by Y. Bengio, Y. LeCun. Adam: A method for stochastic optimization (OpenReview.net, San Diego, 2015)
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
RS conducted all experiments and analyzed the results. And RS was also a major contributor in writing the manuscript. NT and SU technically supervised RS and polished the initial written manuscript. All authors read and approved the final manuscript.
Authors’ information
RS received his B.S. and M.S. degrees in Electronics and Information Engineering from Hokkaido University, Japan in 2014 and 2016, respectively. From 2022 to 2023, he had worked at the Stanford Vision and Learning Lab (SVL) at Stanford University, USA. He is currently a researcher at Sony Research and Ph.D. candidate at Graduate School of Information Science and Technology, Hokkaido University. His research interests include biosignal processing, music information retrieval, acoustic signal processing, and 3D computer vision. He is a member of IEEE.
NT received Ph.D. from University of Tsukuba, Japan, in 2020. Formerly, he had worked at the Computer Vision Lab at ETH Zurich, Switzerland. Since he joined Sony in 2008, he has performed research in the field of audio, computer vision, and machine learning. In 2018, he won the Sony Outstanding Engineer Award, which is the highest form of individual recognition for Sony Group engineers. He achieved the best scores in several challenges including Signal Separation Evaluation Campaign (SiSEC) 2018 and Detection and Classification of Acoustic Scenes and Events (DCASE) 2021. He has authored several papers and served as a reviewer at CVPR, ICASSP, Interspeech, ICCV, Trans. ASLP, Trans. MM, and more. He coorganized the DCASE 2022 task3.
SU received the Dipl.Ing. and Ph.D. degree in electrical engineering from the University of Stuttgart, Germany, in 2006 and 2012, respectively. From 2007 to 2011, he was a research assistant at the Chair of System Theory and Signal Processing, University of Stuttgart. In this time, he worked in the area of statistical signal processing, focusing especially on parameter estimation theory and methods. Since 2011, he is with the Sony Stuttgart Technology Center where he works as a Senior Principal Engineer on problems in music source separation and deep neural network compactization.
ST received his B.S. degree in communications engineering and M.S. degree in information science from Tohoku University, Japan, in 2000 and 2002, respectively. He is currently a researcher and a senior manager at Sony Group Corporation, Japan. His research interests include speech and audio signal processing and machine learning. His team achieved the first place in DCASE 2021 Challenge in Task3 and coorganized DCASE Challenge in 2022 and 2023. He also coorganized Sound Demixing Challenge 2023, where real audio from movies by Sony Pictures Entertainment were used for the evaluation.
YM received the B.S. and M.S. degrees in information science from Keio University in 2002 and 2004, respectively. He obtained the Ph.D. degree in information science and technology from the University of Tokyo in 2020. Currently, he is leading Creative AI Lab at Sony Group Corporation while serving as Specially Appointed Associate Professor at Tokyo Institute of Technology. He joined Sony Corporation in 2004 and has been leading teams that developed the sound design for the PlayStation game title called “Gran Turismo Sport” and spatial audio solution called “Sonic Surf VR.” He also won several awards such as TIGA award for best audio design for Gran Turismo Sports and a jury selection at Japan Media Arts Festival for their 576channel sound field synthesis called “Acoustic Vessel Odyssey.” From 2011 to 2012, he was a visiting researcher at Analysis/Synthesis Team, Institut de Rechereche et Coordination Acoustique/Musique (IRCAM), Paris, France. He was involved in the 3DTV content search project sponsored by European Project FP7, in research collaboration with IRCAM. In 2021, his team organized Music Demixing (MDX) Challenge where Sony Music provided a professionallyproduced music dataset for the evaluation of submitted systems to an online platform on AIcrowd. His team also participated in DCASE2021 Challenge and achieved the first place in Task3.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix 1: Robustness against initialization
Appendix 1: Robustness against initialization
To confirm the robustness against the initial random seed, we conducted experiments that ran XUMX on the same experimental settings except the initialization. Specifically, we ran 10 XUMXs where each was initialized with different random seeds and compared the final performances among them. As shown in Table 4, all instruments’ averaged results among all different seeds outperformed the vanilla UMX’s ones. Furthermore, even by focusing each random seed’s results onebyone, there are almost no scores that were inferior to the UMX’s ones. Therefore, we argue that our Xscheme enables the target network to enhance its performance independent of initialization effects.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sawata, R., Takahashi, N., Uhlich, S. et al. The whole is greater than the sum of its parts: improving music source separation by bridging networks. J AUDIO SPEECH MUSIC PROC. 2024, 39 (2024). https://doi.org/10.1186/s13636024003546
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636024003546