Stripe-Transformer: deep stripe feature learning for music source separation

Music source separation (MSS) is to isolate musical instrument signals from the given music mixture. Stripes widely exist in music spectrograms, which potentially indicate high-level music information. For example, a vertical stripe indicates a drum time and a horizontal stripe indicates a harmonic component such as a singing voice. These stripe features actually affect the performance of MSS systems, which has not been explicitly explored by previous MSS studies. In this paper, we propose stripe-Transformer, a deep stripe feature learning method for MSS with a Transformer-based architecture. Stripe-wise self-attention mechanism is designed to capture global dependencies along the time and frequency axis in music spectrograms. Experimental results on the Musdb18 dataset show that our proposed model reaches an average source-to-distortion (SDR) of 6.71dB on four target sources, achieving state-of-the-art performance with fewer parameters. And the visualization results show the capability of the proposed model to extract beat and harmonic structure in music signals.


Introduction
Music source separation (MSS) is an essential technology in music information retrieval (MIR), with the aim of recovering one or more target musical sources from the mixture. The target musical sources usually refer to some musical instruments such as bass, drums, and singing voice. The mixture refers to combinations of source signals.
MSS has an extensive range of applications, such as music remixing [1] and accompaniment extraction for karaoke systems [2]. It can also be used as a preprocessing technique for other MIR tasks [3][4][5]. When the background accompaniment is removed, the results of some algorithms such as singer identification [6], vocal melody extraction [7], music emotion recognition [8], and queryby-humming [9] show promising improvements. Some other MIR studies also use source separation as a joint optimization target [10][11][12][13][14] to achieve more effective performance.
Recently, with the rapid development of deep learning, the DNN-based methods achieved more competitive results in MSS, using various of neural networks such as CNNs [32][33][34], RNNs [35,36], and Transformers [37,38]. These networks were trained on multi-track music datasets to learn the pattern of separating various kinds of instruments and singing voices.
However, most previous DNN-based MSS studies do not explicitly explore the stripe features widely existed in music spectrograms. Considering the existence of harmonics for melody-based instruments, there are many horizontal stripes parallelly located in integral multiples of fundamental frequency f 0 [39], which can be found in the "vocals" and "bass" spectrograms shown in Fig. 1. For example, given f 0 is 200Hz, there are corresponding harmonic components in places around 400Hz, 600Hz, etc. On the other hand, vertical stripes also appear in spectrograms when rhythmic instruments such as drum kits are played [26], as presented in the "drums" spectrogram.
The aim of our study is to appropriately process these unique characteristics of musical spectrograms for music source separation. Considering the excellent performance of deep neural networks in recent MIR developments, we choose the DNN-based architecture to process stripe features of music spectrograms.
Our contributions in this paper mainly include the following four aspects: (1) In the task of MSS, we first propose to combine U-Net architecture with the Transformer backbone network. (2) In the proposed model, high-level spectral feature maps are modeled as sequences of horizontal or vertical stripes. We design a stripe-wise self-attention (SiSA) module, a novel attention mechanism to capture long-term dependencies within and between these stripes. (3) Under the optimal experimental setting, the proposed method can achieve the state-of-the-art (SOTA) results on the Musdb18 dataset with fewer parameters. (4) We present a visual analysis for attention maps of stripe features and reconstructed spectrograms, which shows that the proposed model can better extract stripe features such as beat and harmonic structure in the music spectrograms.

Related works
Recently, the SOTA algorithms for music source separation are mostly based on deep neural networks. This section introduces the existing MSS methods which use the relevant neural networks involved in our proposed method.

CNN-based methods
Convolutional neural network (CNN) was initially proposed for image classification [32]. The convolution layer slides different convolution kernels on the input image with certain linear operations. This operation with the strategy of sharing parameters significantly reduces the model parameters and can better extract local features.
In the audio and music source separation task, CNNs also show effective performance. Chandna et al. proposed a network based on CNN for audio source separation [40]. Takahashi et al. proposed a DenseNet-based network which introduced connections through multiple feature maps through down-sampling and up-sampling   [29], which achieved SOTA performance using the spectrogrambased method.

U-Net-based methods
U-Net was initially proposed for medical image segmentation [43], which adopted a U-shaped encoder-decoder structure. Skip connections are introduced to connect the convolutional layers of the same resolution between the encoder and decoder. In this way, shallow features and deep semantic features are fused, which significantly boosts the performance of image segmentation. Jansson et al. first applied U-Net to singing voice separation [28], which surpassed the SOTA method at that time in both subjective and objective indicators. Since then, there has been a series of MSS works based on this model architecture [12,29,44,45]. Choi et al. verified various types of intermediate blocks that can be used in the U-Net architecture [46]. Wave-U-Net was proposed as a time domain method for audio source separation method adapted from U-Net architecture [15,47].

Transformer-based methods
Recently, Transformers have been widely applied in the area of natural language processing [37,38], image processing [48,49], and audio processing [50,51]. Transformers with the self-attention mechanism can capture long-term dependencies and highlight essential features in a parallel computation pattern. For the music source separation task, Li et al. proposed a sliced attentionbased neural network, which showed the effective performance of the self-attention mechanism [30]. Yu et al. proposed a pure spectral-temporal Transformer-based encoder that outperformed previous singing voice separation methods [52].

Method
This section introduces the overall architecture of the proposed model and the details of its main components. The stripe-Transformer block and stripe-wise self-attention mechanism are further explained.

Overall architecture
As presented in Fig. 2, we design our proposed model according to the SIMO (single-input-multi-output) architecture [53], in which the single-input refers to the input mixture and multi-outputs refer to spectrograms of target sources. In terms of the model architecture, we use the U-Net-like structure, with the consideration of the impressive performance of this symmetric structure for source separation tasks.
It is difficult to directly learn global information from low-level feature maps, and the calculation complexity will be unbearable if the Transformer module directly models on the input spectrogram with a relatively large frequency dimension (1536 frequency bins in our experimental setting). We first down-sample the spectral feature maps by using the convolution layer with the stride of 2 on the frequency dimension. Each down-sampling convolutional layer is followed by a residual CNN block. The encoded feature maps can be reduced to 192 frequency bins using three down-sampling CNN blocks. Then, the stripe feature learning module is placed at the bottleneck part of the U-Net structure to process multiscale feature representations. The outputs of the stripe feature learning module are then passed into the convolutional decoder. Skip connections exist between spectral feature maps of down-sampling and up-sampling processes.

Residual CNN block
Residual CNN blocks are placed at the encoder and decoder, which can focus on local regions to recover highresolution details. The structure of a residual CNN block is presented in Fig. 2b. It consists of two convolutional layers, in which each layer follows a LeakyReLU activation and a batch normalization layer. And one more convolution layer of 1 × 1 kernel connects the input and the output of the main branch.

Stripe-Transformer block
Stripe-Transformer block is used to capture dependencies of horizontal and vertical stripes in multi-scale feature representations. The structure of a stripe-Transformer block is presented in Fig. 2c, which mainly consists of a stripe-wise self-attention (SiSA) module, a squeeze-and-excitation (SE) module, and a mixed-scale convolutional FFN (MixCFN).
The SiSA module is an attention-based dual-path network, which consists of two branches for processing horizontal and vertical stripe features, respectively. Specifically, the input of the SiSA module x ∈ R H ×W ×C is first divided into two groups x H ∈ R H ×W × C 2 and x V ∈ R H ×W × C 2 along the channel dimension. The two groups are then processed by horizontal and vertical branches separately, in which feature maps are treated as a sequence of the horizontal stripes and vertical stripes, respectively. The details of SiSA are shown in Fig. 3 and will be described in Section 3.4. We denote the horizontal and vertical branches of the SiSA module as SiSA H and SiSA V , respectively. The outputs of these two branches h H ∈ R H ×W × C 2 and h V ∈ R H ×W × C 2 can be obtained by  [55] to further process the feature outputs from attention layers. MixCFN is based on the structure of the common feed-forward-network (FFN), which consists of two fully connected (FC) layers and a GELU activation function. To further extract multi-scale local information, MixCFN adds two depth-wise convolution paths between two FC layers. Specifically, the feature maps after the first FC layer are split into two parts along the channel dimension and then passed into 3 × 3 and 5 × 5 depth-wise convolution layers.
Finally, the output of the MixCFN z ∈ R H ×W ×C , which is also the output of the stripe-Transformer block, can be obtained by The layer normalization is used to speed up network convergence and residual connection is used to avoid vanishing gradient problems.

Stripe-wise self-attention
The SiSA module contains horizontal and vertical branches, as mentioned in the above section. We take the vertical branch of the SiSA module as an example for further explanation, as shown in Fig. 3. The horizontal branch is in a similar pattern to the vertical branch and will not be discussed.
Basically, each feature map in the vertical branch can be modeled as a sequence of vertical stripes, in which each stripe is also a sequence of frequency bins at a certain time step. And it has been demonstrated in previous works [27,30] that capturing long-term dependencies of feature sequences is beneficial to the source separation task. Therefore, we propose to use two kinds of attention-based networks, inter-stripe multi-head self-attention (MHSA) and inner-stripe MHSA, to deal with the long-term dependency problem of stripe-level feature modeling. The former is used to capture dependencies between different stripes, and the latter is used to capture dependencies in each stripe. Let I ∈ R H ×W ×C be the input feature maps of the vertical SiSA, in which C is half the channel number of the input of the stripe-Transformer block. For inter-stripe MHSA, I is first processed by a linear transform along the channel dimension and reshaped to Q, K, and V ∈ R n×H ×W ×c , in which n denotes the head numbers and c denotes the number of channels per head. The strip-pooling [56] strategy is used to down-sample feature maps Q and K as stripe tokens. It performs global average pooling (GAP) on each vertical stripe, which can be denoted as (5) in which y j denotes the down-sampled feature of a single stripe. Through this spatial reduction operation, the shape of Q and K becomes n × W × c . Then, the attention maps A ∈ R n×W ×W can be obtained from the inner product of Q and K, in which an attention map

Experiments
In this section, we first introduce our experimental settings, including dataset, model configuration, training, and evaluation strategies. And we explore the performance of the proposed model compared with other DNN-based networks. We also perform comparison experiments concerning the construction of the proposed model and the effect of input audio length. Finally, the visualization results and analyses are discussed.

Dataset
We use the open dataset Musdb18 [57], a professional multi-track dataset that contains four target sources, including "vocals, " "bass, " "drums, " and "other, " among which the "other" includes musical instruments except for the previous three, such as piano and violin. The dataset contains 150 pieces in total, including 100 songs in the training set and 50 songs in the testing set. We then divide the songs in the training set into 86 songs for model training and 14 songs for model validation. All audio materials are in stereo format at 44.1kHz.

Experimental settings
During data generation, random segments in training songs are selected for each training iteration. Each song segment is processed by STFT to obtain the spectrogram, with a window length and hop size of 4096 and 1024 samples, respectively. The size of the obtained spectrogram is 2 × 2049 × 256, which represents a roughly 6-s song piece. Since the frequency bandwidth of the tracks in the Musdb18 dataset is limited to 16kHz, we cut the high-frequency part and finally obtain the spectrogram with the size of 2 × 1536 × 256 , which is then fed into the neural networks. We also use data augmentation to obtain more sufficient data for training, mainly the random remixing and random amplitude scaling [58]. The random remixing strategy randomly takes 6-s pieces from each audio source track and then obtains the mixture by the linear summation of all tracks.
For detail settings of our proposed model, the channel numbers of the ResCNN part in the encoder are 32, 48, and 64, and the kernel size keeps 3 × 3 . In multi-scale stripe-Transformer blocks, the channel numbers are 128, 256, and 512, and the head numbers are 4, 8, and 16. The expansion ratio for MixCFN keeps 3. We use the method that decouples the estimation of magnitudes and phases to optimize the model [29], in which complex ideal ratio masks (cIRMs) are obtained from the final layer of the network. The reconstructed source signals are obtained from the product of the input complex spectrogram and the output cIRMs in the complex domain. The loss we use is the L1 loss between reconstructed waveform source signals and ground truths. We use the Adam optimizer without regularization. The learning rate is set to 0.0001 initially and is multiplied by a factor of 0.9 after every 10K steps. The batch size is set to 16. Stripe-Transformer and other comparison models are trained for 200K iterations with four V100 32G GPUs.
For the inference of the model, we refer to the practice of Sams-Net [30] to cut the original complete audio into several continuous segments, and each piece will be fed into the stripe-Transformer network. Finally, the results of all segments are assembled to obtain recovered songs.
We evaluate the proposed model and comparison models by using three objective indicators, namely source-todistortion ratio (SDR), source-to-interference ratio (SIR), and source-to-artifact ratio (SAR). Given an estimate of a source s i composed of the true source s target , and three error terms, interference e interf , noise e noise , and artifacts e artif [59], the SDR, SIR, and SAR can be defined as follows: All metrics mentioned above are calculated by the Python package museval [60] using the median of frames and median of tracks.

Ablation study
We design some ablation experiments to verify the effectiveness of stripe-Transformer. Firstly, we verify the performance of the stacked stripe-Transformer blocks as the bottleneck part of the network, compared with stacked blocks using the other two backbone networks. We  denote our proposed model as "Stripe-T. " The one baseline model we refer to is the residual CNN block, using the structure consistent with the encoder as shown in Fig. 2b. We denote this method as "ResCNNs. " The other baseline model we refer to is the spatial-reduction Transformer (SR-T) block in Pyramid Vision Transformer (PVT) [61]. The SR-T block is another Transformerbased network component widely used in image segmentation, which can also use a self-attention mechanism to process high-resolution feature maps. Specifically, it adopts depth-wise CNN with a certain stride before self-attention operation. This spatial-reduction design can avoid high computation costs when capturing global dependencies. In our experiments, the down-sampling stride of the spatial-reduction convolution kernel is set to 8 × 8 , 4 × 4 , and 2 × 2 in three stages, and the head numbers and channel numbers are the same as Stripe-T. We denote this method as "SR-T. " We keep the structure of the encoder and decoder the same as in Fig. 2  The experimental results can be seen in Table 1. We compare the SDR, SIR, and SAR performance of the mentioned three bottlenecks. According to the SDR value, ResCNNs and SR-T have similar performance, among which ResCNNs is slightly higher than SR-T on "bass" and "other" and slightly lower than SR-T on "drums" and "vocals. " Stripe-Transformer outperforms the other two methods on all four targets. According to the average values of the three indicators, ResCNNs and SR-T are both 7.75 dB, while stripe-Transformer reaches 8.52dB, with about 0.77dB improvement compared with the mentioned two methods.
To further explore the construction of the proposed stripe-Transformer, we test the performance of the system when removing some components of the stripe-Transformer, as presented in Table 2. We first verify the effectiveness of the horizontal and vertical branches of the SiSA module. When removing horizontal and vertical branches of stripe-Transformer blocks, the mean of metrics decreases by 0.43dB and 0.30dB, respectively, indicating that the removal of horizontal SiSA has a slightly more significant impact on the performance. We also verify the effect of inner-stripe and inter-stripe MHSA inside the SiSA module. When removing these two parts separately, the mean of metrics decreases by 0.61dB and 0.63dB. In summary, the removal of any branch of the SiSA module will degrade the performance of the proposed system.

Context length of stripe-Transformer
For Transformer-based networks, the length of the input sequence will affect the performance of the model [30,62]. We test the performance of the model with different input segment lengths. The metric we use is the average of SDR, SIR, and SAR scores. We set the frame lengths of the input audio to be 64, 128, 256, 512, and 1024, in which 256-frame stands for around 6s in our experimental settings. The results are shown in Fig. 4. It can be found that when the frame length is set to 256, the average score reaches the highest for "vocals" and "drums" separation. The score of "bass" separation achieves the  highest when the number of input frames is 1024. And the separation performance of the above three instruments remains poor when the number of frames is 64, which contains the least context information. Table 3 shows the comparison results of SDR score between our model and existing MSS methods. Wave-U-Net [15] is an adaption architecture of U-Net on time-domain representations. Meta-TasNet [18] is a meta-learning-inspired architecture for source separation. The above two methods are all time-domain methods, and the rest are spectrogram-based methods. Open-Unmix [63] is a frequently used benchmark system on the Musdb18 dataset, with three bidirectional LSTMs as the backbone network. MMDenseLSTM [27] uses multi-band multi-scale CNNs [41] and integrates long short-term memory (LSTM). Sams-Net [30] introduces the sliced attention mechanism to the spectrogram domain. D3Net [31] uses densely multi-dilated convolution to further improve the performance based on multiband configuration. Deep ResUNet [29] uses a 143-layer network with a novel cIRM estimation strategy, which achieves the SOTA performance in spectrogram-based methods.

Comparison with other MSS systems
As shown in Table 3, the proposed stripe-Transformer achieves 7.63dB on the "drums" category and 5.89dB on the "other" category, which outperforms the above systems. On the "vocals" category, the stripe-Transformer achieves 7.83dB, with a significant improvement compared to other methods other than Deep ResUNet. On the "bass" category, stripe-Transformer is comparable with most spectrogram-based methods and is relatively weaker than Deep ResUNet. For the overall performance, stripe-Transformer achieves averagely 6.71dB, which is comparable with the 6.73dB of Deep ResUNet while using around one-tenth of the number of parameters. We compare stripe-Transformer with mentioned methods in terms of the overall performance and model parameters, as summarized in Fig. 5.

Visualization
To further investigate the effect of the stripe-wise selfattention mechanism of the proposed model, we extract the attention maps of stripe-Transformer. The stripeattention maps of the vertical branch are taken from the first stage of the bottleneck part, and those of the horizontal branch are taken from the third stage. The results are shown in Fig. 6. The attention score is an average of all query stripes and attention heads, which provides a more global interpretation.
The bottom attention bar profiles vertical stripes of spectrograms, i.e., time steps. It can be found that the most highlighted areas along the time axis are located around the drum signals, especially the time steps of the kick drums. And the left attention bar profiles horizontal stripes of spectrograms, i.e., frequency bands. It can be found that the relatively lower frequency band (<4000 Hz) achieves the higher attention score in the above two Fig. 4 The average score of SDR, SIR, and SAR on "vocals, " "drums, " and "bass" categories with different input audio lengths cases, since the lower frequency band is considered as the most complicated part along the frequency axis, which contains more instrumental energies. We also evaluate the quality of the reconstructed spectrograms for each target source, using different comparison models mentioned in Section 4.3. As shown in Fig. 7, we use red boxes to highlight differences in target spectrograms estimated by ResCNNs, SR-T, and stripe-Transformer. In terms of the reconstruction outputs of the "drums" category, the vertical stripes are broken while using ResCNNs; the boundary between the vertical stripes is not clear enough using SR-T. In the lower frequency part, the constituents of the music are often more complicated, which makes the separation of drum activities more likely to make mistakes. In comparison, stripe-Transformer can better recover these details. Since drum activities are shown as vertical stripes in the spectrogram, their locations and relationships can be better handled by stripe-level feature modeling. For the "bass" category, most of the energy in the red box is lost when using ResCNNs while it is preserved well when using SR-T and stripe-Transformer. It demonstrates the importance of capturing global dependencies using some strategies such as the self-attention mechanism. For the "other" category, ResCNNs and SR-T drop percussive signals in the red box, which appear as vertical stripes in the spectrogram. For the "vocals" category, stripe-Transformer can also better recover the labeled region compared with ResCNNs and SR-T, with its better ability to process harmonic signals shown as horizontal stripes. And we use yellow boxes to highlight differences made by removing horizontal or vertical SiSAs in stripe-Transformer. For the "drums" category, both two models miss high-frequency energies at labeled regions. For the "bass" category, the stripe-Transformer with only horizontal SiSAs almost misses one note. For the "other" category, there are no significant percussive signals in the labeled spectrogram estimated by the stripe-Transformer with only vertical SiSAs. Therefore, it can be demonstrated that the removal of the horizontal or vertical branch of the SiSA module might degrade the performance.

Conclusion
In this paper, we propose a novel deep neural network architecture, stripe-Transformer, for the task of music source separation. The stripe feature learning module in the proposed model significantly boosts the performance of MSS. The experimental results on the Musdb18 dataset show that the proposed model achieves SOTA performance with fewer parameters in terms of SDR score. The quality of reconstructed spectrograms is better when using stripe-Transformer compared with ResUNet and SR-T. And visualization results of attention maps show that our proposed model can better highlight beat and harmonic structures in music spectrograms.
In our future work, we will enlarge the proposed network and apply it into more instrumental separation tasks. Moreover, Transformer-based networks usually need large amount of training data. We will further investigate data augmentation techniques and some semisupervised methods such as noisy self-training [64] to further improve the performance of the model.