Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music

Multiple predominant instrument recognition in polyphonic music is addressed using decision level fusion of three transformer-based architectures on an ensemble of visual representations. The ensemble consists of Mel-spectrogram, modgdgram, and tempogram. Predominant instrument recognition refers to the problem where the prominent instrument is identified from a mixture of instruments being played together. We experimented with two transformer architectures like Vision transformer (Vi-T) and Shifted window transformer (Swin-T) for the proposed task. The performance of the proposed system is compared with that of the state-of-the-art Han’s model, convolutional neural networks (CNN), and deep neural networks (DNN). Transformer networks learn the distinctive local characteristics from the visual representations and classify the instrument to the group where it belongs. The proposed system is systematically evaluated using the IRMAS dataset with eleven classes. A wave generative adversarial network (WaveGAN) architecture is also employed to generate audio files for data augmentation. We train our networks from fixed-length music excerpts with a single-labeled predominant instrument and estimate an arbitrary number of predominant instruments from the variable-length test audio file without any sliding window analysis and aggregation strategy as in existing algorithms. The ensemble voting scheme using Swin-T reports a micro and macro F1 score of 0.66 and 0.62, respectively. These metrics are 3.12% and 12.72% relatively higher than those obtained by the state-of-the-art Han’s model. The architectural choice of transformers with ensemble voting on Mel-spectro-/modgd-/tempogram has merit in recognizing the predominant instruments in polyphonic music.


Introduction
Music information retrieval (MIR) is a growing field of research with lots of real-world applications and is applied well in categorizing, manipulating, and synthesizing music. An important MIR task of predominant instrument recognition is addressed in this paper. Predominant instrument recognition refers to the problem where the prominent instrument is identified from a mixture of instruments being played together [1]. The task of identifying the leading instrument in polyphonic music is challenging due to the presence of interfering partials in the orchestral background. The auditory scene produced by a musical composition can be regarded as a multi-source environment, where different sound sources are played at various pitches and loudness, and even the spatial position of a given sound source may vary with respect to time [2]. Automatic identification of lead instruments is important, since the performance of the source separation can be improved significantly by knowing the type of the instrument [1]. If the instrument information is included in the tags, it allows people to search for music with the specific instrument they want. Audio enhancement based on instrument-specific equal-ization is also in high demand in music processing. It also helps to enhance fundamental MIR tasks like auto-tagging [3], and automatic music transcription [4].
An extensive review of approaches for isolated musical instrument classification can be found in [5]. Nonnegative matrix factorization (NMF) model [6], end-toend model [7], fusion model with spectral, temporal, and modulation features [8] can be referred to as initial attempts for the proposed task in a polyphonic environment. More recent works deal with instrument recognition in polyphonic music, which is a more demanding and challenging problem. A method for automatic recognition of predominant instruments with support vector machine (SVM) classifiers trained with features extracted from real musical audio signals is proposed in [2]. Bosch et al. improved this algorithm with source separation in a preprocessing step [9]. Han et al. [1] developed a deep CNN for instrument recognition based on Melspectrogram inputs and aggregation of multiple outputs from sliding windows over the audio data. Pons et al. [10] analyzed the architecture of Han et al. in order to formulate an efficient design strategy to capture the relevant information about timbre. Both approaches were trained and validated by the IRMAS dataset of polyphonic music excerpts. Detecting the activity of music instruments using a deep neural network (DNN) through a temporal max-pooling aggregation is addressed in [11]. Dongyan Yu et al. [12] employed a network with an auxiliary classification scheme to learn the instrument categories through multitask learning. Gomez et al. [13] investigated the role of two source separation algorithms as pre-processing steps to improve the performance in the context of predominant instrument detection tasks. It was found that both source separation and transfer learning could significantly improve the recognition performance, especially for a small dataset composed of highly similar musical instruments. In [14], the Hilbert-Huang transform (HHT) is employed to map one-dimensional audio data into twodimensional matrix format, followed by CNN to learn the affluent and effective features for the task. The proposed work in [15] employed an attention mechanism and multiple-instance learning (MIL) framework to address the challenge of weakly labeled instrument recognition in the OpenMIC dataset.
The modified group delay feature (MODGDF) is proposed for pitched musical instrument recognition in an isolated environment in [16]. While the commonly applied Mel-frequency cepstral coefficients (MFCC) feature is capable of modeling the resonances introduced by the filter of the instrument body, it neglects the spectral characteristics of the vibrating source, which also play their role in human perception of musical sounds [17]. Incorporating phase information attempts to preserve this neglected component. It has already been estab-lished in the literature that the modified group delay function emphasizes peaks in spectra well [18]. It has also been shown in [19] that sinusoids in noise can be estimated well-using group delay function. Furthermore, it was shown that even for shorter windows, the phase spectrum could contribute as much as the magnitude spectrum to speech intelligibility [20]. In our work, we are introducing phase-based modgdgram as a complementary feature to magnitude-based spectrogram in recognizing predominant instruments from a polyphonic environment. The source information is completely suppressed in the modgdgram compared to the spectrogram, and the system-specific information is retained, which is a vital clue in instrument identification.
Tempo-based features are employed in various music information retrieval tasks. Grosche et al. point out the potential of integrating the concept of tempo representation into music structural segmentation [21]. Tempobased features have also been used for cross-version novelty detection in [22]. In [23], an ensemble of VGG-like CNN classifiers were trained on non-augmented, pitchsynchronized, tempo-synchronized, and genre-similar excerpts of IRMAS for the proposed task. They employed tempo-syncing as one of the data augmentation techniques and achieved better results than the baseline model.
The fusion of multiple modalities can offer significant performance gains over using a modality alone and is widely used in recent music processing applications [24][25][26]. The performance of the various features depends on the instrument characteristics and other unknown factors, and no one feature consistently outperforms all others. Consequently, researchers have investigated the possibility of fusing multiple features to take advantage of their strengths. In our work, we utilize transformer architectures to learn instrument-specific characteristics using Mel-spectro-/modgd-/tempogram to estimate predominant instruments from polyphonic music. Transformerbased systems have outperformed previous approaches for various natural language processing (NLP) and computer vision tasks [27], [28].

Contributions
The major contributions of the proposed experiment can be summed up as:  2 We present a high capacity transformer model for Mel-spectrogram inputs. Our model is derived from [29] with some significant changes as described in Section 4, and it outperforms the existing models, including [1]. The efficacy of transformer models and attention mechanisms are demonstrated by comparison with CNN and DNN architectures. 3 We explore the time-domain strategy of synthetic music audio generation for data augmentation using WaveGAN. The proposed task is addressed with and without data augmentation. 4 In the development phase, the performance is evaluated using various schemes like Mel-spectrogram, modgdgram, and tempogram followed by ensemble voting.
The outline of the rest of the paper is as follows. Section 3 explains the proposed system. The model architectures are described in Section 4. The performance evaluation is explained in Section 5 followed by the analysis of results in Section 6. The paper is concluded in Section 7.

System description
The proposed method of Vision transformer (Vi-T) and Shifted window transformer (Swin-T) are shown in Figs. 2 and 3 respectively. In the proposed model, transformers are used to learn the distinctive characteristics of Mel-spectro/modgd/tempo-gram to identify the leading instrument in a polyphonic context. As a part of data augmentation, additional training files are generated using WaveGAN (Fig. 1). The probability values reported at the nodes of the trained model are mapped as the scores for a test file input. The final decision on the test file is based on soft voting. Soft voting involves summing the predicted probabilities for class labels (from three networks) followed by thresholding. The candidates above the particular threshold were considered as predominant instruments. The performance of the proposed system is compared with that of the state-of-the-art Han's model and a DNN model. A detailed description of each phase is given in the following subsections.

Mel-spectrogram
Mel-spectrogram is widely used in speech and music processing applications [30], [31]. Mel-spectrogram approximates how the human auditory system works and can be seen as the spectrogram smoothed, with high precision in the low frequencies and low precision in the high frequencies [32]. All audio files in the IRMAS dataset are in a 16-bit stereo .wav format with a sampling rate of 44,100 Hz. The time-domain waveform is converted to a time-frequency representation using a short-time Fourier transform (STFT) with a frame size of 50 ms and hop size of 10 ms. Then the linear frequency scale obtained spectrogram is converted to a Mel-scale using 128 for the number of Mel-frequency bins.

Modified group delay functions and modgdgram
Group delay features are being employed in numerous speech and music processing applications [18], [33]. The group delay function is defined as the negative derivative of the unwrapped Fourier transform phase with respect to frequency. Group delay functions, τ (e jω ) are mathematically defined as where X(e jω ) is the Fourier transform of the signal x[ n] and arg(X(e jω )) is the phase function. It can be computed directly from the signal, x[ n] by [34], where the subscripts R and I denote the real and imaginary parts and X(e jω ) and Y (e jω ) are the Fourier transforms of x[ n] and n.x[ n] (signal multiplied with index), respectively. The spiky nature of the group delay spectrum due to zeros that are located close to the unit circle can be suppressed by replacing the term |X(e jω )| in the denominator of Eq. (2) with its cepstrally smoothed version, S(e jω ) thereby resulting in modified group delay functions (MODGD) [18]. The modified group delay functions are obtained by, where, Two new parameters, α and γ (0 < α ≤ 1 and 0 < γ ≤ 1) are introduced to control the dynamic range of MODGD [18]. Modgdgram is the visual representation of MODGD with time and frequency in the horizontal and vertical axis, respectively. In a third dimension, the amplitude of the group delay function at a particular time is represented by the intensity or color of each point in the image. Modgdgrams are computed with a frame size of 50 ms and a hop size of 10 ms. The parameters α and γ have been empirically chosen as 0.9 and 0.5, respectively. Melspectrograms and modgdgrams are implemented using MATLAB.
Typically in spectrograms, we can see pitch components and their harmonics as striations along with formant structure. But system-specific information (formant tracks) is enhanced in modgdgram by suppressing the source information. In music, the body of the musical instrument is the counterpart of the vocal tract (system) in speech. Davis et al. [35] claim that timbres are properties of musical instruments which rely on the physical characteristics of the instrument. Thus, timbre makes a particular musical instrument or the human voice and produces a different sound from another, even when they play or sing the same note.

Tempogram
A tempogram is a time-pulse representation of an audio signal laid out such that it indicates the variation of pulse strength over time given a specific time lag l or a beats per minute (BPM) value. [36]. It is a time-tempo representation that encodes the local tempo of a music signal over time. The calculation of the tempogram is based on the assumption that music exhibits coherent and locally periodic patterns. These patterns may be characterized by peaks in the autocorrelation function (ACF) of the onset detection function [36] at certain time lags. The training and testing audio files are read and processed using the Librosa framework. The principle of autocorrelation is used to estimate the tempo at every segment in the novelty function [37]. Autocorrelation tempograms are computed with librosa.feature.tempogram using a 2048 point FFT window and a hop size of 512.

DNN
A DNN framework on musical texture features (MTF) is experimented with to examine the performance of deep learning methodology on handcrafted features. MTF includes MFCC (13 dim), spectral centroid, spectral bandwidth, root mean square energy, spectral roll-off, and chroma STFT. The features are computed with a frame size of 40 ms and a hop size of 10 ms using Librosa framework 1 . The DNN consists of seven layers, with increasing units from 8 to 512. Regarding the activation function, ReLU has been chosen for hidden layers and softmax for the output layer. The approach attempted in [38] has been customized for multi-label classification and has been experimented with to analyze the role of machine learning techniques, especially using the MTF-SVM framework.

CNN
CNN uses a deep architecture with repeated convolutions followed by max-pooling. A total of five layers are used with the number of filters starting from 32 to 512 for Mel-spectrogram processing. The first two layers used 5 × 5 filters, and the remaining layers used 3 × 3 filters. Using filters of different shapes seems an efficient way of learning spectrogram-based CNNs [10]. To achieve the best performance, the optimal filter size is usually chosen empirically by either experimental validation or visualization for each convolutional layer [39]. The initial layers help to extract general features and also help in noise reduction. The last convolutional layers used 3 × 3 filters as later layers reveal more specific and complex patterns and final layers activations help to recognize the predominant instruments from accompaniments. Global max-pooling is adopted in the final max-pooling layer, which is then fed to a fully connected layer. For modgdgram processing, we used six convolutional layers with the number of filters increasing from 8 to 256, followed by 2 × 2 max pooling. We used filters of size 3 × 3 in all six layers with a fixed stride size of one. For tempogram processing, we used the same model as Mel-spectrogram. A dropout of 0.5 is introduced after the fully connected layer to avoid overfitting in all processing. Leaky ReLU with α = 0.33 in hidden layers has been empirically chosen for optimum performance in Mel-spectrogram processing. But in modgdgram and tempogram processing, the best performance is obtained for ReLU. Softmax is used as the activation function for the output layer.

Vi-T
Inspired by the success of Transformer [27] in various natural language processing tasks, Vision Transformers (ViT) [40] constitute the first pure transformer-based architecture that can achieve good performance on the image recognition task. Figure 2 shows the architecture of our proposed method. As shown in Fig. 2(a), the input image x R HXWXL , where H, W, and L represent the height, width, and the number of channels of the image x. The input image is partitioned into non-overlapping patches, called tokens. In our work, we choose M = 6 × 6, where M is the size of a patch. Then each patch is linearly projected to a dimension of 64, along with position embeddings, and feeds the resulting sequence of vectors to a standard transformer encoder. The number of patches, P=HW /M 2 . The various hyperparameters selected for our proposed method are shown in Table 1. Position embeddings are added to the patch embeddings to retain positional information. The Transformer encoder is shown in Fig. 2(b) and consists of alternating layers of multi-headed selfattention (MSA) with eight attention heads and two multi layer perceptron (MLP) layers with 2048 and 1024 nodes with Gaussian error linear unit (GELU) nonlinearity in between. Layernorm (LN) is applied before every MSA and MLP layer, and residual connections are placed after each module. MSA is defined in [27] as where,  [27], while d k is the dimension of the query vector. The outputs from all 8 attention heads are concatenated to form a single output vector before passing it through the feed-forward network. The model is then trained on instrument classification in a supervised manner.

Swin-T
The main drawback of ViT is that it produces feature maps of a single low resolution and has quadratic computation complexity to input image size due to computation of selfattention globally. Also, the tokens are of fixed scale and are thus unsuitable for vision applications. Unlike other transformers Swin-T [29] has a hierarchical architecture and has linear computational complexity through the proposed shifted window-based self-attention approach. The computational complexity of Vi-T is given by [29] The computational complexity drops for Swin-T as per the Eq. (9) above. MSA has quadratic computational complexity to patch number hw, while W-MSA has linear computational complexity due to the shifted window approach [29]. Figure 3 shows the architecture of our proposed method. The input image is partitioned into non-overlapping patches, called tokens during patch partitioning. In our work, we choose M = 4 × 4, where M is the size of a patch. The second step is linear embedding, in which the eigenvalues in the feature map are projected to a C dimensional vector. The hyperparameter C has been empirically chosen as 96 for our work. The various selected hyperparameters for our proposed method are shown in Table 1. The output of the patch embedding layer leads to two Swin Transformer networks. The output of the second Swin-T network is applied to a patch merging layer. Patch merging works in a similar way to CNN's pooling layer by concatenating the features of each group of neighboring patches and applying a linear embedding layer to change the output dimension to 2C. Hence the output of patch merging layer is ( H 8 x W 8 x 2C) and is followed by global average pooling and a dense layer with 11 nodes and a softmax activation function. Figure 3(b) shows the internal architecture of the Swin-T block. Shifted windows approach is used in the encoder to address the multi-head self-attention (MSA) scheme. The output of the patch embedding layer is divided into non-overlapping windows (in our work, we choose N = 4, where N is the number of windows). Here to compute the self-attention of a given patch within that window, we ignore the rest of the patches in other windows. As illustrated in Fig. 3(b), W-MSA is the windowed multihead self-attention in which we divide the patched image into non-overlapping windows and compute attention for patches within the window. In SW-MSA, the window is stride forwarded by two patches just like the kernel striding in CNN and computing attention within that window. For the empty patches, the process was repeated after zero padding. W-MSA and SW-MSA are followed by a 2-layer MLP each with 256 nodes and GELU nonlinearity in between. LN is applied before each MSA module and MLP, and a residual connection is applied after each module. The modified equation for attention [29] is where B is the relative position of window.

Dataset
The performance of the proposed system is evaluated using the IRMAS (Instrument Recognition in Musical Audio Signals) dataset, developed by the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF). It consists of more than 6000 musical audio excerpts from various styles with annotations of the predominant instruments present. All audio files in the IRMAS dataset are On the other hand, the testing data are multi-labeled and consist of 2874 audio files with lengths between 5 and 20 s and contain multiple predominant instruments. This dataset has two disadvantages when training models. First, the number of audio files available for certain instruments like cello, clarinet, and flute is less than 500, and the models trained with the data are hardly generalizable. Second, the dataset is not well balanced in terms of either musical genre or instrumentation. However, this may not be a problem if the datasets were larger and the distribution represented the real world. Data augmentation offers an excellent solution to this issue. Data augmentation means training the deep network with additional diverse data. This increases the generalization capability of the network and thus reduces overfitting.

Data augmentation using WaveGAN
Generative adversarial networks (GAN) have been successfully applied to a variety of problems in image generation [41] and style transfer [42]. WaveGAN architecture is similar to deep convolutional GAN (DCGAN), which is used for Mel-spectrogram generation in various music processing applications. The DCGAN generator uses transposed convolution to iteratively upsample low-resolution feature maps into a high-resolution image. In WaveGAN architecture, the transposed convolution operation is modified to widen its receptive field. Specifically, longer one-dimensional filters of length 25 are used instead of two-dimensional filters of size 5 × 5 and the intermediate representation is upsampled by a factor of four instead of two at each layer. The input to the generator is a random sample taken from a uniform distribution between −1 and 1 and is projected and reshaped to the dimension 16 × 1024. This is followed by six transpose convolution layers that upsample the input feature map to a fine and detailed output. The output of the generator is 65,536 samples (corresponding to 4.01 s of audio at 16 kHz). It is also capable to produce 1.49 s of audio at 44.1khz by choosing the slice length of 65536 samples. The output of the generator is directly applied to the input of the discriminator. The discriminator is an efficient CNN that discriminates between real and generated samples. The discriminator is also modified similarly, using length-25 filters in one dimension and increasing stride from two to four which results in WaveGAN architecture [43].
The transposed convolution in the generator produces checkerboard artifacts [43]. To ensure that the discriminator does not learn these artifacts, we use phase shuffle operation (with hyperparameter n=2) as suggested in [43]. ReLU is used as the activation for transposed convolution layers and LReLU with α = 0.2 is chosen for convolution operation. Finally, the system is trained using the Wasserstein GAN with gradient penalty (WGAN-GP) strategy [44] to tackle the vanishing gradient problem and enhance training stability. For training, the WaveGAN optimizes WGAN-GP using Adam for both generator and discriminator. A constant learning rate of 0.0001 is used with β 1 = 0.5 and β 2 = 0.9. WaveGAN is trained for 2000 epochs on the three-sec audio files of each class to generate similar audio files based on a similarity metric (s) [45] with an acceptance criterion of s > 0.1. The values of parameters and hyperparameters associated with WaveGAN for our experiments are listed in Table 2. A total of 6585 audio files with cello (625), clarinet (482), flute (433), acoustic guitar (594), electric guitar (732), organ (657), piano (698), saxophone (597), trumpet (521), violin (526), and voice (720) are generated. Training files available in the corpus are denoted by Train DB and the generated files are added to the available training corpus, and the augmented corpus is denoted by Train AugDB . Mel-spectrogram, modgdgram, and tempogram of natural and generated audio files for acoustic guitar are shown in Fig. 1. The experiment details and a few audio files can be accessed at https://sites. google.com/view/audiosamples-2020/home/instrument.
The quality of generated files is evaluated using a perceptual test. It is conducted with ten listeners to assess the quality of generated files for 275 files covering all classes. Listeners are asked to grade the quality by choosing one among the five opinion grades varying from poor to excellent quality (scores, 1 to 5). A mean opinion score (MOS) of 3.64 is obtained. This value is comparable to the MOS score obtained in [43] and [46] using WaveGAN.

Experimental set-up
The experiment is progressed in four phases, namely Melspectrogram-based, modgdgram-based, and tempogram- based, followed by soft voting. Hard or majority voting is not used in our method since the presence of simultaneously occurring partials degrades its performance [1]. Han's model [1] is implemented with 1 s slice length for performance comparison. In their approach, sigmoid outputs obtained by sliding window analysis on Melspectrogram inputs were aggregated followed by thresholding, and the candidates above that particular threshold were considered as predominant instruments. In our proposed method of soft voting, the predicted probabilities from three networks are summed followed by thresholding. We choose a threshold value of 0.5 empirically as it helps to recognize most of the predominant instruments [1].

Training configuration
The DNN network is trained with categorical crossentropy loss function using Adam optimizer with a learning rate of 0.001 and a mini-batch size of 128. For CNN networks, we choose a batch size of 128 and an Adam optimizer with a categorical cross-entropy loss function.
For Vi-T, we used categorical cross-entropy loss function using Adam optimizer, with a learning rate of 0.001 and weight decay of 0.0001, and the mini-batch size was set to 256. For Swin-T we used categorical cross-entropy using the Adam optimizer, with a learning rate of 0.001 and gradient clip value of 0.5, and the mini-batch size was set to 32. 20% of training data is used for tuning the hyperparameters during validation for all the models. The training was stopped when the validation loss did not decrease for more than two epochs.

Testing configuration
2874 polyphonic files of variable length with multiple predominant instruments are used for the testing phase.
Since the number of annotations for each class was not equal, we computed precision, recall, and F1 measures for both the micro and the macro averages. For the micro averages, we calculated the metrics globally, thus giving more weight to the instrument with a higher number of appearances. On the other hand, we calculated the metrics for each label and found their unweighted average for the macro averages.

Results and analysis
The overall performance of different phases of the Swin-T experiment with data augmentation Train AugDB is tabulated in Table 3  network and reports a better macro score than Han's model. It shows superior performance for five instrument classes than Han's model and three instruments over our proposed Mel-spectrogram-Swin-T network. Thus our proposed voting-Swin-T and Mel-spectrogram-Swin-T showed improved performance than the state-of-the-art Han's model.

Analysis of instrument-wise identification performance
The instrument-wise recall for all our voting experiments with data augmentation is shown in Fig. 4. The proposed Voting frameworks showed superior performance to the state-of-the-art Han's model. In the case of ensemble voting using CNN, instruments like the clarinet, electric guitar, piano, and trumpet show improved performance over Han's model. In the case of voting using transformers, seven instruments showed improved performance over Han's model. For all the voting techniques, the voice reports a high recall due to its distinct spectral characteristic [1].

Effect of data augmentation
For deep learning, the number of training examples is critical for the performance compared to the case of using hand-crafted features because it aims to learn a feature from the low-level input data [1]. The problem with small datasets is that models trained with them do not generalize well from the validation and test set [47]. Han's model using (Train DB ) reports a low F1 score of about 0.20 for cello, and they suggest that it is due to the insufficient number of training samples [1]. The same experiment when repeated using Train AugDB and our Mel-spectrogram-Swin-T showed an improved F1 score validates the claim in [1]. The significance of data augmentation in the proposed model can be analyzed from Table 4. While the proposed method of Voting-Swin-T, without data augmentation (Train DB ), reports micro and macro F1 score of 0.59 and 0.60, respectively, the metrics improved to 0.66 and 0.62, respectively, for the data augmentation scheme. It shows an improvement of 11.86% and 3.33% relatively higher than that obtained for experiments with Train DB . Similar performance improvement is observed for Han's model and MTF-SVM and DNN frameworks.

Effect of transformer architecture and attention
The instrument-wise F1 scores for all the Melspectrogram experiments are shown in Fig. 5. The model using CNN alone does not show improved performance as expected; this is mainly because of the difficulty in predicting the multiple predominant instruments from the variable-length testing file while training with single predominant fixed-length training files. Only the instruments with distinct spectral characteristics and voice show good performance. On the other hand, experiments with transformer architecture showed improved performance for all the instruments. This is mainly because the transformer architecture with a multi-head attention mechanism helps to focus or attend to specific regions of the visual representation for predominant instruments recognition. Another important point is that it requires very few trainable parameters to learn the model, which  [15]. Compared to self-attention multi-head attention gives the attention layer multiple representation subspaces, and as the image passes through different heads, predictions about the predominant instruments are more refined than employing single head self-attention. In the case of ViT, we have to compute the self-attention for a given patch with all the other patches in an input image. On the other hand, Swin-T with shifted window scheme gives the effect of kernel striding in CNN which along with multi-head attention helps to recognize multiple predominant instruments with linear computational complexity.
We also conducted an ablation study of the architecture in order to gain a better understanding of the network's behavior. We investigated the performance by changing the number of heads, patch size, projection dimension, and the number of MLP nodes. The results are tabulated in Table 5. The optimal parameters obtained through Melspectrogram analysis are applied to the modgdgram and tempogram architectures through a similar ablation study. To summarize, the results show the potential of Swin-T architecture and the promise of alternate visual representations other than the conventional Mel-spectrograms for predominant instruments recognition tasks.

Effect of voting and ablation study of ensemble
Several studies [48,49] have demonstrated that by consolidating information from multiple sources, better performance can be achieved compared to uni-modal systems which motivated us to perform the ensemble voting method. We also conducted the ablation study of the ensemble to evaluate the contribution of the individual parts in the proposed ensemble classification framework for predominant instrument recognition. Since there are three visual representations, we have experimented with different fusion schemes as shown in Table 6. Table 6 reports F1 measures for different fusion strategies trained with Train AugDB . Spect, Modgd, and Tempo refer to Mel-spectrogram-Swin-T, Modgdgram-Swin-T, and Tempogram-Swin-T respectively.
It is important to note that Spect + Modgd and Modgd + Tempo show improvement in macro measures compared to Mel-spectrogram-based Han's model. This shows the importance of phase information in the proposed  task. Conventionally, the spectrum-related features used in instrument recognition take into account merely the magnitude information. However, there is often additional information concealed in the phase, which could be beneficial for recognition as seen in [16]. In the case of tempogram, Spect + Tempo showed improved performance over Han's model. The advantage of onsets in extracting informative cues about musical instrument recognition is proposed in [50]. Human listeners can easily identify instrument sounds from onset portions compared to other portions of the sound. Cemgil et al. [51] define the "tempogram" which induces a probability distribution over the pairs (pulse period, pulse phase) given the onsets. In most automated tempo and beat tracking approaches, the first step is to estimate the positions of note onsets within the music signal. Results of the experiments described in [52] suggested that the presence of onsets was beneficial, in particular for instrument sounds. Since onset detection is the primary step in computing tempogram, it can provide useful information about predominant instruments. The experimental results validate the claim in [52]. The advantage of voting is that it is unlikely that all classifiers will make the same mistake, as long as every error is made by a minority of the classifiers, an optimal classification can be achieved [53].
Since the ensemble soft voting of three representations results in better performance, we opted for the same as the final scheme and our proposed ensemble frameworks outperform the state-of-the-art Han's model.

Comparison to existing algorithms
The performance metrics for various algorithms on the IRMAS corpus are reported in Table 7. The number of trainable parameters is also indicated. Bosch et al. [9] modified the Fuhrmann's algorithm [2] and used typical hand-made timbral audio features with their frame-wise mean and variance statistics to train SVMs with a source separation technique called flexible audio source separation framework (FASST) in a preprocessing step. It reports a micro and macro F1 score of 0.50 and 0.43 respectively, and it is evident that the proposed ensemble frameworks outperform the hand-crafted  [10] customized the architecture of Han et al. and introduced two models, namely, single-layer and multi-layer approaches. They used the same aggregation strategy as that of Han's model by averaging the softmax predictions and finding the candidates with a threshold of 0.2. As different from the existing approaches, we estimated the predominant instrument using the entire Mel-spectrogram without sliding window and aggregation analysis. Better micro and macro measures show that it is possible to predict multiple instruments from the visual representations without any sliding window analysis. Also, our proposed Swin-T for Mel-spectrogram requires approximately four times fewer trainable parameters than Han's model [54]. In [15], the usage of an attention layer was shown to improve classification results in the OpenMIC dataset when applied to a set of Mel-spectrogram features extracted from a pre-trained VGG net. While the work focuses on Mel-spectrogram, we experimented with the effect of phase and tempo information along with magnitude information. Our proposed ensemble voting technique outperformed existing algorithms and the MTF DNN and SVM framework on the IRMAS dataset for both the micro and the macro F1 measure.

Conclusion
We presented a transformer-based predominant instrument recognition system using multiple visual representations. Transformer models are used to capture the instrument-specific characteristics and then do further classification. We experimented with Vi-T and the recent Swin-T architectures with a detailed ablation study and our proposed experiments using Swin-T outperform existing algorithms with very less trainable parameters. We introduced an alternate visual representation to conventionally used Mel-spectrograms. Our study shows that visual representation in terms of modgdgram can be explored in many applications. We believe that optimum parameters may potentially lead to a better visual representation for modified group delay functions. It is worth noting that many recent deep learning schemes in image processing such as transfer learning, attention mechanism, and transformers are transferable to the audio processing domain. Modified group delay functions can be computed directly from the music signal and also from the flattened music spectrum. It is known as direct-modgdgram (or simply "modgdgram") and source-modgdgram, respectively. Direct modgdgram emphasizes system information and source-modgdgram provides information about the multiple sources present in the music signal [55]. Source-modgdgram has been effectively used for melody extraction [56] and multi-pitch estimation [57]. Since we need system information to track the presence of instruments, we employ the directmodgdgram for the task of instrument recognition.
The proposed method is evaluated using the IRMAS dataset. As observed in many music information retrieval tasks, the data augmentation strategy has also shown its promise in the proposed task. The time-domain strategy of synthetic music generation for data augmentation using WaveGAN is explored. WaveGAN data augmentation for instrument detection is probably a new attempt in predominant instrument recognition. As future work, we would like to focus on synthesizing high-quality audio files using recent high fidelity audio synthesis approaches discussed in [58] and to compare the pipeline of traditional audio augmentations used in many tasks [23] with adversarial audio synthesis. The ensemble voting framework outperforms the existing state-of-the-art algorithms and music texture features DNN and SVM frameworks. The results show the potential of the ensemble voting technique in predominant instrument recognition in polyphonic music.