Comparison of semi-supervised deep learning algorithms for audio classification

In this article, we adapted five recent SSL methods to the task of audio classification. The first two methods, namely Deep Co-Training (DCT) and Mean Teacher (MT), involve two collaborative neural networks. The three other algorithms, called MixMatch (MM), ReMixMatch (RMM), and FixMatch (FM), are single-model methods that rely primarily on data augmentation strategies. Using the Wide-ResNet-28-2 architecture in all our experiments, 10% of labeled data and the remaining 90% as unlabeled data for training, we first compare the error rates of the five methods on three standard benchmark audio datasets: Environmental Sound Classification (ESC-10), UrbanSound8K (UBS8K), and Google Speech Commands (GSC). In all but one cases, MM, RMM, and FM outperformed MT and DCT significantly, MM and RMM being the best methods in most experiments. On UBS8K and GSC, MM achieved 18.02% and 3.25% error rate (ER), respectively, outperforming models trained with 100% of the available labeled data, which reached 23.29% and 4.94%, respectively. RMM achieved the best results on ESC-10 (12.00% ER), followed by FM which reached 13.33%. Second, we explored adding the mixup augmentation, used in MM and RMM, to DCT, MT, and FM. In almost all cases, mixup brought consistent gains. For instance, on GSC, FM reached 4.44% and 3.31% ER without and with mixup. Our PyTorch code will be made available upon paper acceptance at https:// github. com/ Labbe ti/ SSLH.


Introduction
Semi-supervised learning (SSL) aims to reduce the dependency of deep learning systems on labeled data by integrating unlabeled data during the learning phase. It is essential since the conception of a large labeled dataset is expensive, dependent on the task to be learned, and time-consuming. On the contrary, the acquisition of unlabeled data is cheaper and quicker regardless of the task to perform. Using unlabeled data while maintaining high performance can be done in three different ways: i) consistency regularization [1,2], which encourages a model to produce consistent prediction whereas the input is perturbed, ii) entropy minimization [3][4][5], which encourages the model to output high confidence predictions on unlabeled files, and iii) standard regularization by using weight decay [6,7], mixup [8] or adversarial examples [9]. The most direct approach for SSL is pseudo-labeling [5], but since then, many new and better approaches came out such as Mean Teacher (MT) [10], Deep Co-Training (DCT) [11], MixMatch (MM) [12], ReMix-Match (RMM) [13], and FixMatch (FM) [14].
In previous work [15], we compared MT and DCT for the task of audio tagging (AT), a classification task that consists of automatically assigning an audio event label to an audio recording. Both approaches use two neural networks during training. In the present article, we extend our comparison by adapting to AT the three single-model SSL methods MM, RMM and FM. One difficulty lies in choosing which audio data augmentation techniques to use, that work for different types of sound events and spoken words [16]. The augmentations used on images for object recognition, such as flips and rotations, are most often not relevant for audio data. We compare the error rates on three audio datasets with different scopes and sizes: i) Environmental Sound Classification 10 (ESC-10) [17], with audio event categories such as dog barking and helicopter, ii) UrbanSound8k (UBS8K) [18], more specific to urban noises such as car horns, sirens and street music, and iii) Google Speech Commands v2 (GSC) [19], containing spoken words exclusively.
In MM and RMM, a successful data augmentation technique called mixup [8] is used. It consists of mixing pairs of samples, both the data samples and the labels with a random coefficient. We propose to add mixup to the three other SSL approaches, namely MT, DCT and FM, which do not already use it. The results reported in this article will highlight the positive impact of mixup in almost all our experiments.
The article contributions are mainly two-fold: i) the application and comparison of several recent SSL methods for audio tagging on three different datasets, ii) the modification of these methods with the integration of mixup, which resulted in systematic error rate reductions. We shall see that in most cases, MM outperformed the other methods, closely followed by FixMatch+mixup.
The structure of the paper is as follows. Section 3 describes the augmentations we used and the mixup mechanism at the core of the present work. Section 4 describes the five SSL methods, Section 5 presents the experimental settings, and finally, Section 6 presents and discusses the results.

Related work
Semi-supervised learning (SSL) is a well-known machine learning setting, for which a lot of research has been conducted, before the rise in popularity of deep learning [20,21]. In this work, we explore recent SSL approaches that were proposed in the framework of deep learning, since we use deep neural networks as state-of-the-art classifiers for audio tagging. These new approaches, as we shall see, were driven by the simplicity of incorporating unsupervised loss terms into the cost functions of neural networks [22].

Semi-supervised deep learning taxonomy
In their SSL survey [22], Van Engelen and colleagues proposed a detailed taxonomy for SSL methods in the framework in deep learning. The algorithms explored in the present article fit in the intrinsically semi-supervised inductive methods category, meaning methods that attempt to construct a classifier by directly optimizing an objective function for labeled and unlabeled samples. Most semi-supervised neural networks make use of perturbation-based learning methods, where the training data samples (labeled or unlabeled or both) are perturbed with data augmentation techniques. This is meant to incorporate the so-called smoothness assumption in SSL, which states that a classifier should be robust to local perturbations in its input. This is the case of the five methods explored in our work: MT, DCT, MM, RMM, and FM. If we follow Van Engelen et al.'s taxonomy, MT is a consistency regularization method, in which predictions of a teacher and a student models are penalized when being different. DCT is described as a pseudo-labeling method, based on the disagreement between two models trained on two different views of the same data. As we shall see in the DCT description, the second view is automatically created by deriving adversarial examples of the original data samples. Finally, MM, RMM and FM are considered as hybrid methods, in that they combine pseudo-labeling, consistency regularization and entropy minimization for performance improvement. Entropy minimization refers to methods that artificially lower the uncertainty of the predictions made on the unlabeled data. We will see, for instance, the use of a sharpening function in MM.

Semi-supervised deep learning in audio classification
In the seminal articles in which the five SSL methods were proposed, the experiments were carried out on image classification tasks only, not on audio related tasks. If we focus on SSL applied to sound event detection (SED), the most used technique in the literature is MT. In particular, the system ranked first in the Detection and Classification of Acoustic Scenes and Events (DCASE) task 4 2018 challenge (Large-scale weakly labeled semi-supervised sound event detection in domestic environments) used MT with convolutional recurrent neural networks trained on a small labeled subset and a larger unlabeled one [23]. Since then, MT was used in the baseline system provided by the challenge organizers, and most of the systems proposed by the participants [24,25]. Also in the framework of DCASE Task 4, Shi and colleagues adapted MM for the task [26]. Their MM method outperformed their solution based on MT [1] . SED is a task consisting of segmenting an audio recording in possibly overlapping audio events. It is slightly different from audio classification, the target task of the present work, in which we more simply aim to tag audio recordings globally with a single audio event category per recording. Outside DCASE, MT has been favorably compared to supervised learning in [27] for audio classification. The authors show the importance of using diverse collections of noise as perturbations in MT. They also used MixUp successfully, as we will in the present article. Although they used two datasets in common with us (Google Speech Commands and UrbanSound8k), their results cannot be compared to ours because of differences in the evaluation strategies: train/test splits different from the official ones and no cross-validation on Ur-banSound8k, and a different number of target classes with Google Speech Commands. Finally, recently, FM [1] http://dcase.community/challenge2019/task-soundevent-detection-in-domestic-environments-results and MT were compared on music, industrial sounds, and acoustic scenes classification data sets. FM outperformed MT and supervised learning in all cases [16].
An extension of our previous work In previous work, we already compared two SSL methods for AT, namely MT and DCT, and we showed that DCT was consistently better than MT [15]. We build on this preliminary work to consider three simpler SSL methods, based on a single neural network instead of two models: MM, RMM and FM. Although some of these SSL methods were applied (in modified forms) to audio data in the context of audio classification before, as we just saw, the present work is among the first ones to compare a number of them in a systematic way.
As we shall see in their technical description, a key aspect in these three "hollistic" methods is the extensive use of data augmentation techniques both on the labeled and unlabeled data subsets. In the results that we will report, we used the same augmentation techniques to train our fully-supervised baselines, which gave much stronger baselines than in our previous work [15]. Finally, another novelty of the present work is the addition of the mixup [8] augmentation to the SSL methods MT, DCT and FM.

Audio data augmentation
Augmentations are at the heart of most recent semisupervised learning mechanisms. In this section, we begin by describing the mixup mechanism, which we extensively use in this work, and the other audio data augmentations used in some of the training settings.

Mixup
Mixup [8] is a successful data augmentation/regularization technique, that proposes to mix pairs of samples (images, audio clips, etc.). If x 1 and x 2 are two different input samples (spectrograms in our case) and y 1 , y 2 their respective one-hot encoded labels, then the mixed sample and target are obtained by a simple convex combination: (1) where λ is a scalar sampled from a symmetric Beta distribution at each mini-batch generation: where α is a real-valued hyper-parameter to tune (always smaller than 1.0 in our case).
In the original MM algorithm, an "asymmetric" version of mixup is used, in which the maximum value between λ and 1 − λ is retrieved: This makes the λ values either close to one, allowing the resulting mixed batches to be closer to x 1 . This may be useful when the method mixes labeled and unlabeled samples, when only slight perturbations are wanted.

Audio signal augmentation methods
We tested several audio augmentation techniques and retained three of them: Occlusion, CutOut [28], and Speed Perturbation [29]. In addition to the three selected augmentations described below, we also tried to add uniform noise on the log-mel spectrograms, invert the mel frequency axis and the time axis, but no gains were observed with these techniques.
• Occlusion: applied to the raw audio signal, Occlusion consists of setting a segment of the waveform to zero. The size of the segment is randomly chosen up to a user-defined maximum size. The position of the segment is also chosen randomly. • CutOut: applied to the log-mel spectrograms, CutOut sets the values within a random rectangle area with the -80 dB value, which corresponds to the silence energy level in our spectrograms. The length and width of the removed sections are randomly chosen from a predefined interval and depend on the spectrogram size. • Speed Perturbation: we resample the raw audio signal up (nearest-neighbor upsampling) or down (decimation) according to a rate chosen randomly within a predefined interval. The resulting waveform is either shorter or longer. Padding or cropping is randomly applied at the start and the end of the stretched signal to keep the signal duration constant. The difference between Occlusion and CutOut is that CutOut sets a time-frequency rectangle to the -80 dB value, whereas Occlusion sets to zero a whole portion of the waveform.
We used Occlusion, CutOut and Speed Perturbation in augmented supervised learning settings, and in MM, RMM, and FM. During training, one of those is randomly applied to each audio sample. RMM and FM make use of so-called "weak" and "strong" augmentations. The difference between the two lies in the strength and randomness with which an augmentation is applied. A "weak" augmentation has a 50% chance to be applied, and a"strong" one is always applied.
In order to tune these augmentations, we performed a grid-search on their hyperparameters, training Wide-Resnet28-2 models on the Google Speech Commands dataset (this architecture and dataset will be described here-after). The resulting hyperparameters are listed in Table 1.
No augmentation was used in DCT nor in MT, except Gaussian noise in MT.

Semi-supervised deep learning algorithms
This section provides a detailed description of the five SSL approaches we compare for audio classification. We chose them for their high performance reported for object recognition in images. Two of these approaches, Mean Teacher (MT) [10], and Deep Co-Training (DCT) [11] use the principle of consistency regularization between the outputs of two models. The other methods, MixMatch (MM) [12], ReMix-Match (RMM) [13], and FixMatch (FM) [14], use a single model and combine the three SSL mechanisms described in the introduction. We provide a figure to illustrate each of the five methods. In Section 4.6, we explain how we add mixup to MT, DCT and FM, since MM and RMM already use it. We included a blue box in the method workflow figures, to show where mixup is optionally integrated. We will refer to the modified methods as "method+mixup", for instance, FM+mixup.

Mean Teacher (MT)
MT uses two neural networks: a "student" f and a "teacher" g, which share the same architecture. The weights ω of the student model are updated using the standard gradient descent algorithm, whereas the weights W of the teacher model are the Exponential Moving Average (EMA) of the student weights. The teacher weights are computed at every mini-batch iteration t, as the convex combination of its weights at t 1 and the student weights, with a smoothing constant α ema : There are two loss functions applied either on the labeled or unlabeled data subsets. On the labeled data Figure 1 MT workflow. Both models receive as input labeled xs and unlabeled files xu. A supervised loss Ls is computed between the ground truth and the student model predictions, whereas a consistency cost Lcc is computed between the student and teacher model predictions.
x s , the usual cross-entropy (CE) is used between the student model's predictions and the ground-truth y s .
The consistency cost is computed from the student predictions f (x s ) and f (x u ), and from the teacher prediction g(x s ) and g(x u ), where x s and x u correspond to the same samples but slightly perturbed with Gaussian noise with a 15 dB signal-to-noise ratio [24]. This cost is a Mean Square Error (MSE) loss: The symbol ⊥ denotes the stop gradient operator, meaning that the teacher weights W t are a constant with respect to optimization. The final loss function is the sum of the supervised loss function and the consistency cost weighted by a factor λ cc which controls its influence.

Deep Co-Training (DCT)
DCT has been recently proposed by Qiao et al. [11]. It is based on Co-Training (CT), the well-known generic framework for SSL proposed by Blum and colleagues in 1998 [30]. The main idea of Co-Training is based on the assumption that two independent views on a training dataset are available to train two models separately. Ideally, the two views are conditionally independent given the class. The two models are then used to make predictions on the unlabeled data subset. The most confident predictions are selected and added to the labeled subset. This process is iterative, like pseudo-labeling.
DCT is an adaptation of CT in the context of deep learning. Instead of relying on views of the data that are different, DCT makes use of adversarial examples to ensure the independence in the "view" presented to the models. Each batch is composed of a supervised and an unsupervised part. Thus, the unlabeled data are directly used, and the iterative aspect of the algorithm is removed.
Let S and U be the subsets of labeled and unlabeled data, respectively, and let f and g be the two neural networks that are expected to collaborate.
The DCT loss function is comprised of three terms, as shown in Eq. (8). These terms correspond to loss functions estimated either on S, U, or both. Note that during training, a mini-batch is comprised of labeled and unlabeled samples in a fixed proportion. Furthermore, in a given mini-batch, the labeled examples given to each of the two models are sampled independently.
The first term, L s , given in Eq. (9), corresponds to the standard supervised classification loss function for the two models f and g, estimated on examples x 1 and x 2 respectively, which are sampled from S.
In our case, we use categorical Cross-Entropy (CE), the standard loss function used in classification tasks with mutually exclusive classes.
As in MT, a consistency cost on the unlabeled examples is used in DCT. It takes the form of the Jensen-Shannon (JS) divergence between the two sets of predictions on examples x u sampled from the unlabeled subset U, given by: where H denotes the entropy.
For DCT to work, the two models need to be complementary: on a subset different from S ∪ U , examples misclassified by one model should be correctly classified by the other model [31]. In DCT, this is achieved by generating adversarial examples with one model and training the other model to be robust to these adversarial samples. To generate adversarial examples, we used the Fast Gradient Signed Method (FGSM, [32]), as in Qiao's work. The L diff loss term (Eq. (11)) is the sum of the Cross-Entropy losses between the predictions f (x 3 ) and g(x f 3 ), where x 3 is sampled from S ∪ U and x f 3 is the adversarial example generated with the model f from x 3 taken as input. The second term is the symmetric term for model g, with x 4 sampled from S ∪ U and x g 4 the adversarial example generated with g from x 4 .
For more in-depth details on the technical aspects of DCT, the reader may refer to [11]. We implemented DCT as precisely as described in Qiao's article, using PyTorch, and made sure to accurately reproduce their results on CIFAR-10: about 90% accuracy when using only 10% of the training data as labeled data (5000 images).

MixMatch
MixMatch [12] (MM) uses entropy minimization and standard regularization, namely pseudo-labeling [5], mixup, and weak data augmentation, to leverage the unlabeled data and provide better generalization capabilities. Unlike MT and DCT, this approach uses only one model. The different steps are shown in Fig. 3 and detailed in the following paragraphs.
During the learning phase, each minibatch is composed of labeled x s and unlabeled x u samples in equivalent proportions. The first step consists of applying an augmentation to the labeled part of the mini-batch and K augmentations to the unlabeled part in parallel. These K augmentations are sampled from the three augmentations (weak) described in Section 3. In the second step, pseudo-labels y u are generated for the unlabeled files using the model's prediction averaged on these K variants as shown in Eq. (12), where x u,i denotes the i-th variant of an unlabeled augmented file.ŷ For encouraging the model to produce confident predictions, a post-processing step is necessary to decrease the output's entropy. To do so, the highest probability is increased and the other ones decreased. This process is called "sharpening" by the method authors, and it is defined as: The sharpen function is applied on to the pseudolabels p =ŷ u . The parameter T , called Temperature, controls the strength of the sharpen function. When T tends towards zero, the entropy of the distribution produced is lowered.
Finally, the labeled and unlabeled augmented samples are concatenated and shuffled into a W set then used as a pool of training samples used by the asymmetric mixup function. Asymmetric mixup is applied separately on the labeled and unlabeled parts of the mini-batch, as formulated here: where B s and |W | are the number of labeled samples and of the whole W set. The W set and the corresponding labels are shuffled in the same order. Each labeled sample is then perturbed by a second labeled or unlabeled sample. Mixing the two is done so that the original labeled sample remains the main component of the resulting sample. The operation has been detailed in Section 3.1. The same procedure is applied onto the unlabeled files using the remaining samples from W.
The original MixMatch loss function is composed of the standard CE cost for the supervised loss L s , and the MSE for the unsupervised loss L u . We replace MSE with CE in all our experiments, as proposed in the ReMixMatch paper. Indeed, it seems that CE performs better than MSE in our experiments. (17) where B s and B u are the number of examples in the labeled and unlabeled mini-batches. The final loss is the sum of the two components, with a hyper-parameter λ u :

ReMixMatch (RMM)
ReMixMatch (RMM) [13] was presented as an improvement of MixMatch and introduced the concept of strong and weak augmentations and a so-called distribution alignment mechanism.
At every iteration, the batch is composed of labeled x s and unlabeled x u samples. One weak augmentation and K strong augmentations are applied on x u . The weakly-augmented sample is used to compute the pseudo-label vectorsŷ u of the unlabeled examples. A distribution alignment mechanism modifies the pseudo-labels to make them follow the class distribution of the labeled subset. Two "distributions" p s and p u are estimated in the form of vectors, which are respectively the averages of the true labels y s and of the pseudo-labelsŷ u , calculated over the samples of the N previous batches. Then, distribution alignment is applied toŷ u with this equation: Finally, we apply the sharpen function from Eq. (13) to the pseudo-labelsŷ u , as done in MixMatch. The labelŝ y u will be used as targets for the weakly and strongly augmented batches. Like in MixMatch, we concatenate the labeled and unlabeled batches to a set W for the mixup augmentation, and the labeled and unlabeled loss L s and L u remain the same. ReMixMatch also introduced a strong-augmentation loss component for increasing stability and accuracy. This component will be computed with the first strongly-augmented version of x u , called x u1 : In the original ReMixMatch, the authors added another loss term, a self-supervised learning component that predicts which transformation is applied to the x u1 batch. The transformation used was a rotation of 0, 90, 180, or 270 degrees, and the model had to guess which angle the image had been rotated by (a fourclass classification task). In some configurations, it was supposed to help the model to avoid collapsing during training. This component was removed because it did not show any positive impact on our experiments, and using rotations or flips on audio spectrograms is difficult to justify in terms of audio semantics.
In our experiments, the final loss is the sum of the three different components: FixMatch [14] (FM) is another SSL method which proposes a simplification of MM and ReMixMatch. The method also uses one model, removes mixup and replaces the sharpen function by binary pseudolabels. FM uses both weak augmentations (weak) and strong augmentations (strong). The strong augmentations can mislead the model predictions by disrupting too much the training data. Figure 5 shows the main pipeline of FixMatch. As in the other method illustrations, we added a mixup box in blue, to indicate where we add it to the algorithm in our modified FM algorithm, thus called FM+mixup.

FixMatch
The supervised loss component is the standard crossentropy applied to the weakly-augmented data : Then, we guess the labels of the weakly augmented unlabeled data and apply a binarization (argmax) of these predictions to have a one-hot encoded label. This label is used as target for training the model with strongly augmented unlabeled data. It allows the model to generalize with weak and strong augmentations and it also uses the guessed label to improve the model accuracy with unlabeled data: To avoid training on incorrect guessed labels, FM uses a threshold τ that ensures that the unsupervised cost function can only be applied to predictions made with high confidence, i.e., above this threshold. This can be easily implemented in the form of a mask: As in MixMatch, we sum the loss components to compute the final loss: 4.6 Adding mixup to MT, DCT and FM As we described here-above, MM and RMM already uses mixup in its workflow. In order to measure the impact of mixup, we will report results when we remove mixup from MM and RMM. On the contrary, the three other SSL methods explored in our work (MT, DCT, FM) do not use mixup in their original version. We explored several ways to add mixup to them, and retained the best one for each of the three methods. Note that we illustrate where the mixup operation has been added in the figures describing the different methods in the previous section.
Since the labeled and unlabeled data flow is very similar in MM and FM, we added mixup to FM at the same place as in MM: both labeled and unlabeled samples are mixed up. Similarly, it is also the asymmetric mixup variant that we used in MM and FM since mixup is applied to labeled and unlabeled samples together, as in the original MM method. Using mixup on labeled and unlabeled examples separately seems to hurt performance with these two methods.
In MT, mixup is applied on labeled and unlabeled samples separately and only for the teacher model. The perturbation with Gaussian noise applied to the unlabeled samples is removed, since no gain was observed when mixup is used instead.
For DCT, mixup is applied on the unlabeled samples only, common to both models in each minibatch during training. Applying mixup on the labeled samples, which are sampled differently for the two models at each training step, lead yo worse results. It is then, not necessary to use the asymmetrical variant for MT and DCT.
Finally, in all cases, we apply mixup on the log-mel spectrograms, which are the input features given to our deep neural networks (feature extraction is detailed in the experiment section).

Experiments
In this section, we describe our experimental setup. We give a brief description of the datasets and metrics, describe the Wide ResNet architecture we used, together with the training strategy details.

Datasets and evaluation metrics
Environmental Sound Classification 10 (ESC-10) [17] is a selection of 400 five-second-long recordings of audio events separated into ten balanced categories. The dataset is provided with five uniformly sized cross-validation folds that will be used to perform the evaluation. The files are sampled at 44 kHz and are converted into 431 × 64 log-mel spectrograms.
UrbanSound8k (UBS8K) [18] is a dataset composed of 8742 files between 1 and 4 seconds long, separated into ten balanced categories. The dataset is provided with ten cross-validation folds of uniform size that will be used to perform the evaluation. The files are zero-padded to 4 seconds, resampled to 22 kHz, and converted to 173 × 64 log-mel spectrograms.
Google Speech Commands Dataset v2 (GSC) [19] is an audio dataset of spoken words designed to evaluate keyword spotting systems. The dataset is split into 85511 training files, 10102 validation files, and 4890 testing files. The latter is used for the evaluation of our systems. We ran the task of classifying the 35 word categories of this dataset. The files are zeropadded to 1 second if needed and sampled at 16 kHz before being converted into 32 × 64 log -mel spectrogram.
In all cases, the 64 mel-coefficients were extracted using a window size of 2048 samples and a hop length of 512 samples. For ESC-10 and UBS8K, we used the official cross-validation folds. We report the average classification Error Rate (ER) along with standard deviations. ER is defined as the number of errors divided by the total number of samples.

Models
We used the Wide-ResNet-28-2 [33]  SOTA performance on the three datasets when trained in a 100% supervised setting. Moreover, its small size, comprised of about 1.4 Million parameters, allows to experiment quickly. Its structure consists of an initial convolutional layer (conv1) followed by three groups of residual blocks (block1, block2, and block3). Finally, an average pooling and a linear layer act as a classifier. The residual blocks, composed of two BasicBlock, are repeated three times and their structure is defined in Eq. (27). The number of channels of the convolution layers is referred as l, BN stands for Batch Normalization and ReLU [34] for the Rectified Linear Unit activation function. We used the official implementation available in PyTorch [35].

Training configurations
Each model was trained using the ADAM [36] optimizer. Table 3 shows the hyper-parameter values used for each method, such as the learning rate lr, the minibatches' size bs, the warmup length wl if used, and the number of epochs e. These parameters are identical regardless of the dataset used, unless otherwise specified. They were obtained by performing a reasonable short grid-search using UBS8K dataset first validation fold.
For supervised training, MM and FM, the learning rate remains constant throughout training. For MT and DCT, the learning rate is weighted by a descending cosine rule, function of the learning epoch t: where N e denote the number of epochs. All the SSL approaches, but FixMatch, introduce one or more subsidiary terms to the loss. To alleviate their impact at the beginning of the training, these terms are weighted by a lambda λ ratio, which ramps up to its maximum value within a warmup length wl. The ramp-up strategy is defined in Eq. (29) for MT and DCT, and is linear in MM during the first 16k learning iterations.
In MT, the maximum value of λ cc is 1 and α ema is set to 0.999. In DCT, the maximum values of λ cot and λ diff are 1 and 0.5, respectively. In MM the maximum value of λ u is 1. FM and RMM do not use a ramp up strategy. In FM, the value of λ u is set to 1 and in RMM the values of λ u , λ u1 and λ r are set to 1.5, 0.5 and 0.5, respectively.
In MM and RMM, we use two augmentations (k = 2), the sharpening temperature T is set to 0.5. In FM, we use a threshold τ = 0.8 on ESC-10 and GSC datasets, and τ = 0.95 for UBS8K. In RMM, the number of labels N kept for distribution alignment is set to 128.
For MM, FM and RMM, on ESC-10, the batch size is 60 because ESC-10 is a small dataset of 400 files only. During training, only four folders are used, that is, 320 files. In a 10% configuration and due to the whole division's restrictions, this represents only 30 supervised files in total. Each mini-batch must contain as many labeled as unlabeled files, hence the batch size of 60. Moreover, because of this small number of files, the training phase only lasts for 2700 iterations, and therefore, warm-up ends prematurely.
For our proposed variants, which include mixup, we kept the same configurations and parameter values.

Results
We first report the results obtained in a supervised setting, with and without the same data augmentation methods used in the SSL algorithms, including mixup.
We compare the error rates obtained by the five SSL methods and then show that adding mixup is almost in all cases beneficial.

Supervised learning
This section presents the results obtained with supervised learning in different settings while using either 10% or 100% of the labeled data available. MM, RMM and FM use augmentations as their core mechanism. RMM and FM use weak and strong augmentations, while MM uses a combination of weak augmentations and mixup. Therefore, it seems essential for fair comparisons to use the same augmentations in the supervised settings too. We trained models without any augmentation (Supervised), using mixup alone (mixup), weak augmentations alone (Weak), a combination of weak augmentations and mixup (Weak+mixup), strong augmentations alone (Strong), and to finish, a combination of strong augmentations with mixup (Strong+mixup). Table 4 presents the results on ESC-10, UBS8K, and GSC. In order to give an idea of how our results compare to the literature, we reported three results from the literature, in the "CNN models (literature)" row in the table. We chose to report results from works in which the models are primarily based on a CNN architecture, to be fair with the Wide-ResNet we used in our case. There are better results from the recent literature, but that involved large transformer models, sometimes pretrained on AudioSet. For instance, the state-of-the-art result on UBS8K is 10.0% ER, obtained with a 25-M parameter transformer, pre-trained on AudioSet [40].
ESC-10. In the 10% setting, the supervised model reached an ER of 32.00%. The use of Weak yielded the best performance with 22.67% ER, outperforming the supervised model by 9.3 points (29.16% relative). In the 100% setting, the supervised model reached an ER of 8.00%, and the best ER of 4.67% was achieved when using Weak+mixup. The gain is 3.33 points (41.62% relative).
UBS8K. In a 10% setting, the supervised model reached 33.80% ER, and the best supervised result was obtained with Weak+mixup, with a 23.75% ER. It represents an improvement of 10.05 points, 29.73% relative improvement. In the 100% setting, the same augmentation combination reached an ER of 17.96%, outperforming the 23.29% ER from the supervised model by 5.33 points, 22.88% relative improvement.
GSC. In a 10% setting, the supervised model reached 10.01% ER, and Weak+mixup yielded the best ER of 6.58% It represents an augmentation of 3.43 points, 34.26% relative improvement. In the 100% setting, the Strong+mixup reached an ER of 2.98%, outperforming the 4.94% ER from the supervised model by 1.96 point, 39.68% relative improvement.
Overall, we observe that in a supervised setting, the combination of mixup with a weak or a strong augmentation is systematically better than using a single augmentation, except in the ESC-10 dataset.

Semi-supervised learning
We report in Table 5 the results of the SSL methods. For MM and RMM, mixup is already used in the original methods, thus, we compare MM to MM without mixup (MM-mixup) and RMM to RMM without mixup (RMM-mixup). For the three other methods, we denote for instance FM+mixup the FM algorithm augmented with mixup.
In all the three datasets, the five SSL methods brought ER decreases compared to the 10% supervised learning setup, when no augmentation is performed. Only MM, RMM, and FM performed better than the best supervised training result, that used the weak augmentations. Furthermore, they also significantly outperformed MT and DCT in all but one cases (DCT better than RMM on UBS8K), showing that using single-model SSL methods is more efficient than two-model-based methods, at least on these three datasets and among the five methods that were compared.
For ESC-10, in the 10% setting, the lowest ER was achieved by RMM with a 12.00% value, compared to a 22.67% for a weakly augmented supervised training. The GSC dataset results confirm the previous observations. The MM method is the best method with an ER of 3.25%, representing a relative gain of 6.76 (67.53%) or 3.33 points (50.61%) compared to supervised training without and with Weak+mixup augmentations, respectively. RMM and FM+mixup obtained results very similar to MM: 3.54% and 3.31% ER, respectively.

Impact of mixup
Given that the best SSL methods so far were MM and RMM, and that mixup is used in these approaches, we decided to try to add mixup to MT, DCT, and FM, in different ways for each method as explained in Section 4.6. In [14], Appendix D.2, mixup on the entries (not on the labels) was added to FM, removing all the other image augmentations. In this setting, FM was shown to reach an accuracy very close to that of MM on CIFAR-10.
In Table 5, we reported the results when adding mixup to MT, DCT and FM, (MT+mixup, DCT+mixup, FM+mixup). We also give the ER when removing mixup from MM and RMM, in the row named MMmixup amd RMM-mixup.
As a first comment, MM-mixup and RMM-mixup are always worse than with mixup. For instance, with MM on USB8K, ER increased from 18.02% to 20.42%. This is particularly visible with RMM on ESC-10 and UBS8K. Moreover, adding mixup to the other SSL methods brought performance improvements on all the datasets tested. The only counter-example observed is FM on ESC-10, which went from 13.33% to 14.67% ER. The standard deviation value also increased significantly from 2.89% to 7.21%.
Similarly, FM on UBS8K went from 21.44% ER without mixup to 18.24% with mixup. On GSC, RMM presented the largest gap between 5.15% and 3.54% ER without and with mixup, respectively.
It is also important to note that using mixup allowed to get ER values very close to the ones obtained with fully (100% setting) supervised training using augmentations, on UBS8K and GSC. This is observable with MM, RMM, and FM+mixup. For instance, compared to Weak+mixup 100% supervised, MM has only 0.06 point difference on UBS8K, and 0.27 point difference on GSC.
When we look at our supervised training performance, we can observe that an improvement does not systematically follow the use of weak or strong augmentations. However, when combined with mixup, ER is frequently improved. This can be partly explained by the fact that audio augmentations are often difficult to choose and that their impact is often dependent on the dataset and the task at hand [27]. With this in mind, mixup seems to be beneficial regardless of the dataset used.

Training time
The normalized training duration means for all the five methods are shown in Fig. 6. The values were computed on the three datasets using the following equation: Where d is the total duration, N f the number of folds in the dataset, N e the number of epochs, and bs the batch size used in each method. We compute the three means for each dataset then we report the average of the three values. Finally, we use the supervised 100% execution time as the reference (training duration of one). We also assessed the impact of adding mixup, but it had a negligible impact of about 0.5%.
Among the SSL approaches, the fastest one is MT, which has a training time 4.5 times longer that the fully supervised training. Then, FM and MM follow with are 6 times longer. DCT, with its high complexity and use of adversarial data, took up to 7.6 times longer, and finally the longest of all is RMM, 11.6 times longer, due to the large number of augmentations involved. 6.08 Figure 6 Normalized mean training duration for all methods without mixup.

Discussion
Why are MM, RMM and FM better than MT and DCT?
This question remains open. Several key components may explain this gap in performance. First, data augmentation is extensively used in these methods (weak and strong ones), both on the labeled data and on the unlabeled subset to satisfy the consistency criterion of SSL. No data augmentation is used in the DCT nor MT basic methods, except the addition of noise in MT, on the unlabeled subset at the input of the teacher model. Nevertheless, when mixup was added to MT, no significant gain was observed. Thus, other augmentations should be explored for MT. Second, MM, RMM and FM use pseudo-labeling, with either explicit entropy minimization (sharpening in MM and RMM) or threshold-based selection (confidence masking in FM).
In DCT and MT, no entropy minimization is used, the predictions on the unlabeled part of the data are used as is for a consistency criterion between the two collaborating networks.

Which augmentations?
We used three augmentations (besides mixup): Occlusion, CutOut and Speed perturbation. An advantage of those is that they are task-agnostic. We tuned their hyperparameters once on GSC, and then, we used them on ESC-10 and UBS8K as is, bringing performance improvements. Exploring more audio-specific augmentations is an avenue still to be explored. For instance, we did not try pitch shifting nor dynamic range compression [41]. Those would need careful parameter tuning depending on the audio event types and on the dataset involved in the experiments. Finally, Occlusion and CutOut could be replaced by SpecAugment [42], originally proposed in automatic speech recognition and very often used nowadays in audio processing tasks, such as audio tagging. There is two small differences, though, in using SpecAugment, since it drops out one or several vertical and horizontal stripes from the spectrograms, while CutOut drops out a single rectangle of random shape. Another difference is that we applied randomly either Occlusion or CutOut, but not a combination of the two. To evaluate the effect of SpecAugment, we ran supervised learning experiments on GSC, using Speed Perturbation and mixup, and SpecAugment instead of Occlusion and CutOut, in the 100% of the labeled training data setting. We tested several configurations for SpecAugment. Our best setting was zero, one or two frequency stripes of width between 0 and 7 bins, and zero or one stripe of width also between 0 and 7 bins in time. This setting led to a 2.51% ER, which is better than the 2.98% value of our best supervised baseline method. This confirms experimentally that SpecAugment could replace Occlusion and CutOut, as a combination of the two. We did not rerun all the SSL experiments with SpecAugment, but one might expect slightly better results than those obtained with Occlusion and CutOut.

Conclusions
In this article, we reported audio classification experiments in a semi-supervised setting on three standard datasets of different sizes and content, the very small-sized ESC-10 with generic audio events, urban noises with UrbanSound8K, and speech with Google Speech Commands. We used only 10% of the labeled training data samples and the remaining 90% as unlabeled samples. We adapted and compared five SSL algorithms for this task, two methods that use two neural networks in parallel: Mean Teacher and Deep Co-Training, and the three single-model methods Mix-Match, ReMixMatch and FixMatch, that strongly rely on data augmentation.
All the five methods brought significant gains compared to a supervised training setting using 10% of labeled data. They performed better than supervised learning without augmentation. On UBS8K, MixMatch and FixMatch were very close to fully supervised learning with augmentation (100% of labeled training data). On ESC-10, ReMixMatch reached the best Error Rate of 12.00%. The relative gains were 62% and 47%, when compared to a supervised training using 10% of labeled data, without and with augmentation, respectively. On UrbanSound8K, MixMatch obtained the best results, reaching 18.02% Error Rate. Compared to a 10% supervised training without and with augmentation, the respective relative improvements were 47% and 24%. On Google Speech Commands, MixMatch again reached the best Error Rate of 3.25%. The relative improvement was 68% and 51%, compared to a 10% supervised training without and with augmentation, respectively. Mixup is an efficient regularization technique that is at the heart of the Mix-Match and ReMixMatch algorithms. Its consistent impact in MM and RMM encouraged us to add it to the other SSL approaches. In almost all the experiments, adding mixup brought consistent improvements, which allowed us to get closer to the best supervised learning settings using 100% of the labeled data available. For instance, adding mixup to FixMatch reduced the error rates on UrbanSound8K from 21.4% to 18.3%, and from 4.4% to 3.3% on Google Speech Commands, to be compared with 17.9% and 3.0% respectively, obtained in the best supervised learning settings.
In conclusion, if we were to recommend a method out of the ones tested in our work, we would recommend MixMatch, and FixMatch+mixup also, with very similar performances. Their good results are consistent across the three datasets. The gains brought by these methods is worth their training time, about six times the 100% supervised setting training time.
ReMixMatch obtained the best results on ESC-10, but this method is more demanding in training time.
Many questions remain open, though. The fact that MM and RMM were slightly better than FM needs to be further investigated, in particular the use of audio augmentations different in nature for the weak and the strong ones may be a direction to explore. MT and DCT do not use augmentations in their original version. It would be interesting, though, to try the weak augmentations used in the holistic methods with them.
We also plan to adapt the SSL methods to multilabel audio tagging, for instance on Audioset [43] or FSD50K [44]. In particular, we would have to adapt the sharpen method in MixMatch, and the thresholding operations in FixMatch. Finally, new SSL methods have been very recently proposed and could be added to our list, such as Unsupervised Data Augmentation (UDA) [45], and the recent Meta Pseudo Labels method [46].