Deep semantic learning for acoustic scene classification

Acoustic scene classification (ASC) is the process of identifying the acoustic environment or scene from which an audio signal is recorded. In this work, we propose an encoder‑decoder‑based approach to ASC, which is borrowed from the SegNet in image semantic segmentation tasks. We also propose a novel feature normalization method named Mixup Normalization, which combines channel‑wise instance normalization and the Mixup method to learn useful information for scene and discard specific information related to different devices. In addition, we propose an event extraction block, which can extract the accurate semantic segmentation region from the segmentation network, to imitate the effect of image segmentation on audio features. With four data augmentation techniques, our best single system achieved an average accuracy of 71.26% on different devices in the Detection and Classifica‑ tion of Acoustic Scenes and Events (DCASE) 2020 ASC Task 1A dataset. The result indicates a minimum margin of 17% against the DCASE 2020 challenge Task 1A baseline system. It has lower complexity and higher performance com‑ pared with other state‑of‑the‑art CNN models, without using any supplementary data other than the official chal‑ lenge dataset.


Introduction
"Acoustic scene" is the concept that humans commonly used to identify a particular acoustic environment.The task of sensing and understanding the environment where a sound is detected is known as Acoustic Scene Classification [1].It aims to categorize the detected sound into one of the predefined classes such as a park, airport, or bus.In recent years, methods using CNNs have been widely studied, where the spectrum of the acoustic scene is used as image input, such that best practice image classification methods can be applied [2][3][4].However, there are still many issues to address.
Firstly, the accuracy of similar audio scenes is low [5], such as airports and shopping malls.Both are indoor places with many people and contain many similar sounds, such as conversation, broadcasting, and personnel movement.The only difference is the airport contains the roar of an aircraft engine.However, note that engine sound can also be much weakened when it comes to the airport interior; it is hard to recognize when the sound is weak all the time.Therefore, if the deep learning approach cannot learn the different features of similar scenes, it cannot recognize them correctly, because the proportion of similar parts of the scene is high.
Secondly, the generalization performance on unknown devices is poor [5].Due to the different filtering properties of microphones in recording equipment, the recording quality of different equipment will be uneven.The network structure will learn the characteristics of the equipment when there are few recording devices.This will make the model parameters overfit to the known equipment.However, there are many kinds of recording equipment in practical applications.If the impact of the equipment cannot be eliminated, it will not be widely used.
Finally, the current network structure is very complex, like the parameters of the top three networks in DCASE 2020 challenge Task 1A, the minimum number of parameters is 39 MB, and the maximum is 130 MB [5].Although the use of larger models can achieve higher accuracy, the hardware requirements will also be higher, so that it cannot be used on lightweight hardware.
In order to solve the above problems, we aim to develop a CNN system with low complexity to improve recognition performance on unseen devices.We propose a concept of semantic segmentation for acoustic scene classification with multiple devices.Drawing on the experience of SegNet [6] networks of image semantic segmentation, we proposed Audio-SegNet networks of audio semantic segmentation, which is an extension of our previous work [7].In order to reduce the number of parameters and simplify the Audio-SegNet as much as possible, we have deleted some layers in the original SegNet network.Compared with the original model using 26 layers of conventional convolution, our proposed network only has 6 layers of conventional convolutions.Moreover, we also change the convolution kernel size from 3 × 3 to 2 × 3 to further reduce the number of parameters, which is an extension of our previously proposed Mini-SegNet architecture [8].
We then propose a novel feature normalization method which we termed Mixup Normalization.It can learn useful information from scene and discard unnecessary device-specific information.This normalization layer is added to the first convolution layer and the last convolution layer.Compared with the BN [9], our normalization layer can greatly improve the convergence speed and ensure the independence between features [10].
In addition, we also propose a new module which we termed as event extraction block.This module is added to the last layer of the decoder to get the semantic segmentation area to improve the prediction of similar audio scenes.
Our main contributions are summarized as follows: 1) Proposed an audio semantic segmentation system with as low complexity as possible without using model compression method.2) Proposed a new event extraction block module to improve the recognition performance of similar audio scenes.
The rest of the paper is organized as follows.In Section 2, we introduce the development history of ASC and describe some acoustic scene classification methods and existing problems and the main idea of the proposed system.In Section 3, we present the proposed ASC systems, including encoder-decoder architectures, event extraction block, data augmentation, and Mixup Normalization.In Section 4, we show the database description and experiments setup.Experimental results obtained with our system are explained and analyzed.Finally, a summary and conclusion are presented in Section 5.

Previous works
The first Detection and Classification of Acoustic Scene and Events 2013 challenge [11] was organized by the IEEE Audio and Acoustic Signal Processing (AASP) Technical Committee.It released open and established datasets and provided the scenario to evaluate and benchmark different approaches for the acoustic scene.
In the ASC task in DCASE 2013, 2016, and 2017, the audio data of acoustic scene classification comes from a kind of high-quality acquisition equipment.In order to study acoustic scene classification more widely, DCASE 2018 [12] and 2019 [13] proposed the mismatch task in different recording devices A, B, C, and D. In the DCASE2020 [14], ASC challenge Task 1A was Acoustic Scene Classification with Multiple Devices.This task includes 10 classes of sounds recorded on multiple devices.The dataset contains a fair number of examples from a high-quality device (referred to as A), as well as a limited number from the targeted low-quality devices (referred to as B and C) and simulated devices (referred to as S1-S6).A gap in the amount and quality of the recorded data causes overfitting on classification results.In particular, a part of the evaluation set is a compressed version of recorded audio data from device D and simulated devices S7-S11.This not only brings ASC closer to real-world conditions but also presents a huge challenge.
In the early period, researchers studied acoustic characteristics such as Zero Crossing Rate [15], Perceptual Linear Prediction [16], and Mel Frequency Coefficients [17] for the classification of the acoustic scenes.In recent years, mainly selected features are Constant-Q transform [18] and log-mel spectrogram [19].However, there seems to be no general consensus on which features are best.Recently, Helin et al. [20] proposed several spectrogram processing fusion strategies to obtain more discriminative information for ASC, including log-Mel spectrogram, CQT, Gamma, and MFCC.
After that, more and more CNN-based classifiers are designed [2-5, 21, 22].In [23], the authors presented a localized (small) kernel CNN layer.Sequence correction and local spectral time information are used for parallel networks of CNN and LSTM [24].Phaye et al. [25] developed a sub-spectrogram based on CNN architecture.McDonnell and Gao [26] proposed a two-path residual network, explicitly dividing the high and low frequencies of the spectrum into two parallel pathways within the same network.
In real life, the environmental sound is mostly collected by different recording devices.Therefore, many data enhancement methods are used to reduce the impact caused by device characteristics between different devices, such as SpecAugment++ [27], GAN [28,29], Mixup, Temporal crop, spectrum correction, pitch shift, speed change, adding random noise, and mixing audios [30].Meanwhile, in order to get better recognition results, the network structure design is more complex.For example, for ResNet [31,32] and FCNN [30], although the recognition accuracy is currently state-ofthe-art, they may have several drawbacks.It should be noted that their network structure is highly complex and uses more data enhancement methods.Larger models may require better hardware for training and fine-tuning, such as working on the Graphics Processing Unit.In addition, the hardware resources are limited in many real-world applications, such as smart wearable devices, Bluetooth earphones, and smart phones.Therefore, large models may also face deployment issues on a computationally limited platform [33].So it is hard to use complex networks widely.In the DCASE 2021 Task 1A [22], researchers need to solve not only the generalization problem that some devices only appear in the evaluation dataset but also the model complexity limit of 128 KB is set for the non-zero parameters.Therefore, some methods [34][35][36] to reduce the complexity of the model are used, such as pruning, quantization, and knowledge distillation.At the same time, Yang et al. [37] propose a novel neural model compression strategy, called Acoustic Lottery.Specifically, they use the Lottery Ticket Hypothesis [38] method to find a sub-network neural model associated with a small amount of non-zero model parameters in an advanced neural network.However, this method only reduces the number of non-zero parameters, and the total number of parameters does not decrease if we do not adopt the sparse representation.
Although the previous methods have greatly improved performance, there are still many basic problems worth exploring, such as confusion between similar scenes in terms of time and the difficulty of developing highperformance systems due to the presence of overlapping sound events, as well as the lack of distinguishing commonalities between different scene categories.Especially in the classification of acoustic scenes under different devices, there is still a problem of inconsistent audio quality.To address these issues, semantic segmentation networks have had good classification effect in image recognition and can effectively distinguish acoustic segments in different scenes [39].Examples of such networks include Fully Convolutional Networks (FCN) [40], SegNet [6], U-net [41], and DeepLab [42][43][44].For audio classification tasks, encoder-decoder networkbased methods have been successfully applied for music source separation [45,46].For instance, Liu et al. [47] used the U-net network with a self-attention method to separate voice and accompaniment in music.In their self-attention subnets, the same musical patterns can be reconstructed to achieve better source separation performance.Moreover, Huang et al. [48] proposed an RNNbased Encoder-Decoder framework for pitch tracking.Then, the encoder part, as the pitch extractor, can be applied to a down-stream Mandarin tone classification task.Based on the aforementioned points, we believe that acoustic scenes are composed of some basic units (acoustic events) which contain certain semantic information.Therefore, we proposed audio semantic segmentation with event extraction block and Mixup Normalization for acoustic scene classification.

Network architecture
This section introduces an efficient model design for acoustic scene classification with multiple devices.It also describes the details involved in the processing flow and model architecture.
The diagram of the ASC classifiers used in our proposed SegNet approach is illustrated in Fig. 1.The motivation of our method is to extract fine-grained features from acoustic events by convolutional encoder-decoder.Our system consists of two important stages.Firstly, mono audio signals are converted to time-frequency representations, with zero mean and unit variance normalization.Secondly, the log-mel feature is fed to Mini-SegNet models for feature learning.The output layer includes a dense layer of K classes and a softmax function for classification.

Proposed Mini-SegNet system
In the realm of ASC, CNNs have become the preferred method [34,49] for classifying log-Mel spectrograms [5,22].Specifically, a 2D time-frequency representation is initially extracted from a given audio clip.Subsequently, the neural network can perform feature extraction and dimensionality reduction through operations such as convolution and pooling [50], resulting in a deep representation.
We think that the acoustic scene is composed of some basic units (acoustic events), just as language governs the syntax of phonemes and words.As we all know, bird chirping is recorded in the park, and the sound of aircraft engines is recorded in the airport.Bird chirping and aircraft engines are what we call acoustic events.These acoustic events contain some semantic information, which has a certain internal relationship with the discrimination of acoustic scenes.Therefore, we proposed audio semantic segmentation with event extraction block for acoustic scene classification.In the field of image segmentation, the SegNet network has achieved encouraging results [6].This is primarily because maxpooling and subsampling reduce feature map resolution, using multiscale feature mapping to improve segmentation performance.CNN-based models have been widely utilized to encode complicated scene utterances into high-level semantic representations [50].SegNet arises from this need to map low resolution features to input resolution for pixel-wise classification [6].Inspired by SegNet network, the encoder-decoder network Audio-SegNet and the event extraction block are designed to capture the temporal and spatial information of an audio feature for acoustic scene classification.
The main idea of this paper is to use an encoderdecoder architecture to learn the acoustic scene for precise semantics mapping.Therefore, we verify the idea of Audio-SegNet, using pooling indices to inform the upsampling layers and extracting acoustic features from the pooling layers in the encoding process.This makes it easier for the decoder to get precise semantic segmentation in frequency.This paper conducts a more in-depth study based on our previous work on Mini-SegNet [8].
In our work, we proposed the Audio-SegNet to extract the multi-granularity abstract features, as shown in Fig. 1.
In the encoder module, convolution and pooling are used to extract features and reduce dimensions.In the decoding process, the position and frequency band information are recovered by convolution of the corresponding encoding module sampled on Upsampling2D to make up for the missing pixel information.This method makes full use of the semantic information of sound events in the acoustic scene through the encoding and decoding process and uses the rules of "acoustic scene based on sound events" to provide a preliminary basis for future work.
The details of Mini-SegNet are shown in Fig. 2. In this network, we use a simpler and smaller convolution/upconvolution and maxpooling/upsampling.In image segmentation, better performance can be achieved by only using the information from the last feature map [6].But in our case, its performance is not satisfactory.In the ASC, it is an overall classification task and does not need predicting labels for each spatial output like Image-SegNet [6].But, we can get more refined semantic segmentation through upsampling and then get an accurate proportion of events through the event extraction block as shown in Fig. 2. Meanwhile, we add a global average pooling layer or an event extraction block after the decoder of the network.The high dimensional feature representation at the output of the final decoder is fed to a trainable softmax classifier.In Image-SegNet, this softmax classifies each pixel independently [6].In order to realize ASC, we modify it in Audio-SegNet.The output of softmax classifiers is a K number of acoustic scene classes.
The design of Audio-SegNet represents the first technical contribution to this task.In order to analyze the performance of Audio-SegNet, we constructed several Audio-SegNet networks with different convolution network depth and convolution kernel size to classify As shown in Fig. 2, it is a simple encoder-decoder architecture.It is mainly composed of encoder and decoder modules.First, considering the amount of data, we reduce the number of network layers to maximize the ability of deep learning.Secondly, we modify the original 3 × 3 convolution kernel to 2 × 3 and get better per- formance in our experiment.The number maps in the encoder are 64, 128.The number of feature maps in the decoder are 128, 64.The encoder module, consists of two Conv blocks.In the first Conv_block 1, it contains a 2D Convolution layer whose kernel size is 2 × 3, and the number of filters is 64, then followed by a normalization, a ReLU non-linearity, and a maxpooling whose pool size is 2 × 3. The second is Conv_block 2, which is a 2D Convo- lution layer with kernel size is 2 × 3 and 128 filters with a batch normalization and a ReLU non-linearity.After that, the corresponding convolution, batch normalization, and activation were performed again.In maxpooling layer, the key part of the feature is retained and other weak features are discarded.For each sample, the indices of max locations computed during pooling are stored and passed to the decoder.
The output of the encoder is taken as the input of the decoder module.A decoder upsamples its input using the transferred pool indices from its encoder to produce a sparse feature map.It then performs convolution with a trainable filter bank to densify the feature map.The final decoder output feature maps are fed to a softmax classifier for classification.The decoder module is similar to the encoder and consists of two DeConv blocks.In each block, upsampling 2D with size of 2 × 3 is performed first, then followed by convolutional layers, normalization, and ReLU.Finally, the global average pooling layer or the event extraction block is used.Two dense layers with dropout are used to output the final prediction.

Event extraction block
In this work, we designed an event extractor and applied it to the mini SegNet network structure, as shown in Fig. 2. In the semantic segmentation network, the semantics of this point is represented by calculating the maximum value of the same position of each different channel layer [6].It can be seen that different channel layers represent different semantic regions.We believe the different semantic segmentation regions in the whole feature graph represent different events.Suppose x ∈ R N ×F ×T ×C is the input feature, where N, F, T, and C represent batch size, frequency dimension, time dimension, and the number of channel, respectively.
We first obtain the semantic segmentation tensor S ∈ Z N ×F ×T , which is the indices of the maxi- mum values of input feature along the channel axis c ∈ {0, 1, . . ., C − 1}: Then, we extract top-k semantic segmentation regions: where mode k (•) is top-k mode, i.e., top-k numbers that appears the most of S. Y ∈ {0, 1, . . ., C − 1} N ×k , where N and C represent batch size and number of channels, (1) respectively.In our experiment, the best choice of k is 4. At the same time, we find that normalized Y performs better, so we normalize the values of Y by the number of channels so that there will be no large deviation in subsequent learning as below.
Second, the output feature of the final decoder will be fed to the global average pooling layer at the same time.Then, the output feature of the global average pooling layer along the channel axis is concatenated with Ỹ as the output tensor of the event extraction block.The tensor has shape (N , C + k).
Finally, the event extraction block output will be fed to a trainable softmax classifier which consists of an affine transformation followed by the softmax function.
We think that the audio scene is also composed of a variety of audio events, so obtaining the audio events in the audio scene through event extraction block can be used to distinguish similar audio scenes.

Data augmentation
Data augmentation is an efficient way to avoid overfitting and enhance the model's generalization in deep neural network [51].We use mixup, ImageDataGenerator, Specaugment, and cropping for data augmentation.We do not use any additional data and train the model from scratch.In our work, data augmentation does improve performance, and we make a detailed comparison in the next section.
Mixup [51] is performed at a mini-batch level: two data batches, along with corresponding labels, are randomly mixed in each training step.Mixup creates a new training sample by mixing a pair of two training samples.It generate a new training sample (x, y) from the data and label pair (x 1 , y 1 ) (x 2 , y 2 ) by the following Eq.( 4).
Here, ∈ [0, 1] is acquired by sampling from the beta distribution Beta(α, α) , and α is a hyper parameter.Besides the data x 1 and x 2 , it is characteristic to mix the labels y 1 and y 2 .
In addition, we tried to use ImageDataGenerator [52] in this task.It is an image generator, mainly used in image classification.At the same time, it can also enhance the data in batches, expand the size of the data set, and enhance the generalization ability of the model.In our work, it is implemented with width shift, height shift.
We additionally used crop augmentation [26] in the temporal axis: each of the two samples combined using mixup was first cropped independently and randomly.Then, we applied Specaugment [53] at a minibatch level.For a batch of data in the training step, each feature map is randomly masked in both time and frequency axes.

Mixup Normalization
We found that instance normalization (IN) had a good performance in image style transfer [54].Its function is equivalent to unifying different pictures into one style.In short, IN can learn domain difference from channel mean and variance in the image domain for better domain style transfer [55,56].The audio device ID (A-S3) of official data set differences is revealed along different dimensions of concatenation of mean and standard deviations of the output layer of the Mini-Seg-Net encoder as shown in 2D Fig. 3.So we use instance normalization to get audio device generalized features in channel dimension as below.
where, Here, µ uc , σ nc ∈ R N ×C , are mean and standard devia- tion of the input feature x ∈ R N ×F ×T ×C , where, N, F, T, and C represent batch size, frequency dimension, time dimension, and number of channel, respectively.ǫ is a small number added to σ to avoid division by zero.
As far as we know, direct use of IN can only learn the style information and lose of useful information for classification.In order to compensate the classification information and reduce the influence of excessive IN, we add a hyperparameter learned from the Mixup method; we can use the hyperparameter to balance the weights on both sides.This normalization method named Mixup Normalization (MixupNorm) is below.
We apply MixupNorm for the first encoder layer and last decoder layer in Fig. 2.There are a total of two Mix-upNorm modules in the Mini-SegNet network. (5)

Dataset
To evaluate our system, we use the Task 1A acoustic scene classification data from the official data set of the TAU Urban Acoustic Scene 2020 Mobile Development dataset [14].The dataset consists of 10 acoustic scenes: airport, bus, metro, metro_station, park, public_square, shopping_mall, street_pedestrian, street_traffic, tram.The development set contains data from 10 cities and 9 devices, 3 real devices (A, B, C), and 6 simulated devices (S1-S6).Most of the experimental data were collected from high-quality recording device A. The other devices are commonly available customer devices: device B is a Samsung Galaxy S7, device C is iPhone SE, and device The development dataset comprises 40 h of data from device A, and smaller amounts from the other devices.Audio is provided in single-channel 44.1 kHz 24-bit format and was split into 10-s segments that are provided in individual files.The organizer of the challenge provides basic meta data of train/test split consisting of 13,965 samples in the training set and 2970 samples in the test set.As shown in Table 1, some devices (S4, S5, S6) appear only in the test subset.So the device-specific information of S4-S6 cannot be learned in training.

Experiment setup
We train our models on GPU, with a batch size of 64, and with stochastic gradient descent with a momentum of 0.9 for the optimizer.At the same time, we use a warm restart learning rate schedule [57]; it gets to maximum value of 0.1 after 11, 31, 71, 151, and 311 epochs and then decays according to a cosine pattern to 1 × 10 −5 .In our work, we transformed audio data into a power spectrogram by skipping every 1024 samples with 2048 length Hanning window.A spectrum of 431 frames was yielded from 10-s audio file, and each spectrum was compressed into 256 bins of mel frequency scale.Additionally, deltas and delta-deltas were calculated from the log Mel spectrogram and stacked into the channel axis.The number of frames of the input feature was cropped by the length of the delta-delta channel so that the final shape becomes [256 × 423 × 3].And each network was trained for 310 epochs.During the training stage, the different data augmentation methods for the dataset for Mini-SegNet are used, and the parameters are set as Mixup with α = 0.3, ImageDataGenerator with width_shift_range = 0.6 and height_shift_range = 3, and Specaugment with a temporal mask and two frequency masks with mask parameters of 80 and 30, respectively.The input data is randomly cropped into a fixed-length along the time axis.In our experiments, the input data with the size of [ 256 × 423 × 3 ] was cropped into [ 256 × 400 × 3 ] input feature map.We used MixupNorm with = 0.1 and = 0 for compar- ison on unseen devices.
For the train-test split, we adopt the official recommended way to split the development material.There are 13,965 train audio clips and 2970 test audio clips.The training set includes audio from devices A, B, C, and S1-S3.The test set covers data from those six devices and extra data from unseen devices S4, S5, and S6.And we applied data augmentation to increase the diversity of data distribution.The augmented data was generated from each mini-batch consisting of 64 samples during the training process in real-time.Experiments show that this method can improve the accuracy of acoustic scene classification.
For the multi-class classification tasks, cross-entropy (CE) is generally used as the loss function: where p is the model's estimated probability, y is a ground-truth class label (one-hot vector), and j represents the j th class.We adopt the CE loss as the loss func- tion for the proposed model.

Results and discussion
To illustrate the properties and performance of Audio-SegNet proposed in this paper, we adopt the official recommended way to split the train set and test set on Acoustic Scene 2020 Mobile Development dataset as shown in Table 1.We compared and analyzed various versions of Audio-SegNet, which have different convolution layers and kernel sizes.We also verify the performance of the Mixup Normalization and the event extraction block methods for unseen devices (ID S4-S6 in test) recognition.

Validation results of Mini-SegNet
The DCASE2020 Task 1A challenge [14] is evaluated by the average of the class-wise accuracy, also known as "macro-average accuracy." All the work in this paper is tested on the challenge dataset, because the datasets come from different devices and the train/test setup.Our experimental results are mainly shown by the average accuracy, that is, the average accuracy of scene classification under various devices.
As shown in Table 2, different Audio-SegNet networks have different performances, such as the amount of training parameters, training time, and training accuracy.In our work, according to the structure of Image-SegNet, we first constructed the Audio-SegNet for ASC.In our experiments, we term SegNet-L, which means it is a  The results show that its performance is poor, especially in the training set and test set there is a large gap.The main reason is that SegNet-L architecture has a deep network, and when our data is limited, it cannot be fully utilized.Therefore, the final classification accuracy has a great problem of overfitting.The simplest way to prevent overfitting is to reduce the model size, that is, to reduce the number of learnable parameters in the model, which is determined by the number of layers and the number of units in each layer.Therefore, we have made many attempts to modify the depth of the network.SegNet-M, compared to SegNet-L, not only has a smaller convolution layer but also the training parameters are reduced by an order of magnitude.The training time is obviously reduced, and the accuracy of the test set is improved.But there is still overfitting phenomenon.Then, we further try to reduce the number of network layers and construct two kinds of networks, SegNet-S and Mini-SegNet.Mini-SegNet has better performance, less parameters, and shorter training time.At the same time, overfitting has also been alleviated.In our work, our acoustic scene classification is limited by the amount of data and cannot use deep networks like image segmentation.Therefore, after the analysis and test, we build the acoustic scene classification system based on the Mini-SegNet network.
A convolution kernel can be regarded as the weighted summation of a certain part; it corresponds to local perception.Its principle is that when we observe an object, we can neither observe each pixel nor observe the whole at once, but start to understand from the local, which corresponds to convolution.In the same receptive field, the smaller the convolution kernel, the smaller the parameters and computational complexity.In order to extract local features more fully, we compare the recognition performance of several different convolution kernel sizes.
Table 3 shows the accuracy of different kernel sizes, maxpooling size, and upsampling size on the basis of the Mini-SegNet network.In our work, we initially kept the original Image-SegNet kernel size configuration.However, there is still overfitting phenomenon in Mini-SegNet.Therefore, we analyze and test the different sizes of the convolution kernels.From Table 3, we can see that the problem of overfitting can be improved by reducing the kernel size to a certain extent.When the kernel size is equal to 2 × 3, the classification perfor- mance of the system is the best.
The accuracy on the training set is 80.49%, and that on the test set is 71.26%.At this time, overfitting problems can be ignored.Compared with the 3 × 3 kernel, the overall performance, such as the amount of training parameters, training time, and accuracy, has been improved.However, if we further reduce the convolution kernel size to 1 × 2, the accuracy is very poor.In Fig. 4, we analyzed the characteristic maps of frequency-wise and channel-wise of the Mini-SegNet encoder output.We found that the 2 × 3 kennel size for the channel-wise of frequency would be better between 80 and 100 than 3 × 3. Compared with the 3 × 3 kernel size, the 2 × 3 kernel size has higher spectral density at channel dimensions 80 to 100.Therefore, when the feature maps of the decoder output are pooled using the  global average pooling layer along the channel axis, the distinction of 2 × 3 features will be higher than 3 × 3.So our acoustic scene classification system is a Mini-SegNet network with convolution kernel size of 2 × 3. Finally, the average classification accuracy is 71.26%.The total training parameters are 478,218, and the average training time is 173 s under 310 epochs.Meanwhile, we use a variety of data enhancement methods to further improve the classification accuracy, without using additional data.Table 4 shows results for Mini-SegNet trained in various configurations using the official test-train split.Every configuration was tested on both architectures.
In [58], mixup data augmentation on acoustic scene classification has been fully verified.Therefore, we use mixup directly in our work.Then, we try and analyze the methods of temporal crop, Specaugment, and ImageDataGenerator respectively.The results in Table 4 show that temporal crop, Specaugment, and ImageDataGenerator improve performance in acoustic scene classification.It not only improves the overall classification accuracy, but also alleviates the problem of overfitting.In our parameter set, we set the width_ shift_range as 0.6, which is divided by the total width.And the height_shift_range is 3, that is, the amplitude of random vertical offset of the image when the data is raised.To a certain extent, it shows that the sound signal contains more information in the frequency domain, and the experimental results also prove that.In general, ImageDataGenerator method based on image data augmentation can also be well applied to acoustic scene classification.As shown in Fig. 5, the proposed system with a warm restart learning rate schedule achieves better performance in the development set than simpler linear learning rate schedule.

Mixup Normalization and event extraction block
We test the Mixup Normalization and the event extraction block methods in the Mini-SegNet network structure and compare them with batch normalization (BN).The baseline is Mini-SegNet, and as shown in Fig. 2, we only used a global average pooling instead of the event extraction block when it is omitted.The results are shown in Table 5.
In Table 5, the average accuracy (A-S6) of mini-Seg-Net is 65.93% with BN, 69.82% with IN, 70.11% with MixupNorm, and 70.97% with MixupNorm and event   IN when the hyperparameter is 0 in Eq. 8.In addition, the event extraction block is effective on unseen devices.We chose the average accuracy for various recording devices (all-accuracy) as the main performance because the task targets generalization properties of systems across a number of different devices.The confusion matrix of acoustic scene classification results under all devices is shown in Fig. 6a.From this figure, it can be seen that the generalization ability on some classes is better, with an accuracy of up to 85% in the recognition of acoustic scenes such as bus, park, and street_traffic.Comparing Fig. 6a and b, we found that the event extraction block effectively reduces the error rate of mutual recognition of similar scenes, such as airport and shopping_mall, street_pedestrian, and public_square.As shown in Table 6, we performed ablation experiments for the hyperparameter of the Mixup Normalization method.The performance is the best when the parameter is set to 0.1.

Comparison with recent state-of-the-art systems
Table 7 compares our proposed Mini-SegNet network with current state-of-the-art systems without applying ensemble techniques.Compared with systems in DCASE2020 Task 1A challenge, our proposed system has comparable performance and lower complexity.On DCASE2021 Task 1A challenge, the model complexity limit of 128 KB was set for the non-zero parameters.Therefore, many model compression methods were used or proposed by researchers, such as knowledge distillation and LTH [37,38].Compared with the top five best-performing systems on the DCASE2021 Task 1A challenge, the proposed system does not use any compression method, so we do not need additional resources and time to train a complex network and then compress the model, such as knowledge distillation.The proposed system still has comparable performance on systems with similar parameters.The acoustic scene classification system proposed in this paper takes log-mel spectrum as the acoustic feature and Mini-SegNet as the classifier.Our proposed system achieved 71.26% on the different devices on the development dataset.

Fig. 1
Fig. 1 Block diagram of the Mini-SegNet for the acoustic scene classification

Fig. 2
Fig. 2 Details of the Mini-SegNet model and Event Extraction Block

Fig. 3
Fig. 3 2D visualization of feature maps of mean and standard deviations.Top: frequency-wise.Bottom: channel-wise

Fig. 4
Fig. 4 Compare the performance of kernel size 2 × 3 and 3 × 3 in channel-wise 80 to 100.Left: the feature map with 3 × 3 kennel size of the Mini-SegNet encoder output.Right:the feature map with 2 × 3 kennel size of the Mini-SegNet encoder output

Table 1
TAU Urban Acoustic Scenes 2020 Mobile Development dataset SegNet with more convolution layers and larger kernel size.In Table2, 64 × 2 represents two convolution layers with 64 output mappings.In the SegNet-L, each encoder network has a corresponding decoder layer and hence the encoder network has 13 convolutional layers.The number of parameters is 31,880,650, and the training time of each epoch is 328 s.The all-accuracy is 93.86% on the train set and 59.06% on the test set.

Table 4
All-accuracy (%) under various data enhancement methods extraction block.The result of IN is 3.89% better than BN, and MixupNorm is 4.18% improvements compared to BN.For the unseen device (ID S4-S6) on the test set, "S4-S6" had an average accuracy of 67.11% using MixupNorm, which is more than 7% and 1% better than BN and IN, respectively.The MixupNorm is Fig. 5 Accuracy of proposed system (310 epochs).with warm restart learning rate schedule.Bottom: without warm restart learning rate schedule

Table 5
Experimental results on Task 1A.Mixup Normalization and event extraction block are efficient on unseen devices (S4-S6) on TAU Urban AcousticScenes 2020 Mobile, Development dataset

Table 6
The Effects of hyperparameter of Mixup Normalization on TAU Urban AcousticScenes 2020 Mobile, Development dataset

Table 7
Comparison with recent state-of-the-art systems using the performance of individual systems without a scorelevel ensemble.The third to seventh rows list the top five bestperforming systems on the DCASE2020 Task 1A challenge.The ninth to 13th rows list the top five best-performing systems on the DCASE2021 Task 1A challenge