Frequency-dependent auto-pooling function for weakly supervised sound event detection

Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.


Introduction
Sound event detection (SED) becomes an important research topic in auditory perception. It has many potential applications such as healthcare in smart home [1,2], surveillance monitor in public area [3], and large-scale information retrieval [4]. The goal of SED is to predict event classes and corresponding time stamps, i.e., onset and offset times of sound events, while audio tagging (AT) aims at detecting what occurred in an audio clip. Therefore, it is desired that strongly labeled data that contains information of precise presence and absence time for sound classes can be used to train SED systems [5][6][7][8]. However, it is costly to acquire such strongly annotated (2021) 2021: 19 Page 2 of 11 neural network approaches [16][17][18][19] are employed to perform prediction for each event class. The pooling function is used to reduce the dimension of the dynamic feature space, which has a great impact on the overall performance of the weakly supervised SED system. Several pooling functions are exploited in the literature. The global max pooling (GMP) focuses on instances with the highest probability, which is difficult to estimate onset and offset annotations for long-time events. The global average pooling (GAP) assumes that all the instances contribute equally, and hence, short-time events are likely to be underestimated. Attention pooling [20,21] is a flexible weighting method, which adds a dense neural network to learn weights for each frame in parallel. However, a limitation of the attention pooling is that larger attention weights concentrate upon frames with smaller probabilities when the label is negative [19]. An auto-pooling (AP) function is developed in [22] by introducing a learnable parameter for each class to deal with the weakly labeled SED problem. The AP function reduces to min-, mean-, or max-operators with the increase of the learnable parameter, which can be interpreted as an automatic interpolation between different standard pooling behaviors. Another aspect of research is based on source separation framework for non-overlapping case [23,24] or overlapping case [25,26]. As a starting point, [24] focuses on the non-overlapping sound events. Time-frequency (T-F) segmentation masks are learned from the clip-level tags and then aggregated over both the time and frequency indices to obtain present probabilities of sound events.
The T-F segmentation mask is equivalent to ideal ratio mask (IRM) [27] in the context of speech enhancement and source separation. As a byproduct, each sound event can be separated from the mixed audio. In this method, a global weighted rank pooling (GWRP) [28] function is employed to aggregate the masks to clip-level predictions.
The T-F bin with a larger value is assigned a larger weight. GWRP is a generalization of GMP and GAP in essence. But, the decay coefficient of GWRP function is manually chosen and may not be optimal in practice.
In this paper, we propose an improved source separation-based approach to solve the problem of weakly supervised SED. The proposed method has a similar framework as in [24], which consists of a segmentation mapping stage and a classification mapping stage. In the segmentation mapping stage, we employ a CNN to capture local patterns of the input spectrogram, i.e., to learn a T-F mask of each specific sound event from weakly labeled data. Concretely, we build a dilated depthwise separable convolution block, named as DDC-block. DDC-block first applies a single-layer dilated filter to each input channel and then applies a 1 × 1 convolution to combine the output of the previous layer. The presented DDCblock outperforms the "VGG-like" CNN originally used in [24] in terms of the detection performance and the complexity. In the classification mapping stage, we present a frequency-dependent auto-pooling function (FAP) to aggregate T-F masks to clip-level predictions of sound events. The FAP function inherently considers the fact that each sound event exhibits different frequency characteristics by introducing a learnable frequency-varying vector for each class. Furthermore, we show that there are close links between the proposed FAP and the commonly used GMP, the soft-max pooling, the GAP, and the AP functions. In this paper, the proposed method is not specifically designed for handling overlapping sounds. We first focus on the weakly labeled problem without considering the impact of overlapping. Next, we evaluate the proposed method on DCASE 2017 task 4 dataset and DCASE 2020 task 4 dataset, which are recorded in a realistic environment and contain overlapping sound events.
The remainder of the paper is organized as follows: In Section 2, we present the proposed method in detail, including DDC-block for producing T-F masks of sound events and FAP used to aggregate T-F masks to clip-level predictions. In Section 3, we carry out extensive experiments to evaluate the performance of the new method. Section 4 concludes the paper.

Proposed method
We now describe the proposed source separation-based framework and how this method can be used to solve the weakly supervised SED as well as AT problems.
The training process of the new framework consists of two steps, i.e., segmentation mapping and classification mapping. In the segmentation mapping stage, a log-mel spectrogram of the audio clip x(n) is extracted to obtain a feature matrix X =[ |X(t, f )|] ∈ IR T×F + , where t = 1, 2, ..., T and f = 1, 2, ..., F represent frame and frequency indices, respectively; T denotes the number of audio clip frames; and F is the number of frequency bands. Then, a segmentation mapping of X →M is modeled via a deep neural network, whereM =[M k (t, f )] ∈ IR K×T×F + is the estimation of the IRM for each sound event class, k is the index of class, and K represents the number of predefined classes. In [24], a "VGG-like" CNN is employed to complete the transformation from the input feature to the specific mask of each sound event. In this paper, we utilize DDC-blocks for this task to obtain a better performance. The details of implementation is presented in Section 2.1. Because only the clip-level tag y =[ y 1 , y 2 , · · · , y K ] T ∈ IR K×1 + is available for weakly supervised problems, a global pooling function should be designed to transform the estimated T-F mask into the presence probability of the kth sound event. In the classification mapping stage, we map the estimated T-F mask into the clip-level prediction, i.e.,M →ŷ , whereŷ =[ŷ 1 ,ŷ 2 , · · · ,ŷ K ] T ∈ IR K×1 + denotes the clip-level probability. The objective is to minimize the binary crossentropy betweenŷ k and clip-level tag y k , and the loss function is given by: Several global pooling functions such as GMP, GAP, and GWRP have been adopted for the classification mapping. We present a frequency-dependent pooling function to fully exploit the potentials of the source separation-based approach in Section 2.2.
Both AT and SED tasks share the same training stage as described above. Once the training is completed, we can obtain the prediction of the trained model as the AT result. However, for SED task, extra operations need to be carried out to get frame-level probabilities at inference process. Since the estimated maskM k (t, f ) contains the information of sound event activities, the framelevel probability can be obtained by aggregating the mask across frequency axis as: where w k (f ) denotes the weight of the f th frequency band for the kth class, andŷ k (t) is the estimated frame-level probability. In [24], Kong et al. average over frequency axis of the mask with w k (f ) = 1/F, whereas we utilize the learned vector + of FAP to calculate the weighted average along the frequency dimension. To produce smooth frame-level predictions [24], we first select a frame t as a seed wherê y k (t) ≥ 0.2. Then, we merge the neighboring frames on both sides in a region-growing style until the frame t whereŷ k (t ) ≤ 0.1. The diagram of the proposed source separation-based method is shown in Fig. 1.

The segmentation mapping stage
The segmentation mapping performs feature transformation via the deep neural network. The model consisting of VGG-blocks has been proven quite promising since it can capture local patterns of input features. However, the utilization of VGG-block leads to a high computational cost. To solve this problem, we build a convolutional block, i.e., DDC-block, which employs depthwise separable convolution [29,30] with dilated filters instead of the typical CNN as in [24]. The architecture of DDCblock is shown in Fig. 2. Each of the three convolution operations is followed by a non-linearity activation and a batch normalization process. The stack of depthwise and pointwise convolution is called depthwise separable convolution, which is considered to be a single convolution layer as the typical CNN. Thus, the number of convolution layers in DDC-block is the same as that in VGG-block.
For the depthwise convolution, the number of filters is required to be equal to that of input channels, which means the spatial convolution is performed independently in each input channel. For the pointwise convolution, 1×1 filters are used to project the outputs of the depthwise layer onto a new feature space. It has been demonstrated that the stack of depthwise and pointwise layer can reduce the number of parameters by a factor of 1/N + 1/(w width · w height ), where w width and w height represent width and height of the filter, respectively, and N is the number of output channels [31]. Such a design can significantly alleviate the over-fitting problem. Additionally, we propose to use dilated filters in depthwise layer. Dilated filters can increase the size of receptive field without introducing extra parameters. We carried out experiments (not shown here) and found that the utilization of dilation has a positive impact on both AT and SED tasks. The process of depthwise separable convolution with dilation rate is shown in Fig. 3. At this point, we present a detailed configuration of the model which is used in the segmentation mapping stage. We first apply four DDC-blocks on the input log-mel spectrogram, and the channel numbers of the four blocks are 32, 64, 128, and 128, respectively. Then, a K-channel convolution layer with 1 × 1 filter is used to convert the output of the last DDC-block to T-F segmentation masks through sigmoid activation functions. Finally, a global pooling function is used to aggregate the estimated mask to the clip-level prediction. The proposed network has a similar depth compared to the "VGG-like" network in [24], which leads to a fair comparison. The detail of the proposed model architecture is summarized in Table 1.

The classification mapping stage
In this subsection, we model the classification mapping ofM →ŷ via the pooling function. To this end, pooling functions such as GMP, GAP, and GWRP [26] are commonly used in SED field. GMP only concerns on the T-F bin with the maximum probability, leading to the Global pooling function K × 1 constrained gradient path and inefficient computation. GAP assumes that all the T-F bins contribute equally to the clip-level prediction, which means GAP is unable to focus on the specific T-F bins.
(1) Global weighted rank pooling, GWRP can be understood as a generalization of GMP and GAP. The main idea of GWRP is to put high weights on the T-F bins with high values [26]. The T-F bins of the kth sound event M k are sorted in a descending order, and the corresponding jth element of the sorted sequence is denoted byM k,j . The clip-level prediction can be represented as: where r k ∈[ 0, 1] is a hyper parameter that controls the behavior of GWRP function. Notice that the GWRP reduces to GMP for r k = 0 and GAP for r k = 1. Since the value of weight increases asM k,j becomes large, the aggregation performance is improved compared with GMP and GAP. However, the performance of GWRP highly depends on the interpolation coefficient r k which is difficult to be chosen in practice. We thus propose a FAP function which can determine the interpolation coefficient automatically as required.
(2) Frequency-dependent auto-pooling, FAP function is actually an improved version of soft-max pooling, which introduces a learnable parameter vector α k = [ α k (1), α k (2), · · · α k (f )] T ∈ IR F×1 as weighting coefficients for the kth class. FAP treats α k as a free vector that can be learned during training. The expression of FAP is: where α k (f ) is the weight of the f th frequency band. Note that the frequency-varying α k (f ) is shared among all frames in each frequency band.
We now show the relationship between the proposed FAP and several well-known pooling functions. The proposed FAP function can be treated as an extension of AP proposed in [22]. FAP is specifically used for 2D data like spectrograms, so the prior information of frequency distribution for event classes can be considered during aggregation. That is, FAP can focus on the crucial frequency bands adaptively by learning the vectors. Additionally, FAP reduces to GMP, soft-max pooling, and GAP when α k (f ) → ∞, α k (f ) = 1, and α k (f ) = 0, respectively.
A word on the bound of the parameter α k (f ) is appropriate here. For α k (f ) < 0, FAP is similar to min-pooling, and hence, T-F bins with smaller values attract much more attention, which is not desired. On the other hand, for α k (f ) → ∞, FAP simplifies to GMP which may result in the gradient explosion problem. We thus propose to set the parameter α k (f ) to 0 < α k (f ) < α max , where α max is a predefined constant. In this paper, α max = 10 is empirically chosen and achieves a satisfactory performance.

Data preparation
We utilize the audio clips of DCASE 2018 Task 1 dataset as background noises which are recorded in 10 scenes such as metro station and shopping mall. For the sound events, 3710 manually verified clips that include 41 categories are obtained from DCASE 2018 Task 2. These events involve various human activities, household events, instrument events, etc. All of the audio clips are sampled at 32 kHz. More details on the preparation of data are shown in Table 2. We fix the duration of every sound events to 2 s which is same as [24] to make sure that the generated clips are non-overlapping. To be specific, the events shorter than 2 s are padded with zeros to 2 s. For the events longer than 2 s, we extract the first 2 s of them as training data and remove the other parts of the recording. Three randomly selected events are mixed with the background noise without any overlapping at 0-dB SNR. The onsets of events are 0.5 s, 3 s, and 5.5 s, respectively. Thus, we can ensure that the impact of overlapping can be avoided, which will help us to focus on the weakly labeled task first. We synthesize 8000 audio clips and divide them into 4 cross-validation folds.

Setup
We choose 64 bands log-mel spectrum as the input data representation to our model. A 64-ms-long Hanning window is employed for STFT with 50% overlap. Then, each frame is converted into a 64-dimensional vector by a logmel filter bank. This process converts a 10-s audio clip into a 64 × 311 dimensional log-mel spectrogram representation. Learning rate is initially set to 1e−3 and automatically reduced to 0.9 times of the previous value per 1000 iterations. Xavier initialization is used to initialize the model. The experiment setting is shown in Table 3, and each result shown in this paper is obtained by averaging over 10 independent experiments. We observed that the prediction result of AT task is more convincing than that of SED task. In order to reduce false positives for SED task, we first evaluate the clip-level predictions of each clip. Only the classes that are predicted as active at clip-level can be selected to evaluate the framelevel predictions [24]. Since the lengths of most events are longer than 10 frames, we treat the predictions which are shorter than 10 frames as false-positive cases and remove them to reduce inserts. Moreover, some classes such as "Knock", occur discontinuously in audio clips, but their ground-truth frame-level labels are always continuous. Thus, the events or the silence gap of events shorter than 10 frames are removed or merged. We set the onset collar of 200 ms and an offset collar of 200 ms/50% to count the true positives of the prediction, which is similar to the configuration of [32].
As for evaluation metrics, we use F-score, area under the curve (AUC), mean average precision (mAP) [32], and error rate (ER) to evaluate the performance of AT and SED tasks. The F-score is computed as a harmonic mean between precision and recall. The AUC is the area under ROC curve which plots true-positive rate (TPR) versus false-positive rate (FPR). The mAP is the average of precision at different recall values regardless of thresholds. The mAP can evaluate the model comprehensively and is widely used in the weakly supervised SED field. Error rate (ER) measures the number of errors including deletions (D) and insertions (I). ER is a score rather than a percentage which can become larger than 1 in the case when the system makes more errors than correct predictions.

Performance evaluation of the pooling functions
In this section, we evaluate the performance of five pooling functions including GMP, GAP, GWRP, AP, and FAP. To fairly validate the effectiveness of FAP function, we employ the same model in the segmentation mapping stage. Specifically, the DDC-blocks in Table 1 are replaced with four VGG-blocks. For GWRP function, the interpolation coefficient r k is set to 0.9998 as in [24]. As seen from Table 4, GMP and GAP are inferior to the other approaches due to the impractical assumptions. As expected, FAP achieves the highest scores in all the involved methods in terms of the mAP and AUC for both AT and SED tasks. Especially, FAP achieves a significant performance improvement over GWRP and AP. This is mainly because FAP function can automatically interpolate between different pooling behaviors through the learnable frequency-wise vectors. Figure 4 illustrates the parameters learned by the proposed FAP function. Clearly, the learned weights vary with frequency bands for certain acoustic events, and the weight vectors α k exhibit different distribution characteristics for different sound events. For instance, the energy of keys_jangling events and bark events are mainly distributed in the high-and low-frequency bands, respectively. The corresponding vectors α k show a consistent tendency with the energy distribution. It is observed in the experiments that α k (f ) ≤ 3 for most cases, and we present the results for α max = 3, 5, 10, respectively. Table 5 investigates the effect of the upper bound of α k (f ) on the overall performance. In the case of α max = 10, FAP function achieves the best performance, and hence, it is used for the proposed method in the other experiments.

Performance evaluation of the proposed method
We compare the performance of DDC-blocks and VGGblocks based on GWRP, AP, and FAP functions. Two wellknown examples of MIL methods, i.e., Attention [20] and TALNet [19], are also involved to make a comprehensive comparison. The results are shown in Table 6.  We first compare the performance of DDC-blocks and VGG-blocks. Using the same pooling function, the methods with DDC-blocks outperform that with VGG-blocks in terms of the F1-score, AUC, and mAP. As for the model size, the required parameters for DDC-blockbased approaches decrease by 49.5% compared with the VGG-block-based methods. Thus, the utilization of DDCblocks significantly reduces the number of parameters while it achieves a better performance in segmentation. Moreover, DDC-FAP achieves the highest mAP and AUC, the lowest ER, and the least insertion among all the source separation-based methods. It turns out that the combination of DDC-block and FAP achieves a significant performance improvement compared with the method in [24]. In AT task, DDC-FAP is comparable with Attention [20] and TALNet [19] . The proposed method achieves a somewhat higher AUC score in the AT task. It indicates that DDC-FAP makes fewer false-negative predictions, which is consistent with the observation that DDC-FAP gets fewer deletions compared to Attention [20]. As for the SED task, it is apparent that DDC-FAP achieves the highest mAP (0.427) and AUC (0.868) and hence outperforms the other methods. To provide a better illustration, we summarize the results of 10 independent experiments to draw the box plot of the main metrics in Fig. 5. It shows that the results of DDC-FAP are relatively stable in these metrics and superior to the other methods in SED task.  In order to show the performance of the aforementioned methods more intuitively, we let the model parameters as the abscissa and the mAP in SED task as the ordinate as shown in Fig. 6. An ideal SED system should require fewer parameters and achieves a higher mAP. It can be seen from Fig. 6 that the method using the DDC-block outperforms all the other approaches.

Performance evaluation on DCASE 2020 Task 4
Compared with the aforementioned synthetic dataset, some of the events in DCASE 2020 Task 4 are overlapped. DCASE 2020 Task 4 dataset mainly consists of a FUSS dataset and a DESED dataset. The FUSS dataset used for sound separation task does not provide labels for event classes, so that DDC-FAP method cannot utilize it for training. The DESED dataset is used for the SED task, which consists of strong labeled, weakly labeled, and unlabeled audio clips. We evaluate the proposed method on the weakly labeled training set of DESED. The results are shown in Table 7. The performance of the mentioned methods on DCASE 2020 Task 4 dataset is similar with that in Section 3.4. For SED task, DDC-FAP still achieves the best results for all metrics. For AT task, DDC-FAP ranks 2nd, which is slightly worse than Attention. Experimental results show that although the proposed method is not specifically designed for overlap, it still has good performance for the overlapping case.

Performance evaluation on DCASE 2017 Task 4
Data imbalance is also a challenging problem for SED task. We utilize DCASE 2017 Task Fig. 6 The number of parameter and performance in the SED task a mini-batch data balancing operation to ensure that the number of the most frequent events is at most five times than the least frequent samples in a mini-batch. To be consistent with [19], the class-specific thresholds which achieve the highest F-score of AT task are utilized to make the clip-level predictions. The performance of AT and SED task is evaluated on the clip level and 1-s segment level, respectively. In addition to the mentioned methods, the adaptive distance-based pooling function has been proposed recently. It compensates non-relevant information of audio events by applying an adaptive transformation in temporal axis. For a comprehensive comparison, we show the results of the mentioned methods in Table 8. It can be seen that the proposed method achieves the best F-score in both AT and SED task. Besides the unbalanced property, some of sound events are overlapping in the dataset. Moreover, the events of DCASE 2017 Task 4 have not been padded or trimmed, which contain a more natural and diverse distribution of duration. The results show that DDC-FAP performs well in these situations, which indicates its robustness to complex scenes.

Conclusion
In this paper, we proposed a novel source separationbased method for weakly supervised SED. In segmentation mapping stage, we designed a model consisting of four DDC-blocks to convert the input feature to the T-F mask of each sound event. To utilize the prior frequency information, we proposed the FAP function which introduces learnable vectors to find the key bands when aggregating the T-F masks. Both of the temporal location of the predefined events and the separated waveform can be obtained from the trained T-F mask. Extensive experiments demonstrated that the DDC-block is more effective and computationally lighter than the VGG-block in segmentation mapping stage, and the FAP function outperforms the widely used pooling operators. The proposed DDC-FAP method achieves a better performance than the   [33] state-of-the-art source separation-based methods in various situations such as the non-overlapped, overlapped, and unbalanced cases.