 Research
 Open access
 Published:
Frequencydependent autopooling function for weakly supervised sound event detection
EURASIP Journal on Audio, Speech, and Music Processing volume 2021, Article number: 19 (2021)
Abstract
Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separationbased method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDCblock) to estimate timefrequency (TF) masks of each sound event from a TF representation of an audio clip. DDCblock is experimentally proven to be more effective and computationally lighter than “VGGlike” block. To fully utilize frequency characteristics of sound events, we then propose a frequencydependent autopooling (FAP) function to obtain the cliplevel present probability of each sound event class. A combination of two schemes, named DDCFAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDCFAP has a better performance than the stateoftheart source separationbased method in SED task.
1 Introduction
Sound event detection (SED) becomes an important research topic in auditory perception. It has many potential applications such as healthcare in smart home [1, 2], surveillance monitor in public area [3], and largescale information retrieval [4]. The goal of SED is to predict event classes and corresponding time stamps, i.e., onset and offset times of sound events, while audio tagging (AT) aims at detecting what occurred in an audio clip. Therefore, it is desired that strongly labeled data that contains information of precise presence and absence time for sound classes can be used to train SED systems [5–8]. However, it is costly to acquire such strongly annotated data in realities. On the other hand, there is a large amount of weakly labeled data that is tagged with only types of sound events at cliplevel and does not provide the corresponding temporal information. For instance, AudioSet released by Google consists of a collection of 2,084,320 humanlabeled 10s sound clips [9, 10].
Multiple instance learning [11–13] (MIL) is a common framework to train using weakly labeled data. In MIL methods for SED, the audio clip (bag) is divided into overlapped frames (instances), where only ground truth labels of clips are available. An audio clip is labeled positive if the clip contains at least one positive frame. MIL methods usually consist of two parts, a dynamic predictor for generating the present probability of the specific event in each frame and a pooling function for aggregating framelevel probabilities to a cliplevel prediction. For the dynamic predictor, conventional support vector machine (SVM) [14], Gaussian mixture model (GMM) [15], and neural network approaches [16–19] are employed to perform prediction for each event class. The pooling function is used to reduce the dimension of the dynamic feature space, which has a great impact on the overall performance of the weakly supervised SED system. Several pooling functions are exploited in the literature. The global max pooling (GMP) focuses on instances with the highest probability, which is difficult to estimate onset and offset annotations for longtime events. The global average pooling (GAP) assumes that all the instances contribute equally, and hence, shorttime events are likely to be underestimated. Attention pooling [20, 21] is a flexible weighting method, which adds a dense neural network to learn weights for each frame in parallel. However, a limitation of the attention pooling is that larger attention weights concentrate upon frames with smaller probabilities when the label is negative [19]. An autopooling (AP) function is developed in [22] by introducing a learnable parameter for each class to deal with the weakly labeled SED problem. The AP function reduces to min, mean, or maxoperators with the increase of the learnable parameter, which can be interpreted as an automatic interpolation between different standard pooling behaviors.
Another aspect of research is based on source separation framework for nonoverlapping case [23, 24] or overlapping case [25, 26]. As a starting point, [24] focuses on the nonoverlapping sound events. Timefrequency (TF) segmentation masks are learned from the cliplevel tags and then aggregated over both the time and frequency indices to obtain present probabilities of sound events. The TF segmentation mask is equivalent to ideal ratio mask (IRM) [27] in the context of speech enhancement and source separation. As a byproduct, each sound event can be separated from the mixed audio. In this method, a global weighted rank pooling (GWRP) [28] function is employed to aggregate the masks to cliplevel predictions. The TF bin with a larger value is assigned a larger weight. GWRP is a generalization of GMP and GAP in essence. But, the decay coefficient of GWRP function is manually chosen and may not be optimal in practice.
In this paper, we propose an improved source separationbased approach to solve the problem of weakly supervised SED. The proposed method has a similar framework as in [24], which consists of a segmentation mapping stage and a classification mapping stage. In the segmentation mapping stage, we employ a CNN to capture local patterns of the input spectrogram, i.e., to learn a TF mask of each specific sound event from weakly labeled data. Concretely, we build a dilated depthwise separable convolution block, named as DDCblock. DDCblock first applies a singlelayer dilated filter to each input channel and then applies a 1×1 convolution to combine the output of the previous layer. The presented DDCblock outperforms the “VGGlike” CNN originally used in [24] in terms of the detection performance and the complexity. In the classification mapping stage, we present a frequencydependent autopooling function (FAP) to aggregate TF masks to cliplevel predictions of sound events. The FAP function inherently considers the fact that each sound event exhibits different frequency characteristics by introducing a learnable frequencyvarying vector for each class. Furthermore, we show that there are close links between the proposed FAP and the commonly used GMP, the softmax pooling, the GAP, and the AP functions. In this paper, the proposed method is not specifically designed for handling overlapping sounds. We first focus on the weakly labeled problem without considering the impact of overlapping. Next, we evaluate the proposed method on DCASE 2017 task 4 dataset and DCASE 2020 task 4 dataset, which are recorded in a realistic environment and contain overlapping sound events.
The remainder of the paper is organized as follows: In Section 2, we present the proposed method in detail, including DDCblock for producing TF masks of sound events and FAP used to aggregate TF masks to cliplevel predictions. In Section 3, we carry out extensive experiments to evaluate the performance of the new method. Section 4 concludes the paper.
2 Proposed method
We now describe the proposed source separationbased framework and how this method can be used to solve the weakly supervised SED as well as AT problems.
The training process of the new framework consists of two steps, i.e., segmentation mapping and classification mapping. In the segmentation mapping stage, a logmel spectrogram of the audio clip x(n) is extracted to obtain a feature matrix \(\mathbf {X}=[X(t,f)] \in \text {I\!R}_{+}^{T \times F}\), where t=1,2,...,T and f=1,2,...,F represent frame and frequency indices, respectively; T denotes the number of audio clip frames; and F is the number of frequency bands. Then, a segmentation mapping of X→M ̂ is modeled via a deep neural network, where \( \hat{\mathbf{M}} = [ \hat{M}_{k}(t,f)] \in \text {I\!R}_{+}^{K \times T \times F} \) is the estimation of the IRM for each sound event class, k is the index of class, and K represents the number of predefined classes. In [24], a “VGGlike” CNN is employed to complete the transformation from the input feature to the specific mask of each sound event. In this paper, we utilize DDCblocks for this task to obtain a better performance. The details of implementation is presented in Section 2.1. Because only the cliplevel tag \( \mathbf {y} = [y_{1}, y_{2},\cdots,y_{K}]^{\mathrm {T}} \in \text {I\!R}_{+}^{K \times 1} \) is available for weakly supervised problems, a global pooling function should be designed to transform the estimated TF mask into the presence probability of the kth sound event. In the classification mapping stage, we map the estimated TF mask into the cliplevel prediction, i.e., ^{M}→ ^{y}, where \({~}^{\mathbf {y}}=[\hat {y}_{1},\hat {y}_{2},\cdots,\hat {y}_{K} ]^{\mathrm {T}} \in \text {I\!R}_{+}^{K \times 1} \) denotes the cliplevel probability. The objective is to minimize the binary crossentropy between \(\hat {y}_{k}\) and cliplevel tag y_{k}, and the loss function is given by:
Several global pooling functions such as GMP, GAP, and GWRP have been adopted for the classification mapping. We present a frequencydependent pooling function to fully exploit the potentials of the source separationbased approach in Section 2.2.
Both AT and SED tasks share the same training stage as described above. Once the training is completed, we can obtain the prediction of the trained model as the AT result. However, for SED task, extra operations need to be carried out to get framelevel probabilities at inference process. Since the estimated mask \( \hat {M}_{k}(t,f) \) contains the information of sound event activities, the framelevel probability can be obtained by aggregating the mask across frequency axis as:
where w_{k}(f) denotes the weight of the fth frequency band for the kth class, and \(\hat {y}_{k}(t)\) is the estimated framelevel probability. In [24], Kong et al. average over frequency axis of the mask with w_{k}(f)=1/F, whereas we utilize the learned vector \(\mathbf {w}_{k} = [w_{k}(1),w_{k}(2),\cdots,w_{k}(F)]^{\mathrm {T}} \in \text {I\!R}_{+}^{F \times 1}\) of FAP to calculate the weighted average along the frequency dimension. To produce smooth framelevel predictions [24], we first select a frame t as a seed where \(\hat {y}_{k}(t)\ge 0.2\). Then, we merge the neighboring frames on both sides in a regiongrowing style until the frame t^{′} where \(\hat {y}_{k}(t')\le 0.1\). The diagram of the proposed source separationbased method is shown in Fig. 1.
2.1 The segmentation mapping stage
The segmentation mapping performs feature transformation via the deep neural network. The model consisting of VGGblocks has been proven quite promising since it can capture local patterns of input features. However, the utilization of VGGblock leads to a high computational cost. To solve this problem, we build a convolutional block, i.e., DDCblock, which employs depthwise separable convolution [29, 30] with dilated filters instead of the typical CNN as in [24]. The architecture of DDCblock is shown in Fig. 2. Each of the three convolution operations is followed by a nonlinearity activation and a batch normalization process. The stack of depthwise and pointwise convolution is called depthwise separable convolution, which is considered to be a single convolution layer as the typical CNN. Thus, the number of convolution layers in DDCblock is the same as that in VGGblock.
For the depthwise convolution, the number of filters is required to be equal to that of input channels, which means the spatial convolution is performed independently in each input channel. For the pointwise convolution, 1×1 filters are used to project the outputs of the depthwise layer onto a new feature space. It has been demonstrated that the stack of depthwise and pointwise layer can reduce the number of parameters by a factor of 1/N+1/(w_{width}·w_{height}), where w_{width} and w_{height} represent width and height of the filter, respectively, and N is the number of output channels [31]. Such a design can significantly alleviate the overfitting problem. Additionally, we propose to use dilated filters in depthwise layer. Dilated filters can increase the size of receptive field without introducing extra parameters. We carried out experiments (not shown here) and found that the utilization of dilation has a positive impact on both AT and SED tasks. The process of depthwise separable convolution with dilation rate is shown in Fig. 3.
At this point, we present a detailed configuration of the model which is used in the segmentation mapping stage. We first apply four DDCblocks on the input logmel spectrogram, and the channel numbers of the four blocks are 32, 64, 128, and 128, respectively. Then, a Kchannel convolution layer with 1×1 filter is used to convert the output of the last DDCblock to TF segmentation masks through sigmoid activation functions. Finally, a global pooling function is used to aggregate the estimated mask to the cliplevel prediction. The proposed network has a similar depth compared to the “VGGlike” network in [24], which leads to a fair comparison. The detail of the proposed model architecture is summarized in Table 1.
2.2 The classification mapping stage
In this subsection, we model the classification mapping of M ̂→ŷ via the pooling function. To this end, pooling functions such as GMP, GAP, and GWRP [26] are commonly used in SED field. GMP only concerns on the TF bin with the maximum probability, leading to the constrained gradient path and inefficient computation. GAP assumes that all the TF bins contribute equally to the cliplevel prediction, which means GAP is unable to focus on the specific TF bins.
(1) Global weighted rank pooling, GWRP can be understood as a generalization of GMP and GAP. The main idea of GWRP is to put high weights on the TF bins with high values [26]. The TF bins of the kth sound event M ̂_{k} are sorted in a descending order, and the corresponding jth element of the sorted sequence is denoted by \( \hat {M}_{k,j}\). The cliplevel prediction can be represented as:
where r_{k}∈[0,1] is a hyper parameter that controls the behavior of GWRP function. Notice that the GWRP reduces to GMP for r_{k}=0 and GAP for r_{k}=1. Since the value of weight increases as \(\hat {M}_{k,j}\) becomes large, the aggregation performance is improved compared with GMP and GAP. However, the performance of GWRP highly depends on the interpolation coefficient r_{k} which is difficult to be chosen in practice. We thus propose a FAP function which can determine the interpolation coefficient automatically as required.
(2) Frequencydependent autopooling, FAP function is actually an improved version of softmax pooling, which introduces a learnable parameter vector α_{k}=[α_{k}(1),α_{k}(2),⋯α_{k}(f)]^{T}∈IR^{F×1} as weighting coefficients for the kth class. FAP treats α_{k} as a free vector that can be learned during training. The expression of FAP is:
where α_{k}(f) is the weight of the fth frequency band. Note that the frequencyvarying α_{k}(f) is shared among all frames in each frequency band.
We now show the relationship between the proposed FAP and several wellknown pooling functions. The proposed FAP function can be treated as an extension of AP proposed in [22]. FAP is specifically used for 2D data like spectrograms, so the prior information of frequency distribution for event classes can be considered during aggregation. That is, FAP can focus on the crucial frequency bands adaptively by learning the vectors. Additionally, FAP reduces to GMP, softmax pooling, and GAP when α_{k}(f)→∞,α_{k}(f)=1, and α_{k}(f)=0, respectively.
A word on the bound of the parameter α_{k}(f) is appropriate here. For α_{k}(f)<0, FAP is similar to minpooling, and hence, TF bins with smaller values attract much more attention, which is not desired. On the other hand, for α_{k}(f)→∞, FAP simplifies to GMP which may result in the gradient explosion problem. We thus propose to set the parameter α_{k}(f) to 0<α_{k}(f)<α_{max}, where α_{max} is a predefined constant. In this paper, α_{max}=10 is empirically chosen and achieves a satisfactory performance.
3 Experimental results
3.1 Data preparation
We utilize the audio clips of DCASE 2018 Task 1 dataset as background noises which are recorded in 10 scenes such as metro station and shopping mall. For the sound events, 3710 manually verified clips that include 41 categories are obtained from DCASE 2018 Task 2. These events involve various human activities, household events, instrument events, etc. All of the audio clips are sampled at 32 kHz. More details on the preparation of data are shown in Table 2. We fix the duration of every sound events to 2 s which is same as [24] to make sure that the generated clips are nonoverlapping. To be specific, the events shorter than 2 s are padded with zeros to 2 s. For the events longer than 2 s, we extract the first 2 s of them as training data and remove the other parts of the recording. Three randomly selected events are mixed with the background noise without any overlapping at 0dB SNR. The onsets of events are 0.5 s, 3 s, and 5.5 s, respectively. Thus, we can ensure that the impact of overlapping can be avoided, which will help us to focus on the weakly labeled task first. We synthesize 8000 audio clips and divide them into 4 crossvalidation folds.
3.2 Setup
We choose 64 bands logmel spectrum as the input data representation to our model. A 64mslong Hanning window is employed for STFT with 50% overlap. Then, each frame is converted into a 64dimensional vector by a logmel filter bank. This process converts a 10s audio clip into a 64×311 dimensional logmel spectrogram representation. Learning rate is initially set to 1e −3 and automatically reduced to 0.9 times of the previous value per 1000 iterations. Xavier initialization is used to initialize the model. The experiment setting is shown in Table 3, and each result shown in this paper is obtained by averaging over 10 independent experiments.
We observed that the prediction result of AT task is more convincing than that of SED task. In order to reduce false positives for SED task, we first evaluate the cliplevel predictions of each clip. Only the classes that are predicted as active at cliplevel can be selected to evaluate the framelevel predictions [24]. Since the lengths of most events are longer than 10 frames, we treat the predictions which are shorter than 10 frames as falsepositive cases and remove them to reduce inserts. Moreover, some classes such as “Knock”, occur discontinuously in audio clips, but their groundtruth framelevel labels are always continuous. Thus, the events or the silence gap of events shorter than 10 frames are removed or merged. We set the onset collar of 200 ms and an offset collar of 200 ms/50% to count the true positives of the prediction, which is similar to the configuration of [32].
As for evaluation metrics, we use Fscore, area under the curve (AUC), mean average precision (mAP) [32], and error rate (ER) to evaluate the performance of AT and SED tasks. The Fscore is computed as a harmonic mean between precision and recall. The AUC is the area under ROC curve which plots truepositive rate (TPR) versus falsepositive rate (FPR). The mAP is the average of precision at different recall values regardless of thresholds. The mAP can evaluate the model comprehensively and is widely used in the weakly supervised SED field. Error rate (ER) measures the number of errors including deletions (D) and insertions (I). ER is a score rather than a percentage which can become larger than 1 in the case when the system makes more errors than correct predictions.
3.3 Performance evaluation of the pooling functions
In this section, we evaluate the performance of five pooling functions including GMP, GAP, GWRP, AP, and FAP. To fairly validate the effectiveness of FAP function, we employ the same model in the segmentation mapping stage. Specifically, the DDCblocks in Table 1 are replaced with four VGGblocks. For GWRP function, the interpolation coefficient r_{k} is set to 0.9998 as in [24].
As seen from Table 4, GMP and GAP are inferior to the other approaches due to the impractical assumptions. As expected, FAP achieves the highest scores in all the involved methods in terms of the mAP and AUC for both AT and SED tasks. Especially, FAP achieves a significant performance improvement over GWRP and AP. This is mainly because FAP function can automatically interpolate between different pooling behaviors through the learnable frequencywise vectors.
Figure 4 illustrates the parameters learned by the proposed FAP function. Clearly, the learned weights vary with frequency bands for certain acoustic events, and the weight vectors α_{k} exhibit different distribution characteristics for different sound events. For instance, the energy of keys_jangling events and bark events are mainly distributed in the high and lowfrequency bands, respectively. The corresponding vectors α_{k} show a consistent tendency with the energy distribution. It is observed in the experiments that α_{k}(f)≤3 for most cases, and we present the results for α_{max}=3,5,10, respectively. Table 5 investigates the effect of the upper bound of α_{k}(f) on the overall performance. In the case of α_{max}=10, FAP function achieves the best performance, and hence, it is used for the proposed method in the other experiments.
3.4 Performance evaluation of the proposed method
We compare the performance of DDCblocks and VGGblocks based on GWRP, AP, and FAP functions. Two wellknown examples of MIL methods, i.e., Attention [20] and TALNet [19], are also involved to make a comprehensive comparison. The results are shown in Table 6.
We first compare the performance of DDCblocks and VGGblocks. Using the same pooling function, the methods with DDCblocks outperform that with VGGblocks in terms of the F1score, AUC, and mAP. As for the model size, the required parameters for DDCblockbased approaches decrease by 49.5% compared with the VGGblockbased methods. Thus, the utilization of DDCblocks significantly reduces the number of parameters while it achieves a better performance in segmentation. Moreover, DDCFAP achieves the highest mAP and AUC, the lowest ER, and the least insertion among all the source separationbased methods. It turns out that the combination of DDCblock and FAP achieves a significant performance improvement compared with the method in [24].
In AT task, DDCFAP is comparable with Attention [20] and TALNet [19]. The proposed method achieves a somewhat higher AUC score in the AT task. It indicates that DDCFAP makes fewer falsenegative predictions, which is consistent with the observation that DDCFAP gets fewer deletions compared to Attention [20]. As for the SED task, it is apparent that DDCFAP achieves the highest mAP (0.427) and AUC (0.868) and hence outperforms the other methods. To provide a better illustration, we summarize the results of 10 independent experiments to draw the box plot of the main metrics in Fig. 5. It shows that the results of DDCFAP are relatively stable in these metrics and superior to the other methods in SED task.
In order to show the performance of the aforementioned methods more intuitively, we let the model parameters as the abscissa and the mAP in SED task as the ordinate as shown in Fig. 6. An ideal SED system should require fewer parameters and achieves a higher mAP. It can be seen from Fig. 6 that the method using the DDCblock outperforms all the other approaches.
3.5 Performance evaluation on DCASE 2020 Task 4
Compared with the aforementioned synthetic dataset, some of the events in DCASE 2020 Task 4 are overlapped. DCASE 2020 Task 4 dataset mainly consists of a FUSS dataset and a DESED dataset. The FUSS dataset used for sound separation task does not provide labels for event classes, so that DDCFAP method cannot utilize it for training. The DESED dataset is used for the SED task, which consists of strong labeled, weakly labeled, and unlabeled audio clips. We evaluate the proposed method on the weakly labeled training set of DESED. The results are shown in Table 7. The performance of the mentioned methods on DCASE 2020 Task 4 dataset is similar with that in Section 3.4. For SED task, DDCFAP still achieves the best results for all metrics. For AT task, DDCFAP ranks 2nd, which is slightly worse than Attention. Experimental results show that although the proposed method is not specifically designed for overlap, it still has good performance for the overlapping case.
3.6 Performance evaluation on DCASE 2017 Task 4
Data imbalance is also a challenging problem for SED task. We utilize DCASE 2017 Task 4 dataset to verify the effectiveness of DDCFAP on unbalanced situation. We add a minibatch data balancing operation to ensure that the number of the most frequent events is at most five times than the least frequent samples in a minibatch. To be consistent with [19], the classspecific thresholds which achieve the highest Fscore of AT task are utilized to make the cliplevel predictions. The performance of AT and SED task is evaluated on the clip level and 1s segment level, respectively. In addition to the mentioned methods, the adaptive distancebased pooling function has been proposed recently. It compensates nonrelevant information of audio events by applying an adaptive transformation in temporal axis. For a comprehensive comparison, we show the results of the mentioned methods in Table 8.
It can be seen that the proposed method achieves the best Fscore in both AT and SED task. Besides the unbalanced property, some of sound events are overlapping in the dataset. Moreover, the events of DCASE 2017 Task 4 have not been padded or trimmed, which contain a more natural and diverse distribution of duration. The results show that DDCFAP performs well in these situations, which indicates its robustness to complex scenes.
4 Conclusion
In this paper, we proposed a novel source separationbased method for weakly supervised SED. In segmentation mapping stage, we designed a model consisting of four DDCblocks to convert the input feature to the TF mask of each sound event. To utilize the prior frequency information, we proposed the FAP function which introduces learnable vectors to find the key bands when aggregating the TF masks. Both of the temporal location of the predefined events and the separated waveform can be obtained from the trained TF mask. Extensive experiments demonstrated that the DDCblock is more effective and computationally lighter than the VGGblock in segmentation mapping stage, and the FAP function outperforms the widely used pooling operators. The proposed DDCFAP method achieves a better performance than the stateoftheart source separationbased methods in various situations such as the nonoverlapped, overlapped, and unbalanced cases.
Availability of data and materials
The datasets analyzed during the current study are available in the [DCASE2018] repository, [http://dcase.community/challenge2018/taskgeneralpurposeaudiotagging]
Abbreviations
 SED:

Sound event detection
 DDCblock:

Dilated depthwise separable convolution block
 TF:

Timefrequency
 FAP:

Frequencydependent autopooling
 AT:

Audio tagging
 MIL:

Multiple instance learning
 SVM:

Support vector machine
 GMM:

Gaussian mixture model
 GMP:

Global max pooling
 GAP:

Global average pooling
 AP:

Autopooling
 IRM:

Ideal ratio mask
 GWRP:

Global weighted rank pooling
 AUC:

Area under the curve
 mAP:

Mean average precision
 ER:

Error rate
 TPR:

Truepositive rate
 FPR:

Falsepositive rate
 D:

Deletion
 I:

Insertion
References
T. Virtanen, M. D. Plumbley, D. Ellis, Computational analysis of sound scenes and events (Springer, Heidelberg, 2018).
Y. Lavner, R. Cohen, D. Ruinskiy, H. IJzerman, in 2016 IEEE International Conference on the Science of Electrical Engineering (ICSEE). Baby cry detection in domestic environment using deep learning, (2016), pp. 1–5.
G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, A. Sarti, in 2007 IEEE Conference on Advanced Video and Signal Based Surveillance. Scream and gunshot detection and localization for audiosurveillance systems, (2007), pp. 21–26.
A. Jati, D. Emmanouilidou, in ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Supervised deep hashing for efficient audio event retrieval, (2020), pp. 4497–4501.
M. Ravanelli, B. Elizalde, K. Ni, G. Friedland, in 2014 22nd European Signal Processing Conference (EUSIPCO). Audio concept classification with hierarchical deep neural networks, (2014), pp. 606–610.
H. Zhang, I. McLoughlin, Y. Song, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Robust sound event recognition using convolutional neural networks, (2015), pp. 559–563.
E. Cakır, G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio. Speech. Lang. Process.25:, 1291–1303 (2017).
D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, M. D. Plumbley, Detection and classification of acoustic scenes and events. IEEE Trans. Multimed.17:, 1733–1746 (2015).
A. Mesaros, T. Heittola, T. Virtanen, in 2016 24th European Signal Processing Conference (EUSIPCO). TUT database for acoustic scene classification and sound event detection, (2016), pp. 1128–1132.
J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, M. Ritter, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Audio set: an ontology and humanlabeled dataset for audio events, (2017), pp. 776–780.
A. Kumar, B. Raj, in Proceedings of the 24th ACM International Conference on Multimedia. Audio event detection using weakly labeled data, (2016).
M. Ilse, J. M. Tomczak, M. Welling, in International conference on machine learning. Attentionbased deep multiple instance learning, (2018), pp. 2127–2136.
T. W. Su, J. Y. Liu, Y. H. Yang, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Weaklysupervised audio event detection using eventspecific gaussian filters and fully convolutional networks, (2017), pp. 791–795.
S. E. Küçükbay, M. Sert, in Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015). Audiobased event detection in office live environments using optimized MFCCSVM approach, (2015), pp. 475–480.
A. Kumar, B. Raj, in 2017 International Joint Conference on Neural Networks (IJCNN). Audio event and scene recognition: a unified approach using strongly and weakly labeled data, (2017), pp. 3475–3482.
S. Adavanne, G. Parascandolo, P. Pertilä, T. Heittola, T. Virtanen, in Scenes and Events 2016 Workshop (DCASE2016). Sound event detection in multichannel audio using spatial and harmonic features, (2016), p. 6.
M. Espi, M. Fujimoto, K. Kinoshita, T. Nakatani, Exploiting spectrotemporal locality in deep learning based acoustic event detection. EURASIP J. Audio. Speech. Music. Process.2015:, 26 (2015).
D. de BenitoGorron, A. LozanoDiez, D. T. Toledano, J. GonzalezRodriguez, Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset. EURASIP J. Audio. Speech. Music Process.2019:, 9 (2019).
Y. Wang, J. Li, F. Metze, in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling, (2019), pp. 31–35.
Y. Xu, Q. Kong, W. Wang, M. D. Plumbley, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Largescale weakly supervised audio classification using gated convolutional neural network, (2018), pp. 121–125.
C. Yu, K. S. Barsim, Q. Kong, B. Yang, Multilevel attention model for weakly supervised audio classification. CoRR. abs/1803.02353: (2018). http://arxiv.org/abs/1803.02353.
B. McFee, J. Salamon, J. P. Bello, Adaptive pooling operators for weakly labeled sound event detection. IEEE/ACM Trans. Audio. Speech. Lang. Process.26:, 2180–2193 (2018).
Q. Kong, Y. Xu, W. Wang, M. D. Plumbley, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). A joint separationclassification model for sound event detection of weakly labelled data, (2018), pp. 321–325.
Q. Kong, Y. Xu, I. Sobieraj, W. Wang, M. D. Plumbley, Sound event detection and time–frequency segmentation from weakly labelled data. IEEE/ACM Trans. Audio. Speech. Lang. Process.27:, 777–787 (2019).
T. Heittola, A. Mesaros, T. Virtanen, A. Eronen, in 2011 Machine Listening in Multisource Environments. Sound event detection in multisource environments using source separation, (2011), pp. 36–40.
F. Pishdadian, G. Wichern, J. Le Roux, Finding strength in weakness: learning to separate sounds with weak supervision. IEEE/ACM Trans. Audio. Speech. Lang. Process.28:, 2386–2399 (2020).
A. Narayanan, D. Wang, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Ideal ratio mask estimation using deep neural networks for robust speech recognition, (2013), pp. 7092–7096.
A. Kolesnikov, C. H. Lampert, in European Conference on Computer Vision. Seed, expand and constrain: three principles for weaklysupervised image segmentation, (2016), pp. 695–711.
F. Chollet, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Xception: deep learning with depthwise separable convolutions, (2017).
J. Guo, Y. Li, W. Lin, Y. Chen, J. Li, Network decoupling: From regular to depthwise separable convolutions. CoRR. abs/1808.05517: (2018). http://arxiv.org/abs/1808.05517.
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR. abs/1704.04861: (2017). http://arxiv.org/abs/1704.04861.
A. Mesaros, T. Heittola, T. Virtanen, Metrics for polyphonic sound event detection. Appl. Sci.6:, 162 (2016).
I. MartínMorató, M. Cobos, F. J. Ferri, Adaptive distancebased pooling in convolutional neural networks for audio event classification. IEEE/ACM Trans. Audio. Speech. Lang. Process.28:, 1925–1935 (2020).
Acknowledgements
Not applicable.
Funding
This work was supported by the Youth Innovation Promotion Association of Chinese Academy of Sciences under Grant 2018027, National Natural Science Foundation of China under Grants 11804368 and 11674348, IACAS Young Elite Researcher Project QNYC201812, National Key R&D Program of China under Grant 2017YFC0804900, and the Strategic Priority Research Program of Chinese Academy of Sciences under Grant XDC02020400.
Author information
Authors and Affiliations
Contributions
SCL conducted the research and performed the experiments. FRY and YC supervised the experimental work and polished the structure as well as the text of the manuscript. The guidance of the whole work was performed by JY. Moreover, all authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, S., Yang, F., Cao, Y. et al. Frequencydependent autopooling function for weakly supervised sound event detection. J AUDIO SPEECH MUSIC PROC. 2021, 19 (2021). https://doi.org/10.1186/s13636021002067
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636021002067