Skip to main content

Sound event triage: detecting sound events considering priority of classes

Abstract

We propose a new task for sound event detection (SED): sound event triage (SET). The goal of SET is to detect an arbitrary number of high-priority event classes while allowing misdetections of low-priority event classes where the priority is given for each event class. In conventional methods of SED for targeting a specific sound event class, it is only possible to give priority to a single event class. Moreover, the level of priority is not adjustable, i.e, the conventional methods can use only types of target event class such as one-hot vector, as inputs. To flexibly control much information on the target event, the proposed SET exploits not only types of target sound but also the extent to which each target sound is detected with priority. To implement the detection of events with priority, we propose class-weighted training, in which loss functions and the network are stochastically weighted by the priority parameter of each class. As this is the first paper on SET, we particularly introduce an implementation of single target SET, which is a subtask of SET. The results of the experiments using the URBAN–SED dataset show that the proposed method of single target SET outperforms the conventional SED method by 8.70, 6.66, and 6.09 percentage points for “air_conditioner,” “car_horn,” and “street_music,” respectively, in terms of the intersection-based F-score. For the average score of classes, the proposed methods increase the intersection-based F-score by up to 3.37 percentage points compared with the conventional SED and other target-class-conditioned models.

1 Introduction

In our everyday life, humans utilize much information obtained from various environmental sounds [1]. The automatic analysis of environmental sounds will lead to the realization of many applications, e.g., anomalous sound detection systems [2], life-logging systems [3], systems for hard-of-hearing persons [4], systems for smart cars [5], and monitoring systems [6].

Sound event detection (SED) [7] is a major task in environmental sound analysis, which identifies sound event classes (e.g., “dog barking,” “car passing by,” and “people walking”) with those time stamps. In conventional SED, many methods using the hidden Markov model (HMM) [8, 9] and non-negative matrix factorization (NMF) [10, 11] have been proposed. Recently, numerous deep neural network (DNN)-based SED methods have been in developed. In DNN-based SED, the convolutional neural network (CNN) [12], recurrent neural network (RNN) [13], and convolutional bidirectional gated recurrent unit (CNN–BiGRU) [14] have been applied. Moreover, some studies have shown that the self-attention-based Transformer [15, 16] and Conformer [17] are useful for SED.

The target sounds to be analyzed depend on the user or application. In the analysis of environmental sounds, a method to target a specific class in sound event localization and detection (SELD) has been proposed [18]. For SED, target sound detection (TSD) has been proposed by Yang et al. [19], in which only the single target sound class is detected, where a reference audio signal or a one-hot vector of the target sound is input to the TSD model. In SED, the goal is to generalize the performance of detecting all sound events, i.e., the SED models are trained to equally detect all of the sound events in a dataset. The widely used objective function of SED, binary cross entropy, is equally weighted for each event class. In real environments, however, the detection priority for each event depends on the user or application. For example, when SED is used for a surveillance system, anomalous events such as “gunshot” or “baby crying” have to be preferentially detected over other events. On the other hand, in the case of a life-logging system, a sound event “kettle” or “footsteps” has to be more preferentially detected in addition to other events. The conventional TSD system is trained only once on many event types and allows the user to choose a target event to focus on during inference. However, its limitation is that the degree of interest cannot be controlled.

To tackle this problem, we propose a new SED task: sound event triage (SET). The goal of SET is to improve the performance of detecting an arbitrary number of high-priority sound event classes while allowing the performance of detecting low-priority event classes to be compromised where the priority is given for each event class, that is, the triage. In Fig. 1, the concept of the SET task is illustrated. A SET system enables user-preference sound event detection. The difference between the conventional methods for SED including TSD [19] and the SET task is whether the degree to which events are detected with priority can be set. Both SET and TSD models are trained once and select a class of interest. Furthermore, only SET models can control the degree of the class of interest, i.e., the priority, at an inference stage. For this first paper on SET, we design a network architecture for single target SET that is a subtask of SET, wherein the single event class is targeted with priority, and evaluate it in detail. We propose a method for single target SET where loss-conditional training [20] is utilized for detecting sound events with priority.

Fig. 1
figure 1

Concept of SET task

2 Related works

In this section, we describe works related to the proposed method. In particular, the section comprises three subsections: strongly supervised SED, the conventional methods of environmental sound analysis using class-conditional techniques, and You Only Train Once (YOTO) [20], with which the arbitrary linear combination of loss weights for multiple tasks can be set with a single model.

2.1 Strongly supervised SED

In strongly supervised SED, given a SED model f, model parameters \(\varvec{\Theta }\), an acoustic feature \(\textbf{X}\), and a ground truth \(z_{n,t} \in \{0,1\}\) for a sound event n in time t, the SED model outputs the probability \({y}_{n,t}\) for the event n and time frame t:

$$\begin{aligned} {y}_{n,t} = {P}( z_{n,t} \mid f, \varvec{\Theta }, \textbf{X}). \end{aligned}$$
(1)

In the training of the DNN-based SED model, to optimize the model parameters \(\varvec{\Theta }\), the following binary cross-entropy (BCE) loss function is used:

$$\begin{aligned} \mathcal {L}_{\textrm{SED}} =- & {} \sum\limits^{N}_{n=1} \left\{ \textbf{z}_{n}\right. \textrm{log} {s}(\textbf{y}_{n}) \nonumber \\+ & {} \left. (1 - \textbf{z}_{n}) \textrm{log} \left( 1 - {s}(\textbf{y}_{n}) \right) \right\} \end{aligned}$$
(2)
$$\begin{aligned} =- & {} \sum\limits^{N}_{n=1} \sum\limits^{T}_{t=1}\left\{ z_{n,t} \textrm{log} {s}(y_{n,t}) \right. \nonumber \\ + & {} \left. (1 - z_{n,t}) \textrm{log} \left( 1 - {s}(y_{n,t}) \right) \right\} , \end{aligned}$$
(3)

where \(s(\cdot )\) denotes the sigmoid function. N and T are the numbers of sound event classes and total time frames, respectively. In an inference stage of SED, \(s(y_{n,t})\) is binarized with a predefined threshold to obtain detection results. As can be seen in Eq. 3, all sound events are equally weighted for generalizing the performance of detecting all sound events.

2.2 Methods of environmental sound analysis for targeting a specific sound

In the analysis of environmental sounds, several methods considering user-guided or target-class-conditioned information have been proposed [18, 19]. In works related to SED, TSD [19] and class-conditioned SELD where information on target event classes or sounds is employed [18] have been studied. Yang et al. [19] have proposed the TSD task derived from SED, which detects target sound events with a one-hot vector of the target event class. In the TSD network, a reference signal or a one-hot vector of the target event class is input to a network as a condition, which is embedded and then fused with a SED network. Slizovskaia et al. [18] have proposed the class-conditioned SELD, which analyzes a specific-target event similarly to TSD to detect the target sound using a one-hot vector of the target class. These conventional systems related to SED can only handle information about the types of sound to be analyzed, not the degree of priority given to the events of special interest.

2.3 YOTO

YOTO [20] is a technique that enables a single network to change into various specialist models without retraining in inference stages. Each specialist model has better performance for a particular task. For example, we assume two specialist models, image-quality and compression-rate specialists [20]. In this case, a single network using YOTO can change into the two specialists or a model with an intermediate expertise of the two specialists without retraining in inference stages. The YOTO scheme is efficient in terms of the training cost or model complexity compared with multiple networks for each type of expertise.

In YOTO, we assume a problem setting that a single DNN-based network performs multiple tasks where the network is trained with multiple losses for each task. Let \(\mathcal {L}_m\) be a loss function for task m. The following loss function is often used for optimizing the parameters of the network:

$$\begin{aligned} \mathcal {L} = \sum\limits_{m=1}^{M} \lambda _{m} \mathcal {L}_{m}, \end{aligned}$$
(4)

where \(\lambda _{m} (0 \le \lambda _{m} \le 1.0 )\) is the m-th element of a vector \(\varvec{\lambda }\) for balancing among the tasks.. M denotes the number of tasks. In the training stage, the single network is trained with various \(\varvec{\lambda }\). In inference stages, an arbitrary \(\varvec{\lambda }\) can be input to the single network trained. When \(\lambda _{m}\) is set to be larger than those in other tasks, the network focuses on the training and/or inference of task m instead of other tasks.

3 Proposed method

3.1 Framework of SET

In the SET task, an arbitrary number of event classes are detected with priority. In the training stage of SET, triage weights are given for detecting target events with priority in addition to acoustic features and model parameters.

$$\begin{aligned} {y}_{n,t} = {P}( z_{n,t} \mid f, \boldsymbol{\Theta }, \textbf{X}, \boldsymbol{\lambda }), \end{aligned}$$
(5)

where \(\boldsymbol{\lambda } = (\lambda _{1}, \lambda _{2}, ..., \lambda _{N})\) are the parameters for the triage, that is, detecting sound events with priority. \(\lambda _{n} (0 \le \lambda _{n} \le 1.0 )\) is a triage weight for the sound event n. In an inference stage of SET, an arbitrary triage parameter \(\varvec{\lambda }\) is input to the SET model. When \(\lambda _{n}\) is set to a larger value than the others, the model targets with higher priority the sound event n than the other events. Examples of the loss functions on the right side in Fig. 2 are described in section 3.4.

Fig. 2
figure 2

Overview of SET in training stage

3.2 Class-weighted training

In the DNN-based SED, as shown in Eq. 3, the BCE loss can be divided into losses of each event class. In other words, the BCE loss can be regarded as the sum of losses for the task of detecting each sound event class. To train a SET model, the following loss function is used:

$$\begin{aligned} \mathcal {L}_{\textrm{SET}}= & {} \mathcal {L}(F(\textbf{X},{\boldsymbol{\lambda }}),\textbf{Z},{\boldsymbol{\lambda }}) \end{aligned}$$
(6)
$$\begin{aligned}= & {} \sum\limits_{n=1}^{N} N\lambda _{n} \mathcal {L}_{\textrm{SED}}(\mathbf {y_{n},z_{n}}) , \end{aligned}$$
(7)

where F(\(\cdot\)) is the output of a SED model. \(\textbf{Z}\) and \({\varvec{\lambda }}\) indicate the outputs of the model and triage parameters, respectively. \(\mathcal {L}_{\textrm{SED}}(\mathbf {y_{n},z_{n}})\) and \(\lambda _{n}\) are the loss function and triage parameter for event class n, where \(\mathbf {y_{n}}\) and \(\mathbf {z_{n}}\) are \(\{y_{n,1} \dots y_{n,t} \dots y_{n,T}\}\) and \(\{z_{n,1} \dots z_{n,t} \dots z_{n,T}\}\), respectively. The loss functions for each event class are weighted by the priority parameters of each class. We call the training scheme using the loss function “class-weighted training.” In Eq. 7, we normalize \(\varvec{\lambda }\) so that \(\sum _{n}\lambda _{n}=1.0\), and scale the loss function by multiplying by N. As can be seen from Eq. 6, both the SED model and the loss function are conditioned with the triage parameters.

To use arbitrary \(\lambda _{n}\) in inference stages of a SET network, \(\varvec{\lambda } = (\lambda _{1}, \lambda _{2},... ,\lambda _{N})\) are repeatedly and randomly sampled from a distribution during the training, which cover various \(\varvec{\lambda }\) values in the single SET network. The sampled parameters \(\varvec{\lambda }\) are input to the SET network and used for the loss calculation (Eq. 7) in the training stage. As shown in Fig. 2, \(\varvec{\lambda }\) values are firstly fed to two multilayer perceptrons (MLPs). The MLPs output two vectors, \(\varvec{\mu } = (\mu _{1}, \mu _{2},... ,\mu _{C})\) and \(\varvec{\sigma } = (\sigma _{1}, \sigma _{2},... ,\sigma _{C})\). As shown in Fig. 3, feature-wise linear modulation (FiLM) [21] is then used to bridge between the outputs of the MLPs and the SED model for detecting events (Fig. 2). The FiLM is applied to a feature map in CNN layers of the SED model. The feature map is multiplied by \(\varvec{\sigma }\) and added to \(\varvec{\mu }\): \({ \mathcal {\hat{M}}}_{ijc} = \mathcal {M}_{ijc}\sigma _{c}+\mu _{c}\). Here, \(\mathcal {M}_{ijc}\) is a feature at a location (ij) of a channel index c in a feature map of a CNN layer, as shown in Fig. 3. FiLM has been reported to perform better than including conditional information, e.g., the triage weights, as additional inputs to the network [20]. Because all the CNN layers in our network have the same number of channels, we feed the same \(\varvec{\mu }\) and \(\varvec{\sigma }\) to all of them for convenience. In addition to the conditioning of the SED model, the sampled \(\varvec{\lambda }\) values are directly used for the losses (Eq. 7) in training stages.

Fig. 3
figure 3

Illustration of FiLM operation

3.3 Single target SET

As the initial work of SET, we introduce a model training scheme for the single target SET, which is a subtask of SET. In Fig. 2, the overview of single target SET method in the training stage is shown. For single target SET, as the distribution of the triage weight \(\lambda _{n}\), we use the Dirichlet distribution \(\mathcal {D}(\varvec{\alpha })\). The probability density function of the \((K-1)\)-dimensional Dirichlet distribution is

$$\begin{aligned} \mathcal {D}(\varvec{\alpha })= & {} \frac{\Gamma (\sum _{k=1}^K\alpha _{k} )}{\prod _{k=1}^K\Gamma (\alpha _{k})}\prod\limits_{k=1}^{K}x_{k}^{\alpha _{k}-1},\end{aligned}$$
(8)
$$\begin{aligned}{} & {} s.t. \sum\limits_{k=1}^{K}x_{k}=1, \end{aligned}$$
(9)

where \(x_{k} \ge 0\) and \(\alpha _{k} > 0\) are stochastic variables for a dimension k and a parameter for the shape of the distribution. \(\Gamma (\cdot )\) represent the gamma function. In a training stage of a SET model, we need a distribution that can be controlled to give a larger weight into a specific class (target) than into other classes (nontarget). The Dirichlet distribution with smaller \(\varvec{\alpha }\) produces such a vector for giving a larger weight into a specific class. The Dirichlet distribution is suitable for single target SET because one class has a larger weight than the others by setting \(\varvec{\alpha }\) smaller. Our single target SET model is conditioned by class-weighted training using YOTO, where the triage weight \(\varvec{\lambda }\) is sampled for each sound event class, to handle arbitrary \(\varvec{\lambda }\), i.e., the priority of the detection. N is also multiplied by inputs of the MLPs for conditioning the SED model, as shown in Fig. 2.

3.4 SET losses with priority of event classes

To perform our SET, Eq. 7 is specified in this section. We propose two loss functions for the class-weighted training of SET. First, we introduce a loss function of SET with active and inactive frames (SET–AI) as follows.

$$\begin{aligned} \mathcal {L}_{\mathrm {SET-AI}} =- & {} \sum\limits^{N}_{n=1} N\lambda _{n}\left\{ \textbf{z}_{n} \textrm{log} {s}(\textbf{y}_{n}) \right. \nonumber \\+ & {} \left. (1 - \textbf{z}_{n}) \textrm{log} \left( 1 - {s}(\textbf{y}_{n}) \right) \right\} \end{aligned}$$
(10)
$$\begin{aligned} =- & {} \sum\limits^{N}_{n=1} \sum\limits^{T}_{t=1}N\lambda _{n}\left\{ z_{n,t} \textrm{log} {s}(y_{n,t}) \right. \nonumber \\+ & {} \left. (1 - z_{n,t}) \textrm{log} \left( 1 - {s}(y_{n,t}) \right) \right\} \end{aligned}$$
(11)

In SET–AI, the triage weight \(\lambda _{n}\) affects the active and inactive frames of sound event n. When \(\lambda _{n}\) is set to a larger value than the others, the model focuses on the training and/or inference of the active and inactive frames of event n compared with the others. In SET–AI, the loss of inactive frames is multiplied by \(\lambda _{n}\).

A large number of inactive frames disturb the training of active frames, as reported in [22]. Hence, only the loss of active frames is multiplied by \(\lambda _{n}\). Thus, we also introduce a loss function of SET with active frames (SET–A), wherein the model focuses on the training of the active frames, as follows.

$$\begin{aligned} \mathcal {L}_{\mathrm {SET-A}} =- & {} \sum\limits^{N}_{n=1} \left\{ N\lambda _{n}\textbf{z}_{n} \textrm{log} {s}(\textbf{y}_{n})\right. \nonumber \\+ & {} \left. (1 - \textbf{z}_{n}) \textrm{log} \left( 1 - {s}(\textbf{y}_{n}) \right) \right\} \end{aligned}$$
(12)
$$\begin{aligned} =- & {} \sum\limits^{N}_{n=1} \sum\limits^{T}_{t=1}\left\{ N\lambda _{n}z_{n,t} \textrm{log} {s}(y_{n,t}) \right. \nonumber \\+ & {} \left. (1 - z_{n,t}) \textrm{log} \left( 1 - {s}(y_{n,t}) \right) \right\} \end{aligned}$$
(13)

The difference between SET–AI and SET–A is whether the number of inactive frames is multiplied by the triage weight \(\lambda _{n}\).

4 Experiments

4.1 Experimental conditions

To evaluate the effectiveness of our methods, we conducted the following experiments:

  • [Experiment 1]: We verified that class-weighted training enables the detection of sound events with priority in terms of F-scores (Sections 4.2.1 and 4.2.2).

  • [Experiment 2]: To analyze in more detail the properties of the proposed methods, we observed misdetection results in terms of insertion or deletion rate for the proposed methods (Section 4.2.3).

  • [Experiment 3]: We investigated how the triage weights are affected for each event class and evaluation metric (Section 4.2.4).

For the experiments, we used the URBAN–SED [23] dataset. URBAN–SED includes 10,000 synthetic audio clips (train, 6000; validation, 2000; test, 2000), where the duration of each clip is 10 s with a sampling rate of 44,100 Hz. The dataset consists of 10 sound event classes. In Fig. 4, the numbers of active and inactive frames for each event are indicated. As acoustic features, we used 64-dimensional log-mel band energies, which were calculated with the window size of 40 ms and the hop size of 20 ms. This setup is based on the baseline system of DCASE2018 Challenge task4 [24]. The threshold value for detecting events is tuned using the validation sets for each event class and method with the intersection-based F-score. The other hyperparameters are also optimized with the intersection-based F-score. For post-processing before detection, a median filter is applied, where the filter size is tuned with the validation sets for each event class and method. The batch size was 64, and models were trained with 100 epochs. To measure the detection performance, we used frame-based and intersection-based metrics [25]. In the intersection-based metric, the detection tolerance criterion (DTC) and ground truth intersection criterion (GTC) are both set to 0.5.

Fig. 4
figure 4

Numbers of active and inactive frames for each event

As the SED model in Fig. 2, we used two models. We used CNN–BiGRU with selective kernel units (CNN–BiGRU–SK) [26, 27], which achieved the best performance in DCASE2021 Challenge task4. In CNN–BiGRU–SK, kernels of multiple sizes are adopted in a CNN of a single model to handle various types of sound event. Moreover, for comparison, we used the TSD model [19]. We used two versions of the TSD model: conditioned with a one-hot vector and with a reference signal. For the one-hot-vector-based TSD model, to make the conditions for TSD and our SET the same, we used only the detection and conditional networks. In other words, we did not employ the classification loss, which is used only when the TSD model is conditioned by a reference signal. We used the same architecture as that in the one-hot-vector-based TSD model for the detection network. For the conditional network of the one-hot-vector-based TSD model, we used the same two MLP layers as that in the single target SET method because one-hot vectors of sound events were utilized instead of reference signal. For the reference-signal-based TSD model, we randomly chose clips of UrbanSound8k [28], as in [19]. Other detailed parameters are shown in Table 1. In Table 1, “FC” means fully connected.

Table 1 Experimental conditions

The FiLM operation (\({\hat{m}}_{ijc} = \sigma _{c}m_{ijc}+\mu _{c}\)) was implemented between convolution and max pooling in each CNN layer. As shown in Fig. 2, the SED model and two MLPs are simultaneously optimized using the loss function Eq. 11 or 13. In this work, \(\alpha _{\forall k}\) is set to 0.1 for the symmetric Dirichlet distribution \(\mathcal {D}(\varvec{\alpha })\), which was tuned using the validation sets. \(\varvec{\lambda } \sim \mathcal {D}(\varvec{\alpha })\) is sampled for each batch of an epoch during the training of a SET model. K is set to the number of sound event classes.

4.2 Experimental results

4.2.1 [Experiment 1]: SET results in terms of frame-based F-score

In this experiment, we selected one target event class n for single target SET and then observed the F-scores of the target event class with various \(\lambda _{n}\) values in inference stages. The triage weight for a nontarget class is fixed at \(1.0/\sum _{n}\lambda _n\). For example, when the index of a target class is \(n=1\) and the triage weight \(\lambda _{1}\) is set to \(5.0/\sum _{n}\lambda _n\), \(\varvec{\lambda } = (5.0, 1.0, \ldots , 1.0)/\sum _{n}\lambda _n\). In all of the experimental results, all detection results of our SET are obtained with the optimal triage weight tuned using the validation sets. The optimal triage weights are set for each method, class, and evaluation metric using the validation sets. As aforementioned in section 3.2\(\varvec{\lambda }\) is multiplied by N for the scaling before \(\varvec{\lambda }\) is input to the two MLPs for \(\varvec{\mu }\) and \(\varvec{\sigma }\).

Figure 5 shows the results of the proposed methods in terms of the frame-based F-score. “baseline” indicates the results of CNN–BiGRU–SK. “TSD w/ one-hot” and “TSD w/ signal” represent the TSD model conditioned by the one-hot vector and by the reference signal, respectively. “SET w/ SET–AI” and “SET w/ SET–A” are SET with class-weighted training using Eqs. 11 and 13, respectively. The results show that the proposed SET methods with class-weighted training achieved a reasonable performance. For the average detection performance of the classes, the SET model with SET–A loss improved the frame-based F-score by 2.29 percentage points compared with the baseline value. Moreover, the performance of detecting those sound events increases when using the SET–A loss compared with using the SET–AI loss. This is because the number of inactive frames of the sound events is large in the training set, as can be seen in Fig. 4. In other words, the models using the SET–AI loss might focus on the training of inactive frames. This leads to the degradation of the detection performance, as reported in [22].

For individual class, the SET–A loss improved the performance of detecting the sound events “air_conditioner,” “car_horn,” and “street_music” by 6.17, 3.65, and 5.12 percentage points, respectively, compared with the baseline values. In particular, the detection performance of “air_conditioner” using the conventional methods is lower than those of the other classes, but it markedly increased when using the proposed methods. This indicates that our SET can boost the performance of recognizing the laborious-to-detect classes. On the other hand, the sound events “drilling” and “jackhammer” are not well detected using the proposed SET compared with the baseline values. This might be because “drilling” is acoustically similar to “jackhammer.” In [28], the timbre of these two events is similar and could also be confused in the classification task. To confirm the similarities among the sound events, we visualized the acoustic feature space using t-SNE in Fig. 6. Note that we used UrbanSound8k [28], which is composed of isolated sound events of the URBAN–SED dataset for visualizing relationships among the sound event classes. This is because it is difficult to clearly visualize the relationships owing to overlapped sound events, i.e., polyphonic, in an audio clip of the URBAN–SED dataset. Figure 6 also indicates that “drilling” is acoustically similar to “jackhammer.” The SET–A loss, which focuses on the active frames, may detect a target sound event and similar one simultaneously compared with the SET–AI loss.

Fig. 5
figure 5

SET results in terms of frame-based F-score (%) for target classes

Fig. 6
figure 6

Relationship among event classes on test set of UrbanSound8k in terms of acoustic features visualized using t-SNE

We next compare the TSD models with our single target SET models. As shown in Fig. 5, our single target SET models detected many events better than the TSD model, e.g., “air_conditioner,” “car_horn,” and “street_music.” The TSD models outperform the SED model for some events such as “engine_idling” and “gun_shot.” In particular, the F-score of “gun_shot” when using the TSD models achieved a reasonable performance comparable to that of our single target SET. As previously indicated in Fig. 4, the number of active frames of “gun_shot” is smaller than those of the other classes. This indicates that the targeting methods, such as TSD and SET, are useful for event classes where the number of frames is small, e.g., rare sound event classes.

4.2.2 [Experiment 1]: SET results in terms of intersection-based F-score

We also evaluated the proposed methods in terms of the intersection-based F-score. In the intersection-based F-score, unlike the frame-based F-score, models are evaluated instance by instance. Here, “instance” means a block with the associated onset and offset [30]. Figure 7 is SET results in terms of the intersection-based F-score. The results show that the F-score of the proposed SET method is improved compared with that of the baseline system. For the average detection performance of the classes, the SET model with the SET–A loss increased the intersection-based F-score by 3.37 percentage points compared with the baseline model. For each class, the SET–A loss improved the performance of detecting the sound events “air_conditioner,” “car_horn,” and “street_music” by 8.70, 6.66, and 6.09 percentage points, respectively, compared with the baseline values. Comparing Figs. 5 and 7, we find that SET w/ SET–A greatly outperformed SET w/ SET–AI for most of the classes in terms of the intersection-based F-score rather than the frame-based F-score.

Fig. 7
figure 7

SET results in terms of intersection-based F-score (%) for target classes

4.2.3 [Experiment 2]: SET results in terms of misdetection

To investigate in more detail the performance of SET in Figs. 8 and 9, we used the error-related evaluation metrics, that is, frame-based insertion rate (IR) and deletion rate (DR).

Fig. 8
figure 8

SET results in terms of insertion rate for target classes

Fig. 9
figure 9

SET results in terms of deletion rate for target classes

Given false positives (FPs) and false negatives (FNs) for each event and time frame t, IR and DR are defined using the insertion (I) and deletion (D) [30] as follows:

$$\begin{aligned} {\textrm{I}}(n,t)= & {} {\textrm{max}}(0, {\textrm{FP}}(n,t)-{\textrm{FN}}(n,t))\end{aligned}$$
(14)
$$\begin{aligned} {\textrm{D}}(n,t)= & {} {\textrm{max}}(0, {\textrm{FN}}(n,t)-{\textrm{FP}}(n,t)) \end{aligned}$$
(15)
$$\begin{aligned} {\textrm{IR}}(n)= & {} \frac{\sum _{t=1}^{T}{\textrm{I}}(n,t)}{\sum _{t=1}^{T}{\textrm{A}}(n,t)},\end{aligned}$$
(16)
$$\begin{aligned} {\textrm{DR}}(n)= & {} \frac{\sum _{t=1}^{T}{\textrm{D}}(n,t)}{\sum _{t=1}^{T}{\textrm{A}}(n,t)}, \end{aligned}$$
(17)

where n and t represent the indexes of a sound event class and a time frame, respectively. I(nt), D(nt), FP(nt), and FN(nt) are each a binary variable indicating whether there is I, D, FP, or FN of event n at time t, respectively. A(nt) is a binary variable indicating whether event n is active at frame t. IR(n) and DR(n) are the insertion and deletion rates for each event class n, respectively.

Figures 8 and 9 show results in terms of IR and DR for each target event. The results show that the proposed SET methods outperformed the conventional methods in terms of IR and DR. In particular, SET w/ SET–AI reduced the IR of “air conditioner” by 0.605 points compared with the baseline value. On the other hand, the IRs of the proposed SET for “drilling” and “jackhammer” are higher than the baseline values. Our SET tends to increase false positives of acoustically similar classes more than the conventional method. As shown in Figs. 8 and 9, most of the classes suffer from the trade-off between IR and DR. On the other hand, both the IR and DR of “engine_idling” and “street_music” are improved when using the proposed method compared with the baseline values.

Comparing the TSD and SET models, we find that the detection performance characteristics of the TSD models are unstable among the event classes compared with those of the SET models. In other words, the difference in detection performance between SED and TSD is large and the superiority of one over the others often reverses compared with SED and SET. In the training of the TSD models, the loss function of the target class is weighted with the same value even if the target class is being overtrained. On the other hand, in our SET, the loss function of the target class is weighted with a random value referring to the Dirichlet distribution. This random weighting may lessen the instability of the detection performance.

4.2.4 [Experiment 3]: Optimal triage weights of SET

We then analyze the optimal triage weight for each event class and evaluation metric. Figure 10 shows the optimal triage weights represented by the radar chart. As previously mentioned, the SET results are obtained using the optimal triage weight, which gives the best detection performance for each target event class, method, and evaluation metric. As shown in Fig. 10, in most of the event classes, the optimal triage weights of SET w/ SET–A are larger than those of SET w/ SET–AI. This is because the ability of detecting the active frames of the target class does not change significantly even when changing the triage weight with the SET–AI loss, as mentioned in Section 4.2.1. In SET w/ SET–A, most of the optimal triage weights for the classes are between 10 and 20. This is because the weight of 5 is very small for boosting the detection of the target class. Weights over 25 are not densely sampled from the Dirichlet distribution we used. In other words, the triage weight of the lower probability density does not greatly contribute to the training of the target class. In practical applications, we can tune detection models for other evaluation metrics or scenes without retraining by the proposed methods. This flexibility of the proposed methods has not been provided by the conventional methods.

Fig. 10
figure 10

Optimal triage weights for each event class and evaluation metric

Figure 11 shows acoustic features and the system outputs and ground truths for selected events. In this figure, the SET results where the triage weight is optimized for the intersection-based F-scores are selected. In most of the cases, the proposed SET outperformed the conventional methods. In particular, SET w/ SET–A is outstanding for detecting active frames for “gun shot.” In “air conditioner,” however, the proposed methods still produce false positives as with the conventional methods. The problem of producing false positives needs to be solved in a future work.

Fig. 11
figure 11

Examples of detection results for each event

5 Conclusion

In this work, we proposed a new task for SED: sound event triage (SET), in which an arbitrary number of event classes are prioritized. In this first study of SET, we introduced training method for single target SET, which is a subtask of SET. To perform single target SET, class-weighted training is used for detecting events with priority. In the class-weighted training, loss functions and the network are stochastically weighted by the priority parameter of each class. In inference stages, the single target SET network with class-weighted training can change into various specialists for each class without retraining.

The results of the experiments using the URBAN–SED dataset show that the proposed method with the SET–A loss outperforms the conventional SED method by 8.70, 6.66, and 6.09 percentage points for “air_conditioner,” “car_horn,” and “street_music,” respectively, in terms of the intersection-based F-score. The results revealed that the SET–A loss contributes more to the detection of a target class than the SET–AI loss. In the average performance of the classes, the proposed methods increased the intersection-based F-score by 3.37 percentage points compared with the conventional SED and TSD models.

As the limitation of the proposed methods, the results indicate that the confusion errors among similar events might be enhanced. In a future work, the multitarget SET needs to be studied by redesigning the distribution of the priority used during training for the multitarget SET. Moreover, the SET performance for a small number of training data, e.g., one-shot or few-shot learning, also needs to be investigated.

Availability of data and materials

The dataset used in the experiments of this article is available in zenodo, https://zenodo.org/record/1002874#.YjigeOrP1dg.

Abbreviations

SET:

Sound event triage

SED:

Sound event detection

YOTO:

You only train once

HMM:

Hidden Markov model

NMF:

Non-negative matrix factorization

DNN:

Deep neural network

CNN:

Convolutional neural network

RNN:

Recurrent neural network

CNN–BiGRU:

Convolutional bidirectional gated recurrent unit

TSD:

Target sound detection

SELD:

Sound event localization and detection

FiLM:

Feature-wise linear modulation

MLPs:

Multilayer perceptrons

SET–AI:

SET with active and inactive frames

SET–A:

SET with active frames (SET–A)

DTC:

Detection tolerance criterion

GTC:

Ground truth intersection criterion

CNN–BiGRU–SK:

CNN–BiGRU with selective kernel units

IR:

Insertion rate

DR:

Deletion rate

FPs:

False positives

FNs:

False negatives

References

  1. K. Imoto, Introduction to acoustic event and scene analysis. Acoust. Sci. Technol. 39(3), 182–188 (2018)

    Article  Google Scholar 

  2. Y. Koizumi, S. Saito, H. Uematsum, Y. Kawachi, N. Harada, Unsupervised detection of anomalous sound based on deep learning and the Neyman-Pearson lemma. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 212–224 (2019)

    Article  Google Scholar 

  3. J. Stork, L. Spinello, J. Silva, K. O. Arras, in Proc. IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). Audio-based human activity recognition using non-Markovian ensemble voting (2012), pp. 509–514

  4. Y.T. Peng, C.Y. Lin, M.T. Sun, K.C. Tsai, in Proc. IEEE International Conference on Multimedia and Expo (ICME). Healthcare audio event classification using hidden Markov models and hierarchical hidden Markov models (2009), pp. 1218–1221

  5. M. Nandwana, T. Hasan, in Proc. INTERSPEECH. Towards smart-cars that can listen: abnormal acoustic event detection on the road (2016), pp. 2968–2971

  6. S. Ntalampiras, I. Potamitis, N. Fakotakis, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). On acoustic surveillance of hazardous situations (2009), pp. 165–168

  7. A. Mesaros, T. Heittola, T. Virtanen, M.D. Plumbley, Sound event detection: a tutorial. IEEE Signal Proc. Mag. 38(5), 67–83 (2021)

    Article  Google Scholar 

  8. A. Mesaros, T. Heittola, A. Eronen, T. Virtanen, in Proc. European Signal Processing Conference (EUSIPCO). Acoustic event detection in real life recordings (2010), pp. 1267–1271

  9. T. Heittola, A. Mesaros, A. Eronen, T. Virtanen, Context-dependent sound event detection. EURASIP J. Audio Speech Music Process. 2013(1), 1–13 (2013)

    Article  Google Scholar 

  10. J. Gemmeke, L. Vuegen, P. Karsmakers, B. Vanrumste, H. Van hamme, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). An exemplar-based NMF approach to audio event detection (2013), pp. 1–4

  11. T. Komatsu, T. Toizumi, R. Kondo, Y. Senda, in Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE). Acoustic event detection method using semi-supervised non-negative matrix factorization with mixtures of local dictionaries (2016), pp. 45–49

  12. S. Hershey, S. Chaudhuri, D. Ellis, J. Gemmeke, A. Jansen, C. Moore, M. Plakal, D. Platt, R. Saurous, B. Seybold, M. Slaney, R. Weiss, K. Wilson, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). CNN architectures for large-scale audio classification (2017), pp. 131–135

  13. T. Hayashi, S. Watanabe, T. Toda, T. Hori, J. Le Roux, K. Takeda, Duration-controlled LSTM for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 25(11), 2059–2070 (2017)

    Article  Google Scholar 

  14. E. Çakır, G. Parascandolo, T. Heittola, H. Huttunen, T. Virtanen, Convolutional recurrent neural networks for polyphonic sound event detection. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1291–1303 (2017)

    Article  Google Scholar 

  15. Q. Kong, Y. Xu, W. Wang, M.D. Plumbley, Sound event detection of weakly labelled data with CNN-Transformer and automatic threshold optimization. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2450–2460 (2020)

    Article  Google Scholar 

  16. K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, K. Takeda, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Weakly-supervised sound event detection with self-attention (2020), pp. 66–70

  17. K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, K. Takeda, in Tech. Rep. DCASE Challenge. Conformer-based sound event detection with semi-supervised learning and data augmentation (2020), pp. 1–5

  18. O. Slizovskaia, G. Wichern, Z. Wang, J.L. Roux, in arXiv, arXiv:2203.04197. Locate this, not that: class-conditioned sound event DOA estimation (2022), pp. 1–5

  19. D. Yang, H. Wang, Y. Zou, C. Weng, in arXiv, arXiv:2112.10153. Detect what you want: target sound detection (2021), pp. 1–5

  20. A. Dosovitskiy, J. Djolonga, in Proc. International Conference on Learning Representations (ICLR). You only train once: loss-conditional training of deep net works (2020), pp. 1–17

  21. E. Perez, F. Strub, H. Vries, V. Dumoulin, A. Courville, in Proc. the Association for the Advancement of Artificial Intelligence (AAAI). FiLM: Visual reasoning with a general conditioning layer (2018), pp. 1–17

  22. K. Imoto, S. Mishima, Y. Arai, R. Kondo, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Impact of sound duration and inactive frames on sound event detection performance (2021), pp. 860–864

  23. J. Salamon, D. MacConnell, M. Cartwright, P. Li, J.P. Bello, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Scaper: a library for soundscape synthesis and augmentation (2017), pp. 344–348

  24. R. Serizel, N. Turpault, H. Eghbal-Zadeh, A.P. Shah, in Proc. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE). Large-scale weakly labeled semi-supervised sound event detection in domestic environments (2018), pp. 19–23

  25. Bilen Ç., G. Ferroni, F. Tuveri, J. Azcarreta, S. Krstulovié, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). A framework for the robust evaluation of sound event detection (2020), pp. 61–65

  26. X. Zheng, Y. Song, I. McLoughlin, L. Liu, L. Dai, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). An improved mean teacher based method for large scale weakly labeled semi-supervised sound event detection (2021), pp. 356–360

  27. X. Zheng, H. Chen, Y. Song, in Tech. Rep. DCASE Challenge. Zheng USTC team’s submission for DCASE2021 task4 - semi-supervised sound event detection (2021), pp. 1–3

  28. J. Salamon, C. Jacoby, J.P. Bello, in Proc. 22nd ACM International Conference on Multimedia (ACM-MM’14). A dataset and taxonomy for urban sound research (2014), pp. 1041–1044

  29. D.P. Kingma, J. Ba, in Proc. International Conference on Learning Representations (ICLR). Adam: a method for stochastic optimization (2015)

  30. A. Mesaros, T. Heittola, T. Virtanen, Metrics for polyphonic sound event detection. Appl. Sci. 6(6), 1–17 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

N. Tonami proposed the methodology, conducted the experiments, and wrote the manuscript. K. Imoto supervised the research and refined the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Keisuke Imoto.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tonami, N., Imoto, K. Sound event triage: detecting sound events considering priority of classes. J AUDIO SPEECH MUSIC PROC. 2023, 5 (2023). https://doi.org/10.1186/s13636-022-00270-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-022-00270-7

Keywords