Sound event triage: detecting sound events considering priority of classes

We propose a new task for sound event detection (SED): sound event triage (SET). The goal of SET is to detect an arbitrary number of high-priority event classes while allowing misdetections of low-priority event classes where the priority is given for each event class. In conventional methods of SED for targeting a specific sound event class, it is only possible to give priority to a single event class. Moreover, the level of priority is not adjustable, i.e, the conventional methods can use only types of target event class such as one-hot vector, as inputs. To flexibly control much information on the target event, the proposed SET exploits not only types of target sound but also the extent to which each target sound is detected with priority. To implement the detection of events with priority, we propose class-weighted training, in which loss functions and the network are stochastically weighted by the priority parameter of each class. As this is the first paper on SET, we particularly introduce an implementation of single target SET, which is a subtask of SET. The results of the experiments using the URBAN–SED dataset show that the proposed method of single target SET outperforms the conventional SED method by 8.70, 6.66, and 6.09 percentage points for “air_conditioner,” “car_horn,” and “street_music,” respectively, in terms of the intersection-based F-score. For the average score of classes, the proposed methods increase the intersection-based F-score by up to 3.37 percentage points compared with the conventional SED and other target-class-conditioned models.


INTRODUCTION
In our everyday life, humans utilize much information obtained from various environmental sounds [1]. The automatic analysis of environmental sounds will lead to the realization of many applications, e.g., anomalous sound detection systems [2], life-logging systems [3], systems for hard-of-hearing persons [4], systems for smart cars [5], and monitoring systems [6].
The target sounds to be analyzed depend on the user or application. In speech processing, source separation based on beamforming [18] for extracting a target speech from mixed sounds has been proposed. In music processing, to extract the sound intended by a user, a music separation method using the user's humming [19] has been proposed. In the analysis of environmental sounds, universal sound selection [20], environmental sound extraction [21], and sound event localization and detection (SELD) [22] have been proposed. Okamoto et al. [21] have proposed a method of extracting For SED, target sound detection (TSD) has have been proposed by Yang et al. [23], in which only the target sound is detected, where a reference audio signal or a one-hot vector of the target sound is input to the TSD model. In SED, the goal is to generalize the performance of detecting all sound events, i.e., the SED models are trained to equally detect all sound events. In real environments, however, the detection priority for each event depends on the user or application. For example, when SED is used for a surveillance system, anomalous events such as "gunshot" or "baby crying" have to be preferentially detected over other events. On the other hand, in the case of a life-logging system, a sound event "kettle" or "footsteps" has to be more preferentially detected in addition to other events. In the conventional method for TSD, target events can be selected but the degree to which priority for detecting the target events cannot be controlled.
To tackle this problem, we propose a new SED task: sound event triage (SET). The goal of SET is to improve the performance of detecting a high-priority sound event while allowing the performance of detecting a low-priority event to be compromised where the extent of priority is given for each event class; that is, the triage. In Fig. 1, the concept of SET is illustrated. SET enables userpreference sound event detection. The difference between the conventional methods for SED including TSD [23] and SET is whether the degree to which events are detected with priority can be set. We then propose a method for SET, in which a loss-conditional training [24] is utilized for detecting sound events with priority.

RELATED WORKS
In this section, we describe works related to the proposed method. In particular, the section comprises three subsections: strongly supervised SED, the conventional methods of environmental sound analysis using class-conditional techniques, and a loss-conditional method based on You Only Train Once (YOTO) [24], with which the arbitrary linear combination of loss weights for multiple tasks can be set with a single model.

Strongly supervised SED
In strongly supervised SED, given a SED model f , model parameters Θ Θ Θ, an acoustic feature X, and a ground truth zn,t ∈ {0, 1} for a sound event n in time t, the SED model outputs the probability yn,t for the event n and time frame t: In the training of the DNN-based SED model, to optimize the model parameters Θ Θ Θ, the following binary cross-entropy (BCE) loss function is used: zn log σ(yn) zn,t log σ(yn,t) where σ(·) denotes the sigmoid function. N and T are the numbers of sound event classes and total time frames, respectively. In an inference stage of SED, σ(yn,t) is binarized with a predefined threshold to obtain detection results. As can be seen in Eq. 3, all sound events are equally weighted for generalizing the performance of detecting all sound events.

Methods of environmental sound analysis for targeting a specific sound
In the analysis of environmental sounds, several methods considering user-guided or target-class-conditioned information have been proposed [20,21,22,23]. In the methods in [20,21], environmental sound extraction systems utilize use-guided information. Ochiai et al. [20] have proposed a method that conditions the universal sound separation system with a multi-hot vector of target classes for extracting target event classes from a mixture of sound events.
The system requires many pairs of an acoustic signal and a sound event label in a supervised manner. In works related to SED, TSD [23] and class-conditioned SELD where information on target event classes or sounds is employed [22] have been studied. Yang et al. [23] have proposed the TSD task derived from SED, which detects target sound events with a one-hot vector of the target event class. In the TSD network, a reference signal or a one-hot vector of the target event class is input to a network of the condition, which is embedded and then fused with a SED network. Slizovskaia et al. [22] have proposed the class-conditioned SELD, which analyzes a specific-target event similarly to TSD to detect the target sound using a one-hot vector of the target class. These conventional systems related to SED can only handle information about the types of sound to be analyzed.

YOTO
YOTO [24] is a technique that enables a single network to perform various models, where each model has a different type of expertise, in inference stages. In the task of image compression, there is expertise for image quality and compression rate [24]. In this case, a single network using YOTO can perform various image-quality or compression-rate specialists in inference stages. The YOTO scheme is efficient in terms of the training cost or model complexity compared with multiple networks for each type of expertise. In YOTO, we assume a problem setting that a single DNN-based network performs multiple tasks where the network is trained with multiple losses for each task. Let Lm be a loss function for task m.
The following loss function is often used for optimizing the parameters of the network: where λm(0 ≤ λm ≤ 1.0) is the weight of Lm for balancing among the tasks. M denotes the number of tasks. In the training stage, the single network is trained with various λm values. In inference stages, an arbitrary λm can be input to the single network trained. When λm is set to be larger than those in other tasks, the network focuses on the training and/or inference of task m instead of other tasks.

Framework of SET
In SET, not only types of target sound but also the degree of priority is used for conditioning models. In Fig. 2, the overview of SET in the training stage is shown. In the training stage of SET, triage weights are given for detecting target events with priority in addition to acoustic features and model parameters.
where λ λ λ = (λ1, λ2, ..., λN ) are the parameters for the triage, that is, detecting sound events with priority. λn(0 ≤ λn ≤ 1.0) is a triage weight for the sound event n. In an inference stage of SET, an arbitrary triage parameter λ λ λ is input to the SET model. When λn is set to a larger value than the others, the model targets with higher priority the sound event n than the other events. The loss functions on the right side in Fig. 2 are described in section 3.3.

Class-level loss-conditional training
In the DNN-based SED, as shown in Eq. 3, the BCE loss can be divided into losses of each event class. In other words, the BCE loss can be regarded as the sum of losses for the task of detecting each sound event class. To perform the event-class-level loss-conditional training of SET, the following loss function is used: where λn and LSED(n) indicate a triage weight and a loss function for sound event n, respectively. In Eq. 6, note that N is multiplied for scaling the loss function due to n λn = 1.0.
To use arbitrary λn in inference stages of a SET network, λ λ λ = (λ1, λ2, ..., λN ) are repeatedly and randomly sampled from a distribution during the training, which cover various λ λ λ values in the single SET network. The sampled parameters λ λ λ are input to the SET network and used for the loss calculation (Eq. 6) in the training stage. As shown in Fig. 2, λ λ λ values are firstly fed to two multilayer perceptrons (MLPs). The MLPs output two vectors, µ µ µ = (µ1, µ2, ..., µC ) and σ σ σ = (σ1, σ2, ..., σC ). As shown in Fig. 3, feature-wise linear modulation (FiLM) [25] is then used to bridge between the outputs of the MLPs and the main network for detecting events (Fig.  2). The FiLM is applied to a feature map in CNN layers of the main network. The feature map is multiplied by σ σ σ and added to µ µ µ: Mijc = Mijcσc +µc. Here, Mijc is a feature at a location (i, j) of a channel index c in a feature map of a CNN layer, as shown in Fig.  3. In addition to the conditioning of the main network, the sampled λ λ λ values are directly used for the losses (Eq. 6) in training stages.
As the distribution of the triage weight λn, we use the Dirichlet distribution D(α α α). The probability density function of the where x k ≥ 0 and α k > 0 are stochastic variables for a dimension k and a parameter for the shape of the distribution. Γ(·) represent the gamma function. When α α α is small, the Dirichlet distribution is sharpened; that is, its shape is beneficial exclusively for the triage. Our SET is conditioned by the event-class-level YOTO, where the triage weight λ λ λ is sampled for each sound event class, to handle arbitrary λ λ λ, i.e., the extent of priority of the detection. N is also multiplied by inputs of the MLPs for conditioning the main network, as shown in Fig. 2.

SET losses with priority of event classes
To perform our SET, Eq. 6 is specified in this section. We propose two loss functions for the class-level loss-conditional training of SET. First, we introduce a loss function of SET with active and inactive frames (SET-AI) as follows.
N λn zn log σ(yn) N λn zn,t log σ(yn,t) In SET-AI, the triage weight λn affects the active and inactive frames of sound event n. When λn is set to a larger value than the others, the model focuses on the training and/or inference of the active and inactive frames of event n compared with the others. In SET-AI, the inactive frames are multiplied by λn.
A large number of inactive frames disturb the training of active frames, as reported in [26]. Hence, the number of only active frames is multiplied by λn. Thus, we also introduce a loss function of SET with active frames (SET-A), wherein the model focuses on the training of the active frames, as follows.
N λnzn log σ(yn) N λnzn,t log σ(yn,t) The difference between SET-AI and SET-A is whether the number of inactive frames is multiplied by the triage weight λn.

Experimental conditions
To evaluate the effectiveness of our methods, we conducted the following experiments:  For the experiments, we used the URBAN-SED [28] dataset. URBAN-SED includes 10,000 synthetic audio clips (train, 6,000; validation, 2,000; test, 2,000), where the duration of each clip is 10 s with a sampling rate of 44,100 Hz. The dataset consists of 10 sound event classes. In Fig. 4, the numbers of active and inactive frames for each event are indicated. As acoustic features, we used 64-dimensional log-mel band energies, which were calculated with the window size of 40 ms and the hop size of 20 ms. This setup is based on the baseline system of DCASE2018 Challenge task4 [29]. The threshold value of σ(yn,t) was 0.5. The batch size was 64, and models were trained with 100 epochs. To measure the detection performance, we used frame-based and intersection-based metrics [30]. In the intersection-based metric, the detection tolerance criterion (DTC) and ground truth intersection criterion (GTC) are both set to 0.5.
As the main network in Fig. 2, we used two models. First, we used CNN-BiGRU [14], which is a combination of CNN and Bi-GRU. CNN-BiGRU is the widely used method as a baseline system for SED. Second, we ued CNN-BiGRU with selective kernel units (CNN-BiGRU-SK) [31,32], which achieved the best performance in DCASE2021 Challenge task4. In CNN-BiGRU-SK, kernels of multiple sizes are adopted in a CNN of a single model to handle various types of sound event. Other detailed parameters are shown in Table 1. In Table 1, "FC" means fully connected.
The FiLM operation (mijc = σcmijc + µc) was implemented between convolution and max pooling in each CNN layer. In this work, α ∀k is set to 0.1 for the symmetric Dirichlet distribution D(α α α), which was tuned using the train and validation set. K is set to the number of sound event classes.

[Experiment 1]: SET results in terms of frame-based F-score
In this experiment, we selected one target event class n for SET and then observed the F-scores of the target event class with various λn values in inference stages. The triage weight for a nontarget class is fixed at 1.0/ n λn. For example, when the index of a target class is n = 1 and the triage weight λ1 is set to 5.0/ n λn, λ λ λ = (5.0, 1.0, . . . , 1.0)/ n λn. As aforementioned in section 3.2, N is multiplied by λ λ λ for the scaling before λ λ λ is input to the two MLPs for µ µ µ and σ σ σ. Figure 5 shows the results of the proposed methods with various triage weights in terms of the frame-based F-score. In the legend, "Baseline" indicates the results obtained by the conventional methods CNN-BiGRU and CNN-BiGRU-SK using the BCE loss    function. "SET-AI loss" and "SET-A loss" are SET with the classlevel loss-conditional training using Eqs. 10 and 12, respectively. The results show that the proposed SET methods using the classlevel loss-conditional training achieved a reasonable performance. The performance of detecting sound events "air conditioner," "children playing," and "engine idling" gradually increases with the triage weight. Moreover, the performance of detecting those sound events markedly increases when using the SET-A loss compared with using the SET-AI loss. This is because the number of inactive frames of the sound events is large in the training set, as can be seen in Fig. 4. In other words, the models using the SET-AI loss might focus on the training of inactive frames. This leads to the degradation of the detection performance, as reported in [26]. On the other hand, the results of the sound events "drilling" and "jackhammer" indicate a different trend from those of "air conditioner," "children playing," and "engine idling." As the triage weight becomes larger, the detection performance using the SET-A loss is much more degraded than that using the SET-AI loss. This might be because "drilling" is acoustically similar to "jackhammer." In [33], the timbre of these two events is similar and could also be confused in the classification task. The SET-A loss, which focuses on the active frames, may detect a target sound event and similar one simultaneously compared with the SET-AI loss. Tables 2 and 3 show detailed SET results in terms of framebased F-score for each target class with various weights based on CNN-BiGRU and CNN-BiGRU-SK, respectively. As shown in Table 2, many sound events are well detected using the pro-posed methods compared with the baseline system. Notably, the detection performance of the proposed method using the SET-A loss is higher than that of the proposed method using the SET-AI loss. For events detected using the SET-A loss, the sound events "air conditioner," "children playing," "engine idling," "siren," and "street music" were detected better than when using the baseline system and the method using the SET-AI loss. In particular, the proposed method using the SET-A loss improved the frame-based F-score of "air conditioner" by 7.53 and 9.72 percentage points compared with the conventional CNN-BiGRU and CNN-BiGRU-SK, respectively. Moreover, the proposed method using the SET-A loss of "street music" achieved better performance in terms of the F-score than did the conventional CNN-BiGRU and CNN-BiGRU-SK. These well-detected events "air conditioner" and "street music" are characterized as continuous sounds, as reported in [34], which are relatively easy to detect than the others. The proposed method using SET-A mainly focuses on the active frames in Eq. 12, which would detect more active frames regardless of some misdetections and is good at detecting such continuous sounds with a number of active frames.

[Experiment 1]: SET results in terms of intersection-based F-score
We also evaluated the proposed methods in terms of the intersectionbased F-score. In the intersection-based F-score, unlike the framebased F-score, models are evaluated instance by instance. Here, "in-   stance" means a block with the associated onset and offset [35]. Tables 4 and 5 are SET results in terms of the intersection-based Fscore compared with those of CNN-BiGRU and CNN-BiGRU-SK for various triage weights. The results show that the F-score of the proposed SET method is improved compared with that of the baseline system. In particular, as shown in Table 5, the proposed method using the SET-A loss improved the intersection-based F-scores of the sound events "car horn," "dog bark," and "gun shot" by 7.44, 8.39, and 7.82 percentage points compared with the conventional CNN-BiGRU-SK. As shown in Table 4, these events were not precisely detected using the proposed method. This implies that more sophisticated event detection methods can boost SET performance, which needs to be investigated in future works. Unlike the results in terms of the frame-based F-score, the performance of detecting many sound events gradually deteriorates as the triage weight increases. This issue will be discussed in the next experiment.

[Experiment 2]: SET results of relationships among misdetections and F-scores
To investigate in more detail performance of SET, we used the errorrelated evaluation metrics, frame-based insertion rate (IR) and deletion rate (DR). Given false positives (FPs) and false negatives (FNs) for each event and time frame t, IR and DR are defined using the insertion (I) and deletion (D) [35] as follows: where FP(t) and FN(t) are the numbers of FPs and FNs at frame t, respectively. A(t) represents the number of active events at frame t. The aforementioned results indicated the different behaviors between the frame-based and intersection-based F-scores. Figure 6 shows the relationships between the frame-base and intersectionbased F-scores and insertion rate with various triage weights for each target event. In the figure, dashed lines indicate results of the relationships between the frame-based F-score and IR. Solid lines are results of the relationships between the intersection-based F-score and IR. The left figure shows that, by using the SET-AI loss, the intersection-based F-score for each target event is moderately decreased as the triage weight increases. As shown on the right side of Fig. 6, by using the SET-A loss, the frame-based F-score for each target event is highly increased as the triage weight increases. On the other hand, the result indicates that the intersection-based Fscores for most target events increase and then decrease as the triage weight increase. This is because the number of FPs is dominant in terms of the intersection-based F-score and is counted differently from that for the frame-based F-score, when the triage weight is set to be large. As shown in Fig. 7, the number of FPs is counted instance by instance in terms of the intersection-based F-score. Figure  7(a) shows an example of an instance for FPs. In Fig. 7(b), the detection result is divided into two instances, each of which is short. In the intersection-based F-score, the number of FPs in Fig. 7(b) is twice that in Fig. 7(a); that is, one instance is divided into multiple short instances with a significant impact. On the other hand, in the frame-based F-score, there is no large difference in the number of FPs between Figs. 7(a) and 7(b). Tables 6 and 7 show results in terms of IR and DR for each target event with various triage weights. The results show that the IR of many events is worse than DR; that is, the systems are affected mainly by FPs. The proposed methods using the SET-A loss degrade IR compared with those using the SET-AI loss. This is because the SET-A loss focuses on recognizing active frames while it neglects inactive frames.
From the results in this section, we conclude that our SET detects more active frames for target events, but the number of divided or short-instance FPs might be increased.

CONCLUSION
In this work, we proposed a new task for SED: sound event triage (SET), in which both the type of target sound and the extent of priority are considered for targeting specific sound events. To perform SET, the task-level conditional-loss training, where a model and loss are conditioned with sampled parameters, is utilized for detecting events with priority. Results of the experiments using the URBAN-SED dataset show that the proposed methods achieve reasonable detection performance in terms of the F-scores. The proposed method mainly considering active frames outperforms the conventional SED method by around 10 percentage points in terms of the frame-based F-score for some events. As the limitation of the proposed methods, the results indicate that the confusion errors among similar events might be enhanced.