Skip to main content

GLFER-Net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration


Polyphonic sound source localization and detection (SSLD) task aims to recognize the categories of sound events, identify their onset and offset times, and detect their corresponding direction-of-arrival (DOA), where polyphonic refers to the occurrence of multiple overlapping sound sources in a segment. However, vanilla SSLD methods based on convolutional recurrent neural network (CRNN) suffer from insufficient feature extraction. The convolutions with kernel of single scale in CRNN fail to adequately extract multi-scale features of sound events, which have diverse time-frequency characteristics. It results in that the extracted features lack fine-grained information helpful for the localization of sound sources. In response to these challenges, we propose a polyphonic SSLD network based on global-local feature extraction and recalibration (GLFER-Net), where the global-local feature (GLF) extractor is designed to extract the multi-scale global features through an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module. The local feature extraction (LFE) unit is designed for capturing detailed information. Besides, we design a feature recalibration (FR) module to emphasize the crucial features along multiple dimensions. On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, we compared our proposed GLFER-Net with six and four SSLD methods, respectively. The results show that the GLFER-Net achieves competitive performance. The modules we designed are verified to be effective through a series of ablation experiments and visualization analyses.

1 Introduction

Sound source localization and detection (SSLD) can be regarded as a joint task of locating sound sources and detecting sound events. The SSLD system should predict the boundaries of active sound events, identify their categories, and provide the spatial trajectories of sound sources simultaneously. In recent years, SSLD has gained much popularity and has been helpful in many aspects of daily applications [1]. For instance, with the assistance of the SSLD system, robots can assist human-machine interaction [2]. Moreover, SSLD can collaborate with the speech enhancement task to denoise specific speakers by capturing their positions in intelligent meeting rooms [3]. It is also applied in smart cities for real-time environmental sound monitoring [4].

The SSLD task can be divided into two sub-tasks: sound event detection (SED) and sound source localization (SSL). SED identifies not only the categories of sound events but also their onset and offset times. Currently, the SED task mainly focuses on polyphonic SED, which aims to predict overlapping events of different categories. Semi-supervised learning is widely used for this task due to many unlabeled data in its datasets [5]. SSL is to estimate the direction of arrival (DOA), which is crucial for determining the positions of sound sources relative to the microphones at each time frame. It can provide the auxiliary localization information of each sound source for downstream tasks. For the real-world scenarios with an unknown number of sources, Fu et al. proposed the iterative sound source localization (ISSL) method to make the model more stable without using a threshold [6].

Adavanne et al. pioneered a convolutional recurrent neural network (CRNN) structure for the SSLD task [7], taking full advantage of the feature extraction ability of convolutional neural network (CNN) and the superior temporal context modeling ability of recurrent neural network (RNN). This CRNN-based approach as the baseline of Task3 of Detection and Classification of Acoustic Scenes and Events (DCASE) challengeFootnote 1 has become widely recognized and extensively applied in the profound research [8,9,10,11].

In recent years, for practical application of the SSLD, the datasets used in DCASE2022 Task3 has supplemented recordings of real-world spatial sound scenes based on the previous computational simulated spatial recordings, with accurate manual annotations. This realistic datasets not only have stronger reverberation and diversified environmental settings but also include frequent moving sound sources, more background noise and a large number of homogeneous sound events. These complex factors steeply increase the difficulty of SSLD task, especially with the sharp increase in the number of overlapping sound events, posing higher demands on the capability of encoding spatial relations.

Vanilla CRNN-based SSLD models using the CNNs with single kernel size are hard to adequately extract the feature maps of sound events with diverse time-frequency characteristics. Additionally, the spatial features extracted by vanilla CNN lack fine-grained information resulting in the inaccurate localization of sound sources, and lack spatial information across multiple dimensions, resulting in inaccurate classification of sound events. To overcome the limitations of the CRNN structure mentioned above, we propose a polyphonic SSLD network, global-local feature extraction and recalibration (GLFER-Net), equipped with the global-local feature (GLF) extractor and feature recalibration (FR) module replacing the vanilla CNN in a CRNN-based model. Our primary contributions are summarized as follows:

  1. (1)

    We design a GLF extractor consisting of two branches, where an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module make up the upper branch for extracting global feature maps, while two local feature extraction (LFE) units the lower branch for extracting the local features.

  2. (2)

    We design a cross-scale shuffle (CSS) unit to fuse the multi-scale features in the MSFE module and an FR module to recalibrate the feature maps from the outputs of the GLF extractor.

  3. (3)

    We compared our proposed GLFER-Net with six and four SSLD methods on synthetic and realistic datasets, respectively. A series of ablation studies and visualization analyses show the effectiveness of the GLF extractor and FR module. Additionally, we exploited data augmentation techniques to enhance the generalization capability of the proposed GLFER-Net.

This paper is organized as follows: Section 2 introduces the existing SSLD methods. Section 3 details our proposed network. The experimental setup, datasets, and evaluation metrics will be described in Section 4. A series of experimental results and discussion will be detailed in Section 5. Finally, we will draw conclusions in Section 6.

2 Related works

Recent SSLD methods have explored aspects of the input feature, output format, model architecture, and data augmentation techniques. Several novel input features have been introduced to incorporate as much positional information as possible. Nguyen et al. proposed the spatial cue-augmented log-spectrogram (SALSA) features [12], which consists of multi-channel log-spectrograms and the normalized principal eigenvectors of spatial covariance matrices in their respective time-frequency bins. It was proved crucial for addressing overlapping sound events. Rosero et al. devised a Gammatone-based Sound Events Localization and Detection (G-SELD) system [13], which seeks to enhance the SSLD performance by employing the bio-inspired gammatone auditory filters for the acoustic feature extraction.

There has been increasing attention on the design of output formats in the field of SSLD. Shimada et al. proposed the activity-coupled Cartesian DOA (ACCDOA) vector to simplify the model’s output [14], treating SSLD as a single regression task. Cao et al. introduced a track-wise output format [15], which showed excellent generalization for overlapping sound events of the same class. Shimada et al. extended ACCDOA to the multi-ACCDOA vector in the format of track-wise output [16]. However, it still needs to exhibit more generalization ability in the real world to face the challenge of the polyphonic acoustic environment with an unknown number of overlaps. To solve this problem, Kim et al. explored an approach for multi-task SSLD based on angular distance [17], which adapts the framework of “You Only Look Once” (YOLO) in the SSLD task, achieving impressive performance on multiple datasets.

In order to overcome the drawbacks of inefficient sequential data processing in RNNs, Sound Event Localization and Detection via Temporal Convolutional Networks (SELD-TCN) [18] was proposed, which introduced the temporal convolutional network (TCN) with dilated convolutions to replace the bidirectional gated recurrent unit (BiGRU) in CRNN framework to capture long-range temporal dependencies. Besides, Transformer [19] was proposed, which can process sequential data in parallel, bringing substantial improvement in efficiency. Conformer [20] was first proposed for the speech recognition task to endow the Transformer with the capability to capture local information. This model integrates a convolution module and a self-attention mechanism within a unified single-branch structure. Inspired by it, Wang et al. combined the Resnet and Conformer structures for the SSLD task [21], resulting in promising performance gains. However, Conformer places a substantial burden on computing resources due to its large number of model parameters. To tackle this challenge, Zhou et al. proposed a dual-branch attention module to improve the structure of Conformer [22], which utilizes both self-attention and convolution branches to capture global and local contextual information of the sound events.

Data augmentation methods have been commonly exploited to increase the quantity of data, thus enhancing the generalization ability of the model. Mazzon et al. proposed the first-order Ambisonics (FOA) domain spatial augmentation method based on the well-known rotational property of FOA sound encoding [23]. The basic idea of the method is to apply some transformations to the FOA channels along with corresponding labels that it simulates a new DOA of recorded sounds. Ronchini et al. used data augmentation methods to generalize the SSLD system performance on unseen data [24]. The technique they used is based on channel rotations and reflection on the x-y plane which allows improving DOA labels keeping the physical relationships between channels. Wang et al. integrated four methods in a step-by-step manner to devise an effective four-stage data augmentation scheme, including audio channel swapping, multi-channel simulation, time-domain mixing, and time-frequency masking [21]. This strategy allows models trained on augmented data to exhibit robustness in localizing variations of sound sources.

3 The proposed method

In this section, we will introduce our proposed GLFER-Net in detail. The overall diagram of the network is illustrated in Fig. 1, where the log-linear spectrograms from four channels and normalized three-channel sound intensity vectors (SIVs) are concatenated as the input features. An encoder first processes them for shallow feature extraction. The encoder comprises two convolutional blocks and an average pooling operation; each convolutional block includes a 2D-convolutional layer with a kernel size of 3 \(\times\) 3, a batch normalization (BN) layer, and a Gaussian error linear unit (GELU) activation function [25], as shown in Fig. 2. Note that there are no residual connections in the encoder, avoiding the interference of noise from the original input features. Subsequently, the output of the encoder is fed into the part enclosed by the dashed box in Fig. 1, which consists of a GLF extractor, an FR module, and an average pooling operation. This part is repeated three times to extract high-level features with rich spatial information. Feature maps are halved in size through each average pooling operation in time and frequency dimensions. Before being fed into the BiGRU, these features were downsampled by a factor of sixteen in both time and frequency dimensions. Next, a two-layer BiGRU is utilized to capture the temporal context information and model the relationships among time frames.

Fig. 1
figure 1

The diagram of the proposed GLFER-Net

Fig. 2
figure 2

The detailed structure of the encoder

Finally, SED and DOA branches use two fully connected (FC) layers to produce SED and DOA predictions, respectively. The SED branch adopts a sigmoid activation function, whereas the DOA branch is tanh. The following subsections will present a detailed description of input features and each module.

3.1 Input features

The log-mel spectrogram is extensively utilized in the SSLD task. However, Nguyen et al. has convincingly demonstrated that the linear-scale feature outperforms the mel-scale [12]. Hence, we preferred to use the log-linear spectrogram. The four-channel linear spectrograms can be calculated from the complex ones S(t,f) using:

$$\begin{aligned} LINSPEC\left( t,f \right) = \log \left( \left\| S(t,f) \right\| ^{2} \right) \in {\mathbb {R} }^{4\times T\times F} \end{aligned}$$

where t and f are time and frequency indices, \(T\) denotes time frames, \(F\) is the number of frequency bins, and \(||\cdot ||\) indicates the mode taking operation, respectively.

Each FOA signal consists of four channels (W, X, Y, Z), where W is the 0-th order spherical harmonics capturing the omnidirectional information, while X, Y, and Z are the 1-st order spherical harmonics conveying spatial information along the Cartesian coordinate in the sound field. SIVs carry valuable information about the direction of sound propagation, and their inverse direction is commonly interpreted as the DOA. Moreover, FOA-based SIVs can be directly used for precise DOA estimation [26]. In the short-time fourier transform (STFT) domain, the computation of SIV from 4-channel spectra is carried out using:

$$\begin{aligned} I(t,f) =\frac{1}{\rho _{0}c} \Re \left\{ W^{*}(t,f)\cdot \left( \begin{array}{l} X(t,f)\\ Y(t,f)\\ Z(t,f) \end{array}\right) \right\} \end{aligned}$$

where \(\rho _{0}\) and c are the density and velocity of the sound. \(\Re\) denotes real-part of complex numbers, and * is the conjugate of ones. The SIVs are normalized as follows:

$$\begin{aligned} I_{norm}(t,f)=\frac{I(t,f)}{\left\| I(t,f) \right\| } \end{aligned}$$

The log-linear spectrograms and normalized SIVs are concatenated as input features \(X_{e}\in {\mathbb {R} }^{7\times T_{e}\times F_{e}}\).

3.2 Global-local feature extractor

Various sound events exhibit the diversity of duration and frequency distribution range [27]. Designing an effective feature extractor is necessary to obtain richer spatial information for the SSLD task. Furthermore, the classification of sound events needs coarse-grained global information, while detecting the onset and offset times of sound events and estimating the spatial trajectory of the sound source require fine-grained local information [28]. In order to obtain comprehensive feature representation, we devise a GLF extractor, whose structure is illustrated in Fig. 3. The upper branch comprises an ODConv layer and an MSFE module in order to obtain global feature maps. In the lower branch, we introduce an LFE unit to extract local features, stacked twice to ensure that local features are extracted sufficiently.

We exploit the attentional feature fusion (AFF) unit to realize soft selection between two branches [29] for aggregating global and local features instead of direct summing. This manner highlights the important components of different feature maps while restraining unnecessary ones. The whole procedure of the GLF extractor can be formulated as follows:

$$\begin{aligned} y_{1}=MSFE(ODConv(x))+ODConv(x) \end{aligned}$$
$$\begin{aligned} y_{2}=LFE_{2}(LFE_{1}(x)) \end{aligned}$$
$$\begin{aligned} y=AFF(y_{1},y_{2}) \end{aligned}$$

Next, we provide a detailed description of the modules in the GLF extractor.

Fig. 3
figure 3

The diagram of the global-local feature (GLF) extractor

3.2.1 Omni-directional dynamic convolution

Dynamic convolution [30] gives the weights input-adaptive capabilities through dynamic mechanisms for enhancing the model’s generalizability. However, dynamic convolution suffers from a significant limitation in that only a single dimension (convolutional kernel) possesses dynamic characteristics, while other dimensions are overlooked [31]. Li et al. proposed the ODConv introducing a multi-dimensional attention mechanism [32], which adopts a parallel strategy to learn diverse attentions for convolutional kernels along four dimensions in kernel space, that is, the spatial, input channel, output channel, and convolutional kernel dimensions. These four kinds of attentions are mutually complementary and multiplied to convolutional kernel \(w_{i}\) successively in the location-wise, channel-wise, filter-wise, and kernel-wise orders. The ODConv can be formulated as follows:

$$\begin{aligned} y=\left(\,\alpha _{\omega 1}\odot \alpha _{f1}\odot \alpha _{c1}\odot \alpha _{s1}\odot W_{1}+ ... +\,\alpha _{\omega n}\odot \alpha _{fn}\odot \alpha _{cn}\odot \alpha _{sn}\odot W_{n}\,\right) *x \end{aligned}$$

where \({\alpha _{\omega i}}\) denotes the attention scalar, and \(\alpha _{s i}\in \mathbb {R}^{k\times k}\), \(\alpha _{c i}\in \mathbb {R}^{c_{in}}\), and \({\alpha _{f i}\in \mathbb {R}^{c_{out}}}\) denote three attentions computed along the spatial, input channel, and output channel dimensions in the kernel space for the convolutional kernel \(W_{i}\), respectively. They are calculated by the squeeze-and-excitation (SE) module [33]. \(\odot\) represents the channel-wise multiplication. Introducing these multi-dimensional dynamic weights can improve the modeling ability of the convolution with additional spatial interaction [34]. In our proposed method, the ODConv is used to calibrate the input feature maps of the GLF extractor to suppress redundant information and highlight important ones effectively. The calibrated feature maps are then fed into the MSFE module to extract the multi-scale feature maps with larger receptive fields and obtain global ones further by the CSS unit.

3.2.2 Multi-scale feature extraction module

We design a MSFE module to capture the information across the feature maps of multiple scales, which helps identify sound events with various time-frequency characteristics. In addition, we introduce an attention mechanism in combining multi-scale feature maps to emphasize the contribution of important features. We also employ a channel shuffle operation to enhance information communication among them.

Fig. 4
figure 4

The overall structure of the multi-scale feature extraction (MSFE) module and the details of the cross-scale shuffle (CSS) unit therein

As shown in Fig. 4a, the outputs of the preceding ODConv layer are firstly fed into a multi-branch convolutional unit. Assuming that the number of channels is C, we split the feature maps into four segments along the channel dimension, and each has the same channel number of \(\frac{C}{4}\) to mitigate computational complexity. Then, group convolutions (GConvs) with distinct kernel sizes are exploited to extract the spatial information at multiple scales in parallel. We adhere to the grouping setting presented in [35], where the relationship between multi-scale kernel K and group sizes \(g_{i}\) is defined as follows:

$$\begin{aligned} g_{i}=2^{\frac{K-1}{2} } ,i=0,1,2,3 \end{aligned}$$

Subsequently, the SE mechanism [33] is exploited along multiple branches to acquire the attention weights for multi-scale feature maps. By employing a softmax function, the feature maps of different scales can be adaptively selected under the guidance of multi-scale channel weights denoted as att. For the i-th branch, it can be computed by:

$$\begin{aligned} att_{i}=Softmax(Z_{i})=\frac{exp(Z{i})}{\sum _{i=0}^{3}{exp(Z_{i})}},i=0,1,2,3 \end{aligned}$$

where \(Z_{i}\in R^{C/4\times 1\times 1}\) is the outputs of SE module for the i-th branch. The long-range channel dependency is achieved by the above operation. Then, the feature map from GConv in i-th branch denoted as \(F_{i}\) is calibrated as follows:

$$\begin{aligned} Y_{i}=F_{i}\odot att_{i},i=0,1,2,3 \end{aligned}$$

where \(\odot\) represents the channel-wise multiplication, the module can assign distinct weights to the feature maps of multiple scales and that of different channels within a single scale to emphasize cross-channel multi-scale spatial information. Finally, these four calibrated feature maps, \(Y_{i}\), are concatenated as the outputs of MSFE.

In the MSFE module, we directly concatenate multi-scale features without consideration of their interdependencies. The utilization of group convolution leads to a side effect that the outputs of certain groups only relate to a small subset of input feature maps, lacking effective inter-group communication [36]. To alleviate this side effect, we introduce a CSS unit that can fuse the multi-scale feature maps and enhance their correlations to obtain global feature maps. The detailed architecture is illustrated in Fig. 4b.

The channel shuffle operation facilitates the flow of information across multi-scale feature maps [36]. This operation can be modeled as a process of “reshape-transpose-reshape”: given the input feature maps with dimension of \((*,n)\), it is firstly reshaped to \((*,g_{s},n/g_{s})\), further transposed to \((*,n/g_{s},g_{s})\), and reshaped back to \((*,n)\), where n and \(g_{s}\) denote channel number and group size, respectively, \(*\) represents other dimensions. After this operation, we can get the shuffled feature maps, which are aggregated together with the original ones through the aggregation block. The aggregation block includes two convolution layers with a kernel size of \(1\times 1\). The first one reduces the number of channels, and the second one further fuses the feature maps from various channels, retaining the information of original channel positions. Additionally, we incorporate residual connections supplementing original information after the CSS unit and the MSFE module.

3.2.3 Local feature extraction unit

Ding et al. proposed asymmetric convolution instead of the standard convolution layer with the square kernel for enriching the feature space [37]. Asymmetric convolution can explicitly enhance the representational capacity of CNN while extracting more delicate spatial features. Here, we design an LFE unit based on asymmetric convolutions to acquire more fine-grained features. As illustrated in the bottom part of Fig. 3, we design a combination of \(1\,\times \,3\,+\,3\,\times \,1\) and \(3\,\times \,1\,+\,1\,\times \,3\) convolution layers, which is followed by BN layer and GELU activation function. The asymmetric convolutions with the kernel sizes of \(1\,\times \,3\) and \(3\,\times \,1\) focus on information in the frequency and time dimensions, respectively. Regarding the impact of the LFE unit on SSLD performance, we will make further exploration in Section 4.

3.3 Feature recalibration module

The attention mechanism can improve the feature representation of CNN by building the dependencies among channels or spatial positions [38]. Previous research has witnessed its significant achievements [33, 39, 40]. Inspired by the above methods, we propose a feature recalibration (FR) module with three attention branches, independently computing the attention weights along time, channel, and frequency dimensions to integrate the three attention mechanisms’ advantages effectively. The whole structure of it is shown in Fig. 5.

Fig. 5
figure 5

The diagram of the Feature Recalibration (FR) module. P denotes the operation of permutation. A-POOL represents global average pooling and 1D convolution. \(\sigma\) denotes sigmoid function

In the FR module, we supposed the shape of input feature M is (C, T, F). It is first mapped or permuted into three feature maps \(M_{1}\) with the shape of (T, C, F), \(M_{2}\) that of (C, T, F), \(M_{3}\) that of (F, T, C), and as the inputs of three attention branches, respectively. C, T, and F denote the channel, time, and frequency. Global averaging pooling (GAP) is applied separately to obtain the spatial statistics g along the time t, channel c, and frequency f dimensions in each branch:

$$\begin{aligned} g_{t}=\mathcal {F}_{GAP}(M_{1})=\frac{1}{C\times F} \sum \limits _{i=1}^{C} \sum \limits _{j=1}^{F} m_{1t}\left( i,j \right) , \end{aligned}$$
$$\begin{aligned} g_{c}=\mathcal {F}_{GAP}(M_{2})=\frac{1}{T\times F} \sum \limits _{i=1}^{T} \sum \limits _{j=1}^{F} m_{2c}\left( i,j \right) , \end{aligned}$$
$$\begin{aligned} g_{f}=\mathcal {F}_{GAP}(M_{3})=\frac{1}{C\times T} \sum \limits _{i=1}^{C} \sum \limits _{j=1}^{T} m_{3f}\left( i,j \right) , \end{aligned}$$

where \(m_{1t}\) is the t-th time frame feature map of \(M_{1}\), \(m_{2c}\) is the c-th channel feature map of \(M_{2}\), and \(m_{3f}\) is the f-th frequency bin feature map of \(M_{3}\). Then, three 1D-convolution layers with the same kernel size of 1 \(\times\) 3 are in parallel used to dynamically select task-related time frames, channels, and frequency bins like efficient channel attention (ECA) [41]. The three kinds of attention maps can be obtained through the sigmoid function:

$$\begin{aligned} \textbf{A}_{\textbf{t}} =sigmoid \left[ W(g_{t})\right] , \end{aligned}$$
$$\begin{aligned} \textbf{A}_{\textbf{c}} =sigmoid \left[ W(g_{c}) \right] , \end{aligned}$$
$$\begin{aligned} \textbf{A}_{\textbf{f}} =sigmoid \left[ W(g_{f})\right] \end{aligned}$$

where W is the weight matrix of 1D-convolution, \(\textbf{A}_{\textbf{t}} \in \mathbb {R} ^{T\times 1\times 1}\), \(\textbf{A}_{\textbf{c}} \in \mathbb {R} ^{C\times 1\times 1}\), \(\textbf{A}_{\textbf{f}} \in \mathbb {R} ^{F\times 1\times 1}\). The three attention maps are multiplied with \(M_{1}\), \(M_{2}\), and \(M_{3}\) respectively for recalibrating the feature maps along three dimensions. This process increases the interest of necessary time frames, channels, and frequency bins in feature maps. Finally, the three kinds of recalibrated feature maps are reshaped with the same shape represented by \(M_{1}'\), \(M_{2}'\), and \(M_{3}'\), which are summed as the ultimate output. We will explore each attention branch through ablation studies and investigate the impacts of various attention combinations on SSLD performance.

4 Experimental setup

4.1 Datasets

We evaluated the performance of the proposed GLFER-Net and conducted ablation experiments on the DCASE 2021 Task3 development dataset [10]. The spatialization of all sound events in this dataset is based on the filtering of actual room impulse responses, which are captured in multiple rooms of distinct sizes and acoustic absorption characteristics. Realistic spatial ambient noise is added to the recordings with the signal-to-noise ratios (SNR) ranging from 30 dB to 6 dB. Furthermore, each recording is provided in two spatial formats: microphone array (MIC) and FOA. The FOA dataset is obtained by encoding filters based on the Eigenmike array, which could convert a 32-channel microphone array signal to a 4-channel one. The MIC dataset uses a 4-channel Eigenmike sphere array whose numbers are 6, 10, 26, and 22, respectively. The development dataset comprises 400 training recordings, 100 testing recordings, and 100 validating recordings, each with a duration of one minute. The distribution of all samples covers 12 sound event categories. More information about the dataset is available in [10].

To verify the generalization of our model in real acoustic environments, we also evaluated the performance of the proposed GLFER-Net on the dataset used for the baseline of DCASE 2022 Challenge Task3, which includes the STARSS 2022 development dataset [42] and another 1200 synthetic recordings. Similar to previous challenges, the dataset has two spatial recording formats: FOA and MIC. The recordings in STARSS 2022 dataset collected from real acoustic environments encompasses more noise and reverberation and exist up to three overlapping sound events. Besides, The recordings in this dataset were organized into sessions, each occurring in a unique room. Multiple independent recordings with the length of from 30 s to 6 min were collected during each session. Within the recordings, 13 target classes are identified. The development dataset comprises 67 training recordings and 54 testing recordings. Details of this dataset can be found in [42]. The most significant distinction of DCASE 2022 Challenge Task 3 from previous challenges is that the models are tested on recordings from real annotated scenes.

Both datasets we used in this paper are in FOA format. The specific distinctions between the DCASE 2021 and 2022 Task3 development datasets are summarized in Table 1.

Table 1 The detailed characteristics of the DCASE 2021 and 2022 Task3 development datasets

We also exploit data augmentation techniques as a data transformation mechanism to improve the learning ability of the model without expanding the dataset size. Similar to [12], we consecutively utilize three data augmentation methods including audio channel swapping, random cutout and frequency shifting.

4.2 Evaluation metrics

We used official metrics [43] from the DCASE Challenge Task3 to assess the SSLD performance. It consists of individual SED and SSL metrics. For the SED task, we used the location-dependent error rate (ER) and F1 score while the class-dependent localization error (\(LE_{CD}\)) and localization recall (\(LR_{CD}\)) for the SSL task.

The SSLD system deems a sound event detection to be accurate only when the predicted sound event has the correct class label, and its estimated DOA is within \(D^{\circ }\) of the reference DOA, where D is generally set to \(20^{\circ }\). Thus, these SED metrics are denoted as \(ER_{20^{\circ }}\) and \(F_{20^{\circ }}\).

Mathematically, \(ER_{20^{\circ }}\) and \(F_{20^{\circ }}\) are calculated as follows:

$$\begin{aligned} ER_{20^{\circ }}=\frac{S+D+I}{N}, \end{aligned}$$
$$\begin{aligned} F_{20^{\circ }}=\frac{2TP}{2TP+FP+FN} \end{aligned}$$

where TP represents the true positive that sound event is active in both the ground truth and prediction. FP refers to false positives, indicating sound events predicted as active but are actually inactive in the ground truth. Conversely, FN stands for false negatives that sound events predicted as inactive but are active in the ground truth. N denotes the total number of sound events active in the ground truth. S, I, and D refer to substitution, insertion, and deletion errors, respectively. The mathematical definitions of these statistics are provided below:

$$\begin{aligned} S=\min (FN,FP) \end{aligned}$$
$$\begin{aligned} D={\text {*}}{max}(0,FN-FP) \end{aligned}$$
$$\begin{aligned} I=\max (0,FP-FN) \end{aligned}$$

The SSL metrics are class-dependent (CD), meaning the localization prediction can only be considered when the corresponding sound class is correctly detected. \(LE_{CD}\) expresses the average angular distance between predictions and ground truths and can be formulated as:

$$\begin{aligned} LE_{CD}=\arccos (\textbf{u}_{\textbf{ref}}\cdot {\textbf{u}_{\textbf{pre}}}) \end{aligned}$$

where \(\textbf{u}_{\textbf{ref}}\) and \(\textbf{u}_{\textbf{pre}}\) denote the position vectors of the reference and predicted sound event, respectively. The localization recall metric \(LR_{CD}\) is the true positive rate of localization estimation.

Furthermore, SSLD error (\(\varepsilon _{SSLD}\)), which is an average of four metrics, was introduced as an additional general performance assessment:

$$\begin{aligned} \varepsilon _{SSLD}=\frac{ER_{20^{\circ }}+(1-F_{20^{\circ }})+{{\frac{LE_{CD}}{180^{\circ }}}} +(1-LR_{CD})}{4} \end{aligned}$$

It is worth noting that contrary to the previous challenges using micro-averaged metrics, DCASE 2022 Task3 introduces a new evaluation scheme that the metrics \(F_{20^{\circ }}\), \(LE_{CD}\), and \(LR_{CD}\) are macro-averaged, which assigns equal weight to each class. Hence, we used the macro-averaged metrics to evaluate the methods trained on the STARSS 2022 development dataset to ensure the fairness of experiments. A good SSLD system should exhibit the lower scores of \(ER_{20^{\circ }}\), \(LE_{CD}\), and \(\varepsilon _{SSLD}\) metrics, along with higher ones of \(F_{20^{\circ }}\) and \(LR_{CD}\) metrics.

4.3 Loss functions

GLFER-Net was trained using a multi-objective learning approach, simultaneously optimizing SED and SSL sub-tasks. SED task is treated as a multi-label classification task and uses binary cross-entropy loss [44] as follows:

$$\begin{aligned} L^{\textrm{SED}}=-\frac{1}{T_{0}N}\sum \limits _{t_{0},n}\left[ \textbf{y}_{\textbf{t}_{\textbf{0}}\textbf{,n}}^{\textrm{SED}}\textrm{log}\hat{\textbf{y}}_{\textbf{t}_{\textbf{0}}\textbf{,n}}^{\textrm{SED}}+ (1-\textbf{y}_{\textbf{t}_{\textbf{0}}\textbf{,n}}^{\textrm{SED}})\textrm{log}(1-\hat{\textbf{y}}_{\textbf{t}_{\textbf{0}}\textbf{,n}}^{\textrm{SED}})\right] , \end{aligned}$$

where \(\textbf{y}_{\textbf{t}_{\textbf{0}}\textbf{,n}^{\textrm{SED}}}\) and \(\hat{\textbf{y}}_{\textbf{t}_{\textbf{0}}\textbf{,n}}^{\textrm{SED}}\) are the reference and estimated active probabilities for the n-th sound event at the \(t_{0}\)-th, respectively. \(T_{0}\) represents the total number of frames in one batch, and N represents the number of classes. In this paper, \(T_{0}=T_{e}/16\). While SSL serves as a regression task uses mean squared error [45] as follows:

$$\begin{aligned} L^{\textrm{SSL}}=\frac{1}{T_{0}N}\sum \limits _{t_{0},n}||(\hat{\textbf{y}}_{t_{0},n}^{\textrm{DOA}}-\textbf{y}_{t_{0},n}^{\textrm{DOA}})||^{2} \end{aligned}$$

where \(\hat{\textbf{y}}_{t_{0},n}^{\textrm{DOA}}\) is the DOA estimation for the n-th sound event at the \(t_{0}\)-th frame, \(\textbf{y}_{t_{0},n}^{\textrm{DOA}}\) denotes the ground truth. The final loss function L can be expressed as follows:

$$\begin{aligned} L=\lambda L^{\textrm{SED}}+(1-\lambda )\textbf{1}_{\textrm{active}}L^{\textrm{SSL}} \end{aligned}$$

\(\lambda\) is a hyper-parameter as loss weight. Empirically, we set it to 0.3 in this paper. \(L^{\textrm{SSL}}\) is only computed for the active sound events in each frame.

4.4 Training setup

The sampling rate for both datasets is 24 kHz. We employed a Hann window with 512 points and a hop size of 300 points for the STFT.The network sets the batch size as 8, 50 for the epoch, employing the AdamW optimizer with an initial learning rate of \(3e^{-4}\), following the cosine annealing training method [46]. Input signals are segmented into 8-s non-overlapping segments. Both the number of GLF extractors and FR modules are set to three, with 64, 128, and 256 filters for GLF extractors respectively. The hidden layer dimension of a BiGRU is set to 128. A threshold of 0.3 is applied for binarization of SED predictions. All models are implemented using PyTorch.

Fig. 6
figure 6

The column charts of four metrics on models with various group size (\(g_{s}\)) of shuffle operation in CSS units. The left chart shows the scores of \(ER_{20^{\circ }}\) and \(F_{20^{\circ }}\) for SED task, the right one that of \(LE_{CD}\) and \(LR_{CD}\) for SSL task

5 Experimental results and discussion

We conducted an exhaustive analysis of experimental results. Firstly, we investigated the group size of the channel shuffle operation in the CSS unit and determined an optimal one. Subsequently, a series of ablation experiments were carried out to demonstrate the significance of GLF extractor and FR module. Following that, we explored the combination of the attention branches in the FR module. Besides, we conducted an ablation study of channel shuffle operation and aggregation block in the CSS unit to verify their roles. The above ablation experiments were all conducted on the DCASE 2021 Task3 development dataset. Finally, we performed the comparisons with multiple competitive methods on two datasets and visualized the predicted results of the model to demonstrate the superiority of the proposed GLFER-Net.

5.1 Ablation study of group size in the CSS unit

We investigate which group size of channel shuffle operation in the CSS unit is the best for the SSLD task. In this part, all experiments were conducted on our proposed GLFER-Net without data augmentation techniques but with the various \(g_{s}\) of channel shuffle operation. Figure 6 illustrates the experimental results with different \(g_{s}\). The optimum performance is achieved when \(g_{s}\) is set to 8. When \(g_{s}\) is smaller or larger than 8, it all leads to considerable reductions in performance. As the group size increases, the number of channels for each group decreases accordingly. There is not enough information in each group of channels, leading to incomprehensive information exchange among different groups, potentially weakening the representation ability of the model. Consequently, the group size of the channel shuffle operation is set to 8 in the following experiments.

5.2 Ablation study of FR module and GLF extractor

To verify the effectiveness of the GLF extractor, which is designed to extract features containing rich spatial information, and that of the FR module to recalibrate the feature maps, we conducted a series of experiments. As shown in Table 2, when removing the FR module, the GLFER-Net has worse performance with the scores of \(ER_{20^{\circ }}\), \(LE_{CD}\), and \(\varepsilon _{SSLD}\) increasing by 0.05, \(1.4^{\circ }\), and 0.02, and that of \(F_{20^{\circ }}\) degrading by 2.3%. When further replacing the GLF extractor with two serial convolutional blocks, its performance declines notably, as shown in the last row of the table, especially \(\varepsilon _{SSLD}\), which has a reduction of 0.07.

Table 2 Ablation study on the DCASE 2021 Task3 development dataset

For the sake of visually showing the effectiveness of the LFE unit in the GLF extractor, we visualize the feature maps in the first GLF extractor. In Fig. 7, the region surrounded by blue boxes indicates the moving sound event of Speech and that by red boxes the periodic discontinuous sound event of Alarm. Figure 7a illustrates a feature map of x in the GLF extractor that follows encoder. It still contains substantial noise, so the model has difficulty recognizing each sound event. Figure 7b is a feature map of \(y_{1}\), which is the output of the MSFE module in the first GLF extractor. Comparing Fig. 7b with Fig. 7a, we can see that the components unrelated to sound events are reduced after the MSFE module. However, the harmonic components of Speech are unclear, and the energy distribution of Alarm appears indistinct. Figure 7c is a feature map of y, which is the output of the first GLF extractor. It contains fine-grained information captured from LFE units. Seen from Fig. 7c, the periodic nature of Alarm is discernible, but the harmonic components of Speech after 1 second are lost. Figure 7d is another feature map of y. Although the characteristic of Alarm is not clear enough, the harmonic components of Speech are more complete compared to Fig. 7c.

Fig. 7
figure 7

The visualization of feature maps in the first GLF extractor. a A feature map of the input of GLF extractor, b that of the output of MSFE module, c and d two feature maps of the output of GLF extractor. c The discernible periodic nature of Alarm and d the clear harmonic components of Speech

By comparing Fig. 7b with Fig. 7c and d, we can see that with the assistance of LFE units, the specific frequency components of two kinds of sound events can be clearly distinguished from the feature maps. By comparing Fig. 7a with Fig. 7c and d, we can verify the effectiveness of the GLF extractor.

5.3 Exploration of attention branches in FR module

We also explored the effectiveness and the combination of various attention branches in the FR module. Table 3 provides the results of various attention branches, where CA represents the channel attention branch, TA the time attention one, and FA the frequency attention one, respectively. Seeing from Table 3, we can get the following conclusions:

Table 3 The exploration of combinations of three attention branches in the FR module on the DCASE 2021 Task3 development dataset
  1. 1)

    The top three rows of Table 3 correspond to using three solo attention branch. From the experimental results, we can see that the model with CA branch achieves the best performance, for that the CA mechanism has the ability of learning inter-channel correlations, which is crucial for enhancing localization performance. Notably, the model with CA branch achieves the highest \(LR_{CD}\) score of 67.1%.

  2. 2)

    Among the three kinds of combinations containing two kinds of attention branches, the model with the combination of CA and TA branches yields relatively superior results. The TA is an attention mechanism along the temporal dimension that highlights crucial time frames, aiding in the detection of active sound event boundaries. Integrating TA with CA can further improve the accuracy of SED task, with the scores of \(ER_{20^{\circ }}\) and that of \(F_{20^{\circ }}\) increase 0.01 and 1.0%, respectively. However, a marginal gap of 0.01 remains on the \(\varepsilon _{SSLD}\) score between it and that of the three attention branches, as shown in the final row.

  3. 3)

    The FA branch emphasizes the important spectral component to provide more accurate spatial information for SED and SSL tasks. When it is employed with CA and TA branches simultaneously, their complementary effects lead to excellent performance, all metrics have improved significantly. Hence, in the following experiments, the FR module consists of three attention branches.

5.4 Ablation study of two components in CSS unit

We performed the ablation experiments of channel shuffle operation (Shuf) and aggregation block (Agg) when the FR module contains three kinds of attention branches and the group size (\(g_{s}\)) of channel shuffle operation is set to 8. In Table 4, the first row denotes the performance of complete GLFER-Net, and the second row is that of the model without aggregation block in the CSS unit, where the shuffled features are directly added to the original ones as the output. Compared to the results in the first row, that of the second increase by 0.04, \(1.5^{\circ }\), and 0.02 on \(ER_{20^{\circ }}\), \(LE_{CD}\), and \(\varepsilon _{SSLD}\) and decrease by 2.1% and 0.7% on \(F_{20^{\circ }}\) and \(LR_{CD}\), respectively. These verify the effectiveness of the aggregation block preserving the information from the original channels. The last row denotes the results of the model without the aggregation block and channel shuffle operation, which means that the model utilizes a convolutional layer with a kernel size of 1 × 1 to replace the CSS unit. Compared to the scores of first row, that of last row increase by 0.05, \(1.8^{\circ }\), and 0.03 on \(ER_{20^{\circ }}\), \(LE_{CD}\), and \(\varepsilon _{SSLD}\) and decrease by 3.1% and 1.3% on \(F_{20^{\circ }}\) and \(LR_{CD}\), respectively. These demonstrate the effectiveness of the CSS unit, which can integrate multi-scale features and enhance information communication among them.

Table 4 Ablation study of the CSS unit on the DCASE 2021 Task3 development dataset
Table 5 The comparison between GLFER-Net and seven other methods on the DCASE 2021 Task3 development dataset
Table 6 The comparison between our proposed GLFER-Net and four other methods on the DCASE 2022 Task3 development dataset

5.5 Comparisons with other methods

We first compared our proposed GLFER-Net with six other methods on the DCASE 2021 Task3 development dataset. All models were trained on the same training, validating, and testing datasets for fairness. In Table 5, compared methods did not use data augmentation techniques and post-processing techniques. They all employed simple CRNN-based architectures, where CNNs use single-scale kernel convolutions. Compared to these methods, our proposed GLFER-Net demonstrates stronger feature extraction capability by leveraging GLF extractors and FR modules. The feature representation of GLFER-Net contains more comprehensive spatial information, enhancing the classification accuracy of diverse sound events. Additionally, the multi-dimensional attention mechanism within the FR module contributes to improving localization performance. As shown in Table 5, GLFER-Net outperforms almost all methods except that slightly worse than AD-YOLO [17] only on \(LE_{CD}\) metric but with fewer parameters amount. The last row lists the results of GLFER-Net with data augmentation techniques mentioned in Section 4.1. Compared to training the GLFER-Net with original data, training the GLFER-Net with transformed one can increase the scores of 8.2% and 4.7% on \(F_{20^{\circ }}\) and \(LR_{CD}\) and decrease that of 0.1, \(3.2^{\circ }\) and 0.06 on \(ER_{20^{\circ }}\), \(LE_{CD}\), and \(\varepsilon _{SELD}\), respectively. It demonstrates that this kind of consecutive data transformation can bring outstanding improvement in SSLD performance.

Then, to verify the generalization of our proposed method, we also made a comparison with the other four methods on the DCASE 2022 Task3 development dataset. As shown in Table 6, the performance of GLFER-Net without data augmentation techniques is only slightly superior to Resnet-Conformer [21] on \(\varepsilon _{SELD}\). However, it exhibits strong performance in terms of localization with an improvement of 9.7% on \(LR_{CD}\).

Fig. 8
figure 8

The visualization of ground truth and predicted scores of three ablation experiments. In each sub-graph, the first row displays the results of the SED task, and the second and third rows show the results of azimuth and elevation for the SSL task, respectively. In all sub-plots, the horizontal axes represent the time, and the vertical ones in the SED task sub-plots are the sound event class indices, each depicted with a unique color. The vertical axes of the second and third rows of each sub-graph represent the azimuth angle with a range of [− 180\(^{\circ }\),180\(^{\circ }\)] and elevation angle of [− 90\(^{\circ }\),90\(^{\circ }\)], respectively

5.6 Visualization analysis

Take ‘“fold6_room1_mix012” as an example, an audio clip from the test set of the DCASE 2021 Task3 development dataset, and we make a visualization analysis. For the SED task, in comparison to the reference in Fig. 8a, GLFER-Net without FR module predicts an erroneous category at around 20 s, depicted by the rectangular box with the color of light blue in Fig. 8b. The complete GLFER-Net can correctly detect the boundaries and categories of most sound events, as shown in Fig. 8c.

For the SSL task, the second and third rows of each sub-graph are the visualizations of azimuth and elevation, respectively. In the reference diagram, the rectangular box with the color green depicts the trajectory of the azimuth angle for the sound event labeled as Knock, which corresponds to a moving sound source occurring at around 55 s. GLFER-Net exhibits the robustness of a certain level for the moving sound sources but confuses the sound events Knock indicated by the blue line and Phone the green line from 50 s to 60 s as shown in Fig. 8c. However, such confusion is effectively alleviated with the assistance of data augmentation methods. Furthermore, our method produces the biases of elevation angle compared to the ground truth, depicted by the rectangular box with the color black. The data augmentation method with three consecutive transformations is proved to reduce such biases from the comparison between Fig. 8c and d.

6 Conclusion

In this paper, we propose the GLFER-Net based on the GLF extractor and FR module for polyphonic sound source localization and detection. The LFE units in the GLF extractor complement the fine-grained information to the multi-scale features from the MSFE module, where a CSS unit is designed to fuse the multi-scale features and enhance the information communication among them. After each GLF extractor, an FR module is introduced to emphasize the crucial features along multiple dimensions. We also use three consecutive data augmentation methods as a data transformation mechanism to improve the generalization ability of the model. On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, our proposed GLFER-Net outperforms six and four SSLD methods, respectively. Through a series of ablation experiments and visualization analyses on the DCASE 2021 Task3 development dataset, the effectiveness of the GLF extractor, FR module, LFE unit, and two components in the CSS unit is verified. However, our proposed method still has some limitations. When compared with other methods on the development dataset of DCASE 2022 Challenges, the \(ER_{20^{\circ }}\) score of GLFER-Net is slightly higher than that of Resnet-Conformer. Additionally, the complexity of model needs to be further reduced.

Availability of data and materials

The DCASE 2021 dataset used in the experiments of this paper is available at

The DCASE 2022 dataset used in the experiments of this paper is available at

Code availability

The GLFER-Net was implemented by using the Pytorch backend. The code is available at





Activity-coupled Cartesian direction of arrival


Attentional feature fusion


Bidirectional gated recurrent unit


Batch normalization


Convolutional neural network


Convolutional recurrent neural network


Cross-scale shuffle


Detection and classification of acoustic scenes and events


Direction of arrival


Efficient channel attention


Fully connected


First-order ambisonics


Feature recalibration


Gammatone-based Sound Events Localization and Detection


Global averaging pooling


Group convolution


Gaussian error linear unit


Global-local feature


Polyphonic SSLD network adopting global-local feature extraction and recalibration


Iterative sound source localization


Local feature extraction


Microphone array


Multi-scale feature extraction


Omni-directional dynamic convolution


Recurrent neural network


Spatial cue-augmented log spectrogram




Sound event detection


Sound Event Localization and Detection via Temporal Convolutional Network


Sound intensity vector


Sound source localization


Sound source localization and detection


Short-time Fourier transform


You Only Look Once


  1. P.A. Grumiaux, S. Kitić, L. Girin, A. Guérin, A survey of sound source localization with deep learning methods. J. Acoust. Soc. Am. 152(1), 107–151 (2022)

    Article  Google Scholar 

  2. L. Jin, J. Yan, X. Du, X. Xiao, D. Fu, Rnn for solving time-variant generalized sylvester equation with applications to robots and acoustic source localization. IEEE Trans. Ind. Inform. 16(10), 6359–6369 (2020).

    Article  Google Scholar 

  3. Y. Yang, Q. Hu, Q. Zhao, P. Zhang, So-das: A two-step soft-direction-aware speech separation framework. IEEE Signal Proc. Lett. 30, 344–348 (2023)

    Article  Google Scholar 

  4. Y. Zhang, A. Zheng, K. Han, Y. Wang, J.N. Hwang, in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vehicle 3d localization in road scenes via a monocular moving camera (2021), pp. 2390–2394.

  5. Y. Hu, X. Zhu, Y. Li, H. Huang, L. He, A multi-grained based attention network for semi-supervised sound event detection (Interspeech, 2022)

  6. Y. Fu, M. Ge, H. Yin, X. Qian, L. Wang, G. Zhang, J. Dang, Iterative sound source localization for unknown number of sources. (INTERSPEECH, 2022)

  7. S. Adavanne, A. Politis, J. Nikunen, T. Virtanen, Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J. Sel. Top. Signal Process. 13(1), 34–48 (2019).

    Article  Google Scholar 

  8. S. Adavanne, A. Politis, T. Virtanen, A multi-room reverberant dataset for sound event localization and detection (2019). ArXiv arXiv:1905.08546,

  9. K. Shimada, N. Takahashi, S. Takahashi, Y. Mitsufuji, Sound event localization and detection using activity-coupled cartesian doa vector and rd3net (2020). ArXiv arXiv:2006.12014.

  10. A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, T. Virtanen, in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection (Barcelona, Spain, 2021), pp. 125–129.

  11. Q. Wang, J. Du, Z. Nian, S. Niu, L. Chai, H. Wu, J. Pan, C.H. Lee, in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Loss function design for DNN-based sound event localization and detection on low-resource realistic data (IEEE, 2023), pp. 1–5

  12. T.N.T. Nguyen, K.N. Watcharasupat, N.K. Nguyen, D.L. Jones, W.S. Gan, Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 1749–1762 (2022).

    Article  Google Scholar 

  13. K. Rosero, F. Grijalva, B. Masiero, Sound events localization and detection using bio-inspired gammatone filters and temporal convolutional neural networks (IEEE/ACM Trans. Audio Speech Lang, Process, 2023)

    Google Scholar 

  14. K. Shimada, Y. Koyama, N. Takahashi, S. Takahashi, Y. Mitsufuji, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection (IEEE, 2021), pp. 915–919

  15. Y. Cao, T. Iqbal, Q. Kong, F. An, W. Wang, M.D. Plumbley, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), An improved event-independent network for polyphonic sound event localization and detection (IEEE, 2021), pp. 885–889

  16. K. Shimada, Y. Koyama, S. Takahashi, N. Takahashi, E. Tsunoo, Y. Mitsufuji, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training (IEEE, 2022), pp. 316–320

  17. J.S. Kim, H.J. Park, W. Shin, S.W. Han, in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Ad-yolo: You look only once in training multiple sound event localization and detection (IEEE, 2023), pp. 1–5

  18. K. Guirguis, C. Schorn, A. Guntoro, S. Abdulatif, B. Yang, in 2020 28th European Signal Processing Conference (EUSIPCO), SELD-TCN: Sound event localization and detection via temporal convolutional networks (2021), pp. 16–20.

  19. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

  20. A. Gulati, J. Qin, C.C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al., Conformer: Convolution-augmented transformer for speech recognition (2020). arXiv preprint arXiv:2005.08100

  21. Q. Wang, J. Du, H.X. Wu, J. Pan, F. Ma, C.H. Lee, A four-stage data augmentation approach to ResNet-Conformer based acoustic modeling for sound event localization and detection. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 1251–1264 (2023)

    Article  Google Scholar 

  22. Y. Zhou, H. Wan, Dual-branch attention module-based network with parameter sharing for joint sound event detection and localization. EURASIP J. Audio Speech Music Process. 2023(1), 27 (2023)

    Article  Google Scholar 

  23. L. Mazzon, Y. Koizumi, M. Yasuda, N. Harada, First order ambisonics domain spatial augmentation for DNN-based direction of arrival estimation (2019). arXiv preprint arXiv:1910.04388

  24. F. Ronchini, D. Arteaga, A. Pérez-López, in DCASE, Sound event localization and detection based on CRNN using rectangular filters and channel rotation data augmentation (2020), pp. 180–184

  25. D. Hendrycks, K. Gimpel, Gaussian error linear units (GELUs) (2016). arXiv preprint arXiv:1606.08415

  26. Y. Cao, T. Iqbal, Q. Kong, M. Galindo, W. Wang, M.D. Plumbley, in Proc. Detection Classification Acoustic Scenes Events (DCASE) Challange, Two-stage sound event localization and detection using intensity vector and generalized cross-correlation (2019)

  27. Y. Hu, X. Sun, L. He, H. Huang, A generalized network based on multi-scale densely connection and residual attention for sound source localization and detection. J. Acoust. Soc. Am. 151(3), 1754–1768 (2022)

    Article  Google Scholar 

  28. Y. Hu, X. Zhu, Y. Li, H. Huang, L. He, A multi-grained based attention network for semi-supervised sound event detection (2022). arXiv preprint arXiv:2206.10175

  29. Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, K. Barnard, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Attentional feature fusion (2021), pp. 3560–3569

  30. Y. Chen, X. Dai, M. Liu, D. Chen, L. Yuan, Z. Liu, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Dynamic convolution: Attention over convolution kernels (2020), pp. 11030–11039

  31. B. Yang, G. Bender, Q.V. Le, J. Ngiam, Condconv: Conditionally parameterized convolutions for efficient inference. Adv. Neural Inf. Process. Syst. 32 (2019)

  32. C. Li, A. Zhou, A. Yao, in International Conference on Learning Representations, Omni-dimensional dynamic convolution (2022)

  33. J. Hu, L. Shen, G. Sun, in Proceedings of the IEEE conference on computer vision and pattern recognition, Squeeze-and-excitation networks (2018), pp. 7132–7141

  34. Y. Rao, W. Zhao, Y. Tang, J. Zhou, S.N. Lim, J. Lu, Hornet: Efficient high-order spatial interactions with recursive gated convolutions. Adv. Neural Inf. Process. Syst. 35, 10353–10366 (2022)

    Google Scholar 

  35. H. Zhang, K. Zu, J. Lu, Y. Zou, D. Meng, in Proceedings of the Asian Conference on Computer Vision, Epsanet: An efficient pyramid squeeze attention block on convolutional neural network (2022), pp. 1161–1177

  36. X. Zhang, X. Zhou, M. Lin, J. Sun, in Proceedings of the IEEE conference on computer vision and pattern recognition, Shufflenet: An extremely efficient convolutional neural network for mobile devices (2018), pp. 6848–6856

  37. X. Ding, Y. Guo, G. Ding, J. Han, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Acnet: Strengthening the kernel skeletons for powerful CNN via asymmetric convolution blocks (2019)

  38. D. Misra, T. Nalamada, A.U. Arasanipalai, Q. Hou, in Proceedings of the IEEE/CVF winter conference on applications of computer vision, Rotate to attend: Convolutional triplet attention module (2021), pp. 3139–3148

  39. S. Woo, J. Park, J.Y. Lee, I.S. Kweon, in Proceedings of the European conference on computer vision (ECCV), CBAM: Convolutional block attention module (2018), pp. 3–19

  40. S. Yu, X. Sun, Y. Yu, W. Li, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Frequency-temporal attention network for singing melody extraction (IEEE, 2021), pp. 251–255

  41. Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, Q. Hu, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, ECA-Net: Efficient channel attention for deep convolutional neural networks (2020), pp. 11534–11542

  42. A. Politis, K. Shimada, P. Sudarsanam, S. Adavanne, D. Krause, Y. Koyama, N. Takahashi, S. Takahashi, Y. Mitsufuji, T. Virtanen, in Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022), STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events (Nancy, 2022), pp. 125–129.

  43. A. Mesaros, S. Adavanne, A. Politis, T. Heittola, T. Virtanen, in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Joint measurement of localization and detection of sound events (IEEE, 2019), pp. 333–337

  44. U. Ruby, V. Yendapalli, Binary cross entropy with deep learning technique for image classification. Int. J. Adv. Trends Comput. Sci. Eng. 9(10) (2020)

  45. C.J. Willmott, K. Matsuura, Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30(1), 79–82 (2005)

    Article  Google Scholar 

  46. I. Loshchilov, F. Hutter, in International Conference on Learning Representations, SGDR: stochastic gradient descent with warm restarts (2016)

Download references


This work is supported by the National Natural Science Foundation of China (NSFC).


No appliable.

Author information

Authors and Affiliations



M.Ma: conceptualization, methodology, software, validation, writing, original draft preparation, and visualization. Y.Hu: conceptualization, methodology, validation and writing―review and editing. H. Huang: resources and supervision. L.He: supervision and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Ying Hu.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, M., Hu, Y., He, L. et al. GLFER-Net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration. J AUDIO SPEECH MUSIC PROC. 2024, 34 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: