GLFER-Net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration

Polyphonic sound source localization and detection (SSLD) task aims to recognize the categories of sound events, identify their onset and offset times


Introduction
Sound source localization and detection (SSLD) can be regarded as a joint task of locating sound sources and detecting sound events.The SSLD system should predict the boundaries of active sound events, identify their categories, and provide the spatial trajectories of sound sources simultaneously.In recent years, SSLD has gained much popularity and has been helpful in many aspects of daily applications [1].For instance, with the assistance of the SSLD system, robots can assist human-machine interaction [2].Moreover, SSLD can collaborate with the speech enhancement task to denoise specific speakers by capturing their positions in intelligent meeting rooms [3].It is also applied in smart cities for real-time environmental sound monitoring [4].
The SSLD task can be divided into two sub-tasks: sound event detection (SED) and sound source localization (SSL).SED identifies not only the categories of sound events but also their onset and offset times.Currently, the SED task mainly focuses on polyphonic SED, which aims to predict overlapping events of different categories.
Semi-supervised learning is widely used for this task due to many unlabeled data in its datasets [5].SSL is to estimate the direction of arrival (DOA), which is crucial for determining the positions of sound sources relative to the microphones at each time frame.It can provide the auxiliary localization information of each sound source for downstream tasks.For the real-world scenarios with an unknown number of sources, Fu et al. proposed the iterative sound source localization (ISSL) method to make the model more stable without using a threshold [6].Adavanne et al. pioneered a convolutional recurrent neural network (CRNN) structure for the SSLD task [7], taking full advantage of the feature extraction ability of convolutional neural network (CNN) and the superior temporal context modeling ability of recurrent neural network (RNN).This CRNN-based approach as the baseline of Task3 of Detection and Classification of Acoustic Scenes and Events (DCASE) challenge 1 has become widely recognized and extensively applied in the profound research [8][9][10][11].
In recent years, for practical application of the SSLD, the datasets used in DCASE2022 Task3 has supplemented recordings of real-world spatial sound scenes based on the previous computational simulated spatial recordings, with accurate manual annotations.This realistic datasets not only have stronger reverberation and diversified environmental settings but also include frequent moving sound sources, more background noise and a large number of homogeneous sound events.These complex factors steeply increase the difficulty of SSLD task, especially with the sharp increase in the number of overlapping sound events, posing higher demands on the capability of encoding spatial relations.
Vanilla CRNN-based SSLD models using the CNNs with single kernel size are hard to adequately extract the feature maps of sound events with diverse time-frequency characteristics.Additionally, the spatial features extracted by vanilla CNN lack fine-grained information resulting in the inaccurate localization of sound sources, and lack spatial information across multiple dimensions, resulting in inaccurate classification of sound events.To overcome the limitations of the CRNN structure mentioned above, we propose a polyphonic SSLD network, global-local feature extraction and recalibration (GLFER-Net), equipped with the global-local feature (GLF) extractor and feature recalibration (FR) module replacing the vanilla CNN in a CRNN-based model.Our primary contributions are summarized as follows: (1) We design a GLF extractor consisting of two branches, where an omni-directional dynamic convolution (ODConv) layer and multi-scale feature extraction (MSFE) module make up the upper branch for extracting global feature maps, while two local feature extraction (LFE) units the lower branch for extracting the local features.(2) We design a cross-scale shuffle (CSS) unit to fuse the multi-scale features in the MSFE module and an FR module to recalibrate the feature maps from the outputs of the GLF extractor.(3) We compared our proposed GLFER-Net with six and four SSLD methods on synthetic and realistic datasets, respectively.A series of ablation studies and visualization analyses show the effectiveness of the GLF extractor and FR module.Additionally, we exploited data augmentation techniques to enhance the generalization capability of the proposed GLFER-Net.
This paper is organized as follows: Section 2 introduces the existing SSLD methods.Section 3 details our proposed network.The experimental setup, datasets, and evaluation metrics will be described in Section 4. A series of experimental results and discussion will be detailed in Section 5. Finally, we will draw conclusions in Section 6.

Related works
Recent SSLD methods have explored aspects of the input feature, output format, model architecture, and data augmentation techniques.Several novel input features have been introduced to incorporate as much positional information as possible.Nguyen et al. proposed the spatial cue-augmented log-spectrogram (SALSA) features [12], which consists of multi-channel log-spectrograms and the normalized principal eigenvectors of spatial covariance matrices in their respective time-frequency bins.It was proved crucial for addressing overlapping sound events.Rosero et al. devised a Gammatone-based Sound Events Localization and Detection (G-SELD) system [13], which seeks to enhance the SSLD performance by employing the bio-inspired gammatone auditory filters for the acoustic feature extraction.
There has been increasing attention on the design of output formats in the field of SSLD.Shimada et al. proposed the activity-coupled Cartesian DOA (ACCDOA) vector to simplify the model's output [14], treating SSLD as a single regression task.Cao et al. introduced a trackwise output format [15], which showed excellent generalization for overlapping sound events of the same class.Shimada et al. extended ACCDOA to the multi-ACCDOA vector in the format of track-wise output [16].However, it still needs to exhibit more generalization ability in the real world to face the challenge of the polyphonic acoustic environment with an unknown number of overlaps.To solve this problem, Kim et al. explored an approach for multi-task SSLD based on angular distance [17], which adapts the framework of "You Only Look Once" (YOLO) in the SSLD task, achieving impressive performance on multiple datasets.
In order to overcome the drawbacks of inefficient sequential data processing in RNNs, Sound Event Localization and Detection via Temporal Convolutional Networks (SELD-TCN) [18] was proposed, which introduced the temporal convolutional network (TCN) with dilated convolutions to replace the bidirectional gated recurrent unit (BiGRU) in CRNN framework to capture long-range temporal dependencies.Besides, Transformer [19] was proposed, which can process sequential data in parallel, bringing substantial improvement in efficiency.Conformer [20] was first proposed for the speech recognition task to endow the Transformer with the capability to capture local information.This model integrates a convolution module and a self-attention mechanism within a unified single-branch structure.Inspired by it, Wang et al. combined the Resnet and Conformer structures for the SSLD task [21], resulting in promising performance gains.However, Conformer places a substantial burden on computing resources due to its large number of model parameters.To tackle this challenge, Zhou et al. proposed a dual-branch attention module to improve the structure of Conformer [22], which utilizes both self-attention and convolution branches to capture global and local contextual information of the sound events.Data augmentation methods have been commonly exploited to increase the quantity of data, thus enhancing the generalization ability of the model.Mazzon et al. proposed the first-order Ambisonics (FOA) domain spatial augmentation method based on the well-known rotational property of FOA sound encoding [23].The basic idea of the method is to apply some transformations to the FOA channels along with corresponding labels that it simulates a new DOA of recorded sounds.Ronchini et al. used data augmentation methods to generalize the SSLD system performance on unseen data [24].The technique they used is based on channel rotations and reflection on the x-y plane which allows improving DOA labels keeping the physical relationships between channels.Wang et al. integrated four methods in a step-by-step manner to devise an effective four-stage data augmentation scheme, including audio channel swapping, multichannel simulation, time-domain mixing, and timefrequency masking [21].This strategy allows models trained on augmented data to exhibit robustness in localizing variations of sound sources.

The proposed method
In this section, we will introduce our proposed GLFER-Net in detail.The overall diagram of the network is illustrated in Fig. 1, where the log-linear spectrograms from four channels and normalized three-channel sound intensity vectors (SIVs) are concatenated as the input features.An encoder first processes them for shallow feature extraction.The encoder comprises two convolutional blocks and an average pooling operation; each convolutional block includes a 2D-convolutional layer with a kernel size of 3 × 3, a batch normalization (BN) Fig. 1 The diagram of the proposed GLFER-Net layer, and a Gaussian error linear unit (GELU) activation function [25], as shown in Fig. 2. Note that there are no residual connections in the encoder, avoiding the interference of noise from the original input features.Subsequently, the output of the encoder is fed into the part enclosed by the dashed box in Fig. 1, which consists of a GLF extractor, an FR module, and an average pooling operation.This part is repeated three times to extract high-level features with rich spatial information.Feature maps are halved in size through each average pooling operation in time and frequency dimensions.Before being fed into the BiGRU, these features were downsampled by a factor of sixteen in both time and frequency dimensions.Next, a two-layer BiGRU is utilized to capture the temporal context information and model the relationships among time frames.
Finally, SED and DOA branches use two fully connected (FC) layers to produce SED and DOA predictions, respectively.The SED branch adopts a sigmoid activation function, whereas the DOA branch is tanh.The following subsections will present a detailed description of input features and each module.

Input features
The log-mel spectrogram is extensively utilized in the SSLD task.However, Nguyen et al. has convincingly demonstrated that the linear-scale feature outperforms the mel-scale [12].Hence, we preferred to use the loglinear spectrogram.The four-channel linear spectrograms can be calculated from the complex ones S(t,f ) using: where t and f are time and frequency indices, T denotes time frames, F is the number of frequency bins, and || • || indicates the mode taking operation, respectively. (1) Each FOA signal consists of four channels (W, X, Y, Z), where W is the 0-th order spherical harmonics capturing the omnidirectional information, while X, Y, and Z are the 1-st order spherical harmonics conveying spatial information along the Cartesian coordinate in the sound field.SIVs carry valuable information about the direction of sound propagation, and their inverse direction is commonly interpreted as the DOA.Moreover, FOAbased SIVs can be directly used for precise DOA estimation [26].In the short-time fourier transform (STFT) domain, the computation of SIV from 4-channel spectra is carried out using: where ρ 0 and c are the density and velocity of the sound.ℜ denotes real-part of complex numbers, and * is the conjugate of ones.The SIVs are normalized as follows: The log-linear spectrograms and normalized SIVs are concatenated as input features X e ∈ R 7×T e ×F e .

Global-local feature extractor
Various sound events exhibit the diversity of duration and frequency distribution range [27].Designing an effective feature extractor is necessary to obtain richer spatial information for the SSLD task.Furthermore, the classification of sound events needs coarse-grained global information, while detecting the onset and offset times of sound events and estimating the spatial trajectory of the sound source require fine-grained (2) local information [28].In order to obtain comprehensive feature representation, we devise a GLF extractor, whose structure is illustrated in Fig. 3.The upper branch comprises an ODConv layer and an MSFE module in order to obtain global feature maps.In the lower branch, we introduce an LFE unit to extract local features, stacked twice to ensure that local features are extracted sufficiently.We exploit the attentional feature fusion (AFF) unit to realize soft selection between two branches [29] for aggregating global and local features instead of direct summing.This manner highlights the important components of different feature maps while restraining unnecessary ones.The whole procedure of the GLF extractor can be formulated as follows: Next, we provide a detailed description of the modules in the GLF extractor.

Omni-directional dynamic convolution
Dynamic convolution [30] gives the weights inputadaptive capabilities through dynamic mechanisms for enhancing the model's generalizability.However, dynamic convolution suffers from a significant limitation in that only a single dimension (convolutional kernel) possesses dynamic characteristics, while other dimensions are overlooked [31].Li et al. proposed the ODConv introducing a multi-dimensional attention mechanism [32], which adopts a parallel strategy to learn diverse attentions for convolutional kernels along four dimensions in kernel space, that is, the spatial, input channel, output channel, and convolutional kernel dimensions.These four kinds of attentions are mutually complementary and multiplied to convolutional kernel w i successively in the location-wise, channel-wise, filter- wise, and kernel-wise orders.The ODConv can be formulated as follows: where α ωi denotes the attention scalar, and α si ∈ R k×k , α ci ∈ R c in , and α fi ∈ R c out denote three attentions com- puted along the spatial, input channel, and output channel dimensions in the kernel space for the convolutional kernel W i , respectively.They are calculated by the squeeze-and-excitation (SE) module [33].⊙ represents ( 4) the channel-wise multiplication.Introducing these multidimensional dynamic weights can improve the modeling ability of the convolution with additional spatial interaction [34].In our proposed method, the ODConv is used to calibrate the input feature maps of the GLF extractor to suppress redundant information and highlight important ones effectively.The calibrated feature maps are then fed into the MSFE module to extract the multi-scale feature maps with larger receptive fields and obtain global ones further by the CSS unit.

Multi-scale feature extraction module
We design a MSFE module to capture the information across the feature maps of multiple scales, which helps identify sound events with various time-frequency characteristics.In addition, we introduce an attention mechanism in combining multi-scale feature maps to emphasize the contribution of important features.We also employ a channel shuffle operation to enhance information communication among them.
As shown in Fig. 4a, the outputs of the preceding ODConv layer are firstly fed into a multi-branch convolutional unit.Assuming that the number of channels is C, we split the feature maps into four segments along the channel dimension, and each has the same channel number of C 4 to mitigate computational complexity.Then, group convolutions (GConvs) with distinct kernel sizes are exploited to extract the spatial information at multiple scales in parallel.We adhere to the grouping setting presented in [35], where the relationship between multi-scale kernel K and group sizes g i is defined as follows: Subsequently, the SE mechanism [33] is exploited along multiple branches to acquire the attention weights for multi-scale feature maps.By employing a softmax function, the feature maps of different scales can be adaptively selected under the guidance of multi-scale channel weights denoted as att.For the i-th branch, it can be computed by: where Z i ∈ R C/4×1×1 is the outputs of SE module for the i-th branch.The long-range channel dependency is achieved by the above operation.Then, the feature map (8) from GConv in i-th branch denoted as F i is calibrated as follows: where ⊙ represents the channel-wise multiplication, the module can assign distinct weights to the feature maps of multiple scales and that of different channels within a single scale to emphasize cross-channel multi-scale spatial information.Finally, these four calibrated feature maps, Y i , are concatenated as the outputs of MSFE.
In the MSFE module, we directly concatenate multiscale features without consideration of their interdependencies.The utilization of group convolution leads to a side effect that the outputs of certain groups only relate to a small subset of input feature maps, lacking effective inter-group communication [36].To alleviate this side effect, we introduce a CSS unit that can fuse the multiscale feature maps and enhance their correlations to obtain global feature maps.The detailed architecture is illustrated in Fig. 4b.
The channel shuffle operation facilitates the flow of information across multi-scale feature maps [36].This operation can be modeled as a process of "reshapetranspose-reshape": given the input feature maps with (10) dimension of ( * , n) , it is firstly reshaped to ( * , g s , n/g s ) , further transposed to ( * , n/g s , g s ) , and reshaped back to ( * , n) , where n and g s denote channel number and group size, respectively, * represents other dimensions.After this operation, we can get the shuffled feature maps, which are aggregated together with the original ones through the aggregation block.The aggregation block includes two convolution layers with a kernel size of 1 × 1 .The first one reduces the number of channels, and the second one further fuses the feature maps from various channels, retaining the information of original channel positions.Additionally, we incorporate residual connections supplementing original information after the CSS unit and the MSFE module.

Local feature extraction unit
Ding et al. proposed asymmetric convolution instead of the standard convolution layer with the square kernel for enriching the feature space [37].Asymmetric convolution can explicitly enhance the representational capacity of CNN while extracting more delicate spatial features.Here, we design an LFE unit based on asymmetric convolutions to acquire more fine-grained features.As Fig. 4 The overall structure of the multi-scale feature extraction (MSFE) module and the details of the cross-scale shuffle (CSS) unit therein illustrated in the bottom part of Fig. 3, we design a combination of 1 × 3 + 3 × 1 and 3 × 1 + 1 × 3 convo- lution layers, which is followed by BN layer and GELU activation function.The asymmetric convolutions with the kernel sizes of 1 × 3 and 3 × 1 focus on informa- tion in the frequency and time dimensions, respectively.Regarding the impact of the LFE unit on SSLD performance, we will make further exploration in Section 4.

Feature recalibration module
The attention mechanism can improve the feature representation of CNN by building the dependencies among channels or spatial positions [38].Previous research has witnessed its significant achievements [33,39,40].Inspired by the above methods, we propose a feature recalibration (FR) module with three attention branches, independently computing the attention weights along time, channel, and frequency dimensions to integrate the three attention mechanisms' advantages effectively.The whole structure of it is shown in Fig. 5.
In the FR module, we supposed the shape of input feature M is (C, T, F).It is first mapped or permuted into three feature maps M 1 with the shape of (T, C, F), M 2 that of (C, T, F), M 3 that of (F, T, C), and as the inputs of three atten- tion branches, respectively.C, T, and F denote the channel, time, and frequency.Global averaging pooling (GAP) is applied separately to obtain the spatial statistics g along the time t, channel c, and frequency f dimensions in each branch: (11) where m 1t is the t-th time frame feature map of M 1 , m 2c is the c-th channel feature map of M 2 , and m 3f is the f-th frequency bin feature map of M 3 .Then, three 1D-convo- lution layers with the same kernel size of 1 × 3 are in par- allel used to dynamically select task-related time frames, channels, and frequency bins like efficient channel attention (ECA) [41].The three kinds of attention maps can be obtained through the sigmoid function: where W is the weight matrix of 1D-convolution, The three attention maps are multiplied with M 1 , M 2 , and M 3 respectively for recalibrating the feature maps along three dimensions.This process increases the interest of necessary time frames, channels, and frequency bins in feature maps.Finally, the three kinds of recalibrated feature maps are reshaped with the same shape represented by M ′ 1 , M ′ 2 , and M ′ 3 , which are summed as the ultimate output.We will explore each attention branch through ablation studies and investigate the impacts of various attention combinations on SSLD performance.( 12)

Datasets
We evaluated the performance of the proposed GLFER-Net and conducted ablation experiments on the DCASE 2021 Task3 development dataset [10].The spatialization of all sound events in this dataset is based on the filtering of actual room impulse responses, which are captured in multiple rooms of distinct sizes and acoustic absorption characteristics.Realistic spatial ambient noise is added to the recordings with the signal-to-noise ratios (SNR) ranging from 30 dB to 6 dB.Furthermore, each recording is provided in two spatial formats: microphone array (MIC) and FOA.The FOA dataset is obtained by encoding filters based on the Eigenmike array, which could convert a 32-channel microphone array signal to a 4-channel one.The MIC dataset uses a 4-channel Eigenmike sphere array whose numbers are 6, 10, 26, and 22, respectively.The development dataset comprises 400 training recordings, 100 testing recordings, and 100 validating recordings, each with a duration of one minute.The distribution of all samples covers 12 sound event categories.More information about the dataset is available in [10].
To verify the generalization of our model in real acoustic environments, we also evaluated the performance of the proposed GLFER-Net on the dataset used for the baseline of DCASE 2022 Challenge Task3, which includes the STARSS 2022 development dataset [42] and another 1200 synthetic recordings.Similar to previous challenges, the dataset has two spatial recording formats: FOA and MIC.The recordings in STARSS 2022 dataset collected from real acoustic environments encompasses more noise and reverberation and exist up to three overlapping sound events.Besides, The recordings in this dataset were organized into sessions, each occurring in a unique room.Multiple independent recordings with the length of from 30 s to 6 min were collected during each session.Within the recordings, 13 target classes are identified.The development dataset comprises 67 training recordings and 54 testing recordings.Details of this dataset can be found in [42].The most significant distinction of DCASE 2022 Challenge Task 3 from previous challenges is that the models are tested on recordings from real annotated scenes.
Both datasets we used in this paper are in FOA format.The specific distinctions between the DCASE 2021 and 2022 Task3 development datasets are summarized in Table 1.
We also exploit data augmentation techniques as a data transformation mechanism to improve the learning ability of the model without expanding the dataset size.Similar to [12], we consecutively utilize three data augmentation methods including audio channel swapping, random cutout and frequency shifting.

Evaluation metrics
We used official metrics [43] from the DCASE Challenge Task3 to assess the SSLD performance.It consists of individual SED and SSL metrics.For the SED task, we used the location-dependent error rate (ER) and F1 score while the class-dependent localization error ( LE CD ) and localization recall ( LR CD ) for the SSL task.
The SSLD system deems a sound event detection to be accurate only when the predicted sound event has the correct class label, and its estimated DOA is within D • of the reference DOA, where D is generally set to 20 • .Thus, these SED metrics are denoted as ER 20 • and F 20 •.Mathematically, ER 20 • and F 20 • are calculated as follows: where TP represents the true positive that sound event is active in both the ground truth and prediction.FP refers to false positives, indicating sound events predicted as active but are actually inactive in the ground truth.Conversely, FN stands for false negatives that sound events predicted as inactive but are active in the ground truth.N denotes the total number of sound events active in the ground truth.S, I, and D refer to substitution, insertion, and deletion errors, respectively.The mathematical definitions of these statistics are provided below: (17)  The SSL metrics are class-dependent (CD), meaning the localization prediction can only be considered when the corresponding sound class is correctly detected.LE CD expresses the average angular distance between predictions and ground truths and can be formulated as: where u ref and u pre denote the position vectors of the reference and predicted sound event, respectively.The localization recall metric LR CD is the true positive rate of localization estimation.
Furthermore, SSLD error ( ε SSLD ), which is an average of four metrics, was introduced as an additional general performance assessment: It is worth noting that contrary to the previous challenges using micro-averaged metrics, DCASE 2022 Task3 introduces a new evaluation scheme that the metrics F 20 • , LE CD , and LR CD are macro-averaged, which assigns equal weight to each class.Hence, we used the macroaveraged metrics to evaluate the methods trained on the STARSS 2022 development dataset to ensure the fairness of experiments.A good SSLD system should exhibit the lower scores of ER 20 • , LE CD , and ε SSLD metrics, along with higher ones of F 20 • and LR CD metrics.

Loss functions
GLFER-Net was trained using a multi-objective learning approach, simultaneously optimizing SED and SSL subtasks.SED task is treated as a multi-label classification task and uses binary cross-entropy loss [44] as follows: (22) where y t 0 ,n SED and ŷSED t 0 ,n are the reference and estimated active probabilities for the n-th sound event at the t 0 -th, respectively.T 0 represents the total number of frames in one batch, and N represents the number of classes.In this paper, T 0 = T e /16 .While SSL serves as a regression task uses mean squared error [45] as follows: where ŷDOA t 0 ,n is the DOA estimation for the n-th sound event at the t 0 -th frame, y DOA t 0 ,n denotes the ground truth.The final loss function L can be expressed as follows: is a hyper-parameter as loss weight.Empirically, we set it to 0.3 in this paper.L SSL is only computed for the active sound events in each frame.

Training setup
The sampling rate for both datasets is 24 kHz.We employed a Hann window with 512 points and a hop size of 300 points (24)  for the STFT.The network sets the batch size as 8, 50 for the epoch, employing the AdamW optimizer with an initial learning rate of 3e −4 , following the cosine annealing training method [46].Input signals are segmented into 8-s non-overlapping segments.Both the number of GLF extractors and FR modules are set to three, with 64, 128, and 256 filters for GLF extractors respectively.The hidden layer dimension of a BiGRU is set to 128.A threshold of 0.3 is applied for binarization of SED predictions.All models are implemented using PyTorch.

Experimental results and discussion
We conducted an exhaustive analysis of experimental results.Firstly, we investigated the group size of the channel shuffle operation in the CSS unit and determined an optimal one.Subsequently, a series of ablation experiments were carried out to demonstrate the significance of GLF extractor and FR module.Following that, we explored the combination of the attention branches in the FR module.Besides, we conducted an ablation study of channel shuffle operation and aggregation block in the CSS unit to verify their roles.The above ablation experiments were all conducted on the DCASE 2021 Task3 development dataset.Finally, we performed the comparisons with multiple competitive methods on two datasets and visualized the predicted results of the model to demonstrate the superiority of the proposed GLFER-Net.

Ablation study of group size in the CSS unit
We investigate which group size of channel shuffle operation in the CSS unit is the best for the SSLD task.In this part, all experiments were conducted on our proposed GLFER-Net without data augmentation techniques but with the various g s of channel shuffle operation.Fig- ure 6 illustrates the experimental results with different g s .The optimum performance is achieved when g s is set to 8. When g s is smaller or larger than 8, it all leads to considerable reductions in performance.As the group size increases, the number of channels for each group decreases accordingly.There is not enough information in each group of channels, leading to incomprehensive information exchange among different groups, potentially weakening the representation ability of the model.Consequently, the group size of the channel shuffle operation is set to 8 in the following experiments.

Ablation study of FR module and GLF extractor
To verify the effectiveness of the GLF extractor, which is designed to extract features containing rich spatial information, and that of the FR module to recalibrate the feature maps, we conducted a series of experiments.As shown in Table 2, when removing the FR module, the GLFER-Net has worse performance with the scores of ER 20 • , LE CD , and ε SSLD increasing by 0.05, 1.4 • , and 0.02, and that of F 20 • degrading by 2.3%.When further replacing the GLF extractor with two serial convolutional blocks, its performance declines notably, as shown in the last row of the table, especially ε SSLD , which has a reduc- tion of 0.07.For the sake of visually showing the effectiveness of the LFE unit in the GLF extractor, we visualize the feature maps in the first GLF extractor.In Fig. 7, the region surrounded by blue boxes indicates the moving sound event of Speech and that by red boxes the periodic discontinuous sound event of Alarm. Figure 7a illustrates a feature map of x in the GLF extractor that follows encoder.It still contains substantial noise, so the model has difficulty recognizing each sound event.Figure 7b is a feature map of y 1 , which is the output of the MSFE module in the first GLF extractor.Comparing Fig. 7b with Fig. 7a, we can see that the components unrelated to sound events are reduced after the MSFE module.However, the harmonic components of Speech are unclear, and the energy distribution of Alarm appears indistinct.Figure 7c is a feature map of y, which is the output of the first GLF extractor.It contains fine-grained information captured from LFE units.Seen from Fig. 7c, the periodic nature of Alarm is discernible, but the harmonic components of Speech after 1 second are lost.Figure 7d is another feature map of y.Although the characteristic of Alarm is not clear enough, the harmonic components of Speech are more complete compared to Fig. 7c.
By comparing Fig. 7b with Fig. 7c and d, we can see that with the assistance of LFE units, the specific frequency components of two kinds of sound events can be clearly distinguished from the feature maps.By comparing Fig. 7a with Fig. 7c and d, we can verify the effectiveness of the GLF extractor.

Exploration of attention branches in FR module
We also explored the effectiveness and the combination of various attention branches in the FR module.Table 3 provides the results of various attention branches, where CA represents the channel attention branch, TA the time attention one, and FA the frequency attention one, respectively.Seeing from Table 3, we can get the following conclusions:  1) The top three rows of Table 3 correspond to using three solo attention branch.From the experimental results, we can see that the model with CA branch achieves the best performance, for that the CA mechanism has the ability of learning inter-channel correlations, which is crucial for enhancing localization performance.Notably, the model with CA branch achieves the highest LR CD score of 67.1%.2) Among the three kinds of combinations containing two kinds of attention branches, the model with the combination of CA and TA branches yields relatively superior results.The TA is an attention mechanism along the temporal dimension that highlights crucial time frames, aiding in the detection of active sound event boundaries.Integrating TA with CA can further improve the accuracy of SED task, with the scores of ER 20 • and that of F 20 • increase 0.01 and 1.0%, respectively.However, a marginal gap of 0.01 remains on the ε SSLD score between it and that of the three attention branches, as shown in the final row.
3) The FA branch emphasizes the important spectral component to provide more accurate spatial information for SED and SSL tasks.When it is employed with CA and TA branches simultaneously, their complementary effects lead to excellent performance, all metrics have improved significantly.Hence, in the following experiments, the FR module consists of three attention branches.

Ablation study of two components in CSS unit
We performed the ablation experiments of channel shuffle operation (Shuf) and aggregation block (Agg) when the FR module contains three kinds of attention branches and the group size ( g s ) of channel shuffle operation is set to 8. In Table 4, the first row denotes the performance of complete GLFER-Net, and the second row is that of the model without aggregation block in the CSS unit, where the shuffled features are directly added to the original ones as the output.Compared to the results in the first row, that of the second increase by 0.04, 1.5

Comparisons with other methods
We first compared our proposed GLFER-Net with six other methods on the DCASE 2021 Task3 development Then, to verify the generalization of our proposed method, we also made a comparison with the other four methods on the DCASE 2022 Task3 development dataset.As shown in Table 6, the performance of GLFER-Net without data augmentation techniques is only slightly superior to Resnet-Conformer [21] on ε SELD .However, it exhibits strong performance in terms of localization with an improvement of 9.7% on LR CD .

Visualization analysis
Take '"fold6_room1_mix012" as an example, an audio clip from the test set of the DCASE 2021 Task3 development dataset, and we make a visualization analysis.For the SED task, in comparison to the reference in Fig. 8a, GLFER-Net without FR module predicts an erroneous category at around 20 s, depicted by the rectangular box with the color of light blue in Fig. 8b.The complete GLFER-Net can correctly detect the boundaries and categories of most sound events, as shown in Fig. 8c.
For the SSL task, the second and third rows of each sub-graph are the visualizations of azimuth and elevation, respectively.In the reference diagram, the rectangular box with the color green depicts the trajectory of the azimuth angle for the sound event labeled as Knock, which corresponds to a moving sound source occurring at around 55 s.GLFER-Net exhibits the robustness of a certain level for the moving sound sources but confuses the sound events Knock indicated by the blue line and Phone the green line from 50 s to 60 s as shown in Fig. 8c.However, such confusion is effectively alleviated with the assistance of data augmentation methods.Furthermore, our method produces the biases of elevation angle compared to the ground truth, depicted by the rectangular box with the color black.The data augmentation method with three consecutive transformations is proved to reduce such biases from the comparison between Fig. 8c  and d.

Conclusion
In this paper, we propose the GLFER-Net based on the GLF extractor and FR module for polyphonic sound source localization and detection.The LFE units in the GLF extractor complement the fine-grained information to the multi-scale features from the MSFE module, where a CSS unit is designed to fuse the multi-scale features and enhance the information communication among them.After each GLF extractor, an FR module is introduced to emphasize the crucial features along multiple dimensions.We also use three consecutive data augmentation methods as a data transformation mechanism to improve the generalization ability of the model.On the open datasets of Task3 in DCASE 2021 and 2022 Challenges, our proposed GLFER-Net outperforms six and four SSLD methods, respectively.Through a series of ablation experiments and visualization analyses on the DCASE 2021 Task3 development dataset, the effectiveness of the GLF extractor, FR module, LFE unit, and two components in the CSS unit is verified.However, our proposed method still has some limitations.When compared with other methods on the development dataset of DCASE 2022 Challenges, the ER 20 • score of GLFER-Net is slightly higher than that of Resnet-Conformer.Additionally, the complexity of model needs to be further reduced.

Fig. 2 Fig. 3
Fig.2The detailed structure of the encoder

Fig. 5
Fig. 5 The diagram of the Feature Recalibration (FR) module.P denotes the operation of permutation.A-POOL represents global average pooling and 1D convolution.σ denotes sigmoid function

Fig. 6
Fig. 6 The column charts of four metrics on models with various group size ( g s ) of shuffle operation in CSS units.The left chart shows the scores of ER 20 • and F 20 • for SED task, the right one that of LE CD and LR CD for SSL task

Fig. 7
Fig. 7 The visualization of feature maps in the first GLF extractor.a A feature map of the input of GLF extractor, b that of the output of MSFE module, c and d two feature maps of the output of GLF extractor.c The discernible periodic nature of Alarm and d the clear harmonic components of Speech

Fig. 8
Fig. 8 The visualization of ground truth and predicted scores of three ablation experiments.In each sub-graph, the first row displays the results of the SED task, and the second and third rows show the results of azimuth and elevation for the SSL task, respectively.In all sub-plots, the horizontal axes represent the time, and the vertical ones in the SED task sub-plots are the sound event class indices, each depicted with a unique color.The vertical axes of the second and third rows of each sub-graph represent the azimuth angle with a range of [− 180 • ,180 • ] and elevation angle of [− 90 • ,90 • ], respectively

Table 1
The detailed characteristics of the DCASE 2021 and 2022 Task3 development datasets

Table 2
Ablation study on the DCASE 2021 Task3 development dataset

Table 3
The exploration of combinations of three attention branches in the FR module on the DCASE 2021 Task3 development dataset

Table 4
Ablation study of the CSS unit on the DCASE 2021 Task3 development dataset • , and 0.02 on ER 20 • , LE CD , and ε SSLD and decrease by 2.1% and 0.7% on F 20 • and LR CD , respectively.These verify the effectiveness of the aggregation block preserving the information from the original channels.The last row denotes the results of the model without the aggregation block and channel shuffle operation, which means that the model utilizes a convolutional layer with a kernel size of 1 × 1 to replace the CSS unit.Compared to the scores of first row, that of last row increase by 0.05, 1.8 • , and 0.03 on ER 20 • , LE CD , and ε SSLD and decrease by 3.1% and 1.3% on F 20 • and LR CD , respectively.These demonstrate the effectiveness of the CSS unit, which can integrate multi-scale features and enhance information communication among them.

Table 5
The comparison between GLFER-Net and seven other methods on the DCASE 2021 Task3 development dataset

Table 6
The comparison between our proposed GLFER-Net and four other methods on the DCASE 2022 Task3 development dataset [17]set.All models were trained on the same training, validating, and testing datasets for fairness.In Table5, compared methods did not use data augmentation techniques and post-processing techniques.They all employed simple CRNN-based architectures, where CNNs use single-scale kernel convolutions.Compared to these methods, our proposed GLFER-Net demonstrates stronger feature extraction capability by leveraging GLF extractors and FR modules.The feature representation of GLFER-Net contains more comprehensive spatial information, enhancing the classification accuracy of diverse sound events.Additionally, the multi-dimensional attention mechanism within the FR module contributes to improving localization performance.As shown in Table5, GLFER-Net outperforms almost all methods except that slightly worse than AD-YOLO[17]only on LE CD metric but with fewer parameters amount.The last row lists the results of GLFER-Net with data augmentation techniques mentioned in Section 4.1.Compared to training the GLFER-Net with original data, training the GLFER-Net with transformed one can increase the scores of 8.2% and 4.7% on F 20 • and LR CD and decrease that of 0.1, 3.2 • and 0.06 on ER 20 • , LE CD , and ε SELD , respectively.It demonstrates that this kind of consecutive data transformation can bring outstanding improvement in SSLD performance.