To enhance the expression of channel feature and spatial feature information and avoid network degradation, the attention mechanism fusion module and residual structure are presented in this part. Based on the M-CRNN model in Part 2, the localization performance needs to be improved. The Residual-spatially and channel Squeeze Excitation Recurrent Neural Network(Res-scSE-CRNN) model is constructed to improve localization performance.

### 3.1 Residual structure

With the deepening of the network layer, the gradient used for reverse transmission in the network will appear very small with continuous multiplication, resulting in the disappearance of the gradient, and the shallow parameters cannot be updated. In the process of neural network training, it is usually hoped to achieve a better learning effect by choosing deeper network structure than a shallow one. However, with the increase in depth, the model is difficult to learn more than five mapping parameters correctly, and the redundant network layer is easy to cause network degradation [24].

To solve the above problems, this paper proposes to use the characteristics of ResNet, that is, to train the deeper network with fewer parameters, to reduce the problem of gradient disappearance and degradation. The structure of ResNet is shown in Fig. 4.

In Fig. 4, *x* is the input of ResNet and *F(x)* the output after linear operation of two convolution layers and nonlinear activation of the activation function. The calculation method is as follows:

$$\begin{aligned} &F(x)=w_i\sigma (w_{i-1}x) \end{aligned}$$

(1)

where \(w_i\) is the value of *i*ılayer weight, where \(i=2\), \(\sigma (x)\) represents ReLU activation function.

It can be seen from Eq. (1) and Fig. 4 that the learning objective becomes \(F(x)=H(x)-x\) and it can be seen that when ResNet is used to design the network layer, the network layer is optimized, and the fast connection does not add more parameters, nor does it add extra computing cost.

### 3.2 Spatially and channel Squeeze-Excitation

Acoustic events are complex and diverse, so it is key to detect acoustic event categories and estimate acoustic source orientation to extract their directional features. The neural network model of attention mechanism pays more attention to effective information, ignores invalid information and extracts advanced features, so it is suitable for detecting and locating sound events.

In the baseline model CRNN, although the convolution kernel can be used to integrate the information of feature map spatial dimension and feature map dimension in the local receptive field, spatial and channel feature mapping cannot be independently learned. To solve this problem, the spatially and channel Squeeze-Excitation (scSE) module is added to the base model CRNN. The advanced feature diagram is obtained by compressing and weighting the channel characteristics. To enhance the channels that contribute more to classification and suppress the channels that contribute less to classification, the information on the localizations that play a key role in classification can be improved. The following describes the construction process of the scSE module.

Squeeze-and-Excitation network (SE) [25, 26] is the feature mapping of independent learning Spaces and channels. scSE is a variant of SE. It is a novel network architecture combined with the spatially Squeeze-Excitation (sSE) model and the channel Squeeze-Excitation (cSE) model. In this module, the channel and spatial relationship are considered at the same time, and the outputs of sSE and cSE are added and added to enhance the spatial coding ability of the convolutional layer network and improve the recognition effect of the neural network. The sSE model and cSE model are presented below.

In Fig. 5, the sSE model uses the convolution kernel size to realize the effects of channel excitation and spatial excitation, and the model introduces the attention mechanism from the perspective of spatial relations. The featured graph of \(H{\times }\)W\({\times }C\) is subjected to a nonlinear operation of 1\(\times\)1\(\times\)1 convolution dimension reduction and sigmoid function activation. After feature recalibration, output *Û* by multiplying the corresponding space of dimension *U*. Such as \(W_{sq}\) \(\in\) \(R^{1\times 1\times C}\); output *q* through \(q=W_{sq}U\), *q* is the feature tensor with channel number 1, and finally normalized to [0-1] through sigmoid. The operation function expression of this process is as follows:

$$\begin{aligned}{} & {} \hat{U}_{sSE}=F_{sSE}(U) \nonumber \\ {} & {} \quad =[\sigma (q_{1,1})u^{1,1},...,\sigma (q_{i,j})u^{i,j},...\sigma (q_{H,W})u^{H,W}] \end{aligned}$$

(2)

where \(\sigma (q_{i,j})\) represents the importance of (I, j) in the feature graph.

In Fig. 6, the cSE model uses the mutual stimulus between channels to build a feature mapping channel interdependence model. Inserting this module at a specific point in the network can get better results than the original advanced network, which has been well verified in the image classification task. Firstly, the unique feature map *U* of each channel was obtained by the global average pooling method, and the ReLU activation function was used to enhance the independence of each channel through two full connection layers with different weights, and the value was normalized to [0,1] through a sigmoid layer. That is, input *U*=[\(u_1,u_2,...,u_C\)] into the channel, where the operation process of each channel is \(u_i\in R^{H\times {W}}\), and the *k* value of each localization *U* output through the global pooling layer can be calculated by:

$$\begin{aligned} Z_{k}=\frac{1}{H\times W}\sum \limits _{i}^{H}\sum \limits _{j}^{W}u_k(i,j) \end{aligned}$$

(3)

$$\begin{aligned} \hat{z}=W_1(\sigma (W_2z)),W_1\in R^{c \times \frac{c}{p}},W_2\in R^{c \times \frac{c}{p}} \end{aligned}$$

(4)

where *Ẑ* represents the importance of the feature of the *i* channel, *p* represents the ratio parameter, and \(W_1,W_2\) represent the weight of two fully connected layers. The independence of each channel is enhanced through the ReLU activation function and finally obtained \(\sigma\)*(ẑ)* through sigmoid normalization between 0 and 1. The process is calculated as follows:

$$\begin{aligned}{} & {} \hat{U}_{c S E}=F_{c S E}(U)\nonumber \\ {} & {} \quad =\left[ \sigma \left( \hat{z}_1\right) u_1, \sigma \left( \hat{z}_2\right) u_2, \ldots \sigma \left( \hat{z}_c\right) u_c\right] \end{aligned}$$

(5)

Since sSE model takes spatial structure into account and cSE model takes channel arrangement into account, this paper establishes scSE model by adding and summing the outputs of the two models. The expression is as follows:

$$\begin{aligned} \hat{U}_{s c S E}=\hat{U}_{s S E}+\hat{U}_{c S E} \end{aligned}$$

(6)

The scSE module established in this paper is shown in Fig. 7.

Add the scSE module shown in Fig. 7 to ResNet, which is added after the Exponential Linear Unit(ELU) activation function.

The scSE model combines the advantages of sSE and cSE to recalibrate feature maps from spatial and channel dimensions and combine the output information of the two modules to improve the detection effect of acoustic events.

### 3.3 Network model construction based on residual attention mechanism fusion

Considering the characteristics of ResNet structure, scSE, RNN and full connection layer, the Res-scSE-CRNN network structure proposed in this paper is shown in Fig. 8.

In Fig. 8, Res-scSE module is used to replace the convolution layer of CRNN network model to achieve the purpose and effect of high accuracy of SED and DOA. Two Bi-GRU layers were set up to obtain context information, and then the features extracted were dimensionally reduced through the full connection layer. Through the nonlinear operation of the sigmoid activation function and Tanh activation function, the categories of sound events and azimuth estimation were output respectively. Log-mel harmonic strength vector and log-mel and GCC-PHAT features were used for input.

The improved Res-scSE module is shown in Fig. 9.

As you can see from Fig. 9, the module creates two jump connections before and after SE calibration. This double-jump connection allows the network to learn residual maps with and without SE recalibration simultaneously, and residual learning facilitates the training process by alleviating the gradient disappearance problem.

### 3.4 Evaluation indicators

Error rate (*ER*), F1-Score, Doa Error (*DE*), and frame recall rate (*FR*) are often used as indicators to evaluate the SEDL, and these four, indicators are used in this paper to evaluate the detection and localization effects of SEDL model.

(1) Evaluation index of detection

*F1*-Score and *ER* were used to evaluate the detection performance, as shown below:

The *F1*-Score is the harmonic average of accuracy *P*and recall. Where the accuracy calculation formula is:

$$\begin{aligned} P=\frac{\sum \limits _{k=1}^{K} T P(k)}{\sum \limits _{k=1}^{K} T P(k)+\sum \limits _{k=1}^{K} F P(k)} \end{aligned}$$

(7)

In the formula, *TP* is the true positive sample. In the *k* frame, the system predicted positive and actually positive; *FP* is a false positive sample, positively predicted by the system but negative in reality; *FN* represents a false negative sample, which is predicted by the system to be negative but actually positive.

The recall is expressed as:

$$\begin{aligned} R=\frac{\sum \limits _{k=1}^{K} T P(k)}{\sum \limits _{k=1}^{K} T P(k)+\sum \limits _{k=1}^{K} F N(k)} \end{aligned}$$

(8)

The *F1*-Score is obtained by calculating *P* and *R*. The relationship between the two is:

$$\begin{aligned} F1-score=\frac{2PR}{P+R} \end{aligned}$$

(9)

*F1*-Score can be calculated according to the following formula:

$$\begin{aligned}{} & {} F1-Score= \nonumber \\ {} & {} \quad \frac{2 \sum \limits _{k=1}^{K} T P(k)}{2 \sum \limits _{k=1}^{K} T P(k)+\sum \limits _{k=1}^{K} F P(k)+\sum \limits _{k=1}^{K} F N(k)} \end{aligned}$$

(10)

The calculation expression of *ER* is:

$$\begin{aligned} E R=\frac{\sum \limits _{k=1}^K S(k)+\sum \limits _{k=1}^K D(k)+\sum \limits _{k=1}^K I(k)}{\sum \limits _{k=1}^K N(k)} \end{aligned}$$

(11)

(2) Evaluation index of localizations

The localization performance is evaluated using *DE* and *FR* as follows:

The error rate of DOA is calculated as follows:

$$\begin{aligned} D E=\frac{1}{\sum \limits _{k=1}^K D_E^k} \sum _{k=1}^K H\left( D O A_R^k, D O A_E^k\right) \end{aligned}$$

(12)

where \(D_E^k\) represents \(DOA_R^k\) the total number of angles at the *k* moment, and \(H(\cdot )\) represents the Hungarian algorithm to solve task allocalization.

DOA frame recall rate, the calculation expression is:

$$\begin{aligned} FR=\frac{\sum \limits _{k=1}^K 1\left( D_R^K=D_E^K\right) }{K} \end{aligned}$$

(13)

where \(D_R^K\) refers to \(DOA_R^K\) the total number of angles of reference at the Kth event. The calculated result is 1 when the condition is satisfied (\(D_R^K\)=\(D_E^K\)), and the result is obtained by adding this number over all the moments.

In an ideal environment, *ER* is close to 0, *FR* is close to 1, *F*1-score is close to 1, and *DE* is close to 0, and the system performance is better.