Channel and temporal-frequency attention UNet for monaural speech enhancement

The presence of noise and reverberation significantly impedes speech clarity and intelligibility. To mitigate these effects, numerous deep learning-based network models have been proposed for speech enhancement tasks aimed at improving speech quality. In this study, we propose a monaural speech enhancement model called the channel and temporal-frequency attention UNet (CTFUNet). CTFUNet takes the noisy spectrum as input and produces a complex ideal ratio mask (cIRM) as output. To improve the speech enhancement performance of CTFUNet, we employ multi-scale temporal-frequency processing to extract input speech spectrum features. We also utilize multi-conv head channel attention and residual channel attention to capture temporal-frequency and channel features. Moreover, we introduce the channel temporal-frequency skip connection to alleviate information loss between down-sampling and up-sampling. On the blind test set of the first deep noise suppression challenge, our proposed CTFUNet has better denoising performance than the champion models and the latest models. Furthermore, our model outperforms recent models such as Uformar and MTFAA in both denoising and dereverberation performance.


Introduction
Speech is vital in various aspects of our daily lives, including mobile communication, audio chat, remote conferences, and speech control.There are many sources of noise, such as car honking, machine noise, rain, and murmurs.Reverberation occurs when sound waves propagate indoors, reflecting and absorbing off walls, ceilings, floors, and other obstacles.Even after the sound source has stopped, the sound wave persists in the room, reflecting and absorbing until it eventually dissipates.Therefore, noise and reverberation frequently disrupt speech, severely affecting the listener's experience.In light of the issues above, removing background noise and reverberation from noisy speech is essential.Because of the user's desire for high-quality speech, speech enhancement, and de-reverberation technologies are increasingly critical.
Speech enhancement techniques can be broadly classified into traditional methods and deep neural network (DNN)-based methods.Traditional methods refer to completing the speech enhancement task via signal processing and certain statistical assumptions.Examples of traditional methods include subspace algorithms [1], spectral subtraction [2], and algorithms based on statistical models [3,4].Traditional methods mainly operate under the assumption that noise signals are stationary.However, most noise in natural environments is non-stationary, and most traditional methods have limitations.
Due to significant advancements in computing power, DNN-based speech enhancement methods have become increasingly prevalent.Because DNNs are highly effective in handling non-stationary noise, the research on DNN-based speech enhancement is more and more abundant.DNN-based speech enhancement methods can be categorized into time-domain and time-frequency domain approaches.The time-domain approach directly estimates the clean speech signal based on the noisy speech, taking an end-to-end approach.Wavenet [5], the first DNN capable of generating natural human speech and better modeling acoustic features, was instrumental in developing endto-end denoising methods.Rethage et al. [6] propose an end-to-end denoising method that retains the acoustic feature modeling capability of Wavenet while reducing the algorithm's time complexity by removing its autoregressive features.Speech enhancement models generally process only the amplitude spectrum, ignoring the phase information.To exploit phase information fully, Stoller et al. [7] propose the time-domain end-toend speech enhancement model Wave-U-Net, which allows the modeling of phase information and avoids fixed spectral transformations.To mitigate high delay and computational cost issues, Luo et al. [8] propose an end-to-end full-convolution time-domain speech separation network (Conv-TasNet).
Scholars have made significant progress in speech enhancement in the time domain; however, speech enhancement in the time-frequency domain is becoming increasingly popular.The time-frequency domain approach offers several advantages, such as the ability to focus on features often overlooked by the time-domain approach, enhanced robustness, and reduced computational cost [9].The time-frequency domain approach typically involves two main methods: spectral mapping and spectral masking.
Spectral mapping refers to estimating the clean spectrum from the noisy spectrum.In the time-frequency domain-based speech enhancement algorithms, the phase information plays a crucial role in the enhancement performance [9,10].However, estimating the phase spectrum directly is challenging since it lacks a clear structure.To address this problem, Tan et al. propose a novel framework using a convolutional recurrent network (CRN) [11] and introduce a gated convolution module [12] to estimate the phase spectrum.
Spectral masking takes the noisy spectrum as input and the mask as a training target.The mask can take various forms, such as ideal binary mask (IBM), ideal ratio mask (IRM), and complex ideal ratio mask (cIRM).Chen et al. [13] propose a separation and enhancement model based on long short-term memory (LSTM) that focuses on the temporal dynamics features of speech to improve speech intelligibility.This model significantly enhances objective speech intelligibility under low delay.Hao et al. propose FullSubNet [14], which uses a combination of a pure fullband model and a pure sub-band model to model the signal smoothly, pay attention to local features, and capture global long-distance features.To address the problems of input and output mismatch and rough handling of the frequency band, the FullSubNet+ is proposed by Chen et al. [15].
In the field of speech enhancement, the attention mechanism plays a pivotal role in dynamically adjusting focus to distinct regions based on the unique characteristics of input signals.As a result, it has gained widespread adoption in speech enhancement models, aiming to enhance the quality and intelligibility of speech signals.However, the vanilla attention approach presents significant challenges due to its high computational complexity, rendering it impractical for speech-processing tasks.Consequently, finding effective strategies to mitigate the complexity of attention remains a significant challenge.
In this work, based on time-frequency domain speech enhancement and spectral masking, we propose a novel model for speech enhancement called the channel and temporal-frequency attention UNet (CTFUNet), which combines channel and time-frequency attention mechanisms to denoising and dereverberation speech signals.
CTFUNet takes the noisy complex spectrum as input and produces the cIRM as the output, achieving excellent performance in speech enhancement.Our contributions are summarized as follows: • To alleviate the computational complexity of vanilla self-attention, we propose the multi-conv head channel attention (MCHCA) module.It enables the extraction of temporal-frequency speech features while maintaining linear complexity calculations for self-attention.• With the aim of improving the efficiency of channel feature extraction, we introduce the residual channel attention module (RCAM) into our work.This module can selectively highlight the channels with the most features in the neural network.• In the encoding and decoding framework, the encoding process compresses and loses a large amount of detailed information.To alleviate this problem and further extract features from channel dimensions and temporal-frequency dimensions at multiple scales and levels, we propose the channel temporal-frequency skip connection (CTFSC) between the downsampling and up-sampling modules.
The remaining contents of this paper are presented as follows: In Section 2, we provide an overview of the related works relevant to our study.Section 3 presents the signal model and the various components of our proposed CTFUNet in detail.Section 4 elaborates on the datasets used in our experiments and the implementation details of our experiments.Section 5 presents the results of our experiments and provides a detailed analysis.Finally, in Section 6, we draw conclusions based on our findings.

Self-attention
Self-attention is a widely used attention mechanism and a crucial component of transformer.By utilizing self-attention, the network can capture long-range dependencies in the input, and the multi-head structure enables parallel attention calculation.Recently, the effectiveness of self-attention has been demonstrated in various fields such as computer vision, natural language processing, and speech processing [16][17][18].The specific calculation process of self-attention is as follows: where Q, K, and V denote query, key, and value projection vectors.d k represents the dimension of K.However, self-attention takes up a significant amount of computation and graphics memory when calculating the attention map.For example, for an image with H × W pixels, its complexity is O(H 2 W 2 ).
To alleviate this problem, Zhao et al. [19] make improvements to the vanilla attention by dividing temporal-frequency attention into temporal attention and frequency attention, reducing the complexity of attention.Zhang et al. [20] propose a axial self-attention (ASA) for speech enhancement.ASA can reduce the need for memory and computation, making it more suitable for speech signals.In our research, we make improvements to self-attention, reducing its complexity by implicitly encoding global information by calculating self-attention on the channel dimension.

Temporal convolutional network
Previous research has demonstrated that recurrent neural networks (RNNs) excel in addressing sequencerelated tasks [21,22].However, RNNs operate one time step at a time and process the next step only after completing the previous one.As a result, RNN calculations require significant memory to store all intermediate results.To overcome this challenge, researchers propose a new network for time series processing called temporal convolutional network (TCN) [23].TCN is based on convolutional neural networks (CNNs) and incorporates causal convolution, dilated convolution, and residual module.In comparison to RNNs, TCN offers several benefits, including: • Parallelism.Unlike RNN, TCN processes the input time series as a whole without waiting for the last time step to complete processing. (1) • Flexible receptive field size.Dilated convolution improves the receptive field, so TCN can flexibly change the size of the receptive field by using dilated convolution.• Because of the introduction of the residual module, TCN has stable gradients, which can avoid gradient explosion or vanishing.• During training, TCN requires less memory than RNN.
In the field of speech enhancement, TCN is widely used because of its superior ability to process sequencerelated tasks than RNN [24].Pandey et al. [25] insert a TCN between the encoder and decoder and achieve a good performance of speech enhancement with fewer trainable parameters.Lin et al. [26] combine selfattention with TCN and adopt the multi-stage learning method to extract features.In our study, we used temporal-frequency convolutional network (TFCN) [27] instead of TCN.TFCN is an improvement of TCN that can simultaneously utilize features from both temporal and frequency dimensions, resulting in stronger modeling capabilities.

UNet
UNet [28,29] is a network model that follows a symmetrical U-shaped structure.It is typically an encoderdecoder structure.The first half of UNet is responsible for feature extraction and continuously reducing the input size, typically achieved through convolution and down-sampling operations.The latter half aims to restore the original input size.Apart from convolution, the crucial steps of this process include up-sampling and skip connections.Skip connections concatenate the location information of the bottom layer with the semantic information of the deep layer to achieve better results.While UNet has a straightforward structure and good performance, its model size is relatively large, and its performance may be affected by the receptive field.
Because the network structure of UNet has local connectivity characteristics, it can be used for speech signal processing.Choi et al. [30] improve UNet by proposing Tiny Recurrent UNet (TRUNet) and propose phaseaware β-sigmoid mask (PHM) for speech enhancement.Fu et al. [29] build a network framework based on UNet and Conformer [31].In addition, they simultaneously model the real and imaginary parts of the input speech spectrum and calculate self-attention on both temporalfrequency dimensions.In our study, we also improve UNet to focus not only on temporal-frequency dimensional features, but also on channel dimensional features.

Signal model
Assuming that x(t) represents a clean speech signal, the acoustic signal captured by the microphone in a noisy room can be expressed as follows: where h(t) denotes the room impulse response (RIR), n(t) indicates the background noise, and * denotes con- volution operation.Moreover, based on the definition of reverberation, the RIR h(t) can be decomposed into the direct part h d (t) and the reflection part h r (t) , so y(t) can be re-expressed as: where d(t) denotes direct sound, which is the sound that travels directly from the sound source to the listener without reflecting off any surfaces, and r(t) denotes reverberation, which is the sound that is reflected off the surfaces in the room before reaching the listener's ear.The discrete Fourier transform of Eq. 3 is given by: where l and f denote frame index and frequency bin, respectively.D(l, f) represents the target to be estimated, while R(l, f) and N(l, f) represent the complex spectrum of the reverberation and noise that need to be removed, respectively.
The proposed model in our study takes Y(l, f) as input and outputs the estimated spectrum mask M(l, f ) .Sub- sequently, we use M(l, f ) to estimate the desired output D(l, f ). (2)

Overall structure
In recent years, UNet has demonstrated its efficacy in feature extraction from data.This architecture has been widely adopted in speech enhancement and has shown remarkable results [28,29].The proposed CTFUNet, which follows a typical UNet structure, is presented in Fig. 1.The input to CTFUNet is the complex spectrum of noisy speech.First, a phase encoder (PE) is employed to convert complex spectral features to real spectral features.Then, an 3x3 input convolution layer extracts features and changes the channel number for the later calculations.Following this, three encoders, two neck modules, three decoders, and CTFSC are utilized to construct the main network.
Each encoder mainly comprises a frequency downsampling (FD) module, a temporal-frequency convolution module (TFCM), a MCHCA module, and a RCAM.The neck module consists of a TFCM, a MCHCA module, and a RCAM.The structure of the decoder is similar to that of the encoder, but with a frequency up-sampling (FU) module replacing the FD module.Furthermore, we utilize the CTFSC to connect the encoder and decoder.Finally, an output convolution layer is employed to obtain the cIRM M(l, f ) , and the masking method proposed by [20] is applied to obtain the enhanced spectrum D(l, f ).

Phase encoder and TF-convolution module
Some previous studies [32,33] have proven that realvalued speech enhancement networks are easier to build and achieve better enhancement effects on various datasets.Inspired by [20], we introduce the PE module into the model to perform the mapping of complex spectral features to real spectral features.The structure of our PE module is similar to that in [20], but it consists of only one complex convolution layer for Fig. 1 Overall structure diagram of the proposed channel and temporal-frequency attention UNet processing the noisy speech spectrum.The kernel size and the stride of the complex convolution layer are set to (1,3) and (1,1).The feature dynamic range compression layer's power compression ratio [34] is 0.5.
To efficiently extract temporal-frequency features using small parameters and convolution kernels, [27] proposes a TFCN by replacing 1-D convolutions in TCN with 2-D convolutions.Motivated by this study, we introduce TFCM, which contains 6 TFCNs, each consisting of two point-wise convolution layers and a 2-D dilated convolution layer.The kernel size and stride of the 2-D dilated convolution layer are (3,3) and (1,1), respectively.For the ith TFCN, the dilations of the 2-D dilated convolution layer are set to 2 i−1 .For the ith TFCN, its detailed description is shown in Table 1.
The input size and the output size of each layer are specified in channel_numbers × frequency_frames × time_frames format, and the hyperparameters in (kernelsize, strides, dilations) format.

Multi-conv head channel attention
Due to its large receptive field, self-attention has been widely used to capture long-term dependencies between features.However, its use in neural networks significantly increases the network's computational complexity.For instance, when calculating the self-attention map of a speech spectrum with size of C × F × L , the time complexity can be as high as C × F 2 × L 2 .To ease this problem, many scholars put forward their solutions [35,36].Motivated by these works, we propose the MCHCA module as illustrated in Fig. 2.After replacing vanilla self-attention with MCHCA, the time complexity of calculating the selfattention map becomes C 2 × F × L , where C is far less than L. MCHCA can capture long-term information with linear complexity, thanks to its two key features: • MCHCA avoids calculating self-attention across the temporal-frequency dimension and instead obtains the self-attention map across the channel dimension to encode global information implicitly.This approach effectively reduces the computational complexity of traditional self-attention.
• To focus on local information, we incorporate 1x1 point-wise convolutions and 3x3 depth-wise convolutions prior to generating the self-attention map.
In MCHCA, we first apply layer normalization to the input.After the point-wise convolution layer captures the cross-channel information, we use the depth-wise convolution layer to extract the temporal-frequency information and obtain query (Q), key (K), and value (V) projection vectors.The process is mathematically represented as follows: where W * P and W * D represent the projection matrixes in the point-wise convolution and depth-wise convolution layers.The integration of point-wise and depth-wise (5) The structure diagram of multi-conv head channel attention module convolution layers exploits the features of different channels in the same temporal-frequency position, enabling the network to concentrate on local information.
For subsequent computation, we reshape Q ∈ R (C,F ×L) , K ∈ R (F ×L,C) and V ∈ R (C,F ×L) from the original size of R (C,F ,L) .Then, we calculate the dot product of Q and K to encode global information across the channel dimension.Following this, we apply the Softmax function to the result to obtain the channel attention map, which has a size of R (C×C) .Finally, we take the dot product of chan- nel attention map and V to obtain the channel attention.The complete channel attention calculation process is expressed as Eq.6: where µ is a learnable scaling factor to adjust the result of the dot product of Q and K.The overall calculation process of MCHCA is expressed as follows: Furthermore, we incorporate multi-head processing on the channel dimension in MCHCA, which enables parallel computation of attention and the capture of features at multiple scales.Table 2 provides a detailed description of MCHCA, the hyperparameters are in (kernelsize, strides) format.
To confirm that our proposed MCHCA indeed reduces time complexity, we compared it with vanilla self-attention (VSA), improved T-F self-attention (ISA) [19], and axial self-attention (ASA) [20].To ensure the successful operation of vanilla self-attention, the length of the speech is selected as 5 s.The comparison results are shown in Table 3.
Compared to VSA and ASA, although MCHCA has more MACs and the number of parameters, MCHCA greatly shortens the runtimes.Compared to ISA, MCHCA has similar runtime in fewer MACs and the number of parameters.

Residual channel attention module
Although we use MCHCA to obtain the channel attention map with the size of R (C×C) , its essence is still temporal- frequency attention.[37] proposes a residual channel (6) attention block, which allows the network to focus more on useful feature channels.Based on it, we introduce RCAM to capture features across different channels.The specific structure of RCAM is illustrated in Fig. 3.
The input first passes through the instance normalization layer and then through the depth convolution-ReLUdepth convolution block (a simple residual block) to obtain the residual features.Subsequently, the residual features are used to obtain the feature information of all channels through 2-D average pooling, down-sampling convolution, ReLU, up-sampling convolution, and sigmoid activation functions.Finally, the residual features are multiplied by the channel feature information and  added to the input to enable the network to use channel information fully.Table 4 provides a detailed description of RCAM, the hyperparameters are in (kernelsize, strides) format.

Frequency down and up sampling
In [20], their works have been demonstrated that FD and FU modules are effective in extracting multi-scale features.Based on their works, we incorporate FD and FU modules into our approach.Additionally, at each scale, we introduce TFCM, MCHCA, and RCAB to enable the network to more effectively capture temporal-frequency and channel features.
The FD and FU modules in our work have similar structures to those proposed in [20].However, we make modifications by replacing the batch normalization layer with the instance normalization layer, which is better for speech enhancement tasks [38,39].Furthermore, we set the kernel size, stride, and groups of both the convolution layer and the transpose convolution layer to (4, 4), (2, 1), and 2, respectively.

Channel temporal-frequency skip connection
Up to now, significant progress has been made in research on attention mechanisms.The incorporation of attention can not only highlight critical regions but also enhance the representation power of these regions.Woo et al. [40] and Hu et al. [41] calculate attention weights on both the channel and spatial dimensions, highlighting the importance of channel attention.To further exploit the features of the temporal-frequency and channel dimensions at multiple scales and levels, we propose CTFSC, as illustrated in Fig. 4. CTFSC mainly consists of a channel focussing module and a temporal-frequency focussing module.
In the channel focussing module, input first passes through the average pooling layer and the max pooling layer to aggregate the temporal-frequency features of speech and obtain P ca and P cm , respectively.Then, P ca and P cm are fed into a convolution block (CB) with shared parameters.Finally, we merge and output the channel eigenvector F c using a sigmoid function and element-wise addition.The overall calculation process of the channel focussing module is as follows: where σ denotes sigmoid function, Avg(•) represents average pooling calculation, Max(•) denotes max pool- ing calculation, x represents the input, and ⊗ denotes element-wise product.
In the temporal-frequency focussing module, the output of channel focussing module F c passes through the average pooling layer and the max pooling layer for channel dimension to aggregate the channel features of speech, obtaining P sa and P sm , respectively.Subsequently, (8) Fig. 4 The structure of channel temporal-frequency skip connection the concatenation of P sa and P sm passes through a con- volution layer with a kernel size of (7, 7) followed by a sigmoid layer to obtain the output.The calculation process of the temporal-frequency focussing module can be expressed as follows: where W denotes the projection matrix of the 7x7 convolution layer.Table 5 provides a detailed description of CTFSC, the hyperparameters are in (kernelsize, strides) format.

Loss function
For speech enhancement tasks, both magnitude and phase information are crucial.Therefore, we adopt the complex mean squared error (cMSE) proposed in [42] as our loss function.The cMSE is defined as follows: where P cRI and P cMag can be expressed as follows: where S cRI and S cMag represent the complex compres- sion spectrum and magnitude compression spectrum of clean speech.Ŝ * denotes the estimated speech spectrum.
It should be noted that we omit the frame index l and the frequency index f for brevity.α and β are 0.3 and 0.7.S cRI , and S cMag can be specifically expressed as follows: where c denotes the compressibility coefficient, set as 0.3. (

Datasets
In our experiment, we utilize three training datasets, all of which are derived from clean speech and noise datasets provided by the first Deep Noise Suppression Challenge [43].The clean speech datasets include about 500 h of English speech clips from 2150 speakers.The noise datasets are composed of 181 h of clips from 150 classes.To conduct the ablation study, we first use the image source method to obtain 100,000 pairs of RIRs with reverberation time RT60 from 0.3 s to 1.4 s.We convolve 75% of the clean speech with randomly selected RIRs and add noise with a random signal-to-noise ratio (SNR) ranging from −5 to 20 dB to the reverberant speech.Finally, we generate a 100-h train dataset and a 20-h validation dataset.For our test set, we select the blind test set provided by the 1st DNS challenge, which comprises two parts: with reverberation and without reverberation.The SNR of the blind test set ranges from 0 to 20 dB.We generate the second dataset to compare the denoising performance with other models.All generation processes are the same as above, except that the duration of the dataset is 500h.To ensure the fairness of comparison, we also select the blind test set provided by the 1st DNS challenge as the test set.
To evaluate the denoising and dereverberation performance of the CTFUNet, we still employ the clean speech datasets and the noise datasets provided by the 1st DNS challenge to generate our third dataset.We divide the clean speech and noise datasets into the train, validation, and test datasets according to the proportion of 80%, 10%, and 10%.Subsequently, all clean speech is convolved with the RIRs generated earlier, and noise with SNR range from −5 to 20 dB is randomly added.Finally, we obtain a 100-h train dataset, a 10-h validation dataset, and a 5-h test dataset for our experiments.
The sampling rate of all the above speech is 16 kHz.

Implementation details
In the experiment, the frame length and hop length of the STFT complex spectrum are 20 ms and 10 ms.The output channel numbers of PE and input convolution layer are 2 and 32.The output channel numbers of the three FDs are 64, 128, and 256.The output channel numbers of the three FUs are 128, 64, and 32.The head numbers of MCHCA are 1, 2, and 4 in encoders and 8 in the neck module.The multiple of down-sampling convolution and up-sampling convolution in RCAM is 4. The output channel number of the output convolution layer is 4. Table 6 provides a detailed description of CTFUNet, the hyperparameters are in (kernelsize, strides) format.
The optimizer is AdamW, and the initial learning rate is 0.001 decaying exponentially by 0.98 with the training epoch increasing.We train the network for 50 epochs with a batch size of 2.

Ablation study
In this section, we conduct an ablation study to investigate the impact of key modules in the proposed CTFU-Net on performance.We evaluate denoising performance on the blind test set while disregarding dereverberation performance.Specifically, we replace MCHCA with ISA and ASA to demonstrate the superiority of MCHCA.We do not choose VSA because its computational complexity is too large to be used for speech processing tasks.Besides, we individually remove RCAM, MCHCA, and CTFSC modules from the CTFUNet architecture."+ISA" and "+ASA" refer to replace MCHCA with ISA and ASA."−CTFSC" refers to simply passing the output of FD into FU, concatenating, and element-wise multiplying with the input of FU without any further processing.Table 7 presents the results of the ablation study.After replacing MCHCA with ISA and ASA, there is a significant decrease in various performances.Combined with Table 3, our proposed MCHCA has significant advantages in both runtimes and performance.CTFSC increases the number of parameters by 0.2 M, but it significantly enhances the denoising performance of the network.The CTFSC module helps alleviate information loss during down-sampling and up-sampling.MCHCA adds 1 M parameters to the network but fully extracts temporal-frequency features and improves denoising capability.RCAM captures channel dimension features, leading to a parameter increase of 1.2 M.

Denoising performance comparison
Table 8 illustrates the denoising performance comparison between our proposed CTFUNet and other models with similar parameters.Compared with the champion model DCCRN [51] in the real-time track of the 1st DNS challenge, the NB-PESQ scores of CTFUNet increased by 0.664 and 0.373 with and without reverberation.In comparison with the champion model PoCoNet [53] in the non-real-time track of the 1st DNS challenge, the WB-PESQ scores of CTFUNet increased by 0.535 and 0.428 with and without reverberation.Additionally, we also compare with several speech enhancement models proposed in recent years, such as GaGNet [55], FullSubNet+ [15], and FS-CANet [56].The results demonstrate that the WB-PESQ, NB-PESQ, STOI, and SI-SDR of CTFU-Net are significantly better than those of other models, with or without reverberation.Therefore, our proposed CTFUNet can achieve excellent denoising performance with a 500-h train dataset, which is much smaller than the datasets used by other models.

Denoising and dereverberation performance comparison
In this section, we conduct a comparative analysis of the denoising and dereverberation performance of CTFUNet, Uformer [29], and MTFAA [20].The datasets and experimental conditions for the three models are identical.The evaluation results are presented in Tables 9, 10, and 11.
To compare the denoising performance separately, we generate five noisy test sets with SNR ranging from −5 to 15 dB in 5 dB intervals.Each test set lasts 1 h, and all speech has no reverberation.We use WB-PESQ, DNSMOS, STOI, and SI-SDR as metrics.Table 9 shows the results of denoising performance.Obviously, our proposed CTFUNet has tremendous advantages over MTFAA and Uformer in denoising performance.On average, CTFUNet improves WB-PESQ by 1.002, DNS-MOS by 0.795, STOI by 5.762%, and SI-SDR by 5.161 compared with noisy speech, demonstrating significant denoising capability.
To compare the dereverberation performance separately, we generated six reverberation test sets with RT60 range of 0.4 s to 1.4 s in steps of 0.2 s.WB-PESQ, DNSMOS, STOI, SI-SDR, SRMR, CD, LLR, and SNR fw are selected as metrics, and the results of dereverberation performance are illustrated in Table 10.Compared with unprocessed reverberation speech, the three models significantly improve WB-PESQ, DNSMOS, STOI, SI-SDR, SRMR, and reduce CD, LLR at each reverberation time.For SNR fw , all models decrease in low RT60 and increase in high RT60 compared with reverberation speech, but only SNR fw of CTFUNet is higher than reverberation speech in the end.Overall, CTFUNet has more significant advantages than other models in all metrics except SI-SDR and SRMR.On average, CTFU-Net improves WB-PESQ by 0.927 and DNSMOS by 0.978 and decreases CD by 1.297 and LLR by 0.306.Therefore, CTFUNet exhibits an excellent enhancement effect on reverberant speech.Finally, we generate a test set containing noise and reverberation to evaluate the denoising and dereverberation performance of CTFUNet simultaneously.The test set contains noisy-reverberant speech with RT60 range of 0.4 s to 1.4 s and SNR ranging from −5 to 15 dB.The results of Table 11 show that CTFU-Net has notable advantages in all metrics except SRMR, which is basically consistent with previous experimental results.Compared with unprocessed noisy-reverberation speech, CTFUNet improves WB-PESQ by 0.869 and DNSMOS by 1.421 and decreases CD by 2.173 and LLR by 0.524.To observe the speech enhancement effect of CTFUNet more intuitively, Fig. 5 illustrates the comparison results of the unprocessed speech spectrogram and the enhanced speech spectrogram.Obviously, the enhanced speech spectrogram of CTFUNet is similar to the clean speech spectrogram, demonstrating that CTFUNet can effectively suppress noise and reverberation.In addition, we visualize the learned attention matrix of each layer, as shown in Fig. 6.This picture shows that MCHCA has learned the correlation between different channels and is able to utilize global contextual information.
Based on the results of all experiments, we can conclude that CTFUNet can effectively improve the clarity and intelligibility of speech under different noise and reverberation levels.

Conclusions
Noise and reverberation seriously affect the quality and intelligibility of speech.To address this issue, we propose CTFUNet, a speech enhancement model that adopts a typical encoder-decoder framework.We mainly use the  temporal-frequency convolution module and the multiconv head channel attention with linear complexity to extract the temporal-frequency features of the signal.We use the residual channel attention module to capture the signal's channel features.Additionally, we introduce the channel temporal-frequency skip connection to mitigate the information loss problem in the process of downsampling and up-sampling.Experimental results demonstrate that CTFUNet can effectively suppress different levels of noise and reverberation, exhibiting excellent speech enhancement performance.

Table 1
Architecture of the ith TFCN

Table 2
Architecture of MCHCA

Table 3
Comparison results of several self-attentions 3ig.3The structure diagram of residual channel attention module

Table 4
Architecture of RCAM

Table 5
Architecture of CTFSC

Table 6
Architecture of CTFUNet

Table 8
Denoising performance comparison of CTFUNet with other models

Table 9
Denoising performance on test dataset without reverberation

Table 10
Dereverberation performance on test dataset without noise

Table 11
Denoising and dereverberation performance on test dataset