Skip to main content
  • Empirical Research
  • Open access
  • Published:

A time-frequency fusion model for multi-channel speech enhancement

Abstract

Multi-channel speech enhancement plays a critical role in numerous speech-related applications. Several previous works explicitly utilize deep neural networks (DNNs) to exploit tempo-spectral signal characteristics, which often leads to excellent performance. In this work, we present a time-frequency fusion model, namely TFFM, for multi-channel speech enhancement. We utilize three cascaded U-Nets to capture three types of high-resolution features, aiming to investigate their individual contributions. To be specific, the first U-Net keeps the time dimension and performs feature extraction along the frequency dimension for the high-resolution spectral features with global temporal information, the second U-Net keeps the frequency dimension and extracts features along the time dimension for the high-resolution temporal features with global spectral information, and the third U-Net downsamples and upsamples along both the frequency and time dimensions for the high-resolution tempo-spectral features. These three cascaded U-Nets are designed to aggregate local and global features, thereby effectively handling the tempo-spectral information of speech signals. The proposed TFFM in this work outperforms state-of-the-art baselines.

1 Introduction

Noise disturbances have severe impacts on speech quality and intelligibility. Speech enhancement is an important and challenging technique aimed at extracting desired speech signals in noisy environments. Different from single-channel speech enhancement, multi-channel speech enhancement can exploit both spatial and tempo-spectral characteristics of the target signal. With the development of devices with multiple microphones, such as hearing aids, mobile phone, and camera, multi-channel speech enhancement technology has attracted more and more attention from many scholars in the past years. Traditional multi-channel speech enhancement technology, such as minimum variance distortionless response (MVDR) [1] and multi-channel Wiener filter (MWF) [2, 3], is often based on ideal assumptions and tend to perform excellently with stationary noise but poorly with non-stationary noise.

Recent advancements in deep learning-based multi-channel speech enhancement techniques have shown remarkable performance, surpassing traditional methods by not relying on prior assumptions and demonstrating adaptability to real-world acoustic environments. A key development in this field is the DNN-based time-frequency (T-F) masking technique. This approach involves estimating T-F masks such as the ideal binary mask (IBM), ideal ratio mask (IRM), and complex ideal ratio mask (cIRM), inspired by the auditory masking phenomenon where weaker sounds become inaudible in the presence of louder sounds within a critical band [4].

The application of these masks can be categorized into two distinct approaches. The first approach uses the estimated mask to assist in calculating the speech and noise spatial covariance matrices (SCMs). In this context, Erdogan and his team utilize long short-term memory (LSTM) networks for T-F mask estimation for each channel, which is then used in computing minimum variance distortionless response (MVDR) beamforming [5]. Similarly, Cui and colleagues employ a bidirectional LSTM network to combat noise and reverberation [6], while Ni and his team integrate an LSTM neural network with a Model-Based EM Source Separation and Localization (MESSL) framework for T-F masking estimation and speech signal quality enhancement [7]. The second approach directly applies the estimated mask to the noisy signal captured by the reference microphone. A notable example of this method is seen in research where a Dense Frequency-Time Attentive Network (DeFT-AN) was introduced to estimate T-F masks, which are then directly applied to noisy speech for extracting the desired signals [8]. Additionally, a Multi-Cue Fusion Network (McNET) has been proposed for cIRM estimation [9]. Furthermore, researchers have explored the use of convolutional recurrent neural networks for T-F masking prediction [10], evaluating the performance of these two differing approaches.

Deep learning-based beamforming has emerged as a popular and effective method, demonstrating impressive performance in the realm of multi-channel speech enhancement [11]. In [12], the authors factorize the MVDR beamformer and estimate the factors using a DNN, thereby avoiding the need for SCMs estimation. Ren et al. [13] directly adds a beamforming operation at the end of the proposed network for real-time multi-channel speech enhancement. Li et al. [14] proposes an embedding and beamforming paradigm which could derive the beamforming weights during the embedding stage, thus allowing for the implementation of a filter-and-sum operation in the beamforming stage. Similarly, [15] adopts a two-stage paradigm that utilizes two cascaded networks for beamforming filter estimation: the first network estimates the time-frequency (T-F) mask, while the second network is dedicated to estimating the beamforming filters. Motivated by the success of these methods, in this work, we focus on directly estimating the beamformer for multi-channel speech enhancement.

Whether it is the mask estimation network or the beamforming weights estimation network, most of them exploit the spatial and tempo-spectral information, which is key to performance improvements in multi-channel speech enhancement systems. A joint spatial and tempo-spectral non-linear filter (FT-JNF), proposed in [16], demonstrates the interdependency of three information sources (spatial, spectral, and temporal). For fusing different types of information, multiple cues fusion network [9] cascades four modules to exploit the full-band spatial, narrowband spatial, sub-band spectral, and full-band spectral information. In this work, our aim is to capture three types of high-resolution features. We investigate their individual contributions in effectively handling the tempo-spectral information of speech signals.

The U-Net architecture consists of an encoder for downsampling, a decoder for upsampling, and skip connections between down and up blocks. It was originally proposed for image segmentation tasks [17] but has also demonstrated excellent performance in speech enhancement tasks [18]. The feature maps in the deeper layers have higher resolution. In particular, some variations of U-Net extract features along the frequency dimension to model contextual information [19], which results in higher frequency resolutions in the deeper layers. In [20], the authors propose a U-Net that progressively reduces the time dimension layer by layer to achieve high temporal resolutions. Additionally, there is a channel-attention dense U-Net that captures high tempo-spectral resolution by performing down- and up-sampling along both the frequency and time dimensions [21].

The multi-resolution U-Net structure has been shown to be powerful for speech enhancement tasks. This type of network extracts features by performing both down-sampling and up-sampling processes. In this context, we present a novel approach for multi-channel speech enhancement: a time-frequency fusion model, which we have named TFFM. It is specifically designed to investigate the roles of features with high-frequency, high-temporal, and high tempo-spectral resolutions. The contributions of this work are outlined as follows:

  • We propose a time-frequency fusion model (TFFM) to aggregate local and global features, thereby effectively handling the tempo-spectral information of speech signals. TFFM is composed of three sequentially cascaded U-Nets, each tailored for a specific purpose. The first U-Net maintains the time dimension while extracting features along the frequency axis, enabling the exploration of high frequency resolution features combined with global temporal information. The second U-Net, in contrast, preserves the frequency dimension and focuses on extracting features with higher temporal resolution, thereby investigating the impact of high temporal resolution within the context of global frequency information. The third U-Net operates along both frequency and time dimensions, enhancing the resolution in both axes to facilitate the study of features with high tempo-spectral resolution.

  • To investigate the global and local dependencies within the feature maps of these three U-Nets, we consider changing the order of the third U-Net, which relies on the local tempo-spectral features. We also conduct an ablation study to examine the individual contributions of each U-Net layer to the overall performance on our dataset. Furthermore, to ensure a fairer comparison of TFFM, we test our proposed model on the L3DAS22 dataset, launched in 2022 specifically for 3D speech enhancement tasks.

The remaining contents of this paper are presented as follows: Section 2 provides some related works relevant to our study and discusses the innovations of our method. Section 3 presents the signal model of multi-channel speech enhancement, and demonstrates the details of our proposed model. Section 4 describes the datasets and experimental setup. In Section 5, the performance investigation of the proposed system is presented. In Section 6, we provide the conclusions of the paper.

2 Related works

In recent years, networks based on the multi-resolution U-Net structure, originally proposed for image segmentation tasks [17], have been experimentally demonstrated to be highly effective for speech enhancement tasks. Stoller et al. [22] describe an end-to-end time domain model that adapts the U-Net to the one-dimensional time domain, enabling the computation and combination of features at different time scales. After that, the researchers make an extension of Wave-U-Net to obtain the clean speech waveform [23]. Pandey et al. [24] propose a dense convolutional network (DCN) for directly predicting enhanced speech samples from noisy ones. This DCN integrates nested dense blocks and self-attention blocks within the U-Net architecture. The aforementioned methods exemplify various extensions of the U-Net structure within the time domain. Additionally, many novel networks based on the U-Net architecture have demonstrated excellent performance in the STFT (short-time Fourier transform) domain. The authors of [21] incorporate channel-attention into the U-Net architecture, effectively simulating the process of beamforming. And in [19], Wang et al. propose a novel network for speaker separation and dereverberation. This network is based on the U-Net architecture, chosen for its ability to maintain local fine-grained structure. Koyama and Raj [15] exploits two cascaded U-Nets to combine the mask-based approach with the filter-based approach for multi-channel speech enhancement.

Spectral information is present in both full-band and sub-band. In the past, many studies have exploited full-band spectral pattern in single-channel speech enhancement tasks. The FullSubNet, as proposed in [25], captures the global spectral context with a full-band model and attends to local spectral patterns with a sub-band model for single-channel speech enhancement. This method demonstrates that sub-bands are also informative. In multi-channel speech enhancement tasks, spatial information plays an essential role, which is also present in both full-band and sub-band. Yang et al. [9] fuses four modules to fully exploit spectral, temporal dynamics, and spatial information, verifying the effectiveness of these modules. Quan and Li  [26] extensively exploits spatial information to handle multi-channel joint speech tasks. They develop the interleaved narrow-band and cross-band blocks to exploit narrow-band and cross-band spatial information. Furthermore, [16] exploits full-band information in the first LSTM layer and narrow-band information in the second LSTM layer. They improve performance by incorporating both global spectral and global temporal information. In [8], the authors use the F-transformer and T-conformer to manage spectral and temporal information, respectively.

The remarkable performance of these methods demonstrates that spectral and temporal information play a crucial role in speech enhancement tasks. Additionally, local and global tempo-spectral information have different contributions in handling speech signals. Multi-resolution U-Net presents feature maps with varying resolutions at each layer. In this work, we utilize feature maps with varying resolutions at each layer of U-Net to investigate the contributions of local and global time-frequency features. The three cascaded U-Nets proposed in our study perform down- and up-sampling along different dimensions, resulting in features of varying resolutions. The first U-Net captures global temporal information while attending to local frequency information at higher resolution in the deeper layers. The second U-Net performs the opposite function, containing global frequency information while capturing local temporal information at higher resolution in the deeper layers. The third U-Net focuses on local tempo-spectral information. These three U-Nets aggregate local and global tempo-spectral information for multi-channel speech enhancement and have shown impressive performance compared to advanced methods.

3 Proposed method

3.1 Problem formulation

The multi-channel mixture of M microphones can be expressed as a summation of the target signal and noise signal. In the STFT domain, it can be written as:

$$\begin{aligned} {y}_m(f,t) & = {s}_m(f,t) + {z}_m(f,t) \nonumber \\ & = {s}_{m}^{d}(f,t) + {s}_{m}^{r}(f,t) + {z}_m(f,t) \nonumber \\ & = {s}_{m}^{d}(f,t) + {n}_m(f,t) \end{aligned}$$
(1)

where \(s_m(f,t)\) is the received speech including direct-path speech \({s}_{m}^{d}(f,t)\) and reverberated speech \({s}_{m}^{r}(f,t)\) at the m-th microphone. \(z_m(f,t)\) denotes the noise signals at the m-th microphone and \({n}_m(f,t)\) is the overall interference including noise and reverberation. For the sake of brevity, the frequency and time frame indexes, f and t, will be omitted in following text. In this work, we aim to estimate the direct-path clean speech of reference microphone from the mixture signals captured by the microphone array.Here, the first microphone is designated as the reference microphone.

3.2 Proposed method

The overview of the TFFM is shown in Fig. 1. The input signals are transformed into complex features using the short-time Fourier transform (STFT), resulting in a three-dimensional tensor \(Y = [y_1,...,y_M] \in \mathbb {C}^{M\times F\times T}\), where F denotes the number of frequency bins and T denotes the number of frames. These features are then formed by stacking the real (R) and imaginary (I) components of the STFT along the channel dimension. After that, the inputs are fed into the first U-Net module, named U-Net-F module. This module preserves the time dimension and extracts features along the frequency dimension. With an increased resolution of the frequency dimension, U-Net-F module captures more detailed frequency information while considering the global temporal information. Subsequently, the outputs of the first U-Net module undergo down- and up-sampling along the time dimension while maintaining the frequency dimension. The second U-Net, named U-Net-T module, is responsible for capturing higher temporal resolution to learn more detailed temporal information while considering the global frequency information. Next, the outputs are processed by the third U-Net, named U-Net-TF module. This module increases the resolution of both the frequency and time dimensions through down- and up-sampling along both dimensions. As a result, U-Net-TF can capture more abstract representations of time-frequency information. Subsequently, the outputs of these three U-Nets pass through a 2D convolutional layer to derive the beamforming weights (BF weights). These weights are then applied to the noisy spectrum to extract the expected speech signal. Subsequently, the output spectrum is transformed into the time domain enhanced signal using the inverse short-time Fourier transform (ISTFT).

The beamforming operation involves complex multiplication of the input features with beamforming weights, followed by summing along the channel dimension. This process can be demonstrated as follows:

$$\begin{aligned} \hat{s}\ = \sum _{m=0}^{M-1} {w_m} \odot {y_m} \end{aligned}$$
(2)

where \({w_m}\in \mathbb {C}^{B\times T\times F}\) denotes the beamforming weight, B is the batch size, and \(\odot\) represents the element-wise multiplication operation. In this work, we adopt a 2D convolutional layer to extract the real and imaginary components of the beamforming weights. These components are shaped identically to the input features, denoted as [B, 2MTF], and the output \(\hat{s}\in \mathbb {C}^{B\times T\times F}\) denotes the enhanced complex spectrum.

Fig. 1
figure 1

Overview of the TFFM. The proposed TFFM includes three U-Net modules: the U-NET-F module, which preserves the time dimension while extracting features along the frequency dimension; the U-Net-T module, which preserves the frequency dimension while extracting features along the time dimension; and the U-Net-TF module, which performs both down-sampling and up-sampling across frequency and time dimensions

Fig. 2
figure 2

The diagram of the basic U-Net. The proposed TFFM comprises three U-Net modules, all of which are based on this common encoder-decoder structure

Fig. 3
figure 3

The diagram illustrates a Dense block that consists of five blocks, each comprising a 2D convolution, an ELU activation function, and an IN step

Multi-resolution U-Net can preserve fine-grained local spectral structure due to the use of skip connections. These connections allow information from early layers to be directly passed to later layers [17]. The three U-Net modules employed in this study are all based on a common encoder-decoder structure, as shown in Fig. 2. The encoder’s first layer includes a 2D convolutional layer (Conv2d) followed by a DenseBlock. The subsequent four middle layers share a consistent architecture, each comprising a 2D convolutional layer, an ELU activation function [27], instance normalization (IN), and a DenseBlock. The encoder’s final layer is made up of a 2D convolutional layer, an ELU activation function, and instance normalization. The decoder’s architecture mirrors that of the encoder, with the key differences being the reversed order of the DenseBlock and the use of 2D deconvolution layer (DConv2d) instead of 2D convolution layer. Importantly, each layer in the decoder combines the output from the previous layer with the output from the corresponding layer in the encoder as its input. The local and global tempo-spectral information is fused by cascading these three U-Net modules.

These three U-Net modules extract features along different dimensions by setting the parameters of the Conv2d layers, which are detailed in Table 1 in Section 4. Specifically, the encoder in the first U-Net consists of six convolutional layers, each employing a stride of (1,2). This stride setting effectively halves the dimensions along the frequency axis while preserving the time dimension layer by layer. In contrast, the encoder in the second U-Net uses a stride of (2,1) in each layer, which progressively reduces the dimensions along the time axis while keeping the frequency dimension constant. The encoder in the third U-Net features six convolutional layers, each with a stride of (2,2), halving the dimensions along both the time axis and the frequency axis layer by layer.

Dense blocks promote feature reuse and strengthen feature propagation, making them effective in speech enhancement tasks [28]. The dense block shown in Fig. 3 consists of five blocks, each comprising a 2D convolution, an ELU activation function, and an IN step. The input to each block is a concatenation of the outputs from all previous blocks in the dense block as well as the initial input. The output dimensions of each block are the same as the input dimensions of the first block.

3.3 Loss function

We train our model with the complex compressed loss proposed in [29], which is shown as follows:

$$\begin{aligned} {\mathscr {L}} = \frac{1}{T \times F}\left(\lambda \sum _{f,t} {|{s^c-\hat{s}^c} |^2} + (1-\lambda )\sum _{f,t} {|{|s|^c-|\hat{s} |^c} |^2}\right) \end{aligned}$$
(3)

where s represents the clean complex spectrum from the reference microphone, and \(\hat{s}\) represents the enhanced complex spectrum, the linear weight \(\lambda = 0.3\), and the compression factor \(c=0.3\). The compressed spectrum \(s^c\) is given by:

$$\begin{aligned} s^c = |s|^c \frac{s}{max(|s|,\mu )} \end{aligned}$$
(4)

where the \(\mu\) is a very small constant, more details about this loss function, for more details about this loss function, please refer to [29].

4 Experiments

4.1 Datasets

To evaluate the proposed approach, we utilize two datasets: the L3DAS22 multi-channel speech enhancement challenge datasetFootnote 1 [30], a public dataset launched in 2022, and the spatial DNS (SPA-DNS) dataset, which is simulated using the Python toolbox Pyroomacoustics [31]. Descriptions of these two datasets are provided below.

  • L3DAS22 multi-channel speech enhancement challenge dataset: The L3DAS22 dataset, provided as part of the ICASSP 2022 challenges, was simulated using two first-order A-format Ambisonic arrays in a noisy-reverberant office environment. One microphone array was fixed at the center of the room, while the other was placed 20 cm away from the center. Both microphones were positioned at a height of 1.3 m. The dry speech and noise signals were sourced from the Librispeech [32] and FSD50K [33] datasets, respectively. The generated A-format Ambisonic noisy mixtures are converted into their B-format counterparts, which includes applying a prefilter, a mixing matrix, and a post-filter. The signal-to-noise ratio (SNR) varied from 6 dBFS to 16 dBFS. The simulated rooms have approximate dimensions of 6 m (length) × 5 m (width) × 3 m (height). The L3DAS22 dataset comprises 37,398 mixtures (\(\sim\)81.3 h) for training, 2362 mixtures (\(\sim\)3.9 h) for validation, and 2189 mixtures (\(\sim\)3.5 h) for testing.

  • The SPA-DNS dataset: We use the Python toolbox Pyroomacoustics [31] to simulate the SPA-DNS dataset. The clean speech and noise files are sourced from the DNS Challenge 2020 corpus [34]. The simulated rooms vary in size, ranging from \(5 \times 5 \times 3 , \textrm{m}^3\) to \(10 \times 10 \times 4 , \textrm{m}^3\), with RT60 values between 0.2 and 1.2 s, shown in Fig. 4. Additionally, we randomly place a circular microphone array with four microphones and a radius of 10 cm within the room. Two sources, one for speech and one for noise, are also randomly positioned inside the room. The distance between these two sources is maintained between 0.75 m and 2 m, and all points in the room are set at least 0.5 m away from the walls. For training, we generate 85k utterances ranging from [3, 6] seconds, for validation 4.4k utterances ranging from [3, 10] seconds, and for testing 2.7k utterances also ranging from [3, 10] seconds. The SNR range is from − 5 to 10 dB. A similar approach to generating multi-channel datasets was previously used in [35].

Fig. 4
figure 4

The configuration of shoebox-like rooms

4.2 Evaluation metrics

On the L3DAS22 dataset, the performance of our work was evaluated and compared using three key metrics. The primary evaluation metric for the L3DAS22 3D speech enhancement task is a combination of short-time objective intelligibility (STOI) and word error rate (WER):

$$\begin{aligned} \text {Task1Metric} = \frac{\text {STOI} + (1 - \text {WER})}{2} \end{aligned}$$
(5)

Here, STOI is the metric for estimating the intelligibility of the output speech signal [36], and WER is used to assess the effects of the enhancement for speech recognition purposes. The WER is computed by comparing the transcription of the enhanced speech with that of the dry speech, both of which are decoded using a pre-trained wav2vec2 ASR (Automatic Speech Recognition) model [37]. Since both STOI and WER scores fall within the range of [0, 1], the composite metric also lies in this range.

To evaluate the performance of this work on the SPA-DNS dataset, we consider the following metrics: Perceptual Evaluation of Speech Quality (PESQ) [38], which assesses the objective speech quality; STOI, which evaluates the objective speech intelligibility; and the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) [39]. The PESQ score ranges from − 0.5 to 4.5, while the STOI score ranges from 0 to 1, with higher scores indicating better quality and intelligibility for both metrics.

4.3 Experimental settings

All data in this study is sampled at 16 kHz. For transforming the data into the STFT domain, a Hanning window of 512 length with a 50% overlap is used. The three U-Nets implemented in this work share the same structure, and their configurations are outlined in Table 1. The decoders mirror the configuration of the encoders. Each dense block nested within the U-Net comprises five layers, with all convolutional layers in the dense block having a kernel size of 3 \(\times\) 3, a stride of 1 \(\times\) 1, and padding of 1 × 1. The network is trained using the Adam optimizer [40], with a learning rate of 0.001 and a batch size of eight input sequences. Our model is trained using the same loss function proposed in [41], and the number of training epochs is set to 150. The number of parameters of the proposed TFFM is 5.1M.

Table 1 The configuration of six Conv2D layers within the encoder part of each U-Net module

5 Experimental results and analysis

5.1 Ablation study

In this section, we conduct an ablation study on the SPA-DNS dataset to examine the individual contributions of each module as described above. The denoising system comprising solely the first U-Net module is denoted as U-Net-F. The configuration including the second U-Net module is referred to as U-Net-T, and the one with the third U-Net module is termed U-Net-TF. Furthermore, we experiment with different arrangements of these three U-Net modules to evaluate the influence of local time-frequency information. The sequence “F,” “T,” “TF” corresponds to the arrangement of the U-Net-F module, U-Net-T module, and U-Net-TF module, respectively. The efficacy of these configurations was assessed using the test dataset of the SPA-DNS dataset, with SNR values ranging from − 5 dB to 10 dB.

Given that the proposed networks are based on a standard U-Net architecture, we have conducted an analysis focusing on the depth of encoders and kernel size within the foundational U-Net. Specifically, we have chosen to evaluate these parameters using the U-Net-TF model. As demonstrated in Fig. 5, there is a noticeable reduction in loss associated with an increase in the number of layers. After considering both the parameters and performance outcomes, we chose the depth of encoders to be 6. Regarding the kernel size, the performance differences among various sizes were relatively minor; consequently, we have chosen a convolution filter size of (4,3).

Fig. 5
figure 5

To assess the influence of varying configurations, we analyzed the validation loss across epochs for the U-Net-TF model on the training dataset, specifically focusing on a variations in the number of layers and b alterations in kernel size

Table 2 presents the average PESQ, STOI, and SI-SDR results of the ablation study. In Table 2, it can be seen that all three U-Net modules contribute to the improvement of speech enhancement performance. Besides, we observe that U-Net-F yields higher improvement scores compared to U-Net-T and U-Net-TF, with U-Net-TF performing the worst. This difference in performance could be attributed to U-Net-F retaining global temporal information in higher frequency resolution layers, which is more beneficial compared to high temporal resolution features with global frequency information and high tempo-spectral resolution features. These results underscore the importance of leveraging global information for speech enhancement. Additionally, the poor performance of U-Net-TF could be due to its exclusive reliance on both local time and local frequency dimensions, without incorporating any global information. Additionally, U-Net-F-T-TF performs much better than U-Net-TF-F-T, which highlights the importance of the order of U-Net-TF module. If the U-Net-TF module is placed as the first one, the input’s STFT would lose the global information in the deeper layers, resulting in the absence of global information in the subsequent processing steps. However, if the U-Net-TF module is placed as the last one, it can not only exploit global information in the first two U-Net modules but also further utilize the more abstract representations of tempo-spectral features with higher resolution, which leads to improved performance. The results presented in Table 2 examine the effectiveness of TFFM in fusing local and global features and show the impact of different U-Net modules in handling the tempo-spectral information of speech signals.

Table 2 Experimental results of ablation study. The bold values show the best results
Table 3 Experimental results of various models on L3DAS22 dataset. The bold values show the best results

5.2 Comparison to the baseline models

In this section, we conduct a comparative analysis of our proposed multi-channel speech enhancement method, TFFM, against other state-of-the-art models, utilizing both the L3DAS22 and SPA-DNS datasets. The results of this comparison, specifically for the L3DAS22 dataset, are detailed in Table 3. The data presented in Table 3 were previously reported in [44], where the authors proposed a multi-channel complex domain denosing network to make full use of spatial information and neural network based masking estimation.

In both the 4-channel and 8-channel results, our proposed TFFM demonstrates competitive performance. We have compared our model with four others in a 4-channel setup, including both causal and non-causal models. Compared to the time domain speech enhancement system, FasNet, TFFM exhibits competitive performance (+0.16 in Task1Metric). Moreover, TFFM outperforms MIMO Unet (+0.098 in Task1Metric), which serves as a challenge baseline, and EaBNet, a neural beamformer, achieving better results with a more compact model design (+0.071 in Task1Metric). Although the non-causal version of Spatial-DCCRN with 8 channels achieved the best metric score in [44], our proposed TFFM still shows performance improvements in the 8-channel setup. The outstanding performance of our proposed TFFM underscores its effective fusion of features.

For a fair comparison, we compare four non-causal advanced models with TFFM on the SPA-DNS dataset, as our proposed TFFM is also non-causal. These models are re-trained on the SPA-DNS dataset by us, and the results of this comparison are detailed in Table 4. These non-causal models are described below:

  • FasNet-TACFootnote 2: A multi-channel speech enhancement system processed in time domain. This approach fully utilizes information from all microphones, resulting in a significant improvement in separation performance [45].

  • EaBNetFootnote 3: A neural beamformer paradigm, proposed in [14], estimates beamforming weights implicitly. This approach incorporates two modules, the EM module designed to utilize spatial-spectral information and the BM module designed to further suppress residual noise.

  • FT-JNFFootnote 4: A DNN-based joint non-linear filter [16] serves as an excellent mask estimation network. This approach exploits interdependencies between spatial and tempo-spectral information, demonstrating the superiority of FT-JNF.

  • TFFM(mapping): A model using the same TFFM configuration which does not estimate the weights of a neural beamformer but rather directly estimates the spectrogram.

Table 4 Experimental results of various models on SPA-DNS dataset. The bold values shown the best results
Fig. 6
figure 6

Spectrogram example of a noisy, b clean, c TFFM, d FasNet-TAC, e EaBNet, and f FT-JNF. The red circle emphasize the areas where TFFM outperforms its counterparts

In this section, Table 4 details the number of parameters for each advanced baseline, the number of multiply-accumulate operations (MACs) per second, the real-time factor (RTF) tested on GPU(Nvidia 3090), and their respective evaluation results. We utilize the Python toolbox torcheval to calculate the MACs. Figure 6 showcases the spectrograms of noisy, clean, and enhanced speeches by these models. Compared to the time domain network FasNet-TAC, TFFM demonstrates superior speech enhancement performance. The spectrograms of FasNet-TAC exhibit poorer reconstruction in both low-frequency and high-frequency regions compared to the clean versions. In comparison, EaBNet, which also has fewer parameters than TFFM, our proposed method exhibits significant performance improvements(+0.542 in PESQ, +0.062 in STOI, +5.506 in SI-SDR). The spectrograms of EaBNet show better results in the high-frequency region but have limited improvement in the low-frequency region. Similarly, while FT-JNF stands out with its small parameter size and decent performance, our proposed TFFM improves the SI-SDR score by 4.141 dB and maintains an acceptable parameter size (5.1M). And the TFFM outperformed the TFFM (mapping) by +0.07 in PESQ, +0.01 in STOI, and +1.28 in SI-SDR. This improvement may be attributed to our proposed model’s comprehensive utilization of both local and global tempo-spectral features at a higher resolution, which enhances its ability to accurately estimate the weights for a neural beamformer. The differences in results compared to the statements and findings in [16] could be attributed to these reasons or the various types of noise used. Moreover, TFFM effectively restores both low-frequency and high-frequency regions. These results demonstrate the time-frequency fusion in these three U-Nets works well. This means that TFFM fully exploited local and global tempo-spectral features with higher resolution, thereby achieving excellent performance in restoring speech signals in both the low-frequency and high-frequency regions. In conclusion, the results demonstrate that TFFM consistently achieves the highest scores, underscoring the effectiveness of our proposed approach.

6 Conclusion

In this work, we propose a novel method called TFFM, which is based on time-frequency U-Net, for multi-channel speech enhancement. TFFM utilizes three cascaded U-Nets to capture three types of high-resolution features and the ablation study demonstrates the individual contributions of each module proposed in this work. We find that all three U-Net modules contribute to the improvement of speech enhancement performance and the U-Net-F module contributes the most, as all its layers contain global temporal information, which benefits the restructuring of the speech signal. Additionally, we show the importance of the order of the three modules. Experimental results reveal that TFFM effectively handles the tempo-spectral information of speech signals and demonstrates performance that is either superior to or comparable with other state-of-the-art approaches.

Availability of data and materials

The datasets generated during and/or analyzed during the current study are available in the L3DAS22 Challenge repository [30].

Notes

  1. https://www.l3das.com/icassp2022/

  2. https://github.com/yluo42/TAC

  3. https://github.com/Andong-Li-speech/EaBNet

  4. https://github.com/sp-uhh/deep-non-linear-filter

Abbreviations

DNN:

Deep neural networks

TFFM:

A time-frequency fusion model

MVDR:

Minimum variance distortionless response

MWF:

Multi-channel Wiener filter

T-F:

Time-frequency

IBM:

Ideal binary mask

IRM:

Deal ratio mask

cIRM:

Complex ideal ratio mask

SCMs:

Speech and noise spatial covariance matrices

LSTM:

Long short-term memory

MESSL:

Model-Based EM Source Separation and Localization

DeFT-AN:

Dense Frequency-Time Attentive Network

McNet:

Multiple cues fusion network

FT-JNF:

Joint spatial and tempo-spectral non-linear filter

DCN:

Dense convolutional network

STFT:

Short-time Fourier transform

R:

Real

I:

Imaginary

Conv2d:

2D convolutional layer

IN:

Instance normalization

DConv2d:

2D deconvolution layer

SPA-DNS dataset:

Spatial DNS dataset

SNR:

Signal-to-noise ratio

STOI:

Short-time objective intelligibility

WER:

Word error rate

ASR:

Automatic Speech Recognition

PESQ:

Perceptual Evaluation of Speech Quality

SI-SDR:

The Scale-Invariant Signal-to-Distortion Ratio

References

  1. J. Chen, J. Benesty, Y. Huang, A minimum distortion noise reduction algorithm with multiple microphones. IEEE Trans. Audio Speech Lang. Process. 16(3), 481–493 (2008)

    Article  Google Scholar 

  2. A. Spriet, M. Moonen, J. Wouters, Robustness analysis of multichannel wiener filtering and generalized sidelobe cancellation for multimicrophone noise reduction in hearing aid applications. IEEE Trans. Speech Audio Process. 13(4), 487–503 (2005)

    Article  Google Scholar 

  3. B. Cornelis, M. Moonen, J. Wouters, Performance analysis of multichannel wiener filter-based noise reduction in hearing aids under second order statistics estimation errors. IEEE Trans. Audio Speech Lang. Process. 19(5), 1368–1381 (2010)

    Article  Google Scholar 

  4. D. Wang, Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplification. 12(4), 332–353 (2008)

  5. H. Erdogan, J.R. Hershey, S. Watanabe, M.I. Mandel, J. Le Roux, Improved MVDR beamforming using single-channel mask prediction networks. Interspeech.pp. 1981–1985 (2016)

  6. X. Cui, Z. Chen, F. Yin, Multi-objective based multi-channel speech enhancement with bilstm network. Appl. Acoust. 177, 107927 (2021)

  7. Z. Ni, F. Grèzes, V.A. Trinh, M.I. Mandel, Improved MVDR beamforming using LSTM speech models to clean spatial clustering masks (2020). arXiv preprint http://arxiv.org/abs/2012.02191

  8. D. Lee, J.W. Choi, Deft-an: Dense frequency-time attentive network for multichannel speech enhancement. IEEE Signal. Process. Lett. 30, 155–159 (2023)

    Article  Google Scholar 

  9. Y. Yang, C. Quan, X. Li, in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mcnet: Fuse multiple cues for multichannel speech enhancement (IEEE, 2023), pp. 1–5

  10. S. Chakrabarty, E.A. Habets, Time-frequency masking based online multi-channel speech enhancement with convolutional recurrent neural networks. IEEE J. Sel. Top. Signal Process. 13(4), 787–799 (2019)

    Article  Google Scholar 

  11. X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M.L. Seltzer, G. Chen, Y. Zhang, M. Mandel, D. Yu, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Deep beamforming networks for multi-channel speech recognition (IEEE, 2016), pp. 5745–5749

  12. H. Kim, K. Kang, J.W. Shin, Factorized MVDR deep beamforming for multi-channel speech enhancement. IEEE Signal Proc. Lett. 29, 1898–1902 (2022)

    Article  Google Scholar 

  13. X. Ren, X. Zhang, L. Chen, X. Zheng, C. Zhang, L. Guo, B. Yu, A causal U-Net based neural beamforming network for real-time multi-channel speech enhancement. Interspeech. pp. 1832–1836(2021)

  14. A. Li, W. Liu, C. Zheng, X. Li, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement (IEEE, 2022), pp. 6487–6491

  15. Y. Koyama, B. Raj, W-Net BF: DNN-based beamformer using joint training approach (2019). arXiv preprint  arXiv:1910.14262

  16. K. Tesch, T. Gerkmann, Insights into deep non-linear filters for improved multi-channel speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 563–575 (2022)

    Article  Google Scholar 

  17. O. Ronneberger, P. Fischer, T. Brox, in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, U-Net: convolutional networks for biomedical image segmentation (Springer, 2015), pp. 234–241

  18. M.T. Ho, J. Lee, B.K. Lee, D.H. Yi, H.G. Kang, A cross-channel attention-based wave-u-net for multi-channel speech enhancement. Interspeech. pp. 4049–4053(2020)

  19. Z.Q. Wang, P. Wang, D. Wang, Multi-microphone complex spectral mapping for utterance-wise and continuous speech separation. IEEE Trans. Audio Speech Lang. Process. 29, 2001–2014 (2021)

    Article  Google Scholar 

  20. X. Xiang, X. Zhang, H. Chen, A nested U-Net with self-attention and dense connectivity for monaural speech enhancement. IEEE Sig. Process. Lett. 29, 105–109 (2021)

    Article  Google Scholar 

  21. B. Tolooshams, R. Giri, A.H. Song, U. Isik, A. Krishnaswamy, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Channel-attention dense U-Net for multichannel speech enhancement (IEEE, 2020), pp. 836–840

  22. D. Stoller, S. Ewert, S. Dixon, Wave-U-Net: A multi-scale neural network for end-to-end audio source separation (2018). arXiv preprint arXiv:1806.03185

  23. H. Lee, H.Y. Kim, W.H. Kang, J. Kim, N.S. Kim, in Proc. Interspeech 2019, End-to-end multi-channel speech enhancement using inter-channel time-restricted attention on raw waveform (2019), pp. 4285–4289. https://doi.org/10.21437/Interspeech.2019-2397

  24. A. Pandey, D. Wang, Dense cnn with self-attention for time-domain speech enhancement. IEEE Trans. Audio Speech Lang. Process. 29, 1270–1279 (2021)

    Article  Google Scholar 

  25. X. Hao, X. Su, R. Horaud, X. Li, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement (IEEE, 2021), pp. 6633–6637

  26. C. Quan, X. Li, Spatialnet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation (2023). arXiv preprint arXiv:2307.16516

  27. D.A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by exponential linear units (elus) (2015). arXiv preprint arXiv:1511.07289

  28. G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional networks. CVPR. pp. 4700–4708 (2017)

  29. Z. Zhang, S. Xu, X. Zhuang, Y. Qian, M. Wang, Dual branch deep interactive unet for monaural noisy-reverberant speech enhancement. Appl. Acoust. 212, 109574 (2023)

  30. E. Guizzo, C. Marinoni, M. Pennese, X. Ren, X. Zheng, C. Zhang, B. Masiero, A. Uncini, D. Comminiello, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), L3DAS22 challenge: learning 3D audio sources in a real office environment (IEEE, 2022), pp. 9186–9190

  31. R. Scheibler, E. Bezzam, I. Dokmanić, in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), Pyroomacoustics: A python package for audio room simulation and array processing algorithms (IEEE, 2018), pp. 351–355

  32. V. Panayotov, G. Chen, D. Povey, S. Khudanpur, in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), LibriSpeech: An ASR corpus based on public domain audio books (IEEE, 2015), pp. 5206–5210

  33. E. Fonseca, X. Favory, J. Pons, F. Font, X. Serra, FSD50K: an open dataset of human-labeled sound events. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 829–852 (2021)

    Article  Google Scholar 

  34. C.K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, et al., The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results (2020). arXiv preprint arXiv:2005.13981

  35. A. Pandey, B. Xu, A. Kumar, J. Donley, P. Calamia, D. Wang, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), TPARN: Triple-path attentive recurrent network for time-domain multichannel speech enhancement (IEEE, 2022), pp. 6497–6501

  36. C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, in 2010 IEEE international conference on acoustics, speech and signal processing, A short-time objective intelligibility measure for time-frequency weighted noisy speech (IEEE, 2010), pp. 4214–4217

  37. A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)

  38. A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, in 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), vol. 2, Perceptual evaluation of speech quality (PESQ)-A new method for speech quality assessment of telephone networks and codecs (IEEE, 2001), pp. 749–752

  39. J. Le Roux, S. Wisdom, H. Erdogan, J.R. Hershey, in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), SDR–half-baked or well done? (IEEE, 2019), pp. 626–630

  40. D.P. Kingma, J. Ba, Adam: A method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980

  41. S. Braun, I. Tashev, in International Conference on Speech and Computer, Data augmentation and loss normalization for deep noise suppression (Springer, 2020), pp. 79–86

  42. Y. Luo, C. Han, N. Mesgarani, E. Ceolini, S.C. Liu, in 2019 IEEE automatic speech recognition and understanding workshop (ASRU), FaSNet: low-latency adaptive beamforming for multi-microphone audio processing (IEEE, 2019), pp. 260–267

  43. X. Ren, L. Chen, X. Zheng, C. Xu, X. Zhang, C. Zhang, L. Guo, B. Yu, in 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), A neural beamforming network for B-format 3D speech enhancement and recognition (IEEE, 2021), pp. 1–6

  44. L. Shubo, Y. Fu, J. Yukai, L. Xie, W. Zhu, W. Rao, Y. Wang, in 2022 IEEE Spoken Language Technology Workshop (SLT), Spatial-DCCRN: DCCRN equipped with frame-level angle feature and hybrid filtering for multi-channel speech enhancement (IEEE, 2023), pp. 436–443

  45. Y. Luo, Z. Chen, N. Mesgarani, T. Yoshioka, End-to-end microphone permutation and number invariant multi-channel speech separation (ICASSP, 2019), 6394-6398

Download references

Acknowledgements

The authors are very grateful to the editors and anonymous reviewers for their guidance and useful suggestions.

Funding

This research was supported by the National Natural Science Foundation of China under Grant No. 62176102, the National Natural Science Foundation of China under Grant No. 62276076, the Natural Science Foundation of Guangdong Province under Grant No. 2020B1515120, and the Joint Fund of Basic and Applied Basic Research Fund of Guangdong Province under Grant No. 2023A1515140109.

Author information

Authors and Affiliations

Authors

Contributions

X.Z. conceptualized the study and implemented the codebase and wrote the initial draft of the manuscript. S.X. run the experiments and further refined the details of the model. X.Z. and S.X. revised the manuscript. M.W. supervised the work. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mingjiang Wang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zeng, X., Xu, S. & Wang, M. A time-frequency fusion model for multi-channel speech enhancement. J AUDIO SPEECH MUSIC PROC. 2024, 47 (2024). https://doi.org/10.1186/s13636-024-00367-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-024-00367-1

Keywords