 Methodology
 Open access
 Published:
Deep encoder/decoder dualpath neural network for speech separation in noisy reverberation environments
EURASIP Journal on Audio, Speech, and Music Processing volume 2023, Article number: 41 (2023)
Abstract
In recent years, the speakerindependent, singlechannel speech separation problem has made significant progress with the development of deep neural networks (DNNs). However, separating the speech of each interested speaker from an environment that includes the speech of other speakers, background noise, and room reverberation remains challenging. In order to solve this problem, a speech separation method for a noisy reverberation environment is proposed. Firstly, the timedomain endtoend network structure of a deep encoder/decoder dualpath neural network is introduced in this paper for speech separation. Secondly, to make the model not fall into local optimum during training, a loss function stretched optimal scaleinvariant signaltonoise ratio (SOSISNR) was proposed, inspired by the scaleinvariant signaltonoise ratio (SISNR). At the same time, in order to make the training more appropriate to the human auditory system, the joint loss function is extended based on shorttime objective intelligibility (STOI). Thirdly, an alignment operation is proposed to reduce the influence of time delay caused by reverberation on separation performance. Combining the above methods, the subjective and objective evaluation metrics show that this study has better separation performance in complex sound field environments compared to the baseline methods.
1 Introduction
Speech separation is widely known as the cocktail party problem [1, 2]. Its goal is to separate the target speaker’s speech from complex sound field environments (other speakers, background noise, and reverberation). While human beings have a strong speech separation capability and can recognize the target speaker’s speech even in complex environments; however, there is still a significant challenge for machine systems.
Speech separation, as an important frontend processing technique, is widely used in tasks such as hearing prosthesis, mobile communication, robust automatic speech, and speaker recognition. It has received extensive attention from researchers. However, the performance of current speech separation systems still needs to fully meet the requirements of human auditory perception, especially in complex sound field environments.
Speech separation has been studied for decades. In the early stages, following the assumption that speech signals conformed to a specific probability distribution (Gaussian or Laplacian) and that the background noise is stable (the spectral characteristics do not change with time), some methods such as Computational Auditory Scene Analysis (CASA) [3], Independent Component Analysis (ICA) [4] and Nonnegative Matrix Factorization (NMF) [5] have been adopted in speech separation. In addition, considering the mask is essential in the process of speech separation, some mask estimation methods have been proposed based on probabilistic mixture models [6, 7] and sparse component analysis [8]. These methods exhibit good separation performance in lowreverberation environments, but their performance decreases with the increase of reverberation time and/or noise levels.
With the development of deep learning, datadriven approaches have been introduced to speech separation. These methods learn features and patterns directly from the data without making any assumptions or prerequisites about the task domain. Based on this methodology, joint optimization of masking functions and deep recurrent neural networks is proposed for singlechannel speech separation in the time domain [9]. Another approach involves operating in the complex domain with simultaneous enhancement of the magnitude and phase spectra to estimate the real and imaginary components of the ideal ratio mask [10].
However, two main difficulties have hindered the development of speech separation. These are the “permutation problem” and the “output dimension mismatch problem.”
To solve the above problem, the permutation invariant training (PIT) [11] method is used in the training phase of the speech separation model to solve the uncertainty problem of speaker order in the mixed signals. Specifically, PIT lists all possible permutations and uses the minimum separation error to update the network. In the end, the source labels corresponding to the separated information are obtained.
In addition, deep clustering (DPCL) [12] is adopted to separate speech by calculating the embedding vector for each time–frequency bin and using the KMeans clustering method. Due to the use of permutationfree training, it can handle multiple sound sources simultaneously, achieving speakerindependent speech separation. However, the Kmeans clustering is utilized in DPCL [12], which requires significant computational resources. To overcome this issue, a deep attractor network (DANet) [13] is proposed to perform mask estimation without clustering. The time–frequency bins corresponding to each source are integrated by creating attractor subpoints in the highdimensional space of the mixture signal. Compared with DPCL, the computational effort is significantly reduced. A variety of time–frequency maskbased separation methods have been successively proposed [14,15,16]. In order to achieve sufficient frequency resolution, phase/magnitude decoupling is inevitable for time–frequency decomposition, which results in imperfect reconstruction accuracy of the sources. In contrast, it can effectively avoid this problem when separation is performed in the timedomain. Hence, a long shortterm memory timedomain audio separation network (LSTMTasNet) [17] has been proposed, where a codec architecture is adopted to model the signal in the timedomain. Furthermore, a fully convolutional timedomain audio separation network (ConvTasNet) [18] was proposed to solve the overfitting of LSTMTasNet by using a temporal convolutional network (TCN) structure. As an endtoend timedomain separation network, it can be used for modeling speech signals for a longterm dependency because of a deep onedimensional dilated convolution block.
However, the input mixture signal is composed of a large number of time steps. In other words, if the receptive field of a onedimensional convolutional neural network is smaller than the sequence length, it is difficult to achieve utterancelevel modeling. Therefore, a dualpath recurrent neural network timedomain separation network (DPRNNTasNet) [19] has been proposed to model long sequences through iterative intra and interchunk operations. Similarly employing the dualpath strategy, dualpath transformer network (DPTNet) [20] utilizes a transformer module that enables longterm dependency modeling of speech signals. In addition, Wavesplit [21] achieved better speech separation performance by computing speaker vectors within a temporal window and obtaining the global vectors via clustering.
The above models [11,12,13,14,15,16,17,18,19,20,21] have been trained on the WSJ02mix [12] dataset through continuous updating and optimization, and the objective evaluation metric (Scaleinvariant signaltonoise ratio improvement (SISNRi)) has been improved continuously. In fact, the WSJ02mix is an ideal dataset in an anechoic acoustic environment (without the interference of noise and room reverberation), which contains only clean speech with two speakers. Through training on clean speech dataset, these methods can achieve excellent performance. Nevertheless, when the models [11, 14, 15, 17,18,19,20,21] are trained and tested on datasets with more complex sound field environments, the separation performance will degrade. Even careful optimization of the model’s hyperparameters can only provide some relief [22, 23], and the improvement is insignificant.
In complex sound scenarios, it is a challenge to achieve satisfactory separation performance using singlechannel information. Therefore, a training model is proposed based on azimuth and distance, using the distinct spatial locations of the speakers captured by a microphone array [24]. Moreover, multiple types of information, such as video information, have been adopted for speech separation [25, 26]. There is a certain improvement in separation performance among these methods but with higher requirements for recording equipment.
In the most recent study, TFGridNet [27] achieves speaker separation through the utilization of complex spectral mapping, in conjunction with loss functions and DNN architectures. The model once again underscores its significant potential in the domain of time–frequency monaural speech separation.
In summary, motivated by the previous work, we propose a method for singlechannel speech separation in complex sound field environments. Endtoend speech separation is performed using only a single microphone capture to obtain information in the presence of noise and reverberant interference. The contributions of the proposed method are summarized as follows:

A network structure of a deep encoder/decoder dualpath neural network is proposed, which enhances the model’s ability to extract speech features. Experimental results demonstrate the feasibility of the method.

A new loss function known as the stretched optimal scaleinvariant signaltonoise ratio (SOSISNR) is proposed, and experimental results show that it outperforms the scaleinvariant signaltonoise ratio (SISNR) in complex sound field environments.

Using a multiobjective joint optimization strategy, the loss function was extended based on shorttime objective intelligibility (STOI) [28] to match the human auditory system better.

The alignment operation is proposed to reduce the model’s reliance on a priori knowledge of the sound field and to increase the robustness of the model.
The rest of the paper is organized as follows. Sections II and III describe the specific implementation steps of the proposed method and the rationale. Section IV describes the experimental procedure. Finally, the conclusions and analysis of the experiments are presented in section V.
2 Description of the separation model
2.1 Problem formulation
In a multisource scenario, the signal \(y\left(t\right)\) recorded by a mono microphone can be modeled in the timedomain as:
where \(i=\mathrm{1,2}\dots ,I\), \(I\) represents the number of the sound sources. \(t\) represents continuous time, indicating signal’s continuous variation along the time dimension. \({s}_{i}\left(t\right)\) and \({r}_{i}\left(t\right)\) denote the \(i\) th speaker’s speech and the room impulse responses (RIRs) between the \(i\) th speaker and the microphone, respectively, and “*” denotes the convolution operation. \(\sigma (t)\) denotes the background noise.
The aim of this paper is to obtain/estimate each speaker’s speech \({s}_{i}\left(t\right)\), \(i=\mathrm{1,2}\dots ,I\), from a recorded signal contaminated by noise and reverberation.
2.2 Deep encoder/decoder
Currently, the architecture of encoder, decoder and separator [18,19,20, 29, 30] have been adopted in speech separation.
Specifically, the role of the encoder is to extract features from mixture signals and map them to a feature space of an appropriate dimension. The features are then analyzed and processed by the separator to separate the highdimensional representation of each component in the mixed speech. Finally, the decoder uses the separated features to reconstruct the original speech signal. The existing works have mainly focused on the separator, with relatively less attention paid on the encoder and decoder part. The linear (shallow) operators are commonly used to extract features, which can limit the expressiveness of the model and may not achieve satisfactory performance in complex sound scenarios.
In order to explore the transformation capabilities of the deep encoder/decoder structure on complex signals, this paper attempts to utilize this structure to focus more on the local features of the speech signal [31]. By concurrently integrating advanced dualpath neural network separation modules, we aspire for the model to exhibit superior separation performance. Therefore, a speech separation method for noisy and reverberant environments is proposed by combining a deep encoder/decoder and a dualpath neural network. The deep encoder/decoder structure is shown in Fig. 1.
As shown in Fig. 1, the encoder obtains the singlechannel signal \(y\left(t\right)\) recorded by the microphone, resamples this signal, and transforms it into a matrix dimension as \(\mathbf{Y}\in {\mathbb{R}}^{1\times S}\), where \(S\) represents the number of the time steps. A onedimensional convolution operation is performed in the first layer of the encoder, expressed as follows:
where \(\mathbf{V}\) represents a kernel of size \(L\) and stride \({}^{L}\!\left/ \!{}_ { \ 2}\right.\), and \(Conv\left(\cdot \right)\) represents a onedimensional convolution operation. \({\mathbf{E}}_{1}\epsilon {\mathbb{R}}^{K\times S}\) is obtained by onedimensional convolution of the kernel \(\mathbf{V}\), where \(K\) is the feature dimension of the signal. The recording signal \(y\left(t\right)\) is initially mapped with a linear transform. Then, starting from the second layer, the input of the \(p\) th layer is the output of the \((p1)\) th layer. The output of the \(p\) th layer can be expressed as:
where \(p\)=\(\mathrm{2,3},\dots ,P\), \(p\) is the index of the encoder layer and \(P\) is the number of deep encoder layers (\(P=4 \mathrm{\ in\ this\ paper}\)). The \(p\) th layer has a kernel \({\mathbf{V}}_{p}\) of size 3 with stride 1 and a \(PReLU\), and \({\mathbf{E}}_{p}\epsilon {\mathbb{R}}^{K\times S}\) represents the output of the \(p\) th layer. In this way, the recording signal \(y\left(t\right)\) is further mapped into a nonlinear potential space by stacking the encoding layers. Using the output from the deep encoder, the separation module estimates the mask \({\mathbf{M}}_{i}\) corresponding to the \(i\) th speaker.
The input feature \({\mathbf{D}}_{0,i}\) of the decoder can be obtained from the output of the deep encoder and the mask \({\mathbf{M}}_{i}\) as follow:
where \({\mathbf{D}}_{0,i}\epsilon {\mathbb{R}}^{K\times S}\), "0" represents the layer 0 of the decoder, which is the input of the deep decoder. \(\mathbf{E}\) is the output of the deep encoder, and \(i\) represents the index of the speaker. “⨀” represents the elementwise multiplication.
The decoder reconstructs the waveform by transforming the twodimensional feature \({\mathbf{D}}_{0,i}\epsilon {\mathbb{R}}^{K\times S}\). Starting from the first layer of the decoder, the input of the \(q\) th layer is the output of the \((q1)\) th layer. The output of the \(q\) th layer can be expressed as:
where \(q\)=\(\mathrm{1,2},\dots ,Q\), \(q\) is the index of the decoder layer, and \(\left(Q+1\right)\) is the number of deep decoder layers (\(Q=3 \mathrm{in this paper}\)). \(i=1, 2\cdots ,I\), \(I\) represents the number of the sound sources. \(TrConv(\cdot )\) represents the transpose convolution operation. The \(q\) th layer of the decoder has a kernel \({\mathbf{U}}_{q}\) of size 3 with stride 1 and a \(PReLU\), \({\mathbf{D}}_{q,i}\epsilon {\mathbb{R}}^{K\times N}\) denotes the output of the \(q\) th layer. The fourth layer performs the transpose convolution operation of the kernel \(\mathbf{U}\) as follows:
where \(\mathbf{U}\) represents a kernel of size \(L\) and stride \({}^{L}\!\left/ \!{}_{2}\right.\), \(\mathbf{D}\epsilon {\mathbb{R}}^{K\times S}\) is a highdimensional representation of the decoder reconstructed waveform, and \({\widehat{\mathbf{s}}}_{i}\epsilon {\mathbb{R}}^{1\times S}\) is the reconstructed speech signal.
2.3 DPRNN separation module
The DPRNN module is used as a separator in this paper, which consists of three operations: segmentation, DPRNN block processing, and overlappingadd [19]. The overall flowchart is shown in Fig. 2:
The output of the deep encoder \(\mathbf{E}\epsilon {\mathbb{R}}^{K\times S}\) is obtained according to formula (3), where \(K\) and \(S\) can be considered as the feature dimension and the time dimension, respectively. As shown in Fig. 2, in the segmentation stage, the output of the deep encoder is segmented into \(F\) chunks of equal size, where each chunk with length \(H\) and hop size \({\hslash }_{p}\) (\({\hslash }_{p}=\frac{H}{2}\)), and the first and last chunks are zeropadded (so that all the output of the deep encoder can be processed). Each block can be represented as \({\mathbf{T}}_{f}\in {\mathbb{R}}^{K\times H}\), where \(f=1,2,\dots ,F\), \(f\) is the index of the block, \(F\) represents the number of the blocks. The combined information of the \(F\) chunks into a threedimensional tensor \(\mathbf{G}=\left[{\mathbf{T}}_{1},{\mathbf{T}}_{2},\dots ,{\mathbf{T}}_{F}\right]\in {\mathbb{R}}^{K\times H\times F}\).
In the DPRNN block processing stage, the tensor \(\mathbf{G}\) is passed to a stack of \(C\) DPRNN blocks. Each block contains intrablock and interblock. Intrablock processing (local modeling) and interblock processing (global modeling) are performed iteratively, while keeping the dimensionality of the threedimensional tensor constant, so that results can be learned for different time dimensions.
Specifically, intrablock processing starts with the intrachunk RNN in the third dimension of tensor \(\mathbf{G}\), as follows:
where \(c\)=\(\mathrm{1,2},\dots ,C\), \(c\) is the stack index of the DPRNN block, and \(C\) is the number of repetitions of the DPRNN block. \({\mathbf{G}}_{0}=\mathbf{G}\), \({\mathbf{W}}_{c}\epsilon {\mathbb{R}}^{X\times H\times F}\) is the output of the intrachunk RNN, and \({\mathrm{h}}_{c}(\cdot )\) represents the mapping function of the intrachunk RNN. \({\mathbf{G}}_{c1}\in {\mathbb{R}}^{K\times H\times F}\) is the speech feature of the previous layer of the threedimensional tensor. Further, to ensure that the dimension of the tensor does not change, a linear fullyconnected (FC) layer is applied, which is defined as follows:
where \({\widehat{\mathbf{W}}}_{c}\in {\mathbb{R}}^{K\times H\times F}\) is the transformed speech feature tensor and \({\widehat{\mathbf{W}}}_{c}\) has the same dimension as \({\mathbf{G}}_{c1}\). \(\mathbf{J}\in {\mathbb{R}}^{K\times X}\) is the weight tensor of the FC layer, and \(\mathbf{a}\) is the bias of the FC layer.
To improve the generalization ability of the model, a layer normalization (LN) was performed on the transformed speech feature tensor \({\widehat{\mathbf{W}}}_{c}\). In addition, to avoid model degradation during training, a residual connection is added between the input \({\mathbf{G}}_{c1}\) of the intrachunk RNN and the LN, as shown in formal (9):
where \({\widehat{\mathbf{G}}}_{c}\in {\mathbb{R}}^{K\times H\times F}\) is the output of the intrablock processing as well as the input of the interblock processing, \(\mathrm{LN}\left(\cdot \right)\) is the layer normalization operation.
The interblock processing is performed in the second dimension of the threedimensional tensor \({\widehat{\mathbf{G}}}_{c}\), i.e., interchunk RNN in the second dimension of tensor \({\widehat{\mathbf{G}}}_{c}\), which is shown as follows:
where \({\mathbf{R}}_{c}\epsilon {\mathbb{R}}^{X\times H\times F}\) is the output of the interchunk RNN and \({\mathrm{v}}_{c}(\cdot )\) represents the mapping function of the interchunk RNN. The subsequent operations are similar to those of the intrablock, which is consisted of interchunk RNN mapping, the FC and LN layers. The output of the \(c\) th DPRNN block is represented as \({\mathbf{G}}_{c} \epsilon {\mathbb{R}}^{K\times H\times F}\). So, the final output of the block processing, i.e., \({\mathbf{G}}_{C} \epsilon {\mathbb{R}}^{K\times H\times F}\), is obtained by repeating \(C\) DPRNN blocks.
Finally, the twodimensional convolutional layer learns the mask of each sound source and performs an overlappingadd operation. The mask \({\mathbf{M}}_{i}\epsilon {\mathbb{R}}^{K\times S}\) corresponding to the \(i\) th speaker can be obtained.
3 Training objective
3.1 Joint loss function
In clean speech separation tasks, the loss function is often based on SISNR and utterancelevel permutation invariant training (uPIT) [18, 19], with the starting point to get the model to predict signals closer and closer to the original clean signal.
A new loss function, SOSISNR, was designed to cope with singlechannel speech separation in complex sound field environments. Meanwhile, the loss function was extended using STOI to improve the intelligibility of separated speech and make it more compatible with the human auditory system.
Using a multiobjective joint optimization strategy, the joint loss function \(\mathcal{L}\) is expressed as follows:
where \({\mathcal{L}}_{\mathrm{sosisnr}}\) is the loss function SOSISNR, and \({\mathcal{L}}_{\mathrm{STOI}}\) is the STOIbased loss function. \(\lambda\) is the weight of \({\mathcal{L}}_{\mathrm{STOI}}\), which is set to 2 in this paper. The derivation and analysis of the two loss functions (i.e., \({\mathcal{L}}_{\mathrm{sosisnr}},{\mathcal{L}}_{\mathrm{STOI}}\)) are given in Subsections III.B and III.C, respectively.
3.2 Stretched optimal scaleinvariant signaltonoise ratio
SISNR, as a timedomain loss function, can be used to measure the similarity between the output of the model and ground truth. It has been widely used in signal processing fields such as speech separation [18, 19] and speech enhancement [32, 33]. The illustration of SISNR and SOSISNR are shown in Fig. 3.
As shown in the orange part, the original clean speech \(s\) is first mapped to obtain the target signal \({s}_{target}\), in order to attenuate the effect of scale variations due to either the original clean speech \(s\) or the estimated signal \(\widehat{s}\). The target signal \({s}_{target}\) is defined in formula (12):
And then the distance between the estimated signal \(\widehat{\mathbf{s}}\) and the target signal \({\mathbf{s}}_{target}\), i.e., the noise signal \({\mathbf{e}}_{noise}\), is defined as follows:
Finally, the intensity ratio of the target signal \({\mathbf{s}}_{target}\) to the noise signal \({\mathbf{e}}_{noise}\) (SISNR) is defined as:
Furthermore, \({\mathbf{s}}_{target}\) and \({\mathbf{e}}_{noise}\) can be updated to the following formula:
Combing formula (14), (15), (16), the expression of SISNR is further derived as:
where \(\theta\) represents the angle between \(\widehat{\mathbf{s}}\) and \(\mathbf{s}\). Obviously, as in formula (15), by adjusting the original clean speech \(\mathbf{s}\) to a suitable scale, the magnitude of \({\mathbf{s}}_{target}\) is not associated with the original clean speech \(\mathbf{s}\). Instead, \({\mathbf{s}}_{target}\) is expressed in terms of the estimated signal \(\widehat{\mathbf{s}}\) as well as the angle \(\theta\) trigonometric function, with no change in direction compared to \(\mathbf{s}\). As in formula (17), SISNR is only related to the angle \(\theta\), which is irrelevant to the magnitude of \(\widehat{\mathbf{s}}\) and \(\mathbf{s}\). The functional relationship between \(\theta\) and SISNR is shown as the orange part in Fig. 4.
In separation performance evaluation, the angle information between the separated signal and the original signal is more important than the magnitude information. Compared to the signaltonoise ratio (SNR) [34]:
SISNR is more suitable for speech separation tasks. Specifically, by regularizing the separated speech signal \(\mathbf{s}\), SISNR overcomes the disadvantage that SNR is susceptible to variations in the energy of input signal. In other words, SISNR is able to evaluate the angle between the separated signal and the original signal without being affected by the change of signal energy [35, 36].
In fact, SISNR is not necessarily the optimal choice for speech separation in complex acoustic environments. As shown in the orange part of Fig. 3, the \({\mathbf{e}}_{noise}\) of SISNR is not necessarily orthogonal to \(\mathbf{s}\). Using SISNR may mislead the model in complex environments, which often causes the model to fall into a local best [35]. Meanwhile, as the orange curve shown in Fig. 4, there are several extreme points (i.e., 0, \(\pm \pi\)) over a period (\(\pi \sim \pi\)). The closer the angle \(\theta\) is to \(\pi\) or \(\pi\), the larger the SISNR value. This means that \(\widehat{s}\) and \(s\) are similar under such angles. Obviously, such a conclusion is incorrect [36].
In order to have only one correct extreme point in the range of \(\pi\) to \(\pi\), we reconstructed a phasecorrected signal \({\widehat{\mathbf{s}}}_{\eta }\). The goal is to double the period of the SISNR to reduce the error extreme points. At the same time, in order to prevent the training from falling into the local optimum, we define a new target signal in the vector \(\mathbf{s}\) direction. The new target signal \({\mathbf{s}}_{target}^{\mathrm{^{\prime}}}\) is obtained by scaling the original signal to make its amplitude independent of the original signal.
Based on the above derivation, we propose a loss function, SOSISNR. The illustration of SOSISNR is shown in the blue part of Fig. 3. Specifically, \({\widehat{\mathbf{s}}}_{\eta }\) is reconstructed based on the spatial relationship between the estimated signal \(\widehat{\mathbf{s}}\) and the original speech\(\mathbf{s}\). This is achieved by keeping the magnitude of the reconstructed estimated signal \({\widehat{\mathbf{s}}}_{\eta }\) equal to that of the original clean speech \(\widehat{\mathbf{s}}\) (\(\left{\widehat{\mathbf{s}}}_{\eta }\right=\left\widehat{\mathbf{s}}\right\)) and halving the angle between the \(\widehat{\mathbf{s}}\) and \(\mathbf{s}\) \(\left({\theta }^{\mathrm{^{\prime}}}=\frac{\theta }{2}\right)\), where \({\theta }^{\mathrm{^{\prime}}}\) represents the angle \({\widehat{\mathbf{s}}}_{\eta }\) and\(\mathbf{s}\).
Meanwhile, the new target signal \({\mathbf{s}}_{target}^{\mathrm{^{\prime}}}\) is defined as follows:
where \(\alpha\) represents the scale adjust factor. Substituting formula (19) into formula (14), \(\mathrm{SISNR{\prime}}\) is obtained:
To simplify the calculation, an intermediate function \(f\left(\cdot \right)\) is defined as follows:
In order to obtain the scale adjustment factor \(\alpha\) corresponding to the maximum value of \(f\left(\alpha \right)\), the derivative of \(f\left(\alpha \right)\) is calculated below:
Therefore, \(\alpha\) can be obtained:
After adjusting the period of SISNR and recalculating the scale adjust factor α, with (23) and (19), the reconstructed target signal \({\mathbf{s}}_{target}^{\mathrm{^{\prime}}}\) can be rewritten as:
Substituting formula (24) into (13), the noise \({\mathbf{e}}_{noise}^{\mathrm{^{\prime}}}\) corresponding to the reconstructed target signal \({\mathbf{s}}_{target}{\prime}\) is given as follows:
Furthermore, the loss function SOSISNR is defined by combined formula (24) and (25) as follows:
The functional relationship of \(\theta\) and SOSISNR value are shown in the blue part in Fig. 4. There is only one extreme point of SOSISNR with \(\theta\) in the range from \(\pi\) to \(\pi\). SOSISNR reaches its maximum value only when the angle θ is zero, which reflects the similarity of \(\widehat{\mathbf{s}}\) and \(\mathbf{s}\) correctly. Also, unlike SISNR \(\upepsilon \left(\infty ,+\infty \right)\), SOSISNR ranges from 0 to positive infinity.
3.3 STOIbased loss function
Although SISNR can well reflect the correlation between the estimated speech \(\widehat{s}\) and the original clean speech \(s\). However, as a distancebased loss function, SISNR cannot directly reflect the effect of signals on human hearing [37]. In contrast, STOI is a commonly used objective evaluation metric [23] (STOI \(\upepsilon \left[\mathrm{0,1}\right]\), a higher value representing better speech intelligibility), which is closely related to human auditory perception [38]. Moreover, it analyzes speech segments as a whole [39], which is more conducive to learning longrange context dependencies. Based on this motivation, the STOIbased loss function \({\mathcal{L}}_{\mathrm{STOI}}\) has been extended to a joint loss function [40].
Taking the estimated speech \(\widehat{\mathbf{s}}\) and the original clean speech \(\mathbf{s}\) as input, \({\mathcal{L}}_{\mathrm{STOI}}\) can be obtained as follows:

a.
Removal of silent frames from estimated speech \(\widehat{\mathbf{s}}\) and original clean speech \(\mathbf{s}\).

b.
The shorttime Fourier transform (STFT) is used to obtain the corresponding representation in the time–frequency domain.

c.
Perform a onethird octave band analysis.

d.
To compensate for the global level difference and improve the stability of the STOI, normalization and clipping are performed.

e.
Measure Intelligibility. The intermediate intelligibility \({\zeta }_{b,n}\) is defined as the spectral correlation coefficients between the two temporal envelopes:
$${\zeta }_{b,n}=\frac{{\left({\widehat{\mathbf{s}}}_{b,n}{m}_{{\widehat{\mathbf{s}}}_{b,n}}\right)}^{T}{\left({\mathbf{s}}_{b,n}{m}_{{\mathbf{s}}_{b,n}}\right)}^{T}}{{\Vert {\widehat{\mathbf{s}}}_{b,n}{m}_{{\widehat{\mathbf{s}}}_{b,n}}\Vert }_{2}{\Vert {\mathbf{s}}_{b,n}{m}_{{\mathbf{s}}_{b,n}}\Vert }_{2}}$$(27)
where \(b\) and \(n\) are the indexes of the onethird octave and the shorttime temporal envelope vectors, respectively. \(b=\mathrm{1,2},\dots ,B\), and \(n=\mathrm{1,2},\dots ,N\). \(B\) and \(N\) are the numbers of onethird octave bands and the shorttime temporal envelope vectors, respectively. \({\widehat{\mathbf{s}}}_{b,n}\) and \({\mathbf{s}}_{b,n}\) represent the shorttime spectrogram vector of the estimated speech \(\widehat{\mathbf{s}}\) and the original clean speech \(\mathbf{s}\), respectively. \(m\left(\cdot \right)\) is the sample mean of the corresponding vector.
Ultimately, \({\mathcal{L}}_{\mathrm{STOI}}\) can be obtained by averaging the intermediate intelligibility of all bands and shorttime temporal envelope vectors:
Choosing the appropriate loss function is crucial for training and optimizing deep learning models. When using the joint loss function, it is necessary to ensure that each loss function is appropriate that their numerical range and symbol selection can ensure the effectiveness of parameter update and optimization. This is to avoid the problem of gradient disappearance, which occurs when the gradients of different loss functions cancel each other out. When the gradient disappears, the network cannot effectively perform back propagation and update and cannot be optimized in the right direction. The value of \({\mathcal{L}}_{\mathrm{STOI}}\) ranges from 0 to 1 in formula (11), so it is necessary to ensure that the other loss function in the joint loss function \(\mathcal{L}\) is nonnegative. SOSISNR just meets this condition.
3.4 Alignment operation and utterancelevel permutation invariant training
This paper focuses on speech separation in complex sound field environments. The input mixture speech signals in the training process are heterogeneous, with different levels of reverberation. In order to make the network cope with complex environments, this paper proposes the alignment operation to reduce the time delay error caused by reverberation. Specifically, this method performs time alignment on the estimated speech \(\widehat{\mathbf{s}}\) and the original clean speech \(\mathbf{s}\), so that the network can learn the corresponding relationship between them more accurately and improve the separation performance.
After obtaining the joint loss function \(\mathcal{L}\) according to formula (11), the loss function needs to be modified in order to achieve the best separation performance. The alignment operation is proposed so that the loss function can better reflect the separation performance of the model, allowing the model to learn more effective information. The alignment operation is shown in Fig. 5.
As shown in Fig. 5, the specific method is to obtain \({\mathbf{s}}^{\uptau }\) by cyclically shifting the original clean speech \(\mathbf{s}\). Here \(\tau\) is an integer value, which represents the number of shift samples. Then, the original clean speech \(\mathbf{s}\) is replaced by the shifted original speech \({\mathbf{s}}^{\uptau }\). The joint loss function \(\mathcal{L}\) is calculated by inputting estimated speech \(\widehat{\mathbf{s}}\), and each shifted original speech \({\mathbf{s}}^{\uptau }\) into the formula (11). Comparing all the loss values with different \(\uptau\), the minimum loss function value is found as shown in formula (29):
where \(\widehat{\mathbf{s}}\epsilon {\mathbb{R}}^{1\times S}\) and \({\mathbf{s}}^{{\varvec{\tau}}}\epsilon {\mathbb{R}}^{1\times S}\) represent the estimated speech and the shifted original speech, respectively, and S represents the length of the tensor after sampling processing. The value of \(\tau\) ranges from 1 to \(S\).
The introduction of this operation enables the model to reduce its reliance on a priori knowledge of sound field environments. It enables the model to cope with different levels of reverberation (offsetting the time delay caused by reverberation) and to significantly reduce the workload of aligning annotations to different speech signals. At the same time, at each epoch of the network training, the operation can be propagated forward to provide timely and appropriate feedback to the model.
Furthermore, to solve permutation ambiguity during training [39], combined with the alignment operation, the utterancelevel permutation invariant training was introduced. It can guide the model to train a speakerindependent separation model. The updated loss function is as follows:
where \(i\) represents the speaker index, \(i=1, 2\cdots ,I\), and \(I\) represents the number of speakers. \({\widehat{s}}_{\left(i\right)}\) is the \(i\) th estimated speaker speech. \({s}_{\gamma \left(i\right)}\) represents the original clean speech of the \(\gamma \left(i\right)\) th speaker, \(\gamma \left(i\right)\) represents the possible index of the original clean speech corresponding to the \(i\) th estimated speaker speech. \(\Gamma\) is the set of all possible permutations for all \(I\) speakers.
4 Experimental settings
4.1 Dataset
The complex scenario where the speech source of interest is disturbed by other speakers, noise, and reflection components simultaneously is simulated for the experiment. We simulated RIRs using an image method [41,42,43] for a rectangular room with dimensions of 7 m × 5 m × 3 m, with the microphone placed in the center of the room (3.5 m × 2.5 m × 1.5 m). Reverberant utterances from different speakers with reverberation time (\({T}_{60}\)) of 100 ms, 200 ms, and 300 ms were randomly generated by varying the sound absorption coefficient of the walls. The speech signal from the si_tr_s dataset of the Wall Street Journal dataset (WSJ0) [12] is chosen as the source signal. In addition, two reverberant utterances from different speakers were randomly selected and mixed with an SNR between 5 dB to 5 dB to generate 30 h of training data. Similarly, the validation and test set (from WSJ0 si_dt_05 and si_et_05) were generated in the same way to produce 10 h and 5 h of data, respectively. A spatial version of the WSJ02mix dataset [12] was generated in this way. Based on this, we paired the spatial version of WSJ02mix with noisy audio (including noise background scenes such as restaurants, cafés, and bars) from the WSJ0 Hipster Ambient Mixtures (WHAM!) dataset [44]. Then, we generated randomly mixed speech in the WSJ02mix dataset with noise at three SNR levels of 5 dB, 10 dB, and 15 dB. This process was designed for speech separation tasks in environments with varying levels of background noise. All speech signals were resampled at 8 kHz.
4.2 Training setup
In the first layer of the deep encoder and the last layer of the deep decoder, the kernel size \(L\) is set to 2, and the hop size is \(L/2\). In the separation module, the number of overlapping stacks \(C\) of DPRNN blocks is set to be 6, and BLSTM [45] with 128 hidden units in each direction is used as intra and interblock RNN.
In the STOIbased loss function, the time–frequency spectrum of the speech signal is obtained by STFT with the Hanning window length and the hop size are 1024 and 256, respectively.
The network is trained for 100 epochs on 4slong segments with an initial learning rate of \(2{\mathrm{e}}^{4}\). If no better results are obtained on the validation set for 3 consecutive epochs, the learning rate will be halved. If the best model is not updated for 10 consecutive epochs, training will be stopped early. Adam [46] is used as the optimizer, and a gradient clipping with a maximum \({L}_{2}\)norm of 5 is applied during training. To ensure fairness, all models are trained using PyTorch profiler [47] on 2 NVIDIA GeForce RTX 4090 GPU devices.
4.3 Evaluation metrics
In the experiments, four objective evaluation metrics and one subjective evaluation metric were used to evaluate the separation performance of our proposed method and the baseline methods.
The objective evaluation metrics include SISNRi [33], signal distortion ratio improvement (SDRi) [48, 49], perceptual evaluation of subjective quality (PESQ) [50], and STOI [28]. These objective metrics are obtained by comparing the model output speech with the original clean speech. The SISNRi and SDRi are energy ratios which can be used to measure the similarity between signals. PESQ scores ranged from 0.5 to 4.5 and STOI scores ranged from 0 to 1. The higher the value, the better the quality of the separated signal.
The MUlti Stimulus test with Hidden Reference and Anchor (MUSHRA) [51, 52] is chosen as the subjective evaluation. The MUSHRA is conducted by asking a number of experienced listeners to rate the quality of the separated mixtures. The value of MUSHRA ranges from 0 to 100, and the higher the value, the better the quality of the separated signal.
5 Experimental results and analysis
The proposed method in this paper was compared with three baseline methods (ConvTasNet, DPRNNTasNet, and DPTNet). During the experiments, the three baseline methods have successfully recurred with the same specific structure and hyperparameter settings as in [18,19,20].
The proposed method and its ablation experiments (The proposed method without the deep encoder/decoder and the loss function without \({\mathcal{L}}_{\mathrm{STOI}}\)) were compared with the baseline method. In addition, we conducted comparative experiments by replacing the proposed method’s SOSISNR with the original SISNR.
The sizes of the aforementioned models, their computational complexities for 4s segments, and objective evaluation metrics are shown in Table 1.
The results show that compared to ConvTasNet, DPRNNTasNet, and DPTNet, the proposed method brings a 5.0 dB, 2.1 dB, and 0.2 dB increase in SISNRi as well as a 4.9 dB, 1.9 dB, and 0.1 dB increase in SDRi, respectively. It is demonstrated that the proposed method can perform better speech separation in complex sound field environments. Meanwhile, the STOI shows that the proposed method brings 22.5%, 7.4%, and 3.6% auditory impression improvement compared to the three baseline methods. It proves that the speech separated by the proposed method is more compatible with the human auditory system.
The effect of different configurations on the separation performance of the proposed method was analyzed by means of ablation experiments (e.g., Table 1, rows 6 and 7). Firstly, the introduction of the deep encoder/decoder brings an increase in SISNRi of 1.1 dB and an increase in SDRi of 1.0 dB to the model. This illustrates the greater potential of the deep encoder/decoder for the better transformation of complex signals. This result shows the possibility of combining a deep encoder/decoder with more advanced separation modules to achieve better separation performance. Secondly, the model is optimized using a joint loss function by introducing a loss function related to human auditory, \({\mathcal{L}}_{\mathrm{STOI}}\). This effectively improves speech intelligibility and better matches the target of the training model to the human auditory system.
Further, the comparative experiments (e.g., Table 1, row 8) indicate that replacing the original SISNR with SOSISNR brings a 0.7 dB increase in SISNRi and 0.8 dB in SDRi for the model. This demonstrates that SOSISNR contributes to enhancing the performance of the speech separation system in noisy and reverberant environments compared to SISNR.
The proposed method, along with three baseline methods, underwent further testing in a realworld environment. Within a room measuring 4.5 m × 3.5 m × 2.8 m, a subset of data (20 recordings) was randomly selected from the WSJ02mix test set and captured using microphones (specifically, the measurement condenser microphone ECM8000) paired with sound cards (Depusheng md22). The room was characterized by an estimated \({T}_{60}\) reverberation time of 400 ms and a SNR of 8.3 dB.
The background noise consisted of external noise and vibrations within the recording room, stemming from incomplete sound insulation material isolation. The results of the realworld testing were averaged, and the objective evaluation metric SISNRi is shown in Fig. 6:
The experimental results indicate that in a realworld environment, both the proposed method and the baseline methods inevitably exhibited some decrease in separation performance. However, the proposed method outperformed the three baseline methods in terms of performance within the realworld setting.
To visualize the performance of the proposed method against the three baseline methods, the separation performance of a randomly sampled segment of a mixture of speech is presented via a speech spectrogram, as shown in Fig. 7. This mixture (Fig. 7f) is disturbed by the background noise of the café with an SNR of 5 dB and room reverberation with a \({T}_{60}\) of 300 ms. Figure 7a is the spectrogram of the original clean signal. Figure 7b, c, d, and e demonstrate the spectrograms of a separated source processed by four methods, respectively.
From Fig. 7, it can be found that all four methods can separate the sound sources. However, the separation performance of the proposed method, DPTNet, and DPRNNTasNet is significantly better than that of ConvTasNet. This is due to the fact that ConvTasNet uses a fixed context length, which results in its lack of longterm tracking of the speaker and generalization to complex sound field environments. Furthermore, as shown by the highlighted white and green dashed boxes (Fig. 7a–d), the proposed method provides a better restoration of the harmonic components. This indicates that our method achieves better separation performance.
To further evaluate the subjective evaluation metric of the proposed method, the MUSHRA listening test was conducted with the participation of 20 experienced listeners. The proposed method and three baseline methods were used to process 18 randomly selected mixture signals from the test set. According to the MUSHRA specifications, each experiment included a hidden reference and a 3.5 kHz lowpass filter anchor. The results of the MUSHRA listening test with 95% confidence intervals are shown in Fig. 8:
From Fig. 8, it can be seen that the MUSHRA scores of the proposed method are higher than those of the baseline methods, which means that the proposed method provides a better auditory experience for listeners.
These methods were further evaluated at different noise reverberation levels. This was done by generating test sets for nine levels of different acoustic environments using the same methodology as in subsection IV.A. The proposed method and the two baseline methods were tested using STOI and PESQ as objective evaluation metrics. The average results of the nine test sets are shown in Figs. 9 and 10, where “re” represents the \({T}_{60}\) with units of “ms” and “n” represents the SNR with units of “dB.”
As shown in Figs. 9 and 10, the STOI and PESQ of the proposed method are higher than that of the three baseline methods in complex environments. The experimental results show an inevitable degradation of the system performance with increasing \({T}_{60}\) and SNR. However, even in more complex sound fields, the proposed method still achieves a robust improvement over the two baseline methods.
6 Conclusion
This paper proposes a deep encoder/decoder dualpath neural network that can better model complex signals. The network can separate the clean speech of each speaker from a mixture with noise and reverberation. In addition, a new loss function, SOSISNR, is proposed to further improve the performance of the model. The joint loss function is extended with the STOIbased loss function to make the model more compatible with the human auditory system.
The alignment operation is proposed to reduce the sensitivity of the model to the utterance starting points and to increase the robustness of the model. Combined with the above operations, the subjective and objective evaluation metrics show that this study has better separation performance in complex sound field environments and shows superiority in various scenarios. At the same time, the model maintains a relatively small model size, which is not demanding on the recording equipment and has wide applicability.
In the future, the model can be improved by incorporating more advanced separation modules. The generalization performance of the proposed method on other unseen datasets needs to be further tested. Simultaneously, in complex scenarios, the model can be further extended to handle situations involving three or more speakers.
Availability of data and materials
Not applicable.
Abbreviations
 DNNs:

Deep neural networks
 SOSISNR:

Stretched optimal scaleinvariant signaltonoise ratio
 SISNR:

Scaleinvariant signaltonoise ratio
 STOI:

Shorttime objective intelligibility
 CASA:

Computational Auditory Scene Analysis
 ICA:

Independent Component Analysis
 NMF:

Nonnegative Matrix Factorization
 PIT:

Permutation invariant training
 DPCL:

Deep clustering
 DANet:

Deep attractor network
 LSTMTasNet:

Long shortterm memory timedomain audio separation network
 ConvTasNet:

Fullyconvolutional timedomain audio separation network
 TCN:

Temporal convolutional network
 DPRNNTasNet:

Dualpath recurrent neural network timedomain separation network
 DPTNet:

Dualpath transformer network
 SISNRi:

Scaleinvariant signaltonoise ratio improvement
 RIRs:

Room impulse responses
 FC:

Fullyconnected
 LN:

Layer normalization
 uPIT:

Utterancelevel permutation invariant training
 SNR:

Signaltonoise ratio
 STFT:

Shorttime Fourier transform
 WSJ0:

Wall Street Journal dataset
 \({T}_{60}\) :

Reverberation time
 WHAM!:

WSJ0 Hipster Ambient Mixtures!
 SDRi:

Signal distortion ratio improvement
 PESQ:

Perceptual evaluation of subjective quality
 MUSHRA:

MUlti Stimulus test with Hidden Reference and Anchor
References
A.W. Bronkhorst, The cocktail party phenomenon: a review of research on speech intelligibility in multipletalker conditions. Acta Acust. Acust. 86(1), 117–128 (2000)
S. Haykin, Z. Chen, The cocktail party problem. Neural Comput. 17(9), 1875–1902 (2005)
P. Comom, C. Jutten, Handbook of blind source separation: independent component (Academic Press, Elsevier, Burlington, Analysis and Applications, 2010)
S.F. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics Speech and Signal Processing 27(2), 113–120 (1979)
K. Yoshii, R. Tomioka, D. Mochihashi, M. Goto, Beyond nmf: timedomain audio source separation without phase reconstruction,” ISMIR (2013), pp.369–374
Y. Jia, Q. Yang, M. Jia, W. Xu, C. Bao, Multiple sound source separation via ideal ratio masking by using probability mixture model. J. Signal Process. 37(10), 1806–1815 (2021)
X. Chen, W. Wang, Y. Wang, X. Zhong, A. Alinaghi, Reverberant speech separation with probabilistic time frequency masking for bformat recordings. Speech Communications. 68, 41–54 (2015)
M. Jia, J. Sun, C. Bao et al., Separation of multiple speech sources by recovering sparse and nonsparse components from Bformat microphone recordings. Speech Commun. 96, 184–196 (2018)
P.S. Huang, M. Kim, M. HasegawaJohnson, P. Smaragdis, Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Dec. 23(12), 2136–2147 (2015)
D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(3), 483–492 (2016)
D. Yu, M. Kolbæk, Z.H. Tan, J. Jensen, Permutation invariant training of deep models for speakerindependent multitalker speech separation," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, New Orleans, LA, USA, 2017), pp. 241245
J. R. Hershey, Z. Chen, J. Le Roux and S. Watanabe, "Deep clustering: discriminative embeddings for segmentation and separation," 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, Shanghai, China, 2016), pp. 31–35
Z. Chen, Y. Luo and N. Mesgarani, "Deep attractor network for singlemicrophone speaker separation," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, New Orleans, LA, USA, 2017), pp. 246–250
M. Kolbæk, D. Yu, Z.H. Tan, J. Jensen, Multitalker speech separation with utterancelevel permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Oct. 25(10), 1901–1913 (2017)
Y. Luo, Z. Chen, N. Mesgarani, Speakerindependent speech separation with deep attractor network. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26(4), 787–796 (2018)
Z. Q. Wang, J. L. Roux and J. R. Hershey, "Alternative objective functions for deep clustering," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, Calgary, AB, Canada, 2018), pp. 686–690
Y. Luo and N. Mesgarani, "TaSNet: timedomain audio separation network for realtime, singlechannel speech separation," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, Calgary, AB, Canada, 2018), pp. 696–700
Y. Luo, N. Mesgarani, ConvTasNet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Aug. 27(8), 1256–1266 (2019)
Y. Luo, Z. Chen and T. Yoshioka, "DualPath RNN: efficient long sequence modeling for timedomain singlechannel speech separation," ICASSP 2020  2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, Barcelona, Spain, 2020), pp. 46–50
J. Chen, Q. Mao, and D. Liu, “Dualpath transformer network: direct contextaware modeling for endtoend monaural speech separation,” Interspeech 2020 (ICSA, Shanghai, China, 2020), pp. 2642–2646
N. Zeghidour, D. Grangier, Wavesplit: endtoend speech separation by speaker clustering. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 2840–2849 (2021)
C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi and J. Zhong, "Attention is all you need in speech separation," 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, Toronto, ON, Canada, 2021), pp. 21–25
T. CordLandwehr, C. Boeddeker, T. Von Neumann, C. Zorilă, R. Doddipatla and R. HaebUmbach, "Monaural source separation: from anechoic to reverberant environments," 2022 International Workshop on Acoustic Signal Enhancement (IWAENC), Bamberg, Germany, 2022, pp. 15
H. Taherian, K. Tan, D. Wang, Multichannel talkerindependent speaker separation through locationbased training. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30, 2791–2800 (2022)
K. Tan, Y. Xu, S.X. Zhang, M. Yu, D. Yu, Audiovisual speech separation and dereverberation with a twostage multimodal network. IEEE Journal of Selected Topics in Signal Processing 14(3), 542–553 (2020)
D. Michelsanti et al., An overview of deeplearningbased audiovisual speech enhancement and separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 1368–1396 (2021)
Z. Q. Wang, S. Cornell, S. Choi, Y. Lee, B. Y. Kim and S. Watanabe, "TFGridNet: making timefrequency domain models great again for monaural speaker separation." 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, Rhodes Island, Greece, 2023)
C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
C. Xu, W. Rao, E.S. Chng, H. Li, SpEx: multiscale time domain speaker extraction network. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 1370–1384 (2020)
M. W. Y. Lam, J. Wang, D. Su and D. Yu, "Effective lowcost timedomain audio separation using globally attentive locally recurrent networks," 2021 IEEE Spoken Language Technology Workshop (SLT), (IEEE, Shenzhen, China, 2021), pp. 801808
B. Kadıoğlu, M. Horgan, X. Liu, J. Pons, D. Darcy and V. Kumar, "An empirical study of ConvTasnet," ICASSP 2020  2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, Barcelona, Spain, 2020), pp. 7264–7268
Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “Dccrn: deep complex convolution recurrent network for phaseaware speech enhancement,” Interspeech 2020 (ICSA, Shanghai, China, 2020), pp. 2472–2476
S. Lv, Y. Fu, M. Xing, J. Sun, L. Xie, J. Huang, Y. Wang, and T. Yu, “SDCCRN: super wide band DCCRN with learnable complex feature for speech enhancement," ICASSP 2022  2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, Singapore, Singapore, 2022), pp. 7767–7771
Jonathan Le Roux, Scott Wisdom, Hakan Erdogan, and John R. Hershey, “SDR halfbaked or well done?,” in Acoustics, Speech and Signal Processing (ICASSP), 2019 IEEE International Conference on. IEEE, 2019, pp. 626 630
C. Ma, D. Li and X. Jia, "Optimal scaleinvariant signaltonoise ratio and curriculum learning for monaural multispeaker speech separation in noisy environment," 2020 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), (IEEE, Auckland, New Zealand, 2020), pp. 711715
Y. Sun, L. Yang, H. Zhu, and J. Hao, “Funnel deep complex UNet for phaseaware speech enhancement,” Interspeech 2021 (ICSA, Brno, Czech Republic, 2021), pp. 161–165
A. Li, W. Liu, X. Luo, C. Zheng and X. Li, "ICASSP 2021 deep noise suppression challenge: decoupling magnitude and phase optimization with a twostage deep network," ICASSP 2021  2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, Toronto, ON, Canada, 2021), pp. 6628–6632
H. Zhang, X. Zhang and G. Gao, "Training supervised speech separation system to improve STOI and PESQ directly," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, Calgary, AB, Canada, 2018), pp. 5374–5378
S.W. Fu, T.W. Wang, Y. Tsao, X. Lu, H. Kawai, Endtoend waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Sept. 26(9), 1570–1584 (2018)
Y. Zhu, X. Xu, Z. Ye, FLGCNN: a novel fully convolutional neural network for endtoend monaural speech enhancement with utterancebased objective functions. Appl. Acoust. 170, 107511 (2020)
J.B. Allen, D.A. Berkley, Image method for efficiently simulating smallroom acoustics. J. Acoustical Soc. Amer. 65, 943–950 (1979)
R. Cheng, C. Bao, Z. Cui, MASS: microphone array speech simulator in room acoustic environment for multichannel speech coding and enhancement. Appl. Sci. 10(4), 1484 (2020)
R. Scheibler, E. Bezzam and I. Dokmanić, "Pyroomacoustics: a python package for audio room simulation and array processing algorithms," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, Calgary, AB, Canada, 2018), pp. 351–355
G. Wichern et al., “WHAM!: extending speech separation to noisy environments,” Interspeech 2019 (ICSA, Graz, Austria, 2019), pp. 1368–1372
S. Honchreiter, J. Schmidhuber, Long shortterm memory. Neural Comput. 9(8), 1735–1780 (1997)
D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” 2014,arXiv preprint arXiv: 1412.6980
Pytorch ,“Profiler,”https://pytorch.org/tutorials/recipes/recipes/profiler.html, 2020, Accessed: 2020–10–21
Y. Isik, J. L. Roux, Z. Chen, S. Watanabe and J. R. Hershey，“Singlechannel multispeaker separation using deep clustering,” Interspeech 2016 (ICSA, San Francisco, USA, 2016), pp.545549
E. Vincent, R. Gribonval, C. Fevotte, Performance measure ment in blind audio source separation. IEEE transactions on audio, speech, and language processing 14(4), 1462–1469 (2006)
Perceptual evaluation of speech quality (PESQ), An objective method for endtoend speech quality assessment of narrowband telephone networks and speech codecs. Rec. ITUT P. 862, 2001 (International Telecommunications Union, Geneva, Switzerland, 2001)
BS.1534, Int. Telecomm. Union, Method for the subjective assessment of intermediate quality levels of coding systems (1997)
B. Series, Method for the subjective assessment of intermediate quality level of audio systems. International Telecommunication Union Radiocommunication Assembly. (2014)
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grants (No. 61971015) and the Beijing Natural Science Foundation (No. L223033).
Funding
This work was supported by the National Natural Science Foundation of China under Grants (No. 61971015) and the Beijing Natural Science Foundation (No. L223033).
Author information
Authors and Affiliations
Contributions
Wang C. performed the whole research and wrote the paper. Jia M. provided support to the writing and experiments. All authors read and approved the final version of the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, C., Jia, M. & Zhang, X. Deep encoder/decoder dualpath neural network for speech separation in noisy reverberation environments. J AUDIO SPEECH MUSIC PROC. 2023, 41 (2023). https://doi.org/10.1186/s13636023003075
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636023003075