Deep encoder/decoder dual-path neural network for speech separation in noisy reverberation environments

In recent years, the speaker-independent, single-channel speech separation problem has made significant progress with the development of deep neural networks (DNNs). However, separating the speech of each interested speaker from an environment that includes the speech of other speakers, background noise, and room reverberation remains challenging. In order to solve this problem, a speech separation method for a noisy reverberation environment is proposed. Firstly, the time-domain end-to-end network structure of a deep encoder/decoder dual-path neural network is introduced in this paper for speech separation. Secondly, to make the model not fall into local optimum during training, a loss function stretched optimal scale-invariant signal-to-noise ratio (SOSISNR) was proposed, inspired by the scale-invariant signal-to-noise ratio (SISNR). At the same time, in order to make the training more appropriate to the human auditory system, the joint loss function is extended based on short-time objective intelligibility (STOI). Thirdly, an alignment operation is proposed to reduce the influence of time delay caused by reverberation on separation performance. Combining the above methods, the subjective and objective evaluation metrics show that this study has better separation performance in complex sound field environments compared to the baseline methods.


Introduction
Speech separation is widely known as the cocktail party problem [1,2].Its goal is to separate the target speaker's speech from complex sound field environments (other speakers, background noise, and reverberation).While human beings have a strong speech separation capability and can recognize the target speaker's speech even in complex environments; however, there is still a significant challenge for machine systems.
Speech separation, as an important front-end processing technique, is widely used in tasks such as hearing prosthesis, mobile communication, robust automatic speech, and speaker recognition.It has received extensive attention from researchers.However, the performance of current speech separation systems still needs to fully meet the requirements of human auditory perception, especially in complex sound field environments.
Speech separation has been studied for decades.In the early stages, following the assumption that speech signals conformed to a specific probability distribution (Gaussian or Laplacian) and that the background noise is stable (the spectral characteristics do not change with time), some methods such as Computational Auditory Scene Analysis (CASA) [3], Independent Component Analysis (ICA) [4] and Non-negative Matrix Factorization (NMF) [5] have been adopted in speech separation.In addition, considering the mask is essential in the process of speech separation, some mask estimation methods have been proposed based on probabilistic mixture models [6,7] and sparse component analysis [8].These methods exhibit good separation performance in low-reverberation environments, but their performance decreases with the increase of reverberation time and/or noise levels.
With the development of deep learning, data-driven approaches have been introduced to speech separation.These methods learn features and patterns directly from the data without making any assumptions or prerequisites about the task domain.Based on this methodology, joint optimization of masking functions and deep recurrent neural networks is proposed for single-channel speech separation in the time domain [9].Another approach involves operating in the complex domain with simultaneous enhancement of the magnitude and phase spectra to estimate the real and imaginary components of the ideal ratio mask [10].
However, two main difficulties have hindered the development of speech separation.These are the "permutation problem" and the "output dimension mismatch problem." To solve the above problem, the permutation invariant training (PIT) [11] method is used in the training phase of the speech separation model to solve the uncertainty problem of speaker order in the mixed signals.Specifically, PIT lists all possible permutations and uses the minimum separation error to update the network.In the end, the source labels corresponding to the separated information are obtained.
In addition, deep clustering (DPCL) [12] is adopted to separate speech by calculating the embedding vector for each time-frequency bin and using the K-Means clustering method.Due to the use of permutation-free training, it can handle multiple sound sources simultaneously, achieving speaker-independent speech separation.However, the K-means clustering is utilized in DPCL [12], which requires significant computational resources.To overcome this issue, a deep attractor network (DANet) [13] is proposed to perform mask estimation without clustering.The time-frequency bins corresponding to each source are integrated by creating attractor subpoints in the high-dimensional space of the mixture signal.Compared with DPCL, the computational effort is significantly reduced.A variety of time-frequency maskbased separation methods have been successively proposed [14][15][16].In order to achieve sufficient frequency resolution, phase/magnitude decoupling is inevitable for time-frequency decomposition, which results in imperfect reconstruction accuracy of the sources.In contrast, it can effectively avoid this problem when separation is performed in the time-domain.Hence, a long short-term memory time-domain audio separation network (LSTM-TasNet) [17] has been proposed, where a codec architecture is adopted to model the signal in the time-domain.Furthermore, a fully convolutional time-domain audio separation network (Conv-TasNet) [18] was proposed to solve the overfitting of LSTM-TasNet by using a temporal convolutional network (TCN) structure.As an endto-end time-domain separation network, it can be used for modeling speech signals for a long-term dependency because of a deep one-dimensional dilated convolution block.
However, the input mixture signal is composed of a large number of time steps.In other words, if the receptive field of a one-dimensional convolutional neural network is smaller than the sequence length, it is difficult to achieve utterance-level modeling.Therefore, a dual-path recurrent neural network time-domain separation network (DPRNN-TasNet) [19] has been proposed to model long sequences through iterative intra-and inter-chunk operations.Similarly employing the dual-path strategy, dual-path transformer network (DPTNet) [20] utilizes a transformer module that enables long-term dependency modeling of speech signals.In addition, Wavesplit [21] achieved better speech separation performance by computing speaker vectors within a temporal window and obtaining the global vectors via clustering.
In complex sound scenarios, it is a challenge to achieve satisfactory separation performance using single-channel information.Therefore, a training model is proposed based on azimuth and distance, using the distinct spatial locations of the speakers captured by a microphone array [24].Moreover, multiple types of information, such as video information, have been adopted for speech separation [25,26].There is a certain improvement in separation performance among these methods but with higher requirements for recording equipment.
In the most recent study, TF-GridNet [27] achieves speaker separation through the utilization of complex spectral mapping, in conjunction with loss functions and DNN architectures.The model once again underscores its significant potential in the domain of time-frequency monaural speech separation.
In summary, motivated by the previous work, we propose a method for single-channel speech separation in complex sound field environments.End-to-end speech separation is performed using only a single microphone capture to obtain information in the presence of noise and reverberant interference.The contributions of the proposed method are summarized as follows: • A network structure of a deep encoder/decoder dual-path neural network is proposed, which enhances the model's ability to extract speech features.Experimental results demonstrate the feasibility of the method.• A new loss function known as the stretched optimal scale-invariant signal-to-noise ratio (SOSISNR) is proposed, and experimental results show that it outperforms the scale-invariant signal-to-noise ratio (SISNR) in complex sound field environments.• Using a multi-objective joint optimization strategy, the loss function was extended based on short-time objective intelligibility (STOI) [28] to match the human auditory system better.• The alignment operation is proposed to reduce the model's reliance on a priori knowledge of the sound field and to increase the robustness of the model.
The rest of the paper is organized as follows.Sections II and III describe the specific implementation steps of the proposed method and the rationale.Section IV describes the experimental procedure.Finally, the conclusions and analysis of the experiments are presented in section V.

Problem formulation
In a multi-source scenario, the signal y(t) recorded by a mono microphone can be modeled in the time-domain as: where i = 1, 2 . . ., I , I represents the number of the sound sources.t represents continuous time, indicating signal's continuous variation along the time dimension.s i (t) and r i (t) denote the i th speaker's speech and the room impulse responses (RIRs) between the i th speaker and the microphone, respectively, and "*" denotes the convolution operation.σ (t) denotes the background noise.
The aim of this paper is to obtain/estimate each speaker's speech s i (t) , i = 1, 2 . . ., I , from a recorded signal contaminated by noise and reverberation. (1)
Specifically, the role of the encoder is to extract features from mixture signals and map them to a feature space of an appropriate dimension.The features are then analyzed and processed by the separator to separate the high-dimensional representation of each component in the mixed speech.Finally, the decoder uses the separated features to reconstruct the original speech signal.The existing works have mainly focused on the separator, with relatively less attention paid on the encoder and decoder part.The linear (shallow) operators are commonly used to extract features, which can limit the expressiveness of the model and may not achieve satisfactory performance in complex sound scenarios.
In order to explore the transformation capabilities of the deep encoder/decoder structure on complex signals, this paper attempts to utilize this structure to focus more on the local features of the speech signal [31].By concurrently integrating advanced dual-path neural network separation modules, we aspire for the model to exhibit superior separation performance.Therefore, a speech separation method for noisy and reverberant environments is proposed by combining a deep encoder/decoder and a dualpath neural network.The deep encoder/decoder structure is shown in Fig. 1.
As shown in Fig. 1, the encoder obtains the single-channel signal y(t) recorded by the microphone, resamples this signal, and transforms it into a matrix dimension as Y ∈ R 1×S , where S represents the number of the time steps.A one-dimensional convolution operation is performed in the first layer of the encoder, expressed as follows: where V represents a kernel of size L and stride L /2 , and Conv(•) represents a one-dimensional convolution opera- tion.E 1 ǫR K ×S is obtained by one-dimensional convolu- tion of the kernel V , where K is the feature dimension of the signal.The recording signal y(t) is initially mapped with a linear transform.Then, starting from the second layer, the input of the p th layer is the output of the (p − 1) th layer.The output of the p th layer can be expressed as: where p=2, 3, . . ., P , p is the index of the encoder layer and P is the number of deep encoder layers ( P = 4 in this paper ).The p th layer has a kernel V p of size 3 with stride 1 and a PReLU , and E p ǫR K ×S rep- resents the output of the p th layer.In this way, the recording signal y(t) is further mapped into a non-linear potential space by stacking the encoding layers.Using the (2) output from the deep encoder, the separation module estimates the mask M i corresponding to the i th speaker.
The input feature D 0,i of the decoder can be obtained from the output of the deep encoder and the mask M i as follow: where D 0,i ǫR K ×S , "0" represents the layer 0 of the decoder, which is the input of the deep decoder.E is the output of the deep encoder, and i represents the index of the speaker."⨀" represents the element-wise multiplication.
The decoder reconstructs the waveform by transforming the two-dimensional feature D 0,i ǫR K ×S .Starting from the first layer of the decoder, the input of the q th layer is the output of the (q − 1) th layer.The output of the q th layer can be expressed as: where q=1, 2, . . ., Q , q is the index of the decoder layer, and (Q + 1) is the number of deep decoder layers ( Q = 3inthispaper ).i = 1, 2 • • • , I , I represents the num- ber of the sound sources.TrConv(•) represents the trans- pose convolution operation.The q th layer of the decoder has a kernel U q of size 3 with stride 1 and a PReLU , D q,i ǫR K ×N denotes the output of the q th layer.The fourth layer performs the transpose convolution operation of the kernel U as follows: (4) where U represents a kernel of size L and stride L /2 , DǫR K ×S is a high-dimensional representation of the decoder reconstructed waveform, and s i ǫR 1×S is the reconstructed speech signal.

DPRNN separation module
The DPRNN module is used as a separator in this paper, which consists of three operations: segmentation, DPRNN block processing, and overlapping-add [19].The overall flowchart is shown in Fig. 2: The output of the deep encoder EǫR K ×S is obtained according to formula (3), where K and S can be consid- ered as the feature dimension and the time dimension, respectively.As shown in Fig. 2, in the segmentation stage, the output of the deep encoder is segmented into F chunks of equal size, where each chunk with length H and hop size ℏ p ( ℏ p = H 2 ), and the first and last chunks are zero-padded (so that all the output of the deep encoder can be processed).Each block can be represented as T f ∈ R K ×H , where f = 1, 2, . . ., F , f is the index of the block, F represents the number of the blocks.The com- bined information of the F chunks into a three-dimen- In the DPRNN block processing stage, the tensor G is passed to a stack of C DPRNN blocks.Each block con- tains intra-block and inter-block.Intra-block processing (local modeling) and inter-block processing (global modeling) are performed iteratively, while keeping the (6)  Specifically, intra-block processing starts with the intra-chunk RNN in the third dimension of tensor G , as follows: where c=1, 2, . . ., C , c is the stack index of the DPRNN block, and C is the number of repetitions of the DPRNN block.G 0 = G , W c ǫR X×H ×F is the output of the intra- chunk RNN, and h c (•) represents the mapping function of the intra-chunk RNN.G c−1 ∈ R K ×H ×F is the speech feature of the previous layer of the three-dimensional tensor.Further, to ensure that the dimension of the tensor does not change, a linear fully-connected (FC) layer is applied, which is defined as follows: where W c ∈ R K ×H ×F is the transformed speech feature tensor and W c has the same dimension as G c−1 .J ∈ R K ×X is the weight tensor of the FC layer, and a is the bias of the FC layer.
To improve the generalization ability of the model, a layer normalization (LN) was performed on the transformed speech feature tensor W c .In addition, to avoid model degradation during training, a residual connection is added between the input G c−1 of the intra-chunk RNN and the LN, as shown in formal ( 9): (7) where G c ∈ R K ×H ×F is the output of the intra-block pro- cessing as well as the input of the inter-block processing, LN(•) is the layer normalization operation.
The inter-block processing is performed in the second dimension of the three-dimensional tensor G c , i.e., inter- chunk RNN in the second dimension of tensor G c , which is shown as follows: where R c ǫR X×H ×F is the output of the inter-chunk RNN and v c (•) represents the mapping function of the inter- chunk RNN.The subsequent operations are similar to those of the intra-block, which is consisted of inter-chunk RNN mapping, the FC and LN layers.The output of the c th DPRNN block is represented as G c ǫR K ×H ×F .So, the final output of the block processing, i.e., G C ǫR K ×H ×F , is obtained by repeating C DPRNN blocks.
Finally, the two-dimensional convolutional layer learns the mask of each sound source and performs an overlapping-add operation.The mask M i ǫR K ×S corresponding to the i th speaker can be obtained.

Joint loss function
In clean speech separation tasks, the loss function is often based on SISNR and utterance-level permutation invariant training (uPIT) [18,19], with the starting point to get the model to predict signals closer and closer to the original clean signal.
A new loss function, SOSISNR, was designed to cope with single-channel speech separation in complex sound field environments.Meanwhile, the loss function was (10) R c = v c G c [:, j, :] , j = 1, 2, . . ., H , Fig. 2 The overall flowchart extended using STOI to improve the intelligibility of separated speech and make it more compatible with the human auditory system.
Using a multi-objective joint optimization strategy, the joint loss function L is expressed as follows: where L sosisnr is the loss function SOSISNR, and L STOI is the STOI-based loss function.is the weight of L STOI , which is set to 2 in this paper.The derivation and analysis of the two loss functions (i.e., L sosisnr , L STOI ) are given in Sub-sections III.B and III.C, respectively.

Stretched optimal scale-invariant signal-to-noise ratio
SISNR, as a time-domain loss function, can be used to measure the similarity between the output of the model and ground truth.It has been widely used in signal processing fields such as speech separation [18,19] and speech enhancement [32,33].The illustration of SISNR and SOSISNR are shown in Fig. 3.
As shown in the orange part, the original clean speech s is first mapped to obtain the target signal s target , in order to attenuate the effect of scale variations due to either the original clean speech s or the estimated signal s .The target signal s target is defined in formula (12): And then the distance between the estimated signal s and the target signal s target , i.e., the noise signal e noise , is defined as follows: (11 (13) e noise = s − s target .
Finally, the intensity ratio of the target signal s target to the noise signal e noise (SISNR) is defined as: Furthermore, s target and e noise can be updated to the fol- lowing formula: Combing formula ( 14), ( 15), ( 16), the expression of SISNR is further derived as: where θ represents the angle between s and s .Obviously, as in formula (15), by adjusting the original clean speech s to a suitable scale, the magnitude of s target is not asso- ciated with the original clean speech s .Instead, s target is expressed in terms of the estimated signal s as well as the angle θ trigonometric function, with no change in direc- tion compared to s .As in formula (17), SISNR is only related to the angle θ , which is irrelevant to the magni- tude of s and s .The functional relationship between θ and SISNR is shown as the orange part in Fig. 4.
In separation performance evaluation, the angle information between the separated signal and the original signal is more important than the magnitude (14 Fig. 3 The illustration of SISNR and SOSISNR information.Compared to the signal-to-noise ratio (SNR) [34]: SISNR is more suitable for speech separation tasks.Specifically, by regularizing the separated speech signal s , SISNR overcomes the disadvantage that SNR is susceptible to variations in the energy of input signal.In other words, SISNR is able to evaluate the angle between the separated signal and the original signal without being affected by the change of signal energy [35,36].
In fact, SISNR is not necessarily the optimal choice for speech separation in complex acoustic environments.As shown in the orange part of Fig. 3, the e noise of SISNR is not necessarily orthogonal to s .Using SISNR may mislead the model in complex environments, which often causes the model to fall into a local best [35].Meanwhile, as the orange curve shown in Fig. 4, there are several extreme points (i.e., 0, ±π ) over a period ( −π ∼ π ).The closer the angle θ is to −π or π , the larger the SISNR value.This means that s and s are similar under such angles.Obviously, such a conclusion is incorrect [36].
In order to have only one correct extreme point in the range of −π to π , we reconstructed a phase-corrected signal s η .The goal is to double the period of the SISNR to reduce the error extreme points.At the same time, in order to prevent the training from falling into the local optimum, we define a new target signal in the vector s direction.The new target signal s ′ target is obtained by (18 scaling the original signal to make its amplitude independent of the original signal.Based on the above derivation, we propose a loss function, SOSISNR.The illustration of SOSISNR is shown in the blue part of Fig. 3. Specifically, s η is recon- structed based on the spatial relationship between the estimated signal s and the original speechs .This is achieved by keeping the magnitude of the reconstructed estimated signal s η equal to that of the original clean speech s ( s η = s ) and halving the angle between the s and s θ ′ = θ 2 , where θ ′ represents the angle s η ands.Meanwhile, the new target signal s ′ target is defined as follows: where α represents the scale adjust factor.Substituting formula (19) into formula (14), SISNR′ is obtained: To simplify the calculation, an intermediate function f (•) is defined as follows: In order to obtain the scale adjustment factor α corre- sponding to the maximum value of f (α) , the derivative of f (α) is calculated below: Fig. 4 The functional relationship between θ and ratio Therefore, α can be obtained: After adjusting the period of SISNR and recalculating the scale adjust factor α, with ( 23) and ( 19), the reconstructed target signal s ′ target can be rewritten as: Substituting formula ( 24) into ( 13), the noise e ′ noise corresponding to the reconstructed target signal s target ′ is given as follows: Furthermore, the loss function SOSISNR is defined by combined formula ( 24) and ( 25) as follows: The functional relationship of θ and SOSISNR value are shown in the blue part in Fig. 4.There is only one extreme point of SOSISNR with θ in the range from −π to π .SOSISNR reaches its maximum value only when the angle θ is zero, which reflects the similarity of s and s correctly.Also, unlike SISNR ǫ(−∞, +∞) , SOSISNR ranges from 0 to positive infinity.

STOI-based loss function
Although SISNR can well reflect the correlation between the estimated speech s and the original clean speech s .However, as a distance-based loss function, SISNR cannot directly reflect the effect of signals on human hearing [37].In contrast, STOI is a commonly used objective evaluation metric [23] (STOI ǫ[0, 1] , a higher value representing better speech intelligibility), which is closely related to human auditory perception [38].Moreover, it analyzes speech segments as a whole [39], which is more conducive to learning long-range context dependencies.Based on this motivation, the STOI-based loss function L STOI has been extended to a joint loss function [40].(22) Taking the estimated speech s and the original clean speech s as input, L STOI can be obtained as follows: Choosing the appropriate loss function is crucial for training and optimizing deep learning models.When using the joint loss function, it is necessary to ensure that each loss function is appropriate that their numerical range and symbol selection can ensure the effectiveness of parameter update and optimization.This is to avoid the problem of gradient disappearance, which occurs when the gradients of different loss functions cancel each other out.When the gradient disappears, the network cannot effectively perform back propagation and update and cannot be optimized in the right direction.The value of L STOI ranges from 0 to 1 in formula (11), so it is necessary to ensure that the other loss function in the joint loss function L is non- negative.SOSISNR just meets this condition.

Alignment operation and utterance-level permutation invariant training
This paper focuses on speech separation in complex sound field environments.The input mixture speech (27) signals in the training process are heterogeneous, with different levels of reverberation.In order to make the network cope with complex environments, this paper proposes the alignment operation to reduce the time delay error caused by reverberation.Specifically, this method performs time alignment on the estimated speech s and the original clean speech s , so that the network can learn the corresponding relationship between them more accurately and improve the separation performance.
After obtaining the joint loss function L according to formula (11), the loss function needs to be modified in order to achieve the best separation performance.The alignment operation is proposed so that the loss function can better reflect the separation performance of the model, allowing the model to learn more effective information.The alignment operation is shown in Fig. 5.
As shown in Fig. 5, the specific method is to obtain s τ by cyclically shifting the original clean speech s .Here τ is an integer value, which represents the number of shift samples.Then, the original clean speech s is replaced by the shifted original speech s τ .The joint loss function L is calculated by inputting estimated speech s , and each shifted original speech s τ into the formula (11).Compar- ing all the loss values with different τ , the minimum loss function value is found as shown in formula ( 29): where sǫR 1×S and s τ ǫR 1×S represent the estimated speech and the shifted original speech, respectively, and S represents the length of the tensor after sampling processing.The value of τ ranges from 1 to S.
The introduction of this operation enables the model to reduce its reliance on a priori knowledge of sound field environments.It enables the model to cope with different levels of reverberation (offsetting the time delay caused by reverberation) and to significantly reduce the workload of aligning annotations to different speech signals.At the same time, at each epoch of the network training, the operation can be propagated forward to provide timely and appropriate feedback to the model.Furthermore, to solve permutation ambiguity during training [39], combined with the alignment operation, the utterance-level permutation invariant training was introduced.It can guide the model to train a speakerindependent separation model.The updated loss function is as follows: (29)

Fig. 5 Illustration of alignment operation
where i represents the speaker index, i = 1, 2 • • • , I , and I represents the number of speakers.s (i) is the i th esti- mated speaker speech.s γ (i) represents the original clean speech of the γ (i) th speaker, γ (i) represents the possible index of the original clean speech corresponding to the i th estimated speaker speech.Ŵ is the set of all possible permutations for all I speakers.

Dataset
The complex scenario where the speech source of interest is disturbed by other speakers, noise, and reflection components simultaneously is simulated for the experiment.We simulated RIRs using an image method [41][42][43] for a rectangular room with dimensions of 7 m × 5 m × 3 m, with the microphone placed in the center of the room (3.5 m × 2.5 m × 1.5 m).Reverberant utterances from different speakers with reverberation time ( T 60 ) of 100 ms, 200 ms, and 300 ms were randomly generated by varying the sound absorption coefficient of the walls.The speech signal from the si_tr_s dataset of the Wall Street Journal dataset (WSJ0) [12] is chosen as the source signal.In addition, two reverberant utterances from different speakers were randomly selected and mixed with an SNR between -5 dB to 5 dB to generate 30 h of training data.Similarly, the validation and test set (from WSJ0 si_dt_05 and si_et_05) were generated in the same way to produce 10 h and 5 h of data, respectively.A spatial version of the WSJ0-2mix dataset [12] was generated in this way.Based on this, we paired the spatial version of WSJ0-2mix with noisy audio (including noise background scenes such as restaurants, cafés, and bars) from the WSJ0 Hipster Ambient Mixtures (WHAM!) dataset [44].Then, we generated randomly mixed speech in the WSJ0-2mix dataset with noise at three SNR levels of 5 dB, 10 dB, and 15 dB.This process was designed for speech separation tasks in environments with varying levels of background noise.All speech signals were resampled at 8 kHz.

Training setup
In the first layer of the deep encoder and the last layer of the deep decoder, the kernel size L is set to 2, and the hop size is L/2 .In the separation module, the number of overlapping stacks C of DPRNN blocks is set to be 6, and BLSTM [45] with 128 hidden units in each direction is used as intra-and inter-block RNN.
In the STOI-based loss function, the time-frequency spectrum of the speech signal is obtained by STFT with the Hanning window length and the hop size are 1024 and 256, respectively.
The network is trained for 100 epochs on 4-s-long segments with an initial learning rate of 2e −4 .If no better results are obtained on the validation set for 3 consecutive epochs, the learning rate will be halved.If the best model is not updated for 10 consecutive epochs, training will be stopped early.Adam [46] is used as the optimizer, and a gradient clipping with a maximum L 2 - norm of 5 is applied during training.To ensure fairness, all models are trained using PyTorch profiler [47] on 2 NVIDIA GeForce RTX 4090 GPU devices.

Evaluation metrics
In the experiments, four objective evaluation metrics and one subjective evaluation metric were used to evaluate the separation performance of our proposed method and the baseline methods.
The objective evaluation metrics include SISNRi [33], signal distortion ratio improvement (SDRi) [48,49], perceptual evaluation of subjective quality (PESQ) [50], and STOI [28].These objective metrics are obtained by comparing the model output speech with the original clean speech.The SISNRi and SDRi are energy ratios which can be used to measure the similarity between signals.PESQ scores ranged from -0.5 to 4.5 and STOI scores ranged from 0 to 1.The higher the value, the better the quality of the separated signal.
The MUlti Stimulus test with Hidden Reference and Anchor (MUSHRA) [51,52] is chosen as the subjective evaluation.The MUSHRA is conducted by asking a number of experienced listeners to rate the quality of the separated mixtures.The value of MUSHRA ranges from 0 to 100, and the higher the value, the better the quality of the separated signal.

Experimental results and analysis
The proposed method in this paper was compared with three baseline methods (Conv-TasNet, DPRNN-TasNet, and DPTNet).During the experiments, the three baseline methods have successfully recurred with the same specific structure and hyperparameter settings as in [18][19][20].
The proposed method and its ablation experiments (The proposed method without the deep encoder/ decoder and the loss function without L STOI ) were compared with the baseline method.In addition, we conducted comparative experiments by replacing the proposed method's SOSISNR with the original SISNR.
The sizes of the aforementioned models, their computational complexities for 4-s segments, and objective evaluation metrics are shown in Table 1.
The results show that compared to Conv-TasNet, DPRNN-TasNet, and DPTNet, the proposed method brings a 5.0 dB, 2.1 dB, and 0.2 dB increase in SIS-NRi as well as a 4.9 dB, 1.9 dB, and 0.1 dB increase in SDRi, respectively.It is demonstrated that the proposed method can perform better speech separation in complex sound field environments.Meanwhile, the STOI shows that the proposed method brings 22.5%, 7.4%, and 3.6% auditory impression improvement compared to the three baseline methods.It proves that the speech separated by the proposed method is more compatible with the human auditory system.
The effect of different configurations on the separation performance of the proposed method was analyzed by means of ablation experiments (e.g., Table 1, rows 6 and 7).Firstly, the introduction of the deep encoder/decoder brings an increase in SISNRi of 1.1 dB and an increase in SDRi of 1.0 dB to the model.This illustrates the greater potential of the deep encoder/decoder for the better transformation of complex signals.This result shows the possibility of combining a deep encoder/decoder with more advanced separation modules to achieve better separation performance.Secondly, the model is optimized using a joint loss function by introducing a loss function related to human auditory, L STOI .This effectively improves speech intelligibility and better matches the target of the training model to the human auditory system.
Further, the comparative experiments (e.g., Table 1, row 8) indicate that replacing the original SISNR with SOSISNR brings a 0.7 dB increase in SISNRi and 0.8 dB in SDRi for the model.This demonstrates that SOSISNR contributes to enhancing the performance of the speech separation system in noisy and reverberant environments compared to SISNR.
The proposed method, along with three baseline methods, underwent further testing in a real-world environment.Within a room measuring 4.5 m × 3.5 m × 2.8 m, a subset of data (20 recordings) was randomly selected from the WSJ0-2mix test set and captured using microphones (specifically, the measurement condenser microphone ECM8000) paired with sound cards (Depusheng md22).The room was characterized by an estimated T 60 reverberation time of 400 ms and a SNR of 8.3 dB.
The background noise consisted of external noise and vibrations within the recording room, stemming from incomplete sound insulation material isolation.The results of the real-world testing were averaged, and the objective evaluation metric SISNRi is shown in Fig. 6: The experimental results indicate that in a real-world environment, both the proposed method and the baseline methods inevitably exhibited some decrease in separation performance.However, the proposed method outperformed the three baseline methods in terms of performance within the real-world setting.
To visualize the performance of the proposed method against the three baseline methods, the separation performance of a randomly sampled segment of a mixture of speech is presented via a speech spectrogram, as shown in Fig. 7.This mixture (Fig. 7f ) is disturbed by the background noise of the café with an SNR of -5 dB and room reverberation with a T 60 of 300 ms. Figure 7a is the spectrogram of the original clean signal.Figure 7b,  c, d, and e demonstrate the spectrograms of a separated source processed by four methods, respectively.
From Fig. 7, it can be found that all four methods can separate the sound sources.However, the separation performance of the proposed method, DPTNet, and DPRNN-TasNet is significantly better than that of Conv-TasNet.This is due to the fact that Conv-TasNet uses a fixed context length, which results in its lack of long-term tracking of the speaker and generalization to complex sound field environments.Furthermore, as shown by the highlighted white and green dashed boxes (Fig. 7a-d), the proposed method provides a better restoration of the harmonic components.This indicates that our method achieves better separation performance.
To further evaluate the subjective evaluation metric of the proposed method, the MUSHRA listening test was conducted with the participation of 20 experienced listeners.The proposed method and three baseline methods were used to process 18 randomly selected mixture signals from the test set.According to the MUSHRA specifications, each experiment included a hidden reference and a 3.5 kHz low-pass filter anchor.The results of the MUSHRA listening test with 95% confidence intervals are shown in Fig. 8: From Fig. 8, it can be seen that the MUSHRA scores of the proposed method are higher than those of the baseline methods, which means that the proposed method provides a better auditory experience for listeners.
These methods were further evaluated at different noise reverberation levels.This was done by generating test sets for nine levels of different acoustic environments using the same methodology as in subsection IV.A.The proposed method and the two baseline methods were tested using STOI and PESQ as objective evaluation metrics.The average results of the nine test sets are shown in Figs. 9 and 10, where "re" represents the T 60 with units of "ms" and "n" rep- resents the SNR with units of "dB." As shown in Figs. 9 and 10, the STOI and PESQ of the proposed method are higher than that of the three Fig. 9 Results of PESQ baseline methods in complex environments.The experimental results show an inevitable degradation of the system performance with increasing T 60 and SNR.However, even in more complex sound fields, the proposed method still achieves a robust improvement over the two baseline methods.

Conclusion
This paper proposes a deep encoder/decoder dual-path neural network that can better model complex signals.The network can separate the clean speech of each speaker from a mixture with noise and reverberation.In addition, a new loss function, SOSISNR, is proposed to further improve the performance of the model.The joint loss function is extended with the STOI-based loss function to make the model more compatible with the human auditory system.
The alignment operation is proposed to reduce the sensitivity of the model to the utterance starting points and to increase the robustness of the model.Combined with the above operations, the subjective and objective evaluation metrics show that this study has better separation performance in complex sound field environments and shows superiority in various scenarios.At the same time, the model maintains a relatively small model size, which is not demanding on the recording equipment and has wide applicability.
In the future, the model can be improved by incorporating more advanced separation modules.The generalization performance of the proposed method on other unseen datasets needs to be further tested.Simultaneously, in complex scenarios, the model can be further extended to handle situations involving three or more speakers.

Fig. 1
Fig. 1 The deep encoder/decoder structure a. Removal of silent frames from estimated speech s and original clean speech s. b.The short-time Fourier transform (STFT) is used to obtain the corresponding representation in the timefrequency domain.c.Perform a one-third octave band analysis.d.To compensate for the global level difference and improve the stability of the STOI, normalization and clipping are performed.e. Measure Intelligibility.The intermediate intelligibility ζ b,n is defined as the spectral correlation coefficients between the two temporal envelopes: where b and n are the indexes of the one-third octave and the short-time temporal envelope vectors, respectively.b = 1, 2, . . ., B , and n = 1, 2, . . ., N .B and N are the num- bers of one-third octave bands and the short-time temporal envelope vectors, respectively.s b,n and s b,n represent the short-time spectrogram vector of the estimated speech s and the original clean speech s , respectively.m(•) is the sample mean of the corresponding vector.Ultimately, L STOI can be obtained by averaging the intermediate intelligibility of all bands and short-time temporal envelope vectors:

Table 1
Performance comparison Fig. 6 Results of SISNRi in a real-world environment