Nonlinear residual echo suppression based on dual-stream DPRNN

The acoustic echo cannot be entirely removed by linear adaptive filters due to the nonlinear relationship between the echo and the far-end signal. Usually, a post-processing module is required to further suppress the echo. In this paper, we propose a residual echo suppression method based on the modification of dual-path recurrent neural network (DPRNN) to improve the quality of speech communication. Both the residual signal and the auxiliary signal, the far-end signal or the output of the adaptive filter, obtained from the linear acoustic echo cancelation are adopted to form a dual-stream for the DPRNN. We validate the efficacy of the proposed method in the notoriously difficult double-talk situations and discuss the impact of different auxiliary signals on performance. We also compare the performance of the time domain and the time-frequency domain processing. Furthermore, we propose an efficient and applicable way to deploy our method to off-the-shelf loudspeakers by fine-tuning the pre-trained model with little recorded-echo data.


Introduction
The acoustic echo is generated from the coupling between the loudspeaker and the microphone in full-duplex handsfree telecommunication systems or smart speakers. It severely deteriorates the quality of speech communication and significantly degrades the performance of automatic speech recognition (ASR) within the smart speakers. Typical linear acoustic echo cancelation (LAEC) methods use adaptive algorithms to identify the impulse response between the loudspeaker and the microphone [1]. Timedomain least mean square (LMS) algorithms [2,3] are often employed in delay-sensitive situations. Frequencydomain LMS algorithms are often utilized to guarantee both fast convergence speed and low computational load [2]. The frequency-domain adaptive Kalman filter (FDKF) [4] is also a commonly used method with several efficient variations proposed recently [5,6]. The performance of LAEC methods severely degrades when nonlinear distortion is non-negligible in the acoustic echo path [7]. Usually, a residual echo suppression (RES) module is required to further suppress the echo. The RES is usually conducted by estimating the spectrum of the residual echo based on the far-end signal, filter coefficients, and the residual signal of LAEC [8][9][10][11][12][13]. However, it is difficult for the signal-processing-based RES to balance well between the residual echo attenuation and the near-end speech distortion.
Recently, deep neural network (DNN) has been introduced into RES due to its powerful capability of modeling nonlinear systems, including the time domain and timefrequency (TF) domain methods. TF-domain methods adopt the short-time Fourier transform (STFT) to extract spectral features. The fully connected network (FCN) was employed to exploit multiple-input signals in RES [14]. The bidirectional or unidirectional recurrent neutral network (RNN) was also introduced to RES [15][16][17]. These methods ignore the coupling between magnitude and phase and are unable to recover the phase information, leading to limited performance [18]. Inspired by the fully convolutional time-domain audio separation network (Conv-TasNet) [18], we proposed a RES method based on the multi-stream Conv-TasNet, where both the residual signal of the LAEC system and the output of the adaptive filter are adopted to form multiple streams [19]. The benefit of introducing the auxiliary signals into the network was validated by simulations. However, the model employs a complicated network structure and is not efficient enough to exploit the information of multiple streams, resulting in large number of parameters which restricts its practical application. Moreover, the benefit of multi-streams is yet to be validated by experiments on off-the-shelf loudspeakers.
Dual-path recurrent neural network (DPRNN) [20] was recently proposed for speech separation task and achieves the state-of-the-art (SOTA) performance on WSJ0-2mix dataset. It utilizes an encoder module for feature extraction and employs RNNs for time series modeling. To overcome the inefficiency of RNN in modeling long sequences, DPRNN splits the long sequential input into smaller chunks and applies intra-and inter-chunk operations iteratively. Compared with Conv-TasNet, DPRNN shows superiority in both performance and parameter number [20]. Moreover, its RNN-based structure has advantages over Conv-TasNet in memory consumption when processing online.
In this paper, we extend our previous work on multistream Conv-TasNet. We adopt the residual signal of LAEC and the auxiliary signal to create two streams, and propose two DPRNN-structure networks in the time domain and TF domain respectively to effectively exploit their information. To validate the efficacy of our proposed RES methods, we compare them with several typical methods on both artificial-echo dataset and recordedecho dataset. Furthermore, we regard the well-trained model on artificial-echo dataset as a pre-trained model and fine-tune it on recorded-echo dataset. Different fine-tuning strategies are investigated to achieve a balance between the performance and the training cost.

Problem formulation
The AEC system with RES post-filter is depicted in Fig. 1, where x(n) is the far-end signal,ŷ(n) is the output of the adaptive filter, and H(z) represents the echo path transfer function. The microphone signal d(n) consisting of the echo y(n), the near-end speech s(n), and background noise v(n) can be expressed as The signal of the LAEC s AEC (n) is given by subtracting the output of the adaptive filterŷ(n) from the microphone signal d(n), witĥ whereĥ(n) denotes the adaptive filter and * represents convolution operation. Due to the inevitable nonlinear feature in the echo path, the LAEC cannot perfectly attenuate the echo, and s AEC (n) can be regarded as the mixture of the residual echo, background noise, and the near-end signal. The RES can be designed from the viewpoint of speech separation, but unlike the standard speech separation problem, the auxiliary information extracted from the adaptive filter can be exploited to improve the performance. In this paper, we employ s AEC (n) together with an auxiliary signal, x(n) orŷ(n), to construct a dual-stream DPRNN (DSDPRNN). Figure 2 outlines the structure of our proposed DPRNNbased RES method, which consists of two encoder modules, a suppression module, and a decoder module. The two encoder modules are used to extract features from The decoder transforms the output of the suppression module into masks and converts the masked feature back to the waveform. The difference between the time-domain and the TF-domain methods mainly lies in the encoder and the decoder, while the structure of the suppression module is the same. Figure 3 shows the structure of the encoder and the decoder in the time-domain method. The encoder takes a time-domain waveform u as input and converts it into a time series of N-dimensional representations using a 1-D convolutional layer with a kernel size L and 50% overlap, followed by a ReLU activation function

Model design
where W ∈ R G×N with length G is the output of the operation. Then, W is transformed into C-dimensional representations by a fully connected layer and divided into T = 2G/K − 1 chunks of length K, where the overlap between chunks is 50%. All chunks are then stacked together to form a 3-D tensor W ∈ R T×K×C . The decoder applies overlap-add operation to the output of suppression module Y s ∈ R T×K×C , followed by a PReLU activation [21], to form the output Q ∈ R G×C . Then, an N-dimensional fully connected layer with a ReLU activation is applied to Q to obtain the mask of W , and the estimation of clean speech's representationŜ is obtained bŷ where f FC i , i = 1, 2 represents the fully connected layer, f OA represents the overlap-add operation, and denotes the element-wise multiplication. A 1-D transposed convolution layer is utilized to convert the masked representation back to waveform signalŝ. The intra-chunk operation of DPRNN can also be applied in the frequency domain. Figure 4 shows the structure of the encoder and the decoder in the TF-domain method. We first obtain the TF representation Z ∈ C T ×F by the STFT operation with a Q-point Hamming window and 50% overlap, where F = Q/2 + 1 is the number of effective frequency bins. We concatenate the real and imaginary component of Z to form a 3-D tensor Z ∈ R T ×F×2 . The 3-D representation W ∈ R T ×K ×C is then obtained by a 2-D convolutional layer with C output channel. The kernel size is 5 × 5 and the stride is 2 is the number of down-sampled frequency bins. The frame length, the chunk size, and the feature dimension T , K , C correspond to T, K, C in the time-domain encoder respectively, and the output is further processed by the same suppression module. The decoder takes the output of suppression module Y s as input and successively applies two fully connected layers, followed by a PReLU and a ReLU activation respectively, to form the output Q ∈ R T ×K ×C . Then, Q is processed by two independent 2-D transposed convolutional layers, called Trans Conv_A and Trans Conv_P, with the kernel size 5 × 5 and the stride 1 × 2. Trans Conv_A with a ReLU activation function is utilized to estimate the mask of TF bins. Trans Conv_P followed by a normalization operation for each TF bin is employed to estimate the real part and imaginary part of the phase information. Finally, the spectrogram of the output signalŜ is estimated bŷ where f A TC , f P TC , and Norm represent the functions of Trans Conv_A, Trans Conv_P, and the normalization operation for each TF bin respectively. We use 1 I 1 ×I 2 ×...×I M , •, and × i to denote an all-ones tensor, the outer product, and the mode-i product [22]. The outer product between the tensor H ∈ R I 1 ×I 2 ×...×I M and the vector g ∈ R J is defined as The mode-i product between the tensor H and the matrix D ∈ R I M ×J is defined as Similar to the operation in [23], A and P in Eqs. 8 and 9 act as the amplitude mask and the phase prediction result respectively. After that, an inverse STFT operation is applied to convertŜ back to the waveform signalŝ .
The suppression module consists of six DSDPRNN blocks, each of which contains two dual-stream RNN (DSRNN) blocks corresponding to intra-chunk and interchunk processing respectively. Figure 5 presents the structure of the proposed DSRNN block, where each stream is successively processed by an RNN layer, a fully connected layer, and a normalization layer. The RNN layer in each intra-chunk block is a bidirectional RNN layer applied along the chunk dimension with C/2 output channels for each direction, while the RNN layer in each inter-chunk block is a unidirectional RNN layer with C output channels and is applied along the frame dimension. Let V 0 i ∈ R T×K×C denote the input tensors of stream i, then the output of the RNN layer V 1 i can be expressed as where f RNN i represents the function of the RNN layer. The feature in V 1 A and V 1 B is then mixed by where α, β ∈ R C are trainable parameters. The output V 2 i is concatenated to the corresponding raw input V 0 i and then processed by a fully connected layer with C output channels. V 3 i is obtained with a residual connection and can be formulated as where [ ·, ·] represents the concatenation operation. The concatenation and projection are applied along the chunk dimension in intra-chunk blocks, and these operations are also applied along the feature dimension in interchunk blocks. The output V 4 i of the DSRNN block is then obtained by a normalization layer to V 3 i , except for those in the last DSDPRNN block where V 3 i serves as the output where f Norm i denotes the function of the normalization layer. The features of streams A and B are processed iteratively by the intra-chunk and the inter-chunk DSRNN blocks, and the output of stream A in the last DSDPRNN block is regarded as the output of the suppression module.
We use Group Normalization [24] with a group number of 2. The input feature of the normalization layer X ∈ R T×K×C is first divided into two groups as and the output is formulated as and where the subscripts l, k, c denote the index of the 3-D tensor, γ i , β i ∈ R C/2 are trainable parameters, and is a small constant for numerical stability.

Dataset
Unlike telecommunication systems, where the far-end signal is usually speech, music often acts as the "far-end" signal for smart loudspeakers. Therefore, we use both speech and music as the far-end signal, and the nearend signal is speech. We choose LibriSpeech [25] as the speech dataset and MUSAN [26] as the music dataset. We randomly choose 225, 25, and 40 different speakers from LibriSpeech, and 497, 48, and 115 pieces of music from MUSAN for training, validation, and test respectively. The audio data is sampled at 16 kHz and split into 4-s segments. Totally, we use 26,556, 1083, and 920 segments of 4-s speech and 101,956, 1083, and 920 segments of 4-s music for training, validation, and test respectively. The clipping function and the sigmoidal function, although not precise models for the actual nonlinearity of the loudspeakers, are commonly utilized numerical models in many previous works on nonlinear acoustic suppression [15,17]. Thus, the clipping function, sigmoidal function, and convolution operation are successively applied to the far-end signal to generate the simulated echo. The clipping function is either soft-clipping or hard-clipping function [27] Clip soft (x(n)) = x max x(n) where x max = · max(abs(x(n))) determines the maximum value of the clipping function. Three types of softclipping and three types of hard-clipping functions are utilized with the parameter set to 0.6, 0.8, and 0.9.
The frequency-domain Kalman filter [4] acts as the LAEC to generate the residual echo, and the mean of its echo attenuation on the artificial-echo dataset is about 17.0 dB. To obtain the simulated s AEC (n), we add both the clean speech signal and the colored noise to the residual echo. The inverse-frequency-power of the colored noise [30] is randomly chosen between 0 and 2. For the training and validation set, the signal-to-echo ratio (SER) (before processing of LAEC) is randomly chosen from {−14.2, −16.2, −18.2, −20.2} dB and the colored noise is added with the signal-to-noise ratio (SNR) randomly chosen from {30, 20, 10} dB. For the test set, the SER is −18.2 dB and the SNR is 20 dB.
In total, we generate 106,224 segments of speech residual echo and 101,956 segments of music residual echo for training, 1083 segments of speech residual echo and 1083 segments of music residual echo for validation, and 920 segments of speech residual echo and 920 segments of music residual echo for test.
The approach to generate the artificial nonlinear echo is only a rough approximation for simulating the nonlinearity of the loudspeaker. To evaluate the performance of our model in practical applications, we also record echo signals from off-the-shelf loudspeakers using the microphone, AcousticSensing CHZ-221. A pair of EDI-FIER R12U (ER) loudspeakers and a pair of LOYFUN LF-501 (LL) loudspeakers are used to record the echo signals in a office with room size 6 m × 6 m × 3.2 m. The recording environment is shown in Fig. 6. For each loudspeaker model, we obtain 10,800 segments of 4-s recorded-echo signals (5400 segments of speech and music respectively) from one loudspeaker for training and 1840 segments (920 segments of speech and music respectively) from another loudspeaker of the same kind for test.

Experiment configuration
We control the parameter number and processing delay in the time-domain and the TF-domain methods for a fair comparison. For the time-domain method, the number of filters N, kernel size L, chunk size K, and feature dimension B in the encoder are 256, 8, 100, and 128 respectively. For the TF-domain method, the frame length Q, the number of down-sampled frequency bins K , and feature dimension B in the encoder are 400, 99, and 128 respectively. Thus, the tensor of the encoder in the timedomain and the TF-domain methods are of the dimension T × 100 × 128 and T × 99 × 128 respectively. The gated recurrent unit (GRU) [31] is used as the RNN layer.
The model is trained by the Adam optimizer [32] for 80 epochs, with each epoch containing 26,556 pairs of training data and each batch containing 8 pairs. The initial learning rate is set to 0.001 and is halved every time the validation loss is not improved in two successive epochs. We apply l 2 norm gradient clipping with a maximum of 5. Pytorch is employed for model implementation and four Nvidia GeForce GTX 1080Ti are used for training.

Evaluation metrics
We use three metrics for performance evaluation: the perceptual evaluation of speech quality (PESQ) [33], the signal-to-distortion ratio (SDR) [34,35], and the shorttime objective intelligibility (STOI) [36]. The echo return loss enhancement (ERLE) of the DNN-based methods in single-talk situations has been shown to be of a sufficiently high number in the previous work [19]. In this paper, we pay particular attention to RES performance in the most difficult low-SER double-talk situations, and the PESQ, SDR, and STOI are regarded to be better choices than the ERLE since they can more effectively evaluate the processed near-end speech quality. Furthermore, the desired signal is the near-end speech in most AEC scenarios, while the interference might be either speech from the far end, as in common communication applications, or music played by the smart speakers. Therefore, we use the PESQ instead of the perceptual evaluation of audio quality as an objective metric to measure the quality of the processed near-end speech.

Performance comparison
We compare the proposed methods with some typical DNN-based RES methods to validate the efficiency of our model. In the following comparison, we name our proposed methods as DSDPRNN. We further use the suffix "t" and "f " to represent the time-domain and the TFdomain methods respectively and use the suffix "x" and "y" to distinguish between the models in which x(n) or y(n) is used as the auxiliary signal. The LSTM-based model (LSTM) [17] and the multi-stream Conv-TasNet model (MSTasNet) are utilized for comparison. The models [14,15] which have shown significantly inferior performance in our previous work [19] are ignored in this comparison. The total number of trainable parameters and the multiply-accumulate operations per second (MACCPs) of these models is shown in Table 1. The model size of our proposed methods is only 1/5 of the model size of MSTasNet, and the computation cost is also slightly lower.
The time latency of MSTasNet is set to 410 samples for a fair comparison. The performance in terms of PESQ, SDR, and STOI is shown in Table 2. The DSD-PRNN methods outperform the LSTM and the MSTas-Net in all artificial-echo conditions, validating that our proposed methods provide an efficient way to exploit the information of dual-stream. For recorded echo, the advantage of the DSDPRNN methods over MSTasNet is less obvious, but their generalization capability in practical applications is still validated. The comparison between the time-domain and the TF-domain methods shows that the former tends to achieve slightly better SDR scores, while the latter has slightly better performance in terms of PESQ and STOI. Furthermore, we observe that the methods with the auxiliary signalŷ(n)  achieve better performance in the attenuation of recorded echo, implying their better generalization capability compared with the methods using x(n) as the auxiliary signal.

Fine-tuning for off-the-shelf loudspeakers
Though the proposed methods generalize well to real loudspeakers, better performance can be expected by training on echo recorded from loudspeakers. The welltrained model in the artificial-echo dataset can be regarded as a pre-trained model and then fine-tuned by the recorded-echo dataset in practice. We only test the performance of the DSDPRNN with the auxiliary signal y(n). The purpose of the fine-tuning is to improve the performance under limited supplementary training data.
We have tried the fine-tuning on the suppression module, but found that the model overfits severely with small amount of recorded data. Thus, we propose two strategies to fine-tune the model by mainly retraining the decoder.
(1) Train the decoder module only and freeze the other parameters.
(2) Train the decoder and the last DSDPRNN block and freeze the other parameters. We conduct two experiments in the fine-tuning stage for cross validation.
In each experiment, we only use 12-h echo signals from one loudspeaker as the training set. The batch size is set to 16 and the exponential-decay strategy is used to halve the learning rate every 1350 steps. The fine-tuning stage uses two Nvidia GeForce GTX 1080Ti and takes only about 3 h for training since the partly frozen parameters reduce the computational complexity for training and the size of the recorded training data is far below the size of the artificial echo. We use "Time" and "TF" to distinguish the time-domain and the TF-domain DSD-PRNN methods and use the suffix "1", "2" to represent the models using the above two fine-tuning strategies respectively. The performance of the pre-trained model is presented with no suffix as benchmark. Compared to strategy 2, the training time in the fine-tuning stage of strategy 1 decreases by 14% and the memory cost is reduced by half. The performances of the proposed methods after finetuning with the ER echo dataset and the LL echo dataset are shown in Tables 3 and 4 respectively. In artificialecho conditions, the performance degrades slightly after fine-tuning, and similar results are observed using both the fine-tuning strategies. The test results of the model fine-tuned using the recorded training dataset from the same kind of loudspeaker are highlighted by blue font. The efficacy of both fine-tuning strategies can be seen, and strategy 2 has significantly better performance when the model is fine-tuned by the training dataset from the same kind of loudspeaker. It also should be noted that the performance improves slightly even when the model is fine-tuned with training data from different loudspeakers, indicating the generalization capability of the fine-tuning method. Considering that only a very limited data is required in the fine-tuning stage, this scheme is easy to be applied to any off-the-shelf loudspeakers.

Conclusion
In this paper, we propose efficient RES methods in both the time domain and the TF domain on the modification of DPRNN. We adopt the residual signal and the auxiliary  signal extracted from the LAEC system to form dualstream for the DPRNN. Experiments validate the efficacy of the proposed methods in double-talk situations compared with several typical RES methods. Furthermore, we propose an efficient and applicable way to improve the performance on off-the-shelf loudspeakers by regarding the well-trained model on artificial-echo dataset as a pre-trained model, and fine-tuning it on recorded-echo dataset. Two fine-tuning strategies are evaluated in experiments, showing that the fine-tuning strategy of training the decoder and the last DSSPRNN block achieves more effective echo suppression on the recorded-echo dataset.