Problem formulation
The AEC system with RES post-filter is depicted in Fig. 1, where x(n) is the far-end signal, \( \hat{y}(n) \) is the output of the adaptive filter, and H(z) represents the echo path transfer function. The microphone signal d(n) consisting of the echo y(n), the near-end speech s(n), and background noise v(n) can be expressed as
$$ d(n)=s(n)+y(n)+v(n) $$
(1)
The signal of the LAEC sAEC(n) is given by subtracting the output of the adaptive filter \( \hat{y}(n) \) from the microphone signal d(n), with
$$ \hat{y}(n)\kern0.5em =\hat{h}(n)\ast x(n)\kern2em $$
(2)
$$ {s}_{\mathrm{AEC}}(n)\kern0.5em =d(n)-\hat{y}(n)\kern2em $$
(3)
where \( \hat{h}(n) \) denotes the adaptive filter and ∗ represents convolution operation. Due to the inevitable nonlinear feature in the echo path, the LAEC cannot perfectly attenuate the echo, and sAEC(n) can be regarded as the mixture of the residual echo, background noise, and the near-end signal. The RES can be designed from the viewpoint of speech separation, but unlike the standard speech separation problem, the auxiliary information extracted from the adaptive filter can be exploited to improve the performance. In this paper, we employ sAEC(n) together with an auxiliary signal, x(n) or \( \hat{y}(n) \), to construct a dual-stream DPRNN (DSDPRNN).
Model design
Figure 2 outlines the structure of our proposed DPRNN-based RES method, which consists of two encoder modules, a suppression module, and a decoder module. The two encoder modules are used to extract features from sAEC(n) and the auxiliary signal to form two streams, streams A and B, respectively. The suppression module suppresses the residual echo and recovers the near-end signal by exploiting the information of streams A and B. The decoder transforms the output of the suppression module into masks and converts the masked feature back to the waveform. The difference between the time-domain and the TF-domain methods mainly lies in the encoder and the decoder, while the structure of the suppression module is the same.
Figure 3 shows the structure of the encoder and the decoder in the time-domain method. The encoder takes a time-domain waveform u as input and converts it into a time series of N-dimensional representations using a 1-D convolutional layer with a kernel size L and 50% overlap, followed by a ReLU activation function
$$ \boldsymbol{W}=\mathrm{ReLU}\left(\mathrm{Conv}1\mathrm{d}\left(\boldsymbol{u}\right)\right) $$
(4)
where \( \boldsymbol{W}\in {\mathbb{R}}^{G\times N} \) with length G is the output of the operation. Then, W is transformed into C-dimensional representations by a fully connected layer and divided into T=2G/K−1 chunks of length K, where the overlap between chunks is 50%. All chunks are then stacked together to form a 3-D tensor \( \mathcal{W}\in {\mathbb{R}}^{T\times K\times C} \). The decoder applies overlap-add operation to the output of suppression module \( {\mathcal{Y}}_s\in {\mathbb{R}}^{T\times K\times C} \), followed by a PReLU activation [21], to form the output \( \boldsymbol{Q}\in {\mathbb{R}}^{G\times C} \). Then, an N-dimensional fully connected layer with a ReLU activation is applied to Q to obtain the mask of W, and the estimation of clean speech’s representation \( \hat{\boldsymbol{S}} \) is obtained by
$$ \hat{\boldsymbol{S}}\kern0.5em =\mathrm{ReLU}\left({f}_{{\mathrm{FC}}_2}\left(\boldsymbol{Q}\right)\right)\odot \boldsymbol{W}\kern2em $$
(5)
$$ \boldsymbol{Q}\kern0.5em =\mathrm{PReLU}\left({f}_{\mathrm{OA}}\left({f}_{{\mathrm{FC}}_1}\left({\mathcal{Y}}_s\right)\right)\right)\kern2em $$
(6)
where \( {f}_{{\mathrm{FC}}_i},\kern1em i=1,2 \) represents the fully connected layer, fOA represents the overlap-add operation, and ⊙ denotes the element-wise multiplication. A 1-D transposed convolution layer is utilized to convert the masked representation back to waveform signal \( \hat{\boldsymbol{s}} \).
The intra-chunk operation of DPRNN can also be applied in the frequency domain. Figure 4 shows the structure of the encoder and the decoder in the TF-domain method. We first obtain the TF representation \( \boldsymbol{Z}\in {\mathbb{C}}^{T^{\prime}\times F} \) by the STFT operation with a Q-point Hamming window and 50% overlap, where F=Q/2+1 is the number of effective frequency bins. We concatenate the real and imaginary component of Z to form a 3-D tensor \( \mathcal{Z}\in {\mathbb{R}}^{T^{\prime}\times F\times 2} \). The 3-D representation \( {\mathcal{W}}^{\prime}\in {\mathbb{R}}^{T^{\prime}\times {K}^{\prime}\times {C}^{\prime }} \) is then obtained by a 2-D convolutional layer with C′ output channel. The kernel size is 5×5 and the stride is 1×2, where \( {K}^{\prime }=\frac{F-3}{2} \) is the number of down-sampled frequency bins. The frame length, the chunk size, and the feature dimension T′,K′,C′ correspond to T,K,C in the time-domain encoder respectively, and the output is further processed by the same suppression module. The decoder takes the output of suppression module s′ as input and successively applies two fully connected layers, followed by a PReLU and a ReLU activation respectively, to form the output \( {\mathcal{Q}}^{\prime}\in {\mathbb{R}}^{T^{\prime}\times {K}^{\prime}\times {C}^{\prime }} \). Then, ′ is processed by two independent 2-D transposed convolutional layers, called Trans Conv_A and Trans Conv_P, with the kernel size 5×5 and the stride 1×2. Trans Conv_A with a ReLU activation function is utilized to estimate the mask of TF bins. Trans Conv_P followed by a normalization operation for each TF bin is employed to estimate the real part and imaginary part of the phase information. Finally, the spectrogram of the output signal \( {\hat{S}}^{\prime } \) is estimated by
$$ {\hat{S}}^{\prime}\kern0.5em =\left( abs\left(\boldsymbol{Z}\right)\circ {1}^2\right)\odot \left(\mathcal{A}{\times}_3{1}^{1\times 2}\right)\odot \mathcal{P}\kern2em $$
(7)
$$ \mathcal{A}\kern0.5em =\mathrm{ReLU}\left({f}_{\mathrm{TC}}^A\left({\mathcal{Q}}^{\prime}\right)\right)\in {\mathbb{R}}^{T^{\prime}\times F\times 1}\kern2em $$
(8)
$$ \mathcal{P}\kern0.5em =\mathrm{Norm}\left({f}_{\mathrm{TC}}^P\left({\mathcal{Q}}^{\prime}\right)\right)\in {\mathbb{R}}^{T^{\prime}\times F\times 2}\kern2em $$
(9)
where \( {f}_{\mathrm{TC}}^A \), \( {f}_{\mathrm{TC}}^P \), and Norm represent the functions of Trans Conv_A, Trans Conv_P, and the normalization operation for each TF bin respectively. We use \( {1}^{I_1\times {I}_2\times \dots \times {I}_M} \), ∘, and ×i to denote an all-ones tensor, the outer product, and the mode- i product [22]. The outer product between the tensor \( \mathcal{\mathscr{H}}\in {\mathbb{R}}^{I_1\times {I}_2\times \dots \times {I}_M} \) and the vector \( \boldsymbol{g}\in {\mathbb{R}}^J \) is defined as
$$ \kern0.5em \mathcal{R}=\mathcal{\mathscr{H}}\circ \boldsymbol{g}\in {\mathbb{R}}^{I_1\times {I}_2\times \dots \times {I}_M\times J}\kern2em $$
(10)
$$ \kern0.5em {\mathcal{R}}_{i_1,{i}_2,\dots, {i}_M,j}={\mathcal{\mathscr{H}}}_{i_1,{i}_2,\dots, {i}_M}\cdotp {\boldsymbol{g}}_j\kern2em $$
(11)
The mode- i product between the tensor and the matrix \( \boldsymbol{D}\in {\mathbb{R}}^{I_M\times J} \) is defined as
$$ \kern0.5em \mathcal{R}=\mathcal{\mathscr{H}}{\times}_M\boldsymbol{D}\in {\mathbb{R}}^{I_1\times {I}_2\times \dots \times {I}_{M-1}\times J}\kern2em $$
(12)
$$ \kern0.5em {\mathcal{R}}_{i_1,{i}_2,\dots, {i}_{M-1},j}=\sum \limits_{i_M=1}^{I_M}{\mathcal{\mathscr{H}}}_{i_1,{i}_2,\dots, {i}_M}\cdotp {\boldsymbol{D}}_{i_M,j}\kern2em $$
(13)
Similar to the operation in [23], and in Eqs. 8 and 9 act as the amplitude mask and the phase prediction result respectively. After that, an inverse STFT operation is applied to convert \( {\hat{S}}^{\prime } \) back to the waveform signal \( {\hat{\boldsymbol{s}}}^{\prime } \).
The suppression module consists of six DSDPRNN blocks, each of which contains two dual-stream RNN (DSRNN) blocks corresponding to intra-chunk and inter-chunk processing respectively. Figure 5 presents the structure of the proposed DSRNN block, where each stream is successively processed by an RNN layer, a fully connected layer, and a normalization layer. The RNN layer in each intra-chunk block is a bidirectional RNN layer applied along the chunk dimension with C/2 output channels for each direction, while the RNN layer in each inter-chunk block is a unidirectional RNN layer with C output channels and is applied along the frame dimension. Let \( {\mathcal{V}}_i^0\in {\mathbb{R}}^{T\times K\times C} \) denote the input tensors of stream i, then the output of the RNN layer \( {\mathcal{V}}_i^1 \) can be expressed as
$$ {\mathcal{V}}_i^1={f}_{{\mathrm{RNN}}_i}\left({\mathcal{V}}_i^0\right),i=A\kern1em \mathrm{or}\kern1em B $$
(14)
where \( {f}_{{\mathrm{RNN}}_i} \) represents the function of the RNN layer. The feature in \( {\mathcal{V}}_A^1 \) and \( {\mathcal{V}}_B^1 \) is then mixed by
$$ {\mathcal{V}}_A^2\kern0.5em ={\mathcal{V}}_A^1+\left({1}^T\circ {1}^K\circ \boldsymbol{\alpha} \right)\odot {\mathcal{V}}_B^1\kern2em $$
(15)
$$ {\mathcal{V}}_B^2\kern0.5em ={\mathcal{V}}_B^1+\left({1}^T\circ {1}^K\circ \boldsymbol{\beta} \right)\odot {\mathcal{V}}_A^1\kern2em $$
(16)
where \( \boldsymbol{\alpha}, \boldsymbol{\beta} \in {\mathbb{R}}^C \) are trainable parameters. The output \( {\mathcal{V}}_i^2 \) is concatenated to the corresponding raw input \( {\mathcal{V}}_i^0 \) and then processed by a fully connected layer with C output channels. \( {\mathcal{V}}_i^3 \) is obtained with a residual connection and can be formulated as
$$ {\mathcal{V}}_i^3={f}_{{\mathrm{FC}}_i}\left(\left[{\mathcal{V}}_i^2,{\mathcal{V}}_i^0\right]\right)+{\mathcal{V}}_i^0,i=A\kern1em \mathrm{or}\kern1em B $$
(17)
where [·,·] represents the concatenation operation. The concatenation and projection are applied along the chunk dimension in intra-chunk blocks, and these operations are also applied along the feature dimension in inter-chunk blocks. The output \( {\mathcal{V}}_i^4 \) of the DSRNN block is then obtained by a normalization layer to \( {\mathcal{V}}_i^3 \), except for those in the last DSDPRNN block where \( {\mathcal{V}}_i^3 \) serves as the output
$$ {\mathcal{V}}_i^4={f}_{{\mathrm{Norm}}_i}\left({\mathcal{V}}_i^3\right),i=A\kern1em \mathrm{or}\kern1em B $$
(18)
where \( {f}_{{\mathrm{Norm}}_i} \) denotes the function of the normalization layer. The features of streams A and B are processed iteratively by the intra-chunk and the inter-chunk DSRNN blocks, and the output of stream A in the last DSDPRNN block is regarded as the output of the suppression module. We use Group Normalization [24] with a group number of 2. The input feature of the normalization layer \( \mathcal{X}\in {\mathbb{R}}^{T\times K\times C} \) is first divided into two groups as
$$ \mathcal{X}=\left[{\hat{\mathcal{X}}}^1,{\hat{\mathcal{X}}}^2\right],\kern1em {\hat{\mathcal{X}}}^1,{\hat{\mathcal{X}}}^2\in {\mathbb{R}}^{T\times K\times \frac{C}{2}}, $$
(19)
and the output is formulated as
$$ \kern0.5em {f}_{\mathrm{Norm}}\left(\mathcal{X}\right)=\left[{\hat{\mathcal{Y}}}^1,{\hat{\mathcal{Y}}}^2\right]\kern2em $$
(20)
with
$$ \kern0.5em {\hat{\mathcal{Y}}}_{l,k,c}^i=\frac{{\hat{\mathcal{X}}}_{l,k,c}^i-\mu \left({\hat{\mathcal{X}}}_l^i\right)}{\sqrt{\sigma \left({\hat{\mathcal{X}}}_l^i\right)+\varepsilon }}\cdotp {\boldsymbol{\gamma}}_c^i+{\boldsymbol{\beta}}_c^i,\kern1em i=1,2\kern2em $$
(21)
and
$$ \kern0.5em \mu \left({\hat{\mathcal{X}}}_l^i\right)=\frac{2}{CK}\sum \limits_{k=1}^K\sum \limits_{c=1}^{C/2}{\hat{\mathcal{X}}}_{l,k,c}^i,\kern1em i=1,2\kern2em $$
(22)
$$ \kern0.5em \sigma \left({\hat{\mathcal{X}}}_l^i\right)=\frac{2}{CK}\sum \limits_{k=1}^K\sum \limits_{c=1}^{C/2}{\left[{\hat{\mathcal{X}}}_{l,k,c}^i-\mu \left({\hat{\mathcal{X}}}_l^i\right)\right]}^2,\kern1em i=1,2\kern2em $$
(23)
where the subscripts l,k,c denote the index of the 3-D tensor, \( {\boldsymbol{\gamma}}^i,{\boldsymbol{\beta}}^i\in {\mathbb{R}}^{C/2} \) are trainable parameters, and ε is a small constant for numerical stability.
Training target
We choose the maximization of the scale-invariant source-to-noise ratio (SISNR) [18] as the training target
$$ \kern0.5em {\boldsymbol{s}}_{\mathrm{target}}=\frac{\mid <\hat{\boldsymbol{s}},\boldsymbol{s}>\mid \boldsymbol{s}}{{\left\Vert \boldsymbol{s}\right\Vert}^{\mathbf{2}}}\kern2em $$
(24)
$$ \kern0.5em {\boldsymbol{e}}_{\mathrm{noise}}=\hat{\boldsymbol{s}}-{\boldsymbol{s}}_{\mathrm{target}}\kern2em $$
(25)
$$ \kern0.5em \mathrm{SISNR}=10\underset{10}{\log}\frac{{\left\Vert {\boldsymbol{s}}_{\mathrm{target}}\right\Vert}^2}{{\left\Vert {\boldsymbol{e}}_{\mathrm{noise}}\right\Vert}^2}\kern2em $$
(26)
where \( \hat{\boldsymbol{s}},\boldsymbol{s} \) are the estimated and the target clean sources respectively, <·,·> represents the dot product of vectors, and ||s|| denotes the l2 norm of s.