Skip to main content

Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition


Lately, the self-attention mechanism has marked a new milestone in the field of automatic speech recognition (ASR). Nevertheless, its performance is susceptible to environmental intrusions as the system predicts the next output symbol depending on the full input sequence and the previous predictions. A popular solution for this problem is adding an independent speech enhancement module as the front-end. Nonetheless, due to being trained separately from the ASR module, the independent enhancement front-end falls into the sub-optimum easily. Besides, the handcrafted loss function of the enhancement module tends to introduce unseen distortions, which even degrade the ASR performance. Inspired by the extensive applications of the generative adversarial networks (GANs) in speech enhancement and ASR tasks, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the ASR system. Generally, it consists of a self-attention speech enhancement GAN and a self-attention end-to-end ASR model. There are two advantages which are worth noting in this proposed framework. One is that it benefits from the advancement of both self-attention mechanism and GANs, while the other is that the discriminator of GAN plays the role of the global discriminant network in the stage of the adversarial joint training, which guides the enhancement front-end to capture more compatible structures for the subsequent ASR module and thereby offsets the limitation of the separate training and handcrafted loss functions. With the adversarial joint optimization, the proposed framework is expected to learn more robust representations suitable for the ASR task. We execute systematic experiments on the corpus AISHELL-1, and the experimental results show that on the artificial noisy test set, the proposed framework achieves the relative improvements of 66% compared to the ASR model trained by clean data solely, 35.1% compared to the speech enhancement and ASR scheme without joint training, and 5.3% compared to multi-condition training.


In recent years, attention-based end-to-end neural networks, which subsume the acoustic and language models into a single neural network, have triggered a revolution in the field of automatic speech recognition (ASR) [1, 2] and are challenging the dominance of hidden Markov model-based hybrid systems [3]. Furthermore, the self-attention mechanism has made another breakthrough in the innovation of the attention architecture, which considers the whole sequence at once to model feature interactions that are arbitrarily distant in time, leading to faster convergence and state-of-the-art results in ASR [412]. The self-attention system predicts the next output symbol conditioned on the full sequence of the previous predictions. Once a mistake occurs in one estimation step due to noise interference, all the subsequent steps will be disturbed. As speech signals are inevitably interfered by various background noises in the realistic environment, it is crucial to improve the robustness of the self-attention mechanism for practical application.

The mainstream solution to the noise robustness problem is adding an independent speech enhancement (SE) module as the front-end of ASR. Speech enhancement aims to transform the interfered speech to its original clean version, which is achieved by various approaches, i.e., statistical methods such as Wiener filter [13], time-frequency masking [1416], signal approximation [17, 18], and spectral mapping [19, 20]. No matter what approach the speech enhancement model adopts to achieve the goal, it is trained separately from the ASR model on different loss functions (i.e., mean squared error [21]) and being evaluated by different objective criteria (i.e., mean opinion score (MOS) prediction of the intrusiveness of background noise [22], segmental SNR [23]). This mismatch between the enhancement training and the final ASR task leads to a sub-optimum easily [24]. Moreover, the handcrafted loss functions tend to generate over-smoothed spectra or introduce unseen distortions, which sometimes even degrade the downstream ASR performance [25].

To obtain the optimum and circumvent introducing unnecessary distortion, the idea of a joint training framework is proposed for robust speech recognition [2528]. The fundamental concept of the joint training is concatenating the speech enhancement front-end and a downstream ASR model to build an entire neural network and jointly adjust the parameters in each module. The goal here is that the enhancement front-end tends to produce enhanced features desired by the ASR component, and the ASR module can guide the enhancement module to a more discriminative direction. In this way, the joint framework is optimized on the final ASR objectives, i.e., word/character error rate (W/CER).

Generative adversarial networks (GANs) aim at mapping samples \(\hat {x}\) from the distribution \(\mathcal {\hat {X}}\) to samples x from another distribution \(\mathcal {X}\). There are two components within GANs. One is the generator (G), which performs the mapping, and the other is the discriminator (D), which guides the training of the generator. GANs have been applied to various speech signal processing tasks, such as speech enhancement [29, 30], robust speaker verification [31], spoken language identification [32], speech emotion recognition [33], data augmentation [34], and robust speech recognition [35].

Inspired by the advancement of self-attention mechanism and various applications of GAN in speech-related tasks, we propose an adversarial joint training framework with self-attention mechanism to boost the robustness of the self-attention ASR systems, which consists of a self-attention speech enhancement GAN (SA_SEGAN) and a self-attention end-to-end ASR model (SA_ASR), where we experiment with Transformer [36] and Conformer [37]. The discriminant component of SA_SEGAN is first utilized to distinguish the enhanced features from the original clean features, instructing the enhancement module to output the clean distribution. When it comes to the stage of the joint training, the D component acts as the global training guide, and it will shift the direction for the G component to produce more congruous features for the ASR task. As the global guide, the discriminator is expected to remedy the limitation of the separate training and handcrafted loss functions, alleviate the distortion, and lead the speech enhancement component to the global optimum. Meanwhile, the enhancement module is supposed to capture more underlying structural characteristics. With this global guide, the whole framework is expected to learn more robust representations compatible with the ASR task automatically.

In summary, the main contributions of this paper are as follows:

  • We propose a self-attention-based jointly trained adversarial framework targeting robust speech recognition. To the best of our knowledge, this is the first joint training scheme that benefits from the advantages of both the self-attention mechanism and adversarial training.

  • We conduct the local and global adversarial training simultaneously, where the discriminant component does not concentrate on the enhancement front-end exclusively, but also plays the role of the global training guide.

  • The proposed framework yields remarkable results, which achieves relative improvements up to 66% compared to the ASR model trained by clean data solely, 35.1% compared to the scheme without joint training, and 5.3% compared to multi-condition training.

Related work

GANs have been applied in speech enhancement tasks without attention [29, 38, 39] and with attention [40, 41]. These works validate the functionality of GAN in the enhancement task on diverse objective criteria; however, they lack proofs of the effectiveness of their work for the downstream ASR task.

GANs have also been employed to improve the robustness of the ASR model [35, 4244]. A potential limitation lies in the weak matching and communication between the integrated modules. For instance, speech enhancement and speech recognition are often designed independently, and the enhancement system is tuned according to the metrics that are not straightly relative to the final ASR performance.

To address this concern, joint training is a promising approach. An early attempt was proposed in [45], where a feature extraction front-end and a Gaussian mixture model-hidden Markov model back-end are jointly trained on maximum mutual information. Afterwards, other interesting works are published in this field [25, 26, 4648]. Nevertheless, an effective integration between the various systems has been difficult for many years, mainly due to the different nature of the technologies involved at different steps. For example, in [25, 46], the joint training is actually performed as a fine-tuning procedure. To tackle this problem, this paper deploys the discriminant component of GAN as a global guide, leading the enhancement module to match the downstream ASR module.

Self-attention-based SE-ASR scheme


Figure 1 illustrates an overview of our proposed joint training framework for robust end-to-end speech recognition pictorially. The system consists of a self-attention enhancement front-end and a self-attention ASR model. Given the raw noisy speech input \(\boldsymbol {\tilde {x}}\) and the raw clean input x, we illustrate the entire procedure of the joint training pipeline in the following forms:

$$ \boldsymbol{\hat{x}}=\text{Generator}(\boldsymbol{\tilde{x}}), $$
Fig. 1

Overview of the SE_ASR joint training framework

$$ \boldsymbol{\hat{f}}=\text{FBank}(\boldsymbol{\hat{x}}), $$
$$ P(\boldsymbol{y}|\boldsymbol{\hat{f}})=\text{SA\_ASR}(\boldsymbol{\hat{f}}), $$
$$ P(D|\boldsymbol{\hat{x}},\boldsymbol{x^{\ast}})=\text{Discriminator}(\boldsymbol{\hat{x}},\boldsymbol{x^{\ast}}). $$

Here, Generator(·) acts as a speech enhancement front-end realized by the generator component of SA_SEGAN [40], which transforms the noisy raw input \(\boldsymbol {\tilde {x}}\) to the enhanced \(\boldsymbol {\hat {x}}\). FBank(·) is a function for extracting the normalized log FBank features \(\boldsymbol {\hat {f}}\) from the enhancement outputs \(\boldsymbol {\hat {x}}\). Subsequently, SA_ASR(·) is an ASR system based on self-attention layers realized by the Transformer [36] or Conformer [37] architecture. y is the outputs of the whole scheme. Discriminator(·) is realized by the discriminator component of SA_SEGAN [40], which distinguishes enhanced outputs from clean data.

Self-attention mechanism

Self-attention [49] relates the information over different positions of the entire input sequence for computing the attention distribution using scaled dot-product attention:

$$ \text{Attention}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V})=\text{softmax}\left(\frac{\boldsymbol{QK}^{T}}{\sqrt{d_{k}}}\right)\boldsymbol{V}. $$

\(\boldsymbol {Q}\in \mathbb {R}^{t_{q} \times d_{q}}, \boldsymbol {K}\in \mathbb {R}^{t_{k} \times d_{k}}\), and \(\boldsymbol {V}\in \mathbb {R}^{t_{v} \times d_{v}}\) are three inputs of the self-attention layer: queries, keys, and values, where tq,tk, and tv are the element numbers in different inputs while dq,dk, and dv denote the corresponding element dimensions. The scalar \(\frac {1}{\sqrt {d_{k}}}\) prevents the softmax function from falling into regions with tiny gradients. One query’s output is computed as a weighted sum of the values, where each weight of the value is computed by a designated function of the query with the corresponding key.

Self-attention speech enhancement GANs

Speech enhancement GANs (SEGAN)

Given a dataset \(\mathcal {X}=\{(\boldsymbol {x^{\ast }_{1}},\boldsymbol {\tilde {x}_{1})},(\boldsymbol {x^{\ast }_{2}},\boldsymbol {\tilde {x}_{2}}),\cdots,(\boldsymbol {x^{\ast }_{N}},\boldsymbol {\tilde {x}_{N}})\}\) consisting of N pairs of raw signals: clean speech signal x and noisy speech signal \(\boldsymbol {\tilde {x}}\). Speech enhancement aims to find a mapping \(f_{\theta } (\boldsymbol {\tilde {x}}):\boldsymbol {\tilde {x}}\to \boldsymbol {\hat {x}}\) to transform the raw noisy signal \(\boldsymbol {\tilde {x}}\) to the enhanced signal \(\boldsymbol {\hat {x}}\). θ contains the parameters of the enhancement network.

Conforming to GAN’s principle [50], the generator G is for learning an effective mapping that can imitate the real data distribution to generate novel samples related to those of the training set. Hence, G acts as the enhancement function. In contrast, the discriminator D plays the role of a classifier which distinguishes the real sample, coming from the dataset that G is imitating, from the fake samples, made up by G. D guides θ towards the distribution of clean speech signals. To sum up, SEGAN designates the generator G for the enhancement mapping, i.e., \(\boldsymbol {\hat {x}}=G(\boldsymbol {\tilde {x}})\), while designates the discriminator D to guide the training of G by classifying \((\boldsymbol {x^{\ast }},\boldsymbol {\tilde {x}})\) as real and \((\boldsymbol {\hat {x}},\boldsymbol {\tilde {x}})\) as fake. Eventually, G learns to produce enhanced signals \(\boldsymbol {\hat {x}}\) good enough to fool D such that D classifies \((\boldsymbol {\hat {x}},\boldsymbol {\tilde {x}})\) as real.

Self-attention speech enhancement GANs (SA_SEGAN)

SA_SEGAN [40] is SEGAN with the adoption of the self-attention layer adapted from non-local attention [51, 52]. Given the feature map \(\boldsymbol {F}\in \mathbb {R}^{L\times C}\) output by the 1-dim convolutional layer, where L is the time dimension, C is the number of channels, the query matrix Q, the key matrix K, and the value matrix V are obtained via transformations:

$$ \boldsymbol{Q}=\boldsymbol{FW}^{Q},\boldsymbol{K}=\boldsymbol{FW}^{K},\boldsymbol{V}=\boldsymbol{FW}^{V}, $$

where \(\boldsymbol {W}^{Q}\in \mathbb {R}^{C\times \frac {C}{b}}, \boldsymbol {W}^{K}\in \mathbb {R}^{C\times \frac {C}{b}}\), and \(\boldsymbol {W}^{V}\in \mathbb {R}^{C\times \frac {C}{b}}\) represent the learnt weight matrices of the 1×1 convolutional layer of \(\frac {C}{b}\) filters. Furthermore, Phan et al. [40] introduce two factors, b and p, for memory efficiency. b reduces the channel dimension, while p reduces the number of keys and values by a max pooling layer with filter width and stride size of p. Therefore, the dimension of the matrices are \(\boldsymbol {Q}\in \mathbb {R}^{L\times \frac {C}{b}}, \boldsymbol {K}\in \mathbb {R}^{\frac {L}{p}\times \frac {C}{b}}\), and \(\boldsymbol {V}\in \mathbb {R}^{\frac {L}{p}\times \frac {C}{b}}\). The attention map A and the attentive output O are then computed as:

$$ \boldsymbol{A}=softmax(\boldsymbol{QK}^{T}),\quad\boldsymbol{A} \in \mathbb{R}^{L\times \frac{L}{p}}, $$
$$ \boldsymbol{O}=(\boldsymbol{AV})\boldsymbol{W}^{O},\quad \boldsymbol{W}^{O}\in \mathbb{R}^{\frac{C}{b}\times C}. $$

Each element aijA indicates the extent to which the model attends to the jth column vj of V when producing the ith output oi of O. With the weight matrix WO realized by a 1 ×1 convolution layer of C filters, the shape of O is restored to the original shape L×C.

In the end, SA_SEGAN contains a shortcut connection to facilitate information propagation, and a learnable parameter β is employed to balance the weight between the output O and the input feature map F as:

$$ \boldsymbol{F'}=\beta \boldsymbol{O}+\boldsymbol{F}. $$

We illustrate the diagram of a simplified self-attention layer with L=9,C=6,p=3, and b=2 in Fig. 2.

Fig. 2

Illustration of the application of self-attention mechanism in speech enhancement GANs with L=9,C=6,p=3, and b=2

Network architecture

The architectures of the generator G and the discriminator D are depicted in Fig. 3a, b. The G component makes use of an encoder-decoder architecture with fully convolutional layers [53]. The generator’s encoder comprises 11 1-dim stridden convolutional layers with a common filter width of 31 and a stride length of 2, followed by parametric rectified linear units (PReLUs) [54]. The encoder receives a 1-s segment of the raw signal sampled at 16 kHz, approximately 16,384 samples as the input. To compensate for the smaller and smaller convolutional output, the number of filters increases along the encoder’s depth {16,32,32,64,64,128,128,256,256,512,1024}, resulting in output size of the feature map {8192×16,4096×32,2048×32,1024×64,512×64,256×128,128×128,64×256,32×256,16×512,8×1024}. At the 11th layer of the encoder, the encoding vector \(\boldsymbol {c}\in \mathbb {R}^{8\times 1024}\) is stacked with the noise sample \(\boldsymbol {z}\in \mathbb {R}^{8\times 1024}\), sampled from the distribution \(\mathcal {N}(0, I)\), and presented to the decoder.

Fig. 3

Illustration of the SA_SEGAN architecture. a The generator component. b The discriminator component [40]

The decoder component mirrors the encoder architecture with the same number of filters and the filter width to reverse the encoding process through deconvolutions. The same as the encoder, each deconvolutional layer is again followed by a PReLUs. The skip connections are deployed to connect the encoding layer with its corresponding decoding layer to allow the information flow between the encoding stage and the decoding stage.

The discriminator is constructed of a similar architecture to the encoder component of the generator. However, it receives the two-channel input and utilize virtual batch-norm [55] before LeakyReLU [56] activation with α = 0.3. Moreover, the D network is topped up with a 1×1 convolutional layer to reduce the dimension of the output of the last convolutional layer from 8×1024 to 8 for the subsequent classification task with the softmax layer.

The self-attention layer illustrated in Section 3.3.2 couples with the (de)convolutional layer of both the generator and the discriminator. Figure 3a, b demonstrate an example of the self-attention layer coupling with the lth (de)convolutional layer. As we can see, if we add the self-attention layer to the lth convolutional layer of the encoder, the mirror lth deconvolutional layer of the decoder and the lth layer in the discriminator also couples a self-attention layer. Theoretically, the self-attention layer can be placed in any number, even all, of the (de)convolutional layers.

FBank extraction network

We extract the normalized log FBank features \(\boldsymbol {\hat {f}}\) as the input of the subsequent ASR model, which is computed from the enhanced signals \(\boldsymbol {\hat {x}}\):

$$ \boldsymbol{\hat{f}}=\text{FBank}(\boldsymbol{\hat{x}})=\text{Norm}(\text{log}(\text{Mel}(\text{STFT}(\boldsymbol{\hat{x}})))), $$

where STFT(·) is the operation of short-time Fourier transform (STFT), Mel(·) is the operation of Mel matrix multiplication, and Norm(·) is for normalizing the mean and variance to 0 and 1, separately. Consequently, the FBank feature extraction layer is differentiable.


Multi-head attention mechanism

Multi-head attention mechanism [49], as the terminology implies, contains more than one self-attention module. As the core module of the Transformer [36], it leverages different attending representations jointly. Before performing each attention, three linear projections transform the queries, keys, and values to more discriminated representations, respectively. Afterwards, each dot-product attention is calculated independently, and their outputs are concatenated and fed into another linear projection to obtain the final dmodel-dimensional outputs:

$$\begin{array}{*{20}l} &\text{MultiHead}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) \\ &= \text{Concat}(head_{1}, head_{2},\cdots,head_{h})\boldsymbol{W}^{OUT}, \end{array} $$


$$ head_{i}=\text{Attention}(\boldsymbol{QW}^{Q}_{i}, \boldsymbol{KW}^{K}_{i}, \boldsymbol{VW}^{V}_{i}). $$

h refers to the head numbers, and Q,K,V have the same dimensions of dmodel. Four projection matrices \(\boldsymbol {W}^{Q}_{i}\in \mathbb {R}^{d_{{model}}\times d_{q}}, \boldsymbol {W}^{K}_{i}\in \mathbb {R}^{d_{{model}}\times d_{k}}, \boldsymbol {W}^{V}_{i}\in \mathbb {R}^{d_{{model}}\times d_{v}}\), and \(\boldsymbol {W}^{OUT}\in \mathbb {R}^{hd_{v}\times d_{{model}}}\). Additionally, dq=dk=dv=dmodel/h.

Positional encoding

One obvious limitation of the Transformer model is that the output is invariant to the input order permutation, i.e., the Transformer does not model the order of the input sequence. Vaswani et al. [49] solve this problem by injecting information about absolute positions into the input sequence via sinusoid positional embeddings:

$$ PE_{(pos,i)}=\left\{\begin{array}{ll} sin(pos/10000^{i/d_{{model}}}))\quad \text{if}\:i\: \text{is}\: \text{even} & \\ cos(pos/10000^{i/d_{{model}}}))\quad \text{if}\:i\: \text{is}\: \text{odd} &\end{array}\right., $$

where pos refers to the position, and i is the dimension. The sinusoidal function allows the model to extrapolate from long sequence lengths.

Feed-forward network

The feed-forward network (FFN) is another core module of the Transformer [36]. It is composed of two linear transformations with a ReLU activation in between. The dimensionality of the input and output is dmodel, and the inner layer has the dimensionality dff. Specifically,

$$ \text{FFN}(\boldsymbol{x})=\text{max}(0, \boldsymbol{x}\boldsymbol{W}_{1}+\boldsymbol{b}_{1})\boldsymbol{W}_{2}+\boldsymbol{b}_{2}, $$

where the weights \(\boldsymbol {W}_{1}\in \mathbb {R}^{d_{{model}}\times d_{{ff}}}, \boldsymbol {W}_{2}\in \mathbb {R}^{d_{{ff}}\times d_{{model}}}\) and the biases \(\boldsymbol {b}_{1}\in \mathbb {R}^{d_{{ff}}}, \boldsymbol {b}_{2}\in \mathbb {R}^{d_{{model}}}\). The linear transformations are the same across different positions.

Network architecture

The detailed model architecture of the ASR-Transformer is as follows.

The encoder is shown in Fig. 4a. The input embedding is for extracting expressive representations of dimension dmodel. Thereafter, to enable the model to attend on the auxiliary position information, the dmodel-dim positional encoding (Section 3.5.2) is added to the input encoding. Then, the sum of the encoded outputs is fed into a stack of Ne encoder blocks, each of which has two sub-blocks: one is the multi-head attention (Section 3.5.1), receiving queries, keys, and values from the previous block, and the other is the feed-forward networks (Section 3.5.3). In the meanwhile, layer normalization and residual connection are introduced to each sub-block for effective training. Thus, the pipeline of the sub-block is:

$$ \boldsymbol{x}+\text{SubBlock}(\text{Layer\ Norm}(\boldsymbol{x})). $$
Fig. 4

Model architecture of the Transformer. a Encoder. b Decoder [36]

The decoder is shown in Fig. 4b. The output-embedding converts the character sequence to dimension dmodel. Added with the positional encoding, the sum of them is fed into a stack of Nd decoder blocks, which consists of three sub-blocks: The first is a masked multi-head attention, which ensures that the predictions for position j depends only on the known outputs at positions less than j. The second is a multi-head attention whose keys and values come from the encoder outputs while queries come from the previous sub-block outputs. The third is also feed-forward networks. Similar to the encoder, layer normalization and residual connection are also employed to each sub-block of the decoder. Eventually, the output probabilities are acquired by a linear projection and a subsequent softmax function.


Conformer [37] is a state-of-the-art ASR encoder architecture. Different from the Transformer block (as described in Section 3.5), it is equipped with a convolution layer to increase the local information modeling capability of the Transformer encoder model [49] and a pair of FFN modules sandwiching the multi-head self-attention module and the integrated convolution module. The Conformer model consists of a Conformer encoder proposed in [37] and a Transformer decoder [36]. The encoder first processes the input with a convolution subsampling layer and then with Conformer blocks, as illustrated in Fig. 5i. The Conformer block (Fig. 5ii) consists of a multi-head self-attention module (MHSA), a convolution module, sandwiched by a pair of macron-feedforward module [57]. The layer normalization is applied before each module, and the dropout is followed by a residual connection afterwards (pre-norm) [58, 59]. Mathematically, let xi be the input to the ith Conformer block, the output yi of this block is:

$$\begin{array}{*{20}l} \boldsymbol{x}^{\prime}_{i}&=\boldsymbol{x}_{i}+\frac{1}{2}\text{FFN}(\boldsymbol{x}_{i}), \end{array} $$
Fig. 5

Illustration of the Conformer encoder model architecture. i Conformer encoder architecture. ii Conformer block architecture. ii-a Convolution module of the Conformer block. ii-b Multi-headed self-attention module of Conformer block. ii-c Feed forward module of Conformer block

$$\begin{array}{*{20}l} \boldsymbol{x}^{\prime\prime}_{i}&=\boldsymbol{x}^{\prime}_{i}+\text{MHSA}(\boldsymbol{x}^{\prime}_{i}), \end{array} $$
$$\begin{array}{*{20}l} \boldsymbol{x}^{\prime\prime\prime}_{i}&=\boldsymbol{x}^{\prime\prime}_{i}+\text{Conv}(\boldsymbol{x}^{\prime\prime}_{i}), \end{array} $$
$$\begin{array}{*{20}l} \boldsymbol{y}_{i}&=\text{Layer\ Norm}\left(\boldsymbol{x}^{\prime\prime\prime}_{i}+\frac{1}{2}\text{FFN}(\boldsymbol{x}^{\prime\prime\prime}_{i})\right). \end{array} $$

FFN(·),MHSA(·),Conv(·), and Layer Norm(·) denote the macron-feedforward module, the multi-head self-attention module, the convolution module, and the layer normalization module, respectively. The multi-head self-attention module is the same as in Section 3.5.1 and is demonstrated in Fig. 5ii-b. Sections 3.6.1 and 3.6.2 introduce the convolution module and the macron-feedforward module, respectively.

Convolution module

Figure 5ii-a demonstrates the details of the convolution module. The convolution module starts with a 1-dim pointwise convolution layer and a gated linear units (GLU) activation [60]. The 1-dim pointwise convolution layer doubles the input channels, and the GLU activation splits the input along the channel dimension and executes an element-wise product. What follows are a 1-dim depthwise convolution layer, a batch normalization layer, a Swish activation, and another 1-dim pointwise convolution layer. As mentioned before, the layer normalization is applied before each module, and the dropout is followed by a residual connection afterwards (pre-norm).

Macron-feedforward module

Unlike the FFN module in Transformer encoder [36], which comprises two linear transformations with a ReLU activation in between (Eq. 14), Conformer encoder [37] introduces another FFN module and substitutes the ReLU activation with the Swish activation. Furthermore, inspired by Macaron-Net [57], this pair of FFN modules are following a half-step scheme and sandwiching the MHSA and the convolution modules. The detail of the FFN is illustrated in Fig. 5ii-c.

Adversarial joint training

GANs aim at mapping samples \(\boldsymbol {\hat {x}}\) from the distribution \(\mathcal {\hat {X}}\) to samples x from another distribution \(\mathcal {X^{\ast }}\). The generator G is tasked to learn an effective mapping that can imitate the real data distribution to generate novel samples from the manifold defined in the \(\mathcal {X}\), by means of an adversarial training exerted by the discriminator D. During back-propagation, D classifies real samples from the fake samples more accurately; in return, G updates its parameters towards the real data manifold, till the mixed Nash equilibria are reached [50]. The GAN training process can be formulated as a minimax game between G and D, with the objective:

$$ \begin{aligned} \min_{G}\: \max_{D}\: \mathcal{L}(D, G) = & \mathbb{E}_{\boldsymbol{x^{\ast}}\sim p_{{data}}(\boldsymbol{x^{\ast}})}[\text{log} D(\boldsymbol{x^{\ast}})]+ \\ & \mathbb{E}_{\boldsymbol{\hat{x}}\sim p_{\hat{x}}(\boldsymbol{\hat{x}})}[\text{log}(1-D(G((\boldsymbol{\hat{x}}))))]. \end{aligned} $$

In our proposed robust end-to-end speech recognition scheme, the discriminant network first acts as the local guide for the enhancement module, where D shifts the training of G towards the distribution of clean data; thereafter, it is deployed as the global guide for the whole scheme, where D instructs G to output pertinent enhanced data for the subsequent ASR task.

We first train the enhancement module, which contains both the generator and the discriminator. To solve the problem of vanishing gradients caused by sigmoid cross-entropy loss for training, the least-squares GAN (LSGAN) with binary coding (1 for real, 0 for fake) is utilized instead of the cross-entropy loss. Consequently, the loss function of the discriminator component changes to

$$ \begin{aligned} \min_{D}\: \mathcal{L}(D)= & \frac{1}{2}\mathbb{E}_{\boldsymbol{x^{\ast}},\boldsymbol{\tilde{x}}\sim p_{{data}}(\boldsymbol{x^{\ast}}, \boldsymbol{\tilde{x}})}[D(\boldsymbol{x^{\ast}},\boldsymbol{\tilde{x}})-1]^{2}+ \\ & \frac{1}{2}\mathbb{E}_{\boldsymbol{z}\sim p_{\boldsymbol{z}}(\boldsymbol{z}), \boldsymbol{\tilde{x}}\sim p_{{data}}(\boldsymbol{\tilde{x})}}[D(G(\boldsymbol{z},\boldsymbol{\tilde{x}}),\boldsymbol{\tilde{x}})]^{2}, \end{aligned} $$

where z is a latent variable. To minimize the distance between its generations and the clean examples, it is beneficial to add a secondary component to the loss of G. Inspired by the effectiveness of L1 norm in the image manipulation domain [61, 62], we deploy it in G component to gain more fine-grained and realistic results. The magnitude of the L1 norm is controlled by a new hyper-parameter λ. Hence, the loss function of the generator component becomes:

$$ \begin{aligned} \min_{G}\: \mathcal{L}(G)= & \frac{1}{2}\mathbb{E}_{\boldsymbol{z}\sim p_{\boldsymbol{z}}(\boldsymbol{z}), \boldsymbol{\tilde{x}}\sim p_{{data}}(\boldsymbol{\tilde{x})}}[D(G(\boldsymbol{z},\boldsymbol{\tilde{x}}),\boldsymbol{\tilde{x}})-1]^{2} \\ & +\lambda\left \| G(\boldsymbol{z},\boldsymbol{\tilde{x}})-\boldsymbol{x^{\ast}}\right \|_{1}. \end{aligned} $$

In the joint training, the enhancement module is initialized from the trained G component, while the global discriminant module is initialized from the trained D component. The training of the ASR component is based on the cross entropy criterion, namely:

$$ \mathcal{L}_{{asr}}=-\text{ln}P(\boldsymbol{y}^{\ast}|\boldsymbol{f})=-\sum_{n}\text{ln}P(\boldsymbol{y}_{n}^{\ast}|\boldsymbol{f},\boldsymbol{y}_{1:n-1}^{\ast}), $$

where y is the ground truth of a whole sequence of output labels, and \(\boldsymbol {y}_{1:n-1}^{\ast }\) is the ground truth from output step 1 to n−1. In the proposed framework, the parameters of all procedures, enhancement, feature extraction, ASR, and the discriminant network, are updated by stochastic gradient descent calculated by the loss function of the whole scheme. It is composed of three losses: \(\mathcal {L}_{{asr}}, \mathcal {L}_{{enh}}\), and \(\mathcal {L}_{{gan}}\), which correspond to Eqs. 23, 22, and 21, i.e.:

$$ \mathcal{L}=\mathcal{L}_{{asr}}+\kappa \mathcal{L}_{{enh}}+\gamma \mathcal{L}_{{gan}}, $$

where κ and γ are two hyper-parameters weighting the magnitude of the enhancement loss and adversarial loss, respectively. Notably, the scheme targets the recognition performance, and the loss function of the discriminant network adapts the enhancement module implicitly. As a result, the discriminant network guides the enhancement module to serve the subsequent ASR task more properly. Accordingly, the unnecessary speech distortion caused by the enhancement process is alleviated.

Experimental setups

We systematically evaluate the robustness of the adversarial joint training framework, and ablation tests are conducted to validate the effects of (i) the enhancement front-end on the ASR task, (ii) the joint training on the whole scheme, and (iii) the GAN on the joint training.


All experiments are executed on the open source Mandarin speech corpus, AISHELL-1 [63]. This corpus is 178-h-long, and its utterances contain 11 domains, e.g., smart home, autonomous driving, industrial production. A total of 400 speakers from different accent areas in China participate in the recording. The corpus is divided into training, development, and test sets. The training dataset contains 120,098 utterances from 340 speakers, the development dataset contains 14,326 utterances from 40 speakers, and the test dataset contains 7176 utterances from 20 speakers.

For the noisy data, we contaminate clean utterances in AISHELL-1 with 9 sorts of intrusions from the NOISEX-92 dataset [64] artificially as noisy utterances. We create noisy training, development, and test sets in the same manner. Note that besides the “matched” noisy test set, which is contaminated by the same intrusions as the training dataset, we also corrupt the test set with the rest 5 sorts of intrusions in the NOISEX-92 dataset as “unmatched” test materials. Table 1 exhibits the sorts the intrusions mixed in “match” and “unmatch” cases. All utterances are mixed with the intrusions at SNRs randomly sampled between [0 dB, 20 dB]. To sum up, we have two sorts of datasets for training:

  • Clean: clean utterances from the training dataset of AISHELL-1

    Table 1 The demonstration of categories of intrusions utilized in “match” and “unmatch” cases
  • Match: contaminated clean utterances (training dataset) with “matched” noises of Table 1

For test datasets, we have the following:

  • Clean: clean utterances from the test dataset of AISHELL-1

  • Match: contaminated clean utterances of the test set with the same intrusions (“matched” noises of Table 1) as “matched” training set

  • Unmatch: contaminated clean utterances of the test set with different intrusions (“unmatched” noises of Table 1) from “matched” training set


For the comparison purpose, we take the work from [28] as the baseline model.

In [28], the mask-based enhancement network is deployed as the front-end. It estimates a masking function to multiply the frequency domain feature of the noisy speech to form an estimate of the clean speech. For the ASR task, Liu et al. employ the ESPnet model [65]. It consists of an encoder network that maps the input feature sequence into a higher-level representation. Then, a location-based attention layer integrates the representation into a context vector with the attention weight vector. In the end, the decoder predicts the next output conditioned on the full sequence of previous predictions. Besides, there is an extra discriminant network, whose loss is weighted in the loss function of the whole scheme to optimize the joint training.

Importantly, the baseline model does not contain any self-attention layer. Furthermore, the discriminant work in the baseline model is an extra auxiliary module, which does not participate in the enhancement training directly. By contrast, our work benefit from self-attention mechanism and the discriminant module exits innately, which is a component of the enhancement front-end. It acts as the local guide for the enhancement training, leading the enhancement network to output towards the distribution of the clean samples. Simultaneously, it also plays the role of the global guide, instructing the enhancement module and the ASR module better matched.



For the enhancement front-end, the input is the 257-dim logarithmic STFT features, and all input vectors are normalized to have the zero mean and the unit variance. The network is composed of 3-layer long short-term memory (LSTM) with 128 nodes, followed by a linear layer with the sigmoid activation function. The network outputs the masking estimate, whose size is equal to the input size, multiplying by the STFT feature of the noisy speech to estimate the clean speech.

For the ASR network, the input is the 80-dim normalized log FBank features transformed from the enhanced STFT features. The encoder is composed of 4-layer bidirectional LSTM (BLSTM) with 320 cells, while the decoder is composed of 1-layer unidirectional LSTM with 320 cells. After each BLSTM layer, a linear projection layer with 320 nodes is used to combine the forward and backward LSTM outputs. The location-based attention mechanism comprises 10 centered convolution filters of width 100. Besides, We also adopt a joint connectionist temporal classification (CTC)-attention multitask loss function [66] with the CTC loss weight as 0.1.

The discriminant network consists of a 4-layer convolution network, each of which is followed by the ReLU activation function [67].

For decoding, we use a beam search algorithm with the beam size 12. CTC rescores the hypotheses with 0.1 weight [66]. Besides, an external recurrent neural network (RNN) language model is also adopted with 0.2 weight during decoding.

The proposed joint training scheme


The SA_SEGAN is trained for 86 epochs with RMSprop [68] and a learning rate of 0.0002. The batch size is 50. During training, we extract 1-s chunks of raw waveforms (L=16,384 samples) with a 50% overlap. During the test, we slide the window without overlapping through the whole duration of our test utterances and concatenate the outputs at the end of the stream. During both training and test, we employ a high-frequency preemphasis filter with a coefficient of 0.95 to all inputs. For the self-attention layer in SA_SEGAN, we use b=8 and p=4 for memory reduction. Phan et al. [40] suggest that the placement of the self-attention layer does not show a clear difference on the performance, which indicates that applying the self-attention to the higher-level (de)convolutional layer is expected to be as good as to a lower layer. Compromising between the computation time and memory requirement and the performance, we place the self-attention layer in the 10th layer (l=10).

FBank extraction network

The FBank feature extraction network is a linear layer to transform the raw outputs from the upstream SA_SEGAN to the downstream ASR procedure. We extract 80-dim filterbanks with the window size of 25 ms and the window shift of 10 ms, extended with the temporal first- and second-order differences. Thereafter, we do the logarithmic calculation and global mean and variance normalization according to Eq. 10.


For training the Transformer, we adopt Adam optimizer [69] with β1=0.9,β2=0.98,ε=10−9, and vary the learning rate over the course of training according to the formula:

$$ \begin{aligned} lr=k^{\prime} \cdot d^{-0.5}_{{model}} \cdot \text{min}(n^{-0.5}, n \times warmup_{n}^{-1.5}), \end{aligned} $$

where n denotes the step number. k is a tunable scalar, which is set to be 10 initially and is declined to 1 when the model converges. The learning rate increases linearly during the fist warmupn=25,000 steps, and afterwards, it decreases proportionally to the inverse square root of the step number. We apply the residual dropout to each sub-block before adding the residual information, while the attention dropout is performed on the softmax activations in each attention. Both of these aforementioned dropouts are set to be 0.1. Additionally, we guide the system to be more attentive on closer positions by punishing the attention weights of more distant position pairs. Similar to the baseline model, we also adopt a joint CTC-attention multi-task loss function [66], with the CTC loss weight as 0.3. In the decoding, we set the beam size to 12 and length penalty α = 1.0 [70]. Besides, we also integrate an external RNN language model with 0.3 weight. The training procedure is stopped after 30 epochs.


The model hyper-parameters of the Conformer are Ne=12, Nd=6, H=4, dk=256, and dff=2048. The convolution subsampling layer possesses a 2-layer convolutional neural network (CNN) with 256 channels, stride with 2, and kernel size with 3. The kernel size of the convolution module is 31. We apply dropout in each residual unit of the Conformer with the weight 0.1. The same as the Transformer, we train the network with Adam optimizer [69] with β1=0.9,β2=0.98, and ε=10−9 and a Transformer learning rate schedule [49] with 10,000 warm-up steps. The learning rate is peaked at \(0.05/\sqrt {d}\), where d is the model dimension in the Conformer encoder. Note that we do not apply speed perturbation [71] or SpecAugment [72] for the data augmentation to exclude extra tricks that could cause performance improvements. The training procedure is stopped after 30 epochs.


We use character error rate (CER) to quantify the system performance in all experiments. We report CER of the AISHELL-1 test set on three conditions: “clean” refers to the original clean test dataset of the corpus, “match" denotes the noisy test dataset contaminated by “matched” sorts of intrusions in Table 1, and “unmatch" means the noisy test set corrupted by the rest of “unmatched” sorts of intrusions in Table 1. To validate the efficacy of the enhancement front-end, we also introduce multi-condition training (MCT), a popular training strategy for robust speech recognition, for comparison. Different from the training data generation of the speech enhancement front-end, the training data of MCT contains 10% clean utterances, which are chosen randomly from the training set of AISHELL-1. Except for the 10% partition, the remaining 90% of the training data is generated in the same manner as that of the enhancement front-end, namely being corrupted by the “matched” intrusions from Table 1 at an SNR in the range [0 dB, 20 dB].

Firstly, we train the ASR network with the original clean utterance and multi-condition training strategy. The results are shown in Table 2.

Table 2 CER [%] results of the ASR system trained by clean data and multi-condition training (MCT) without the enhancement

Ranking these three models from the aspect of ASR performance, the first is Conformer, then Transformer, and the last is the baseline model, consistent with observations in [36, 37]. However, their performance deteriorates rapidly in the noisy test set, demonstrating the necessity of the robustness investigation. The MCT training considerately improves the system’s robustness. Its performance on the “matched” test set outperforms the clean training by 63.2% and 62.9% relative, in cases of Transformer and Conformer model respectively, while on the “unmatched” dataset, it outperforms the clean training by 56.3% and 56.8% relatively, in cases of Transformer and Conformer model, respectively.

Secondly, we train SA_SEGAN with the training data contaminated by “matched” intrusions in Table 1 to enhance the noisy speech. Then, the enhanced features are used for the downstream ASR task. Importantly, the ASR models are taken over from the same well-trained model as in Table 2, which means that the enhancement front-end and the ASR back-end are trained separately by different objectives. As exhibited in Table 3, the enhancement module tremendously improves the performance of the ASR component, which is trained by the clean data merely. Compared to Table 2, it outperforms all of the three ASR modules (baseline, Transformer, Conformer) without the enhancement front-end. The improvement achieved in the “matched” dataset is more remarkable than that achieved on the “unmatched” test set. For instance, it outperforms the Conformer without the enhancement module by 46.0% in the “matched” test set while 9.1% in the “unmatched” test set. This difference is due to the fact that the SA_SEGAN is trained with the “matched” intrusions and can enhance the data contaminated by the same intrusions better during the test. All these improvements confirm the efficacy of the enhancement module for improving the robustness of the ASR system. Nevertheless, improving the robustness of the framework in unseen noisy environments still remains to be a challenge. Additionally, the speech enhancement module deteriorates the performance of the ASR_MCT network, which stays in accordance with the observations in [35] and [73]. Donahue et al. [35] and Narayanan and Wang [73] hypothesize that the enhancement front-end might be introducing hitherto-unseen distortions that compromise performance. Furthermore, we believe that this latent distortion is derived from the independent training of the enhancement module, demonstrating the necessity of the joint training strategy.

Table 3 The impacts of the enhancement front-end on the ASR systems trained by clean data and multi-condition training (MCT). The results are in CER [%]

To remedy the deterioration of the performance of the ASR_MCT, we retrain the network with the enhanced features. Assuming that the network may also benefit from the knowledge of the noisy features, we also experimented with ingesting both enhanced and noisy features. The results are displayed in Table 4. Either the Transformer_MCT model or the Conformer_MCT model is initialized from the existing well-trained MCT checkpoints respectively, setting the additional parameters to zero to ensure the fair training start. As presented in Table 4, the retraining with the enhanced features improves the performance in both “matched” and “unmatched” cases, and the retraining with both enhanced and noisy features improves the performance slightly further.

Table 4 CER [%] results of the SE_ASR system retraining with and without noisy features

Lastly, we jointly train the whole scheme with and without adversarial training according to Eq. 24. In the framework, the enhancement front-end is initialized from the generator (G component) of SA_SEGAN, the ASR back-end is initialized from the ASR_MCT checkpoint (without retraining), and the adversarial module is initialized from the discriminator (D component) of SA_SEGAN. When the adversarial module participates in the training, we set the magnitude of the loss function by κ=6.0 and γ=0; by contrast, when it participates in the training, we set κ=6.0 and γ=3.0. The results are presented in Table 5. Compared to Table 2, the joint training mitigates the distortion problem existing in the MCT strategy. Additionally, the participance of the adversarial training improves the performance further and exceeds the performance of retraining with both enhanced and noisy features in either Transformer or Conformer case. Taking Conformer for example, compared to Conformer trained with clean data merely, the adversarial joint training yields 63.8% relative and 59.0% relative improvements on “matched" and “unmatched" datasets, respectively; meanwhile, the adversarial joint training outperforms the MCT strategy by 2.5% relative and 5.2% relative on “matched" and “unmatched" datasets, separately. These results indicate the efficacy of the adversarial joint training in improving the robustness of the end-to-end ASR scheme.

Table 5 The impacts of the joint training with and without GAN on SA-ASR pipeline. The results are in CER [%]


To analyze the difference between these enhancement modules that are trained independently, jointly without GANs, and jointly with GANs, we quantify their performance on the following five objective criteria (the higher the better):

  • SSNR: segmental SNR [23] (in the range of [0, +))

  • CBAK: MOS prediction of the intrusiveness of background noises [22] (in the range of [1,5])

  • CSIG: MOS prediction of the signal distortion attending only to the speech signal [22] (in the range of [1,5])

  • COVL: MOS prediction of the overall effect [22] (in the range of [1,5])

  • PESQ: perceptual evaluation of speech quality, using the wide-band version recommended in ITU-T P.862.2 [74] (in the range of [ − 0.5,4.5])

All criteria are computed based on the implementation in [75], available at the publisher websiteFootnote 1. We quantify the performance of the enhancement front-end that is trained independently, trained jointly with and without GANs in cases of baseline, Transformer, and Conformer schemes. As exhibited in Figs. 6, 7, and 8, the joint training slightly degrades the enhancement module’s performance on SSNR, CBAK, COVL, CSIG, and PESQ generally. These results suggest that these objective criteria cannot indicate the suitability of the enhanced data for ASR task, which verifies that it is hard for independent training to lead the enhancement module to the global optimum. Another phenomenon which is worth noting is that the discrepancies on CBAK and SSNR suggests that there are conflicts between erasing the noise contamination and averting the speech distortion. Therefore, an equilibrium between these two goals should be sought. The experimental results in Section 6 validate the efficacy of the adversarial joint training with a global discriminant guide for reaching the equilibrium point.

Fig. 6

The performance comparison of the enhancement model trained independently and the enhancement models trained jointly with the baseline ASR model without and with GAN

Fig. 7

The performance comparison of the enhancement model trained independently and the enhancement models trained jointly with Transformer ASR model without and with GAN

Fig. 8

The performance comparison of the enhancement model trained independently and the enhancement models trained jointly with Conformer ASR model without and with GAN


In this paper, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the end-to-end ASR system. The jointly compositional scheme consists of an enhancement front-end, a recognition back-end, and the discriminant network. A highlight of this proposed framework is the discriminant component first acts as the guide of the enhancement front-end training; afterwards, it participates in the adversarial joint training as the global instructor, which leads the enhancement front-end to output appropriate enhanced features for the downstream ASR task. Experimental results validate the efficacy of the proposed adversarial joint training strategy. The next work plan is to investigate different framework architectures and training strategies for further improved performance.

Availability of data and materials

The dataset is the open source Mandarin speech corpus, AISHELL-1 [63], and can be found under the following link:


  1. 1.



Automatic speech recognition


Sela-attention automatic speech recognition


Generative adversarial networks


Least-squares generative adversarial networks


Speech enhancement


Speech enhancement generative adversarial networks


Self-attention speech enhancement generative adversarial networks


Character error rate


Parametric rectified linear units


Feed-forward network


Multi-head self-attention module


Gated linear units


Short-time Fourier transform


Long short-term memory


Bidirectional long short-term memory


Connectionist temporal classification


Recurrent neural network


Convolutional neural network


Multi-conditional training


  1. 1

    W. Chan, N. Jaitly, Q. V. Le, O. Vinyals, Listen, attend and spell (2015). arXiv preprint arXiv:1508.01211.

  2. 2

    J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, Attention-based models for speech recognition. Adv. Neural Inf. Process Syst.28:, 577–585 (2015).

    Google Scholar 

  3. 3

    G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. -r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process Mag.29(6), 82–97 (2012).

    Article  Google Scholar 

  4. 4

    C. -C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al, in Proc. ICASSP. State-of-the-art speech recognition with sequence-to-sequence models (IEEE, 2018), pp. 4774–4778.

  5. 5

    D. Povey, H. Hadian, P. Ghahremani, K. Li, S. Khudanpur, in Proc. ICASSP. A time-restricted self-attention layer for ASR (IEEE, 2018), pp. 5874–5878.

  6. 6

    Z. Tian, J. Yi, J. Tao, Y. Bai, Z. Wen, Self-attention transducers for end-to-end speech recognition (2019). arXiv preprint arXiv:1909.13037.

  7. 7

    J. Salazar, K. Kirchhoff, Z. Huang, in Proc. ICASSP. Self-attention networks for connectionist temporal classification in speech recognition (IEEE, 2019), pp. 7115–7119.

  8. 8

    K. J. Han, J. Huang, Y. Tang, X. He, B. Zhou, in Proc. INTERSPEECH. Multi-stride self-attention for speech recognition, (2019), pp. 2788–2792.

  9. 9

    K. J. Han, R. Prieto, T. Ma, in Proc. ASRU. State-of-the-art speech recognition using multi-stream self-attention with dilated 1D convolutions (IEEE, 2019), pp. 54–61.

  10. 10

    N. -Q. Pham, T. -S. Nguyen, J. Niehues, M. Müller, S. Stüker, A. Waibel, Very deep self-attention networks for end-to-end speech recognition (2019). arXiv preprint arXiv:1904.13377.

  11. 11

    C. -F. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le, M. Jain, K. Schubert, C. Fuegen, M. L. Seltzer, Transformer-transducer: End-to-end speech recognition with self-attention (2019). arXiv preprint arXiv:1910.12977.

  12. 12

    H. Luo, S. Zhang, M. Lei, L. Xie, Simplified self-attention for transformer-based end-to-end speech recognition (2020). arXiv preprint arXiv:2005.10463.

  13. 13

    J. Lim, A. Oppenheim, All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Sig. Process. 26(3), 197–210 (1978).

    Article  Google Scholar 

  14. 14

    A. Narayanan, D. Wang, in Proc. ICASSP. Ideal ratio mask estimation using deep neural networks for robust speech recognition (IEEE, 2013), pp. 7092–7096.

  15. 15

    Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014).

    Article  Google Scholar 

  16. 16

    S. Nie, S. Liang, W. Xue, X. Zhang, W. Liu, et al., in Proc. INTERSPEECH. Two-stage multi-target joint learning for monaural speech separation (International Speech Communication Association (ISCA)Dresden, 2015), pp. 1503–1507.

    Google Scholar 

  17. 17

    F. Weninger, J. R. Hershey, J. Le Roux, B. Schuller, in Proc. GlobalSIP. Discriminatively trained recurrent neural networks for single-channel speech separation (IEEE, 2014), pp. 577–581.

  18. 18

    H. Erdogan, J. R. Hershey, S. Watanabe, J. Le Roux, in Proc. ICASSP. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks (IEEE, 2015), pp. 708–712.

  19. 19

    Y. Xu, J. Du, L. -R. Dai, C. -H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2014).

    Article  Google Scholar 

  20. 20

    S. Nie, S. Liang, W. Liu, X. Zhang, J. Tao, Deep learning based speech separation via NMF-style reconstructions. IEEE/ACM Trans. Audio Speech Lang. Process. 26(11), 2043–2055 (2018).

    Article  Google Scholar 

  21. 21

    Y. Ephraim, in Proc. ICASSP. A minimum mean square error approach for speech enhancement (IEEE, 1990), pp. 829–832.

  22. 22

    Y. Hu, P. C. Loizou, Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16(1), 229–238 (2007).

    Article  Google Scholar 

  23. 23

    S. R. Quackenbush, T. P. Barnwell, M. A. Clements, Objective measures of speech quality, Ellis Horwood Series in Artificial Intelligence (Prentice Hall, 1988).

  24. 24

    M. L. Seltzer, in Hands-Free Speech Communication and Microphone Arrays. Bridging the gap: towards a unified framework for hands-free speech recognition using microphone arrays (IEEE, 2008), pp. 104–107.

  25. 25

    Z. -Q. Wang, D. Wang, A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 796–806 (2016).

    Article  Google Scholar 

  26. 26

    Z. -q. Wang, D. Wang, in Proc. INTERSPEECH. Joint training of speech separation, filterbank and acoustic model for robust automatic speech recognition (International Speech Communication Association (ISCA)Dresden, 2015).

    Google Scholar 

  27. 27

    T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, Multichannel end-to-end speech recognition (2017). arXiv preprint arXiv:1703.04783.

  28. 28

    B. Liu, S. Nie, S. Liang, W. Liu, M. Yu, L. Chen, S. Peng, C. Li, in Proc. INTERSPEECH. Jointly adversarial enhancement training for robust end-to-end speech recognition (IEEE, 2019), pp. 491–495.

  29. 29

    S. Pascual, A. Bonafonte, J. Serra, SEGAN: speech enhancement generative adversarial network (2017). arXiv preprint arXiv:1703.09452.

  30. 30

    M. H. Soni, N. Shah, H. A. Patil, in Proc. ICASSP. Time-frequency masking-based speech enhancement using generative adversarial network (IEEE, 2018), pp. 5039–5043.

  31. 31

    D. Michelsanti, Z. -H. Tan, Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification (2017). arXiv preprint arXiv:1709.01703.

  32. 32

    P. Shen, X. Lu, S. Li, H. Kawai, in Proc. INTERSPEECH. Conditional generative adversarial nets classifier for spoken language identification (IEEE, 2017), pp. 2814–2818.

  33. 33

    S. Sahu, R. Gupta, C. Espy-Wilson, On enhancing speech emotion recognition using generative adversarial networks (2018). arXiv preprint arXiv:1806.06626.

  34. 34

    H. Hu, T. Tan, Y. Qian, in Proc. ICASSP. Generative adversarial networks based data augmentation for noise robust speech recognition (IEEE, 2018), pp. 5044–5048.

  35. 35

    C. Donahue, B. Li, R. Prabhavalkar, in Proc. ICASSP. Exploring speech enhancement with generative adversarial networks for robust speech recognition (IEEE, 2018), pp. 5024–5028.

  36. 36

    L. Dong, S. Xu, B. Xu, in Proc. ICASSP. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition (IEEE, 2018), pp. 5884–5888.

  37. 37

    A. Gulati, J. Qin, C. -C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al, Conformer: convolution-augmented transformer for speech recognition (2020). arXiv preprint arXiv:2005.08100.

  38. 38

    H. Phan, I. V. McLoughlin, L. Pham, O. Y. Chén, P. Koch, M. De Vos, A. Mertins, Improving GANs for speech enhancement. IEEE Sig. Process Lett.27:, 1700–1704 (2020).

    Article  Google Scholar 

  39. 39

    D. Baby, isegan: improved speech enhancement generative adversarial networks (2020). arXiv preprint arXiv:2002.08796.

  40. 40

    H. Phan, H. L. Nguyen, O. Y. Chén, P. Koch, N. Q. Duong, I. McLoughlin, A. Mertins, Self-attention generative adversarial network for speech enhancement (2020). arXiv preprint arXiv:2010.09132.

  41. 41

    Y. Koizumi, K. Yatabe, M. Delcroix, Y. Masuyama, D. Takeuchi, in Proc. ICASSP. Speech enhancement using self-adaptation and multi-head self-attention (IEEE, 2020), pp. 181–185.

  42. 42

    A. Sriram, H. Jun, Y. Gaur, S. Satheesh, in Proc. ICASSP. Robust speech recognition using generative adversarial networks (IEEE, 2018), pp. 5639–5643.

  43. 43

    K. Wang, J. Zhang, S. Sun, Y. Wang, F. Xiang, L. Xie, in Proc. INTERSPEECH. Investigating generative adversarial networks based speech dereverberation for robust speech recognition (IEEE, 2018), pp. 1581–1585.

  44. 44

    B. Liu, S. Nie, Y. Zhang, D. Ke, S. Liang, W. Liu, in Proc. ICASSP. Boosting noise robustness of acoustic model via deep adversarial training (IEEE, 2018), pp. 5034–5038.

  45. 45

    J. Droppo, A. Acero, in Proc. INTERSPEECH, 1. Joint discriminative front end and back end training for improved speech recognition accuracy (IEEE, 2006), pp. 281–284.

  46. 46

    T. Gao, J. Du, L. Dai, C. Lee, in Proc. ICASSP. Joint training of front-end and back-end deep neural networks for robust speech recognition (IEEE, 2015), pp. 4375–4379.

  47. 47

    M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, in Proc. SLT. Batch-normalized joint training for DNN-based distant speech recognition (IEEE, 2016), pp. 28–34.

  48. 48

    Y. Qian, T. Tan, D. Yu, Neural network based multi-factor aware joint training for robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 24(12), 2231–2240 (2016).

    Article  Google Scholar 

  49. 49

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need. Adv. Neural Inf. Process Syst.30:, 5998–6008 (2017).

    Google Scholar 

  50. 50

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets. Adv. Neural Inf. Process Syst.27:, 2672–2680 (2014).

    Google Scholar 

  51. 51

    X. Wang, R. Girshick, A. Gupta, K. He, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Non-local neural networks (IEEE Computer SocietySalt Lake City, 2018), pp. 7794–7803.

    Google Scholar 

  52. 52

    H. Zhang, I. Goodfellow, D. Metaxas, A. Odena, in International Conference on Machine Learning. Self-attention generative adversarial networks (Association for Computing MachineryNew York, 2019), pp. 7354–7363.

    Google Scholar 

  53. 53

    A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks (2015). arXiv preprint arXiv:1511.06434.

  54. 54

    K. He, X. Zhang, S. Ren, J. Sun, in Proceedings of the IEEE International Conference on Computer Vision. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, (2015), pp. 1026–1034.

  55. 55

    T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training GANs (2016). arXiv preprint arXiv:1606.03498.

  56. 56

    A. L. Maas, A. Y. Hannun, A. Y. Ng, in Proc. Icml, 30. Rectifier nonlinearities improve neural network acoustic models (JMLR.orgAtlanta, 2013), p. 3.

    Google Scholar 

  57. 57

    Y. Lu, Z. Li, D. He, Z. Sun, B. Dong, T. Qin, L. Wang, T. -Y. Liu, Understanding and improving transformer from a multi-particle dynamic system point of view (2019). arXiv preprint arXiv:1906.02762.

  58. 58

    A. Zeyer, P. Bahar, K. Irie, R. Schlüter, H. Ney, in Proc. ASRU. A comparison of transformer and LSTM encoder decoder models for ASR (IEEE, 2019), pp. 8–15.

  59. 59

    Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, L. S. Chao, Learning deep transformer models for machine translation (2019). arXiv preprint arXiv:1906.01787.

  60. 60

    Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, in International Conference on Machine Learning. Language modeling with gated convolutional networks (International Conference on Machine Learning (IOML)Sydney, 2017), pp. 933–941.

    Google Scholar 

  61. 61

    P. Isola, J. -Y. Zhu, T. Zhou, A. A. Efros, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Image-to-image translation with conditional adversarial networks, (2017), pp. 1125–1134.

  62. 62

    D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A. A. Efros, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Context encoders: feature learning by inpainting, (2016), pp. 2536–2544.

  63. 63

    H. Bu, J. Du, X. Na, B. Wu, H. Zheng, in 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA). Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline, (2017), pp. 1–5.

  64. 64

    A. Varga, H. Steeneken, D. Jones, The noisex-92 study on the effect of additive noise on automatic speech recognition system. Speech Commun.12(3), 247–251 (1992).

    Article  Google Scholar 

  65. 65

    S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, T. Ochiai, in Proc. INTERSPEECH. ESPnet: end-to-end speech processing toolkit (IEEE, 2018), pp. 2207–2211.

  66. 66

    S. Kim, T. Hori, S. Watanabe, in Proc. ICASSP. Joint CTC-attention based end-to-end speech recognition using multi-task learning (IEEE, 2017), pp. 4835–4839.

  67. 67

    V. Nair, G. E. Hinton, in ICML. Rectified linear units improve restricted boltzmann machines (OmnipressMadison, 2010).

    Google Scholar 

  68. 68

    T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn.4(2), 26–31 (2012).

    Google Scholar 

  69. 69

    D. P. Kingma, J. Ba, Adam: a method for stochastic optimization (2014). arXiv preprint arXiv:1412.6980.

  70. 70

    Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al, Google’s neural machine translation system: bridging the gap between human and machine translation (2016). arXiv preprint arXiv:1609.08144.

  71. 71

    T. Ko, V. Peddinti, D. Povey, S. Khudanpur, in Sixteenth Annual Conference of the International Speech Communication Association. Audio augmentation for speech recognition (International Speech Communication Association (ISCA)Dresden, 2015).

    Google Scholar 

  72. 72

    D. S. Park, W. Chan, Y. Zhang, C. -C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le, Specaugment: a simple data augmentation method for automatic speech recognition (2019). arXiv preprint arXiv:1904.08779.

  73. 73

    A. Narayanan, D. Wang, in Proc. ICASSP. Joint noise adaptive training for robust automatic speech recognition (IEEE, 2014), pp. 2504–2508.

  74. 74

    ITU, Recommendation ITU-T P, 862.2: Wideband Extension to Recommendation P. 862 for the Assessment of Wideband Telephone Networks and Speech Codecs. ITU-Telecommunication Standardization Sector, 2007.

  75. 75

    P. C. Loizou, Speech enhancement: theory and practice (CRC press, Boca Raton, 2013).

    Book  Google Scholar 

Download references


Open Access funding enabled and organized by Projekt DEAL.

Author information




Li, L. conceptualized the study. Li, L.; Kang, Y.; and Shi, Y. executed the experiments. All authors did the literature analysis, manuscript preparation, editing, and proofreading, and approved the final manuscript.

Corresponding author

Correspondence to Lujun Li.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, L., Kang, Y., Shi, Y. et al. Adversarial joint training with self-attention mechanism for robust end-to-end speech recognition. J AUDIO SPEECH MUSIC PROC. 2021, 26 (2021).

Download citation


  • Self-attention mechanism
  • Generative adversarial networks
  • Speech enhancement
  • Robust speech recognition