Adversarial Joint Training with Self-Attention Mechanism for Robust End-to-End Speech Recognition

Lately, the self-attention mechanism has marked a new milestone in the field of automatic speech recognition (ASR). Nevertheless, its performance is susceptible to environmental intrusions as the system predicts the next output symbol depending on the full input sequence and the previous predictions. Inspired by the extensive applications of the generative adversarial networks (GANs) in speech enhancement and ASR tasks, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the ASR system. Generally, it consists of a self-attention speech enhancement GAN and a self-attention end-to-end ASR model. There are two highlights which are worth noting in this proposed framework. One is that it benefits from the advancement of both self-attention mechanism and GANs; while the other is that the discriminator of GAN plays the role of the global discriminant network in the stage of the adversarial joint training, which guides the enhancement front-end to capture more compatible structures for the subsequent ASR module and thereby offsets the limitation of the separate training and handcrafted loss functions. With the adversarial joint optimization, the proposed framework is expected to learn more robust representations suitable for the ASR task. We execute systematic experiments on the corpus AISHELL-1, and the experimental results show that on the artificial noisy test set, the proposed framework achieves the relative improvements of 66% compared to the ASR model trained by clean data solely, 35.1% compared to the speech enhancement&ASR scheme without joint training, and 5.3% compared to multi-condition training.


Introduction
In recent years, attention-based end-to-end neural networks, which subsume the acoustic and language models into a single neural network, trigger the revolution in the field of automatic speech recognition (ASR) [1,2] and are challenging the dominance of Hidden Markov Model-based hybrid systems [3]. Furthermore, the self-attention mechanism has made another breakthrough in the innovation of the attention architecture, which considers the whole sequence at once to model feature interactions that are arbitrarily distant in time, leading to faster convergence and state-of-theart results in ASR [4,5,6,7,8,9,10,11,12]. The self-attention system predicts the next output symbol conditioned on the full sequence of the previous predictions. Once a mistake occurs in one estimation step due to noise interference, all the subsequent steps will be disturbed. As speech signals are inevitably interfered by various background noises in the realistic environment, it is crucial to improve the robustness of the self-attention mechanism for practical application.
The mainstream solution to the noise robustness problem is adding an independent speech enhancement (SE) module as the front-end of ASR. Speech enhancement aims to transform the interfered speech to its original clean version, which is achieved by various ap-arXiv:2104.01471v1 [eess.AS] 3 Apr 2021 proaches, i.e., the statistical method like Wiener filter [13], the time-frequency masking [14,15,16], the signal approximation [17,18], the spectral mapping [19,20], etc. No matter what approach the speech enhancement model adopts to achieve the goal, it is trained separately from the ASR model on different loss functions (i.e., mean squared error [21]) and being evaluated by different objective criteria (i.e., Mean Opinion Score (MOS) prediction of the intrusiveness of background noise [22], Segmental SNR [23]). This mismatch between the enhancement training and the final ASR task leads to a sub-optimum easily [24]. Moreover, the handcrafted loss functions tend to generate over-smoothed spectra or introduce unseen distortions, which sometimes even degrade the downstream ASR performance [25].
To obtain the optimum and circumvent introducing unnecessary distortion, the idea of a joint training framework is proposed for robust speech recognition [26,25,27,28]. The fundamental concept of the joint training is concatenating the speech enhancement front-end and a downstream ASR model to build an entire neural network and jointly adjust the parameters in each module. The goal here is that the enhancement front-end tends to produce enhanced features desired by the ASR component, and the ASR module can guide the enhancement module to a more discriminative direction. In this way, the joint framework is optimized on the final ASR objectives, i.e., word/character error rate (W/CER).
Generative adversarial networks (GANs) aim at mapping samplesx from the distributionX to samples x from another distribution X . There are two components within GANs. One is the generator (G), which performs the mapping; and the other is the discriminator (D), which guides the training of the generator. GANs have been applied to various speech signal processing tasks, such as speech enhancement [29,30], robust speaker verification [31], spoken language identification [32], speech emotion recognition [33], data augmentation [34], and robust speech recognition [35]. Inspired by the advancement of self-attention mechanism and various applications of GAN in speechrelated tasks, we propose an adversarial joint training framework with self-attention mechanism to boost the robustness of the self-attention ASR systems, which consists of a self-attention speech enhancement GAN (SA SEGAN) and a self-attention end-to-end ASR model (SA ASR), where we experiment with Transformer [36] and Conformer [37]. The discriminant component of SA SEGAN is first utilized to distinguish the enhanced features from the original clean features, instructing the enhancement module to output the clean distribution. When it comes to the stage of the joint training, the D component acts as the global training guide, and it will shift the direction for the G component to produce more congruous features for the ASR task. As the global guide, the discriminator is expected to remedy the limitation of the separate training and handcrafted loss functions, alleviate the distortion, and lead the speech enhancement component to the global optimum. Meanwhile, the enhancement module is supposed to capture more underlying structural characteristics. With this global guide, the whole framework is expected to learn more robust representations compatible with the ASR task automatically.
In summary, the main contributions of this paper are the following: • We propose a self-attention based jointly-trained adversarial framework targeting robust speech recognition. This framework benefits from the advancement of both self-attention mechanism and adversarial training; • We exert the global adversarial training, where the discriminant component does not concentrate on the enhancement front-end exclusively, but also plays the role of the global training guide. • The proposed framework yields remarkable results, which achieve relative improvements of 66% compared to the ASR model trained by clean data solely, 35.1% compared to the scheme without joint training, and 5.3% compared to multicondition training.

Related Work
GANs have been applied in speech enhancement tasks without attention [29,38,39] and with attention [40,41]. These works validate the functionality of GAN in the enhancement task on diverse objective criteria; however, they lack proofs of the effectiveness of their work for the downstream ASR task. GANs have also been employed to improve the robustness of the ASR model [35,42,43,44]. A potential limitation lies in the weak matching and communication between the integrated modules. For instance, speech enhancement and speech recognition are often designed independently, and the enhancement system is tuned according to metrics that are not straightly relative to the final ASR performance.
To address this concern, joint training is a promising approach. An early attempt was proposed in [45], where a feature extraction front-end and a Gaussian Mixture Model-Hidden Markov Model back-end are jointly trained on maximum mutual information. Afterwards, other interesting works are published in this field [25,26,46,47,48]. Nevertheless, an effective integration between the various systems has been difficult for many years, mainly due to the different nature of the technologies involved at different steps. For example, in [25,46], the joint training is actually performed as a fine-tuning procedure. To tackle this problem, this paper deploys the discriminant component of GAN as a global guide, leading the enhancement module to match the downstream ASR module.
3 Self-attention Based SE-ASR Scheme Fig. 1 illustrates an overview of our proposed joint training framework for robust end-to-end speech recognition pictorially. The system consists of a selfattention enhancement front-end and a self-attention ASR model. Given the raw noisy speech inputX and the raw clean input X * , we illustrate the entire procedure of the joint training pipeline in the following forms:
Here, Generator(·) acts as a speech enhancement front-end realized by the generator component of SA SEGAN [40], which transforms the noisy raw in-putX to the enhancedX. FBank(·) is a function for extracting the normalized log FBank featureŝ F from the enhancement outputsX. Subsequently, SA ASR(·) is an ASR system based on self-attention layers realized by Transformer [36] or Conformer [37] architecture. Y is the outputs of the whole scheme. Discriminator(·) is realized by the discriminator component of SA SEGAN [40], which distinguishes enhanced outputs from clean data.

Self-attention Mechanism
Self-attention [49] relates the information over different positions of the entire input sequence for computing the attention distribution using scaled dot-product attention: Q ∈ R tq×dq , K ∈ R t k ×d k , and V ∈ R tv×dv are three inputs of the self-attention layer: queries, keys, and values, where t q , t k , and t v are the element numbers in different inputs while d q , d k , and d v denote the corresponding element dimensions. The scalar 1 √ d k prevents the softmax function from falling into regions with tiny gradients. One query's output is computed as a weighted sum of the values, where each weight of the value is computed by a designated function of the query with the corresponding key.

Speech Enhancement GANs (SEGAN)
Given a dataset X = {(x * 1 ,x 1 ), (x * 2 ,x 2 ), · · · , (x * N ,x N )} consisting of N pairs of raw signals: clean speech signal x * and noisy speech signalx. Speech enhancement aims to find a mapping f θ (x) :x →x to transform the raw noisy signalx to the enhanced signalx. θ contains the parameters of the enhancement network.
Conforming to GAN's principle [50], the generator G is for learning an effective mapping that can imitate the real data distribution to generate novel samples related to those of the training set. Hence G acts as the enhancement function. In contrast, the discriminator D plays the role of a classifier which distinguishes the real sample, coming from the dataset that G is imitating, from the fake samples, made up by G. D guides θ towards the distribution of clean speech signals. To sum up, SEGAN designates the generator G for the enhancement mapping, i.e.x = G(x), while designates the discriminator D to guide the training of G by classifying (x * ,x) as real and (x,x) as fake. Eventually, G learns to produce enhanced signalsx good enough to fool D such that D classifies (x,x) as real. 3.3.2 Self-attention Speech Enhancement GANs (SA SEGAN) SA SEGAN [40] is SEGAN with the adoption of the self-attention layer adapted from non-local attention [51,52]. Given the feature map F ∈ R L×C output by the 1-dim convolutional layer, where L is the time dimension, C is the number of channels, the query matrix Q, the key matrix K, and the value matrix V are obtained via transformations: where W Q , W K , and W V denote the weight matrices of the convolutional layer. Furthermore, Phan et al. [40] introduce two factors, b and p, for memory efficiency. b reduces the channel dimension, while p reduces the number of keys and values by a max pooling layer with filter width and stride size of p. Therefore, the dimension of the matrices are The attention map A and the attentive output O are then computed as Each element a ij ∈ A indicates the extent to which the model attends to the jth column v j of V when producing the ith output o i of O. With the weight  [40] matrix W O realized by a 1 × 1 convolution layer of C filters, the shape of O is restored to the original shape L × C.
In the end, SA SEGAN contains a shortcut connection to facilitate information propagation, and a learnable parameter β is employed to balance the weight between the output O and the input feature map F as We illustrate the diagram of a simplified self-attention layer with L = 9, C = 6, p = 3, and b = 2 in Fig. 2.

Network Architecture
The architectures of the generator G and the discriminator D are depicted in Fig. 3 (a) and (b). The G component makes use of an encoder-decoder architecture with fully-convolutional layers [53]. The gener-ator's encoder comprises 11 1-dim stridden convolutional layers with a common filter width of 31 and a stride length of 2, followed by parametric rectified linear units (PReLUs) [54]. The encoder receives a onesecond segment of the raw signal sampled at 16  At the 11th layer of the encoder, the encoding vector c ∈ R 8×1024 is stacked with the noise sample z ∈ R 8×1024 , sampled from the distribution N (0, I), and presented to the decoder. The decoder component mirrors the encoder architecture with the same number of filters and the filter width to reverse the encoding process through deconvolutions. The same as the encoder, each deconvolutional layer is again followed by a PReLUs. The skip connections are deployed to connect the encoding layer with its corresponding decoding layer to allow the information flow between the encoding stage and the decoding stage.
The discriminator is constructed of a similar architecture to the encoder component of the generator. However, it receives the two-channel input and utilize virtual batch-norm [55] before LeakyReLU [56] activation with α = 0.3. Moreover, the D network is topped up with a 1 × 1 convolutional layer to reduce the dimension of the output of the last convolutional layer from 8×1024 to 8 for the subsequent classification task with the softmax layer.
The self-attention layer illustrated in section 3.3.2 couples with the (de)convolutional layer of both the generator and the discriminator. Fig. 3 (a) and (b) demonstrate an example of the self-attention layer coupling with the lth (de)convolutional layer. As we can see, if we add the self-attention layer to the lth convolutional layer of the encoder, the mirror lth deconvolutional layer of the decoder and the lth layer in the discriminator also couples a self-attention layer. Theoretically, the self-attention layer can be placed in any number, even all, of the (de)convolutional layers.

FBank Extraction Network
We extract the normalized log FBank featuresf as the input of the subsequent ASR model, which is computed from the enhanced signalsx: f = FBank(x) = Norm(log(Mel(STFT(x)))), (10) where STFT(·) is the operation of short-time Fourier transform (STFT), Mel(·) is the operation of Mel matrix multiplication, and Norm(·) is for normalizing the mean and variance to 0 and 1, separately. Consequently, the FBank feature extraction layer is differentiable.

Multi-head Attention Mechanism
Multi-head attention mechanism [49], as the terminology implies, contains more than one self-attention module. As the core module of the Transformer [36], it leverages different attending representations jointly. Before performing each attention, three linear projections transform the queries, keys, and values to more discriminated representations, respectively. Afterwards, each dot-product attention is calculated independently, and their outputs are concatenated and fed into another linear projection to obtain the final d model -dimensional outputs: where h refers to the head numbers, and Q, K, V have the same dimensions of d model . Four projection ma-

Positional Encoding
One obvious limitation of the Transformer model is that the output is invariant to the input order permutation, i.e., the Transformer does not model the order of the input sequence. Vaswani et al. [49] solve this problem by injecting information about absolute positions into the input sequence via sinusoid positional embeddings: where pos refers to the position and i is the dimension. The sinusoidal function allows the model to extrapolate from long sequence lengths.

Feed-forward Network
The feed-forward network (FFN) is another core module of the Transformer [36]. It is composed of two linear transformations with a ReLU activation in between.  The dimensionality of the input and output is d model , and the inner layer has the dimensionality d f f . Specifically, where the weights The linear transformations are the same across different positions.

Network Architecture
The detailed model architecture of the ASR-Transformer is as follows: The encoder is shown in Fig. 4 (a). The inputembedding is for extracting expressive representations of dimension d model . Thereafter, to enable the model to attend on the auxiliary position information, the d model -dim positional encoding (Section 3.5.2) is added to the input encoding. Then the sum of encoded outputs is fed into a stack of N e encoder blocks, each of which has two sub-blocks: one is the multi-head attention (Section 3.5.1), receiving queries, keys, and values from the previous block; the other is the feed-forward networks (Section 3.5.3). In the meanwhile, layer normalization and residual connection are introduced to each sub-block for effective training. Thus, the pipeline of the sub-block is: x + SubBlock(Layer Norm(x)). (15) The decoder is shown in Fig. 4 (b). The outputembedding converts the character sequence to dimension d model . Added with the positional encoding, the sum of them is fed into a stack of N d decoder blocks, which consists of three sub-blocks: The first is a masked multi-head attention, which ensures that the predictions for position j depends only on the known outputs at positions less than j. The second is a multihead attention whose keys and values come from the encoder outputs while queries come from the previous sub-block outputs. The third is also feed-forward networks. Similar to the encoder, layer normalization and residual connection are also employed to each subblock of the decoder. Eventually, the output probabilities are acquired by a linear projection and a subsequent softmax function.

Conformer
Conformer [37] is a state-of-the-art ASR encoder architecture. Different from the Transformer block (as described in Section 3.5), it is equipped with a convolution layer to increase the local information modeling capability of the Transformer encoder model [49] and a pair of FFN modules sandwiching the multi-head self-attention module and the integrated convolution module. The Conformer model consists of a Conformer encoder proposed in [37] and a Transformer decoder [36]. The encoder first processes the input with a convolution subsampling layer and then with Conformer blocks, as illustrated in Fig. 5 (i). The Conformer block (Fig. 5 (ii)) consists of a multi-head self-attention module (MHSA), a convolution module, sandwiched by a pair of macron-feedforward module [57]. The layer normalization is applied before each module and the dropout is followed by a residual connection afterwards (pre-norm) [58,59]. Mathematically, let x i be the input to the ith Conformer block, the output y i of this block is: FFN(·), MHSA(·), Conv(·), and Layer Norm(·) denote the macron-feedforward module, the multi-head selfattention module, the convolution module, and the  layer normalization module, respectively. The multihead self-attention module is the same as in Section 3.5.1 and is demonstrated in Fig. 5 (ii-b). Section 3.6.1 and 3.6.2 introduce the convolution module and the macron-feedforward module, respectively. Fig. 5 (ii-a) demonstrates the details of the convolution module. The convolution module starts with a 1-dim pointwise convolution layer and a gated linear units (GLU) activation [60]. The 1-dim pointwise convolution layer doubles the input channels, and the GLU activation splits the input along the channel dimension and executes an element-wise product. What follows are a 1-dim depthwise convolution layer, a batch normalization layer, a Swish activation, and another 1dim pointwise convolution layer. As mentioned before, the layer normalization is applied before each module and the dropout is followed by a residual connection afterwards (pre-norm).

Macron-feedforward Module
Unlike the FFN module in Transformer encoder [36], which comprises two linear transformations with a ReLU activation in between (Equ. 14), Conformer encoder [37] introduces another FFN module and substitutes the ReLU activation with the Swish activation. Furthermore, inspired by Macaron-Net [57], this pair of FFN modules are following a half-step scheme and sandwiching the MHSA and the convolution modules. The detail of the FFN is illustrated in Fig. 5 (ii-c).

Adversarial Joint Training
GANs aim at mapping samplesx from the distribu-tionX to samples x * from another distribution X * . The generator G is tasked to learn an effective mapping that can imitate the real data distribution to generate novel samples from the manifold defined in the X , by means of an adversarial training exerted by the discriminator D. During back-propagation, D classifies real samples from the fake samples more accurately; in return, G updates its parameters towards the real data manifold, till the mixed Nash equilibria are reached [50]. The GAN training process can be formulated as a minimax game between G and D, with the objective In our proposed robust end-to-end speech recognition scheme, the discriminant network first acts as the local guide for the enhancement module, where D shifts the training of G towards the distribution of clean data; thereafter, it is deployed as the global guide for the whole scheme, where D instructs G to output pertinent enhanced data for the subsequent ASR task.
We first train the enhancement module, which contains both the generator and the discriminator. To solve the problem of vanishing gradients caused by sigmoid cross-entropy loss for training, the least-squares GAN (LSGAN) with binary coding (1 for real, 0 for fake) is utilized instead of the cross-entropy loss. Consequently, the loss function of the discriminator component changes to where z is a latent variable. To minimize the distance between its generations and the clean examples, it is beneficial to add a secondary component to the loss of G. Inspired by the effectiveness of L 1 norm in the image manipulation domain [61,62], we deploy it in G component to gain more fine-grained and realistic results. The magnitude of the L 1 norm is controlled by a new hyper-parameter λ. Hence, the loss function of the generator component becomes In the joint training, the enhancement module is initialized from the trained G component, while the global discriminant module is initialized from the trained D component. The training of the ASR component is based on the cross entropy criterion, namely L asr = −lnP (Y * |F ) = − n lnP (y * n |F, y * 1:n−1 ), (23) where Y * is the ground truth of a whole sequence of output labels and y * 1:n−1 is the ground truth from output step 1 to n − 1. In the proposed framework, the parameters of all procedures, enhancement, feature extraction, ASR, and the discriminant network, are updated by stochastic gradient descent calculated by the loss function of the whole scheme. It is composed of three losses: L asr , L enh , and L gan , which correspond to Eqs. 23, 22, and 21, i.e. L = L asr + κL enh + γL gan , where κ and γ are two hyper-parameters weighting the magnitude of the enhancement loss and adversarial loss. Notably, the scheme targets the recognition performance, and the loss function of the discriminant network adapts the enhancement module implicitly. As a result, the discriminant network guides the enhancement module to serve the subsequent ASR task more properly. Accordingly, the unnecessary speech distortion caused by the enhancement process is alleviated.

Experimental Setups
We systematically evaluate the robustness of the adversarial joint training framework, and ablation tests are conducted to validate the effects of (i) the enhancement front-end on the ASR task, (ii) the joint training on the whole scheme, and (iii) the GAN on the joint training. For the noisy data, we contaminate clean utterances in AISHELL-1 with 9 sorts of intrusions from the NOISEX-92 dataset [64] artificially as noisy utterances. We create noisy training, development, and test sets in the same manner. Note that besides the "matched" noisy test set, which is contaminated by the same intrusions as the training dataset, we also corrupt the test set with the rest 5 sorts of intrusions in the NOISEX-92 dataset as "unmatched" test materials. Table 1 exhibits the sorts the intrusions mixed in "match" and "unmatch" cases. All utterances are mixed with the intrusions at SNRs randomly sampled between [0dB, 20dB]. To sum up, we have two sorts of datasets for training:

Corpus
• clean: Clean utterances from the training dataset of AISHELL-1.

Baseline
For the comparison purpose, we take the work from [28] as the baseline model.
In [28], the mask-based enhancement network is deployed as the front-end. It estimates a masking function to multiply the frequency-domain feature of the noisy speech to form an estimate of the clean speech. For the ASR task, Liu et al. employ the ESPnet model [65]. It consists of an encoder network that maps the input feature sequence into a higher-level representation. Then a location-based attention layer integrates the representation into a context vector with the attention weight vector. In the end, the decoder predicts the next output conditioned on the full sequence of previous predictions. Besides, there is an extra discriminant network, whose loss is weighted in the loss function of the whole scheme to optimize the joint training.
Importantly, the baseline model does not contain any self-attention layer. Furthermore, the discriminant work in the baseline model is an extra auxiliary module, which does not participate in the enhancement training directly. By contrast, our work benefit from self-attention mechanism and the discriminant module exits innately, which is a component of the enhancement front-end. It acts as the local guide for the enhancement training, leading the enhancement network to output towards the distribution of the clean samples. Simultaneously, it also plays the role of the global guide, instructing the enhancement module and the ASR module better matched.

Baseline
For the enhancement front-end, the input is the 257dim logarithmic STFT features, and all input vectors are normalized to have the zero mean and the unit variance. The network is composed of 3-layer long shortterm memory (LSTM) with 128 nodes, followed by a linear layer with the sigmoid activation function. The network outputs the masking estimate, whose size is equal to the input size, multiplying by the STFT feature of the noisy speech to estimate the clean speech.
For the ASR network, the input is the 80-dim normalized log FBank features transformed from the enhanced STFT features. The encoder is composed of 4-layer bidirectional LSTM (BLSTM) with 320 cells, while the decoder is composed of 1-layer unidirectional LSTM with 320 cells. After each BLSTM layer, a linear projection layer with 320 nodes is used to combine the forward and backward LSTM outputs. The locationbased attention mechanism comprises 10 centered convolution filters of width 100. Besides, We also adopt a joint connectionist temporal classification (CTC)attention multitask loss function [66] with the CTC loss weight as 0.1.
The discriminant network consists of a 4-layer convolution network, each of which is followed by the ReLU activation function [67].
For decoding, we use a beam search algorithm with the beam size 12. CTC rescores the hypotheses with 0.1 weight [66]. Besides, an external recurrent neural network (RNN) language model is also adopted with 0.2 weight during decoding.

The Proposed Joint
Training Scheme SA SEGAN The SA SEGAN is trained for 86 epochs with RMSprop [68] and a learning rate of 0.0002. The batch size is 50. During training, we extract 1-second chunks of raw waveforms (L = 16, 384 samples) with a 50% overlap. During the test, we slide the window without overlapping through the whole duration of our test utterances and concatenate the outputs at the end of the stream. During both training and test, we employ a high-frequency preemphasis filter with a coefficient of 0.95 to all inputs. For the self-attention layer in SA SEGAN, we use b = 8 and p = 4 for memory reduction. Phan et al. [40] suggest that the placement of the self-attention layer does not show a clear difference on the performance, which indicates that applying the self-attention to the higher-level (de)convolutional layer is expected to be as good as to a lower layer. Compromising between the computation time and memory requirement and the performance, we place the selfattention layer in the 10th layer (l=10).

FBank extraction network
The FBank feature extraction network is a linear layer to transform the raw outputs from the upstream SA SEGAN to the downstream ASR procedure. We extract 80-dim filterbanks with the window size of 25ms and the window shift of 10ms, extended with the temporal first-and secondorder differences. Thereafter, we do the logarithmic calculation and global mean and variance normalization according to Eq. 10.
Transformer For training the Transformer, we adopt Adam optimizer [69] with β 1 = 0.9, β 2 = 0.98, = 10 −9 , and vary the learning rate over the course of training according to the formula: where n denotes the step number. k is a tunable scalar, which is set to be 10 initially and is declined to 1 when the model converges. The learning rate increases linearly during the fist warmup n = 25000 steps, and afterwards, it decreases proportionally to the inverse square root of the step number. We apply the residual dropout to each sub-block before adding the residual information, while the attention dropout is performed on the softmax activations in each attention. Both of these aforementioned dropouts are set to be 0.1. Additionally, we guide the system to be more attentive Table 1 The demonstration of categories of intrusions utilized in "match" and "unmatch" cases. on closer positions by punishing the attention weights of more distant position-pairs. Similar to the baseline model, we also adopt a joint CTC-attention multi-task loss function [66], with the CTC loss weight as 0.3. In the decoding, we set the beam size to 12 and length penalty α = 1.0 [70]. Besides, we also integrate an external RNN language model with 0.3 weight. The training procedure is stopped after 30 epochs.
Conformer The model hyper-parameters of the Conformer are: N e =12, N d =6, H=4, d k =256 and d f f =2048. The convolution subsampling layer possesses a 2-layer convolutional neural network (CNN) with 256 channels, stride with 2, and kernel size with 3. The kernel size of the convolution module is 31. We apply dropout in each residual unit of the Conformer with the weight 0.1. The same as the Transformer, we train the network with Adam optimizer [69] with β 1 = 0.9, β 2 = 0.98, = 10 −9 and a Transformer learning rate schedule [49] with 10000 warm-up steps. The learning rate is peaked at 0.05/ √ d, where d is the model dimension in the Conformer encoder. Note that we do not apply speed perturbation [71] or SpecAugment [72] for the data augmentation to exclude extra tricks that could cause performance improvements. The training procedure is stopped after 30 epochs.

Results
We use character error rate (CER) to quantify the system performance in all experiments. We report CER of the AISHELL-1 test set on three conditions: "clean" refers to the original clean test dataset of the corpus, "match" denotes the noisy test dataset contaminated by "matched" sorts of intrusions in Table 1, and "unmatch" means the noisy test set corrupted by the rest of "unmatched" sorts of intrusions in Table  1. To validate the efficacy of the enhancement frontend, we also introduce multi-condition training (MCT) for comparison, where we artificially contaminate the training dataset of AISHELL-1 with background noise at a certain SNR. Note that there are three randomizations: (i) The utterance to be corrupted is chosen randomly, and in total 90% of the training utterances are corrupted; (ii) The background noise is chosen from the "matched" intrusions from Table 1 randomly; (iii) The SNR is sampled randomly between [0dB, 20dB]. Therefore, MCT training data comprise 10% original clean data and 90% contaminated data, which is corrupted by one of the "matched" noises at an SNR in the range of [0dB, 20dB].
Firstly, we train the ASR network with the original clean utterance and multi-condition training strategy. The results are shown in Table 2. Ranking these three models from the aspect of ASR performance, the first is Conformer, then Transformer, and the last is the baseline model, consistent with observations in [36,37]. However, their performance deteriorates rapidly in the noisy test set, demonstrating the necessity of the robustness investigation. The MCT train-ing considerately improves the system's robustness. Its performance on the "matched" test set outperforms the clean training by 63.2% and 62.9% relative, in cases of Transformer and Conformer model respectively; while on the "unmatched" dataset, it outperforms the clean training by 56.3% and 56.8% relative, in cases of Transformer and Conformer model respectively. Secondly, we train SA SEGAN with the training data contaminated by "matched" intrusions in Table  1 to enhance the noisy speech. Then the enhanced features are used for the downstream ASR task. Importantly, the ASR models are taken over from the same well-trained model as in Table 2, which means that the enhancement front-end and the ASR back-end are trained separately by different objectives. As exhibited in Table 3, the enhancement module tremendously improves the performance of the ASR component, which is trained by the clean data merely. Compared to Table 2, it outperforms all of the three ASR modules (baseline, Transformer, Conformer) without the enhancement front-end. The improvement achieved in the "matched" dataset is more remarkable than that achieved on the "unmatched" test set. For instance, it outperforms the Conformer without the enhancement module by 46.0% in the "matched" test set while 9.1% in the "unmatched" test set. This difference is due to that the SA SEGAN is trained with the "matched" intrusions and can enhance the data contaminated by the same intrusions better during the test. All these improvements confirm the efficacy of the enhancement module for improving the robustness of the ASR system. Nevertheless, improving the robustness of the framework in unseen noisy environments still remains to be a challenge. Additionally, the speech enhancement module deteriorates the performer of the ASR MCT network, which stays in accordance with the observations in [35]. This degradation may be derived from the latent distortions caused by the overtraining of the enhancement module.
To remedy the deterioration of the performance of the ASR MCT, we retrain the network with the enhanced features. Assuming that the network may also  Figure 6 Comparisons of different enhancement models' performance under different training conditions. benefit from the knowledge of the noisy features, we also experiment with ingesting both enhanced and noisy features. Results are displayed in Table 4. Either the Transformer MCT model or the Conformer MCT model is initialized from the existing well-trained MCT checkpoints respectively, setting the additional parameters to zero to ensure the fair training start. As presented in Table 4, the retraining with the enhanced features improves the performance in both "matched" and "unmatched" cases, and the retraining with both enhanced and noisy features improves the performance slightly further.
Lastly, we jointly train the whole scheme with and without adversarial training according to Eq. 24. In the framework, the enhancement front-end is initialized from the generator (G component) of SA SEGAN, the ASR back-end is initialized from the ASR MCT checkpoint (without retraining), and the adversarial module is initialized from the discriminator (D component) of SA SEGAN. When the adversarial module participates in the training, we set the magnitude of the loss function by κ=6.0 and γ=0; by contrast, when it participates in the training, we set κ=6.0 and γ=3.0. Results are presented in Table 5. Compared to Table 2, the joint training mitigates the distortion problem existing in the MCT strategy. Additionally, the participance of the adversarial training improves the performance further; and exceeds the performance of retraining with both enhanced and noisy features in either Transformer or Conformer case. Taking Conformer for example, compared to Conformer trained with clean data merely, the adversarial joint training yields 63.8% relative and 59.0% relative improvements on "matched" and "unmatched" datasets, respectively; meanwhile, the adversarial joint training outperforms the MCT strategy by 2.5% relative and 5.2% relative on "matched" and "unmatched" datasets, separately. These results indicate the efficacy of the adversarial joint training in improving the robustness of the end-to-end ASR scheme.

Discussion
To analyse the difference between these enhancement modules that are trained independently, jointly without GANs, and jointly with GANs, we quantify their performance on the following five objective criteria (the higher the better): • SSNR: Segmental SNR [23] (in the range of [0 , +∞)) • CBAK: MOS prediction of the intrusiveness of background noises [22] (in the range of [1 , 5]) • CSIG: MOS prediction of the signal distortion attending only to the speech signal [22] (in the range of [1 , 5]) • COVL: MOS prediction of the overall effect [22] (in the range of [1 , 5]) • PESQ: Perceptual evaluation of speech quality, using the wide-band version recommended in ITU-T P.862.2 [73] (in the range of [-0.5 , 4.5]) All criteria are computed based on the implementation in [74], available at the publisher website [1] . We quantify the performance of the enhancement frontend that is trained independently, trained jointly with and without GANs in case of Transformer scheme. As exhibited in Fig. 6, the joint training disgrades the enhancement module's performance on SSNR, CBAK, COVL, and PESQ, except for CSIG. It is safe to draw two conclusions from this result. First, these results suggest that these objective criteria cannot indicate the suitability of the enhanced data for ASR task, which verifies the assertion that the independent training leads the enhancement module into the suboptimum easily. Second, the opposite trend on CSIG [1] https://www.crcpress.com/downloads/K14513/ K14513 CD Files.zip proves the assumption that the joint training strategy can mitigate the unseen distortion introduced by the handcrafted loss function. Another phenomenon which is worth nothing is that the discrepancies on CBAK and SSNR reveals the conflicts between erasing the noise contamination and averting the speech distortion. Therefore, the equilibrium between these two goals is critical. The experimental results in Section 6 validate the efficacy of the adversarial joint training with a global discriminant guide for reaching the equilibrium point.

Conclusion
In this paper, we propose an adversarial joint training framework with the self-attention mechanism to boost the noise robustness of the end-to-end ASR system. The jointly compositional scheme consists of an enhancement front-end, a recognition back-end, and the discriminant network. A highlight of this proposed framework is the discriminant component first acts as the guide of the enhancement front-end training; afterwards, it participates in the adversarial joint training as the global instructor, which leads the enhancement front-end to output appropriate enhanced features for the downstream ASR task. Experimental results validate the efficacy of the proposed adversarial joint training strategy. The next work plan is to investigate different framework architectures and training strategies for further improved performance.