Accent modification for speech recognition of non-native speakers using neural style transfer

Nowadays automatic speech recognition (ASR) systems can achieve higher and higher accuracy rates depending on the methodology applied and datasets used. The rate decreases significantly when the ASR system is being used with a non-native speaker of the language to be recognized. The main reason for this is specific pronunciation and accent features related to the mother tongue of that speaker, which influence the pronunciation. At the same time, an extremely limited volume of labeled non-native speech datasets makes it difficult to train, from the ground up, sufficiently accurate ASR systems for non-native speakers.In this research, we address the problem and its influence on the accuracy of ASR systems, using the style transfer methodology. We designed a pipeline for modifying the speech of a non-native speaker so that it more closely resembles the native speech. This paper covers experiments for accent modification using different setups and different approaches, including neural style transfer and autoencoder. The experiments were conducted on English language pronounced by Japanese speakers (UME-ERJ dataset). The results show that there is a significant relative improvement in terms of the speech recognition accuracy. Our methodology reduces the necessity of training new algorithms for non-native speech (thus overcoming the obstacle related to the data scarcity) and can be used as a wrapper for any existing ASR system. The modification can be performed in real time, before a sample is passed into the speech recognition system itself.

the speech recognition system's accuracy to decrease in such cases [4][5][6][7]. The current pace of development in the global economy, education, and mobility of the workforce creates the need to properly recognize the speech of nonnative speakers who nowadays represent the vast majority of users.
Traditional approaches for training speech recognition classifiers usually tend to employ supervised learning techniques [8][9][10][11][12][13][14][15]. While perfectly fitted for cases of recognizing speech of most popular languages worldwide, supervised learning methodologies will not produce classifiers of a decent quality for non-native speakers. The main reason is the lack of labeled datasets of non-native speech which would be large enough to be used as a training set in a supervised learning algorithm. We have dealt with the problem of data scarcity regarding the non-native speech in our previous research [16][17][18][19]. Our idea at the time was to use unlabeled datasets (e.g., Japanese people who speak English and English corpus) in a setup called dual supervised learning.
This time we plan to tackle the problem of non-native accents using the style transfer methodology [20,21] adapted for the case of speech. The application of style transfer in the audio domain is not new. In [22], the authors investigated how to transfer the style of a reference audio signal to a target audio content. They proposed a flexible framework for the task, which uses a sound texture model to extract statistics characterizing the reference audio style, followed by an optimization-based audio texture synthesis to modify the target content. In contrast to mainstream optimization-based visual transfer methods, the process proposed by the authors is initialized by the target content instead of random noise and the optimized loss is only about texture, not structure.
In [23], the authors presented a new machine learning technique for generating music and audio signals. The focus of their work was to develop new techniques parallel to what has been proposed for artistic style transfer for images by others. They presented two cases of modifying an audio signal to generate new sounds. A feature of their method is that a single architecture can generate these different audio-style-transfer types using the same set of parameters which otherwise require complex hand-tuned diverse signal processing pipelines.
To tackle the problem of non-native speech recognition, we plan to apply and adjust style transfer to the domain of speech and sound in order to create an algorithm for realtime pronunciation and accent modification. Having done that, it would enable the possibility of creating a wrapper over already existing and trained ASR systems. Such an approach could allow the modification of a non-native speaker's voice in real time, so that the ASR system used at the time can recognize the speech with a higher degree of accuracy.

Methods
Within this article, we present an approach for handling the problem related to a specific, non-native accent. We created a method that modifies the accent of a non-native speaker so that it resembles the accent of a native speaker to a higher extent. The purpose of this method is to increase the accuracy of ASR systems which had already been developed and trained using a native speech dataset, without the necessity to train new ASR models adapted for a specific non-native accent.
Our idea is to modify the accent of speech using the representation of a sound wave in a graphical domain, i.e., a spectrogram.
The general flow of our approach is depicted in Fig. 1.
At the beginning, we transform a sound wave file into a spectrogram (the process indicated as A on the diagram). Secondly, accent modification is performed. Within the second step, we decided to check two ways of modifying accent with spectrograms and they are described in detail in Sections 2.1 and 2.2.
Finally, the sound wave, in modified form, is fed into the ASR system in order to recognize the speech into text. In our research, we experimented with two kinds of speech recognition process. As shown in the figure, one way is to revert the modified spectrogram back to the sound wave (the process indicated as B on the diagram) in a form of WAV file and then feed it to a previously agreed ASR system. The second way indicated as C is to feed the spectrogram directly to another ASR system (adapted for recognizing the speech from spectrograms) created within this research (Section 2.3).
The accent modification algorithms are trained in a way that they learn how to modify the spectrogram representing non-native speech to one resembling the same utterance by a native speaker. The correctness of the modified speech is determined by the accuracy of the speech recognition system trained on a dataset containing native speaker samples, allowing us to evaluate the quality of accent modification. The criterion which decides whether or not the style-modifying algorithm can perform well is a reduction in the metrics related to the error yielded during the inference using the ASR networks, which were trained on native speaker samples.

Accent modification using autoencoder
In this approach, we came up with an autoencoder based on a convolutional neural network (CNN) [24]. Our idea is to employ such a network for the purpose of changing the pronunciation style (Fig. 2).
The autoencoder was written using the Keras library [25], and its detailed architecture is described in Table 1.
During the training phase, the autoencoder is fed spectrograms of samples of non-native speakers, whereas the autoencoder's output is compared against the spectrograms of exactly the same utterances pronounced by native speakers of the particular language. Then backpropagation causes the autoencoder to learn the conversion of the same words and sentences from the speech containing a non-native accent to the one with the modified accent. During the inference phase, the input spectrogram created in the first step of our pipeline is fed into the autoencoder as input data. The output of the autoencoder is a spectrogram which is slightly converted according to the CNN layers' weights learned after the training.

Accent modification using style transfer-based approach
Another approach we decided to experiment with employs a style transfer methodology adapted for the  domain of speech and sound. Specifically we decided to create a method that resembled the style transfer feedforward algorithm from the graphical domain.
To briefly explain the problem of the graphical style transfer, we try to modify an image in a way that its style resembles the style of another, a so-called style image. At the same time, the content of the image ideally should not be modified.
The general flow of the accent modification using style transfer is depicted in Fig. 3.
In order to utilize such a setup, we first train a network (here, called a loss network) separately, beforehand, which will be used as a speech recognizer in the style transfer approach. Its role is to separate speech spectrograms into multiple layers using a convolutional network. It will be used for extracting content (related to the utterance) and style (related to the accent and pronunciation) from the images (spectrograms). The loss network is depicted on the diagram as LN. In order to properly extract style and content from the input spectrograms, the loss network must be trained using data from such a domain. The data utilized in the training process is described in Section 3. 1 As the loss network model for automatic speech recognition tasks, we combined properties of convolutional recurrent layers, where the former layers become, in fact, employed as feature extractors. Convolutional neural networks have been proven to give outstanding results when applied to images, here spectrograms. They are able to detect and learn local features which are later passed on to recurrent layers. The architecture of the neural network is depicted in Table 2. It accepts an image as the input and outputs a sequence of letters.
The main step of this approach is training the autoencoder for style modification, which performs the essence of the idea. Its architecture is the same as the one of the autoencoder used in the previous approach for accent modification and is described in detail in Table 1.
During one training step, the spectrogram of a sample with a native accent is fed into the loss network, which extracts style matrix Sn from certain layers. It is depicted in the diagram as LN (s) . Next, the spectrogram of a sample containing a non-native accent is pushed through the same loss network which results in extraction of content matrix Cnn. The process is depicted in the diagram as LN (c) . The sample is also fed into the style modifying autoencoder which outputs a modified spectrogram that is fed into the loss network to extract matrices representing style and content of the transformed sample (Stnn, Ctnn respectively). It is symbolized in the figure as LN (s), (c) . After having received Sn, Cnn, Stnn, Ctnn, we can formulate the content and style losses. Content loss is calculated as: where l is the set of convolutional layers representing the content of the sound wave.
Style loss is calculated as: where: Gn l -the Gram matrix of lth layer of Sn received from the loss network Gtnn l -the Gram matrix of the lth layer of Stnn Gram matrix is the result of the multiplication of the matrix by its transpose.
Therefore, the final loss function is represented as: After having formulated our loss function, we backpropagate the error to train the style modifying autoencoder network for the task of accent modification. At this step, the weights of the loss network are already frozen and do not take part in the training process.
Such sequences are executed repeatedly with samples drawn from native speech datasets and non-native ones, respectively. It is worth mentioning that in cases of style transfer, as opposed to the autoencoder approach, it is not necessary for both spectrograms (with native and non-native accents) to represent the same content. As mentioned in the experimental part of this article, we performed several runs of training the autoencoder in order to find the best subsets of convolutions to represent the style and content layers.
During the inference phase, we use only the trained autoencoder that modifies the accent of a new sample.

Speech recognition using spectrograms
At the end of our pipeline, the speech recognition process is performed. One of the two approaches we experimented with is using an ASR system trained on spectrograms. We decided to create a model for the speech recognition using spectrograms converted from WAV files. The architecture of the network playing the role of the ASR system is depicted in Table 3. We used a combination of convolutional and recurrent neural networks (CNN-RNN) in order to train a new speech recognition system. Similar to the loss network mentioned earlier, this network also accepts images and outputs a sequence of letters.
We trained the network using a popular and publicly available dataset LibriSpeech. The details, together with the metrics and the results of the training process, are shown in Section 3.

Cloud-based ASR
Another way of speech recognition that we decided to check is an online ASR service. In our research, we decided upon Google Cloud Speech-to-Text and used the results of recognized text to calculate accuracy metrics.

TDNN architecture-based ASR
Another network-Time Delay Neural Network (TDNN)was used as an evaluation tool in our methodology.

Datasets used
One of the two datasets utilized within this research is a set of around 75,000 samples called English Speech Database Read by Japanese Students (UME-ERJ) containing Japanese, as well as Americans, pronouncing English sentences.

Sentences for learning phonemic pronunciation:
• 460 phonetically balanced sentences • 32 sentences including phoneme sequences difficult for Japanese to pronounce correctly • 100 sentences designed for test set • 302 minimal-pair words • 300 phonemically balanced words 2. Sentences for learning prosody of speech: • 94 sentences with various intonation patterns • 120 sentences with various accent and rhythm patterns • 109 words with various accent patterns The same dataset was used in our previous work [16]. The dataset was employed for training both the autoencoder in Section 2.1 and the style transfer network in Section 2.2, as it contains sentences and words pronounced by both native and non-native speakers. The training dataset contains around 18,662 pairs of spectrograms representing the exact same utterances from native and non-native speakers. This amount of the recordings represents around 26 h of speech. The remaining test and validation subsets did not overlap with the training subset.
Another dataset used in the research is the LibriSpeech dataset. It was used to train both the spectrogram-based ASR module (after converting samples to spectrograms) used as the last part of our pipeline (Section 2.3) and the TDNN-based network (Section 2.4.2). Another application of the dataset is training the loss network for the style transfer approach in one of the accent modification variants (Section 2.2). Also, we used the dataset to train the TDNN-based ASR system, as another network evaluating the performance of our pipeline.
The summary of utilized datasets is shown in Table 4.

Experiments and metrics
The autoencoder introduced in Section 2.1 as well as the loss network (Section 2.2) and the spectrogram-based ASR system (Section 2.3) were trained using the Connectionist temporal classification (CTC, [26]) function. In our research, we designed separate experiments for several processes in our pipeline. Namely, we performed experiments and evaluated the results for: 1. Relative improvement in the speech recognition accuracy in case of autoencoder-based accent modification, including both approaches for ASR in the final stage 2. Relative improvement in the speech recognition accuracy in cases of audio style transfer-based accent modification, including both approaches for ASR. In this approach, we performed several runs of training the style modifying autoencoder to check the best combination of subsets of style and content layers in the loss network

Metrics
We employed two different evaluation processes depending on the experiment type. As a quality metric for the speech recognition processes (loss network, ASR module, and the cloud-based service), we chose three different metric types. First is the standard Word Error Rate (WER) and the second one is Character Error Rate (CER), which is expressed as: where: i -number of insertions s -number of substitutions d -number of deletions n -total number of characters Another metric type introduced is phoneme similarity [27]. It is expressed as Mean Similarity Score (MSS) in the results section of our work.
As for the evaluation of the accent modification itself, we decided to present a relative decrease in CER yielded by the ASR module from the last part of our pipeline (Google Cloud Speech-to-Text and the spectrogrambased ASR trained using LibriSpeech).

Results
Each respective result below represents an average over ten runs of each experiment with a particular setup.

The results without accent modification
The ASR module (Section 2.3) was trained using spectrograms converted from LibriSpeech train-clean-360 subset. It was evaluated using the test-clean dataset and achieved 15.7% CER and 19.7% WER. This model, evaluated with a 10% test subset of the spectrograms of UME-ERJ dataset, achieved only 46.3% CER and 56.8% WER.
The averaged result of speech recognition using spectrograms yielded by our loss network is 11.2% CER and and 14.9% WER using the LibriSpeech test-clean dataset.
The 10% test subset of the WAV samples of the UME-ERJ dataset was also used to evaluate the performance of the Google Cloud Speech-to-Text API, and the result we obtained was 39.8% of CER.
The LibriSpeech test-clean dataset was also used for evaluating the TDNN-based ASR network we trained, and the result achieved in our test was 10% CER and 12.5% WER.

Impact of the autoencoder-based accent modification
After activating the autoencoder-based accent modification in our pipeline, the same test subset of the UME-ERJ dataset gave a result of 36.1% CER (evaluation by the spectrogram-based ASR model trained only on the LibriSpeech training set). Therefore, it yielded a 22% 46.3% − 36.1% 46.3% relative improvement in terms of CER.
In the case of the Google Cloud API, we first fed the data to the autoencoder and then converted the modified spectrograms back to the sound wave format. At the end, we sent it to the cloud service and recorded the recognized text. We used the 10% subset of samples from UME-ERJ and after the process obtained a result of 27.3% CER, which translates to 31.4% of the relative improvement.

Impact of the accent modification based on the style transfer approach
For each combination of subsets tested for style and content layers, we checked the CER value on the 10% subset of spectrograms from the UME-ERJ dataset by feeding it into the trained autoencoder that modifies the style (Section 2.2) and feeding the respective result into the ASR (Section 2.3). The best setup gave a result of 31.7% CER. Therefore, it yielded a 32% relative improvement in terms of CER.
In the case of the cloud service evaluation, we followed the analogical process as in Section 3.4.2. The best combination of style and content layers achieved the result of 23.9% CER which means a relative improvement of 40%.
All experimental results with the best CER are presented in Tables 5 and 6. The results for experiments conducted on the style transfer approach with different style and content layers are presented in Tables 8 and  9. Tables 5, 6, 7, 8, and 9 represent the results obtained when testing the setup with both 10% subset of the UME-ERJ dataset and specifically prepared 100-sentences test subset of UME-ERJ.
In the tables, we use the following symbols:  WM -results of the experiment without the accent modification process A -the experiment with the autoencoder-based modification process ST -the experiment with the style transfer-based modification process RI CER -relative improvement for the CER metrics

Discussion
This article contains study of the audio style transfer methods used for improving accuracy of ASR which had been trained using native speech datasets and is used by non-native speakers.
We found that the style transfer methodology adapted to the speech domain yields better results than an autoencoder trained in a supervised way. We think that the reason behind it lies in the fact that we performed the training step repeatedly by sampling random samples to include, respectively, non-native and native accents, and transforming them into the spectrograms. The samples in each such pair do not have to represent the same. This observation may suggest another interesting idea that if this approach were to be extended it might be possible to create a more universal autoencoder that might convert the accent of non-native speakers from multiple into one (e.g., North American English) accent. This is going to be one of the steps for the future of our research.
Another observation is an extension of the fact that the accent modification process is not conditioned on any variables related to the speaker or speech environment. That causes the situation where during the experiment testing phase we could detect that the gender of the speaker in the sample after style transformation is different than in the one from before, while preserving the  actual content of the pronounced word or sentence. However, we did not treat such cases as failed, as our primary goal was to increase the accuracy of the ASR system, which was eventually achieved. Nevertheless, it will be a next step for our team to address. As another future step in our research, we would like to conduct more experiments, i.e., the evaluation of the style transfer approach described in Section 2.2 using more datasets. We are also planning to evaluate the process of a style transfer-based approach and autoencoder-based approach with longer sentences and samples.
We are planning to develop our idea for non-native speech recognition further and to constantly improve the quality of the designed methodology. Furthermore, additional experiments will be conducted, i.e., using multiple nationalities of non-native English speakers, as well as using different datasets including samples of languages other than English.

Conclusions
In this research, we explained the problem of non-native speech recognition and the reason why training ASR systems adapted for such speech may be problematic.
We described in detail the idea behind style transfer methodology and our adaptation to the speech and sound domain. We presented the method as a way to transform non-native pronunciation so that it resembles native speech to a higher extent, thus enabling the ASR system to perform better when being used by a non-native speaker. Table 9 Relative improvement depending on the content and style layers in cases of Google cloud-based evaluation using 10% UME-ERJ subset

Style layers Content layers RI(CER)
1-10 6-12 40% We performed initial experiments using the UME-ERJ dataset and tested several different pipelines for pronunciation modification. We evaluated each approach on a custom ASR trained to recognize speech from spectrograms, as well as on the publicly available Google Cloud Speech-to-Text. Our initial findings show that it is possible to augment the non-native speech samples in a way that they will be recognized with a higher accuracy by an ASR system.
We also pointed out several issues that appeared while we were training and evaluating our algorithms. This proves that there is definitely a lot of room for improvement in order to adapt the method to multiple speakerdependent conditions and other non-native nationalities.