Pronunciation augmentation for Mandarin-English code-switching speech recognition

Code-switching (CS) refers to the phenomenon of using more than one language in an utterance, and it presents great challenge to automatic speech recognition (ASR) due to the code-switching property in one utterance, the pronunciation variation phenomenon of the embedding language words and the heavy training data sparse problem. This paper focuses on the Mandarin-English CS ASR task. We aim at dealing with the pronunciation variation and alleviating the sparse problem of code-switches by using pronunciation augmentation methods. An English-to-Mandarin mix-language phone mapping approach is first proposed to obtain a language-universal CS lexicon. Based on this lexicon, an acoustic data-driven lexicon learning framework is further proposed to learn new pronunciations to cover the accents, mis-pronunciations, or pronunciation variations of those embedding English words. Experiments are performed on real CS ASR tasks. Effectiveness of the proposed methods are examined on all of the conventional, hybrid, and the recent end-to-end speech recognition systems. Experimental results show that both the learned phone mapping and augmented pronunciations can significantly improve the performance of code-switching speech recognition.


Introduction
Code-switching (CS) phenomenon is prevalent in many multilingual communities. It is defined as the switching of two or more languages at the conversation, utterance, and sometimes even word level [1][2][3]. There are two different forms of code-switching, one is the inter-sentential switching with the alternation is between sentences, and the other is the intra-sentential with the switching is within one sentence or word [3].
The code-switching phenomenon is quite common around the world. For example, in India, it is very common to see the Bengali-English or Bengali-Hindi-English in most people's daily speech [4]; in USA and Switzerland, people can often hear Spanish-English and French-Italian *Correspondence: yanhua@shnu.edu.cn 1 SHNU-Unisound Joint Laboratory of Natural Human-Computer Interaction, Shanghai Engineering Research Center of Intelligent Education and Bigdata, Shanghai Normal University, Shanghai, China Full list of author information is available at the end of the article code-switching speech [3]; and in Hong-Kong, the combination of English and the native Cantonese is also very common [5]. Particularly, in East Asia, the Mandarin-English code-switching is extremely popular, such as in Singapore, Malaysia, Mainland China, and Taiwan [6,7]. In addition, the code-switching is also now frequently found in our daily life, such as in some professional activities, social media, consumer goods, or entertainment, it is fairly common to hear people borrowing words from one language to use them in another [8,9]. In recent years, the research of code-switching automatic speech recognition (ASR) has received increasingly attention. This is because with the rapid development of speech technology, variety speech-driven interfaces to smart devices, and other real AI applications become mainstream, most state-of-theart monolingual ASR systems fail when they encounter code-switched speech.
Compared with the recent significant success achieved in monolingual speech recognition [10,11], the ASR systems still have problem to deal with the code-switching speech, especially the intra-sentential switching. To build a good code-switching ASR system, several challenges need to be handled, either in acoustic or language modeling. One of the major challenge is the pronunciation variation phenomenon of the embedding language at the codeswitches. Unlike the matrix language for native speakers, in many CS scenarios, most code-switching speakers may not be familiar with the embedding language, the words borrowed from the embedding language may be pronounced with a spectrum of accents and may be systematically or randomly mispronounced [8]. For example, in the Mandarin-English code-switching utterances that collected from Mainland of China, most of those embedded English words may be Chinglish (Chinese English). There is a significant pronunciation variation between the Chinglish and the standard British or American English. Although there were many previous works have been proposed to deal with the discrepancy and co-articulation effects between different CS mixed languages, such as the units merging [8,[12][13][14], language-universal acoustic modeling targets, or framework [15][16][17]. Works related to handle the pronunciation variation of embedding words are very limited.
In this study, we concentrate on exploring pronunciation augmentation techniques for acoustic modeling. These techniques are applied and examined to improve a Mandarin-English intra-sentential code-switching ASR system. We aim at dealing with the pronunciation variation and alleviating the sparse problem of code-switches by using pronunciation augmentation methods. Our contributions are as follows: (1) an English-to-Mandarin mixlanguage phone mapping is proposed. We first validate the effectiveness of conventional data-driven phoneme sharing. However, the direct one-to-one unit mapping only helps to alleviate the embedding language training data sparsity at some extent. The acoustic discrimination between different languages is ignored. Therefore, we propose a new English-to-Mandarin phone mapping to enhance the pronunciations of universal code-switching lexicon. (2) An acoustic data-driven lexicon learning framework is proposed to learn new pronunciations to cover the accents, mis-pronunciations, or pronunciation variations of those embedding English words. Only using the phone mapping still can not well handle the pronunciation variation with the mispronounced, accented embedding words or phrases. Because the standard pronunciations in the monolingual lexicon can not cover these variation cases, these words would typically need to be expressed with another new pronunciation or phone set. Therefore, motivated by the acoustic data-driven lexicon learning in [18], we propose a novel pronunciation augmentation approach to produce the possible new pronunciations for those embedding words at the code-switches. Based on the merged universal code-switching phone set, this approach integrates both of the information from the expert knowledge, and acoustic evidences in training corpus. Effectiveness of these proposed techniques are not only examined to improve the conventional, hybrid ASR system, but also validated to enhance the state-of-theart end-to-end ASR systems. Our experiments on real code-switching ASR task show that the proposed methods are very effective to improve the performance of CS speech recognition, and without any performance degradation of the matrix language recognition (Mandarin test set), this is very important for the real ASR applications. Because in real ASR scenarios, a CS ASR system may be used not only for recognizing the Mandarin-English code-switching speech, but also used for monolingual Mandarin speech recognition simultaneously.
The rest of the paper is organized as follows. A review of previous works is presented in Section 2. Section 3 presents the framework of the proposed pronunciation augmentation method. Section 4 describes the details of three speech recognition systems. Experimental configurations are presented in Section 5. Results and performance analysis are presented in Section 6. Finally, we conclude and present future works in Section 7.

Review of previous works
The code-switching phenomenon is very natural for people's communication; however, it throws several interesting challenges to the speech recognition community. Three major challenges have been focused in the literature: (i) the heavy sparsity of code-switching training data, especially for the data of intra-sentential code-switched points in both the acoustic and language modeling; (ii) the significant language discrepancy and co-articulation effects in code-mixed utterances, it imposes a big gap between acoustic modeling units of different languages; and (iii) the above mentioned pronunciation variation of embedding language at the code-switches. All of the stages in an ASR system could be significantly affected by any of these challenges, including the acoustic modeling, language modeling, and decoding.
To handle the code-switching data sparsity problem, the most straight forward way is to create code-switching speech corpus. However, for the Mandarin-English CS ASR, only a few small publicly available code-switching corpus can be found, such as the 80 h OC16-CE80 corpus that provided for the Chinese-English mixlingual speech recognition challenge (MixASR-CHEN) [19] and the SEAME corpus [20,21] with 63 h spontaneous intrasentential and inter-sentential code-switch speech. The matrix language in both OC16-CE80 and SEAME is Mandarin; the data amount of code-switch speech events in both corpus is extremely sparse. In addition, with the mixlanguage property and high cost of time and money, it is arduous to create large-scale CS corpus with golden standard manual transcription [22]. Therefore, some works start to focus on developing automatic CS event detection system to extract speech utterances with language switches. For example, in [23], a latent language space model and delta-Bayesian information criterion were proposed to detect the code-switching event. Rallabandi et al. [24] proposed to use an ASR system to detect code-switching style utterance from acoustics. In [25], the authors used frame-level language posteriors generated from a CS ASR system to detect the code-switches. Most of these works were highly dependent on the performances of ASR systems. They may not practical for extracting CS utterances from large real audio archives; therefore, some recent works start to focus on building CS event classifiers based on the deep neural networks directly [26]. In addition, we know that traditional data augmentation techniques, such as the speed or volume perturbation [27,28], the SpecAugment [29] or audio synthetic [30] have shown their effectiveness to alleviate the data sparsity at some extent in various speech and sound processing tasks, however, these techniques can not handle the pronunciation variation problem of the embedding language in a CS ASR scenario. Towards the code-switched text data augmentation for language modeling, there are also many works in the literature. Works in [14] used machine translated text to augment the available code-switched text and found that those synthesized CS texts achieved significant reductions in perplexity. And in [31], the authors increased the CS texts by integrating both the syntactic and semantic features into the language modeling process. The recent generative adversarial networks with reinforcement learning was proposed in [32] to create CS text from monolingual sentences. Pratapa et al. [33] proposed to generate grammatically valid artificial CS data using parallel monolingual sentences with linguistic equivalence constraint. In [4], a simple transliteration-based data augmentation approach was proposed to augment the Bengali-English code-switched transcripts. The results showed that transliterating the code-mixed textual corpus to the matrix language and adding it to training data significantly improved the CS ASR performance. All of these previous works showed that the artificial generated CS texts were very effective for alleviating the data sparsity problem in language modeling at some extent. However, from our previous observation of acoustic data augmentation presented in [26], it seems that it is more difficult to generate effective synthesized CS speech than CS text; this may due to the fact that effective real speech are normally with complicated acoustic environment and intra and inter speaker variabilities. These variabilities are very challenging to speech synthesize community.
To alleviate the data sparsity and co-articulation effect problem in code-switching acoustic modeling, previous works mainly focused on (1) exploring an universal acoustic modeling units, (2) developing new acoustic modeling strategies with multi-task or transfer learning, and (3) unsupervised or semi-supervised learning. For improving the CS ASR systems with conventional architectures, such as the Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) or the Deep Neural Network-HMM (DNN-HMM)-based hybrid framework, these works mainly focused on mix-language phone mapping and unit merging, such as, in [34][35][36], the unit merging on state, senone, and Gaussian Levels was proposed for Mandarin-English ASR tasks. In [8,12,14], different data-driven and knowledge-based phone merging and clustering algorithms were investigated to get a compact bilingual phone set. For the recent popular end-to-end (E2E) acoustic modeling, new universal acoustic modeling units for CS ASR were proposed to minimize the coarticulation and discrepancy between different languages, such as, in [7,15], the Character-Subword units with Chinese characters for Mandarin and Byte-pair Encoding (BPE) [37] subword for English were constructed. Shan et al. [38] adopted the Mandarin characters plus English letters and wordpieces as its E2E modeling units. For the new strategies of CS ASR, most works aimed at integrating the individual language information to improve the final CS models. For example, two language-specific DFSMN subnets with a shared output layer was proposed to model the CS acoustic information in [15]. And in [7,16,38], multi-task joint training of language identification and CS E2E ASR tasks were investigated, and in some works, the transfer learning was also used to provide a good initialization of the E2E encoders using large-scale monolingual corpus. In addition, to exploit the large-scale untranscribed code-switching data, many other efforts have been paid on using the unsupervised and semi-supervised learning for improving the CS acoustic model, such as in [26,39,40], etc. All of these previous works have been proved to be effective, either for alleviating the data sparsity or co-articulation effects at some extent. However, they still can not solve the embedding language pronunciation variation problem, because most current hybrid acoustic modeling approaches still highly rely on the monolingual lexicons with standard pronunciations. And moreover, in E2E Mandarin-English ASR scenarios, most sub-words or wordpiece extraction approaches only consider character sequence frequencies instead of acoustics, which at times produce inferior sub- word segmentation that might lead to erroneous speech recognition output [41]. Therefore, in few latest works, people start to focus on dealing with the pronunciation variation problem in acoustic modeling. For example, to handle the accented pronunciation problem in conventional DNNbased hybrid Mandarin-English code-switching ASR system, [17] proposed to generate native pronunciations representation of embedding language words in the matrix language phoneme set, using a combination of existing acoustic phone decoders and a LSTM-based graphemeto-phoneme (G2P) model. However, for the popular E2E architectures in CS ASR, we have not found any approach focusing on the pronunciation variation issue in the literature, although in some recent works for monolingual E2E ASR, they have found that the high quality pronunciation lexicons developed by linguists can potentially improve the performance of E2E systems, such as the detail investigations on evaluating the value of pronunciation lexicon in E2E models in [42] and the pronunciation-assisted sub-word modeling method in [41].
In this study, we also focus on dealing with the pronunciation variation problem in the Mandarin-English CS ASR tasks. We aim at achieving better recognition accuracy of the embedding languages phrases without any performance scarify of the matrix language speech, by using pronunciation augmentation techniques. Unlike the work in [17], we not only consider merging the acoustic similarity between mixed languages, but also consider enhancing the discriminative information between different languages. New possible pronunciations for those embedding words at code-switches will be automatically generated. Combined with the expert information and G2P, effectiveness of these new pronunciations are examined in both hybrid and E2E CS ASR systems.

The proposed pronunciation augmentation approach
In this section, we present the details of the proposed pronunciation augmentation approach. An English-to-Mandarin (E2M) mix-language phone mapping approach is first proposed to obtain an universal code-switching phone set. Based on this phone set, we further investigate a pronunciation augmentation strategy for embedding language words using acoustic data-driven lexicon learning.

E2M phone mapping
It is well known that there is big language difference between Mandarin and English; however, the existence of the co-articulation and CS data sparsity problem make it important to use an universal phone set for building a success DNN-HMM based hybrid ASR system. Instead of mapping all the embedding language phones to the matrix language phones as in previous works [8,17], here we choose to cast light on the balance between the similarities and differences of the two mixed Mandarin and English languages. Only part of phones with high similarity measure are merged together. Figure 1 illustrates the proposed framework of E2M mix-language phone mapping. In this framework, we combine two effective conventional data-driven phone clustering methods with an expert correction to generate the final universal phone set for Mandarin-English code-switching ASR. Specifically, given two monolingual lexicons and speech corpus, we first obtain a set of English to Mandarin phone mapping pairs Tag , using the Tag model-based phone mapping method that has been proposed in [8]. Then, we perform the TCM phone clustering to obtain another set of phone mapping pairs TCM as proposed in [43]. These two sets of pairs are further combined and merged using following rule: where P E is the English phone, P M is Mandarin phone, and (P E , P M ) is the E2M phone mapping pair, Com contains those E2M pairs that lie in both the Tag and TCM sets. These pairs are taken as the high confidence similarity phone mapping pairs, because they derived from two different phone clustering methods. The principle of the Tag model-based method aims at sharing individual Gaussians across languages. All the Gaussians in the Tag model are clustered in a single, phone-independent, language-independent, Kullback-Leibler divergence-based, Vector Quantization (VQ) code-book. If any two phones in the Tag model have the majority of their Gaussians lying in common VQ clusters, then these phones are assumed to be similar [8]. On the other side, the TCM-based method in [43] is a two-pass phone clustering method that is based on a co-occurrence confusion matrix. In each pass, Mandarin and English take turns as the source and the target language. The counts of co-occurrence between force-aligned target phone strings and the corresponding source phonetic transcriptions are then arranged to calculate the confusion probability between phone pairs. That is to say, these two methods merged the characteristics of two languages in totally different aspects. Therefore, we hope that comparing and integrating the outputs of two different methods can not only assure the confidence of data-driven phone mapping but also can provide some potential guidance for our further expert correction. Besides the Com , as shown in Fig. 1, we also add an expert correction stage to improve the overall quality of E2M phone projection. The motivation of this stage is that, in our extensive experiments, we find the Com contains most of non-vowel phone mappings, but with only few vowel phone-mappings, most vowel phone-mappings produced by the TCM and Tag-model based methods are very different. Therefore, we invite three linguistic experts from Shanghai Normal University to perform the vowel phone mapping correction. In this stage, all pairs in Com are directly fed into the final phone mapping set Final ; only other vowel phone mapping pairs in Tag and TCM are then checked and corrected by linguists. The majority voting rule (2/3) is used to measure the reliability of experts' correction. This correction process not only dependent on the linguistic knowledge of experts, but also guided by the statistics of confusion matrices achieved from the TCM and Tag model-based phone clustering processes. It worth noting that we do not perform any mapping for those pairs with low similarity measurements (e.g., the DH, ZH in CMU English lexicon); keeping these language-dependent phones may help to integrate the large acoustic and language discriminative information during acoustic modeling. Finally, all of the corrected phone mapping pairs, the pairs in Com , and the few language-dependent English phones are combined to produce the final universal CS phone set, perform the English lexicon mapping, and obtain the final universal CS lexicon.
In addition, according to the experts' knowledge and different outputs of the above two phone clustering methods, we speculate that it may be useful to perform a one-to-two or one-to-many mapping for those English vowels with a similarity measure larger than a threshold. This may help to efficiently handle the Mandarin accents in English words because all the mappings from both phone clustering methods are data-driven results. A mapping pair has a very high confusion probability may still indicates a similar acoustic characteristics, even it is not lie in Com set. Therefore, in this study, besides the one-to-one mapping, the one-to-two mapping cases with a further expert correction are also investigated. Two one-to-two mapping examples for English phones AA and IY are illustrated in Fig. 2.

Pronunciation augmentation using Lexicon learning
The idea of our pronunciation augmentation approach is motivated by the algorithm of acoustic data-driven lexicon learning in [18]. In [18], this algorithm was proposed to automatic generate pronunciations for the OOV words in monolingual English ASR task. However, in this study, we borrow the lexicon learning idea to handle the pronunciation variation problem in Mandarin-English code-switching ASR. It is developed to generate informative new pronunciations only for the embedding language (English) words. These new pronunciations are then appended to the universal CS lexicon for ASR acoustic modeling.
In fact, it is also possible for the system to produce new pronunciations for Mandarin words using lexicon learning. However, in our Mandarin-English code-switching tasks, the Mandarin is the matrix language, while English is the embedding language, all the Mandarin words are spoken by the native speakers, and there is no pronunciation variation for a native speaker to say native speech. Therefore, we only considering the new pronunciations of English words and ignore those ones produced for the Mandarin words. Figure 3 presents the whole framework of the proposed pronunciation augmentation using lexicon learning. It includes three main modules: (a) the CS lexicon preparation for all the words in acoustic training data, (b) the new pronunciation candidates collection, and (c) the The proposed pronunciation augmentation framework of embedding language words in CS ASR using lexicon learning pronunciation pruning and selection. This framework is very similar to the one proposed in [18], however, there is many implementation details that are specially designed for the focused code-switching speech recognition task.
In module (a), we first create a Mandarin-English CS lexicon by mapping the standard monolingual English lexicon using those phone mappings obtained in Section 3.1. Then, unlike the small seed lexicon in [18], here we extend the whole CS lexicon by training a Sequitur G2P [44] model on it to produce initial pronunciation only for OOV words in the training data. Based on the G2P extended lexicon, in module (b), we train an acoustic model using all of the monolingual Mandarin, English, and code-switching speech data to perform the training data forced-alignment, build the phone language model, and further construct a phonetic decoder. This decoder is then used to generate the phonetic transcription for each specific word w exist in the training data. Finally, for each individual word, we can obtain many new pronunciations candidates (PD) generated from the phonetic decoding, by aligning the phone sequence of forced-alignment and phonetic transcription using a normalized relative frequency measurement. These PDs can be combined with the G2P extended lexicon into a large CS lexicon (Combined Lexicon) for the next acoustic evidence collection in module (c). The "acoustic evidence" is defined as τ p (O u |w, b); it is the acoustic conditional data likelihood of utterance O u , given the pronunciation of word w being b. This "acoustic evidence" is derived from the per-utterance lattice pronunciation-posterior statistics, and these statistics are computed using lattices of training utterances that are produced based on the Combined Lexicon and existing acoustic model in module (b). Given a set of pronunciation candidates for a specific word w, and the acoustic evidence τ per utterance, the pronunciation pruning and selection are performed using an iterative greedy pronunciation selection (IGPS) procedure with a per-utterance likelihood reduction criterion. Finally, with this procedure, all the least important pronunciation candidates will be iteratively removed in an efficient greedy fashion. All the details about lexicon learning algorithm and other implementation tricks, please refer to the work [18].
Furthermore, unlike the motivation to generate pronunciations for OOV words in [18], the lexicon learning idea in this study just plays a pronunciation augmentation role. Therefore, based on the greedy pronunciation selection, we added an additional PD selection constraint to assure a higher quality of new pronunciations as below: where PD(w) τ is the "acoustic evidence" soft counts of pronunciation for word w derived from phonetic decoding after the last iteration of IGPS. Avg.R(w) τ is the average soft counts of word pronunciations in the reference source lexicon (G2P Extended CS Lexicon), and ρ ∈[ 0, 1] is the statistical pruning factor. It worth noting that only the training utterances contain English words are taken into account during the acoustic evidence collection and PD pruning stages, because in our CS task, we only expect to augment the pronunciations of English words for the heavy pronunciation variation issue. After the greedy process of PD pruning, only the informative subset of PDs for each word with acoustic evidence is selected. These informative pronunciations are then used to augment the source CS lexicon for acoustic modeling. For those words in target vocabulary that are not seen in the acoustic training data, or no pronunciation produced during lexicon learning, we choose to generate their pronunciations by re-training a CS G2P model using the already augmented lexicon, instead of the initial G2P pronunciation candidates in module (a).

Speech recognition systems
Three types of ASR system are used to evaluate the proposed method. They are the conventional GMM-HMM system, the state-of-the-art lattice-free maximum mutual information (LF-MMI)-based hybrid system and the latest Transformer-based E2E system.

GMM-HMM system
Our GMM-HMM systems are built using the open source Kaldi speech recognition toolkit [45]. The 13-dimensional mel-frequency cepstral coefficients (MFCC) plus onedimensional pitch with their first-and second-order differential coefficients are used as the input acoustic features to train the initial GMM-HMM acoustic models.
Based on the initial model, all of the acoustic features are then spliced over 9 frames and projected to 40dimensional subspace using the linear discriminant analysis (LDA). A further maximum likelihood linear transform (MLLT) is applied to transform the projected features for a better orthogonality. These transformed features are then used to refine the GMM-HMM parameters. After the decision tree clustering, the final models have around 6000 context-dependent tied states with around 32 Gaussians per state (different lexicon leads to different number of states). Based on this LDA+MLLT model, we further perform the speaker adaptive training (SAT) with constrained maximum likelihood linear regression to adapt the Gaussian mixture model parameters. After adapting the parameters, a re-alignment is performed to improve the LDA+MLLT+SAT system. The framework and implementation details of our GMM-HMM system training are followed the example recipe egs/swbd/s5c in Kaldi main branch.

LF-MMI based hybrid system
The LF-MMI based ASR hybrid framework was first proposed in [46]. Because of its good performances and excellent generalization ability, it has been becoming a mainstream technology for speech recognition, either in industry or in academic community. The LF-MMI based hybrid acoustic model is a time-delay neural network (TDNN) with multi-splice sub-sampling topology. Povey et al. [46] proposed to train it in a purely sequence-discriminative way using the lattice-free version of the MMI criterion. Compared with the classical TDNN training with crossentropy criterion, three major modifications have been introduced to the LF-MMI TDNN training: • Training from scratch without initialization from a cross entropy system. • The use of a threefold reduced frame rate and a simpler HMM topology. • Limiting the range of time frames where supervision labels can appear by using finite state acceptors (FSA).
In addition, unlike the denominator lattices in classical MMI, the lattices in the LF-MMI architecture are first generated from a phone-level n-gram language model, and then compiled into utterance-specific FSA graphs for TDNN training. Furthermore, to avoid over-fitting during training, the cross-entropy objective function as well as the leaky HMM are also applied as extra regularization techniques in this architecture. In this study, we choose to use a TDNN-LSTM hybrid structure presented in [47] as our acoustic model because of its better performances.

Transformer-based E2E system
With the great success of no-recurrence sequence-tosequence model-Transformer proposed in machine translation [48], more and more research works in speech community start to focus on it. Recently, a Speech-Transformer [49] was successfully proposed by introducing the Transformer to ASR task. With the encoderdecoder architecture and multi-head self-attention mechanism to learn the context and positional dependencies, the Transformer has proven to be very successful to achieve competitive speech recognition performances, and it has already become the state-of-the-art E2E ASR system.
Compared to the above LF-MMI based hybrid system that consist of separate pronunciation, acoustic, and language models, the Transformer-based ASR system is a single neural-network which implicitly models all three. Due to lack of pronunciation lexicon, most E2E systems choose to model the output text sequence in finer units instead of the whole words, such as the characters, BPE subwords, and wordpieces. Therefore, in this study, we also investigate how to improve a Transformerbased E2E system for Mandarin-English code-switching ASR, by introducing the augmented CS pronunciations to assist the BPE subword modeling. This is motivated by the recent work of pronunciation-assisted subword modeling (PASM) proposed in [41]. We hope the PASM can generate linguistically meaningful subwords for the embedding language English by analyzing the training text corpus and our augmented CS lexicon.
The recipe of PASM word segmentation 1 was used in our experiments. All the experiments of E2E ASR system building were performed using the open-source end-toend speech recognition toolkit ESPnet [50].

Datasets
The corpus used to build our code-switching ASR systems is provided by Unisound Corporation 2 , including 186 hours (hrs) Mandarin-English code-switching speech, 500 hrs Mandarin, and 100 hrs English (with Chinese accent) monolingual speech. We use "Chilish" to term this accented English set. All of the utterances are conversational speech or speech collected from voice search, and the speakers are all from the mainland of China. In our training set, there is a heavy imbalance between the data amount of Mandarin and Chilish, because in real applications, it is much harder to collect Chilish speech than Mandarin. As our goal is to improve the ASR performance of the embedding language without any performance scarify of the matrix language, we designed three test sets for performance evaluation, one is a 3 hrs pure Mandarin speech test set (Mandarin), one is 3.6 hrs Mandarin-English codeswitching test set (1.6 hrs are from voice search, 2.0 hrs are general conversational speech), and the third one is a pure 1.6 hrs Chilish test set.

Neural network structures and acoustic features
The experimental configurations of GMM-HMM systems have been already presented in Section 4.1. Unlike the typical TDNN, the LF-MMI based TDNN-LSTM hybrid system is a mixture architecture of LSTMPs and subsampled TDNNs, using 3 fast-LSTMP layers interleaved with 7 spliced TDNN layers. More details of this architecture can refer to the TDNN-LSTMP structure used for SWBD corpus in [47]. The frame-level alignments and lattices in TDNN-LSTM model training were directly generated from the GMM-HMM system. The recipe of swbd/s5c/local/chain/run_tdnn_lstm.sh in Kaldi repository is used for our hybrid model training, but without any i-vectors or other speaker adaptation techniques. The language model used in both GMM-HMM and hybrid system is the same trigram LM that built on all of the training data texts.
As most ASR example recipes in ESPnet, for the Transformer-based E2E systems, we use 12 Encoder and 6 Decoder blocks with 2048 feed-forward inner dimension. The model dimension d model is set to 256 and the attention head number h was set to 4. Both the hybrid and E2E acoustic models use the same 80-dimensional filter-bank features, plus 3-dimensional pitch (pitch and its first and second derivatives). All of the input acoustic features are extracted using a 25-ms Hamming window with a 10-ms frame shift. In addition, to enhance the model robustness, we perform both the SpecAugument and speed perturbation acoustic data augmentation during all the hybrid and E2E system training.

Lexicon and performance measure
The monolingual Mandarin lexicon (MLex) used in this study is an ARPAbet-based tonal lexicon provided by Unisound Corporation. It has 109 phoneme/toneme phone set, covering more than 200k Mandarin words. The monolingual English lexicon is the CMU open source English dictionary with 39 phonemes 3 . In the experiments, these two monolingual lexicons are first combined and mapped into an universal CS lexicon, and then further augmented using the proposed E2M mix-lingual phone mapping and lexicon learning approaches to handle pronunciation variation problem in the code-switching ASR tasks. For the system performance measure, we use token error rate (TER), where the "token" refers to the unit of Mandarin character and English word respectively.

Acoustic difference between native and non-native English speech
It is well known that there is big acoustic difference between native English and non-native English speech.
In Table 1, we also performed an experiment to validate this difference. In this experiment, the 1.6 hrs Chilish test set is taken for evaluation. The 100 hrs Chilish training data is used to train the GMM-HMM model for nonnative monolingual English. The native English model is directly taken from the Kaldi LibriSpeech repository on LibriSpeech.
Both systems in Table 1 are trained using the CMU lexicon. From the big TER gap on Chilish test set, it is clear to observe the significant acoustic difference between native English and Chilish, even the native acoustic model is trained on a larger corpus. However, in the literature, most ASR systems related to English are still using the CMU pronunciations which are specially designed for the native English. This indicates that there might be chance to improve the ASR performance by integrating the acoustic variations in the lexicon pronunciations. Table 2 compares the effectiveness of the proposed E2M mix-lingual phone mapping with different conventional phone mapping methods. The first two systems with only MLex and CMU lexicon are monolingual Mandarin and English system respectively; they are only separately trained using the 500 hrs Mandarin and 100 hrs Chilish training data. Other systems in this table are all coder-switching GMM-HMM systems and trained on the total 786 hrs training data. The "CMU+MLex" represents directly concatenating the phone sets and lexicons of CMU and MLex without any phone sharing or mapping. The "TCM" and "Tag model-based" are the conventional data-driven phone clustering methods proposed in [43] and [8] respectively. The "E2M-po2o" is our proposed E2M phone mapping, in which only part of English phonemes are involved in the English to Mandarin phone mapping, and in this case, they are just an one-to-one mapping. From TERs in Table 2, three observations can be found: (a) phone-mapping is very effective to improve the code-switching ASR performances, either by using the TCM, Tag model-based phone clustering or the proposed E2M. Both conventional phone clustering methods achieved more than a relative 20% TER reductions over baseline, a further 6% TER reduction is obtained by the proposed E2M. In the proposed E2M framework, this further gain may benefit from the tradeoff mechanism between language-independent acoustic commonness and language-dependent characteristics. (b) More Chilish words are mis-recognized by the universal codeswitching acoustic model than the pure English model; it indicates that phone mapping brings acoustic confusion between Chilish and Mandarin than monolingual ASR. (c) On the Mandarin test set, all of the CS models achieve better performances than the system trained on pure Mandarin data. Relative 1.6-3.8% TER reductions has been obtained. This is due to the increased Mandarin data included in the CS training set. In addition, we can see that there is almost no performance degradation on the Mandarin test set, by using phone mapping or clustering methods over the baseline. Figure 4 demonstrates the TERs calculated separately on the Mandarin and English part of the code-switching test set. "CS" is the whole code-switching test set as shown in Table 2, "CS-Mandarin" and "CS-English" represent the Mandarin part and English part of CS. Compared with the CMU+MLex baseline, both the performances of Mandarin and English speech are improved by the proposed E2M. And moreover, it is clear to see that the performance gain on English part is much larger than the one on Mandarin part. This indicates that the Chinese accent characteristics of embedding English words are well learned and the acoustic events of code-switches are heavily increased by the proposed E2M; the recognition error within code-switches are significantly reduced. Table 3 shows the performance comparison of the stateof-the-art LF-MMI based hybrid systems with different E2M phone mapping strategies. From the TER numbers in Tables 2 and 3, we can see that these hybrid systems achieve significant ASR performance gains (more than  relative 45%) over the conventional GMM-HMM systems. The "E2M-po2m" is the proposed E2M phone mapping with part of English phones (∼ 38%) are mapped to two or three Mandarin phones. "E2M-po2o-T, E2M-po2m-D" means we use the "E2M-po2o" for acoustic model training, while "E2M-po2m" is used for decoding. Except for the last line of Table 3, all other systems use the same lexicon for both the acoustic model training and decoding. Compared with the large gains obtained from phone mapping in GMM-HMM systems, the E2M-po2o only obtains relative 10.8% TER reduction on code-switching test set over the CMU+MLex baseline. In addition, by comparing the results on code-switching test set of last two lines in Table 3, it is interesting to find that the one-to-many mapping in E2M brings more mix-lingual acoustic confusion than the one-to-one phone mapping. And given the fixed universal phone set, only using proper one-to-many phone mapping in the decoding stage can bring further 6.1% TER reduction over the E2M-po2o method. This may due to the fact that more pronunciation entries provided more possible competitive candidates in the WFST paths. Moreover, it is clear to see that the recognition of pure Mandarin speech does not significantly affected by different phone mapping methods, while on the pure Chilish test set, a relative 6.7% performance degradation is obtained. This indicates that the acoustic model with a Mandarin dominant phone set and training data is not suitable for a pure English ASR task. In addition, the TER gains on code-switching test set indicates that the training data of acoustic modeling bi-phone units for code-switches is also significantly increased in the hybrid acoustic modeling through the phone mapping.

Examination of Lexicon learning-based pronunciation augmentation 6.3.1 Performances in conventional ASR systems
After achieving the universal E2M-po2o mixlingual lexicon, we then augment the pronunciations of English words in training data set using the proposed lexicon learning-based framework in Fig. 3. The recipe of wsj/s5/steps/dict/ learn_lexicon_greedy.sh in Kaldi main branch is modified to perform our CS ASR lexicon learning. During the iterative greedy pronunciation selection, we tune the scaling factor α = 0.05, 0.01, 0.001 and smoothing factor β = 10, 5, 10 to compute the likelihood reduction threshold for controlling the pruning degree of pronunciations from the phonetic decoding, G2P, and source lexicon respectively. Figure 5 demonstrates new pronunciation  samples learned for five embedding English words. It is clear to see that the learned pronunciations based on the acoustic evidences tend to be more Chilish than the ones from only one-to-one E2M mapping. It may provide a chance for acoustic model to capture the accent and mis-pronounced property of the embedding English words by augmenting the lexicon with these learned pronunciations. Table 4 presents the performance comparison of the proposed lexicon learning-based pronunciation augmentation on conventional ASR systems. The "Source Lexicon" is the universal CS lexicon with E2M-po2o phone mapping. "+ Lexlearn" means the "Source Lexicon" augmented with the new pronunciations of English words learned from the proposed CS lexicon learning framework. The same acoustic modeling and decoding lexicon is used for each GMM-HMM and "LF-MMI hybrid-1" system as shown in this table. However, the "LF-MMI hybrid-2" system used different lexicons with E2M-po2o and E2M-po2m phone mapping respectively for the acoustic modeling and decoding, and both lexicons are also augmented with the same new pronunciations as in "LF-MMI hybrid-1" system. System "LF-MMI hybrid" is the same as system with "E2M-po2o" in Table 3.
From the results of Table 4, we can see that by using the augmented CS lexicon, the GMM-HMM systems can obtain a relative 3.8 to 8.8%, and 2.2 to 3.7% TER reduction on the code-switching and Chilish test set respectively. By taking the source pronunciations as reference, introducing the acoustic soft-count pruning factor can effectively help to select better new pronunciations with enough acoustic evidences. Based on the outputs of standard iterative greedy pronunciation selection, we find that the ρ = 0.7 achieves the best results. Furthermore, by comparing the results of hybrid systems in Table 4 with their baselines in Table 3, we still can obtain relative 5.2% (hybrid-1) and 3.7% (hybrid-2) TER reduction on the code-switching test set by using the augmented lexicon, and consistently, the TER of Chilish is also reduced from 15.4 to 14.8%. However, it is clear to see that the gains obtained on LF-MMI based hybrid systems are much smaller than the ones obtained on GMM-HMM systems; it indicates that the hybrid system has a better acoustic modeling ability to deal with the pronunciation variations of embedding language than the traditional GMM-HMM system. Table 5 compares the results with different target acoustic modeling units in our E2E Transformer-based codeswitching ASR systems. The targets of our baseline system are a set of Mandarin characters and English letters plus blank symbol which leads to an output dimension of 5257. In addition, we also tried to adopt the widely used BPE subword segmentation to generate 2000 subwords as acoustic modeling units for English. Therefore, in the system with "Character-BPE, " there is a total of 7230 acoustic modeling targets. The systems with "Character-PASM(*)" modeling units are performed to examine the effects of our augmented CS lexicon for a better English subword generation. In these system, the Mandarin characters units are the same as in the baseline, but the English targets are produced from the universal CS lexicons using the pronunciation-assisted subword modeling (PASM) method. Three CS lexicons are investigated, the basic "CMU+MLex, " the lexicon with E2M-po2m phone mapping, and the final augmented CS lexicon with acoustic lexicon learning ("Source Lexicon+Lexlearn (2.41)" in   Table 5 with the best result of LF-MMI hybrid systems in Table 4, we can see a little bit performance improving on pure Mandarin test set while the performance of code-switching and Chilish speech are significantly degraded, such as for CS test set, the TER is degraded from 10.4 to 11.6%. However, by using the BPE subwords instead of simple letters as English E2E acoustic modeling targets, a relative 4.3% and 3.8% TER gains are obtained on the code-switching and Chilish test set respectively. This indicates that the BPE subwords can provide a better acoustic representation than simple letters. When we use the PASM to produce English targets instead of the BPE, the TER of CS is reduced from 11.1 to 10.7% with only CMU pronunciations for English. This observation is consistent with the one achieved in [41]. And it indicates that unlike the BPE only taking the spelling into consideration, leveraging the pronunciation information of a word during the subword segmentation can produce better E2E acoustic modeling units. Furthermore, when we use the phone mapped universal CS lexicon, a further 1.8% relative TER reduction on CS is obtained. And in addition, by comparing the results on both code-switching and Chilish test sets between last two lines of Table 5, the E2E CS ASR system can also benefit from the learned new pronunciations. These observations indicate that the learned and phone mapped new pronunciations provided more phonetically meaningful subwords for the embedding English words.

Conclusion
This paper presents a pronunciation augmentation framework based on the universal Mandarin-English codeswitching lexicon. This framework is proposed to handle the accented pronunciations or randomly mispronounced words of the embedding English in the code-switching speech recognition task. We first examine the proposed English-to-Mandarin phone mapping on both the conventional GMM-HMM and the state-of-the-art LF-MMIbased hybrid ASR systems. Experimental results show that we obtain more than relative 10% TER reduction on the code-switching test set, by using the universal CS lexicon with the proposed phone mapping strategy. In addition, from the comparison results on the GMM-HMM systems, we can see that this strategy provided much more ASR performance gains than two conventional phone mapping methods in the literature, by considering the balance of acoustic similarity and language discrimination between Mandarin and English.
Furthermore, based on the universal CS lexicon, we validate the proposed pronunciation augmentation framework from two main aspects. One is directly using the augmented pronunciations to train conventional GMM-HMM and LF-MMI hybrid systems. The other is using the augmented pronunciations to assist the subword segmentation to generate better acoustic modeling targets for end-to-end Transformer-based ASR system. The extensive results in Section 6.3.2 show that both the proposed phone mapping and pronunciation augmentation framework can also be taken as effective solutions for improving the E2E CS ASR performances.
More investigations about proposing new pronunciationassisted methods to fully exploit the phonetically meaningful information for improving E2E CS ASR system and validating the proposed methods on other CS corpus will be our future works.