- Research Article
- Open Access
Language Model Adaptation Using Machine-Translated Text for Resource-Deficient Languages
© Arnar Thor Jensson et al. 2008
- Received: 30 April 2008
- Accepted: 29 October 2008
- Published: 27 January 2009
Text corpus size is an important issue when building a language model (LM). This is a particularly important issue for languages where little data is available. This paper introduces an LM adaptation technique to improve an LM built using a small amount of task-dependent text with the help of a machine-translated text corpus. Icelandic speech recognition experiments were performed using data, machine translated (MT) from English to Icelandic on a word-by-word and sentence-by-sentence basis. LM interpolation using the baseline LM and an LM built from either word-by-word or sentence-by-sentence translated text reduced the word error rate significantly when manually obtained utterances used as a baseline were very sparse.
- Speech Recognition
- Language Model
- Machine Translation
- Automatic Speech Recognition
- Speech Recognition System
The state-of-the-art speech recognition has advanced greatly for several languages . Extensive databases both acoustical and text have been collected in those languages in order to develop the speech recognition systems. Collection of large databases requires both time and resources for each of the target language. More than 6000 living languages are spoken in the world today. Developing a speech recognition system for each of these languages seems unimaginable, but since one language can quickly gain political and economical importance a quick solution toward developing a speech recognition system is important.
Since data, for the purpose of developing a speech recognition system, is sparse or nonexisting for resource-deficient languages, it may be possible to use data from the other resource-rich languages, especially when available target language sentences are limited which often occurs when developing prototype systems.
Development of speech recognizers for resource-deficient languages using spoken utterances in a different language has already been reported in , where phonemes are identified in several different languages and used to create or aid an acoustic model for the target language. Text for creating the language model (LM) is on the other hand assumed to exist in a large quantity and therefore sparseness of text is not addressed in .
Statistical language modeling is well known to be very important in large vocabulary speech recognition but creating a robust language model typically requires a large amount of training text. Therefore it is difficult to create a statistical LM for resource deficient languages. In our case, we would like to build an Icelandic speech recognition dialogue system in the weather information domain. Since Icelandic is a resource deficient language there is no large text data available for building a statistical LM, especially for spontaneous speech.
Methods have been proposed in the literature to improve statistical language modeling using machine-translated (MT) text from another source language [3, 4]. A cross-lingual information retrieval method is used to aid an LM in different language in . News stories are translated from a resource- language to a resource- language using a statistical MT system trained on a sentence-aligned corpus in order to improve the LM used to recognize similar or the same story in the resource- language. Another method described in  uses ideas from latent semantic analysis for cross lingual modeling to develop a single low-dimensional representation shared by words and documents in both languages. It uses automatic speech recognition transcripts and aligns each with the same or similar story in another language. Using this parallel corpus a statistical MT system is trained. The MT system is then used to translate a text in order to aid the LM used to recognize the same or similar story in the original language. LM adaptation with target task machine-translated text is addressed in  but without speech recognition experiments. A system that uses an automatic speech recognition system for human translators is improved in  by using a statistical machine translation of the source text. It assumes that the content of the text translated is the same as in the target text recognized. The above mentioned systems all use statistical machine translation (MT) often expensive to obtain and unavailable for resource-deficient languages.
MT methods other than statistical MT are also available, such as rule based MT systems. A rule based MT system can be based on a word-by-word (WBW) translation or sentence-by-sentence (SBS) translation. WBW translation only requires a dictionary, already available for many language pairs, whereas rule based SBS MT needs more extensive rules and therefore more expensive to obtain. The WBW approach is expected to be successful only for closely grammatical related languages. In this paper, we investigate the effectiveness of WBW and SBS translation methods and show the amount of data for the resource-deficient language required to par these methods.
In Section 2, we explain the method for adapting language models. Section 3 explains the experimental corpora. Section 4 explains the experimental setups. Experimental results are reported in Section 5 followed by a discussion in Sections 6, and 7 concludes the paper.
The final perplexity or word error rate (WER) value is calculated using an evaluation text set or speech evaluation set ( ) which is disjoint from all other datasets.
3.1. Experimental Data: LM
A unique word list was made out of the Jupiter corpus, and was machine translated using  in order to create a dictionary. This MT is a rule-based system. The dictionary consists of one-to-one mapping, that is, an original English word has only one Icelandic translation. The word translation can consist of zero (unable to translate), one, or multiple words. Multiple words occur in the case when a word in English cannot be described in one word in Icelandic such that the English word "today" translates to the Icelandic words "dag." An English word is usually translated to one Icelandic word only.
BLEU evaluation of the and the machine translators.
3.2. Experimental Data: Acoustic Model
Icelandic phonemes in IPA format.
/ i, i, , a, y, œ, u, ɔ, au, ou, ei, ai, œ y /
/ p, ph, t, th, c, ch, f, v, ð, s, ʝ, ç, , m, n, l, r /
Some attributes of the phonetically balanced Icelandic text corpus.
No. of sentences
No. of words
No. of phones
No. of unique PB units
Average no. of words/sentence
Average no. of phones/word
Some attributes of the Icelandic acoustic training corpus.
No. of male speakers
No. of female speakers
25-dimensional feature vectors consisting of 12 MFCCs, their delta, and a delta energy were used to train gender-independent acoustic model. Phones were represented as context-dependent, 3-state, left-to-right hidden Markov models (HMMs). The HMM states were clustered by a phonetic decision tree. The number of leaves was 1000. Each state of the HMMs was modeled by 16 Gaussian mixtures. No special tone information was incorporated. HTK  version 3.2 was used to train the acoustic model.
3.3. Evaluation Speech Corpus
Some attributes of the Icelandic evaluation speech corpus.
Evaluation speech corpus
No. of utterances
No. of male speakers
No. of female speakers
Experiments 5 to 8 used SBS machine-translated data. Experiment 5 used no corpus but used the unique words found in , creating the vocabulary . This was done in order to find the impact of including only SBS translated vocabulary. Experiment 6 used as the TRT corpus without adding translated words to the vocabulary. Experiment 7 used the SBS MT along with the combined vocabulary found from the and corpora. Experiment 8 used both information from the SBS and WBW MT. Using WBW translated data along with SBS MT can be done since the dictionary used to create the WBW MT was created using the SBS MT.
The set size varied from 100 to 1500 sentences for all the experiments. In the following text corresponds to a subset of the set where is the number of sentences used. Experiments with no set included, , was also performed on Experiment 4, Experiment 7, and Experiments 8. All LMs were built using 3-grams with Kneser-Ney smoothing. The WER experiments were performed three times with different, randomly chosen sentences, creating each and set, in order to increase the accuracy of the results. An average WER was calculated over the three experiments. This increases accuracy when comparing different experiments especially when the set is very sparse. The vocabulary changed for each and set and the values for words and unique words in Table 1 reflect only one of the three cases. The words and vocabulary sizes for the other two cases were very similar to the one reported in Table 1. Perplexity and out-of-vocabulary ( ) results reported in this paper also correspond only to the case with and sets found in Table 1. Each experiment had the interpolation weights optimized on the corpus.
The speech recognition experiments were performed using Julius  version "rev.3.3p3 (fast)."
When the WER results are more carefully investigated we are able to find out how many more sentences are needed for Experiment 1 to par Experiment 7. When 100 sentences are used for Experiment 7 then around 150 sentences in addition are needed for Experiment 1 to par the WER result of Experiment 7. When 500 sentences are used for Experiment 7 then around 300 sentences in addition are needed for Experiment 1 to par the WER results. When 1000 sentences are used for Experiment 7 then around 200 sentences in addition are needed for Experiment 1 to par the WER results in Experiment 7.
OOV rate (%) with corresponding vocabulary sizes inside parentheses.
The improvement of the Icelandic LM with translated English text/data was confirmed by reduction in WER by using either WBW or SBS MT. Experiment 1 should be compared with the other experiments since Experiment 1 does not assume any foreign translation. When the in Experiment 1 is compared with the interpolated results using WBW MT in Experiment 4, we get a WER 49.6% reduced to 46.6% respectfully, a 6.0% relative improvement when using 100 sentences. The relative improvement reduces as more sentences are added to the system and converges to the when 500 sentences are added to the system. Neither Experiment 2 nor Experiment 3 gives any significant improvement over the . This along with the results in Experiment 4 suggests that when WBW translated data is available, both the translated corpus and its vocabulary should be added to the system when the sentences are sparse.
The reason why Experiment 8 is not outperforming Experiment 7 is most likely because Experiment 8 is using unique words found in the corpus in addition to the unique words found in Experiment 7. As Table 10 shows, around 1100 new words are added to the vocabulary in Experiment 8 compared to Experiment 7 for all set conditions without reducing the OOV rate significantly. Therefore the perplexity rate increases making the speech recognition process more difficult. The unique words found in are therefore not contributing toward better results if vocabulary from is used.
When the is compared with the interpolated results using SBS MT in Experiment 7, we get a WER 49.6% reduced to 41.9% respectfully, a 15.5% relative improvement when 100 sentences are added to the system. Improvements by merging the vocabulary from the and is confirmed by comparing Experiment 6 and Experiment 7 for all sets. The WER improvement of the SBS MT over the WBW MT is confirmed for all the sets as the BLEU evaluation results in Section 3.1 suggests. This can be seen by comparing Experiment 4 in Figure 2 with Experiment 7 in Figure 3. The improvement is as well confirmed with perplexity results when Experiment 3 and Experiment 6 are compared in Table 9. When the vocabulary is kept the same as in the case of Experiment 1, Experiment 3, and Experiment 6 the proposed methods always outperform the baseline perplexity results.
The results presented in this paper show that an LM can be improved considerably using either WBW or SBS translation. This especially applies when developing a prototype system where the amount of target domain sentences is very limited. The effectiveness of the WBW and SBS translation methods was confirmed for English to Icelandic for a weather information task. The convergence point of these methods with the baseline was around 400 and 1500 manually collected sentences for the WBW and the SBS translation methods respectfully. In order to get significant improvement, a good (high BLEU score) MT system is needed. The WBW translation is especially important for resource-deficient languages that do not have SBS machine translation tools available. It is believed that a high BLEU score can be obtained with WBW MT for very closely related language pairs and between dialects. Confirming the effectiveness of the WBW and the SBS translation methods for other language pairs is left as future work, as is applying the rule based WBW and SBS translation methods to a larger domain, for example broadcast news. Future work also involves an investigation of other maximum a posteriori adaptation methods such as  and methods like the ones described in [14–16] that selects a relevant subset from a large text collection such as the World Wide Web to aid sparse target domain. These methods assume that a large text collection is available in the target language but we would like to apply these methods to extract sentences from the corpus. Since the acoustic model is only built from 3.8 hours of acoustic data which gives rather poor results we would like to either collect more Icelandic acoustic data or use data from foreign languages to aid current acoustic modeling.
The authors would like to thank Dr. J. Glass and Dr. T. Hazen at MIT and all the others who have worked on developing the Jupiter system. They also would like to thank Dr. Edward W. D. Whittaker for his valuable input. Special thanks to Stefan Briem for his English to Icelandic machine translation tool and allowing to use his machine translation results. This work is supported in part by 21st Century COE Large-Scale Knowledge Resources Program.
- Adda-Decker M: Towards multilingual interoperability in automatic speech recognition. Speech Communication 2001,35(1-2):5-20. 10.1016/S0167-6393(00)00092-3View ArticleMATHGoogle Scholar
- Schultz T, Waibel A: Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Communication 2001,35(1-2):31-51. 10.1016/S0167-6393(00)00094-7View ArticleMATHGoogle Scholar
- Khudanpur S, Kim W: Using cross-language cues for story-specific language modeling. Proceedings of the International Conference on Spoken Language Processing (ICSLP '02), September 2002, Denver, Colo, USA 1: 513-516.View ArticleGoogle Scholar
- Kim W, Khudanpur S: Cross-lingual latent semantic analysis for language modeling. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '04), May 2004, Montreal, Canada 1: 257-260.Google Scholar
- Nakajima H, Yamamoto H, Watanabe T: Language model adaptation with additional text generated by machine translation. Proceedings of the 19th International Conference on Computational Linguistics (COLING '02), August 2002, Taipei, Taiwan 2: 716-722.Google Scholar
- Paulik M, Stüker S, Fügen C, Schultz T, Schaaf T, Waibel A: Speech translation enhanced automatic speech recognition. Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU '05), November-December 2005, San Juan, Puerto Rico 121-126.Google Scholar
- Zue V, Seneff S, Glass JR, et al.: JUPITER: a telephone-based conversational interface for weather information. IEEE Transactions on Speech and Audio Processing 2000,8(1):85-96. 10.1109/89.817460View ArticleGoogle Scholar
- Briem S: Machine Translation Tool for Automatic Translation from English to Icelandic. Iceland, 2007, http://www.simnet.is/stbr/Google Scholar
- Papineni K, Roukos S, Ward T, Zhu W: BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Conference of the Association for Computational Linguistics (ACL '02), July 2002, Philadelphia, Pa, USA 311-318.Google Scholar
- Rögnvaldsson E: Islensk hljodfraedi. Malvisindastofnun Haskola Islands, Reykjavik, Iceland; 1989.Google Scholar
- Young S, Evermann G, Hain T, et al.: The HTK Book (Version 3.2.1). 2002Google Scholar
- Lee A, Kawahara T, Shikano K: Julius—an open source real-time large vocabulary recognition engine. Proceedings of the European Conference on Speech Communication and Technology (EUROSPEECH '01), September 2001, Aalborg, Denmark 1691-1694.Google Scholar
- Bacchiani M, Roark B: Unsupervised language model adaptation. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 224-227.View ArticleGoogle Scholar
- Sarikaya R, Gravano A, Gao Y: Rapid language model development using external resources for new spoken dialog domains. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 1: 573-576.Google Scholar
- Sethy A, Georgiou P, Narayanan S: Selecting relevant text subsets from web-data for building topic specific language models. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL '06), June 2006, New York, NY, USA 145-148.Google Scholar
- Klakow D: Selecting articles from the language model training corpus. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '00), June 2000, Istanbul, Turkey 3: 1695-1698.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.