Experiments on Automatic Recognition of Nonnative Arabic Speech
© Yousef Ajami Alotaibi et al. 2008
Received: 11 May 2007
Accepted: 13 January 2008
Published: 24 February 2008
The automatic recognition of foreign-accented Arabic speech is a challenging task since it involves a large number of nonnative accents. As well, the nonnative speech data available for training are generally insufficient. Moreover, as compared to other languages, the Arabic language has sparked a relatively small number of research efforts. In this paper, we are concerned with the problem of nonnative speech in a speaker independent, large-vocabulary speech recognition system for modern standard Arabic (MSA). We analyze some major differences at the phonetic level in order to determine which phonemes have a significant part in the recognition performance for both native and nonnative speakers. Special attention is given to specific Arabic phonemes. The performance of an HMM-based Arabic speech recognition system is analyzed with respect to speaker gender and its native origin. The WestPoint modern standard Arabic database from the language data consortium (LDC) and the hidden Markov Model Toolkit (HTK) are used throughout all experiments. Our study shows that the best performance in the overall phoneme recognition is obtained when nonnative speakers are involved in both training and testing phases. This is not the case when a language model and phonetic lattice networks are incorporated in the system. At the phonetic level, the results show that female nonnative speakers perform better than nonnative male speakers, and that emphatic phonemes yield a significant decrease in performance when they are uttered by both male and female nonnative speakers.
Pronunciation variability is by far the most critical issue for Arabic automatic speech recognition (AASR). This is mainly due to the large number of nonnative accents and to the fact that nonnative speech data available for training are generally insufficient. Hence the modeling of separate accents remains difficult and inaccurate. In addition, the Arabic language is characterized by an extreme dialectal variation and nonstandardized speech representations, since it is usually written without short vowels and other diacritics, and thus has incomplete phonetic information .
During the past few years, there have been research initiatives carried out on analyzing speech from native and nonnative speakers' points of view. Byrne et al.  worked on analyzing nonnative English speakers by collecting a corpus of conversational English speech from nonnative English speakers. They used an HTK-based speech recognition system. Their corpus contained both read and conversational speech recordings. They concluded that it is hard to recognize nonnative English speakers compared to native ones especially with regard to conversational type. Another study was carried out by Livescu . His work concentrated on analyzing and modeling nonnative speech for automatic speech recognition. He examined—among other tasks—the problem of nonnative speech in a speaker independent, large-vocabulary, spontaneous speech recognition system for American English with native training data. He showed that the interpolated native and nonnative models reduce the word error rate on a nonnative test set by 8.1% relative to his baseline recognizer using models trained on pooled native and nonnative data. He also investigated many issues in language model (LM) differences in native and nonnative speakers. To improve the performance of the speech recognition system for nonnative speakers, Bartkova and Jouvet  propose an approach based on multiple models. They considered French as the native language. In their study, they included English, Spanish, Italian, Portuguese, Turkish, and Arabic nonnative groups. This approach required a huge amount of training data. Compared with research on other languages, only a very limited number of research initiatives have been carried out on Arabic language.
The aim of this paper is to investigate the effect of foreign accents for both male and female speakers on the performance of automatic speech recognition of Arabic. In this way, we can figure out the effects of these variations on the overall HMM-based system accuracy using a language model (LM), and on the individual phoneme accuracy of an HMM-based system, which does not use an LM.
This paper is organized as follows. Section 2 summarizes the main characteristics of the Arabic language. Section 3 describes the data and the baseline systems used in this study. Section 4 presents and discusses the obtained results. Section 5 concludes and indicates the perspective of this work.
2. Arabic Language Characteristics
Arabic is a Semitic language, and it is one of the oldest languages in the world today. It is the fifth widely used language nowadays. Arabic is the first language in the Arab world, used in Saudi Arabia, Jordan, Oman, Yemen, Egypt, Syria, Lebanon, and many more countries. The Arabic alphabet is used in several languages, such as Persian, Urdu, and Malay . Research on the Arabic language has mainly concentrated on modern standard Arabic, which is used throughout the media, courtrooms, and academic institutions in Arab countries. Previous work on developing ASR was dedicated to dialectal and colloquial Arabic within the 1997 NIST benchmark evaluations, and more recently on the recognition of conversational and dialectal speech, as it is reported in .
2.1. Phonetic Features
The standard Arabic language has 34 phonemes, of which six are basic vowels, and 28 are consonants. The Arabic language has fewer vowels than the English language. It has three long and three short vowels, while American English has at least twelve vowels. Standard Arabic is distinct from Indo-European languages because of its consonantal nature. The allowed syllable structures in Arabic are CV, CVC, and CVCC where V indicates a (long or short) vowel while C indicates a consonant. Arabic utterances can only start with a consonant . From an articulatory point of view, it is characterized by the realization of some sounds in the rear part of the vocal tract: glottal and pharyngeal consonants. Arabic sounds can be divided into macroclasses such as stop consonants, voiceless fricatives, voiced fricatives, nasal consonants, liquid consonants, and vowels. The originality of the Arabic phonetics is mainly based on the relevance of lengthening in the vocalic system and on the presence of emphatic and geminated consonants. These particular features play a fundamental role in the nominal and verbal morphological development. Pharyngeal and emphatic phonemes exist only in Semitic languages like Hebrew, Persian, and Urdu [6, 7].
Emphatic consonants are achieved in the rear part of the oral cavity. During their production, the root of the tongue is carried against the pharynx. There are four emphatic consonants in the Arabic language: two plosives: / /, / / and two fricatives: /ðʔ/, /ş/. In the example of the two words /naşaba/ (imputed) and /nasaba/ (erected), an emphatic versus nonemphatic opposition is observed on /s/ .
The gemination is a particular feature, which compensates for the paucity of the Arabic vocalic system. The geminated consonant arises by sustaining the plosive closure. In the example of the words /faʃala/ (he failed) and /faʃ:ala/ (he thwarts), the opposition resides in the gemination of /ʃ/ fricative. Through this example, we measure the importance and the difficulty of performing this feature detection.
The vocalic system contains two phonological quantities for each tone. For each short vowel /a/, /i/, and /u/, there is, respectively, the associated long vowel /a:/, /i:/, and /u:/. In Arabic, this temporal opposition is fundamental. For example, the two words /3amal/ "camel" and /3ama:l/ "beauty", have the length of the final vowel as the only difference .
2.2. Morphological Complexity
The development of accurate AASR systems is faced with two major issues. The first problem is related to diacritization. Arabic texts are almost never fully diacritized: it means that the short strokes placed above or below the consonant, indicating the vowel following this consonant, are usually absent. This limits the availability of training material. The lack of this information leads to many similar word forms, and then decreases predictability in the language model. The second problem is related to the morphological complexity since Arabic has a rich potential of word forms which increases the out-vocabulary rate [8, 9].
3. Data and Baseline Systems
3.1. The WestPoint Corpus
Arabic phoneme list used throughout our experiments.
voiced pharyngeal fricative
velarized voiced alveolar stop
voiced velar fricative
voiceless pharyngeal fricative
voiceless glottal stop
velarized voiceless alveolar fricative
velarized voiceless alveolar stop
velarized voiced interdental fricative
voiced interdental fricative
low front vowel
low back vowel
back upgliding diphthong
front upgliding diphthong
bilabial voiced stop
voiced alveolar stop
upper mid front vowel
voiceless labiodental fricative
voiced velar stop
voiceless glottal fricative
high front lax vowel
high front tense vowel
voiced palato-alveolar fricative
voiceless velar stop
voiced alveolar lateral
voiced bilabial nasal
voiced alveolar nasal
voiceless uvular stop
voiced alveolar flap
voiceless alveolar fricative
voiceless palato-alveolar fricative
voiceless alveolar stop
voiceless interdental fricative
high back rounded vowel
voiced bilabial approximant
voiceless velar fricative
voiced palatal approximant
voiced alveolar fricative
WestPoint corpus statistical summary.
Number of speakers
Hours of data
Megabyte of data
Number if speech files
3.2. The Parameterization
Experimental conditions summary.
22.05 KHz, 16 bits
44 Female + 66 Male
MFCCs with first derivative
Window type and size
Window step size
3.3. The Recognizer
The Hidden Markov Model Toolkit (HTK)  is used for designing and testing the speech recognition systems throughout all experiments. The baseline system was initially designed as a phoneme level recognizer with three active states, one Gaussian per state, continuous, left-to-right, and no skip HMM models. The system was designed by considering all 37 MSA phones as given by the LDC WestPoint catalog. The WestPoint corpus has three phonemes more than the number of MSA phonemes mentioned in most linguistic literature [6, 7, 9]. WestPoint added three more phonemes, namely, /g/ "voiced velar stop", /aw/ "back upgliding diphthong", and /ey/ "upper mid front vowel". In fact, the phoneme /g/ does not exist in MSA at all, but we think that the WestPoint corpus used it because some native and nonnative speakers are using it popularly in some MSA words. We can confirm this fact by hearing some WestPoint audio files. On the other hand, the extra vowel and diphthong were used because of variations in pronunciations of speakers influenced by English and other Latin languages. This type of phoneme exists in these languages but not in MSA. For our study, we finally decided to stick with WestPoint phonemes and transcriptions without any modification. We believe that this decision will facilitate the comparison with systems of other research efforts that are using the same corpus. Since most words consisted of more than two phonemes, context-dependent triphone models were created from monophone models. The training phase consists of re-estimating HMM models by using the Baum-Welch algorithm after aligning and tying the models by using the decision tree method . Phoneme-based models are good at capturing phonetic details. Also context-dependent phoneme models can be used to characterize formant transition information, which is very important in the discrimination of confusable speech units.
3.4. The Language Model
The performance of any recognition system depends on many factors, but the size and the perplexity of the vocabulary are among the most critical ones. In this system, the size of vocabulary is relatively high since it contains more than one thousand different words. Their perplexity is very high due to the existence of many acoustically similar phonemes in Arabic.
The bigram probability is used because at least one bigram has been observed in the training data; otherwise the transition probability is calculated from the unigram count. These statistics are generated by using HLStats function, which is a tool of the HTK toolkit. This function computes the occurrences of all labels in the system and then it generates the back off bigram probabilities based on the phoneme-based dictionary of the corpus. This file counts the probability of the occurrences of every consecutive pairs of labels in all labelled words of our dictionary. A second function of HTK toolkit, HBuild, uses the back off probabilities file as an input and generates the bigram language model. The dictionary used in our application includes all (without any exception) words that were used in WestPoint corpus.
4. Experiments, Results, and Discussion
Nine sets of experiments: Exp. 1, Exp. 2, Exp. 3, , Exp. 9 have been carried out. In each experiment we examined two outcomes from the system. The first outcome concerns phonemic recognition without using any LM. It is referred to by the subscript. The language dictionary used in the system is a simple phoneme-to-phoneme mapping. The second outcome, referred to by the subscript , consists of the system accuracy when the LM is incorporated. It uses a dictionary, mapping every word in the database including its corresponding phoneme transcription. In all the nine experiments, the difference is the type of training and testing database sets depending on a speakers' native language. As specified by the WestPoint database, if the speaker is a nonnative Arabic speaker, this means that he or she is an English native speaker.
4.1. Experimental Setup
In the first experiment, Exp. 1, native Arabic speakers are involved in both the training and the testing. The WestPoint corpus is divided in such a way that 61% of the corpus is used for the training and 39% for the test regardless of either gender or scripts. In the second experiment, Exp. 2, native speakers are used for the training and only nonnative speakers are involved in the test phase. Then, the training uses 85% of the corpus (native speakers), while 15% of the corpus composed of nonnative speakers is used for the test. In Exp. 3, all nonnative Arabic speakers were used for the training and all Arabic native speakers for the test. In the fourth experiment, Exp. 4, only nonnative Arabic speakers were used in both training and testing systems. The nonnative speakers' part of the corpus is divided to obtain 67% for the training, and 33% for the test. In the fifth experiment, Exp. 5, both native and nonnative Arabic utterances are pooled to constitute the training data (69% of the corpus) and testing data (31% of the corpus). Experiments Exp. 6 to Exp. 9 are set up by varying the gender of the speakers in training and testing data. In Exp. 6, native male speakers are used for the training and only nonnative male speakers are involved in the test. Thus the training uses 82% of the corpus (male speakers only), while 18% of the corpus composed of nonnative speakers is used for the test. In Exp. 7, native female speakers are used for the training and only nonnative female speakers are involved in the test. Hence the training uses 90% of the corpus (female speakers only), while 10% of the corpus composed of nonnative speakers is used for the test. In Exp. 8, nonnative male speakers are used for the training and only native male speakers are involved in the test. In this case, the training uses 18% of the corpus (male speakers), while 82% of the corpus composed of native male speakers is used for the test. Finally, in Exp. 9, nonnative female speakers are used for the training and only native female speakers are involved in the test. Thus the training uses 10% of the corpus (female speakers), while 90% of the corpus composed of nonnative speakers is used for the test. The bigram language model was always derived from the total of prompts provided by WestPoint. This means that we use the same language model for all experiments.
4.2. Effect of the Language Model
System overall performance with different configurations with respect to native origin of speakers and using the Arabic LDC-WestPoint corpus.
N/N Exp. 1.a
N/NN Exp. 2.a
NN/N Exp. 3.a
NN/NN Exp. 4.a
M/M Exp. 5.a
N/N Exp. 1.b
N/NN Exp. 2.b
NN/N Exp. 3.b
NN/NN Exp. 4.b
M/M Exp. 5.b
Without bigram-based language model
With bigram-based language model
System overall performance with different gender and native language configurations.
All phoneme level without any language model
Word level with a language model
Exp. 6a: 46.56%
Exp. 6b: 89.07%
Exp. 7a: 51.42%
Exp. 7b: 83.62%
Exp. 8a: 47.86%
Exp. 8b: 95.89%
Exp. 9a: 50.17%
Exp. 9b: 81.34%
4.3. Effect of the Native Origin of Speakers
If we consider the phoneme recognition performance without using any LM, as shown in Table 4, we notice that the system gives its best accuracy when it is trained and tested by nonnative speakers. The poorest overall accuracy is obtained when the system is trained on nonnative speakers and tested on native speakers. In other words, the nonnative trained system gave the best accuracy and the worst accuracy if tested by nonnative and native speakers, respectively. By investigating the detailed results that are related to the accuracy of each phoneme, we found that some phonemes give lower accuracy if tested with native speakers instead of nonnative ones. In fact, the best phoneme recognition rate is reached when nonnative speakers are involved in the training and in the test phases. This result can be explained by the fact that nonnative speakers tend to make efforts in order to be more consistent with the standard pronunciation. When the training and testing data sets in each experiment are identical with respect to the native origin of the speakers, the accuracies are higher compared to the cases where training and test sets are different with respect to the native origin of the speakers. If the training and testing data sets are mixed (regardless of the native origin of speakers), the accuracy decreases by almost 2% and 4% compared to the results obtained in Exp. 1(a), and Exp. 4(a), respectively. As expected, the accuracy of an AASR system is negatively influenced by changing the mother tongues in either the training or testing data sets.
4.4. Effect of the Speakers' Gender
The gender of the speaker is one of the influential sources of speech variability. In the early days of speech recognition, gender was not considered as a major issue. The progress made last decade led to high performance transcription systems that permit one to consider the question whether ASR systems behave differently on male and female speech. An interesting study carried out by Adda-Decker and Lamel  reveled that for both French and English languages, female speakers had better average recognition results than males. In our experiments, and by considering the gender of speakers, as it can be inferred from Table 5 (without a LM, i.e., experiments that numbered by subscript ), we can notice that female speakers give better system overall accuracy. This difference is more than 2% in case where nonnative speakers are involved in the training and the native ones in the test. On the other hand, the improvement of using female speakers is almost 5% when native speakers are used in the training and nonnative speakers in the test. By incorporating a LM and by considering the word level (i.e., in experiments Exp. 6(b) through Exp. 9(b), we see that the argument is inverted. In other words, the LM improved the accuracy of male speakers in a much better way than in the case of female speakers.
4.5. Performance at the Phonetic Level
The Arabic phoneme /H/ is a pharyngeal, fricative, unvoiced, and nonemphatic sound. The /H/ phoneme sharply falls in accuracy whenever nonnative speakers are involved in training and/or testing data. The accuracy of this phoneme is less than 10% in experiments Exp. 6(a) (Male N/NN) and Exp. 7(a) (Female N/NN). On the other hand, this phoneme gives better performance in the other experiments.
The phoneme /G/, which is an alveo-dental, stop, voiced, and emphatic sound, is similar to the /H/ phoneme. It sharply drops in accuracy whenever nonnative speakers are involved in training and/or testing data. To cite as examples, the accuracy of this phoneme is less than 20% in experiments Exp. 3(a) (NN/N) and Exp. 4(a) (NN/NN). It gives a better performance in other experiments.
In this paper, we have presented the results obtained by an HMM-based speaker independent, large-vocabulary speech recognition system for modern standard Arabic with a focus on the problem of foreign accents. We analyzed the performance of AASR at phonetic and word levels. We have confirmed, through our experiments, that the accuracy of the AASR system is negatively influenced by changing the mother tongues in either the training or testing data sets. However, the best phoneme recognition rate is reached when nonnative speakers are involved in both the training and the test phases, which is far from being predictable. The obtained results show that at the phonetic level, the female nonnative speakers perform better than nonnative male speakers. These results confirm that as it is observed for English and French languages , the pronunciation of nonnative Arabic female speakers tends to be more consistent with the standard pronunciation than that of the nonnative male speakers. However, the bigram-based language model improved the accuracy of nonnative male speakers in a much better way than that of the case of female speakers. In addition, we have noticed that nonnative speakers have difficulty in pronouncing the /D/ emphatic consonant. We must note here that the /D/ is a unique phoneme that exists only in Arabic. It is the reason why the Arabic language is commonly known as "the /D/ language" by the Arab community. It is worthy to note that nonnative speakers have significant problems with the pronunciation of the voiceless stop consonant /t/. There was more than a 10% difference between native and nonnative accuracies. We confirmed by hearing all WestPoint Corpus utterances that contain this phoneme, that it is due to the difference in the place of articulation, that is, in the position of the tongue dip when the /t/ is uttered by native and nonnative speakers. We will continue this research work by investigating the best way to adapt the AASR system to foreign accents by introducing the phonetic knowledge acquired from the common errors of nonnative speakers.
- Kirchhoff K, Bilmes J, Das S, et al.: Novel approaches to Arabic speech recognition: report from the 2002 Johns-Hopkins summer workshop. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 344-347.View ArticleGoogle Scholar
- Byrne W, Knodt E, Khudanpur S, Bernstein J: Is automatic speech recognition ready for non-native speech? A data collection effort and initial experiments in modeling conversational Hispanic English. Proceedings of the ESCA Conference on Speech Technology in Language Learning (STILL '98), May 1998, Marholmen, Sweeden 37-40.Google Scholar
- Livescu K: Analysis and modeling of non-native speech for automatic speech recognition, M.S. thesis. MIT, Cambridge, Mass, USA; 1999.Google Scholar
- Bartkova K, Jouvet D: Multiple models for improved speech recognition for non-native speakers. Proceedings of the 9th Conference of Speech and Computer (SPECOM '04), September 2004, St. Petersburg, RussiaGoogle Scholar
- Al-Zabibi M: An acoustic-phonetic approach in automatic Arabic speech recognition, Ph.D. thesis. Loughborough University of Technology, Leics, UK; 1990.Google Scholar
- Alkhouli M: Linguistic Phonetics. Daar Alfalah, Swaileh, Jordan; 1990.Google Scholar
- Elshafei M: Toward an Arabic text-to-speech system. The Arabian Journal for Science and Engineering 1991,16(4):565-583.MathSciNetGoogle Scholar
- El-Imam YM: An unrestricted vocabulary Arabic speech synthesis system. IEEE Transactions on Acoustic, Speech, and Signal Processing 1989,37(12):1829-1845. 10.1109/29.45531View ArticleGoogle Scholar
- Omar A: Studying Linguistic Phonetics. Aalam Alkutob, Cairo, Egypt; 1991.Google Scholar
- Linguistic Data Consortium (LDC) Catalog Number LDC2002S02 2002, http://www.ldc.upenn.edu/Google Scholar
- Young S, Evermann G, Gales M, et al.: The HTK Book (for HTK Version. 3.3). Cambridge University Engineering Department, 2005, http://htk.eng.cam.ac.uk/Google Scholar
- Deng L, O'Shaughnessy D: Speech Processing: A Dynamic and Optimization-Oriented Approach. Marcel Dekker, New York, NY, USA; 2003.Google Scholar
- Adda-Decker M, Lamel L: Do speech recognizers prefer female speakers? Proceedings of the InterSpeech Conference (Interspeech '05), September 2005, Lisbon, Portugal 2205-2208.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.