A large vocabulary continuous speech recognition system for Persian language
© Sameti et al; licensee Springer. 2011
Received: 18 January 2011
Accepted: 5 October 2011
Published: 5 October 2011
The first large vocabulary speech recognition system for the Persian language is introduced in this paper. This continuous speech recognition system uses most standard and state-of-the-art speech and language modeling techniques. The development of the system, called Nevisa, has been started in 2003 with a dominant academic theme. This engine incorporates customized established components of traditional continuous speech recognizers and its parameters have been optimized for real applications of the Persian language. For this purpose, we had to identify the computational challenges of the Persian language, especially for text processing and extract statistical and grammatical language models for the Persian language. To achieve this, we had to either generate the necessary speech and text corpora or modify the available primitive corpora available for the Persian language.
In the proposed system, acoustic modeling is based on hidden Markov models, and optimized decoding, pruning and language modeling techniques were used in the system. Both statistical and grammatical language models were incorporated in the system. MFCC representation with some modifications was used as the speech signal feature. In addition, a VAD was designed and implemented based on signal energy and zero-crossing rate. Nevisa is equipped with out-of-vocabulary capability for applications with medium or small vocabulary sizes. Powerful robustness techniques were also utilized in the system. Model-based approaches like PMC, MLLR and MAP, along with feature robustness methods such as CMS, PCA, RCC and VTLN, and speech enhancement methods like spectral subtraction and Wiener filtering, along with their modified versions, were diligently implemented and evaluated in the system. A new robustness method called PC-PMC was also proposed and incorporated in the system. To evaluate the performance and optimize the parameters of the system in noisy-environment tasks, four real noisy speech data sets were generated. The final performance of Nevisa in noisy environments is similar to the clean conditions, thanks to the various robustness methods implemented in the system. Overall recognition performance of the system in clean and noisy conditions assures us that the system is a real-world product as well as a competitive ASR engine.
Since the start of developing speech recognizers at AT&T Bell labs in the 1950's, enormous efforts and investments were directed towards automatic speech recognition (ASR) research and development. In the 1960s, the ASR research was focused on phonemes and isolated word recognition. Later, in the 70 s and 80 s, connected words and continuous speech recognition were the major trends of ASR research. To accomplish these targets, researchers introduced linear predictive coding (LPC) and used pattern recognition and clustering methods. Hidden Markov models (HMM), cepstral analysis and neural networks were employed in the 80 s. In the next decade, robust continuous speech recognition and spoken language understanding were popular topics. In the last decade, researchers and investors introduced spoken dialogue systems and tried to implement conversational speech recognition systems capable of recognizing and understanding spontaneous speech. Machine learning techniques and artificial intelligence (AI) concepts entered into the ASR research literature and contributed considerably to fulfilling the human speech recognition needs. Up until recent years, speech recognition systems were considered as luxury tools or services and were not usually taken seriously by users. In the past 5-10 years, we have seen that ASR engines have played genuinely beneficial roles in several areas, especially in telecommunication services and important enterprise applications such as customer relationship management (CRM) frameworks.
Several successful ASR systems having good performances are found in the literature [1–3]. The most successful approaches to ASR are the ones based on pattern recognition and using statistical and AI techniques [1, 3, 4]. The front end of a speech recognizer is a feature extraction block. The most common features used for ASR are Mel-frequency cepstral coefficients (MFCC) . Once the features are extracted, modeling is performed usually based on artificial neural network (ANN) or HMM. Linguistic information is also used extensively in an ASR system. Statistical (n-gram) and grammatical (i.e., structural) language models [4, 5] are used for this purpose.
One essential problem with putting the speech recognition systems into practice is the variety of languages people around the world speak. ASR systems are highly dependent on the language spoken. We can categorize the research areas of speech recognition into two major classes; first, acoustic and signal processing which is very much the same for ASR in every language; second, natural language processing (NLP) which is dependent on the language. Obviously, this language dependency hinders the implementation and utilization of ASR systems for any new language.
We have focused our research on Persian speech recognition during recent years. Persian ASR systems have been addressed and developed to different extents [6–10]. There are other works on the development of Persian continuous speech recognition system [11–14]. However, in the most of them, a medium vocabulary continuous speech recognition system with high word error rate is presented. Our developed large vocabulary continues speech recognition system for Persian, called Nevisa, was first introduced in [6, 7] as Sharif speech recognition system. It employs the cepstral coefficients as the acoustic features and continuous density hidden Markov model (CDHHM) as the acoustic model [4, 15]. A time-synchronous left-to-right Viterbi beam search, in combination with a tree-organized pronunciation lexicon is used for decoding [16, 17]. To limit the search space, two pruning techniques are employed in the decoding process. Due to our practical approach in using this system, Nevisa is equipped with established robustness techniques for handling speaker variation and environmental noise. Various data compensation and model compensation methods are used to achieve this objective. Also class-based n-gram language models (LM) [18, 19] with generalized phrase structure grammar (GPSG)-based Persian grammar  are utilized as word-level and sentence-level linguistic information. The frameworks for testing and comparing the effects of the implemented methods and also for optimizing the parameters were gradually built up. This enabled us to move towards a practical ASR system capable of being utilized as Persian dictation software also called Nevisa .
In the remainder of this paper, in Sect. 2, the characteristics of the Persian language, and speech and text corpora of the Persian language are reviewed. An overview of Nevisa Persian speech recognition system and overall features of this system is given in Sect. 3. This section provides a review on acoustic modeling, robustness techniques used in the system, and building statistical and grammatical language models for the Persian language. In Sect. 4 the details of the experiments and the recognition results are given. Finally, Sect. 5 gives a brief summary and conclusion of the paper.
2 Persian language and corpora
2.1 Persian language
The Persian language, also known as Farsi, is an Iranian language within the Indo-Iranian branch of Indo-European languages. It is natively spoken by about seventy million people in Iran, Afghanistan and Tajikistan as the official language. It is also widely spoken in Uzbekistan and, to some extent, in Iraq and Bahrain. This language has remained remarkably stable since the eighth century although local environments, such as the Arabic language, have influenced it. The Arabic language has heavily influenced Persian, but has not changed its structure. In other words, Persian has only borrowed a large number of lexical words from Arabic. Therefore, in spite of this influence, Arabic has not affected the syntactic and morphological forms of Persian; as a result, the language models of Persian and Arabic are fundamentally differences. Although there are several similar phonemes in Arabic and Persian, and they use similar scripts, the phonetic structure of these languages has principal differences; therefore, the acoustic models of Persian and Arabic are not the same. Consequently, the development of a speech recognition system in Arabic and Persian are different due to distinctions in their acoustic and language models.
Phonemes of Persian language
high front unrounded
mid front unrounded
low front unrounded
high back unrounded
mid back unrounded
low back rounded
unvoiced bilabial plosive closure
unvoiced bilabial plosive
voiced bilabial plosive closure
voiced bilabial plosive
unvoiced alveolar plosive closure
unvoiced dental plosive
voiced dental plosive closure
voiced dental plosive
unvoiced palatal plosive closure
unvoiced bilabial plosive
unvoiced velar plosive closure
unvoiced bilabial plosive
voiced palatal plosive closure
voiced palatal plosive
voiced velar plosive closure
voiced velar plosive
voiced uvular plosive closure
voiced uvular plosive
glottal stop closure
unvoiced alveopalatal affricate closure
unvoiced alveopalatal affricate
voiced alveopalatal affricate closure
voiced alveopalatal affricate
unvoiced labiodental fricative
voiced labiodental fricative
unvoiced alveolar fricative
voiced alveolar fricative
unvoiced alveopalatal fricative
voiced alveopalatal fricative
unvoiced uvular fricative
unvoiced glottal fricative
Persian uses the same alphabet as Arabic with four additional letters. Therefore, the number of letters in the Persian alphabet is 32 as compared to 28 in Arabic. Each additional Persian letter represents a phoneme not present in the Arabic phoneme set, namely/p/,/t∫/,/ℑ/and/g/. In addition, Persian has four other phonemes (/v/,/k/,/?/,/G/) which are pronounced differently from their Arabic counterpart. On the other hand, Arabic has its own unique phonemes (about ten) not defined in the Persian language. Persian makes extensive use of word building and combining affixes, stems, nouns and adjectives. Persian frequently uses derivational agglutination to form new words from nouns, adjectives and verbal stems. New words are extensively formed by compounding two existing words, as is common in German. Suffixes predominate Persian morphology, though there are a small number of prefixes. Verbs can express tense and aspect, and they agree with the subject in person and number. There is no gender in Persian, nor are pronouns marked for natural gender.
2.2.1 Speech corpus
In this paper, two speech databases, small Farsdat  and large Farsdat , are used. Small Farsdat is a hand-segmented database in the phoneme level which contains 6080 Persian sentences read by 304 speakers. Each speaker has uttered 18 randomly chosen sentences (from a set of 405 sentences) plus two sentences which are common for all speakers. The sentences are formed by using over 1,000 Persian words and are designed artificially to cover the acoustic variations of the Persian language. The speakers are chosen from ten different dialect regions in Iran and the corpus contains the ten most common dialects of the Persian language. Male to female population ratio is 2:1. The database is recorded in a low-noise environment featuring an average of 31 dB signal to noise ratio with a sampling rate of 22,050 Hz. A clean test set, called the small Farsdat test set (sFarsdat test), is selected from this database that contains 140 sentences from seven speakers. All the other sentences are used as train set (sFarsdat train). Small Farsdat, as its name indicates, is a small size speech corpus and can be used only for training and evaluating limited speech recognition systems in laboratories. This speech corpus is comparable with TIMIT corpus in English. Large Farsdat is another Persian speech database that removes some of the deficiencies of the small Farsdat.
Large Farsdat  includes about 140 h of speech signals, all segmented and labeled in word level. This corpus is uttered by 100 speakers from the most common dialects of the Persian language. Each speaker utters 20-25 pages of text from various subjects. In contrast with small Farsdat, which is recorded in a quiet and reverberation-free room, large Farsdat is recorded in office environment. Four microphones, a unidirectional desktop microphone, two lapel microphones and a headset microphone are used to record the speech signals. All the speech signals in this corpus are recorded using two microphones simultaneously, the desktop microphone is used in all of the recording sessions and each of the other three microphones is used in about one-third of the sessions. Totally, the desktop microphone is used for about 70 h of recorded speech and the other three microphones are used for the 70 remaining hours. The average SNR of the desktop microphone is about 28 dB. The sampling rate is 16 kHz for the whole corpus.
The test set contains 750 sentences from seven speakers (four male and three female) and is recorded using the desktop microphone of the large Farsdat database. We call this set gFarsdat test. The average sentence length of this test set is 7.5 s. This set includes numbers, names and some grammar free sentences and contains about 5000 different words. All other speech signals in the large Farsdat recorded with the desktop microphone are used here as the train set, i.e. gFarsdat train. In this research only those speech les of large Farsdat that are recorded using the desktop microphone, are used in the evaluations.
Farsi noisy speech corpus
The specifications of tasks in FANOS database.
Number of files
(adapt + test)
315 (175 + 140)
315 (175 + 140)
315 (175 + 140)
315 (175 + 140)
Number of speakers
(male + female)
7 (5 + 2)
7 (5 + 2)
7 (5 + 2)
7 (5 + 2)
2.2.2 Text corpus
In this research, we have used the two editions of Persian text corpus called "Peykare" [25, 26]. The first edition of this corpus consists of about ten million words and it was increased to about 100 million words in the second edition . All words in the first edition are annotated with part-of-speech (POS) tags. The texts of this corpus are gathered from various data sources like newspapers, magazines, journals, books, letters, hand-written texts, movie scripts, news etc. This corpus is a complete set of Persian contemporary texts. The texts are about different subjects including politics, arts, culture, economics, sports, stories, etc. The tag set of Persian Text Corpus has 882 POS tags [18, 19] that are reduced to 166 POS tags in this work.
3 Nevisa speech recognition system
The system uses context-dependent (CD) and context-independent (CI) acoustic models that are represented by continuous density hidden Markov models. These models are mixtures of Gaussian distribution in cepstral domain. In this system, forward, skip and loop transitions between the states are allowed and the covariance matrices are assumed diagonal [6, 9, 10]. The parameters of the emission probabilities are trained using the maximum likelihood criterion and the training procedure is initialized by a linear segmentation. Each iteration of the training procedure consists of time alignment by dynamic programming (Viterbi algorithm) followed by parameter estimation, resulting in segmental k-means training procedure [3, 4]. In decoding phase, a Viterbi-based search with beam and histogram pruning techniques are used. In this module, the recognized acoustic units are used to make active hypotheses via word decoder. The word decoder searches the lexicon tree simultaneously in interaction with the acoustic decoder and the pruning modules. The final active hypotheses are rescored using language models. Both statistical and grammatical language models can be used either in word decoder or in rescoring modules. In Nevisa, by default, statistical LM is used in the word decoder, i.e., during the search, and the grammatical model is used in n-best re-scoring module optionally. Dotted arrows in Figure 1 mean that statistical LM can be used in the rescorer module, and grammatical LM can be utilized during the search optionally.
3.2 Acoustic modeling
For acoustic modeling we employ two approaches: context-independent (CI) and context-dependent (CD) modeling. The standard phoneme set of Persian language contains 29 phonemes. This phoneme set and extra HMM models for silence, noise and aspiration are considered in the CI modeling. In sect. 4 where recognition results are given, the details of modeling process, including number of states and Gaussian mixtures, are presented.
For context-dependent modeling, we use triphones as the phone units. The major problem in triphone modeling is the trade-off between the number of triphones and the size of available training data. There are a large number of triphones in a language, but many of them are unseen or rarely used in speech corpora. So the amount of training data is insufficient for many triphones. For solving this problem, the state tying methods are used [35, 36]. Two prevalent methods for state tying are data-driven clustering  and decision tree-based state tying [36, 37]. In these methods, at the first stage, all triphones that occur in a speech corpus are trained using the available data. Then the states of similar triphones are clustered into a small number of classes (the similar triphones are the triphones that have similar middle phoneme). In the last stage, the states that lie in each cluster are tied together. The tied states are called senones .
Different numbers of senones and different numbers of Gaussian distributions were evaluated in the Nevisa system. The experimental results showed that clustering triphone states to 500 senones for small Farsdat and 4,000 senones for large Farsdat leads to the best WER. The evaluation results are given in Sect. 4.
3.2.1 Robustness methods
Like all speech recognizers, the performance of the Nevisa degrades in real applications and in the presence of noise [23, 31, 39, 40]. In order to make this system robust to speaker and environment variations, many of the recent advanced methods in robustness are incorporated. Differences between speakers, in background noise characteristics and channel noises (i.e. microphones), are considered and tried to be dealt with. Nevisa uses data compensation and model compensation approaches as well as their combinations. In the data compensation approach, clean data are estimated from their noisy samples so as to make them similar to the training data. Nevisa uses spectral subtraction (SS) and Wiener filtering , cepstral mean subtraction (CMS) [3, 23], principal component analysis (PCA)  and vocal tract length normalization (VTLN) [27, 28, 41, 29] for this purpose. In the model-based approach, the models of various sounds used by the classifier are modified to become similar to the test data models. Maximum likelihood linear regression (MLLR) [33, 42], maximum a posteriori (MAP) [34, 24], parallel model combination (PMC) [23, 31, 33] and a novel enhanced version of PMC, PCA and CMS based PMC (PC-PMC)  are well incorporated in the system. PC-PMC algorithm takes the advantages of additive noise compensation ability of PMC and convolutional noise removal capability of both PCA and CMS methods. The first problem that is to be solved for combining these methods is that PMC algorithm requires invertible modules in the front-end of the system while CMS normalization is not an invertible process. In addition, a framework is to be designed for the adaptation of the PCA transform matrix in the presence of noise. The PC-PMC method provides solutions to these problems .
The integration of these robustness modules in Nevisa are shown in the Figure 1. The modularity of the system makes it very flexible to remove any one of the system blocks, add new blocks, change or replace the existing ones.
3.3 Language modeling
Linguistic knowledge is as important as acoustic knowledge in recognizing natural speech. Language models depict the constraints on word sequences imposed by syntax, semantics or pragmatics of the language . In recognizing continuous speech, the acoustic signal is too weak to narrow down the number of word candidates. Hence, speech recognizers employ a language model that prunes out acoustic alternatives by taking the previous recognized words into account. In the most applications of speech recognition, it is crucial to exploit vast information about the order of the words. For this purpose, statistical and grammatical language modeling methods are common approaches utilized in spoken human-computer interaction. These methods are used by Nevisa to improve its accuracy.
3.3.1 Statistical language modeling
In statistical approaches, we take a probabilistic viewpoint of language modeling and estimate the probability P(W) for a given word sequence W = w1w2, ..., w n . The simplest and most successful statistical language models are the Markov chain (n-gram) source models, first explored by Shannon . To build statistical language models, we have used the both first edition  and second edition  of the Peykare corpus. As mentioned in Sect. 2.2.2, the first edition of this corpus contains about ten million words that are annotated with POS tags. Using this corpus, we constructed different types of n-gram language models. Since the size of this edition of the corpus was not enough for making a reliable word-based n-gram language model, we built POS-based and class-based n-gram language models, in addition to the word-based n-gram model. These language models are used in the intermediate version of Nevisa. The final language model of the Nevisa has been constructed from the second edition of the Peykare corpus.
Examples of different writing styles for plural suffix "h/" and imperfective prefix "mi".
Another issue is the inconsistency of text encoding in Persian electronic texts. This problem arises from the use of different code pages by online publishers and people. As a result, some letters such as 'ye' and 'ke' have various encoding. For example, the letter 'ye' has three different encodings in Unicode, i.e., U+0649 and U+064A (Arabic letters 'ye') and U+06CC (Persian letter 'ye').
For solving these probleme, we must replace different orthographic forms of a word by a unique form. The main corrections that are applied on corpus texts are as below:
All affixes that attached to the host word or separated by an intervening space are replaced with affixes separated with final form character (zero-width non-joiner character). For example, the words "ket/b h/" (the books) and "miravand" (they are going) in the examples above are replaced by "ket/bh/ ~ " and "miravand ~ ".
Different orthographic realizations of a single word are replaced with their standard form ac-cording to the standards of APLL (Academy of the Persian Language and Literature) . For example, all different forms of words "mas]uliyat" and "majmu]eye" in the above example are replaced with their standard forms (form 1 in Table 4)
Different encodings of a specific character are changed to a unique form. For example, all letters 'ye' that are encoded by U+0649 and U+064A are changed to the letter 'ye' encoded by U+06CC.
All diacritics (Bound graphemes) appearing in texts are removed. For example, the consonant gemination marker in the word "fann/vari" (technology) is removed resulting in the word "fan/vari".
The multiplicity of the POS tags in the corpus was the next problem to be solved. As mentioned earlier, the tag set includes 882 POS tags. While many of them contain detailed information about the words, they are rarely used in the corpus. This results in many different tags for verbs, adjectives, nouns etc. As a solution, we decreased the number of POS tags by clustering them manually according to their syntactical similarity. In addition, for rare and syntactically insignificant POS tags, we used the IGNORE tag. A NULL tag was defined to mark the beginning of a sentence. These modifications reduced the size of the tag set to 166. Finally, the following statistics were extracted from the corpus to build the LMs [18, 19]: unigram statistics of words (The 20,000 most frequent words in the corpus were chosen as the vocabulary set); bigram statistics of words; trigram statistics of words; unigram statistics of POS tags (for 166 tags); bigram statistics of POS tags; trigram statistics of POS tags; number of assigning one POS tag to each word in the corpus (lexical generation statistics). After extracting the word-based n-gram statistics, the back-o trigram language model was built using Katz smoothing method .
In addition to the word-based and POS-based bigram and trigram models, class-based language models can be optionally used . Class-based language modeling can tackle the sparseness of data in the corpus. In this approach, words are grouped into classes and each word is assigned to one or more classes. To determine the word classes, one can use the automatic word clustering methods like Brown's and Martin's algorithms [46, 47]. In these clustering methods, certain information theory criteria, such as average mutual information, are used to make different classes. In Nevisa, the basic idea of Martin's algorithm  is used for word clustering. In this algorithm, the words are clustered initially and they are moved between classes iteratively in the direction of perplexity improvement. Although POS-based and class-based n-grams reduce the sparseness of the extracted bigram and trigram models, in many cases the probabilities remain zero or close to zero. To overcome this problem, various smoothing methods  such as add-one, Katz  and Witten-Bell smoothing  were evaluated on POS-based and class-based n-gram probabilities.
In addition, the language model score for class-based bigram and trigram language models can be computed . As shown in Figure 1 by dotted line, the statistical LM can be applied to the system at the end of the search by n-best re-scorer.
3.3.2 Grammatical language models
Grammar is a formal specification of permissible structures for the language that is used as another important linguistic knowledge source besides the statistical language models in speech recognition systems. In Nevisa, as in the most of the developed speech recognition systems, the output is a set of n-best hypotheses that are ordered based on their acoustic and language model scores. The output sentences do not have the true syntactic structure necessarily. For making high scored syntactic outputs a grammatical model of the language and a syntactic parser are necessary. The grammatical model includes a set of rules and syntactic features for each word in the vocabulary. The rule set describes syntactic structures of permissible sentences in the language. The syntactic parser analyzes the output hypotheses of the recognition system and rejects the non-grammatical hypotheses.
In this rule, N1- (a noun with possibly an adjective) must have EzafeC enclitic (GEN +) and non-pronoun (PRO -) head. N2 points to a complete Noun phrase (a noun with pre-modifiers and post-modifiers). It means that a complete Noun phrase can play the role of genitive for Noun. In addition, this rule shows that the other post-modifiers of noun (P2 and S) can be combined optionally. P2 points to the prepositional phrase and S[COMP +] points to the complement sentence (relative clause). The feature COMP with + value indicates that the sentence must have Persian complementizer "ke" (that, which). Similar to this rule, we write other rules for describing various syntactic structures of Persian. Furthermore, a 1,000-word vocabulary with syntactic features was annotated.
Analyzing a sentence and checking the compatibility of its structure with the grammar needs a parsing technique. Parsing algorithm offers a procedure that searches through various ways of combining grammatical rules to find a combination that generates a tree to illustrate the structure of the input sentence. This is similar to the search problem in speech recognition. A top-down chart parser  is incorporated in Nevisa. The grammatical language model integration in Nevisa is done in a loosely-coupled manner, as shown in Figure 1, at the end of the search process. The Parser takes the n-best list from the word decoder, analyzes each sentence according to grammatical rules and accepts the grammatically correct sentences as the output of the system.
4 Experiments and results
4.1 System parameters
In the acoustic front-end, speech signal is blocked into 20 ms frames with 12 ms overlap if sampled with 22050 Hz sampling rate, and with 25 ms of speech signal and 15 ms of overlap in the case of 16 kHz sampling rate. A pre-emphasis filter with a factor of 0.97 is applied to each frame of speech. A Hamming window is also applied to the signal in order to reduce the effect of frame edge discontinuities. After performing fast Fourier transform (FFT), the magnitude spectrum is warped according to the signal's warping factor if the VTLN option is used. The obtained spectral magnitude spectrum values are weighted and summed up using the coefficients of 40 triangular filters arranged on the Mel-frequency scale. The filter output is the logarithm of sum of the weighted spectral magnitudes. Discrete cosine transform (DCT) is then applied resulting in 13 cepstral coefficients. The first and the second derivatives of cepstral coefficients are calculated using linear regression method  over a window covering seven neighboring cepstrum vectors. This makes up vectors of 39 coefficients per speech frame. Finally, PCA and/or CMS are used in the cases these options are activated.
Nevisa uses phone (context independent) and triphone (context dependent) HMM modeling. All HMMs are left-to-right; forward, skips and self-loop transitions are allowed. The elements of the feature vectors are assumed uncorrelated resulting in diagonal covariance matrices. The parameters are initialized using linear segmentation and then the segmental k-means re-estimation algorithm finalizes the parameters after ten iterations. The beam width in the decoding process is 70 and the stack size is 300.
4.2 Results of language model incorporation
In this section, the evaluation results of incorporating of language models in the Nevisa system are reported. An intermediate version of Nevisa is used in the experiments of this section. The system is trained on 29 Persian phonemes with silence as the 30th phoneme. All HMMs are left-to-right and composed of six states and 16 Gaussian mixture components per state. The vocabulary size is about 1,000 words and the first edition of the text corpus is used for building the statistical language models. In these evaluations, sFarsdat train and sFarsdat test are used as train and test sets, respectively. Two different criteria were used to evaluate the efficiency of the language model variants: the perplexity and word error rate (WER) of the system.
Performance of Nevisa in clean condition (word level)
BL, No LM
The effect of cutoffs on the size and perplexity of a back-off trigram language model
4.3 Results for robustness techniques
The recognition system described in section 4.2 is used to provide results for this section. Here, sFarsdat train is used to train phone models with six states for each model and 16 Gaussian mixture in each state. The vocabulary contains about 1,000 words and the word-based trigram language model is used. Evaluation test sets of FANOS database are used in these experiments.
Evaluation of Nevisa and the robustness methods on FANOS noisy tasks (WER% on word level)
4.4 Final results
WER% of Nevisa on small and large Farsdat using context-independent (phone) and context-dependent (triphone) modeling
As shown in Table 8, generally the performance of the system with sFarsdat test is lower than with gFarsdat test. This is due to the mismatch of the language model between the sentences of sFarsdat test and the text corpus. As indicated in sect. 4.1, the sentences of small Farsdat are designed artificially to cover the Persian acoustic variations and do not have a compatible language model with regular Persian texts such as the Peykare. Training the triphone models with small Farsdat provides higher WER in comparison with large Farsdat because the training data in small Farsdat is not enough for context-dependent modeling. Due to the small size of the sFarsdat train, the numbers of final tied states are reduced to 500. Furthermore, the acoustic mismatch between train and test conditions (train with sFarsdat train and test using gFarsdat test or vice versa) intensifies the increase of WER. The best performance of the system was obtained in the case of context-dependent modeling using large Farsdat database.
5 Summary and conclusion
Nevisa system was introduced as the first large vocabulary speaker-independent continuous speech recognition system for Persian language. The conventional and customized techniques for different modules of the system were incorporated. For each module, necessary modifications and parameter optimizations were performed. The parameter set for each part of the system was found by separately evaluating the performance of that part with different parameter values. The system was developed in the process of academic and industrial teamwork and was intended to be an exploitable product. Therefore, the problems of noisy environments and speaker variations had to be handled. Various robustness techniques were tried and optimized for this purpose. We also customized and utilized statistical and grammatical language models for Persian language. The general n-gram statistics of Persian were extracted and incorporated for the first time. Our evaluation results and real environ-mental tests show that the system is performing satisfactorily enough to be used by typical users.
We are now continuing our research for improved versions of Nevisa. We are using context-dependent acoustic phone units (e.g. triphones), increasing the vocabulary size and improving our language models for this purpose. We are also working on specific language models for medical, legal, banking and office automation applications.
aThe binary features are the features that take only two possible values.
b The atomic features are the features that take more than two possible values.
c Ezafe is short vowel that makes genitives in Persian
- Rabiner LR: Challenges in speech recognition and natural language processing. SPECOM 2006.Google Scholar
- Furui S: 50 years of progress in speech and speaker recognition research. Trans Comput Information Technology ECTI-CIT 2005,1(2):6474.Google Scholar
- Huang X, Acero A, Hon HW: Spoken Language Processing. Prentice Hall, Upper Saddle River, NJ, USA; 2001.Google Scholar
- Rabiner L, Juang BH: Fundamentals of Speech Recognition. Prentice Hall, Upper Saddle River, NJ, USA; 1993.Google Scholar
- Allen J: Natural Language Understanding. Benjamin-Cummings Publishing Co. Inc., Redwood City, CA, USA; 1995.MATHGoogle Scholar
- Babaali B, Sameti H: The sharif speaker-independent large vocabulary speech recognition sys-tem. In The 2nd Workshop on Information Technology & Its Disciplines (WITID 2004). Kish Island; 2004:24-26.Google Scholar
- Sameti H, Movasagh H, Babaali B, Bahrani M, Hosseinzadeh K, Dehkordi A Fazel, Abu-talebi HR, Veisi H, Mokri Y, Motazeri N, Ranjbar M Nezami: Large vocabulary persian speech recognition system. 1st Workshop on Persian Language and Computer 2004, 69-76.Google Scholar
- Movasagh H: Design and implementation of an optimized search method for hmm-based persian continuous speech recognition. Ms thesis, Sharif University of Technology 2004.Google Scholar
- Babaali B: Incorporating pruning techniques for improving the performance of an hmm-based continuous speech recognizer. Ms thesis, Sharif University of Technology 2004.Google Scholar
- Sameti H, Veisi H, Bahrani M, Babaali B, Hosseinzadeh K: Nevisa, a persian continuous speech recognition system. In Communications in Computer and Information Science. Springer Berlin Heidelberg; 2008:485-492.Google Scholar
- Sheikhan M, Tebyani M, Lotfizad M: Continuous speech recognition and syntactic processing in iranian farsi language. Inter J Speech Technol 1997,1(2):135. 10.1007/BF02277194View ArticleGoogle Scholar
- Ahadi SM: Recognition of continuous persian speech using a medium-sized vocabulary speech corpus. In European Conference on Speech communication and technology (Eurospeech'99). Geneva, Switzerland; 1999:863-866.Google Scholar
- Srinivasamurthy N, Narayanan SS: Language-adaptive persian speech recognition. European Conference on Speech Communication and Technology (Eurospeech'03), Geneva 2003.Google Scholar
- Almasganj F, Seyyed Salehi SA, Bijankhan M, Razizade H, Asghari M: Shenava 2: a persian continuous speech recognition software. In The first workshop on Persian language and Computer. Tehran; 2004:77-82.Google Scholar
- Rabiner LR: A tutorial on hidden markov models and selected applications in speech recognition. Proc IEEE 1989,77(2):257-286. 10.1109/5.18626View ArticleGoogle Scholar
- Ortmanns S, Eiden A, Ney H: Improved lexical tree search for large vocabulary speech recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP'98) 1998.Google Scholar
- Ney H, Haeb-Umbach R, Tran BH, Oerder M: Improvements in beam search for 10000-word continuous speech recognition. IEEE Trans Acoust Speech, Signal Process 1992, 2: 353-356.Google Scholar
- Bahrani M, Sameti H, Manshadi M Hafezi: A computational grammar for persian based on gpsg. In 2nd Workshop on Persian Language and Computer. Tehran; 2006.Google Scholar
- Bahrani M, Sameti H: Building statistical language models for persian continuous speech recognition systems using the peykare corpus. Intern J Comp Process Lang 2011,23(1):1-20. 10.1142/S1793840611002188View ArticleGoogle Scholar
- Bahrani M, Sameti H, Manshadi M Hafezi: A computational grammar for persian based on gpsg. Lang Resour Eval 2011, 1-22.Google Scholar
- Bijankhan M, Sheikhzadegan J, Roohani MR, Samareh Y, Lucas C, Tebyani M: Farsdat-the speech database of farsi spoken language. Proceeding of 5th Australian International Conference on Speech Science and Technology 1994, 826-831.Google Scholar
- Sheikhzadegan J, Bijankhan M: Persian speech databases. 2nd Workshop on Persian Language and Computer 2006, 247-261.Google Scholar
- Veisi H: Model-based methods for noise robust speech recognition systems. Ms thesis, Sharif University of Technology 2005.Google Scholar
- Hosseinzadeh K: Improving the accuracy of continuous speech recognition in noisy environments. In Ms thesis. Sharif University of Technology; 2004.Google Scholar
- BijanKhan M: Persian text corpus. In 1st Workshop on Persian Language and Computer. Tehran; 2004.Google Scholar
- Bijankhan M, Sheykhzadegan J, Bahrani M, Ghayoomi M: Lessons from building a persian written corpus: Peykare. Lang Resour Eval 2011,45(2):143-164. 10.1007/s10579-010-9132-xView ArticleGoogle Scholar
- Zhan P, Westphal M, Finke M, Waibel A: Speaker normalization and speaker adaptation- a combination for conversational speech recognition. European Conference on Speech Communication and Technology (EUROSPEECH'97), Greece, ISCA 1997, 2087-2090.Google Scholar
- Pye D, Woodland PC: Experiments in speaker normalisation and adaptation for large vocabulary speech recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'97), Munich 1997, 1047-1050.Google Scholar
- Veisi H, Sameti H, Babaali B, Hosseinzadeh K, Manzuri MT: Improving the robustness of persian large vocabulary continuous speech recognition system for real applications. IEEE International Conference on Information and Communication Technologies (ICTTA'06) 2006, 1293-1297.Google Scholar
- Veisi H, Sameti H: The integration of principal component analysis and cepstral mean subtraction in parallel model combination for robust speech recognition. Digit Signal Process 2011,21(1):36-53. 10.1016/j.dsp.2010.07.004View ArticleGoogle Scholar
- Gales MJF: Model-based Techniques for Noise Robust Speech Recognition. In Phd thesis. University of Cambridge; 1995.Google Scholar
- Veisi H, Sameti H: The combination of cms with pmc for improving robustness of speech recognition systems. In Communications in Computer and Information Science. Springer Berlin Heidelberg; 2008:825-829.Google Scholar
- Leggetter CJ, Woodland PC: Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Comput Speech Lang 1995,9(2):171. 10.1006/csla.1995.0010View ArticleGoogle Scholar
- Woodland PC: Speaker adaptation: Techniques and challenges. IEEE Workshop on Automatic Speech Recognition and Understanding 1999, 85-90.Google Scholar
- Young SJ, Woodland PC: The use of state tying in continuous speech recognition. In European Conference on Speech Communication and Technology (EUROSPEECH'93), ISCA. Berlin; 1993:2203-2206.Google Scholar
- Young SJ, Odell JJ, Woodland PC: Tree-based state tying for high accuracy acoustic modeling. In Proceedings of the Workshop on Human Language Technology. Association for Computational Linguistics Morristown, NJ; 1994:307-312.View ArticleGoogle Scholar
- Odell JJ: The Use of Context in Large Vocabulary Speech Recognition. In Phd thesis. Cambridge University; 1995.Google Scholar
- Hwang MY, Alleva F, Huang X: Senones, multi-pass search, and unified stochastic modeling in sphinx-ii. In European Conference on Speech Communication and Technology (EU-ROSPEECH'93). Berlin; 1993. ISCAGoogle Scholar
- Moreno PJ: Speech Recognition in Noisy Environments. Phd thesis, Carnegie Mellon University 1996.Google Scholar
- Acero A: Acoustical and environmental robustness in automatic speech recognition. Phd thesis, Carnegie Mellon University 1990.Google Scholar
- Welling L, Ney H, Kanthak S: Speaker adaptive modeling by vocal tract normalization. IEEE Trans Speech Audio Process 2002,10(6):415-426. 10.1109/TSA.2002.803435View ArticleGoogle Scholar
- Gales MJF: PC Woodland, Mean and variance adaptation within the mllr framework. Comput Speech Lang 1996,10(4):249-264. 10.1006/csla.1996.0013View ArticleGoogle Scholar
- Shannon C: A mathematical theory of communication. Bell Sys Tech J 1948, 27: 398-403.MathSciNetView ArticleGoogle Scholar
- Sadeghi A Ashraf, Moghadam Z Zandi: The dictionary of Persian orthography. The Acad Persian Lang Lit; 2005.Google Scholar
- Katz S: Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans Acoust Speech Signal Process 1987,35(3):400-401. 10.1109/TASSP.1987.1165125View ArticleGoogle Scholar
- Brown PF, Mercer RL, Pietra VJ Della, Lai JC: Class-based n-gram models of natural language. Comput Linguist 1992,18(4):467-479.Google Scholar
- Martin S, Liermann J, Ney H: Algorithms for bigram and trigram word clustering. Speech Commun 1998,24(1):19-37. 10.1016/S0167-6393(97)00062-9View ArticleGoogle Scholar
- Chen SF, Goodman J: An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on Association for Computational Linguistics. Santa Cruz, California, 1996. 310-318 Association for Computational Linguistics Morristown, NJ, USA;View ArticleGoogle Scholar
- Witten IH, Bell TC: The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory 1991,37(4):1085-1094. 10.1109/18.87000View ArticleGoogle Scholar
- Harper MP, Jamieson LH, Mitchell CD, Ying G, Potisuk S, Srinivasan PN, Chen R, Zoltowski CB, McPheters LL, Pellom B: Integrating language models with speech recognition. Proceedings of the AAAI-94 Workshop on the Integration of Natural Language and Speech Processing 1994, 139-146.Google Scholar
- Kaplan RM: The formal architecture of lexical functional grammar. In Formal Issues in Lexical-Functional Grammar. Center for the Study of Language (CSLI); 1995:7-28.Google Scholar
- Gazdar G, Klein E, Pullum G, Sag IA: Generalized Phrase Structure Grammar. Harvard University Press; 1985.Google Scholar
- Joshi AK, Levy L, Takahashi M: Tree adjunct grammars. Journal of Computer and System Sciences 1975,10(1):136-163. 10.1016/S0022-0000(75)80019-5MATHMathSciNetView ArticleGoogle Scholar
- Radford A: Transformational grammar: a first course. Cambridge University Press, Cambridge; 1988.View ArticleGoogle Scholar
- Clarkson P, Rosenfeld R: Statistical language modeling using the cmu-cambridge toolkit. European Conference on Speech Communication and Technology (EUROSPEECH'97), ISCA, Rhodes 1997, 2707-2710.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.