Using SVM as Back-End Classifier for Language Identification
© Hongbin Suo et al. 2008
Received: 31 January 2008
Accepted: 29 September 2008
Published: 10 November 2008
Robust automatic language identification (LID) is a task of identifying the language from a short utterance spoken by an unknown speaker. One of the mainstream approaches named parallel phone recognition language modeling (PPRLM) has achieved a very good performance. The log-likelihood radio (LLR) algorithm has been proposed recently to normalize posteriori probabilities which are the outputs of back-end classifiers in PPRLM systems. Support vector machine (SVM) with radial basis function (RBF) kernel is adopted as the back-end classifier. But for the conventional SVM classifier, the output is not probability. We use a pair-wise posterior probability estimation (PPPE) algorithm to calibrate the output of each classifier. The proposed approaches are evaluated on the 2005 National Institute of Standards and Technology (NIST). Language recognition evaluation databases and experiments show that the systems described in this paper produce comparable results to the existing arts.
Automatic spoken language identification without using deep knowledge of those languages is a challenging task. The variability of one spoken utterance can be incurred by its content, speakers, and environment. Normally, the training corpus and test corpus consist of unconstrained utterances from different speakers. Therefore, the core issue is how to extract the language differences regardless of content, speaker, and environment information [1, 2]. The clues that human use to identify languages are studied in [3, 4]. The sources of information used to discriminate one language from the others include phonetics, phonology, morphology, syntax, and prosody. At present, The most successful approach to LID uses phone recognizers of several languages in parallel. The analysis in  indicates that performance can be considerably improved in proportion to the number of front-end phone recognizers. Recently, a set of phone recognizers are used to transcribe the input speech into phoneme lattices [5, 6] which are later scored by n-gram language models.
Each spoken utterance is converted into a score vector with its components representing the statistics of the acoustic units. Vector space modeling approach  has been successfully applied to spoken language identification. Results in an anchor GMM system  show that it is able to achieve robust speaker independent language identification through compensation for intralanguage and interspeaker variability. However, the identity of a target language is not sufficiently described by the score vectors that are generated by the following language models in conventional PPRLM systems. To compensate these insufficiencies, it is a natural extension that multiple groups with similar speakers in one language are used to build the multiple target phonotactic language models. For example, the training data set for language modeling can be divided by genders. In our proposed framework, hierarchical clustering (HC) algorithm  and K-means clustering algorithm are used together to extract more information from the available training data. Here, generalized likelihood ratio (GLR) distance defined in  is chosen as the pair-wise distances between two clusters.
In PPRLM framework, back-end discriminative SVM classifiers are adopted to identify the spoken language. The SVM classifier has demonstrated superior performance over generative language modeling framework in [7, 11, 12]. SVM as a discriminative tool maps input cepstral feature vector into high-dimensional space and then separates classes with maximum margin hyperplane. In addition to its discriminative nature, its training criteria also balance the reduction of errors on the training data and the generalization on the unseen data. This makes it perform well on small dataset and suited for handling high-dimensional problem. In this paper, a back-end radial basis function (RBF) kernel  SVM classifier is used to discriminate target languages based on the probability distribution in the discriminative vector space of language characterization scores. The choice of radial basis function kernel is based on its nonlinear mapping function and requirement of relatively small amount of parameters to tune. Furthermore, the linear kernel is a special case of RBF and the sigmoid kernel behaves like radial basis function for certain parameters . Note that the training data of this back-end SVM classifier comes from development data rather than from the data used for training n-gram language models, and cross-validation is employed to select kernel parameters and prevent over-fitting. For testing, once the discriminative language characterization score vectors of a test utterance are generated, the back-end SVM classifier can estimate the posterior probability of each target language that is used to calibrate final outputs. As mentioned above, pair-wise posterior probability estimation (PPPE) algorithm is used to calibrate the output of each classifier. In fact, the multiclass classification problem refers to assigning each of the observations into one of k classes. As two-class problems are much easier to solve, many authors propose to use two-class classifiers for multiclass classification. PPPE algorithm is a popular multiclass classification method that combines all comparisons for each pair of classes. Furthermore, it focuses on techniques that provide a multiclass probability estimate by combining all pair-wise comparisons [15, 16].
The remainder of this paper is organized as follows. The proposed PPRLM LID frameworks is stretched in Section 2. In Section 3, the proposed three basic classifiers are described. Besides, a score calibration method and a probability estimation algorithm are detailed in this section. In Section 4, a speech corpus used for this study is introduced. Experiments and results of the proposed method are given in Section 5. Finally, some conclusions are given in Section 6.
2. The PPRLM LID Framework
In feature extraction, speech data is parameterized every 25 milliseconds with 15 milliseconds overlap between contiguous frames. For each frame, a feature vector with 39 dimensions is calculated as follows: 13 Mel Frequency Perceptual Linear Predictive (MFPLP) [19, 20] coefficients, 13 delta cepstral coefficients, and 13 double delta cepstral coefficients. All the feature vectors are processed by cepstral mean subtraction (CMS) method.
3. The Back-End Classifier
3.1. Gaussian Mixture Model
The back-end procedure takes discriminative language characterization scores from all available classifiers and maps them to final target language post probabilities. Diagonal covariance Gaussian models that are used as the back-end classifiers are trained from the development data . However, these models are hard to describe distribution of high-dimensional features. Usually, linear discriminant analysis (LDA) has been used for this task. As a last step in the back-end procedure, the score vectors are converted to log-likelihood ratios.
3.2. Feed-Forward Neural Network
For feed-forward multilayer neural network training, many algorithms are based on the gradient descent algorithms, such as back propagation (BP). However, These algorithms usually have a poor convergence rate, because the gradient descent methods is using a linear function to approximate an object function. Conjugate gradient (CG), as a second derivative optimal method, has a better convergence rate than BP.
3.3. RBF Support Vector Machine
3.4. Score Calibration
However, the output scores of back-end RBF SVM are not log-likelihood values; thus, linear discriminant analysis (LDA) and diagonal covariance Gaussian models are used to calculate the log-likelihoods for each target language , and improvement has been achieved in detection performance .
Therefore, the estimated posterior probabilities are applicable to performance evaluation. The probability tools of LIBSVM  are used in our approach. Experiments in Section 5 show that this multiclass pair-wise posterior probability estimation algorithm is superior to commonly-used log-likelihood ratio normalization method.
4. Speech Corpus
In phone recognizer framework, the Oregon Graduate Institute Multi-Language Telephone Speech (OGI-TS) Corpus  is used. It contains 90 speech messages in each of the following 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, and Vietnamese. Each message is spoken by a unique speaker and comprises responses to 10 prompts. Besides, phonetically transcribed training data is available for six of the OGI Languages (English, German, Hindi, Japanese, Mandarin, and Spanish). Otherwise, the labeled Hong Kong University of Science and Technology (HKUST) Mandarin Telephone Speech Part 1  is used to accurately train an acoustic model for another Mandarin phone recognizer. A telephone speech database in common use for back-end language modeling is the Linguistic Data Consortium's CallFriend corpus. The corpus comprises two-speaker, unprompted, and conversational speech messages between friends. Hundred North-American long-distance telephone conversations are recorded in each of twelve languages (the same as 11 languages as OGI-TS plus Arabic). There are three sets in this corpus including training, development, and test set, each set consists of 20 two-sided conversations from each language, approximately 30-minute long.
In this paper, experiments are performed on the 2005 NIST LRE  30 s test set. Comparing to the last evaluation, the account of test utterances is rapidly increased. Martin has summarized the numbers of utterances in each language from the primary evaluation data used in this task . Note that in addition to the seven target languages, NIST also collected some conversations in German that are used as evaluation test utterances, though the trials involving these are not considered part of the primary evaluation condition. Moreover, development data which can be used to tune the parameters of back-end classifiers is obtained from the 2003 NIST LRE evaluation sets. Thus, the data comprises 80 development segments, for each of the 7 target languages as given in . All of the training, development and evaluation data is in standard 8-bit 8 kHz mu-law fromat from digital telephone channel.
5. Experiments and Results
The performance of a detection system is characterized by its miss and false alarm probabilities. The primary evaluation metric is based upon 2005 NIST language recognition evaluation . The task of this evaluation is to detect the presence of a hypothesized target language, given a segment of conversational speech over the telephone. Submitted scores are given in the form of equal error rates (EER). EER is the point where miss probability and false alarm probability are equal. Experiments of the proposed application are explained in the following sections.
5.1. Performance of Proposed Systems
A Mandarin phone recognizer is built from HKUST Telephone data in a PRLM system. There are 68 mono-phones and a three-state left-to-right hidden Markov model (HMM) is used for each tri-phone in each language. Thus, the acoustic model can be described in more detail. But, PPRLM system is mainly composed of six phone recognizers. Acoustic model for each phone recognizer is initialized on OGI-TS corpus and retrained on CallFriend training set corpus. Since the amount of labeled data is limited, mono-phone is chosen as the acoustic modeling unit. The outputs of all recognizers are phone sequences that are used to build the following 3-gram phone language models. Phonotactic scores are only composed of DLCSV for classifying.
PRLM systems results on 2005 NIST 30-second tasks.
The experiment results of phone recognizing systems show that discriminative score vector modeling method improves system performance in most cases. As mentioned above, the main reason is that multiple discriminative classifiers based on hierarchically clustered speaker groups are employed to map the speech utterance into discriminative language characterization score vector space, which not only represents enhanced language information but also compensates for intralanguage and interspeaker variability. Moreover, by using back-end classifiers, this speaker group specific variability can be compensated sufficiently and make system less speaker dependent. Furthermore, as shown in Tables 1 and 2, the proposed SVM classifier with the PPPE method adopted in the improved systems is comparable to the other classifier. Because the output scores of back-end classifiers are not real log-likelihood values, this alternative language score calibration method performs better.
5.2. Computational Cost
The computational cost of back-end classifiers.
Real time (RT)
PPRLM system 1
PPRLM system 2
PPRLM system 3
PPRLM system 5
In this paper, we have presented our basic PPRLM system and three classifiers for processing the high-level score features. The progressive use of groups' training data for building 3-gram language models is exploited to map spoken utterance into discriminative language characterization score vector space efficiently. The proposed method enhances language information and compensates the disturbances caused by intralanguage and interspeaker variability. After comparing the results of the different back-end classifying algorithms, discriminative SVM classifier with pair-wise posterior probability achieves the most performance improvement. Furthermore, log-likelihood normalization method is adopted to further improve the performance of language identification task.
This work is partially supported by the Ministry of Science and Technology of the People's Republic of China (973 Program, 2004CB318106), National Natural Science Foundation of China (10574140, 60535030), and The National High Technology Research and Development Program of China (863 Program, 2006AA010102, 2006AA01Z195).
- Li K-P: Automatic language identification using syllabic spectral features. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '94), April 1994, Adelaide, Australia 1: 297-300.Google Scholar
- Nagarajan T, Murthy HA: Language identification using acoustic log-likelihoods of syllable-like units. Speech Communication 2006,48(8):913-926. 10.1016/j.specom.2005.12.003View ArticleGoogle Scholar
- Muthusamy YK, Jain N, Cole RA: Perceptual benchmarks for automatic language identification. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '94), April 1994, Adelaide, Australia 1: 333-336.Google Scholar
- Zissman MA: Comparison of four approaches to automatic language identification of telephone speech. IEEE Transactions on Speech and Audio Processing 1996,4(1):31-44. 10.1109/TSA.1996.481450View ArticleGoogle Scholar
- Gauvain JL, Messaoudi A, Schwenk H: Language recognition using phone lattices. Proceeding of the International Conference on Spoken Language Processing (ICSLP '04), October 2004, Jeju Island, South Korea 1283-1286.Google Scholar
- Shen W, Campbell W, Gleason T, Reynolds D, Singer E: Experiments with lattice-based PPRLM language identification. Proceedings of IEEE Odyssey on Speaker and Language Recognition Workshop, June 2006, San Juan, Puerto Rico 1-6.View ArticleGoogle Scholar
- Li H, Ma B, Lee C-H: A vector space modeling approach to spoken language identification. IEEE Transaction on Audio, Speech, and Language Processing 2006,15(1):271-284.View ArticleGoogle Scholar
- Noor E, Aronowitz H: Efficient language identification using anchor models and support vector machines. Proceedings of IEEE Odyssey on Speaker and Language Recognition Workshop, June 2006, San Juan, Puerto Rico 1-6.View ArticleGoogle Scholar
- Jin H, Kubala F, Schwartz R: Automatic speaker clustering. Proceedings of the DARPA Speech Recognition Workshop, February 1997, Chantilly, Va, USA 108-111.Google Scholar
- Gish H, Siu M-H, Rohlicek R: Segregation of speakers for speech recognition and speaker identification. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '91), May 1991, Toronto, Canada 2: 873-876.Google Scholar
- White C, Shafran I, Gauvain J-L: Discriminative classifiers for language recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '06), May 2006, Toulouse, France 1: 213-216.Google Scholar
- Zhai L-F, Siu M-H, Yang X, Gish H: Discriminatively trained language models using support vector machines for language identification. Proceedings of IEEE Odyssey on Speaker and Language Recognition Workshop, June 2006, San Juan, Puerto Rico 1-6.View ArticleGoogle Scholar
- Chang C-C, Lin C-J: LIBSVM: a library for support vector machines. 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvmGoogle Scholar
- Wu T-F, Lin C-J, Weng RC: Probability estimates for multi-class classification by pairwise coupling. The Journal of Machine Learning Research 2004, 5: 975-1005.MathSciNetMATHGoogle Scholar
- Price D, Knerr S, Personnaz L, Dreyfus G: Pairwise neural network classifiers with probabilistic outputs. In Neural Information Processing Systems. Volume 7. MIT Press, Cambridge, Mass, USA; 1995:1109-1116.Google Scholar
- Refregier P, Vallet F: Probabilistic approach for multiclass classification with neural networks. Proceedings of International Conference on Artificial Networks, June 1991, Espoo, Finland 1003-1007.Google Scholar
- Yan Y, Barnard E: An approach to automatic language identification based on language-dependent phone recognition. Proceedings of the 20th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '95), May 1995, Detroit, Mich, USA 5: 3511-3514.Google Scholar
- Barnard E, Yan Y: Toward new language adaptation for language identification. Speech Communication 1997,21(4):245-254. 10.1016/S0167-6393(97)00009-5View ArticleGoogle Scholar
- Hermansky H: Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America 1990,87(4):1738-1752. 10.1121/1.399423View ArticleGoogle Scholar
- Zolnay A, SchlÃ¼ter R, Ney H: Acoustic feature combination for robust speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 1: 457-460.Google Scholar
- Campbell WM, Campbell JP, Reynolds DA, Singer E, Torres-Carrasquillo PA: Support vector machines for speaker and language recognition. Computer Speech & Language 2006,20(2-3):210-229. 10.1016/j.csl.2005.06.003View ArticleGoogle Scholar
- Barnard E, Cole RA: A neural-net training program based on conjugate gradient optimization. Department of Computer Science, Oregon Graduate Institute of Science and Technology, Portland, Ore, USA; 1989.Google Scholar
- BrÃ¼mmer N, van Leeuwen DA: On calibration of language recognition scores. Proceedings of IEEE Odyssey on Speaker and Language Recognition Workshop, June 2006, San Juan, Puerto Rico 1-8.View ArticleGoogle Scholar
- Singer E, Torres-Carrasquillo PA, Gleason TP, Campbell WM, Reynolds DA: Acoustic, phonetic and discriminative approaches to automatic language recognition. Proceedings of the European Conference on Speech Communication Technology (Eurospeech '03), September 2003, Geneva, Switzerland 1345-1348.Google Scholar
- Muthusamy YK, Cole RA, Oshika BT: The OGI multilanguage telephone speech corpus. Proceeding of the International Conference on Spoken Language Processing (ICSLP '92), October 1992, Banff, Canada 895-898.Google Scholar
- Martin AF, Le AN: The current state of language recognition: NIST 2005 evaluation results. Proceedings of IEEE Odyssey on Speaker and Language Recognition Workshop, June 2006, San Juan, Puerto Rico 1-6.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.