 Research
 Open access
 Published:
Homogenous ensemble phonotactic language recognition based on SVM supervector reconstruction
EURASIP Journal on Audio, Speech, and Music Processing volume 2014, Article number: 42 (2014)
Abstract
Currently, acoustic spoken language recognition (SLR) and phonotactic SLR systems are widely used language recognition systems. To achieve better performance, researchers combine multiple subsystems with the results often much better than a single SLR system. Phonotactic SLR subsystems may vary in the acoustic features vectors or include multiple languagespecific phone recognizers and different acoustic models. These methods achieve good performance but usually compute at high computational cost. In this paper, a new diversification for phonotactic language recognition systems is proposed using vector space models by support vector machine (SVM) supervector reconstruction (SSR). In this architecture, the subsystems share the same feature extraction, decoding, and Ngram counting preprocessing steps, but model in a different vector space by using the SSR algorithm without significant additional computation. We term this a homogeneous ensemble phonotactic language recognition (HEPLR) system. The system integrates three different SVM supervector reconstruction algorithms, including relative SVM supervector reconstruction, functional SVM supervector reconstruction, and perturbing SVM supervector reconstruction. All of the algorithms are incorporated using a linear discriminant analysismaximum mutual information (LDAMMI) backend for improving language recognition evaluation (LRE) accuracy. Evaluated on the National Institute of Standards and Technology (NIST) LRE 2009 task, the proposed HEPLR system achieves better performance than a baseline phone recognitionvector space modeling (PRVSM) system with minimal extra computational cost. The performance of the HEPLR system yields 1.39%, 3.63%, and 14.79% equal error rate (EER), representing 6.06%, 10.15%, and 10.53% relative improvements over the baseline system, respectively, for the 30, 10, and 3s test conditions.
1 Introduction
Spoken language recognition (SLR) refers to the task of automatic determination of language identity. It is estimated that there are about 6,000 spoken languages in the world [1]. An increasing number of multilingual speech processing applications require spoken language recognition as a frontend, with the result that SLR continues to grow in importance. Spoken language recognition is an enabling technology for a wide range of intelligence and security applications for information distillation, such as spoken document retrieval, multilingual speech recognition, and spoken language translation [2].
Language cues can be categorized according to their level of knowledge abstraction as acoustic (spectrum, phone inventory), prosodic (duration, pitch, intonation), phonotactic (sequence of sounds), lexical (vocabulary, morphology), and syntax (phrases, grammar) [3],[4]. Language recognition systems are usually identified by the features they employ, e.g., acoustic systems, phonotactic systems, prosodic systems, and lexical systems. Currently, acoustic language recognition (LR) systems [5] and phonotactic LR systems [3] are both widely used.
Generally, the performance of SLR systems can be improved in two ways: (1) longitudinally, through the development of new techniques to perform the SLR tasks more precisely, e.g., ivector [6][8], JFA [9], discriminative training [10] methods, Ngram modeling methods [3], and support vector machines (SVMs) [11]; (2) transversely, by adding variety to the SLR subsystems, which extracts and integrates more information from the utterances. Stateoftheart language recognition systems fuse multiple subsystems in parallel via a postprocessing backend [12]. In the National Institute of Standards and Technology (NIST) language recognition evaluation (LRE) tasks, teams from all over the world compete to build the best SLR system and have shown that better results can be obtained by combining more subsystems, creating larger and larger SLR systems. In NIST LRE 2011, all submitted language recognition systems were stacked ensembles of at least five language recognition subsystems [13][15]. Much effort goes into trying different variations of subsystems. Generally, the phonotactic LR subsystems can be varied in three ways: (a) extracting various acoustic features to provide feature diversification, for example, MelFrequency Cepstral Coefficients (MFCC) [16], Perceptual Linear Predictive (PLP) [17], and Temporal Patterns Neural Network (TRAPs/NN) [18]; (b) training phone recognizers on multiple languagespecific speech data to provide phonetic diversification [3], e.g., the Russian, Hungarian, Czech, and English phone recognizers developed by Brno University of Technology (BUT) [18] or universal phone recognizer (UPR) [19]; and (c) training phone recognizers on the same languagespecific speech data but using different acoustic models to provide acoustic diversification [20], such as the Artificial Neural NetworkHidden Markov Model (ANNHMM) [21], Gaussian Mixture ModelHidden Markov Model (GMMHMM) [22], and Deep Neural NetworkHidden Markov Model (DNNHMM) [23]. Certainly, in phonotactic language recognition systems, the subsystems must undergo different process of feature extracting, decoding, Ngram counting, and vector space modeling, which means an added computational cost of N times than single subsystem, where N is the number of the subsystems.
This paper demonstrates an architecture to provide a new diversification for phonotactic language recognition systems. The underlying motivation of these algorithms is to provide richer language identifying information without significant additional computation. The subsystems are verified using SVM supervector reconstruction (SSR) algorithms to provide vector space modeling diversification. In this architecture, the subsystems share the same preprocessing of feature extracting, decoding, and expected counting, but models in different vector space, so we call it homogeneous ensemble phonotactic language recognition (HEPLR) system. The HEPLR subsystems increase the variety of the SVM supervector, decrease the computational cost, and improve the SLR accuracy. There are many SVM supervector reconstruction algorithms such as recurrent neuron network (RNN) SVM supervector reconstruction [24]. In this paper, we present three SVM supervector reconstruction algorithms including relative SVM reconstruction [25], functional SVM reconstruction, and perturbative SVM reconstruction.
The remainder of the paper is organized as follows. Section 3 presents the latticebased phonotactic language recognition system used as a baseline in this paper. Section 3 describes the proposed approaches, includes relative, functional, and perturbative SVM supervector reconstruction. Section 3 demonstrates the architecture for the homogeneous ensemble phonotactic language recognition system. The experimental setup to evaluate our proposed method is described in Section 3. Results obtained in language recognition experiments on the NIST LRE 2009 database are presented and discussed in Section 3. Finally, conclusions and future work are outlined in Section 3.
2 Baseline phonotactic SLR system
In this work, the phone recognitionvector space modeling (PRVSM) [26] phonotactic language recognition system is employed as a baseline system. The motivation behind phonotactic language recognition approach is the belief that a spoken language can be characterized by special probabilities of its lexicalphonological constraints. An Ngram vector space model (VSM) is built for the language recognition task using phone transcriptions, which is a stochastic model describing the probabilities of phoneme strings. In the PRVSM system, each Ngram VSM of produces a likelihood score by SVM classifier to a given utterance. The languages used for training the phone recognizers need not be the same with any of those recognized.
The traditional PRVSM language recognition system works by mapping the input utterances from data space into a highdimensional feature space : Φ:→ and then building linear machines in the feature space to find a maximal margin separation. The vectors built in the highdimensional feature space are SVM supervectors, which consist of Ngram counts of features representing the phonotactics of an input speech wave sample.
In PRVSM systems, an utterance x can be mapped to the highdimensional feature space as follows:
where φ(x) is the SVM supervector computed as
where ℓ_{ x } is the lattice produced from data x by a phone recognizer, d_{ i } is the Ngram phoneme string [27]d_{ i }=s_{ i }…s_{i+n−1} (n=N), and F is the dimension of the SVM supervector. p(d_{ i }ℓ_{ x }) is the probability of the phoneme sequence d_{ i } in the lattice, which is computed as
where c(d_{ i }ℓ_{ x }) denotes the Ngram occurrence of d_{ i } given the lattice ℓ_{ x }. This is calculated over all possible hypotheses in the lattice as follows [27]:
where p(Sℓ) denotes the probability of the sequence S in the lattice ℓ, α(s_{ i }) and β(s_{i+N−1}) are the forward probability of the starting node and the backward probability of the ending node in the Ngram s_{ i }…s_{i+N−1}, respectively. ξ(s_{ j }) is the posterior probability of the edge phoneme s_{ j }.
The SVM supervector φ(x) is sent to the SVM classifier and a decision is made based on the most likely hypothesis score. In a PRVSM system, the decision is made based on the SVM output score using
where φ(x_{ l }) are support vectors obtained from training set using the Mercer condition. The term frequency loglikelihood ratio (TFLLR) kernel K_{TFLLR} is computed as [28]:
where p(d_{ i }ℓ_{ all }) is calculated from the observed probability of d_{ i } across all lattices. In this paper, the training stage is always carried out with a oneversusrest strategy between the positive set (the samples in the target language) and negative set (all other samples).
3 SVM supervector reconstruction algorithms
The motivation behind SVM supervector reconstruction is to provide vector space modeling diversification to improve the performance of the overall language recognition system. In the language recognition system employed in this paper, we focus on how a change in the input to the SVM affects the output.
Given an SVM supervector φ(x), we define a function ϕ_{SSR} which operates on φ(x):
We are interested in understanding how ϕ_{SSR}(φ(x)) affects the behavior of the output scores of the SVM. The goal is to define the relationship between φ(x) and ϕ_{SSR}(φ(x)) to enhance the variety of the supervector input.
Selecting SVM supervector reconstruction methods is an open question, so here we propose some typical methods to the implementation. In this section, three SVM supervector reconstruction methods are proposed: relative SVM supervector reconstruction, functional SVM supervector reconstruction, and perturbative SVM supervector reconstruction. Relative SVM supervector reconstruction has been presented in [25], while functional and perturbative reconstructions are new methods. Relative reconstruction is a linear reconstruction, while functional and perturbative reconstructions are nonlinear ones.
3.1 Relative SVM supervector reconstruction
The relative SVM supervector method uses a relative feature approach. Relative features in contrast to absolute features, which represent directly calculable information, are defined by the relationship between an utterance and a set of selected datum utterances. We have presented the concept of relative features in [25].
Calculating relative features requires a relationship measurement, such as distance on similarity. By selecting a proper relationship measurement, the identifiable characteristics can be strengthened and nuisance attributes of the utterance can be discarded. Unlike absolute features, relative features make utterances more convenient to classify by showing the relationship between the utterances and the datum database directly.
Here, we introduce a relative SVM supervector reconstruction defined using the similarity between the utterance SVM supervectors. The widely used kernel methods offer efficient similarity measurements between two SVM supervectors. In this paper, the empirical kernel [29] is introduced into language recognition and a relativized SVM supervector developed. Kernel methods have been used for face recognition [30] and handwritten digit recognition [31] and achieved higher robustness to noise [30]. Using the SVM supervectors that are already built into a language recognition system, we can easily compose a new relativized SVM supervector with only a small increase in computation.
The architecture of the relative SVM supervector reconstruction subsystem is shown in Figure 1. To construct the SVM supervector relativization map, a database s =[s_{1},s_{2},…s_{ m }] containing m utterances is used as the datum mark of similarity. The datum database is stochastically selected from some corpus, whose language need not be the same with the target language. s is mapped into vector space:
The vector relativizational (VR) kernel between two supervectors φ(x_{ i }) and φ(x_{ j }) is
The VR kernel is similar to the TFLLR kernel, but normalized by the observed probability across all lattices of the datum dataset p(d_{ i }ℓ_{ s }). This kernel reflects the degree of similarity between two supervectors.
The utterance x is mapped from the input data space to a relativized mdimensional Euclidean space ^{m}: Φ_{REL}:→^{m} as follows:
In general, K_{VR}(φ(x),φ(S )) defines a space in which each dimension corresponds to the similarities to a prototype. Thus, K_{VR}(·,φ(S )) can be viewed as a mapping onto an mdimensional relativized vector space.
The SVM output score is computed as
where {\phi}_{\text{REL}}\left({x}_{{l}^{\prime}}\right) are support vectors obtained from the training set using the Mercer condition. Selecting a radial basis function (RBF) kernel, K RBF′ is computed as
where D_{RF} is the dimension of the relativized SVM supervector. Selecting a TFLLR kernel, K TFLLR′ is computed as
3.2 Functional SVM supervector reconstruction
In actual test conditions, the training and test data are variable in speakers, background noise, and channel conditions. To achieve higher robustness to variable test conditions, the widely used kernel methods offer efficient similarity measurements between two SVM supervectors in PRVSM system [30]. The geometrical structure of the SVM vector space is completely determined by the kernel, so the selection of the kernel has a crucial impact on the performance of the language recognition systems. The functional SVM supervector reconstruction method defines a mixture between the functional and the original kernels, which can offer the robust discriminative information of the data and get robust language model.
But how to select a proper function is an open problem. There are many functions that can be used to the reconstruction, while not every function is available for the reconstruction that can reduce the equal error rate (EER). What we need to do is to find out what kind of functions can be used in feature reconstruction. The functions need to satisfy the following conditions: (1) monotonic and (2) can make the identifiable characteristics strengthened and nuisance attributes of the utterance discarded. The proposed functional SVM supervector reconstruction method does not rely on prior knowledge to select the functional to reconstruct the supervector. A development database is used for cross validation to select the function. So, here, three functions selected to be used in this paper include

(a)
{\phi}_{\text{FUN}}\left(p\right({d}_{i}\left{\ell}_{x}\right))=\text{sin}(p\left({d}_{i}\right{\ell}_{x}\left)\right)+\text{cos}\left(p\right({d}_{i}\left{\ell}_{x}\right)),(11)

(b)
{\phi}_{\text{FUN}}\left(p\right({d}_{i}\left{\ell}_{x}\right))=p({d}_{i}\left{\ell}_{x}\right)+\left(p\left({d}_{i}\right{\ell}_{x}\right){)}^{2},(12)

(c)
{\phi}_{\text{FUN}}\left(p\right({d}_{i}\left{\ell}_{x}\right))=p({d}_{i}\left{\ell}_{x}\right){\left(p\right({d}_{i}\left{\ell}_{x}\right))}^{2}+{\left(p\right({d}_{i}\left{\ell}_{x}\right))}^{3}.(13)
The utterance x is mapped onto a functionalized vector space:
The three functions are all monotonic in the range of amplitude of the SVM supervector. Selecting a TFLLR kernel, K TFLLR′ is computed as
The SVM output score is computed as
3.3 Perturbational SVM supervector reconstruction
For spoken language recognition, the first and most essential step is to tokenize the running speech into sound units or lattices using a phone recognizer. The phoneme error rate is around 40% to 60% [32] when tokenizing an utterance. The decoding errors are deletion, insertion, and substitution errors, which are expressed as some discrete ‘noise’ when mapped to the highdimensional SVM supervector space (shown in Figure 2). So, here, we introduce a perturbational denoising method for the SVM supervector. Given a supervector φ(x) and some perturbation operator on φ(x), we are interested in understanding how a small perturbation added to the supervector affects the behavior of the SVM [33]. This relationship can be represented using a mapping onto a perturbational vector space.
There are three purposes of proposing perturbational SVM supervector reconstruction method: first, adding perturbational noise to reduce the impact of noise in the SVM supervector introduced by the decoding errors; second, generating a more robust language model to provide input variety to the SVM classifier; and third, highlighting the most discriminative information of the SVM supervector and drowning the nondiscriminative information into the perturbation (shown in Figure 3).
To accomplish the above goals, the type and strength of the perturbation must be selected carefully. How to define a proper perturbation is an open problem. There are a wide variety of perturbations,which can be categorized into multiple ways, including (1) global perturbation and local perturbation, (2) stochastic perturbation and constant perturbation according to the amplitude, (3) absolute perturbation and relative perturbation according to the relationship between the SVM supervector and the perturbation, and (4) addictive perturbation and multiplicative perturbation.
For feature supervectors in vector space, the perturbations are always discrete, maybe random in a certain range or change with the amplitude of the expected value of the supervector. So, we consider both deterministic perturbation \delta ={w}^{\ast}{E}_{p\left(d\right{\ell}_{x})} and stochastic perturbation \delta ={w}^{\ast}\text{uniform}\left[0,{E}_{p\left(d\right{\ell}_{x})}\right], where {E}_{p\left(d\right{\ell}_{x})} is the mean of the SVM supervector, \text{uniform}\left[0,{E}_{p\left(d\right{\ell}_{x})}\right] is the uniform distribution between 0 and {E}_{p\left(d\right{\ell}_{x})}, and w^{∗} is the perturbation weight. More details of the perturbation methods are discussed below.
3.3.1 Perturbational approach 1 (deterministic additive perturbation)
This kind of perturbation represents the assumption that the expected count of every phoneme sequence is perturbed by an equivalent additive amount.
3.3.2 Perturbational approach 2 (stochastic additive perturbation)
This perturbation represents the assumption that the expected count of every phoneme sequence is perturbed by an amount proportional to the frequency of the phoneme sequences.
3.3.3 Perturbational approach 3 (deterministic multiplicative perturbation)
This kind of perturbation represents the assumption that the expected count of every phoneme sequence perturbed by an equivalent multiple amount.
3.3.4 Perturbational approach 4 (stochastic multiplicative perturbation)
This perturbation represents the assumption that the expected count of every phoneme sequence is perturbed by a proportional to the frequency of the phoneme sequences.
From above, it can be seen that methods 1 and 2 implement absolute perturbation, and methods 3 and 4 implement relative perturbation. All are global perturbation algorithms, operating across the entire vector space. We can also investigate local perturbation using these same approaches. Local perturbation is more flexible and realistic for the noises would have effect on part of the expected counting. The proposed methods also do not rely on prior knowledge to put noising into the supervector; we use development database for cross validation to select a better perturbation.
The utterance x is mapped onto a perturbational vector space:
where φ_{PER}(x) is a perturbation of φ(x). Selecting a TFLLR kernel, K TFLLR′ is computed as
The SVM output score is computed as
4 Homogeneous ensemble language recognition system
The architecture of the HEPLR system is shown in Figure 4. All the SVM supervectors are reconstructed into the corresponding vector space and fused at the vector level for training and testing. In this paper, we use the classical method of vector fusion, which is to group several sets of reconstructed SVM supervectors into a large composite supervector [34]. Suppose {\phi}_{{\text{REL}}_{{N}_{{1}_{1}}}}\left(x\right), …, {\phi}_{{\text{REL}}_{{N}_{{1}_{m}^{\prime}}}}\left(x\right), {\phi}_{{\text{FUN}}_{{N}_{{2}_{1}}}}\left(x\right), …, {\phi}_{{\text{FUN}}_{{N}_{{2}_{m}^{\prime}}}}\left(x\right), and {\phi}_{{\text{PER}}_{{N}_{{3}_{1}}}}\left(x\right), …, {\phi}_{{\text{PER}}_{{N}_{{3}_{m}^{\prime}}}}\left(x\right) are the reconstructed SVM supervectors for an input utterance x. The concatenated SVM supervectors can be represented as φ_{REL}(x), φ_{FUN}(x), and φ_{PER}(x), respectively. Denoting each SVM supervector as d_{ m }dimensional, the concatenated SVM supervectors are (d_{1}+…+d m′)dimensional. The concatenated SVM supervectors are defined by
where {w}_{{N}_{{j}_{i}}}={min}_{\forall i}\left({E}_{{j}_{n}}\right)/{E}_{{j}_{i}} (j=1,2,3) with {E}_{{j}_{i}} the priori knowledge of the EER performance of the development data of the subsystem. The logistic regression optimized weighting (LROW) method is used to optimize the reconstructed SVM supervector weighting coefficients. Since not all the SVM supervector reconstruction subsystems are effective when fused, we also extend the work by formulating quantitative measures to select the subsystems for fusion. The output score of the SVM classifier is computed as follows:
where the reconstruction methods are represented by means of ‘*’, and φ^{∗}(x_{l∗}) are support vectors obtained from the reconstructed SVM supervectors using the Mercer condition.
As mentioned previously, in the HEPLR language recognition system, the training stage is carried out between the positive set and negative set with oneversusrest strategy.
The linear discriminant analysismaximum mutual information (LDAMMI) method is used to maximize the posterior probabilities of all the belief score vectors [35], using objective function [36]:
where g(i) indicates the class label of x_{ i } and P(j) denotes the prior probability of class j. Vector fusion is implemented directly as
The probability density function p(x λ) is a Gaussian Mixture Model defined on the Ndimensional vector x :
The proposed homogeneous ensemble language recognition system has three advantages. First, SVM supervector reconstruction provides vector space modeling diversification for richer language identification information. Second, in the HEPLR system, the subsystems share the same preprocessing steps for feature extraction, decoding, and expected counting, which minimizes additional computational cost. Third, fusing the reconstructed SVM supervector with the original supervector at the vector level means that more information can be retained than that given by score fusion.
5 Experimental setup
5.1 Baseline language recognition system setup
The TRAPs/NN phonotactic language recognizers developed by the BUT [37] based on phone lattices, Ngram counts, and SVM scoring are used as baseline systems. An energybased voice activity detector that splits and removes longduration nonspeech segments from the signals is applied initially. Following this, the BUT decoders for Czech (CZ), Hungarian (HU), and Russian (RU) are applied to compute phone posteriori probabilities, as used in NIST LRE tasks by many groups [38],[39]. The phone inventory is 43 for Czech, 59 for Hungarian, and 50 for Russian. Posteriori probabilities are put into the HVite decoder produced by HTK to produce phone lattices, which encode multiple hypotheses with acoustic likelihoods. The Ngram counts are produced by latticetool from SRILM (SRI International, Menlo Park, CA, USA) [40]. The LIBLINEAR tool [41] for multiclass SVMs with linear kernels is applied to give SVM scores. Finally, the LDAMMI algorithm [42] is used for score calibration and fusion.
5.2 Training, development, and test datasets
Evaluation is carried out on the NIST LRE 2009 tasks. This data includes 41793 utterances including 30, 10, and 3s nominal duration, closed condition. The NIST LRE 2009 core task recognition of is to recognize 23 languages, including Amharic, Bosnian, Cantonese, Creole, Croatian, Dari, American English, Indian English, Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu, and Vietnamese. The evaluation involves radio broadcasts and conversational telephone speech channel conditions.
The training data comes from different sources including CallHome, CallFriend, OGI, OHSU, VOA, and the development corpora for the 2003, 2005, and 2007 NIST LRE evaluations.
About 25,000 utterances are selected randomly from VOA and 2003, 2005, and 2007 NIST LRE datasets used as development data.
5.3 Evaluation measures
In this work, the performance of language recognition systems is compared using: (1) EER and (2) average cost performance Cavg defined by NIST [43], which are obtained by oneversusrest tragedy.
6 Experimental results and discussion
We demonstrate the effectiveness of our approaches on NIST LRE 2009 tasks under 30, 10, and 3s conditions. Results are shown in Tables 1, 2, 3, 4, 5, and 6 and Figures 1, 2, 3, 4, 5, 6, 7, and 8 in the following sections. The EER and Cavg performance of individual subsystems and fusions is also shown in the tables below for reference.
6.1 Baseline PRVSM system
Table 1 shows EER and Cavg performance for the NIST LRE 2009 language recognition tasks using the baseline subsystems. In this work, the dimension of the possible 3gram SVM supervector produced by the single Hungarian (HU) phone recognizer with 58 phones is 58×58×58=195112. SVM supervector dimensions for the Russian (RU) and Czech (CZ) recognizers are 117649 and 74088, respectively.
6.2 Relative SVM supervector reconstruction
In this paper, 13,000 conversations which are randomly selected from the 40 languages of the 2003, 2005, and 2007 NIST LRE and VOA, CallHome, and CallFriend Corpora. These are used as the dataset to build the relative SVM supervector reconstructor.
Figure 5 shows the performance of the relative SVM supervector reconstruction subsystems whose SVM classifier uses the TFLLR and RBF kernel, respectively. Figure 5 also shows the performance of the relative SVM supervector reconstruction subsystems whose SVM classifier uses fusion of TFLLR and RBF kernel. Table 2 shows that the performance of the relative SVM supervector reconstruction subsystem is better than baseline and increases slowly with increasing size of the datum dataset. A lower dimension relative SVM supervector can be chosen to trade off performance and computation cost. The language recognition performance for short utterances is significantly improved in the relative SSR subsystem compared to the baseline. The original supervector can not describe short utterances precisely because of insufficient phoneme data, while the relative reconstructed SVM supervectors use the relationship between a short utterance and a large set of datum utterances, which is a richer representation. The experimental results show that the relative SSR subsystem outperforms the baseline and can obtain better performance with relative feature using a lowdimension SVM supervector.
6.3 Functional SVM supervector reconstruction
The language recognition results of functional SVM supervector reconstruction subsystems are given in Figure 6 and Table 3. The results using this approach were similar to or slightly worse than the baseline system in 30s test condition, but outperform the baseline system in the 10 and 3s test conditions.
6.4 Perturbational SVM supervector reconstruction
Figure 7 and Table 4 describe the results of the four perturbational methods. Overall, approach 2 yielded better results (2.20%, 6.59%, 20.93% EER) than the other approaches. The perturbative SVM supervector reconstruction subsystems performed consistently better than the baseline subsystem; particularly, those based on approach 2 performed better than the others. We hypothesize that approach 2 outperforms other perturbation methods because the distribution of the perturbation better matches the distribution of the noise. The perturbation approach adds robustness to the language modeling.
6.5 SVM supervector reconstruction
From the experimental results and discussion, it can be concluded that some of the reconstruction methods (relative SSR) are better at identifying the language of the short utterance and others (functional SSR and perturbational SSR) are better at recognizing long utterances. Because these errors are not highly correlated, we can fuse these results together to harness the complementary behavior among subsystems and improve the language recognition performance.
Figure 8 shows DET curves of the baseline system versus the HEPLR system for NIST LRE 2009. Table 5 gives the corresponding performance numbers for all configurations. These results show that the SSR approaches proposed in this paper outperformed the baseline system in terms of EER and Cavg when considering complete fusions for the subsystems.
6.6 Realtime factors
Table 6 shows the realtime (RT) factors of each part of SSR system. From Table 6, we can see that decoding is the dominant part. Compared to PRVSM baseline system, the computational cost increases about 1.5 times for the relative SSR and barely no increases for the functional SSR and perturbational SSR.
7 Computational cost
Let F, M, and M_{datum} denote the dimension of the phonotactic feature supervector of an utterance, the number of utterances of training dataset, and the datum dataset, respectively. And let c_{ φ } denote the computation cost of the mapping from x to φ(x), and c_{modeling}(F,M) denotes the computational cost of modeling the languages, which relate to F and M. Then, the computational cost of the baseline system is
where c_{PreProcessing}, c_{FeatureExtract}, c_{Decoding}, and c_{NgramCounting} denote the computational cost of preprocessing, feature extracting, decoding, and Ngram counting, respectively.
7.1 Relative SVM supervector reconstruction
Let c_{inner}(F) denote the computation cost of the inner product of the two F dimensional supervectors. Then, the computational cost of the relative SVM supervector reconstruction system is computed as
Usually, c_{modeling}(M_{datum},M)<c_{modeling}(F,M)≪M·c_{ φ }, M·M_{datum}·c_{inner}(F)≪M·c_{ φ }, so
In this paper, M=30996, when considering RU frontend, then F=117649. When M_{datum}=13000, c_{REL}/c_{baseline}=41.94%. That means that the relative SVM supervector reconstruction system takes 41.94% extra computation and achieves a 11.84%, 6.42%, and 15.92% relative improvements, respectively, for 30, 10, and 3s compared to the baseline.
7.2 Functional SVM supervector reconstruction
Let {c}_{{\phi}_{\text{FUN}}} denote the computational cost of mapping φ(x) to φ_{FUN}(x). Then, the computational cost of the functional SVM supervector reconstruction system is computed as
because preprocessing, feature extracting, decoding, and Ngram counting are more complex than the functional computation in this paper, so M\xb7{c}_{{\phi}_{\text{FUN}}}\ll M\xb7{c}_{\phi}. The computational cost of modeling the languages can be considered equal to the baseline. For RU frontend, the functional SVM supervector reconstruction system takes almost no extra computation and achieves 8.84%, 13.32%, and 14.76% relative improvements, respectively, for 30, 10, and 3s compared to the baseline.
7.3 Perturbational SVM supervector reconstruction
Let {c}_{{\phi}_{\text{PER}}} denote the computational cost of adding perturbation to φ(x). Then, the computational cost of the functional SVM supervector reconstruction system is computed as
because preprocessing, feature extracting, decoding, and Ngram counting are more complex than the perturbational computation in this paper, so M\xb7{c}_{{\phi}_{\text{PER}}}\ll M\xb7{c}_{\phi}. The computational cost of modeling the languages can be considered equal to the baseline. For RU frontend, the perturbational SVM supervector reconstruction system takes almost no extra computation and achieves 6.63%, 5.13%, and 9.05% relative improvements, respectively, for 30, 10, and 3s compared to the baseline.
8 Conclusions
In this article, we investigate a strategy of SVM supervector reconstruction to provide vector space modeling diversification to improve the performance and robustness of language recognition tasks with very low additional computational cost. A variety of SVM supervector reconstruction methods are employed to develop the diversified SVM supervectors. Reconstruction methods include relative SSR, perturbational SSR, and functional SSR. Relative SSR method uses the relationship of an utterance and a datum set to present the utterance.
Perturbational SSR reconstructs the SVM supervector to a slightly perturbational version and improves the language recognition performance. Functional SSR can derive effective kernel mixtures and get robust language model. The approaches do not involve significant additional computation compared to a baseline phonotactic system, but represents a way to extract more information from existing decodings.
Experimental results of the proposed HEPLR system on the NIST LRE 2009 evaluation set show better performance than the baseline system. When we fuse the three subsystems at the score level for further improvements, we achieve 1.39%, 3.63%, and 14.79% EER for the 30, 10, and 3s closedset test conditions, respectively. This corresponds to 6.06%, 10.15%, and 10.53% relative improvements.
References
MP Lewis, Ethnologue: languages of the world, 16th edn. (SIL International, 2009). . Accessed 17 Jan 2007. http://www.ethnologue.com.
Zue VW, Glass JR: Conversational interfaces: advances and challenges. Proc. IEEE 2000, 88(8):11661180. 10.1109/5.880078
Zissman MA: Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Proc 1996, 4(1):3144. 10.1109/TSA.1996.481450
Muthusamy YK, Barnard E, Cole RA: Reviewing automatic language identification. IEEE Signal Process. Mag 1994, 11(4):3341. 10.1109/79.317925
PA TorresCarrasquillo, E Singer, MA Kohler, RJ Greene, DA Reynolds, JR Deller, in Proceedings of the International Conference on Spoken Language Processing (ICSLP), Denver, 16–20. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features, (Sept 2002), pp. 33–36.
P Matejka, O Glembek, F Castaldo, MJ Alam, O Plchot, P Kenny, L Burget, J Cernocky, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, 22–27. Fullcovariance UBM and heavytailed PLDA in ivector speaker verification, (May 2011), pp. 4828–4831.
Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P: Frontend factor analysis for speaker verification. IEEE Trans. Audio Speech Language Process 2011, 19(4):788798. 10.1109/TASL.2010.2064307
LF Dharo Enriquez, O Glembek, O Plchot, P Matejka, M Soufifar, Rd Cordoba Herralde, J Cernocky, in Proceedings of INTERSPEECH, Oregon, 9–13. Phonotactic language recognition using ivectors and phoneme posteriogram counts, (Sept 2012), pp. 42–45.
D GarciaRomero, C EspyWilson, in Proceedings of Odyssey—The Speaker and Language Recognition Workshop, Brno, 7–11. Joint factor analysis for speaker recognition reinterpreted as signal coding using overcomplete dictionaries, (July 2010), pp. 117–124.
Heigold G, Ney H, Schluter R, Wiesler S: Discriminative training for automatic speech recognition: modeling, criteria, optimization, implementation, and performance. IEEE Signal Process. Mag 2012, 29(6):5869. 10.1109/MSP.2012.2197232
Campbell WM, Campbell JP, Reynolds DA, Singer E, TorresCarrasquillo PA: Support vector machines for speaker and language recognition. Comput. Speech Language 2006, 20(2):210229. 10.1016/j.csl.2005.06.003
Li H, Ma B, Lee KA: Spoken language recognition: from fundamentals to practice. Proc. IEEE 2013, 101(5):11361159. 10.1109/JPROC.2012.2237151
E Singer, P TorresCarrasquillo, D Reynolds, A McCree, F Richardson, N Dehak, D Sturim, in Proceedings of Odyssey—The Speaker and Language Recognition Workshop, Singapore, 25–28. The MITLL NIST LRE 2011 language recognition system, (June 2012), pp. 209–215.
LJ RodriguezFuentes, M Penagarikano, A Varona, M Diez, G Bordel, A Abad, D Martinez, J Villalba, A Ortega, E Lleida, in Proceedings of INTERSPEECH, Oregon, 9–13. The BLZ submission to the NIST 2011 LRE: data collection, system development and performance, (Sept 2012), pp. 38–41.
M Penagarikano, A Varona, LJ RodriguezFuentes, M Diez, G Bordel, in Proceedings of the NIST 2011 LRE Workshop. University of the Basque Country (EHU) systems for the 2011 NIST language recognition evaluation (Gaithersburg 6–7, Dec 2011), pp. 1–5.
Zheng F, Zhang G, Song Z: Comparison of different implementations of MFCC. J Comput. Sci. Technol 2001, 16(6):582589. 10.1007/BF02943243
A Zolnay, R Schlüter, H Ney, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Acoustic feature combination for robust speech recognition (Philadelphia, 18–23, March 2005), pp. 457–460.
V Hubeika, L Burget, P Matejka, P Schwarz, in Proceedings of INTERSPEECH. Discriminative training and channel compensation for acoustic language recognition (Brisbane, 22–26, Sept 2008), pp. 301–304.
Li H, Ma B, Lee CH: A vector space modeeling approach to spoken language identification. IEEE Trans. Audio Speech Language Process 2007, 15(1):271284. 10.1109/TASL.2006.876860
Sim KC, Li H: On acoustic diversification frontend for spoken language identification. IEEE Trans. Audio Speech Language Process 2008, 16(5):10291037. 10.1109/TASL.2008.924150
N Morgan, H Bourlard, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Continuous speech recognition using multilayer perceptrons with hidden Markov models (Kobe, 18–22, Nov 1990), pp. 413–416.
TorresCarrasquillo PA: Language identification using Gaussian mixture models. PhD thesis, Michigan State University; 2002.
WW Liu, M Cai, H Yuan, J Xu, J Liu, WQ Zhang, in Proceedings of the IEEE International Symposium on Chinese Spoken Language Processing (ISCSLP). Phonotactic Language Recognition Based on DNNHMM Acoustic Model (Singapore, 12–14, Sept 2014), pp. 148–152.
WW Liu, WQ Zhang, Y Shi, A Ji, J Xu, J Liu, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Improved phonotactic language recognition based on RNN feature reconstruction (Florence, 4–9, May 2014), pp. 5359–5363.
WW Liu, WQ Zhang, Z Li, J Liu, in Proceedings of INTERSPEECH. Parallel absoluterelative feature based phonotactic language recognition (Lyon, 25–29, Aug 2013), pp. 2474–2478.
Campbell WM, Campbell JP, Reynolds DA, Singer E, TorresCarrasquillo PA: Support vector machines for speaker and language recognition. Comput. Speech Language 2006, 20(2–3):210229. 10.1016/j.csl.2005.06.003
JL Gauvain, A Messaoudi, H Schwenk, in Proceedings of the International Conference on Spoken Language Processing (ICSLP). Language recognition using phone lattices (Jesu Land, 4–8, Oct 2004), pp. 1283–1286.
Campbell WM, Campbell JP, Reynolds DA, Jones DA, Leek TR: Phonetic speaker recognition with support vector machines. Adv. Neural Inf. Process. Syst 2003, 16: 13771384.
B Scholkopf, J Weston, E Eskin, C Leslie, WS Noble, in Proceedings of the European Conference on Machine Learning (ECML). Kernel approach for learning from almost orthogonal patterns (Helsinki, 19–23, Aug 2002), pp. 511–528.
Wang M, Chen S: Enhanced FMAM based on empirical kernel map. IEEE Trans. Neural Networks 2005, 16(3):557564. 10.1109/TNN.2005.847839
Xiong H, Swamy MNS, Ahmad MO: Optimizing the kernel in the empirical feature. IEEE Trans. Neural Networks 2005, 16(2):460474. 10.1109/TNN.2004.841784
P Matejka, P Schwarz, J Cernockỳ, P Chytil, in Proceedings of INTERSPEECH. Phonotactic language identification using high quality phoneme recognition (Lisbon, 4–8, Sept 2005), pp. 2237–2240.
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Machine Learning Res 2010, 11: 33713408.
Li H, Ma B, Lee CH: A vector space modeling approach to spoken language identification. IEEE Trans. Audio Speech Language Process 2007, 15(1):271284. 10.1109/TASL.2006.876860
P Matejka, L Burget, O Glembek, P Schwarz, V Hubeika, M Fapso, T Mikolov, O Plchot, in Proceedings of the NIST Language Recognition Evaluation Workshop. BUT system description for NIST LRE 2007 (Orlando, 11–12, Dec 2007), pp. 1–5.
D Povey, PC Woodland, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Improved discriminative training techniques for large vocabulary continuous speech recognition (Prague, 22–27, May 2011), pp. 45–48.
Schwarz P: Phoneme recognition based on long temporal context. PhD thesis, Brno University of Technology; 2009.
Z Jancik, O Plchot, N Brummer, L Burget, O Glembek, V Hubeika, M Karafiat, P Matejka, T Mikolov, A Strasheim, J Cernocky, in Proceedings of Odyssey—The Speaker and Language Recognition Workshop. Data selection and calibration issues in automatic language recognitioninvestigation with butagnitio NIST LRE, 2009 system (Brno, 7–11, July 2010), pp. 215–221.
PA TorresCarrasquillo, E Singer, T Gleason, A McCree, DA Reynolds, F Richardson, D Sturim, in Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). The MITLL NIST LRE 2009 language recognition system (Dalas, 14–19, March 2010), pp. 4994–4997.
A Stolcke, SRILM  an extensible language modeling toolkit (SRI International, 2002. . Accessed 3 April 2002. http://www.speech.sri.com/projects/srilm/.
Collobert R, Bengio S: SVMTorch: support vector machines for largescale regression problems. J. Machine Learning Res 2001, 1: 143160.
Zhang WQ, Hou T, Liu J: Discriminative score fusion for language identification. Chin. J. Electron 2010, 19: 124128.
The 2009 NIST language recognition evaluation plan. (U.S. Department of Commerce) . Accessed 1 April 2009. http://www.itl.nist.gov/iad/mig/tests/lang/2009/.
Acknowledgements
This project is supported by the National Natural Science Foundation of China under grant nos. 61370034, 61273268, and 61403224.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, WW., Zhang, WQ., Johnson, M.T. et al. Homogenous ensemble phonotactic language recognition based on SVM supervector reconstruction. J AUDIO SPEECH MUSIC PROC. 2014, 42 (2014). https://doi.org/10.1186/s1363601400425
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1363601400425