Skip to main content

Homogenous ensemble phonotactic language recognition based on SVM supervector reconstruction

Abstract

Currently, acoustic spoken language recognition (SLR) and phonotactic SLR systems are widely used language recognition systems. To achieve better performance, researchers combine multiple subsystems with the results often much better than a single SLR system. Phonotactic SLR subsystems may vary in the acoustic features vectors or include multiple language-specific phone recognizers and different acoustic models. These methods achieve good performance but usually compute at high computational cost. In this paper, a new diversification for phonotactic language recognition systems is proposed using vector space models by support vector machine (SVM) supervector reconstruction (SSR). In this architecture, the subsystems share the same feature extraction, decoding, and N-gram counting preprocessing steps, but model in a different vector space by using the SSR algorithm without significant additional computation. We term this a homogeneous ensemble phonotactic language recognition (HEPLR) system. The system integrates three different SVM supervector reconstruction algorithms, including relative SVM supervector reconstruction, functional SVM supervector reconstruction, and perturbing SVM supervector reconstruction. All of the algorithms are incorporated using a linear discriminant analysis-maximum mutual information (LDA-MMI) backend for improving language recognition evaluation (LRE) accuracy. Evaluated on the National Institute of Standards and Technology (NIST) LRE 2009 task, the proposed HEPLR system achieves better performance than a baseline phone recognition-vector space modeling (PR-VSM) system with minimal extra computational cost. The performance of the HEPLR system yields 1.39%, 3.63%, and 14.79% equal error rate (EER), representing 6.06%, 10.15%, and 10.53% relative improvements over the baseline system, respectively, for the 30-, 10-, and 3-s test conditions.

1 Introduction

Spoken language recognition (SLR) refers to the task of automatic determination of language identity. It is estimated that there are about 6,000 spoken languages in the world [1]. An increasing number of multilingual speech processing applications require spoken language recognition as a frontend, with the result that SLR continues to grow in importance. Spoken language recognition is an enabling technology for a wide range of intelligence and security applications for information distillation, such as spoken document retrieval, multilingual speech recognition, and spoken language translation [2].

Language cues can be categorized according to their level of knowledge abstraction as acoustic (spectrum, phone inventory), prosodic (duration, pitch, intonation), phonotactic (sequence of sounds), lexical (vocabulary, morphology), and syntax (phrases, grammar) [3],[4]. Language recognition systems are usually identified by the features they employ, e.g., acoustic systems, phonotactic systems, prosodic systems, and lexical systems. Currently, acoustic language recognition (LR) systems [5] and phonotactic LR systems [3] are both widely used.

Generally, the performance of SLR systems can be improved in two ways: (1) longitudinally, through the development of new techniques to perform the SLR tasks more precisely, e.g., i-vector [6]-[8], JFA [9], discriminative training [10] methods, N-gram modeling methods [3], and support vector machines (SVMs) [11]; (2) transversely, by adding variety to the SLR subsystems, which extracts and integrates more information from the utterances. State-of-the-art language recognition systems fuse multiple subsystems in parallel via a post-processing backend [12]. In the National Institute of Standards and Technology (NIST) language recognition evaluation (LRE) tasks, teams from all over the world compete to build the best SLR system and have shown that better results can be obtained by combining more subsystems, creating larger and larger SLR systems. In NIST LRE 2011, all submitted language recognition systems were stacked ensembles of at least five language recognition subsystems [13]-[15]. Much effort goes into trying different variations of subsystems. Generally, the phonotactic LR subsystems can be varied in three ways: (a) extracting various acoustic features to provide feature diversification, for example, Mel-Frequency Cepstral Coefficients (MFCC) [16], Perceptual Linear Predictive (PLP) [17], and Temporal Patterns Neural Network (TRAPs/NN) [18]; (b) training phone recognizers on multiple language-specific speech data to provide phonetic diversification [3], e.g., the Russian, Hungarian, Czech, and English phone recognizers developed by Brno University of Technology (BUT) [18] or universal phone recognizer (UPR) [19]; and (c) training phone recognizers on the same language-specific speech data but using different acoustic models to provide acoustic diversification [20], such as the Artificial Neural Network-Hidden Markov Model (ANN-HMM) [21], Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) [22], and Deep Neural Network-Hidden Markov Model (DNN-HMM) [23]. Certainly, in phonotactic language recognition systems, the subsystems must undergo different process of feature extracting, decoding, N-gram counting, and vector space modeling, which means an added computational cost of N times than single subsystem, where N is the number of the subsystems.

This paper demonstrates an architecture to provide a new diversification for phonotactic language recognition systems. The underlying motivation of these algorithms is to provide richer language identifying information without significant additional computation. The subsystems are verified using SVM supervector reconstruction (SSR) algorithms to provide vector space modeling diversification. In this architecture, the subsystems share the same preprocessing of feature extracting, decoding, and expected counting, but models in different vector space, so we call it homogeneous ensemble phonotactic language recognition (HEPLR) system. The HEPLR subsystems increase the variety of the SVM supervector, decrease the computational cost, and improve the SLR accuracy. There are many SVM supervector reconstruction algorithms such as recurrent neuron network (RNN) SVM supervector reconstruction [24]. In this paper, we present three SVM supervector reconstruction algorithms including relative SVM reconstruction [25], functional SVM reconstruction, and perturbative SVM reconstruction.

The remainder of the paper is organized as follows. Section 3 presents the lattice-based phonotactic language recognition system used as a baseline in this paper. Section 3 describes the proposed approaches, includes relative, functional, and perturbative SVM supervector reconstruction. Section 3 demonstrates the architecture for the homogeneous ensemble phonotactic language recognition system. The experimental setup to evaluate our proposed method is described in Section 3. Results obtained in language recognition experiments on the NIST LRE 2009 database are presented and discussed in Section 3. Finally, conclusions and future work are outlined in Section 3.

2 Baseline phonotactic SLR system

In this work, the phone recognition-vector space modeling (PR-VSM) [26] phonotactic language recognition system is employed as a baseline system. The motivation behind phonotactic language recognition approach is the belief that a spoken language can be characterized by special probabilities of its lexical-phonological constraints. An N-gram vector space model (VSM) is built for the language recognition task using phone transcriptions, which is a stochastic model describing the probabilities of phoneme strings. In the PR-VSM system, each N-gram VSM of produces a likelihood score by SVM classifier to a given utterance. The languages used for training the phone recognizers need not be the same with any of those recognized.

The traditional PR-VSM language recognition system works by mapping the input utterances from data space into a high-dimensional feature space : Φ:→ and then building linear machines in the feature space to find a maximal margin separation. The vectors built in the high-dimensional feature space are SVM supervectors, which consist of N-gram counts of features representing the phonotactics of an input speech wave sample.

In PR-VSM systems, an utterance x can be mapped to the high-dimensional feature space as follows:

Φ:xφ(x),
(1)

where φ(x) is the SVM supervector computed as

φ(x)= p d 1 | x , p ( d 2 | x ) , , p ( d F | x ) ,
(2)

where x is the lattice produced from data x by a phone recognizer, d i is the N-gram phoneme string [27]d i =s i si+n−1 (n=N), and F is the dimension of the SVM supervector. p(d i | x ) is the probability of the phoneme sequence d i in the lattice, which is computed as

p ( d i | x ) = c ( d i | x ) i c ( d i | x ) ,

where c(d i | x ) denotes the N-gram occurrence of d i given the lattice x . This is calculated over all possible hypotheses in the lattice as follows [27]:

c ( s i s i + N 1 | ) = S p ( S | ) c ( s i s i + N 1 | S ) = s i s i + N 1 α ( s i ) β ( s i + N 1 ) j = i i + N 1 ξ ( s j ) ,

where p(S|) denotes the probability of the sequence S in the lattice , α(s i ) and β(si+N−1) are the forward probability of the starting node and the backward probability of the ending node in the N-gram s i si+N−1, respectively. ξ(s j ) is the posterior probability of the edge phoneme s j .

The SVM supervector φ(x) is sent to the SVM classifier and a decision is made based on the most likely hypothesis score. In a PR-VSM system, the decision is made based on the SVM output score using

f(φ(x))= l α l K TFLLR (φ(x),φ( x l ))+d,
(3)

where φ(x l ) are support vectors obtained from training set using the Mercer condition. The term frequency log-likelihood ratio (TFLLR) kernel KTFLLR is computed as [28]:

K TFLLR (φ( x i ),φ( x j ))= q = 1 F p ( d q | x i ) p ( d q | all ) p ( d q | x j ) p ( d q | all ) ,
(4)

where p(d i | all ) is calculated from the observed probability of d i across all lattices. In this paper, the training stage is always carried out with a one-versus-rest strategy between the positive set (the samples in the target language) and negative set (all other samples).

3 SVM supervector reconstruction algorithms

The motivation behind SVM supervector reconstruction is to provide vector space modeling diversification to improve the performance of the overall language recognition system. In the language recognition system employed in this paper, we focus on how a change in the input to the SVM affects the output.

Given an SVM supervector φ(x), we define a function ϕSSR which operates on φ(x):

Φ SSR :x ϕ SSR (φ(x)).
(5)

We are interested in understanding how ϕSSR(φ(x)) affects the behavior of the output scores of the SVM. The goal is to define the relationship between φ(x) and ϕSSR(φ(x)) to enhance the variety of the supervector input.

Selecting SVM supervector reconstruction methods is an open question, so here we propose some typical methods to the implementation. In this section, three SVM supervector reconstruction methods are proposed: relative SVM supervector reconstruction, functional SVM supervector reconstruction, and perturbative SVM supervector reconstruction. Relative SVM supervector reconstruction has been presented in [25], while functional and perturbative reconstructions are new methods. Relative reconstruction is a linear reconstruction, while functional and perturbative reconstructions are non-linear ones.

3.1 Relative SVM supervector reconstruction

The relative SVM supervector method uses a relative feature approach. Relative features in contrast to absolute features, which represent directly calculable information, are defined by the relationship between an utterance and a set of selected datum utterances. We have presented the concept of relative features in [25].

Calculating relative features requires a relationship measurement, such as distance on similarity. By selecting a proper relationship measurement, the identifiable characteristics can be strengthened and nuisance attributes of the utterance can be discarded. Unlike absolute features, relative features make utterances more convenient to classify by showing the relationship between the utterances and the datum database directly.

Here, we introduce a relative SVM supervector reconstruction defined using the similarity between the utterance SVM supervectors. The widely used kernel methods offer efficient similarity measurements between two SVM supervectors. In this paper, the empirical kernel [29] is introduced into language recognition and a relativized SVM supervector developed. Kernel methods have been used for face recognition [30] and handwritten digit recognition [31] and achieved higher robustness to noise [30]. Using the SVM supervectors that are already built into a language recognition system, we can easily compose a new relativized SVM supervector with only a small increase in computation.

The architecture of the relative SVM supervector reconstruction subsystem is shown in Figure 1. To construct the SVM supervector relativization map, a database s =[s1,s2,…s m ] containing m utterances is used as the datum mark of similarity. The datum database is stochastically selected from some corpus, whose language need not be the same with the target language. s is mapped into vector space:

sφ(s)= φ ( s 1 ) , φ ( s 2 ) , , φ ( s m ) .
(6)
Figure 1
figure 1

Architecture of relative SVM supervector reconstruction subsystem.

The vector relativizational (VR) kernel between two supervectors φ(x i ) and φ(x j ) is

K VR φ ( x i ) , φ x j = < φ ( x i ) , φ ( x j ) > = q = 1 F p d q | x i p d q | S p d q | x j p d q | S .
(7)

The VR kernel is similar to the TFLLR kernel, but normalized by the observed probability across all lattices of the datum dataset p(d i | s ). This kernel reflects the degree of similarity between two supervectors.

The utterance x is mapped from the input data space to a relativized m-dimensional Euclidean space m: ΦREL:→m as follows:

Φ REL : x φ REL ( x ) = K VR ( φ ( x ) , φ ( s ) ) = < φ ( x ) , φ ( s ) > = K VR ( φ ( x ) , φ ( s 1 ) ) , , K VR ( φ ( x ) , φ ( s m ) ) .

In general, KVR(φ(x),φ(S )) defines a space in which each dimension corresponds to the similarities to a prototype. Thus, KVR(·,φ(S )) can be viewed as a mapping onto an m-dimensional relativized vector space.

The SVM output score is computed as

f ( φ REL (x))= l α l K ( φ REL (x), φ REL ( x l ))+ d ,
(8)

where φ REL ( x l ) are support vectors obtained from the training set using the Mercer condition. Selecting a radial basis function (RBF) kernel, K RBF′ is computed as

K RBF ( φ REL ( x i ) , φ REL ( x j ) ) = exp | φ REL ( x i ) φ REL ( x j ) | 2 D RF ,
(9)

where DRF is the dimension of the relativized SVM supervector. Selecting a TFLLR kernel, K TFLLR′ is computed as

K TFLLR φ REL ( x i ) , φ REL x j = q = 1 m K VR ( φ ( x i ) , φ ( s q ) ) K VR ( φ ( x j ) , φ ( s q ) ) K VR ( φ ( x all ) , φ ( s q ) ) .
(10)

3.2 Functional SVM supervector reconstruction

In actual test conditions, the training and test data are variable in speakers, background noise, and channel conditions. To achieve higher robustness to variable test conditions, the widely used kernel methods offer efficient similarity measurements between two SVM supervectors in PR-VSM system [30]. The geometrical structure of the SVM vector space is completely determined by the kernel, so the selection of the kernel has a crucial impact on the performance of the language recognition systems. The functional SVM supervector reconstruction method defines a mixture between the functional and the original kernels, which can offer the robust discriminative information of the data and get robust language model.

But how to select a proper function is an open problem. There are many functions that can be used to the reconstruction, while not every function is available for the reconstruction that can reduce the equal error rate (EER). What we need to do is to find out what kind of functions can be used in feature reconstruction. The functions need to satisfy the following conditions: (1) monotonic and (2) can make the identifiable characteristics strengthened and nuisance attributes of the utterance discarded. The proposed functional SVM supervector reconstruction method does not rely on prior knowledge to select the functional to reconstruct the supervector. A development database is used for cross validation to select the function. So, here, three functions selected to be used in this paper include

  1. (a)
    φ FUN (p( d i | x ))=sin(p( d i | x ))+cos(p( d i | x )),
    (11)
  2. (b)
    φ FUN (p( d i | x ))=p( d i | x )+ p ( d i | x ) 2 ,
    (12)
  3. (c)
    φ FUN (p( d i | x ))=p( d i | x ) ( p ( d i | x ) ) 2 + ( p ( d i | x ) ) 3 .
    (13)

The utterance x is mapped onto a functionalized vector space:

Φ FUN : x φ FUN ( x ) = φ FUN ( p ( d 1 | x ) ) , φ FUN ( p ( d 2 | x ) ) , , φ FUN ( p ( d F | x ) ) .
(14)

The three functions are all monotonic in the range of amplitude of the SVM supervector. Selecting a TFLLR kernel, K TFLLR′ is computed as

K TFLLR ( φ FUN ( x i ) , φ FUN ( x j ) ) = q = 1 F φ FUN ( p ( d q | x i ) ) φ FUN ( p ( d q | x j ) ) φ FUN ( p ( d q | all ) ) .
(15)

The SVM output score is computed as

f ( φ FUN (x))= l α l K ( φ FUN (x), φ FUN ( x l ))+ d .
(16)

3.3 Perturbational SVM supervector reconstruction

For spoken language recognition, the first and most essential step is to tokenize the running speech into sound units or lattices using a phone recognizer. The phoneme error rate is around 40% to 60% [32] when tokenizing an utterance. The decoding errors are deletion, insertion, and substitution errors, which are expressed as some discrete ‘noise’ when mapped to the high-dimensional SVM supervector space (shown in Figure 2). So, here, we introduce a perturbational denoising method for the SVM supervector. Given a supervector φ(x) and some perturbation operator on φ(x), we are interested in understanding how a small perturbation added to the supervector affects the behavior of the SVM [33]. This relationship can be represented using a mapping onto a perturbational vector space.

Figure 2
figure 2

Effections of decoding errors on SVM supervector.

There are three purposes of proposing perturbational SVM supervector reconstruction method: first, adding perturbational noise to reduce the impact of noise in the SVM supervector introduced by the decoding errors; second, generating a more robust language model to provide input variety to the SVM classifier; and third, highlighting the most discriminative information of the SVM supervector and drowning the non-discriminative information into the perturbation (shown in Figure 3).

Figure 3
figure 3

SVM supervector of an utterance (a) before and (b) after perturbation.

To accomplish the above goals, the type and strength of the perturbation must be selected carefully. How to define a proper perturbation is an open problem. There are a wide variety of perturbations,which can be categorized into multiple ways, including (1) global perturbation and local perturbation, (2) stochastic perturbation and constant perturbation according to the amplitude, (3) absolute perturbation and relative perturbation according to the relationship between the SVM supervector and the perturbation, and (4) addictive perturbation and multiplicative perturbation.

For feature supervectors in vector space, the perturbations are always discrete, maybe random in a certain range or change with the amplitude of the expected value of the supervector. So, we consider both deterministic perturbation δ= w E p ( d | x ) and stochastic perturbation δ= w uniform 0 , E p ( d | x ) , where E p ( d | x ) is the mean of the SVM supervector, uniform 0 , E p ( d | x ) is the uniform distribution between 0 and E p ( d | x ) , and w is the perturbation weight. More details of the perturbation methods are discussed below.

3.3.1 Perturbational approach 1 (deterministic additive perturbation)

δ= w E p ( d | x ) , φ PER (p( d i | x ))=p( d i | x )+δ.
(17)

This kind of perturbation represents the assumption that the expected count of every phoneme sequence is perturbed by an equivalent additive amount.

3.3.2 Perturbational approach 2 (stochastic additive perturbation)

δ= w random 0 , E p ( d | x ) , φ PER (p( d i | x ))=p( d i | x )+δ.
(18)

This perturbation represents the assumption that the expected count of every phoneme sequence is perturbed by an amount proportional to the frequency of the phoneme sequences.

3.3.3 Perturbational approach 3 (deterministic multiplicative perturbation)

δ= w E p ( d | x ) , φ PER (p( d i | x ))=p( d i | x )δ.
(19)

This kind of perturbation represents the assumption that the expected count of every phoneme sequence perturbed by an equivalent multiple amount.

3.3.4 Perturbational approach 4 (stochastic multiplicative perturbation)

δ= w random 0 , E p ( d | x ) , φ PER (p( d i | x ))=p( d i | x )δ.
(20)

This perturbation represents the assumption that the expected count of every phoneme sequence is perturbed by a proportional to the frequency of the phoneme sequences.

From above, it can be seen that methods 1 and 2 implement absolute perturbation, and methods 3 and 4 implement relative perturbation. All are global perturbation algorithms, operating across the entire vector space. We can also investigate local perturbation using these same approaches. Local perturbation is more flexible and realistic for the noises would have effect on part of the expected counting. The proposed methods also do not rely on prior knowledge to put noising into the supervector; we use development database for cross validation to select a better perturbation.

The utterance x is mapped onto a perturbational vector space:

Φ PER : x φ PER ( x ) = φ PER ( p ( d 1 | x ) ) , φ PER ( p ( d 2 | x ) ) , , φ PER ( p ( d F | x ) ) .
(21)

where φPER(x) is a perturbation of φ(x). Selecting a TFLLR kernel, K TFLLR′ is computed as

K TFLLR ( φ PER ( x i ) , φ PER ( x j ) ) = q = 1 F φ PER ( p ( d q | x i ) ) φ PER ( p ( d q | x j ) ) φ PER ( p ( d q | all ) ) .
(22)

The SVM output score is computed as

f ( φ PER (x))= l α l K ( φ PER (x), φ PER ( x l ))+ d .
(23)

4 Homogeneous ensemble language recognition system

The architecture of the HEPLR system is shown in Figure 4. All the SVM supervectors are reconstructed into the corresponding vector space and fused at the vector level for training and testing. In this paper, we use the classical method of vector fusion, which is to group several sets of reconstructed SVM supervectors into a large composite supervector [34]. Suppose φ REL N 1 1 (x), …, φ REL N 1 m (x), φ FUN N 2 1 (x), …, φ FUN N 2 m (x), and φ PER N 3 1 (x), …, φ PER N 3 m (x) are the reconstructed SVM supervectors for an input utterance x. The concatenated SVM supervectors can be represented as φREL(x), φFUN(x), and φPER(x), respectively. Denoting each SVM supervector as d m -dimensional, the concatenated SVM supervectors are (d1+…+d m′)-dimensional. The concatenated SVM supervectors are defined by

φ REL (x)= w N 1 1 φ REL N 1 1 ( x ) , , w N 1 m φ REL N 1 m ( x ) ,
(24)
Figure 4
figure 4

Architecture of the HEPLR system.

φ FUN (x)= w N 2 1 φ FUN N 2 1 ( x ) , , w N 2 m φ FUN N 2 m ( x ) ,
(25)
φ PER (x)= w N 3 1 φ PER N 3 1 ( x ) , , w N 3 m φ PER N 3 m ( x ) ,
(26)

where w N j i = min i E j n / E j i (j=1,2,3) with E j i the priori knowledge of the EER performance of the development data of the subsystem. The logistic regression optimized weighting (LROW) method is used to optimize the reconstructed SVM supervector weighting coefficients. Since not all the SVM supervector reconstruction subsystems are effective when fused, we also extend the work by formulating quantitative measures to select the subsystems for fusion. The output score of the SVM classifier is computed as follows:

f ( φ (x))= l α l K ( φ (x), φ ( x l ))+ d ,
(27)

where the reconstruction methods are represented by means of ‘*’, and φ(xl) are support vectors obtained from the reconstructed SVM supervectors using the Mercer condition.

As mentioned previously, in the HEPLR language recognition system, the training stage is carried out between the positive set and negative set with one-versus-rest strategy.

The linear discriminant analysis-maximum mutual information (LDA-MMI) method is used to maximize the posterior probabilities of all the belief score vectors [35], using objective function [36]:

F MMI (λ)= i log p x i | P ( g ( i ) ) j p x i | λ j P ( j ) ,
(28)

where g(i) indicates the class label of x i and P(j) denotes the prior probability of class j. Vector fusion is implemented directly as

x= w 1 f φ REL ( x ) , w 2 f ( φ FUN ( x ) ) , w 3 f ( φ PER ( x ) ) ,
(29)

The probability density function p(x |λ) is a Gaussian Mixture Model defined on the N-dimensional vector x :

p(x|λ)= m ω m N(x; μ m , Σ m ),
(30)

The proposed homogeneous ensemble language recognition system has three advantages. First, SVM supervector reconstruction provides vector space modeling diversification for richer language identification information. Second, in the HEPLR system, the subsystems share the same preprocessing steps for feature extraction, decoding, and expected counting, which minimizes additional computational cost. Third, fusing the reconstructed SVM supervector with the original supervector at the vector level means that more information can be retained than that given by score fusion.

5 Experimental setup

5.1 Baseline language recognition system setup

The TRAPs/NN phonotactic language recognizers developed by the BUT [37] based on phone lattices, N-gram counts, and SVM scoring are used as baseline systems. An energy-based voice activity detector that splits and removes long-duration non-speech segments from the signals is applied initially. Following this, the BUT decoders for Czech (CZ), Hungarian (HU), and Russian (RU) are applied to compute phone posteriori probabilities, as used in NIST LRE tasks by many groups [38],[39]. The phone inventory is 43 for Czech, 59 for Hungarian, and 50 for Russian. Posteriori probabilities are put into the HVite decoder produced by HTK to produce phone lattices, which encode multiple hypotheses with acoustic likelihoods. The N-gram counts are produced by lattice-tool from SRILM (SRI International, Menlo Park, CA, USA) [40]. The LIBLINEAR tool [41] for multiclass SVMs with linear kernels is applied to give SVM scores. Finally, the LDA-MMI algorithm [42] is used for score calibration and fusion.

5.2 Training, development, and test datasets

Evaluation is carried out on the NIST LRE 2009 tasks. This data includes 41793 utterances including 30-, 10-, and 3-s nominal duration, closed condition. The NIST LRE 2009 core task recognition of is to recognize 23 languages, including Amharic, Bosnian, Cantonese, Creole, Croatian, Dari, American English, Indian English, Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu, and Vietnamese. The evaluation involves radio broadcasts and conversational telephone speech channel conditions.

The training data comes from different sources including CallHome, CallFriend, OGI, OHSU, VOA, and the development corpora for the 2003, 2005, and 2007 NIST LRE evaluations.

About 25,000 utterances are selected randomly from VOA and 2003, 2005, and 2007 NIST LRE datasets used as development data.

5.3 Evaluation measures

In this work, the performance of language recognition systems is compared using: (1) EER and (2) average cost performance Cavg defined by NIST [43], which are obtained by one-versus-rest tragedy.

6 Experimental results and discussion

We demonstrate the effectiveness of our approaches on NIST LRE 2009 tasks under 30-, 10-, and 3-s conditions. Results are shown in Tables 1, 2, 3, 4, 5, and 6 and Figures 1, 2, 3, 4, 5, 6, 7, and 8 in the following sections. The EER and Cavg performance of individual subsystems and fusions is also shown in the tables below for reference.

Figure 5
figure 5

Performance of relative SVM supervector reconstruction subsystem versus dimension. NIST LRE 2009, 30-s, HU frontend (EER and Cavg in percent). (a) TFLLR kernel. (b) RBF kernel. (c) TFLLR and RBF kernel.

Figure 6
figure 6

Performance of functional SVM supervector reconstruction subsystems. NIST LRE 2009, 30-s, HU frontend. ‘+’ indicated fusion (EER and Cavg in percent).

Figure 7
figure 7

Performance of perturbational SVM supervector reconstruction subsystems. NIST LRE 2009, 30-s, HU frontend (EER and Cavg in percent). (a) Approach 1. (b) Approach 2. (c) Approach 3. (d) Approach 4.

Figure 8
figure 8

DET curves of baseline system and HEPLR system for NIST LRE 2009.

Table 1 Performance of baseline language recognition system
Table 2 Performance of relative SVM supervector reconstruction subsystem, TFLLR and RBF kernel
Table 3 Performance of functional SVM supervector reconstruction subsystems
Table 4 Performance of perturbational SVM supervector reconstruction subsystems
Table 5 Comparison of baseline SLR system and HEPLR system
Table 6 Comparison of real-time factor for language recognition systems

6.1 Baseline PRVSM system

Table 1 shows EER and Cavg performance for the NIST LRE 2009 language recognition tasks using the baseline subsystems. In this work, the dimension of the possible 3-gram SVM supervector produced by the single Hungarian (HU) phone recognizer with 58 phones is 58×58×58=195112. SVM supervector dimensions for the Russian (RU) and Czech (CZ) recognizers are 117649 and 74088, respectively.

6.2 Relative SVM supervector reconstruction

In this paper, 13,000 conversations which are randomly selected from the 40 languages of the 2003, 2005, and 2007 NIST LRE and VOA, CallHome, and CallFriend Corpora. These are used as the dataset to build the relative SVM supervector reconstructor.

Figure 5 shows the performance of the relative SVM supervector reconstruction subsystems whose SVM classifier uses the TFLLR and RBF kernel, respectively. Figure 5 also shows the performance of the relative SVM supervector reconstruction subsystems whose SVM classifier uses fusion of TFLLR and RBF kernel. Table 2 shows that the performance of the relative SVM supervector reconstruction subsystem is better than baseline and increases slowly with increasing size of the datum dataset. A lower dimension relative SVM supervector can be chosen to trade off performance and computation cost. The language recognition performance for short utterances is significantly improved in the relative SSR subsystem compared to the baseline. The original supervector can not describe short utterances precisely because of insufficient phoneme data, while the relative reconstructed SVM supervectors use the relationship between a short utterance and a large set of datum utterances, which is a richer representation. The experimental results show that the relative SSR subsystem outperforms the baseline and can obtain better performance with relative feature using a low-dimension SVM supervector.

6.3 Functional SVM supervector reconstruction

The language recognition results of functional SVM supervector reconstruction subsystems are given in Figure 6 and Table 3. The results using this approach were similar to or slightly worse than the baseline system in 30-s test condition, but outperform the baseline system in the 10- and 3-s test conditions.

6.4 Perturbational SVM supervector reconstruction

Figure 7 and Table 4 describe the results of the four perturbational methods. Overall, approach 2 yielded better results (2.20%, 6.59%, 20.93% EER) than the other approaches. The perturbative SVM supervector reconstruction subsystems performed consistently better than the baseline subsystem; particularly, those based on approach 2 performed better than the others. We hypothesize that approach 2 outperforms other perturbation methods because the distribution of the perturbation better matches the distribution of the noise. The perturbation approach adds robustness to the language modeling.

6.5 SVM supervector reconstruction

From the experimental results and discussion, it can be concluded that some of the reconstruction methods (relative SSR) are better at identifying the language of the short utterance and others (functional SSR and perturbational SSR) are better at recognizing long utterances. Because these errors are not highly correlated, we can fuse these results together to harness the complementary behavior among subsystems and improve the language recognition performance.

Figure 8 shows DET curves of the baseline system versus the HEPLR system for NIST LRE 2009. Table 5 gives the corresponding performance numbers for all configurations. These results show that the SSR approaches proposed in this paper outperformed the baseline system in terms of EER and Cavg when considering complete fusions for the subsystems.

6.6 Real-time factors

Table 6 shows the real-time (RT) factors of each part of SSR system. From Table 6, we can see that decoding is the dominant part. Compared to PR-VSM baseline system, the computational cost increases about 1.5 times for the relative SSR and barely no increases for the functional SSR and perturbational SSR.

7 Computational cost

Let F, M, and Mdatum denote the dimension of the phonotactic feature supervector of an utterance, the number of utterances of training dataset, and the datum dataset, respectively. And let c φ denote the computation cost of the mapping from x to φ(x), and cmodeling(F,M) denotes the computational cost of modeling the languages, which relate to F and M. Then, the computational cost of the baseline system is

c baseline =M· c φ + c modeling (F,M)
(31)
c φ = c Pre-Processing + c FeatureExtract + c Decoding + c N-gramCounting ,
(32)

where cPre-Processing, cFeatureExtract, cDecoding, and cN-gramCounting denote the computational cost of preprocessing, feature extracting, decoding, and N-gram counting, respectively.

7.1 Relative SVM supervector reconstruction

Let cinner(F) denote the computation cost of the inner product of the two F dimensional supervectors. Then, the computational cost of the relative SVM supervector reconstruction system is computed as

c REL = M · c φ + M datum · c φ + c modeling ( M datum , M ) + M · M datum · c inner ( F )
(33)

Usually, cmodeling(Mdatum,M)<cmodeling(F,M)M·c φ , M·Mdatum·cinner(F)M·c φ , so

c REL c baseline M · c φ + M datum · c φ M · c φ =1+ M datum M
(34)

In this paper, M=30996, when considering RU frontend, then F=117649. When Mdatum=13000, cREL/cbaseline=41.94%. That means that the relative SVM supervector reconstruction system takes 41.94% extra computation and achieves a 11.84%, 6.42%, and 15.92% relative improvements, respectively, for 30-, 10-, and 3-s compared to the baseline.

7.2 Functional SVM supervector reconstruction

Let c φ FUN denote the computational cost of mapping φ(x) to φFUN(x). Then, the computational cost of the functional SVM supervector reconstruction system is computed as

c FUN =M· c φ +M· c φ FUN + c modeling (F,M),
(35)

because preprocessing, feature extracting, decoding, and N-gram counting are more complex than the functional computation in this paper, so M· c φ FUN M· c φ . The computational cost of modeling the languages can be considered equal to the baseline. For RU frontend, the functional SVM supervector reconstruction system takes almost no extra computation and achieves 8.84%, 13.32%, and 14.76% relative improvements, respectively, for 30-, 10-, and 3-s compared to the baseline.

7.3 Perturbational SVM supervector reconstruction

Let c φ PER denote the computational cost of adding perturbation to φ(x). Then, the computational cost of the functional SVM supervector reconstruction system is computed as

c PER =M· c φ +M· c φ PER + c modeling (F,M),
(36)

because preprocessing, feature extracting, decoding, and N-gram counting are more complex than the perturbational computation in this paper, so M· c φ PER M· c φ . The computational cost of modeling the languages can be considered equal to the baseline. For RU frontend, the perturbational SVM supervector reconstruction system takes almost no extra computation and achieves 6.63%, 5.13%, and 9.05% relative improvements, respectively, for 30-, 10-, and 3-s compared to the baseline.

8 Conclusions

In this article, we investigate a strategy of SVM supervector reconstruction to provide vector space modeling diversification to improve the performance and robustness of language recognition tasks with very low additional computational cost. A variety of SVM supervector reconstruction methods are employed to develop the diversified SVM supervectors. Reconstruction methods include relative SSR, perturbational SSR, and functional SSR. Relative SSR method uses the relationship of an utterance and a datum set to present the utterance.

Perturbational SSR reconstructs the SVM supervector to a slightly perturbational version and improves the language recognition performance. Functional SSR can derive effective kernel mixtures and get robust language model. The approaches do not involve significant additional computation compared to a baseline phonotactic system, but represents a way to extract more information from existing decodings.

Experimental results of the proposed HEPLR system on the NIST LRE 2009 evaluation set show better performance than the baseline system. When we fuse the three subsystems at the score level for further improvements, we achieve 1.39%, 3.63%, and 14.79% EER for the 30-, 10-, and 3-s closed-set test conditions, respectively. This corresponds to 6.06%, 10.15%, and 10.53% relative improvements.

References

  1. MP Lewis, Ethnologue: languages of the world, 16th edn. (SIL International, 2009). . Accessed 17 Jan 2007. http://www.ethnologue.com.

    Google Scholar 

  2. Zue VW, Glass JR: Conversational interfaces: advances and challenges. Proc. IEEE 2000, 88(8):1166-1180. 10.1109/5.880078

    Article  Google Scholar 

  3. Zissman MA: Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Proc 1996, 4(1):31-44. 10.1109/TSA.1996.481450

    Article  Google Scholar 

  4. Muthusamy YK, Barnard E, Cole RA: Reviewing automatic language identification. IEEE Signal Process. Mag 1994, 11(4):33-41. 10.1109/79.317925

    Article  Google Scholar 

  5. PA Torres-Carrasquillo, E Singer, MA Kohler, RJ Greene, DA Reynolds, JR Deller, in Proceedings of the International Conference on Spoken Language Processing (ICSLP), Denver, 16–20. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features, (Sept 2002), pp. 33–36.

    Google Scholar 

  6. P Matejka, O Glembek, F Castaldo, MJ Alam, O Plchot, P Kenny, L Burget, J Cernocky, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, 22–27. Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification, (May 2011), pp. 4828–4831.

    Google Scholar 

  7. Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Language Process 2011, 19(4):788-798. 10.1109/TASL.2010.2064307

    Article  Google Scholar 

  8. LF Dharo Enriquez, O Glembek, O Plchot, P Matejka, M Soufifar, Rd Cordoba Herralde, J Cernocky, in Proceedings of INTERSPEECH, Oregon, 9–13. Phonotactic language recognition using i-vectors and phoneme posteriogram counts, (Sept 2012), pp. 42–45.

    Google Scholar 

  9. D Garcia-Romero, C Espy-Wilson, in Proceedings of Odyssey—The Speaker and Language Recognition Workshop, Brno, 7–11. Joint factor analysis for speaker recognition reinterpreted as signal coding using overcomplete dictionaries, (July 2010), pp. 117–124.

    Google Scholar 

  10. Heigold G, Ney H, Schluter R, Wiesler S: Discriminative training for automatic speech recognition: modeling, criteria, optimization, implementation, and performance. IEEE Signal Process. Mag 2012, 29(6):58-69. 10.1109/MSP.2012.2197232

    Article  Google Scholar 

  11. Campbell WM, Campbell JP, Reynolds DA, Singer E, Torres-Carrasquillo PA: Support vector machines for speaker and language recognition. Comput. Speech Language 2006, 20(2):210-229. 10.1016/j.csl.2005.06.003

    Article  Google Scholar 

  12. Li H, Ma B, Lee K-A: Spoken language recognition: from fundamentals to practice. Proc. IEEE 2013, 101(5):1136-1159. 10.1109/JPROC.2012.2237151

    Article  Google Scholar 

  13. E Singer, P Torres-Carrasquillo, D Reynolds, A McCree, F Richardson, N Dehak, D Sturim, in Proceedings of Odyssey—The Speaker and Language Recognition Workshop, Singapore, 25–28. The MITLL NIST LRE 2011 language recognition system, (June 2012), pp. 209–215.

    Google Scholar 

  14. LJ Rodriguez-Fuentes, M Penagarikano, A Varona, M Diez, G Bordel, A Abad, D Martinez, J Villalba, A Ortega, E Lleida, in Proceedings of INTERSPEECH, Oregon, 9–13. The BLZ submission to the NIST 2011 LRE: data collection, system development and performance, (Sept 2012), pp. 38–41.

    Google Scholar 

  15. M Penagarikano, A Varona, LJ Rodriguez-Fuentes, M Diez, G Bordel, in Proceedings of the NIST 2011 LRE Workshop. University of the Basque Country (EHU) systems for the 2011 NIST language recognition evaluation (Gaithersburg 6–7, Dec 2011), pp. 1–5.

    Google Scholar 

  16. Zheng F, Zhang G, Song Z: Comparison of different implementations of MFCC. J Comput. Sci. Technol 2001, 16(6):582-589. 10.1007/BF02943243

    Article  Google Scholar 

  17. A Zolnay, R Schlüter, H Ney, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Acoustic feature combination for robust speech recognition (Philadelphia, 18–23, March 2005), pp. 457–460.

    Google Scholar 

  18. V Hubeika, L Burget, P Matejka, P Schwarz, in Proceedings of INTERSPEECH. Discriminative training and channel compensation for acoustic language recognition (Brisbane, 22–26, Sept 2008), pp. 301–304.

    Google Scholar 

  19. Li H, Ma B, Lee C-H: A vector space modeeling approach to spoken language identification. IEEE Trans. Audio Speech Language Process 2007, 15(1):271-284. 10.1109/TASL.2006.876860

    Article  Google Scholar 

  20. Sim KC, Li H: On acoustic diversification front-end for spoken language identification. IEEE Trans. Audio Speech Language Process 2008, 16(5):1029-1037. 10.1109/TASL.2008.924150

    Article  Google Scholar 

  21. N Morgan, H Bourlard, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Continuous speech recognition using multilayer perceptrons with hidden Markov models (Kobe, 18–22, Nov 1990), pp. 413–416.

    Google Scholar 

  22. Torres-Carrasquillo PA: Language identification using Gaussian mixture models. PhD thesis, Michigan State University; 2002.

    Google Scholar 

  23. W-W Liu, M Cai, H Yuan, J Xu, J Liu, W-Q Zhang, in Proceedings of the IEEE International Symposium on Chinese Spoken Language Processing (ISCSLP). Phonotactic Language Recognition Based on DNN-HMM Acoustic Model (Singapore, 12–14, Sept 2014), pp. 148–152.

    Google Scholar 

  24. W-W Liu, W-Q Zhang, Y Shi, A Ji, J Xu, J Liu, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Improved phonotactic language recognition based on RNN feature reconstruction (Florence, 4–9, May 2014), pp. 5359–5363.

    Google Scholar 

  25. W-W Liu, W-Q Zhang, Z Li, J Liu, in Proceedings of INTERSPEECH. Parallel absolute-relative feature based phonotactic language recognition (Lyon, 25–29, Aug 2013), pp. 2474–2478.

    Google Scholar 

  26. Campbell WM, Campbell JP, Reynolds DA, Singer E, Torres-Carrasquillo PA: Support vector machines for speaker and language recognition. Comput. Speech Language 2006, 20(2–3):210-229. 10.1016/j.csl.2005.06.003

    Article  Google Scholar 

  27. JL Gauvain, A Messaoudi, H Schwenk, in Proceedings of the International Conference on Spoken Language Processing (ICSLP). Language recognition using phone lattices (Jesu Land, 4–8, Oct 2004), pp. 1283–1286.

  28. Campbell WM, Campbell JP, Reynolds DA, Jones DA, Leek TR: Phonetic speaker recognition with support vector machines. Adv. Neural Inf. Process. Syst 2003, 16: 1377-1384.

    Google Scholar 

  29. B Scholkopf, J Weston, E Eskin, C Leslie, WS Noble, in Proceedings of the European Conference on Machine Learning (ECML). Kernel approach for learning from almost orthogonal patterns (Helsinki, 19–23, Aug 2002), pp. 511–528.

    Google Scholar 

  30. Wang M, Chen S: Enhanced FMAM based on empirical kernel map. IEEE Trans. Neural Networks 2005, 16(3):557-564. 10.1109/TNN.2005.847839

    Article  Google Scholar 

  31. Xiong H, Swamy MNS, Ahmad MO: Optimizing the kernel in the empirical feature. IEEE Trans. Neural Networks 2005, 16(2):460-474. 10.1109/TNN.2004.841784

    Article  Google Scholar 

  32. P Matejka, P Schwarz, J Cernockỳ, P Chytil, in Proceedings of INTERSPEECH. Phonotactic language identification using high quality phoneme recognition (Lisbon, 4–8, Sept 2005), pp. 2237–2240.

    Google Scholar 

  33. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Machine Learning Res 2010, 11: 3371-3408.

    MathSciNet  Google Scholar 

  34. Li H, Ma B, Lee C-H: A vector space modeling approach to spoken language identification. IEEE Trans. Audio Speech Language Process 2007, 15(1):271-284. 10.1109/TASL.2006.876860

    Article  Google Scholar 

  35. P Matejka, L Burget, O Glembek, P Schwarz, V Hubeika, M Fapso, T Mikolov, O Plchot, in Proceedings of the NIST Language Recognition Evaluation Workshop. BUT system description for NIST LRE 2007 (Orlando, 11–12, Dec 2007), pp. 1–5.

    Google Scholar 

  36. D Povey, PC Woodland, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Improved discriminative training techniques for large vocabulary continuous speech recognition (Prague, 22–27, May 2011), pp. 45–48.

    Google Scholar 

  37. Schwarz P: Phoneme recognition based on long temporal context. PhD thesis, Brno University of Technology; 2009.

    Google Scholar 

  38. Z Jancik, O Plchot, N Brummer, L Burget, O Glembek, V Hubeika, M Karafiat, P Matejka, T Mikolov, A Strasheim, J Cernocky, in Proceedings of Odyssey—The Speaker and Language Recognition Workshop. Data selection and calibration issues in automatic language recognition-investigation with but-agnitio NIST LRE, 2009 system (Brno, 7–11, July 2010), pp. 215–221.

    Google Scholar 

  39. PA Torres-Carrasquillo, E Singer, T Gleason, A McCree, DA Reynolds, F Richardson, D Sturim, in Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). The MITLL NIST LRE 2009 language recognition system (Dalas, 14–19, March 2010), pp. 4994–4997.

    Google Scholar 

  40. A Stolcke, SRILM - an extensible language modeling toolkit (SRI International, 2002. . Accessed 3 April 2002. http://www.speech.sri.com/projects/srilm/.

    Google Scholar 

  41. Collobert R, Bengio S: SVMTorch: support vector machines for large-scale regression problems. J. Machine Learning Res 2001, 1: 143-160.

    MathSciNet  Google Scholar 

  42. Zhang W-Q, Hou T, Liu J: Discriminative score fusion for language identification. Chin. J. Electron 2010, 19: 124-128.

    Google Scholar 

  43. The 2009 NIST language recognition evaluation plan. (U.S. Department of Commerce) . Accessed 1 April 2009. http://www.itl.nist.gov/iad/mig/tests/lang/2009/.

Download references

Acknowledgements

This project is supported by the National Natural Science Foundation of China under grant nos. 61370034, 61273268, and 61403224.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei-Qiang Zhang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, WW., Zhang, WQ., Johnson, M.T. et al. Homogenous ensemble phonotactic language recognition based on SVM supervector reconstruction. J AUDIO SPEECH MUSIC PROC. 2014, 42 (2014). https://doi.org/10.1186/s13636-014-0042-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-014-0042-5

Keywords