 Research
 Open access
 Published:
Phone lattice reconstruction for embedded language recognition in LVCSR
EURASIP Journal on Audio, Speech, and Music Processing volume 2012, Article number: 15 (2012)
Abstract
An increasing number of multilingual applications require language recognition (LRE) as a frontend, but desire low additional computational cost. This article demonstrates a novel architecture for embedding phone based language recognition into a large vocabulary continuous speech recognition (LVCSR) decoder by sharing the same decoding process but generating separate lattices. To compensate for the prior bias introduced by the pronunciation dictionary and the language model of the LVCSR decoder, three different phone lattice reconstruction algorithms are proposed. The underlying goals of these algorithms are to override pronunciation and grammar restrictions to provide richer phonetic information. All of the new algorithms incorporate a vector space modeling backend for improved LRE accuracy. Evaluated on a Mandarin/English detection task, the proposed integrated LVCSRLRE system using frameexpanded Nbest phone lattice achieves comparable performance to a stateoftheart phone recognitionvector space modeling (PRVSM) system, but with an added computational cost three times lower than that of a separate PRVSM system.
1 Introduction
Applications such as speechtospeech translation systems and dialogue systems often work in a multilingual environment, so it is necessary to rapidly identify the language being spoken. Even for monolingual systems, language recognition (LRE) is necessary for outoflanguage (OOL) detection [1, 2] to filter out segments of nontarget languages. One common approach to implement language recognition is to make use of the automatic speech recognizers (ASRs) which already exist in the system. The user's utterance can be decoded by a recognizer for each supported language, with the language type determined by the recognizer that returns the highest ASR score. This approach typically obtains high accuracy for LRE [3], but also has two obvious disadvantages in that it works only with multiple ASR frontends and it is not suitable for realtime applications, since the LRE decision can be made only after all the decoding processes complete. If there are more than two or three supported languages, the high computational cost can make this approach infeasible. Another approach is to build a dedicated language identifier which runs separately before any other processing steps take place. This approach requires less computational power, but causes a delay in the response of the system. A third approach is to perform LRE in parallel with speech recognition in an assumed core language, restarting the recognition process if the initial hypothesized language is incorrect [4]. The parallel architecture has faster average response time with delayed response only when the initial hypothesized language is incorrect. However, one deficiency of this architecture is that the LRE and the ASR process are independent of each other, leading to additional computational cost. By using an ASR based LRE methods, the computational cost can be significantly reduced by tightly integrating them together. Latticebased LRE techniques [5–8] which model the context of high level features such as words or phones are effective methods that could be suitable candidates for implementing this approach. In this article, we introduce a method to improve the parallel LREASR architecture by embedding a phone recognition followed by vector space modeling (PRVSM) LRE backend into an LVCSR decoder to reduce total computational cost. This new embedded architecture has substantially lower complexity than the existing ones and thus requires less computational resources. The difficulty of this integration is that the VSM backend cannot directly make use of phone lattices or transcriptions generated by the LVCSR decoder because of the system's integrated pronunciation dictionary and language model (LM). In an LVCSR decoding network, the pronunciation dictionary restricts within word phone connections and the ngram LM constraints crossword connections to provide high recognition accuracy. In contrast, a standard PRVSM system uses a simple phone loop and null grammar, which guarantees that each phone has equal probability to be recognized. If the LVCSR lattices were to be used for LRE, it would give a heavy bias towards the initial target language and cause great performance degradation. To overcome this bias and improve LRE accuracy, we propose to use a separate lattice reconstruction algorithm using the internal backtracking information collected during LVCSR decoding. In this article, three different lattice reconstruction algorithms are evaluated, and their LRE accuracies are tested and compared.
There are, of course, other alternatives to VSM for the LRE backend, such as an LM approach [5]. In this study, we choose VSM because of its superior performance [9, 10]. There are also other LRE techniques, such as those that use long term spectral features like shifted delta cepstrum [11], which do not share the same underlying acoustic features or decoding structure with LVCSR, and would not be as well suited for direct integration as the phone recognition approach. By using the same core recognition engine, the new method proposed here allows for strong LRE accuracy accompanied by significantly reduced computational cost.
This article is outlined as follows. Section "Embedding LRE into LVCSR" introduces the proposed embedded LRE architecture, while Section "LVCSR decoding and phone lattice generation" outlines the LVCSR decoding and phone lattice generation, and Section "Phone lattice reconstruction algorithms" details the three phone lattice reconstruction algorithms. Section "Vector space modeling LRE backend" gives a brief description of the specific VSM LRE backend used in this article. Section "Experimental results" presents experimental results and some discussions.
2 Embedding LRE into LVCSR
Figure 1 illustrates the proposed embedded LRE architecture and a typical work flow for a bilingual application. In this framework, language L1 is initially predicted. As the user begins to speak, his speech is sent to L1's decoder. During L1's decoding progress, phone backtracking information is collected and used to generate phone lattices, which are used by the VSM backend to identify the language type. If the recognized language is the same as L1, the L1's decoding progress is continued as is the L1's post processing, including tasks such as translation and keyword spotting. However, if the language recognition classifies the speech as a different language, for example L2, the current decoding progress is terminated and L2's recognition decoder is activated. The final output would then be generated by L2's processing chain.
The proposed architecture differs from [4] in three aspects. First, the language recognition is no longer a separate process, but uses the existing LVCSR decoder as its lattice generation frontend, saving computational time. Second, in [4] only phone sequences are used for language recognition, whereas in the proposed architecture, phone lattices are used. Lattices provide significant additional information about nonoptimal decoding paths, which may be especially useful when the recognition accuracy is low. Third, a VSM backend instead of a language model is used, which has recently been shown to have superior performance [9].
3 LVCSR decoding and phone lattice generation
In this section, we give mathematical descriptions of each part of the embedded architecture. Given an observation sequence O, the LVCSR decoder decides the most probable word sequence W* using Bayes' rule:
With phone based acoustic models, Equation (1) can be further written as:
where P is a phone sequence corresponding to word sequence W. During decoding, if we record each word hypotheses along with its phone sequence, we can obtain phonelevel recognition results. This is the most common method for phone lattice generation in LVCSR, and the generated phone lattices can be viewed as exact phone alignments of the word lattice. For clarity, we refer to this conventional phone hypothesis approach as "wordbased phone alignment", and the generated phone lattices as "wordbased phone lattices".
In Equation (2), p(W) is the language model constraining interword connections, and p(WP) is the pronunciation dictionary restricting phone connections within each word. In most cases, the relation between P and W is a onetoone mapping. As will be shown in Section "Experimental results", the wordbased phone lattices obey these constraints, resulting in a heavy bias towards the target language and causing significant LRE performance degradation. To remove these constraints and improve LRE accuracy, we propose to use an alternative construction algorithm for phone lattice generation.
4 Phone lattice reconstruction algorithms
The phone lattice reconstruction algorithms improve LRE performance by removing pronunciation and grammar constraints to provide a richer set of decoding alternatives and context. A "bag of phones" phone hypothesis approach is used to achieve this goal. As opposed to "wordbased phone alignment", the "bag of phones" approach collects phone hypotheses generated at each frame during LVCSR decoding into a phone bag regardless of the word it belongs to. If a phone hypothesis of the same identity and start/end frame number already exists, for implementation simplicity only the one with a larger likelihood score is kept. Once a phone hypothesis is put into the bag, it remains, even if the path that it belongs to is pruned. In this way, we can identify a large number of prospective phone hypotheses that never appear in the wordbased phone lattices. To avoid data sparseness in the VSM LRE backend, left and right triphone contexts are removed and only the underlying monophones are preserved in the phone bag. These collected phone hypotheses are then used to construct new phone lattices from scratch.
Another reason for ignoring phone contextdependency constraints is that this allows the generated lattices to contain richer monophone hypotheses. In the following algorithms, an Nbest phone list is constructed at each frame from the "bagofphones" by choosing phone hypotheses with the topN largest likelihood scores. Considering storage and computational resource limitations, we cannot choose a very large N (N = 10 in our experiments). Since contextdependent variants of the same phone are acoustically similar and thus have correlated likelihoods, allowing multiple contextdependent variants in the Nbest list has the effect of limiting phonetic diversity, particularly for smaller values of N. To maintain clear notation, we define a lattice L = (N, A, n_{start}, n_{end}) as a directed acyclic graph (DAG) which is specified by a finite set of nodes N, a finite set of links (or arcs) A, a start node n_{start} ∈ N and an end node n_{end} ∈ N. Given an arbitrary node n ∈ N, T [n] denotes its occurrence time identified by frame number, and N [t] ∈ N is the inverse function returning the corresponding lattice node at time t. Given a link a ∈ A, S [a] denotes its start node, E [a] its end node, I [a] its hypothesis identity, ac [a] its acoustic likelihood, lm [a] its language model likelihood and pr [a] its posterior probability. N represents the total number of nodes, and A represents the number of links. For a phone hypothesis p, S [p] denotes its start frame number, E [p] its end frame number, I [p] its phone identity and ac [p] its acoustic likelihood.
4.1 Timealigned phone lattice
In the first lattice construction algorithm, we apply a time alignment algorithm to the wordbased phone lattices to remove the dictionary and LM constraints. This method is called "timealigned" because it clusters the start and end time nodes of each lattice link, and thus aligns all the links topologically into a linear (sausagelike) structure, where each alignment corresponds to an equivalence class of phone hypotheses, and the ordering of the equivalence classes is consistent with that of the original lattice. The phone connections in the wordbased phone lattice are broken, and the phone arcs are reorganized by their partial ordering relations. Therefore, the pronunciation and grammar constraints no longer exist. This approach reuses phone hypotheses that already exist in the wordbased phone lattices, and does not change the conventional LVCSR decoding procedure. The proposed algorithm is similar to the lattice alignment algorithm used in building a confusion network [12]. A heuristic clustering procedure is used to group time overlapped links into clusters based on lattice topology, time and phonetic hypotheses, and transform the lattice into a linear graph where all paths pass through all nodes. The precedence order of the links in the original lattice is preserved. Although some time information is lost, the aligned phone lattice can be used for language recognition, because the phone connection patterns characterized by their ngram statistics are much more important. The primary difference between the proposed algorithm and the confusion network algorithm is that we do not measure phonetic similarity between phones, but only cluster based on time. Similarly to the confusion network approach, we first define an equivalence class and partial ordering. Given a link a ∈ A, [a] denotes its equivalence class of aligned phone hypotheses. Given a lattice, we can define a partial ordering ≤ on its links. The ordering is defined by the following: for a, b ∈ A, a ≤ b iff a = b or E [a] = S [b] or ∃c ∈ A such that a ≤ c and c ≤ b. Put more simply, a ≤ b means a comes before b. For two equivalence classes [a] and [b], the partial ordering [a] ≼ [b] implies a_{1} ≤ b_{1}, ∀a_{1} ∈ [a], b_{1} ∈ [b]. Given a phone lattice, the basic alignment procedure is that of finding an ordered link equivalence that is consistent with the lattice ordering and is also a total ordering. The algorithm is summarized as follows:
Step 1. Decode the input speech utterance with an LVCSR decoder, collecting phone backtracking information using wordbased phone alignment to generate a wordbased phone lattice. Left and right triphone contexts in the generated lattice are removed.
Step 2. Initialize link equivalence classes by phone identity and start and end times t_{1}, t_{2}:
Let X=\left\{{C}_{p,{t}_{1},{t}_{2}}:\mathsf{\text{for}}\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{all}}\phantom{\rule{2.77695pt}{0ex}}p,{t}_{1}\mathsf{\text{and}}{t}_{2}\right\} be the set of initial equivalence classes.
Step 3. From X, choose the two most similar unordered equivalence classes {C}_{1}^{*} and {C}_{2}^{*}, and merge them into a new equivalence class C_{new}:
The similarity between two equivalence classes C_{1} and C_{2} is measured by:
where overlap (a_{1}, a_{2}) is defined as the time overlap between the two links normalized by the sum of their lengths.
Step 4. ∀C ∈ X, update partial ordering relations: Set C ≼ C_{new}, if C\preccurlyeq {C}_{1}^{*} or C\preccurlyeq {C}_{2}^{*}; set C_{new} ≼ C, if {C}_{1}^{*}\preccurlyeq C or {C}_{2}^{*}\preccurlyeq C.
Step 5. ∀(C_{1}, C_{2}) ∈ X × X, update partial ordering relations: Set C_{1} ≼ C_{2}, if {C}_{1}\preccurlyeq {C}_{1}^{*} and {C}_{2}^{*}\preccurlyeq {C}_{2}, or {C}_{1}\preccurlyeq {C}_{2}^{*} and {C}_{1}^{*}\preccurlyeq {C}_{2}.
Step 6. Set X=X\cup \left\{{C}_{\mathsf{\text{new}}}\right\}\backslash \left\{{C}_{1}^{*},{C}_{2}^{*}\right\}.
Step 7. Repeat Steps 3 to 7 until there are no unordered equivalence classes left.
Step 8, Output the converted lattice.
This algorithm has a time complexity O (A^{3}). Figure 2 gives an example of a lattice constructed using this algorithm. For readability, node and link scores are not marked. As shown, the phone hypotheses are aligned topologically, with pronunciation constraints removed and duplicated paths from different LM states merged. A side effect of the algorithm is that some phones which originally formed a sequence, for example "IAO3", "M", and "AI4" in Figure 2b, end up as a parallel set. Our experimental study showed that this usually happens when (1) the durations of the paralleled phones are very short. For example, the durations of "M" and "AI4" are 0.03 s and 0.04 s, respectively, which are much shorter than other phones in the lattice; (2) the time intervals of the paralleled phones have significant overlap with other phones. For example, "M" and "AI4" are completely overlapped with "IA2" and "IA4"; (3) the paralleled phones are often misrecognized, with very low posterior probabilities. In the example of Figure 2, the log posterior probabilities of "IAO3", "M" and "AI4" are 72.57 (not marked in the figure), which have little impact on the expected ngram count and negligibly impact language recognition performance.
4.2 Phonemeexpanded 1best lattice
The key idea of the timealigned phone lattice approach is to find the best time alignment in order to remove pronunciation dictionary and LM constraints. In the second algorithm, we further enhance this approach by incorporating richer phonetic information. To reduce the complexity, we first use the 1best phone transcription generated by the LVCSR decoder as a reference alignment to divide the whole speech utterance into time slots, and then fill each slot using Nbest phone hypotheses. Connecting the time slots one by one, we can obtain an expanded phone lattice. The algorithm is summarized as follows:
Step 1. Decode the input speech utterance with an LVCSR decoder, collecting phone backtracking information in both the wordbased phone alignment and bagofphone methods. Generate an initial phonelevel transcription of the speech utterance and a bag of phone hypotheses.
Step 2. From the initial phone transcription, get a list of phone boundaries B = {t_{0}, t_{1}, ..., t_{ M }_{1}}, where M is the number of boundaries.
Step 3. Initialize phone lattice L = (N, A, n_{start}, n_{end}) with N = {N[t]: ∀t ∈ B}, A = {},
n_{start} = N[t_{0}] and n_{end} = N[t_{ M }_{1}].
Step 4. For i = 1 to M  1, find the Nbest phone hypotheses with start time t_{ i }_{1} and end time t_{ i }, and store them into list P. For each phone hypothesis p in P, add a new link a to A with S [a] = N [t_{ i }_{1}], E[a] = N [t_{ i }], I [a] = I [p], ac [a] = ac [p], lm [a] = 0.
Step 5. Output L.
Figure 3 gives an example lattice constructed using this algorithm, using the same utterance as in Figure 2. Figure 3a is the initial phone transcription, while Figure 3b is the new phone lattice constructed from the Nbest phone hypotheses. Although lattices reconstructed by this algorithm have a similar topology to the timealigned lattice, the scores attached to each node and link have different meanings, with scores identified as likelihoods in this algorithm and posterior probabilities in the timealigned lattice. Another major difference is that the start/end times in the phonemeexpanded 1best lattice are correct, whereas they are an adjusted approximation in the timealigned lattice.
4.3 Frameexpanded Nbest lattice
Both of the above two algorithms use optimal alignments derived from LVCSR results, which may implicitly incorporate pronunciation and grammar constraints. In the third algorithm, we try to completely eliminate the constraints of the pronunciation dictionary and language model, and reconstruct a new phone lattice from scratch. This is achieved by concatenating the Nbest phone hypotheses of each frame. In contrast to the phonemeexpanded case, the Nbest order is decided by duration normalized log likelihood:
where p is a phone hypothesis. The algorithm is summarized as follows:
Step 1. Decode the input speech utterance with an LVCSR decoder, collecting phone backtracking information using the bag of phones approach. Use this to generate a bag of phone hypotheses.
Step 2. Initialize phone lattice L = (N, A, n_{start}, n_{end}) with N = {N [t]: t = 0, 1, ..., M}, A = {}, n_{start} = N [0] and n_{end} = N [M], where M is the number of frames.
Step 3. For i = 1 to M, find the Nbest phone hypotheses with ending frame number i, and store them into list P. For each phone hypothesis p in P, add a new link a to A with S [a] = N [S [p]], E [a] = N [i], I [a] = I [p], ac [a] = ac [p], lm [a] = 0.
Step 4. Remove unreachable nodes and links from L.
Step 5. Output L.
Step 4 is necessary for correct ngram counting. Once a phone hypothesis is pushed into the bagofphones, token pruning has no effect on it. As a result, the phone bag contains many phone hypotheses from pruned paths. In this algorithm, phone hypotheses are concatenated frame by frame. In most cases, these pruned paths can be connected to other surviving paths and make their way to the end node. However, there may be occasional truncated paths, which can cause failure in forward and backward ngram counting. With no time alignment, lattices reconstructed using this algorithm encode the richest possible phonetic information and are typically quite large, which requires larger temporary storage space. However, after ngram counting in the VSM backend all the lattices can be deleted. In practice, a lattice is converted to an ngram supervector immediately after it is generated, so the storage requirement does not increase significantly. Figure 4 shows an example using the same utterance as in Figures 2 and 3. Since the entire lattice is much too large to include, only the part of the first word "DONG1" is drawn.
4.4 Pruning frameexpanded Nbest lattice
The use of Nbest lists in the third algorithm causes two deficiencies in the generated lattices: First, due to the continuity of the speech signal and overlapping frames during LVCSR feature extraction, one frame's Nbest phone candidates are very likely to still be present in the next frame's Nbest phone list. This can be seen clearly from Figure 4, in which Nbest phone candidates are duplicated frame by frame. Second, the use of the Nbest list can force lowprobability phone hypotheses to be recorded, especially during silence regions. These result in additional lattice redundancy, which if not removed may affect the Ngram statistics, and decrease the discriminability of the classifier. To eliminate some of that redundancy, we introduce a beam pruning mechanism into the Nbest phone list generation procedure.
Suppose P(t) denotes a list of phone hypotheses with ending frame number t, we can define the duration normalized score of the best hypothesis at time t as:
The pruning criterion then states that those phone hypotheses p ∈ P (t) are pruned for which
where the threshold T determines the width of the beam.
5 Vector space modeling LRE backend
To do language recognition, we implement a VSM backend similar to [9]. Figure 5 shows the complete framework of phone recognition followed by vector space modeling. In a standard PRVSM system, a phone recognizer with a null grammar is used to tokenize input utterances to phone lattices and corresponding ng ram statistics. In our proposed embedded architecture, the phone recognizer is replaced by an LVCSR decoder and lattices are generated and reconstructed using the algorithms described above. Denoting S = s_{1}, ..., s_{ N }as an arbitrary phone sequence in the lattice, the expected ng ram counts of the resulting lattice L can be expressed as:
where {\u015d}_{i}={s}_{i\left(n1\right)},\dots ,{s}_{i1} represents the ng ram history and n is the ng ram order. The count function returns the number of occurrence of a given ng ram entry \left({s}_{i},{\u015d}_{i}\right) in phone sequence S. The posterior probability p(SL) can be computed efficiently by the forwardbackward algorithm as described in [6]. From the counts, the joint probability of an ngram entry in a lattice is calculated as:
For a given lattice, we calculate the joint probabilities for all unique ngrams in the lattice and arrange them into a supervector, which are then taken as data for SVM (support vector machine) training and testing for language recognitions.
It has been shown that ngram normalization significantly improves performance [13]. We choose the term frequency log likelihood ratio kernel [13] to weight and normalize the ngram supervectors and form a linear kernel for classification. The kernel function between two supervectors is:
where X_{1} is the supervector for lattice L_{1}, X_{2} is the supervector for lattice {L}_{2},p\left({s}_{i},{\u015d}_{i}all\right) is calculated across lattices derived from all the train data. During training, the inputs of this kernel are two arbitrary supervectors in the training data. For testing, one of the inputs is a supervector of a test utterance and the other is a SVM support vector. Figure 6 shows a diagram of a parallel PRVSM (PPRVSM) system, which fuses phonetic features from multiple phone recognizers. Ngram statistics are computed from each recognition lattice using the PRVSM approach and the supervectors are concatenated to form the input for SVM. PPRVSM has been shown to yield better performance than a single PRVSM system [5].
6 Experimental results
6.1 Experimental setup
To evaluate the proposed architecture and algorithms, we implemented a Mandarin/English language recognition task. Two LVCSR decoders were built as frontends, one for Mandarin and the other for English. For both decoders, PLP with delta and acceleration coefficients were used as features, with cepstral mean/variance normalization. For the Mandarin frontend, a 68 kword dictionary was used, and an acoustic model of 6,000 states and 48 mixtures was trained with about 300 hours data selected from the training sets of the HKUST, CallFriend, CallHome, and Chinese 863 corpora. For the English frontend, a 130 kword dictionary was used, and an acoustic model of 6,000 states and 48 mixtures was trained with about 220 h data selected from the training set of CallFriend, CallHome, and Switchboard. The Mandarin and the English language model were trained with the Chinese and English Gigaword corpora, respectively, and interpolated with transcriptions of the acoustic model training data. Acoustic model training was done using the minimum phone error (MPE) [14] criteria.
For the VSM backend, trigram probability supervectors were extracted following the steps described in Section "Vector space modeling LRE backend", with a core phone set of 96 monophones for Mandarin and 39 monophones for English, excluding sil and sp. This resulted in supervectors of length 894,048 for Mandarin and 60,879 for English. The training corpus consisted of 9,475 Mandarin utterances and 10,827 English utterances selected from the CallFriend and CallHome training sets. Each training utterance was automatically segmented to have about 30 s of speech. For testing, we selected 14,467 utterances, representing approximately 7.2 h of English and 5.5 h of Mandarin from the CallHome development and test sets. Each test utterance was segmented according to the LDC provided transcriptions. Utterances shorter than 0.5 s are discarded. Table 1 gives the utterance length distribution of the test data. In all experiments, the training and test utterances were converted into lattices using identical lattice reconstruction algorithms to make training and test conditions match.
For each lattice reconstruction algorithm, a corresponding VSM language recognition system was implemented, as shown in Table 2. In the following experiments, the performance of these systems is examined. SVMTorch [15] is used for SVM training and testing. For comparison, we built a Mandarin phone recognizer and an English phone recognizer, each as part of a standard PRVSM baseline. The PRVSM frontends use a simple phoneloop and null grammar. These share the same features with the LVCSR decoder, but use significantly compacted contextdependent acoustic models which have only about 300 states. The acoustic model used in the PRVSM system is tuned on NIST LRE 07 data to provide better performance [16]. Alternative English and Mandarin PRVSM frontends using the LVCSR acoustic models are also evaluated, which are named "PRVSMalt" in the Table 2. In our implementation, the PRVSMalt frontend is efficiently integrated with the LVCSR decoder by sharing the same preprocessing and acoustic model evaluation. Although both the PRVSM frontend and the PRVSMalt frontend generate lattices using contextdependent phone sets, the phone contexts are removed prior to the ngram counting in the VSM backend to avoid data sparseness.
6.2 Performance of the different lattice reconstruction algorithms
Detection error tradeoff (DET) curves for the LVCSR Mandarin frontend experiments are shown in Figure 7, with corresponding equal error rates (EERs) given in Table 3. The DET plots using the English frontend have a similar profile and are not presented. From the results, we observe that all three lattice reconstruction algorithms improve the LRE performance significantly compared to direct use of the wordbased phone lattice obtained from LVCSR, with the framebased Nbest method A3 outperforming the other two algorithms. Because this approach completely discards topological (LM, lexicon) constraints used in the LVCSR decoder, and resorts to rearranging phone hypotheses in the bag of phones frame by frame, the resulting phone lattices have the richest phonetic information but the fewest pronunciation and grammar constraints. The fact that the PRVSMalt performs only slightly better than the A3 method supports this constraintremoval efficiency. However, it is interesting to notice that A1 has the advantage of containing lower false alarm and miss rate operating points.
From Table 3, we can see that the Mandarin frontend performs better than the English frontend. This is most likely due to the size of the phone set. In our systems, the English phone set has 39 monophones and the Mandarin phone set has 96 excluding sil and sp. A larger LVCSR phone set means both English and Mandarin utterances can be modeled more precisely in the ngram vector space, leading to better language recognition performance.
Although both use a phoneloop topology, there is still a performance gap between the PRVSM and the PRVSMalt frontends. We attribute this to the different characteristics of the acoustic models for PRVSM language recognition and LVCSR. A PRVSM system is motivated by the observation that some sequences of phones that exist in one language rarely exist in another. In such kind of systems, the phone recognizer tokenizes input speech into phone sequences or lattices, and the VSM backend converts phone sequences or lattices into supervectors and then makes decision. It is very common to use an English frontend, for example, to tokenize Mandarin or speech of many other languages. Therefore, the frontend's generality and robustness to multiple languages are much more important than precisely transcribing speech of a specific target language which is the purpose of a LVCSR decoder. The acoustic models of the LVCSR decoder use many more states and Gaussian mixtures to distinguish phones in different contexts, which is more accurate for LVCSR but less robust for language recognition, and thus leads to performance degradation.
6.3 Effect of pruning frameexpanded Nbest lattice
As explained in Section "Pruning frameexpanded Nbest lattice", removing lowprobability phone hypotheses from the Nbest list can reduce redundancy and confusability of frameexpanded Nbest lattices, and improve LRE performance. Figure 8 plots the EER of A3 as a function of percentage of links retained from the original (unpruned) lattice. We observe that with 8085% of the links in the original lattice we obtain about 0.3% improvement (0.32% for Mandarin and 0.29% for English) in EER over the original lattice. Figure 7 and Table 3 also give DET plot and EERs of the pruned lattices where the 84% beam width is used for Mandarin and 81% for English. Lattice pruning is not applied to the timealigned phone lattice or the phonemeexpanded Nbest lattice. These two algorithms avoid some of the redundancy and confusability through the incorporation of time alignment, for direct lattice alignment (time node clustering) in algorithm A1 and from the use of the 1best transcription in algorithm A2. However, this lower confusability comes at the cost of implicit incorporation of languagespecific pronunciation dictionary and language model constraints, which as shown in the previous section results in lower LRE accuracy.
6.4 Performance of parallel frontends
As discussed earlier, fusion of multiple phonetic features generally improves performance. To evaluate the impact of this for the new integrated LVCSRLRE system, we tested the system using parallel Mandarin and English LVCSR frontends. Figure 9 gives the resulting DET plots, with corresponding EERs listed in Table 4. As expected, system performance is significantly improved in all cases. This result demonstrates the importance of having a highly diverse phone set for the language recognition task, and of using parallel frontends if computational resources permit.
6.5 Performance of different utterance length
In this experiment, we examined the impact of utterance length on language recognition accuracy. Figure 10 shows EERs of the parallel frontend system with different test utterance lengths. The general trend is that EER decreases as utterance lengths get longer, with more reliable language recognition results using test segments longer than three seconds. This provides us an intuitive frame of reference as to how long it may take to get a reliable language recognition result, and thus an idea of the expected added LVCSR latency possible in the case of incorrect initial language settings.
6.6 Computational cost
Computational costs for the proposed LVCSRbased lattice construction algorithms vary significantly. Both A1 and A2 require collection of phone hypotheses using wordbased phone alignment. Since phone hypotheses are propagated, merged and pruned with word hypotheses, this requires the decoder to spend significant time on bookkeeping. However, this is not needed for the A3 approach, because the bagofphones method for phone candidate identification is much simpler. Table 5 lists the specific additional computational cost of the lattice reconstruction algorithms, compared to using a separate PRVSM LRE system. The test was conducted on a single core of an Intel Core2 Duo 2.26G CPU. In the implementation of the PRVSM system, the frontend reuses features extracted by the LVCSR frontend, but the acoustic model evaluation is not shared because the acoustic models are different. It should be noted that A3 and A3' have the same real time factor. This is because the most time consuming part of the lattice reconstruction algorithm is the phone hypothesis collection which is Step 1 of the A3 method, with the computational cost of the other part being trivial. Since A3' also has significantly better performance than the other approaches it is likely a better choice for real application.
The PRVSMalt having a larger computational overhead does not seem consistent with intuition. Since the PRVSMalt and the LVCSR decoder are tightly integrated by sharing the same preprocessing and acoustic model evaluation, the overall system should have a similar overhead to that of A3. However, we found that the PRVSMalt frontend has a much larger number of HMM states evaluated on each frame. In our experiments, the PRVSMalt Mandarin frontend has about 4,800 states/frame on average, while the LVCSR decoder has only about 3,000 states/frame. The acoustic evaluation takes nearly 1xRT in the PRVSMalt system, but 0.66xRT in the pure LVCSR, beam search in the phoneloop also takes about 0.1xRT, so the overall computational cost increases. This is caused by different search strategies and the decoding network topology. The lexicon and the language model constraints in the LVCSR decoder provides various and powerful pruning criteria (such as wordend pruning, language model lookahead pruning, etc.) to be applied in the decoding progress to shrink the search space, but only acoustic pruning is available for the phoneloop decoder. In a pronunciation prefix tree based LVCSR decoder, such as the one we used in this article, the acoustic states near wordends have fewer opportunities to be evaluated because the active search space near the wordends normally have been extensively pruned. However, in a phoneloop decoder where all the phones are connected in parallel, all the acoustic models have equal probability of evaluation. Of course, a narrower beam can be used for faster decoding speed, but this usually leads to even worse accuracy.
7 Conclusion
In this article, we have successfully demonstrated an embedded LVCSRLRE architecture that integrates language recognition capability directly into an LVCSR decoder with very low additional computational cost. The bias introduced by the LVCSR pronunciation dictionary and language model is reduced through the use of a bagofphones approach that allows for encoding richer decoding alternatives. Of the three proposed lattice reconstruction algorithms, the integrated LVCSRLRE system using the frameexpanded Nbest phone lattice shows the best performance, which is comparable to a stateoftheart PRVSM LRE system but at an added computational cost three times lower than that of a separate PRVSM system.
Abbreviations
 ASR:

automatic speech recognition
 DAG:

directed acyclic graph
 DET:

detection error tradeoff
 EER:

equal error rate
 LM:

language model
 LRE:

language recognition
 LVCSR:

large vocabulary continuous speech recognition
 MPE:

minimum phone error
 OOL:

outoflanguage
 PPRVSM:

parallel phone recognition  vector space modeling
 PRVSM:

phone recognition  vector space modeling
 SVM:

support vector machine
 VSM:

vector space modeling
References
Motlicek P: Automatic outoflanguage detection based on confidence measures derived from LVCSR word and phone lattices. In Proc Interspeech 2009. Volume 1. Brighton, UK; 2009:12151218.
Motlicek P, Valente F, Garner PN: English spoken term detection in multilingual recordings. In Proc Interspeech 2010. Volume 1. Chiba, Japan; 2010:206209.
Fernandez F, Cordoba RD, Ferreiros J, Sama V, D'Haro LF: Language identification techniques based on full recognition in an air traffic control task. In Proc. ICSLP 2004. Volume 1. Jeju, Korea; 2004:15651568.
Lim DCY, Lane I: Language identification for speechtospeech translation. In Proc. Interspeech 2009. Volume 1. Brighton, UK; 2009:204207.
Zissman M: Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans Speech Audio Process 1996, 4(1):3144.
Gauvain JL, Messaoudi A, Schewenk H: Language recognition using phone lattices. In Proc ICSLP 2004. Volume 1. Jeju, Korea; 2004:2528.
Campbell WM, Singer E, TorresCarrasquillo PA, Reynolds DA: Language recognition with support vector machines. In Proc. Odyssey 2004. Volume 1. Toledo, Spain; 2004:285288.
Campbell WM, Richardson F, Reynolds DA: Language recognition with word lattices and support vector machines. In Proc ICASSP 2007. Volume 4. Honolulu, Hawaii; 2007:989992.
Li H, Ma B, Lee CH: A vector space modeling approach to spoken language identification. IEEE Trans Audio Speech Lang Process 2007, 15(1):271284.
Deng Y, Liu J: Automatic language identification using support vector machines and phonetic ngram. In Proc International Conference on Audio, Language and Image Processing (ICALIP) 2008. Volume 1. Shanghai, China; 2008:7174.
TorresCarrasquillo PA: Language identification using Gaussian mixture models. PhD thesis, Michigan State University; 2002.
Mangu L, Brill E, Stolcke A: Finding consensus in speech recognition: word error minimization and other application of confusion network. Comput Speech Lang 2000, 14(4):373400.
Campbell WM, Campbell JP, Reynolds DA, Jones DA, Leek TR: Phonetic Speaker Recognition with Support Vector Machines. In Advances in Neural Information Processing System. Volume 16. Edited by: Thrun S, Saul L, Scholkopf B. MIT Press, Cambridge, MA; 2004:13771384.
Povey D: Discriminative training for large vocabulary speech recognition. PhD thesis, Cambridge University Engineering Department; 2004.
Collobert R, Bengio S: Support vector machines for largescale regression problems. J Mach Learn Res 2001, 1: 143160.
Deng Y: Research on the PPRVSM system for language recognition. PhD thesis, Tsinghua University; 2011.
Acknowledgements
This research was supported by the National Natural Science Foundation of China (NSFC) (Nos. 90920302 and 61005019), by the NSFC and Research Grants Council (RGC) of Hong Kong (No. 60931160443), and in part by the National High Technology Development Program of China (863 Program) (No. 2008AA040201).
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Shan, Y., Deng, Y., Liu, J. et al. Phone lattice reconstruction for embedded language recognition in LVCSR. J AUDIO SPEECH MUSIC PROC. 2012, 15 (2012). https://doi.org/10.1186/16874722201215
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/16874722201215