Empirically combining unnormalized NNLM and back-off N-gram for fast N-best rescoring in speech recognition

Shi, Yongzhe; Zhang, Wei-Qiang; Cai, Meng; Liu, Jia

doi:10.1186/1687-4722-2014-19

Research
Open access
Published: 28 April 2014

Empirically combining unnormalized NNLM and back-off N-gram for fast N-best rescoring in speech recognition

Yongzhe Shi¹,
Wei-Qiang Zhang¹,
Meng Cai¹ &
…
Jia Liu¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2014, Article number: 19 (2014) Cite this article

2097 Accesses
1 Citations
Metrics details

Abstract

Neural network language models (NNLM) have been proved to be quite powerful for sequence modeling, including feed-forward NNLM (FNNLM), recurrent NNLM (RNNLM), etc. One main issue concerned for NNLM is the heavy computational burden of the output layer, where the output needs to be probabilistically normalized and the normalizing factors require lots of computation. How to fast rescore the N-best list or lattice with NNLM attracts much attention for large-scale applications. In this paper, the statistic characteristics of normalizing factors are investigated on the N-best list. Based on the statistic observations, we propose to approximate the normalizing factors for each hypothesis as a constant proportional to the number of words in the hypothesis. Then, the unnormalized NNLM is investigated and combined with back-off N-gram for fast rescoring, which can be computed very fast without the normalization in the output layer, with the complexity reduced significantly. We apply our proposed method to a well-tuned context-dependent deep neural network hidden Markov model (CD-DNN-HMM) speech recognition system on the English-Switchboard phone-call speech-to-text task, where both FNNLM and RNNLM are trained to demonstrate our method. Experimental results show that unnormalized probability of NNLM is quite complementary to that of back-off N-gram, and combining the unnormalized NNLM and back-off N-gram can further reduce the word error rate with little computational consideration.

1 Introduction

The output of the speech-to-text (STT) system is usually a multi-candidate form encoded as lattice or N-best list. Rescoring via more accurate models, as a second pass of the STT system, has been widely used to further improve the performance. Fast rescoring with neural network language models is investigated in the paper.

Neural network language models (NNLMs), including feed-forward NNLM (FNNLM) [1, 2] and recurrent NNLM (RNNLM) [3–5], have achieved very good results on many tasks [6–8], especially for RNNLM. Distributed word representations and the associated probability estimates are jointly computed in a feed-forward or recurrent neural network architecture. This approach provides automatic smoothing and leads to better generalization for unseen N-grams. The main drawback of NNLM is the great computational burden of the output layer that contains tens of thousands of nodes corresponding to the words in the vocabulary, where the output needs to be probabilistically normalized for each word with the softmax function and this softmax-normalization requires lots of computations. Thus, N-best list for its simplicity is usually rescored and reranked by NNLM, and the evaluation speed of NNLM needs to be improved further for large-scale applications.

Most of the previous work focuses on the speedup of the training of NNLM via word clustering to structure the output layer [4, 9, 10]. One typical method, the class-based output layer method, was proposed, recently, for speeding up RNNLM training [4], based on word frequency. This method divides the cumulative probability into C partitions to form C frequency binnings which correspond to C clusters. The words are assigned to classes proportionally. Based on the frequency clustering method, the closed-form solution of the output layer complexity can be written as O((C + |V|/C)H), where |V| and H denote the number of nodes in the output layer and the hidden layer, respectively. Another method [9, 11, 12] is to factorize the output layer with a tree structure that needs to be carefully constructed based on expert knowledge [13] or other clustering method [14]. Although the structure-based methods can speed up the evaluation of NNLM, the complexities of these methods are still quite high in real-time systems.

In this paper, the statistic characteristics of normalizing factors are investigated for the N-best hypotheses. Based on the statistic observations, we proposed to approximate the normalizing factors for each hypothesis as a constant proportional to the number of words in the hypothesis, and the normalizing factors can be easily absorbed into the word penalty. Then, the unnormalized NNLM is investigated and combined with back-off N-gram for fast rescoring, which can be computed very fast without the normalization in the output layer, with the complexity reduced significantly.

We apply our proposed method to a well-tuned context-dependent deep neural network hidden Markov model (CD-DNN-HMM) speech recognition system on the English-Switchboard speech-to-text task. Both feed-forward NNLM and recurrent NNLM are well-trained to verify the effectiveness of our method. Experimental results show that unnormalized probability of NNLM is quite complementary to that of back-off N-gram, and combining the unnormalized NNLM and back-off N-gram can further improve the performance of speech recognition with little computational resource.

As our method is theoretically founded on the statistic observations, we first introduce the experimental setup, including the speech recognizer, N-best hypotheses, NNLM structure, and NNLM training, in Section 2 for convenience. The remainder of this paper is organized as follows: The statistics of the normalizing factors on the hypotheses are investigated and the constant normalizing factor approximation is proposed in Section 3. How to combine the unnormalized NNLM and back-off N-gram is presented in Section 4, followed by complexity analysis and speed comparisons in Section 5. Detailed experimental evaluations for N-best rescoring are presented in Section 6. Discussions on the related work are given in Section 7, followed by the conclusions in Section 8.

2 Experimental setup

The experimental setup for the speech recognizer, N-best hypotheses, the NNLM structure, and the NNLM training in our work was introduced here, since our method is theoretically founded on statistical observations.

2.1 Speech recognizer and N-best hypotheses

The effectiveness of our proposed method is evaluated on the STT task with the 309-hour Switchboard-I training set [15]. The 13-dimensional perceptual linear prediction features (PLP) with rolling-window mean-variance normalization and up to third-order derivatives are reduced to 39 dimensions by heteroscedastic linear discriminant analysis (HLDA). The speaker-independent three-state cross-word triphones share 9,308 tied states. The GMM-HMM baseline system has 40-Gaussian mixtures per state, trained with maximum likelihood (ML), and refined discriminatively (DT) with the minimum phone error (MPE) criterion. The well-tuned CD-DNN-HMM system replaces the Gaussian mixtures with scaled likelihoods derived from DNN posteriors. The input to the DNN contains 11 (5-1-5) frames of 39-dimensional features, where the DNN uses the architecture of 429-2048 ×7-9308. The data for system development is the 1831-segment Switchboard part of the NIST 2000 Hub5 eval set (Hub5’00-SWB). The Fisher half of the 6.3h Spring 2003 NIST rich transcription set (RT03S-FSH) acts as the evaluation set.

The 2000h Fisher transcripts, containing about 23 million words, are taken as our training corpus for language modeling. Based on Kneser-Ney smoothing, a back-off trigram language model (KN3) was trained on the 2000h Fisher transcripts for decoding, where the vocabulary is limited to 53K words and unknown words are mapped into a special token <unk>. Note that no additional text is used to train LMs for interpolations to ensure the repeatability. The out-of-vocabulary rate is 0.80% for the training corpus, 0.53% for the development corpus, and 0.017% for the evaluation corpus. The pronouncing dictionary comes from CMU [16]. The HDecode^a command is used to decode the utterance with KN3 to output the lattice, and then the N-best hypotheses are extracted from the lattice using the lattice-tool^b command. In the setup, top 100-best hypotheses are rescored and reranked by other language models, such as back-off 5-gram, FNNLM, and RNNLM, to improve the performance.

2.2 Structure and training of NNLM

The typical structures of NNLMs are shown in Figures 1 and 2, corresponding to FNNLM and RNNLM, respectively. We also define V, H and N as the vocabulary, the size of hidden layer and the order of FNNLM, respectively. The projection matrix E∈ ℜ^|V|×H maps each word to the feature vector as the distributional representation and fed into the hidden layer.

Based on the structures of NNLM, the hidden state h_tof FNNLM can be computed as $h_{t} = tanh (\sum_{o = 1}^{N - 1} W_{iho} v_{t - o})$ , while that of RNNLM can be computed as h_t= sigmoid (W_hhh_t-1+ v_t), where tanh and sigmoid are the activation functions. The probability of the next word is computed via the softmax function in the output layer, where W_ho∈ ℜ^{|V| × H}= [ θ₁,θ₂,…,θ_|V|]^T is the predicting matrix and θ_{∀ i}∈ ℜ^{H × 1} corresponds to each output node.

The transcripts of the Hub5’00-SWB set and the RT03S-FSH set act the development set and the evaluation set, respectively, for NNLM training. One FNNLM and one RNNLM are well-trained on the training corpus with the open source toolkits, CSLM [17] and RNNLM [18], respectively, where both of the hidden layers contain 300 nodes.

To speed up the training of the RNNLM, a frequency-based partition method [4] is used to factorize the output layer with 400 classes. The truncated backpropagation through time algorithm (BPTT) [19] is used to train the RNNLM with 10 time steps, with the initial learning rate set to 0.1. The learning rate is halved, when the perplexity decreases very slowly or increases. On the contrary, the training of FNNLM can be speeded up with 128 context-word pairs as a mini-batch based on GPU implementation, so that no class layer was used, as the class layer usually sacrifices the performance of NNLM for speedup. The learning rate is empirically set as lr =lr₀/(1 + count × wdecay), where the initial learning rate l r₀ is set to 1.0, the weight decay ‘wdecay’ is set to 2×10^-8, and the parameter ‘count’ denotes the number of samples processed, so that the learning rate will decay with the training of model. The basic back-off 5-gram language model (KN5) is also trained with the modified Kneser-Ney smoothing algorithm.

3 Statistics of normalizing factors on N-best hypotheses

3.1 Review of N-best rescoring

The output from the first decoding pass is usually a multi-candidate form encoded as lattice or N-best list. Each path in lattice or N-best list is a candidate time-aligned transcript W = w₁,w₂,…,w_nof the speech utterance X. N-best list for its simplicity is widely used, and N-best rescoring in LVCSR is reviewed here.

Given the acoustic model Λ, the language model L, and a speech utterance X_i, N-best hypotheses from ASR’s decoding are denoted as H_i= W_{i 1},W_{i 2},…,W_iN, where the score of each hypothesis W_{i
j} is computed as

\begin{array}{l} g (X_{i}, W_{ij} | Λ, L) & = logP (X_{i} | W_{ij}, Λ) + α \cdot logP (W_{ij} | L) \\ + n_{ij} \cdot wdpenalt, \end{array}

(1)

where the first two items correspond to acoustic scores and language scores, respectively, and the last one denotes the word penalty that balances insertions and deletions. Also, α denotes a scaling factor for language scores and n_ijdenotes the number of words in the hypothesis W_ij. The global score for each hypothesis in H_iis computed and reranked. The top hypothesis is selected as the output for evaluation. Generally, better performance is expected with more accurate models.

3.2 Normalizing factor for one word

Given a word sequence $\bar{s}$ , denote the t-th word as w_t. The identity of word w_tis denoted as q(w_t) = y_i∈ V, where the subscript i of y_iis the word index in the vocabulary V. The structures of FNNLM and RNNLM are shown in Figures 1 and 2, respectively, where W_ho∈ ℜ^{|V| × H}= [ θ₁,θ₂,…,θ_|V|]^T is the prediction matrix and θ_{∀ i}∈ ℜ^{H × 1} corresponds to each output node.

The predicted probability of NNLM is computed as

\begin{array}{l} P (q (w_{t}) & = y_{j} | h_{t}) = \frac{exp (s_{t})}{z_{t}} \\ with s_{t} & = θ_{j}^{T} h_{t} and z_{t} = \sum_{i = 1}^{| V |} exp (θ_{i}^{T} h_{t}), \end{array}

(2)

where exp(s_t) and z_trespectively correspond to the unnormalized probability and the softmax-normalizing factor. Computing this factor z_tresults in heavy computational burden for normalization.

We evaluated our well-trained FNNLM and RNNLM on the 100-best hypotheses generated from the Hub5’00-SWB set (1,812 utterances), containing 147,454 hypotheses and 2,125,315 words. The log(z_t) for each word is computed and the probability density functions (PDFs) of the log(z_t) for FNNLM and RNNLM are plotted and shown in Figure 3. It shows that the log normalizing factor is widely distributed, ranging from 13 to 20 for FNNLM and from 7 to 20 for RNNLM, respectively. It seems that the variance of log(z_t) is so large that the normalizing factor log(z_t) can’t be simply approximated as a constant for N-best rescoring. However, several findings from our firsthand experience have been noticed to help us approximate the normalizing factor, and we also conclude that some discriminative information of NNLM exists in the unnormalized probability for N-best rescoring in the next two sub-sections.

3.3 Normalizing factor for one hypothesis

The output of speech recognizer is usually encoded as N-best hypotheses, and the better hypothesis can be selected via rescoring with more accurate models. The language score for the hypothesis $W_{ij} = w_{ij 1}, w_{ij 2}, \dots, w_{{ijn}_{ij}}$ is computed as

\begin{array}{l} logP (W_{ij} | L) & = \sum_{t = 1}^{n_{ij}} log (P (w_{ijt} | h_{ijt})) \\ = \sum_{t = 1}^{n_{ij}} s_{ijt} - \sum_{t = 1}^{n_{ij}} log (z_{ijt}), \end{array}

(3)

where n_ijdenotes the number of words in hypothesis W_ij, s_ijtcan be efficiently computed with the dot product of two vectors, while z_ijtrequires a lot of computing consideration.

We randomly selected one utterance from Hub5’00-SWB set and decoded it with HDecode for recognition. Top ten hypotheses are shown in Table 1. We notice that there are lots of similar contexts in N-best hypotheses, especially for the hypotheses with a low word error rate (WER), and differences usually exist in local. As a matter of fact, the normalizing factor z_ijtis completely determined by the context via a smooth function in Equation 2, and similar contexts will result in similar normalizing factors close to each other in value. Thus, lots of normalizing factors in the N-best hypotheses are the same or similar as for lots of the same or similar contexts, so that we roughly approximate $\sum_{t = 1}^{n_{ij}} log (z_{ijt})$ for hypothesis W_ijas a constant proportional to n_ijin this case, shown as

Table 1 Ten hypotheses decoded from one utterance in Hub5’00-SWB set

Full size table

μ_{ij} = \frac{1}{n_{ij}} \sum_{t = 1}^{n_{ij}} log (z_{ijt}) \approx μ_{i},

(4)

where μ_iis the constant corresponding to utterance X_iand μ_ican be estimated as $μ_{i} \approx \bar{μ_{ij}} = \frac{1}{N} \sum_{j = 1}^{N} μ_{ij}$ .

This approximation for utterance X_ican be evaluated with the variance of μ_ijas $Var (μ_{ij}) = \frac{1}{N} \sum_{j = 1}^{N} {(μ_{ij} - \bar{μ_{ij}})}^{2}$ , and $\bar{μ_{ij}}$ denotes the mean of μ_ijin H_i. As a matter of fact, many utterances need to be approximated with Equation 4, and these approximations can be evaluated as mean of Var(μ_ij) and variance of Var(μ_ij) for all utterances. The statistically smaller the Var(μ_ij), the more accurate the approximations.

We evaluated the well-trained FNNLM and RNNLM on the 100-best hypotheses generated from Hub5’00-SWB set (1,812 utterances). The μ_ijfor each hypothesis and the Var(μ_ij) for each 100-best list are computed, and the PDFs of Var(μ_{i
j}) for FNNLM and RNNLM are shown in Figure 4. It shows that the PDFs are quite sharp and close to zero, just like an impulse function, and the constant approximation in Equation 4 for each utterance is accurate and reasonable to some extent.

3.4 Number of words in hypothesis

We also notice that the number of words for one hypothesis is similar with each other in the N-best list. As a matter of fact, N-best hypotheses are rescored and reranked according to the relative scores. If all the hypotheses for utterance X_icontain the same number of words, then the second item in Equation 3 for one hypothesis will be the same as that of others in the N-best list, based on Equation 4, shown as

\begin{array}{l} n_{ij} μ_{ij} & \approx n_{ik} μ_{ik}, \forall i, j, k \\ s.t. n_{ij} & \approx n_{ik} and μ_{ij} \approx μ_{i} \approx μ_{ik} \forall i, j, k, \end{array}

(5)

That is to say, the normalizing factors for one hypothesis will not affect the ranking in the N-best rescoring, and μ_ifor utterance X_ican be arbitrary. We further approximate the constant μ_ito a global constant, irrelevant with the utterance, shown as

μ_{ij} \approx μ_{i} \approx μ,

(6)

where μ is the global constant and can be estimated as $μ = \frac{1}{MN} \sum_{i = 1}^{M} \sum_{j = 1}^{N} μ_{ij}$ on the validation set. M and N denote the number of utterances and the number of hypotheses for each utterance, respectively.

Please note that the approximations in Equation 5 depend on the assumption that the hypotheses for each utterance are equal in length. We count the number of words n_ijfor each hypothesis W_ijand compute the variance of n_ijfor each utterance X_ias

\begin{array}{l} Var (n_{ij}) = \frac{1}{N} \sum_{j = 1}^{N} {(n_{ij} - \bar{n_{ij}})}^{2}) with \bar{n_{ij}} = \frac{1}{N} \sum_{j = 1}^{N} n_{ij} \end{array}

(7)

The statistically smaller the Var(n_ij), the more accurate the approximation in Equations 5 and 6. The PDF of Var(n_ij) on the 100-best hypotheses generated from Hub5’00-SWB set is shown in Figure 5. It shows that the PDF of Var(n_ij) is sharp and most of the Var(n_ij) are smaller than 1.0. The difference of N-best hypotheses in length is small, and the approximation for N-best hypotheses in Equation 5 is reasonable to some extent.

3.5 Normalizing factor approximation

Based on the approximation in Equations 4, 5, and 6, the LM scores in Equation 3 can be simplified as

logP (W_{ij} | L) \approx \sum_{t = 1}^{n_{ij}} s_{ijt} - n_{ij} \cdot μ,

(8)

where only the first item needs to be estimated, while the second item can be estimated on validation set for rescoring. The complexity of the output layer is significantly reduced form O(|V|H) to O(H) with the constant approximation of the normalizing factor.

We also notice that the discriminative information of NNLM for N-best rescoring exists in the unnormalized probability in Equation 8 and the LM scores from back-off N-gram, especially for the KN3 in decoding, usually are available for rescoring. We will investigate the discriminative information in unnormalized NNLM (UP-NNLM), combined with back-off N-gram, to further improve the performance of speech recognizer in the next section.

4 Combining unnormalized NNLM and back-off N-gram

The UP-NNLM combined with back-off N-gram in the logarithmic domain is presented in detail. Generally, the performance of STT systems can be further improved with interpolation of NNLM and back-off N-gram. Since exact probability of NNLM is unavailable in Equation 8, the linear interpolation is performed in logarithmic domain for the entire hypothesis, shown as

\begin{array}{l} log \tilde{P} (W_{ij} | L) \\ = λ \cdot logP (W_{ij} | L) + (1 - λ) \cdot {logP}_{Ngram} (W_{ij} | L) \\ = λ \cdot \sum_{t = 1}^{n_{ij}} (s_{ijt} - μ) + (1 - λ) \cdot {logP}_{Ngram} (W_{ij} | L), \end{array}

(9)

where P_Ngram(W_ij|L) is the language score of back-off N-gram for hypothesis W_ij. By substituting Equation 9 into Equation 1, the global score for each hypothesis is computed as

\begin{array}{l} g (X_{i}, W_{ij} | Λ, L) \\ = α \cdot λ \cdot \sum_{t = 1}^{n_{ij}} s_{ijt} + α \cdot (1 - λ) \cdot {logP}_{Ngram} (W_{ij} | L) \\ + logP (X_{i} | W_{ij}, Λ) + n_{ij} \cdot {wdpenalty}^{'} \\ with {wdpenalty}^{'} = wdpenalty - α \cdot λ \cdot μ, \end{array}

(10)

where the normalizing factor is absorbed into the word penalty. The unnormalized probability only needs to be computed. The computational complexity of the output layer is reduced significantly without explicit normalization.

5 Complexity analysis and speed comparisons

The complexities of NNLM and UP-NNLM are analyzed, and the evaluation speeds of NNLM and UP-NNLM are also measured, shown in Table 2 for detailed comparisons.

Table 2 Complexity and speed comparisons of UP-NNLMs and NNLMs for word predictions

Full size table

The class-based output layer method was based on the frequency partition [4], and the computational complexity of the output layer is given as O((C + |V|/C)H), shown in Table 2. This method is usually used to speed up the training of NNLM while the evaluation of NNLM is also speeded up. Compared with the class-based method, the unnormalized probability of NNLM (UP-NNLM) in our method is required with the complexity O(H) in the output layer. Especially, the complexity of the hidden layer in FNNLM can be further reduced by lookup in the position-dependent projection matrix ${\hat{E}}_{k} \in ℜ^{| V | \times H}, k = 1, 2, \dots, N - 1$ , where the ${\hat{E}}_{k} = {EW}_{ih k}$ can be computed off-line. We denote the fast version of UP-FNNLM as fast-UP-FNNLM in Table 2.

The evaluation speed is measured by the number of words processed per second on a machine with an Intel(R) Xeon(R) 8-core CPU E5520 at 2.27 GHz and 8-G RAM, shown in Table 2. The implementations are based on the open source toolkits, CSLM [17] and RNNLM [18], to ensure the repeatability. To compare clearly, the speed of NNLM without a class layer is also measured. One million words are randomly selected from training data and evaluated by the FNNLM and NNLM, where the word is fed into the FNNLM and RNNLM one by one. Experimental results show that UP-FNNLM and UP-RNNLM are about 2 ∼3 times faster than ‘FNNLM + class layer’ and ‘RNNLM + class layer’ for evaluation. Note that the complexity of the hidden layer in UP-FNNLM or UP-RNNLM is comparable with that of class-based output layer in FNNLM + class layer or RNNLM + class layer, so that this speedup factor is reasonable. Also, it is worthy to notice that the fast-UP-FNNLM is more than 25 times faster than FNNLM + class layer and more than 1,100 times faster than FNNLM. To clearly show the speedup, the evaluation speed is also compared with different hidden layers, shown in Table 3. The larger the hidden layer, the slower the evaluation, and the ‘fast-UP-FNNLM’ is the fastest of all.

Table 3 Testing speed comparisons of UP-NNLMs and NNLMs for different hidden layers

Full size table

6 N-best rescoring evaluation

The NNLM and UP-NNLM are applied to N-best rescoring to demonstrate the performance of our method in this section. According to our experimental setup described in Section 2, the perplexities of our trained language models, including KN3, KN5, FNNLM, and RNNLM-C400, are presented in Table 4 for comparisons, where KN3 is used for decoding. It shows that the RNNLM-C400 interpolated with KN5 performs best of all on the Hub5’00-SWB set and the RT03S-FSH set. Also, RNNLM-C400 performs slightly better than FNNLM with the same setup in perplexity. The Hub5’00-SWB set and the RT03S-FSH set act the validation set and evaluation set, respectively. The 100-best hypotheses for these two sets are rescored and reranked by different language models, shown in Table 5, where 1-best denotes the output of HDecode with KN3. The results for 1-best hypothesis on these two sets as our baseline are comparable with other reported results [20, 21].

Table 4 Perplexity of Hub5’00-SWB and RT03S-FSH set for different LMs

Full size table

Table 5 Word error rates (WERs) of Hub5’00-SWB and RT03S-FSH for 100-best/1,000-best rescoring with NNLM and UP-NNLM

Full size table

The UP-FNNLM and UP-RNNLM, combined with back-off N-gram, are used for fast rescoring in this section. Note that the output layer of our trained RNNLM-C400 is divided into many small softmax output layers in order to speed up the training on the large corpus. Thus, the unnormalized probability comes from the activations of the class layer and the specific softmax output layer, while the entire normalizing factor is also approximated as Equation 6. The UP-NNLM is linearly interpolated with KN5 in the logarithmic domain. The weight for interpolation, the scale of LM scores, and the word penalty are all individually tuned on Hub5’00-SWB set, and then the final performance is evaluated on RT03S-FSH set, shown in Table 5. Significant reductions in WER are observed on the validation and evaluation sets. The language scores of KN3 is usually available in the lattice or N-best list, so that the UP-RNNLM combined with the KN3 reduces WER by 0.8% and 1.2% absolute on Hub5’00-SWB and RT03S-FSH sets, respectively. ‘KN5 + UP-RNNLM-C400’ further reduces the WER by 1.2% and 1.7% absolute on these two sets. Also, we notice that UP-RNNLM performs slightly better than UP-FNNLM, while UP-FNNLM can be evaluated much faster than UP-RNNLM. It can be seen that the ‘UP-NNLM + KN5’ can obtain about 1/2 to 2/3 gains of ‘NNLM + KN5’ with little computation. Experimental results show that the unnormalized probability of NNLMs, including FNNLM and RNNLM, is quite complementary to that of back-off N-gram, and the performance is further improved via the combination of back-off N-gram and NNLM.

7 Discussions with related work

Fast rescoring with NNLM has attracted much attention in the field of speech recognition [9, 10, 22–24]. Many methods [9, 10, 22] for factorizing the output layer were proposed to reduce the complexity of NNLM and to speed up the training and the evaluation. Other techniques [22, 24] were proposed to avoid redundant computations existing in N-best or lattice rescoring. Our proposed method can be easily combined with these methods to further improve the speed of rescoring. Also, a good work on fast training of NNLM with noise-contrastive estimation (NCE) [25] was proposed in [26], where the normalizing factor for each context was treated as a parameter to learn during the training. The training of NNLM was speeded up without the explicit normalization. As a matter of fact, the normalizing factor for each context needs to be learned separately, and these normalizing factors for different contexts will be different, so that the evaluation of NNLM needs to be normalized explicitly. Interestingly, we noticed that the normalizing factors can be manually fixed to one instead of learning them during the training of NNLM, as mentioned in [26]. We believe this findings will be helpful to further improve our current work, since if the variance of the normalizing factor could be constrained in a small range the approximation will be further improved in Equation 6. In this work, the distribution of normalizing factors on the N-best list is investigated, and the normalizing factor for each hypothesis is approximated as a constant for fast rescoring without considering the variance of the normalizing factor. Based on the findings mentioned in [26], we will investigate how to constrain the variance of normalizing factors during the training to further improve our method in the next work.

Furthermore, an alternative method to speed up the rescoring is to use the word lattice instead of N-best list. The word lattice can compactly represent much more hypotheses than the N-best list, as the output of STT. We wonder whether our proposed method can be extended to lattice rescoring. As we all know, the N-best list is close to the lattice with the size of N-best list increased. Two experiments are designed to validate our method. On the one hand, we investigate whether the performance of N-best rescoring will be degraded with the size of N-best list increased. 1,000-best list instead of 100-best list is extracted for each utterance and rescored by our proposed method, shown in Table 5. Experimental results show that our proposed method still works well for 1,000-best list, and similar improvements are obtained for 1,000-best rescoring. On the other hand, we directly rescore the lattice with ‘lattice-tool’ [27] command to evaluate our proposed method. In consideration of the easy implementation and fast rescoring, the ‘UP-FNNLM + KN5’ is integrated into lattice-tool command, where the computation of the LM score is replaced with Equation 9 for convenience. The experimental results show that the rescoring of lattice obtains a slightly lower WER than that of N-best list in Table 6. All the results also mean that our proposed approximations based on our firsthand observations are reasonable and effective for fast N-best rescoring.

Table 6 Word error rates (WERs) of Hub5’00-SWB and RT03S-FSH for lattice rescoring with UP-FNNLM

Full size table

8 Conclusions

Based on the observed characteristics of N-best hypotheses, the normalizing factors of NNLM for each hypothesis are approximated as a global constant for fast evaluation. The unnormalized NNLM combined with back-off N-gram is empirically investigated and evaluated on the English-Switchboard speech-to-text task. The computation complexity is reduced significantly without explicit softmax normalization. Experimental results show that UP-NNLM is about 2 ∼3 times faster than ‘NNLM + class layer’ for evaluation. Moreover, the fast-UP-FNNLM is more than 25 times faster than FNNLM + class layer and more than 1,100 times faster than FNNLM. The N-best hypotheses from STT’s output are approximately rescored and reranked by unnormalized NNLM combined with back-off N-gram model in the logarithmic domain. Experimental results show that the unnormalized probability of NNLM, including FNNLM and RNNLM, is quite complementary to that of back-off N-gram, and UP-NNLM is discriminative for N-best rescoring, even though UP-NNLM is not so accurate. The performance of STT system is improved significantly by ‘KN5 + UP-NNLM’ with little computational resource.

Endnotes

^a HDecode -A -D -T 1 -s 12.0 -p -6.0 -n 32 -t 150.0 150.0 -v 105.0 95.0 -u 10000 -l lat/ -z lat -C CF -H models/HMM -w models/LM -S xa.scp -i xa.mlf models/DICT models/LIST (http://htk.eng.cam.ac.uk/extensions/).

^b lattice-tool -nbest-decode 100 -read-htk -htk-logbase 2.718 -htk-lmscale 12.0 -htk-wdpenalty -6.0 -in-lattice-list xa.lst -out-nbest-dir nbest/ (http://www.speech.sri.com/projects/srilm/manpages/lattice-tool.1.html).

Abbreviations

ASR:: automatic speech recognition
DNN:: deep neural network
FNNLM:: feed-forward neural network language model
KN:: Kneser-Ney smoothing algorithm
KN3:: back-off 3-gram based on KN smoothing
KN5:: back-off 5-gram based on KN smoothing
LM:: back-off N-gram language model
NNLM:: neural network language model
PDF:: probability density function
PPL:: perplexity
RNNLM:: recurrent neural network language model
STT:: speech-to-text
WER:: word error rate.

References

Bengio Y, Ducharme R, Vincent P, Jauvin C: A neural probabilistic language model. Mach. Learn. Res. (JMLR) 2003, 1137-1155.
Google Scholar
Arisoy E, Sainath TN, Kingsbury B, Ramabhadran B: Deep neural network language models. In Proceedings of NAACL-HLT Workshop. Montreal; 2012:20-28. . http://www.aclweb.org/anthology/W12-2703
Google Scholar
Mikolov T, Karafiat M, Burget L, Cernocky JH, Khudanpur S: Recurrent neural network based language model. In Proceedings of InterSpeech. Makuhari; 2010:1045-1048.
Google Scholar
Mikolov T, Kombrink S, Burget L, Cernocky JH, Khudanpur S: Extensions of recurrent neural network language model. In Proceedings of ICASSP. Prague; 2011.
Google Scholar
Sundermeyer M, Schluter R, Ney H: LSTM neural networks for language modeling. In Proceedings of InterSpeech. Portland; 2012.
Google Scholar
Mikolov T, Deoras A, Kombrink S, Burget L, Cernocky JH: Empirical evaluation and combination of advanced language modeling techniques. In Proceedings of InterSpeech. Florence; 2011.
Google Scholar
Kombrink S, Mikolov T, Karafiat M, Burget L: Recurrent neural network based language modeling in meeting recognition. In Proceedings of InterSpeech. Florence; 2011.
Google Scholar
Mikolov T: Statistical language models based on neural networks. PhD thesis, Brno University of Technology (BUT), 2012. http://www.fit.vutbr.cz/imikolov/rnnlm/thesis.pdf
Google Scholar
Le HS, Oparin I, Allauzen A, Gauvain JL, Yvon F: Structured output layer neural network language models for speech recognition. IEEE Trans. Audio Speech Lang. Process 2013, 21: 197-206.
Article Google Scholar
Shi Y, Zhang WQ, Liu J, Johnson MT: RNN language model with word clustering and class-based output layer. EURASIP J. Audio Speech Music Process 2013., 22: doi:10.1186/1687-4722-2013-22
Google Scholar
Morin F, Bengio Y: Hierarchical probabilistic neural network language model. In Proceedings of AISTATS. Barbados; 2005:246-252.
Google Scholar
Mnih A, Hinton G: A scalable hierarchical distributed language model. Adv. Neural Inf. Process. Syst 2008, 21: 1081-1088.
Google Scholar
Fellbaum C: WordNet: an Electronic Lexical Database. MIT, Cambridge; 1998.
Google Scholar
Brown PF, deSouza PV, Mercer RL, Pietra VJD, Lai JC: Class-based N-gram models for natural language. Comput. Linguist 1992, 18(4):467-479.
Google Scholar
Godfrey J, Holliman E: Switchboard-1 Release 2. Linguistic Data Consortium, Philadelphia; 1997.
Google Scholar
The CMU pronouncing dictionary release 0.7a 2007.http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Schwenk H: CSLM: Continuous space language model toolkit. 2010.http://www-lium.univ-lemans.fr/cslm/
Google Scholar
Mikolov T, Deoras A, Kombrink S, Burget L, Cernocky JH: RNNLM - Recurrent Neural Network Language Modeling Toolkit. In Proceedings of ASRU. Hawaii; 2011. . http://www.fit.vutbr.cz/imikolov/rnnlm/
Google Scholar
Rumelhart DE, Hinton GE, Williams RJ: Learning representations by back-propagating errors. Nature 1986, 323: 533-536. 10.1038/323533a0
Article Google Scholar
Seide F, Li G, Chen X, Yu D: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In Proceedings of ASRU. Hawaii; 2011.
Google Scholar
Cai M, Shi Y, Liu J: Deep maxout neural networks for speech recognition. In Proceedings of ASRU. Olomouc; 2013.
Google Scholar
Schwenk H: Continuous space language models. Comput. Speech Lang 2007, 21(3):592-518.
Article Google Scholar
Auli M, Galley M, Quirk C, Zweig G: Joint language and translation modeling with recurrent neural networks. In Proceedings of EMNLP. Seattle; 2013:1044-1054.
Google Scholar
Si Y, Zhang Q, Li T, Pan J, Yan Y: Prefix tree based n-best list re-scoring for recurrent neural network language model used in speech recognition system. In Proceedings of InterSpeech. Lyon; 2013:3419-3423.
Google Scholar
Gutmann M, Hyvarinen A: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Proc. of AISTATS. Sardinia; 2010:297-304.
Google Scholar
Mnih A, Teh YW: A fast and simple algorithm for training neural probabilistic language models. In Proceedings of ICML. Edinburgh; 2012.
Google Scholar
Stolcke A: SRILM – an extensible language modeling toolkit. Proceedings of ICSLP 2002, 901-904.
Google Scholar

Download references

Acknowledgements

The authors are grateful to the anonymous reviewers for their insightful and valuable comments. This work was supported by the National Natural Science Foundation of China under Grant Nos. 61273268, 61005019, and 90920302, and in part by the Beijing Natural Science Foundation Program under Grant No. KZ201110005005.

Author information

Authors and Affiliations

Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing, 100084, China
Yongzhe Shi, Wei-Qiang Zhang, Meng Cai & Jia Liu

Authors

Yongzhe Shi
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Qiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Meng Cai
View author publications
You can also search for this author in PubMed Google Scholar
Jia Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongzhe Shi.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Shi, Y., Zhang, WQ., Cai, M. et al. Empirically combining unnormalized NNLM and back-off N-gram for fast N-best rescoring in speech recognition. J AUDIO SPEECH MUSIC PROC. 2014, 19 (2014). https://doi.org/10.1186/1687-4722-2014-19

Download citation

Received: 07 December 2013
Accepted: 04 April 2014
Published: 28 April 2014
DOI: https://doi.org/10.1186/1687-4722-2014-19

Empirically combining unnormalized NNLM and back-off N-gram for fast N-best rescoring in speech recognition

Abstract

1 Introduction

2 Experimental setup

2.1 Speech recognizer and N-best hypotheses

2.2 Structure and training of NNLM

3 Statistics of normalizing factors on N-best hypotheses

3.1 Review of N-best rescoring

3.2 Normalizing factor for one word

3.3 Normalizing factor for one hypothesis

3.4 Number of words in hypothesis

3.5 Normalizing factor approximation

4 Combining unnormalized NNLM and back-off N-gram

5 Complexity analysis and speed comparisons

6 N-best rescoring evaluation

7 Discussions with related work

8 Conclusions

Endnotes

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords