 Research
 Open Access
 Published:
Empirically combining unnormalized NNLM and backoff Ngram for fast Nbest rescoring in speech recognition
EURASIP Journal on Audio, Speech, and Music Processing volume 2014, Article number: 19 (2014)
Abstract
Neural network language models (NNLM) have been proved to be quite powerful for sequence modeling, including feedforward NNLM (FNNLM), recurrent NNLM (RNNLM), etc. One main issue concerned for NNLM is the heavy computational burden of the output layer, where the output needs to be probabilistically normalized and the normalizing factors require lots of computation. How to fast rescore the Nbest list or lattice with NNLM attracts much attention for largescale applications. In this paper, the statistic characteristics of normalizing factors are investigated on the Nbest list. Based on the statistic observations, we propose to approximate the normalizing factors for each hypothesis as a constant proportional to the number of words in the hypothesis. Then, the unnormalized NNLM is investigated and combined with backoff Ngram for fast rescoring, which can be computed very fast without the normalization in the output layer, with the complexity reduced significantly. We apply our proposed method to a welltuned contextdependent deep neural network hidden Markov model (CDDNNHMM) speech recognition system on the EnglishSwitchboard phonecall speechtotext task, where both FNNLM and RNNLM are trained to demonstrate our method. Experimental results show that unnormalized probability of NNLM is quite complementary to that of backoff Ngram, and combining the unnormalized NNLM and backoff Ngram can further reduce the word error rate with little computational consideration.
1 Introduction
The output of the speechtotext (STT) system is usually a multicandidate form encoded as lattice or Nbest list. Rescoring via more accurate models, as a second pass of the STT system, has been widely used to further improve the performance. Fast rescoring with neural network language models is investigated in the paper.
Neural network language models (NNLMs), including feedforward NNLM (FNNLM) [1, 2] and recurrent NNLM (RNNLM) [3–5], have achieved very good results on many tasks [6–8], especially for RNNLM. Distributed word representations and the associated probability estimates are jointly computed in a feedforward or recurrent neural network architecture. This approach provides automatic smoothing and leads to better generalization for unseen Ngrams. The main drawback of NNLM is the great computational burden of the output layer that contains tens of thousands of nodes corresponding to the words in the vocabulary, where the output needs to be probabilistically normalized for each word with the softmax function and this softmaxnormalization requires lots of computations. Thus, Nbest list for its simplicity is usually rescored and reranked by NNLM, and the evaluation speed of NNLM needs to be improved further for largescale applications.
Most of the previous work focuses on the speedup of the training of NNLM via word clustering to structure the output layer [4, 9, 10]. One typical method, the classbased output layer method, was proposed, recently, for speeding up RNNLM training [4], based on word frequency. This method divides the cumulative probability into C partitions to form C frequency binnings which correspond to C clusters. The words are assigned to classes proportionally. Based on the frequency clustering method, the closedform solution of the output layer complexity can be written as O((C + V/C)H), where V and H denote the number of nodes in the output layer and the hidden layer, respectively. Another method [9, 11, 12] is to factorize the output layer with a tree structure that needs to be carefully constructed based on expert knowledge [13] or other clustering method [14]. Although the structurebased methods can speed up the evaluation of NNLM, the complexities of these methods are still quite high in realtime systems.
In this paper, the statistic characteristics of normalizing factors are investigated for the Nbest hypotheses. Based on the statistic observations, we proposed to approximate the normalizing factors for each hypothesis as a constant proportional to the number of words in the hypothesis, and the normalizing factors can be easily absorbed into the word penalty. Then, the unnormalized NNLM is investigated and combined with backoff Ngram for fast rescoring, which can be computed very fast without the normalization in the output layer, with the complexity reduced significantly.
We apply our proposed method to a welltuned contextdependent deep neural network hidden Markov model (CDDNNHMM) speech recognition system on the EnglishSwitchboard speechtotext task. Both feedforward NNLM and recurrent NNLM are welltrained to verify the effectiveness of our method. Experimental results show that unnormalized probability of NNLM is quite complementary to that of backoff Ngram, and combining the unnormalized NNLM and backoff Ngram can further improve the performance of speech recognition with little computational resource.
As our method is theoretically founded on the statistic observations, we first introduce the experimental setup, including the speech recognizer, Nbest hypotheses, NNLM structure, and NNLM training, in Section 2 for convenience. The remainder of this paper is organized as follows: The statistics of the normalizing factors on the hypotheses are investigated and the constant normalizing factor approximation is proposed in Section 3. How to combine the unnormalized NNLM and backoff Ngram is presented in Section 4, followed by complexity analysis and speed comparisons in Section 5. Detailed experimental evaluations for Nbest rescoring are presented in Section 6. Discussions on the related work are given in Section 7, followed by the conclusions in Section 8.
2 Experimental setup
The experimental setup for the speech recognizer, Nbest hypotheses, the NNLM structure, and the NNLM training in our work was introduced here, since our method is theoretically founded on statistical observations.
2.1 Speech recognizer and Nbest hypotheses
The effectiveness of our proposed method is evaluated on the STT task with the 309hour SwitchboardI training set [15]. The 13dimensional perceptual linear prediction features (PLP) with rollingwindow meanvariance normalization and up to thirdorder derivatives are reduced to 39 dimensions by heteroscedastic linear discriminant analysis (HLDA). The speakerindependent threestate crossword triphones share 9,308 tied states. The GMMHMM baseline system has 40Gaussian mixtures per state, trained with maximum likelihood (ML), and refined discriminatively (DT) with the minimum phone error (MPE) criterion. The welltuned CDDNNHMM system replaces the Gaussian mixtures with scaled likelihoods derived from DNN posteriors. The input to the DNN contains 11 (515) frames of 39dimensional features, where the DNN uses the architecture of 4292048 ×79308. The data for system development is the 1831segment Switchboard part of the NIST 2000 Hub5 eval set (Hub5’00SWB). The Fisher half of the 6.3h Spring 2003 NIST rich transcription set (RT03SFSH) acts as the evaluation set.
The 2000h Fisher transcripts, containing about 23 million words, are taken as our training corpus for language modeling. Based on KneserNey smoothing, a backoff trigram language model (KN3) was trained on the 2000h Fisher transcripts for decoding, where the vocabulary is limited to 53K words and unknown words are mapped into a special token <unk>. Note that no additional text is used to train LMs for interpolations to ensure the repeatability. The outofvocabulary rate is 0.80% for the training corpus, 0.53% for the development corpus, and 0.017% for the evaluation corpus. The pronouncing dictionary comes from CMU [16]. The HDecode^{a} command is used to decode the utterance with KN3 to output the lattice, and then the Nbest hypotheses are extracted from the lattice using the latticetool^{b} command. In the setup, top 100best hypotheses are rescored and reranked by other language models, such as backoff 5gram, FNNLM, and RNNLM, to improve the performance.
2.2 Structure and training of NNLM
The typical structures of NNLMs are shown in Figures 1 and 2, corresponding to FNNLM and RNNLM, respectively. We also define V, H and N as the vocabulary, the size of hidden layer and the order of FNNLM, respectively. The projection matrix E∈ ℜ^{V×H} maps each word to the feature vector as the distributional representation and fed into the hidden layer.
Based on the structures of NNLM, the hidden state h_{ t }of FNNLM can be computed as ${\mathit{h}}_{t}=\text{tanh}\phantom{\rule{0.3em}{0ex}}\left(\sum _{o=1}^{N1}{\mathit{W}}_{\mathit{\text{iho}}}{\mathit{v}}_{to}\right)$, while that of RNNLM can be computed as h_{ t }= sigmoid (W_{ hh }h_{t1}+ v_{ t }), where tanh and sigmoid are the activation functions. The probability of the next word is computed via the softmax function in the output layer, where W_{ ho }∈ ℜ^{V × H}= [ θ_{1},θ_{2},…,θ_{V}]^{T} is the predicting matrix and θ_{∀ i}∈ ℜ^{H × 1} corresponds to each output node.
The transcripts of the Hub5’00SWB set and the RT03SFSH set act the development set and the evaluation set, respectively, for NNLM training. One FNNLM and one RNNLM are welltrained on the training corpus with the open source toolkits, CSLM [17] and RNNLM [18], respectively, where both of the hidden layers contain 300 nodes.
To speed up the training of the RNNLM, a frequencybased partition method [4] is used to factorize the output layer with 400 classes. The truncated backpropagation through time algorithm (BPTT) [19] is used to train the RNNLM with 10 time steps, with the initial learning rate set to 0.1. The learning rate is halved, when the perplexity decreases very slowly or increases. On the contrary, the training of FNNLM can be speeded up with 128 contextword pairs as a minibatch based on GPU implementation, so that no class layer was used, as the class layer usually sacrifices the performance of NNLM for speedup. The learning rate is empirically set as lr =lr_{0}/(1 + count × wdecay), where the initial learning rate l r_{0} is set to 1.0, the weight decay ‘wdecay’ is set to 2×10^{8}, and the parameter ‘count’ denotes the number of samples processed, so that the learning rate will decay with the training of model. The basic backoff 5gram language model (KN5) is also trained with the modified KneserNey smoothing algorithm.
3 Statistics of normalizing factors on Nbest hypotheses
3.1 Review of Nbest rescoring
The output from the first decoding pass is usually a multicandidate form encoded as lattice or Nbest list. Each path in lattice or Nbest list is a candidate timealigned transcript W = w_{1},w_{2},…,w_{ n }of the speech utterance X. Nbest list for its simplicity is widely used, and Nbest rescoring in LVCSR is reviewed here.
Given the acoustic model Λ, the language model L, and a speech utterance X_{ i }, Nbest hypotheses from ASR’s decoding are denoted as H_{ i }= W_{i 1},W_{i 2},…,W_{ iN }, where the score of each hypothesis W_{ i j } is computed as
where the first two items correspond to acoustic scores and language scores, respectively, and the last one denotes the word penalty that balances insertions and deletions. Also, α denotes a scaling factor for language scores and n_{ ij }denotes the number of words in the hypothesis W_{ ij }. The global score for each hypothesis in H_{ i }is computed and reranked. The top hypothesis is selected as the output for evaluation. Generally, better performance is expected with more accurate models.
3.2 Normalizing factor for one word
Given a word sequence $\stackrel{\u0304}{s}$, denote the tth word as w_{ t }. The identity of word w_{ t }is denoted as q(w_{ t }) = y_{ i }∈ V, where the subscript i of y_{ i }is the word index in the vocabulary V. The structures of FNNLM and RNNLM are shown in Figures 1 and 2, respectively, where W_{ ho }∈ ℜ^{V × H}= [ θ_{1},θ_{2},…,θ_{V}]^{T} is the prediction matrix and θ_{∀ i}∈ ℜ^{H × 1} corresponds to each output node.
The predicted probability of NNLM is computed as
where exp(s_{ t }) and z_{ t }respectively correspond to the unnormalized probability and the softmaxnormalizing factor. Computing this factor z_{ t }results in heavy computational burden for normalization.
We evaluated our welltrained FNNLM and RNNLM on the 100best hypotheses generated from the Hub5’00SWB set (1,812 utterances), containing 147,454 hypotheses and 2,125,315 words. The log(z_{ t }) for each word is computed and the probability density functions (PDFs) of the log(z_{ t }) for FNNLM and RNNLM are plotted and shown in Figure 3. It shows that the log normalizing factor is widely distributed, ranging from 13 to 20 for FNNLM and from 7 to 20 for RNNLM, respectively. It seems that the variance of log(z_{ t }) is so large that the normalizing factor log(z_{ t }) can’t be simply approximated as a constant for Nbest rescoring. However, several findings from our firsthand experience have been noticed to help us approximate the normalizing factor, and we also conclude that some discriminative information of NNLM exists in the unnormalized probability for Nbest rescoring in the next two subsections.
3.3 Normalizing factor for one hypothesis
The output of speech recognizer is usually encoded as Nbest hypotheses, and the better hypothesis can be selected via rescoring with more accurate models. The language score for the hypothesis ${W}_{\mathit{\text{ij}}}={w}_{\mathit{\text{ij}}1},{w}_{\mathit{\text{ij}}2},\dots ,{w}_{{\mathit{\text{ijn}}}_{\mathit{\text{ij}}}}$ is computed as
where n_{ ij }denotes the number of words in hypothesis W_{ ij }, s_{ ijt }can be efficiently computed with the dot product of two vectors, while z_{ ijt }requires a lot of computing consideration.
We randomly selected one utterance from Hub5’00SWB set and decoded it with HDecode for recognition. Top ten hypotheses are shown in Table 1. We notice that there are lots of similar contexts in Nbest hypotheses, especially for the hypotheses with a low word error rate (WER), and differences usually exist in local. As a matter of fact, the normalizing factor z_{ ijt }is completely determined by the context via a smooth function in Equation 2, and similar contexts will result in similar normalizing factors close to each other in value. Thus, lots of normalizing factors in the Nbest hypotheses are the same or similar as for lots of the same or similar contexts, so that we roughly approximate $\sum _{t=1}^{{n}_{\mathit{\text{ij}}}}\text{log}\left({z}_{\mathit{\text{ijt}}}\right)$ for hypothesis W_{ ij }as a constant proportional to n_{ ij }in this case, shown as
where μ_{ i }is the constant corresponding to utterance X_{ i }and μ_{ i }can be estimated as ${\mu}_{i}\approx \overline{{\mu}_{\mathit{\text{ij}}}}=\frac{1}{N}\sum _{j=1}^{N}{\mu}_{\mathit{\text{ij}}}$.
This approximation for utterance X_{ i }can be evaluated with the variance of μ_{ ij }as $\text{Var}\left({\mu}_{\mathit{\text{ij}}}\right)=\frac{1}{N}\sum _{j=1}^{N}{({\mu}_{\mathit{\text{ij}}}\overline{{\mu}_{\mathit{\text{ij}}}})}^{2}$, and $\overline{{\mu}_{\mathit{\text{ij}}}}$ denotes the mean of μ_{ ij }in H_{ i }. As a matter of fact, many utterances need to be approximated with Equation 4, and these approximations can be evaluated as mean of Var(μ_{ ij }) and variance of Var(μ_{ ij }) for all utterances. The statistically smaller the Var(μ_{ ij }), the more accurate the approximations.
We evaluated the welltrained FNNLM and RNNLM on the 100best hypotheses generated from Hub5’00SWB set (1,812 utterances). The μ_{ ij }for each hypothesis and the Var(μ_{ ij }) for each 100best list are computed, and the PDFs of Var(μ_{ i j }) for FNNLM and RNNLM are shown in Figure 4. It shows that the PDFs are quite sharp and close to zero, just like an impulse function, and the constant approximation in Equation 4 for each utterance is accurate and reasonable to some extent.
3.4 Number of words in hypothesis
We also notice that the number of words for one hypothesis is similar with each other in the Nbest list. As a matter of fact, Nbest hypotheses are rescored and reranked according to the relative scores. If all the hypotheses for utterance X_{ i }contain the same number of words, then the second item in Equation 3 for one hypothesis will be the same as that of others in the Nbest list, based on Equation 4, shown as
That is to say, the normalizing factors for one hypothesis will not affect the ranking in the Nbest rescoring, and μ_{ i }for utterance X_{ i }can be arbitrary. We further approximate the constant μ_{ i }to a global constant, irrelevant with the utterance, shown as
where μ is the global constant and can be estimated as $\mu =\frac{1}{\mathit{\text{MN}}}\sum _{i=1}^{M}\sum _{j=1}^{N}{\mu}_{\mathit{\text{ij}}}$ on the validation set. M and N denote the number of utterances and the number of hypotheses for each utterance, respectively.
Please note that the approximations in Equation 5 depend on the assumption that the hypotheses for each utterance are equal in length. We count the number of words n_{ ij }for each hypothesis W_{ ij }and compute the variance of n_{ ij }for each utterance X_{ i }as
The statistically smaller the Var(n_{ ij }), the more accurate the approximation in Equations 5 and 6. The PDF of Var(n_{ ij }) on the 100best hypotheses generated from Hub5’00SWB set is shown in Figure 5. It shows that the PDF of Var(n_{ ij }) is sharp and most of the Var(n_{ ij }) are smaller than 1.0. The difference of Nbest hypotheses in length is small, and the approximation for Nbest hypotheses in Equation 5 is reasonable to some extent.
3.5 Normalizing factor approximation
Based on the approximation in Equations 4, 5, and 6, the LM scores in Equation 3 can be simplified as
where only the first item needs to be estimated, while the second item can be estimated on validation set for rescoring. The complexity of the output layer is significantly reduced form O(VH) to O(H) with the constant approximation of the normalizing factor.
We also notice that the discriminative information of NNLM for Nbest rescoring exists in the unnormalized probability in Equation 8 and the LM scores from backoff Ngram, especially for the KN3 in decoding, usually are available for rescoring. We will investigate the discriminative information in unnormalized NNLM (UPNNLM), combined with backoff Ngram, to further improve the performance of speech recognizer in the next section.
4 Combining unnormalized NNLM and backoff Ngram
The UPNNLM combined with backoff Ngram in the logarithmic domain is presented in detail. Generally, the performance of STT systems can be further improved with interpolation of NNLM and backoff Ngram. Since exact probability of NNLM is unavailable in Equation 8, the linear interpolation is performed in logarithmic domain for the entire hypothesis, shown as
where P_{Ngram}(W_{ ij }L) is the language score of backoff Ngram for hypothesis W_{ ij }. By substituting Equation 9 into Equation 1, the global score for each hypothesis is computed as
where the normalizing factor is absorbed into the word penalty. The unnormalized probability only needs to be computed. The computational complexity of the output layer is reduced significantly without explicit normalization.
5 Complexity analysis and speed comparisons
The complexities of NNLM and UPNNLM are analyzed, and the evaluation speeds of NNLM and UPNNLM are also measured, shown in Table 2 for detailed comparisons.
The classbased output layer method was based on the frequency partition [4], and the computational complexity of the output layer is given as O((C + V/C)H), shown in Table 2. This method is usually used to speed up the training of NNLM while the evaluation of NNLM is also speeded up. Compared with the classbased method, the unnormalized probability of NNLM (UPNNLM) in our method is required with the complexity O(H) in the output layer. Especially, the complexity of the hidden layer in FNNLM can be further reduced by lookup in the positiondependent projection matrix ${\hat{\mathit{E}}}_{k}\in {\Re}^{\left\mathbf{\text{V}}\right\times H},k=1,2,\cdots \phantom{\rule{0.3em}{0ex}},N1$, where the ${\hat{\mathit{E}}}_{k}={\mathit{EW}}_{\text{ih}k}$ can be computed offline. We denote the fast version of UPFNNLM as fastUPFNNLM in Table 2.
The evaluation speed is measured by the number of words processed per second on a machine with an Intel(R) Xeon(R) 8core CPU E5520 at 2.27 GHz and 8G RAM, shown in Table 2. The implementations are based on the open source toolkits, CSLM [17] and RNNLM [18], to ensure the repeatability. To compare clearly, the speed of NNLM without a class layer is also measured. One million words are randomly selected from training data and evaluated by the FNNLM and NNLM, where the word is fed into the FNNLM and RNNLM one by one. Experimental results show that UPFNNLM and UPRNNLM are about 2 ∼3 times faster than ‘FNNLM + class layer’ and ‘RNNLM + class layer’ for evaluation. Note that the complexity of the hidden layer in UPFNNLM or UPRNNLM is comparable with that of classbased output layer in FNNLM + class layer or RNNLM + class layer, so that this speedup factor is reasonable. Also, it is worthy to notice that the fastUPFNNLM is more than 25 times faster than FNNLM + class layer and more than 1,100 times faster than FNNLM. To clearly show the speedup, the evaluation speed is also compared with different hidden layers, shown in Table 3. The larger the hidden layer, the slower the evaluation, and the ‘fastUPFNNLM’ is the fastest of all.
6 Nbest rescoring evaluation
The NNLM and UPNNLM are applied to Nbest rescoring to demonstrate the performance of our method in this section. According to our experimental setup described in Section 2, the perplexities of our trained language models, including KN3, KN5, FNNLM, and RNNLMC400, are presented in Table 4 for comparisons, where KN3 is used for decoding. It shows that the RNNLMC400 interpolated with KN5 performs best of all on the Hub5’00SWB set and the RT03SFSH set. Also, RNNLMC400 performs slightly better than FNNLM with the same setup in perplexity. The Hub5’00SWB set and the RT03SFSH set act the validation set and evaluation set, respectively. The 100best hypotheses for these two sets are rescored and reranked by different language models, shown in Table 5, where 1best denotes the output of HDecode with KN3. The results for 1best hypothesis on these two sets as our baseline are comparable with other reported results [20, 21].
The UPFNNLM and UPRNNLM, combined with backoff Ngram, are used for fast rescoring in this section. Note that the output layer of our trained RNNLMC400 is divided into many small softmax output layers in order to speed up the training on the large corpus. Thus, the unnormalized probability comes from the activations of the class layer and the specific softmax output layer, while the entire normalizing factor is also approximated as Equation 6. The UPNNLM is linearly interpolated with KN5 in the logarithmic domain. The weight for interpolation, the scale of LM scores, and the word penalty are all individually tuned on Hub5’00SWB set, and then the final performance is evaluated on RT03SFSH set, shown in Table 5. Significant reductions in WER are observed on the validation and evaluation sets. The language scores of KN3 is usually available in the lattice or Nbest list, so that the UPRNNLM combined with the KN3 reduces WER by 0.8% and 1.2% absolute on Hub5’00SWB and RT03SFSH sets, respectively. ‘KN5 + UPRNNLMC400’ further reduces the WER by 1.2% and 1.7% absolute on these two sets. Also, we notice that UPRNNLM performs slightly better than UPFNNLM, while UPFNNLM can be evaluated much faster than UPRNNLM. It can be seen that the ‘UPNNLM + KN5’ can obtain about 1/2 to 2/3 gains of ‘NNLM + KN5’ with little computation. Experimental results show that the unnormalized probability of NNLMs, including FNNLM and RNNLM, is quite complementary to that of backoff Ngram, and the performance is further improved via the combination of backoff Ngram and NNLM.
7 Discussions with related work
Fast rescoring with NNLM has attracted much attention in the field of speech recognition [9, 10, 22–24]. Many methods [9, 10, 22] for factorizing the output layer were proposed to reduce the complexity of NNLM and to speed up the training and the evaluation. Other techniques [22, 24] were proposed to avoid redundant computations existing in Nbest or lattice rescoring. Our proposed method can be easily combined with these methods to further improve the speed of rescoring. Also, a good work on fast training of NNLM with noisecontrastive estimation (NCE) [25] was proposed in [26], where the normalizing factor for each context was treated as a parameter to learn during the training. The training of NNLM was speeded up without the explicit normalization. As a matter of fact, the normalizing factor for each context needs to be learned separately, and these normalizing factors for different contexts will be different, so that the evaluation of NNLM needs to be normalized explicitly. Interestingly, we noticed that the normalizing factors can be manually fixed to one instead of learning them during the training of NNLM, as mentioned in [26]. We believe this findings will be helpful to further improve our current work, since if the variance of the normalizing factor could be constrained in a small range the approximation will be further improved in Equation 6. In this work, the distribution of normalizing factors on the Nbest list is investigated, and the normalizing factor for each hypothesis is approximated as a constant for fast rescoring without considering the variance of the normalizing factor. Based on the findings mentioned in [26], we will investigate how to constrain the variance of normalizing factors during the training to further improve our method in the next work.
Furthermore, an alternative method to speed up the rescoring is to use the word lattice instead of Nbest list. The word lattice can compactly represent much more hypotheses than the Nbest list, as the output of STT. We wonder whether our proposed method can be extended to lattice rescoring. As we all know, the Nbest list is close to the lattice with the size of Nbest list increased. Two experiments are designed to validate our method. On the one hand, we investigate whether the performance of Nbest rescoring will be degraded with the size of Nbest list increased. 1,000best list instead of 100best list is extracted for each utterance and rescored by our proposed method, shown in Table 5. Experimental results show that our proposed method still works well for 1,000best list, and similar improvements are obtained for 1,000best rescoring. On the other hand, we directly rescore the lattice with ‘latticetool’ [27] command to evaluate our proposed method. In consideration of the easy implementation and fast rescoring, the ‘UPFNNLM + KN5’ is integrated into latticetool command, where the computation of the LM score is replaced with Equation 9 for convenience. The experimental results show that the rescoring of lattice obtains a slightly lower WER than that of Nbest list in Table 6. All the results also mean that our proposed approximations based on our firsthand observations are reasonable and effective for fast Nbest rescoring.
8 Conclusions
Based on the observed characteristics of Nbest hypotheses, the normalizing factors of NNLM for each hypothesis are approximated as a global constant for fast evaluation. The unnormalized NNLM combined with backoff Ngram is empirically investigated and evaluated on the EnglishSwitchboard speechtotext task. The computation complexity is reduced significantly without explicit softmax normalization. Experimental results show that UPNNLM is about 2 ∼3 times faster than ‘NNLM + class layer’ for evaluation. Moreover, the fastUPFNNLM is more than 25 times faster than FNNLM + class layer and more than 1,100 times faster than FNNLM. The Nbest hypotheses from STT’s output are approximately rescored and reranked by unnormalized NNLM combined with backoff Ngram model in the logarithmic domain. Experimental results show that the unnormalized probability of NNLM, including FNNLM and RNNLM, is quite complementary to that of backoff Ngram, and UPNNLM is discriminative for Nbest rescoring, even though UPNNLM is not so accurate. The performance of STT system is improved significantly by ‘KN5 + UPNNLM’ with little computational resource.
Endnotes
^{a} HDecode A D T 1 s 12.0 p 6.0 n 32 t 150.0 150.0 v 105.0 95.0 u 10000 l lat/ z lat C CF H models/HMM w models/LM S xa.scp i xa.mlf models/DICT models/LIST (http://htk.eng.cam.ac.uk/extensions/).
^{b} latticetool nbestdecode 100 readhtk htklogbase 2.718 htklmscale 12.0 htkwdpenalty 6.0 inlatticelist xa.lst outnbestdir nbest/ (http://www.speech.sri.com/projects/srilm/manpages/latticetool.1.html).
Abbreviations
 ASR:

automatic speech recognition
 DNN:

deep neural network
 FNNLM:

feedforward neural network language model
 KN:

KneserNey smoothing algorithm
 KN3:

backoff 3gram based on KN smoothing
 KN5:

backoff 5gram based on KN smoothing
 LM:

backoff Ngram language model
 NNLM:

neural network language model
 PDF:

probability density function
 PPL:

perplexity
 RNNLM:

recurrent neural network language model
 STT:

speechtotext
 WER:

word error rate.
References
 1.
Bengio Y, Ducharme R, Vincent P, Jauvin C: A neural probabilistic language model. Mach. Learn. Res. (JMLR) 2003, 11371155.
 2.
Arisoy E, Sainath TN, Kingsbury B, Ramabhadran B: Deep neural network language models. In Proceedings of NAACLHLT Workshop. Montreal; 2012:2028. . http://www.aclweb.org/anthology/W122703
 3.
Mikolov T, Karafiat M, Burget L, Cernocky JH, Khudanpur S: Recurrent neural network based language model. In Proceedings of InterSpeech. Makuhari; 2010:10451048.
 4.
Mikolov T, Kombrink S, Burget L, Cernocky JH, Khudanpur S: Extensions of recurrent neural network language model. In Proceedings of ICASSP. Prague; 2011.
 5.
Sundermeyer M, Schluter R, Ney H: LSTM neural networks for language modeling. In Proceedings of InterSpeech. Portland; 2012.
 6.
Mikolov T, Deoras A, Kombrink S, Burget L, Cernocky JH: Empirical evaluation and combination of advanced language modeling techniques. In Proceedings of InterSpeech. Florence; 2011.
 7.
Kombrink S, Mikolov T, Karafiat M, Burget L: Recurrent neural network based language modeling in meeting recognition. In Proceedings of InterSpeech. Florence; 2011.
 8.
Mikolov T: Statistical language models based on neural networks. PhD thesis, Brno University of Technology (BUT), 2012. http://www.fit.vutbr.cz/imikolov/rnnlm/thesis.pdf
 9.
Le HS, Oparin I, Allauzen A, Gauvain JL, Yvon F: Structured output layer neural network language models for speech recognition. IEEE Trans. Audio Speech Lang. Process 2013, 21: 197206.
 10.
Shi Y, Zhang WQ, Liu J, Johnson MT: RNN language model with word clustering and classbased output layer. EURASIP J. Audio Speech Music Process 2013., 22: doi:10.1186/16874722201322
 11.
Morin F, Bengio Y: Hierarchical probabilistic neural network language model. In Proceedings of AISTATS. Barbados; 2005:246252.
 12.
Mnih A, Hinton G: A scalable hierarchical distributed language model. Adv. Neural Inf. Process. Syst 2008, 21: 10811088.
 13.
Fellbaum C: WordNet: an Electronic Lexical Database. MIT, Cambridge; 1998.
 14.
Brown PF, deSouza PV, Mercer RL, Pietra VJD, Lai JC: Classbased Ngram models for natural language. Comput. Linguist 1992, 18(4):467479.
 15.
Godfrey J, Holliman E: Switchboard1 Release 2. Linguistic Data Consortium, Philadelphia; 1997.
 16.
The CMU pronouncing dictionary release 0.7a 2007.http://www.speech.cs.cmu.edu/cgibin/cmudict
 17.
Schwenk H: CSLM: Continuous space language model toolkit. 2010.http://wwwlium.univlemans.fr/cslm/
 18.
Mikolov T, Deoras A, Kombrink S, Burget L, Cernocky JH: RNNLM  Recurrent Neural Network Language Modeling Toolkit. In Proceedings of ASRU. Hawaii; 2011. . http://www.fit.vutbr.cz/imikolov/rnnlm/
 19.
Rumelhart DE, Hinton GE, Williams RJ: Learning representations by backpropagating errors. Nature 1986, 323: 533536. 10.1038/323533a0
 20.
Seide F, Li G, Chen X, Yu D: Feature engineering in contextdependent deep neural networks for conversational speech transcription. In Proceedings of ASRU. Hawaii; 2011.
 21.
Cai M, Shi Y, Liu J: Deep maxout neural networks for speech recognition. In Proceedings of ASRU. Olomouc; 2013.
 22.
Schwenk H: Continuous space language models. Comput. Speech Lang 2007, 21(3):592518.
 23.
Auli M, Galley M, Quirk C, Zweig G: Joint language and translation modeling with recurrent neural networks. In Proceedings of EMNLP. Seattle; 2013:10441054.
 24.
Si Y, Zhang Q, Li T, Pan J, Yan Y: Prefix tree based nbest list rescoring for recurrent neural network language model used in speech recognition system. In Proceedings of InterSpeech. Lyon; 2013:34193423.
 25.
Gutmann M, Hyvarinen A: Noisecontrastive estimation: a new estimation principle for unnormalized statistical models. In Proc. of AISTATS. Sardinia; 2010:297304.
 26.
Mnih A, Teh YW: A fast and simple algorithm for training neural probabilistic language models. In Proceedings of ICML. Edinburgh; 2012.
 27.
Stolcke A: SRILM – an extensible language modeling toolkit. Proceedings of ICSLP 2002, 901904.
Acknowledgements
The authors are grateful to the anonymous reviewers for their insightful and valuable comments. This work was supported by the National Natural Science Foundation of China under Grant Nos. 61273268, 61005019, and 90920302, and in part by the Beijing Natural Science Foundation Program under Grant No. KZ201110005005.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Shi, Y., Zhang, W., Cai, M. et al. Empirically combining unnormalized NNLM and backoff Ngram for fast Nbest rescoring in speech recognition. J AUDIO SPEECH MUSIC PROC. 2014, 19 (2014). https://doi.org/10.1186/16874722201419
Received:
Accepted:
Published:
Keywords
 Neural network language model
 Nbest rescoring; Speech recognition