Empirically combining unnormalized NNLM and back-off N-gram for fast N-best rescoring in speech recognition

Neural network language models (NNLM) have been proved to be quite powerful for sequence modeling, including feed-forward NNLM (FNNLM), recurrent NNLM (RNNLM), etc. One main issue concerned for NNLM is the heavy computational burden of the output layer, where the output needs to be probabilistically normalized and the normalizing factors require lots of computation. How to fast rescore the N-best list or lattice with NNLM attracts much attention for large-scale applications. In this paper, the statistic characteristics of normalizing factors are investigated on the N-best list. Based on the statistic observations, we propose to approximate the normalizing factors for each hypothesis as a constant proportional to the number of words in the hypothesis. Then, the unnormalized NNLM is investigated and combined with back-off N-gram for fast rescoring, which can be computed very fast without the normalization in the output layer, with the complexity reduced significantly. We apply our proposed method to a well-tuned context-dependent deep neural network hidden Markov model (CD-DNN-HMM) speech recognition system on the English-Switchboard phone-call speech-to-text task, where both FNNLM and RNNLM are trained to demonstrate our method. Experimental results show that unnormalized probability of NNLM is quite complementary to that of back-off N-gram, and combining the unnormalized NNLM and back-off N-gram can further reduce the word error rate with little computational consideration.

http://asmp.eurasipjournals.com/content/2014/1 /19 to C clusters. The words are assigned to classes proportionally. Based on the frequency clustering method, the closed-form solution of the output layer complexity can be written as O((C + |V|/C)H), where |V| and H denote the number of nodes in the output layer and the hidden layer, respectively. Another method [9,11,12] is to factorize the output layer with a tree structure that needs to be carefully constructed based on expert knowledge [13] or other clustering method [14]. Although the structurebased methods can speed up the evaluation of NNLM, the complexities of these methods are still quite high in real-time systems.
In this paper, the statistic characteristics of normalizing factors are investigated for the N-best hypotheses. Based on the statistic observations, we proposed to approximate the normalizing factors for each hypothesis as a constant proportional to the number of words in the hypothesis, and the normalizing factors can be easily absorbed into the word penalty. Then, the unnormalized NNLM is investigated and combined with back-off N-gram for fast rescoring, which can be computed very fast without the normalization in the output layer, with the complexity reduced significantly.
We apply our proposed method to a well-tuned context-dependent deep neural network hidden Markov model (CD-DNN-HMM) speech recognition system on the English-Switchboard speech-to-text task. Both feedforward NNLM and recurrent NNLM are well-trained to verify the effectiveness of our method. Experimental results show that unnormalized probability of NNLM is quite complementary to that of back-off N-gram, and combining the unnormalized NNLM and back-off Ngram can further improve the performance of speech recognition with little computational resource.
As our method is theoretically founded on the statistic observations, we first introduce the experimental setup, including the speech recognizer, N-best hypotheses, NNLM structure, and NNLM training, in Section 2 for convenience. The remainder of this paper is organized as follows: The statistics of the normalizing factors on the hypotheses are investigated and the constant normalizing factor approximation is proposed in Section 3. How to combine the unnormalized NNLM and back-off N-gram is presented in Section 4, followed by complexity analysis and speed comparisons in Section 5. Detailed experimental evaluations for N-best rescoring are presented in Section 6. Discussions on the related work are given in Section 7, followed by the conclusions in Section 8.

Experimental setup
The experimental setup for the speech recognizer, N-best hypotheses, the NNLM structure, and the NNLM training in our work was introduced here, since our method is theoretically founded on statistical observations.

Speech recognizer and N-best hypotheses
The effectiveness of our proposed method is evaluated on the STT task with the 309-hour Switchboard-I training set [15]. The 13-dimensional perceptual linear prediction features (PLP) with rolling-window mean-variance normalization and up to third-order derivatives are reduced to 39 dimensions by heteroscedastic linear discriminant analysis (HLDA). The speaker-independent three-state cross-word triphones share 9,308 tied states. The GMM-HMM baseline system has 40-Gaussian mixtures per state, trained with maximum likelihood (ML), and refined discriminatively (DT) with the minimum phone error (MPE) criterion. The well-tuned CD-DNN-HMM system replaces the Gaussian mixtures with scaled likelihoods derived from DNN posteriors. The input to the DNN contains 11 (5-1-5) frames of 39-dimensional features, where the DNN uses the architecture of 429-2048×7-9308. The data for system development is the 1831-segment Switchboard part of the NIST 2000 Hub5 eval set (Hub5'00-SWB). The Fisher half of the 6.3h Spring 2003 NIST rich transcription set (RT03S-FSH) acts as the evaluation set.
The 2000h Fisher transcripts, containing about 23 million words, are taken as our training corpus for language modeling. Based on Kneser-Ney smoothing, a back-off trigram language model (KN3) was trained on the 2000h Fisher transcripts for decoding, where the vocabulary is limited to 53K words and unknown words are mapped into a special token <unk>. Note that no additional text is used to train LMs for interpolations to ensure the repeatability. The out-of-vocabulary rate is 0.80% for the training corpus, 0.53% for the development corpus, and 0.017% for the evaluation corpus. The pronouncing dictionary comes from CMU [16]. The HDecode a command is used to decode the utterance with KN3 to output the lattice, and then the N-best hypotheses are extracted from the lattice using the lattice-tool b command. In the setup, top 100-best hypotheses are rescored and reranked by other language models, such as back-off 5-gram, FNNLM, and RNNLM, to improve the performance.

Structure and training of NNLM
The typical structures of NNLMs are shown in Figures 1  and 2, corresponding to FNNLM and RNNLM, respectively. We also define V, H and N as the vocabulary, the size of hidden layer and the order of FNNLM, respectively. The projection matrix E ∈ |V|×H maps each word to the feature vector as the distributional representation and fed into the hidden layer.
Based on the structures of NNLM, the hidden state h t of FNNLM can be computed as h t = tanh output layer, where W ho ∈ |V|×H = [θ 1 , θ 2 , . . . , θ |V| ] T is the predicting matrix and θ ∀i ∈ H×1 corresponds to each output node.
The transcripts of the Hub5'00-SWB set and the RT03S-FSH set act the development set and the evaluation set, respectively, for NNLM training. One FNNLM and one RNNLM are well-trained on the training corpus with the open source toolkits, CSLM [17] and RNNLM [18], respectively, where both of the hidden layers contain 300 nodes.
To speed up the training of the RNNLM, a frequencybased partition method [4] is used to factorize the output layer with 400 classes. The truncated backpropagation through time algorithm (BPTT) [19] is used to train the RNNLM with 10 time steps, with the initial learning rate set to 0.1. The learning rate is halved, when the perplexity decreases very slowly or increases. On the contrary, the training of FNNLM can be speeded up with 128 context-word pairs as a mini-batch based on GPU implementation, so that no class layer was used, as the class layer usually sacrifices the performance of NNLM for speedup. The learning rate is empirically set as lr = lr 0 /(1+count×wdecay), where the initial learning rate lr 0 is set to 1.0, the weight decay 'wdecay' is set to 2 × 10 −8 , and the parameter 'count' denotes the number of samples processed, so that the learning rate will decay with the training of model. The basic back-off 5-gram language model (KN5) is also trained with the modified Kneser-Ney smoothing algorithm.

Review of N-best rescoring
The output from the first decoding pass is usually a multicandidate form encoded as lattice or N-best list. Each path in lattice or N-best list is a candidate time-aligned transcript W = w 1 , w 2 , . . . , w n of the speech utterance X. N-best list for its simplicity is widely used, and N-best rescoring in LVCSR is reviewed here. Given the acoustic model , the language model L, and a speech utterance X i , N-best hypotheses from ASR's decoding are denoted as where the first two items correspond to acoustic scores and language scores, respectively, and the last one denotes the word penalty that balances insertions and deletions. Also, α denotes a scaling factor for language scores and n ij denotes the number of words in the hypothesis W ij . The global score for each hypothesis in H i is computed and reranked. The top hypothesis is selected as the output for evaluation. Generally, better performance is expected with more accurate models.

Normalizing factor for one word
Given a word sequences, denote the t-th word as w t . The identity of word w t is denoted as q(w t ) = y i ∈ V, where the subscript i of y i is the word index in the vocabulary V. The structures of FNNLM and RNNLM are shown in Figures 1 and 2, respectively, where W ho ∈ | V|×H = θ 1 , θ 2 , . . . , θ |V| T is the prediction matrix and θ ∀i ∈ H×1 corresponds to each output node. The predicted probability of NNLM is computed as where exp(s t ) and z t respectively correspond to the unnormalized probability and the softmax-normalizing factor. Computing this factor z t results in heavy computational burden for normalization. We evaluated our well-trained FNNLM and RNNLM on the 100-best hypotheses generated from the Hub5'00-SWB set (1,812 utterances), containing 147,454 hypotheses and 2,125,315 words. The log(z t ) for each word is computed and the probability density functions (PDFs) of the log(z t ) for FNNLM and RNNLM are plotted and shown in Figure 3. It shows that the log normalizing factor is widely distributed, ranging from 13 to 20 for FNNLM and from 7 to 20 for RNNLM, respectively. It seems that the variance of log(z t ) is so large that the normalizing factor log(z t ) can't be simply approximated as a constant for N-best rescoring. However, several findings from our firsthand experience have been noticed to help us approximate the normalizing factor, and we also conclude that some discriminative information of NNLM exists in the unnormalized probability for N-best rescoring in the next two sub-sections.

Normalizing factor for one hypothesis
The output of speech recognizer is usually encoded as Nbest hypotheses, and the better hypothesis can be selected via rescoring with more accurate models. The language score for the hypothesis W ij = w ij1 , w ij2 , . . . , w ijn ij is computed as where n ij denotes the number of words in hypothesis W ij , s ijt can be efficiently computed with the dot product of two vectors, while z ijt requires a lot of computing consideration.
We randomly selected one utterance from Hub5'00-SWB set and decoded it with HDecode for recognition. Top ten hypotheses are shown in Table 1. We notice that there are lots of similar contexts in N-best hypotheses, especially for the hypotheses with a low word error rate (WER), and differences usually exist in local. As a matter of fact, the normalizing factor z ijt is completely determined by the context via a smooth function in Equation 2, and similar contexts will result in similar normalizing factors close to each other in value. Thus, lots of normalizing factors in the N-best hypotheses are the same or similar as for lots of the same or similar contexts, so that we roughly approximate n ij t=1 log(z ijt ) for hypothesis W ij as a constant proportional to n ij in this case, shown as where μ i is the constant corresponding to utterance X i and μ i can be estimated as and μ ij denotes the mean of μ ij in H i . As a matter of fact, many utterances need to be approximated with Equation 4, and these approximations can be evaluated as mean of Var(μ ij ) and variance of Var(μ ij ) for all utterances. The statistically smaller the Var(μ ij ), the more accurate the approximations.
We evaluated the well-trained FNNLM and RNNLM on the 100-best hypotheses generated from Hub5'00-SWB   Figure 4. It shows that the PDFs are quite sharp and close to zero, just like an impulse function, and the constant approximation in Equation 4 for each utterance is accurate and reasonable to some extent.

Number of words in hypothesis
We also notice that the number of words for one hypothesis is similar with each other in the N-best list. As a matter of fact, N-best hypotheses are rescored and reranked according to the relative scores. If all the hypotheses for utterance X i contain the same number of words, then the second item in Equation 3 for one hypothesis will be the same as that of others in the N-best list, based on Equation 4, shown as That is to say, the normalizing factors for one hypothesis will not affect the ranking in the N-best rescoring, and μ i for utterance X i can be arbitrary. We further approximate the constant μ i to a global constant, irrelevant with the utterance, shown as where μ is the global constant and can be estimated as N j=1 μ ij on the validation set. M and N denote the number of utterances and the number of hypotheses for each utterance, respectively.
Please note that the approximations in Equation 5 depend on the assumption that the hypotheses for each utterance are equal in length. We count the number of words n ij for each hypothesis W ij and compute the variance of n ij for each utterance X i as The statistically smaller the Var(n ij ), the more accurate the approximation in Equations 5 and 6. The PDF of Var(n ij ) on the 100-best hypotheses generated from Hub5'00-SWB set is shown in Figure 5. It shows that the PDF of Var(n ij ) is sharp and most of the Var(n ij ) are smaller than 1.0. The difference of N-best hypotheses in length is small, and the approximation for N-best hypotheses in Equation 5 is reasonable to some extent.

Normalizing factor approximation
Based on the approximation in Equations 4, 5, and 6, the LM scores in Equation 3 can be simplified as where only the first item needs to be estimated, while the second item can be estimated on validation set for rescoring. The complexity of the output layer is significantly reduced form O(|V|H) to O(H) with the constant approximation of the normalizing factor. We also notice that the discriminative information of NNLM for N-best rescoring exists in the unnormalized probability in Equation 8 and the LM scores from back-off N-gram, especially for the KN3 in decoding, usually are available for rescoring. We will investigate the discriminative information in unnormalized NNLM (UP-NNLM), combined with back-off N-gram, to further improve the performance of speech recognizer in the next section.

Combining unnormalized NNLM and back-off N-gram
The UP-NNLM combined with back-off N-gram in the logarithmic domain is presented in detail. Generally, the performance of STT systems can be further improved with interpolation of NNLM and back-off N-gram. Since exact probability of NNLM is unavailable in Equation 8, the linear interpolation is performed in logarithmic domain for the entire hypothesis, shown as where P Ngram (W ij |L) is the language score of back-off N-gram for hypothesis W ij . By substituting Equation 9 into Equation 1, the global score for each hypothesis is computed as where the normalizing factor is absorbed into the word penalty. The unnormalized probability only needs to be computed. The computational complexity of the output layer is reduced significantly without explicit normalization.

Complexity analysis and speed comparisons
The complexities of NNLM and UP-NNLM are analyzed, and the evaluation speeds of NNLM and UP-NNLM are also measured, shown in Table 2 for detailed comparisons. The class-based output layer method was based on the frequency partition [4], and the computational complexity of the output layer is given as O((C + |V|/C)H), shown in Table 2. This method is usually used to speed up the training of NNLM while the evaluation of NNLM is also speeded up. Compared with the class-based method, the unnormalized probability of NNLM (UP-NNLM) in our http://asmp.eurasipjournals.com/content/2014/1/19   Table 2.
The evaluation speed is measured by the number of words processed per second on a machine with an Intel(R) Xeon(R) 8-core CPU E5520 at 2.27 GHz and 8-G RAM, shown in Table 2. The implementations are based on the open source toolkits, CSLM [17] and RNNLM [18], to ensure the repeatability. To compare clearly, the speed of NNLM without a class layer is also measured. One million words are randomly selected from training data and evaluated by the FNNLM and NNLM, where the word is fed into the FNNLM and RNNLM one by one. Experimental results show that UP-FNNLM and UP-RNNLM are about 2∼3 times faster than 'FNNLM + class layer' and 'RNNLM + class layer' for evaluation. Note that the com- plexity of the hidden layer in UP-FNNLM or UP-RNNLM is comparable with that of class-based output layer in FNNLM + class layer or RNNLM + class layer, so that this speedup factor is reasonable. Also, it is worthy to notice that the fast-UP-FNNLM is more than 25 times faster than FNNLM + class layer and more than 1,100 times faster than FNNLM. To clearly show the speedup, the evaluation speed is also compared with different hidden layers, shown in Table 3. The larger the hidden layer, the slower the evaluation, and the 'fast-UP-FNNLM' is the fastest of all.

N-best rescoring evaluation
The NNLM and UP-NNLM are applied to N-best rescoring to demonstrate the performance of our method in this section. According to our experimental setup described in Section 2, the perplexities of our trained language models, including KN3, KN5, FNNLM, and RNNLM-C400, are presented in Table 4 for comparisons, where KN3 is used for decoding. It shows that the RNNLM-C400 interpolated with KN5 performs best of all on the Hub5'00-SWB set and the RT03S-FSH set. Also, RNNLM-C400 performs slightly better than FNNLM with the same setup in perplexity. The Hub5'00-SWB set and the RT03S-FSH set act the validation set and evaluation set, respectively. The 100-best hypotheses for these two sets are rescored and reranked by different language models, shown in Table 5, where 1-best denotes the output of HDecode with KN3.
The results for 1-best hypothesis on these two sets as our baseline are comparable with other reported results [20,21]. The UP-FNNLM and UP-RNNLM, combined with back-off N-gram, are used for fast rescoring in this section. Note that the output layer of our trained RNNLM-C400 is divided into many small softmax output layers in order to speed up the training on the large corpus. Thus, the unnormalized probability comes from the activations of the class layer and the specific softmax output layer, while the entire normalizing factor is also approximated as Equation 6. The UP-NNLM is linearly interpolated with KN5 in the logarithmic domain. The  weight for interpolation, the scale of LM scores, and the word penalty are all individually tuned on Hub5'00-SWB set, and then the final performance is evaluated on RT03S-FSH set, shown in Table 5. Significant reductions in WER are observed on the validation and evaluation sets. The language scores of KN3 is usually available in the lattice or N-best list, so that the UP-RNNLM combined with the KN3 reduces WER by 0.8% and 1.2% absolute on Hub5'00-SWB and RT03S-FSH sets, respectively. 'KN5 + UP-RNNLM-C400' further reduces the WER by 1.2% and 1.7% absolute on these two sets. Also, we notice that UP-RNNLM performs slightly better than UP-FNNLM, while UP-FNNLM can be evaluated much faster than UP-RNNLM. It can be seen that the 'UP-NNLM + KN5' can obtain about 1/2 to 2/3 gains of 'NNLM + KN5' with little computation. Experimental results show that the unnormalized probability of NNLMs, including FNNLM and RNNLM, is quite complementary to that of back-off Ngram, and the performance is further improved via the combination of back-off N-gram and NNLM.

Discussions with related work
Fast rescoring with NNLM has attracted much attention in the field of speech recognition [9,10,[22][23][24]. Many methods [9,10,22] for factorizing the output layer were proposed to reduce the complexity of NNLM and to speed up the training and the evaluation. Other techniques [22,24] were proposed to avoid redundant computations existing in N-best or lattice rescoring. Our proposed method can be easily combined with these methods to further improve the speed of rescoring. Also, a good work on fast training of NNLM with noise-contrastive estimation (NCE) [25] was proposed in [26], where the normalizing factor for each context was treated as a parameter to learn during the training. The training of NNLM was speeded up without the explicit normalization. As a matter of fact, the normalizing factor for each context needs to be learned separately, and these normalizing factors for different contexts will be different, so that the evaluation of NNLM needs to be normalized explicitly. Interestingly, we noticed that the normalizing factors can be manually fixed to one instead of learning them during the training of NNLM, as mentioned in [26]. We believe this findings will be helpful to further improve our current work, since if the variance of the normalizing factor could be constrained in a small range the approximation will be further improved in Equation 6. In this work, the distribution of normalizing factors on the N-best list is investigated, and the normalizing factor for each hypothesis is approximated as a constant for fast rescoring without considering the variance of the normalizing factor. Based on the findings mentioned in [26], we will investigate how to constrain the variance of normalizing factors during the training to further improve our method in the next work. Furthermore, an alternative method to speed up the rescoring is to use the word lattice instead of N-best list. The word lattice can compactly represent much more hypotheses than the N-best list, as the output of STT. We wonder whether our proposed method can be extended to lattice rescoring. As we all know, the N-best list is close to the lattice with the size of N-best list increased. Two experiments are designed to validate our method. On the one hand, we investigate whether the performance of Nbest rescoring will be degraded with the size of N-best list increased. 1,000-best list instead of 100-best list is extracted for each utterance and rescored by our proposed method, shown in Table 5. Experimental results show that our proposed method still works well for 1,000-best list, and similar improvements are obtained for 1,000best rescoring. On the other hand, we directly rescore the lattice with 'lattice-tool' [27] command to evaluate our proposed method. In consideration of the easy implementation and fast rescoring, the 'UP-FNNLM + KN5' is integrated into lattice-tool command, where the computation of the LM score is replaced with Equation 9 for convenience. The experimental results show that the rescoring of lattice obtains a slightly lower WER than that of N-best list in Table 6. All the results also mean that our proposed approximations based on our firsthand observations are reasonable and effective for fast N-best rescoring.

Conclusions
Based on the observed characteristics of N-best hypotheses, the normalizing factors of NNLM for each hypothesis are approximated as a global constant for fast evaluation. The unnormalized NNLM combined with back-off N-gram is empirically investigated and evaluated on the English-Switchboard speech-to-text task. The computation complexity is reduced significantly without explicit softmax normalization. Experimental results show that UP-NNLM is about 2∼3 times faster than 'NNLM + class layer' for evaluation. Moreover, the fast-UP-FNNLM is more than 25 times faster than FNNLM + class layer and more than 1,100 times faster than FNNLM. The N-best hypotheses from STT's output are approximately rescored and reranked by unnormalized NNLM combined with back-off N-gram model in the logarithmic domain. Experimental results show that the unnormalized probability of NNLM, including FNNLM and RNNLM, is quite complementary to that of backoff N-gram, and UP-NNLM is discriminative for N-best rescoring, even though UP-NNLM is not so accurate. The performance of STT system is improved significantly by 'KN5 + UP-NNLM' with little computational resource.