Advanced recurrent network-based hybrid acoustic models for low resource speech recognition

Kang, Jian; Zhang, Wei-Qiang; Liu, Wei-Wei; Liu, Jia; Johnson, Michael T.

doi:10.1186/s13636-018-0128-6

Research
Open access
Published: 17 July 2018

Advanced recurrent network-based hybrid acoustic models for low resource speech recognition

Jian Kang¹,
Wei-Qiang Zhang¹,
Wei-Wei Liu²,
Jia Liu¹ &
…
Michael T. Johnson³

EURASIP Journal on Audio, Speech, and Music Processing volume 2018, Article number: 6 (2018) Cite this article

3770 Accesses
11 Citations
Metrics details

Abstract

Recurrent neural networks (RNNs) have shown an ability to model temporal dependencies. However, the problem of exploding or vanishing gradients has limited their application. In recent years, long short-term memory RNNs (LSTM RNNs) have been proposed to solve this problem and have achieved excellent results. Bidirectional LSTM (BLSTM), which uses both preceding and following context, has shown particularly good performance. However, the computational requirements of BLSTM approaches are quite heavy, even when implemented efficiently with GPU-based high performance computers. In addition, because the output of LSTM units is bounded, there is often still a vanishing gradient issue over multiple layers. The large size of LSTM networks makes them susceptible to overfitting problems. In this work, we combine local bidirectional architecture, a new recurrent unit, gated recurrent units (GRU), and residual architectures to address the above problems. Experiments are conducted on the benchmark datasets released under the IARPA Babel Program. The proposed models achieve 3 to 10% relative improvements over their corresponding DNN or LSTM baselines across seven language collections. In addition, the new models accelerate learning speed by a factor of more than 1.6 compared to conventional BLSTM models. By using these approaches, we achieve good results in the IARPA Babel Program.

1 Introduction

Automatic speech recognition (ASR) has undergone rapid change in recent years. Deep neural networks (DNN) combined with hidden Markov models (HMM) have become the dominant approach for acoustic modeling [1, 2], replacing the traditional Gaussian mixture model-hidden Markov models (GMM-HMMs) approach. Utilizing increased availability of both computational power and training data, error rates have been reduced significantly across many speech recognition tasks [3, 4]. A wide variety of NN architectures have been introduced, each having associated advantages and disadvantages. Of all these architectures, recurrent neural networks have shown strong comparative performance.

Recurrent neural networks (RNNs) [5] are a neural network framework that include self-connections from the previous time step as inputs. Each unit contains a dynamic history of the sequence of input features sequence, instead of a fixed-size window. This approach exploits long-term feature dependencies across speech frames. As a result of this structure, RNNs are less affected by temporal distortion. Due to these properties, some researchers have investigated using RNNs to capture longer context [6–10], reporting improved performance compared to DNNs.

Although RNNs are well suited for sequence tasks, the time dependencies that can be learned are still limited to the vanishing and exploding gradient problem [11]. To solve this problem, the long short-term memory (LSTM) [12] has been proposed. LSTM units use gates to control information flow and effectively create shortcut paths across multiple temporal steps. This gate mechanism makes LSTM architectures well suited to sequence tasks and has improved robustness [13–15]. LSTM-based acoustic models have been successfully applied to several speech applications, such as voice search tasks [16, 17], with good performance.

Although LSTM models have achieved excellent results for large vocabulary continuous speech recognition, they still struggle when applied to certain tasks, such as training for low-resource languages.

Conventional LSTM and bidirectional LSTM (BLSTM) require complex training mechanisms which make them difficult to implement. Some of these mechanisms, such as clipping of cell activations and peephole connections, require careful tuning to a particular training set. The vanishing gradient problem across multiple layers is also a potential problem. We hope to solve these shortcomings using additional gating mechanisms that further constrain temporal dependencies and focus the training process.

In this paper, we aim at building advanced RNN structures with a hybrid acoustic model, to maximize use of prior knowledge in the speech signals. The solutions we propose include local window BLSTM, gated recurrent units, and residual architecture-based models. This work expands on our previous work [18–21]. Experiments are carried out on the Babel benchmark datasets, for low resource keyword search evaluations. By using these techniques, we achieve 3 to 10% relative improvements over the corresponding DNN or LSTM baselines. In addition, the new models improve training time by a factor of more than 1.6 compared to conventional BLSTM models.

The remainder of the paper is organized as follows. Section 2 briefly introduces the baseline LSTM model. Section 3 describes the proposed approaches. We report our experimental results in detail in Section 4, including experimental setup, hyperparameters evaluation, and comparisons between selected datasets. Finally, conclusions and future work are outlined in Section 5.

2 Baseline LSTM System

We first give a brief introduction of the DNN and LSTM network structures used as a baseline. A DNN contains a series of hidden layers, which for speech applications is most commonly fully connected with sigmoid activation functions.

RNNs are a neural network framework with self-connections from the previous time step used as inputs. This structure allows the network to capture a dynamic history of information about input feature sequences and is less affected by temporal distortion. Due to these properties, RNN have performed better than traditional DNNs in large vocabulary speech recognition tasks. Although conventional RNNs have feedback connections in the hidden layers to model temporal correlations, this structure captures short-term dependencies much better than long-term dependencies due to the vanishing and exploding gradients in the Stochastic Gradient Descent (SGD) training process [11].

The LSTM RNN topology is an advanced network structure designed to model long-term dependencies while limiting the rate of gradient decay through a gating mechanism.

LSTM units were first introduced in [12]. A popular LSTM structure is shown in Fig. 1. The forward pass from x_t to h_t follows the equations:

$$ g_{t} = \phi\left(W_{gx}x_{t} + W_{gh}h_{t-1} + b_{g}\right) $$

(1)

$$ i_{t} = \sigma\left(W_{ix}x_{t} + W_{ih}h_{t-1} + W_{ic}c_{t-1} + b_{i}\right) $$

(2)

$$ f_{t} = \sigma\left(W_{fx}x_{t} + W_{fh}h_{t-1} + W_{fc}c_{t-1} + b_{f}\right) $$

(3)

$$ c_{t} = f_{t} \otimes c_{t-1} + i_{t} \otimes g_{t} $$

(4)

$$ o_{t} = \sigma\left(W_{ox}x_{t} + W_{oh}h_{t-1} + W_{oc}c_{t} + b_{o}\right) $$

(5)

$$ h_{t} = o_{t} \otimes \phi(c_{t}) $$

(6)

Here, x_t is the input, and h_t−1 is the previous output of the LSTM. W_∗x, W_∗h, W_∗c, and b_∗ are forward matrices, recurrent matrices, diagonal peephole connections, and biases for all gates, respectively. σ is the sigmoid function, ϕ is the hyperbolic tangent function, and ⊗ denotes element-wise multiplication. For convenience, we denote the above calculations as h_t=LSTM(x_t,h_t−1).

The LSTM uses gates to control information flow and effectively creates shortcut paths across multiple temporal steps. The key ideas behind the LSTM unit are the addition of a memory cell block to maintain temporal information and the use of non-linear activation gates to control both the information flow into the memory cell and the output of the unit. Each LSTM unit consists of one cell unit and four control gates. These gates control the input and output as well as the temporal extent of the memory cell through a forget gate. The memory cell itself can also directly control the gates. The LSTM units implement this by peephole connections from the memory cell to the gates to learn precise timing information.

As a hybrid acoustic model, the network is trained to predict HMM states using a forced alignment. For both networks, a softmax layer is added at the top of the recurrent layers to generate posterior possibilities. The output of the softmax layer provides an estimate of the posterior probabilities P(s|o) for states s, with given features o. The output in the softmax layer is computed by

$$ P(s|o) = \text{softmax}(W_{s} h_{\text{out}} + b_{s}), $$

(7)

where (W_s,b_s) is the connection weight matrix and bias vector for the softmax layer, and h_out is the output of the top recurrent layer.

Inspired by DNNs, multiple LSTM layers can be stacked to build deep LSTM RNNs [9]. When input features propagate through the recurrent layer, the output features at each time step incorporate the history of temporal features from previous time steps. Compared to a shallow LSTM, the features generated by a deep LSTM are more generalizable and suitable for prediction. Thus, a deep LSTM RNN takes advantage of the merits of both DNNs and conventional LSTM.

Recently, linear recurrent projection layers have been proposed for reducing the number of parameters at no loss of accuracy [16]. Following this work, we use the term LSTM to denote such a deep LSTM-projected architecture and use this approach as our baseline.

3 Advanced recurrent architectures and training algorithms

Although LSTM models have achieved excellent performance in speech recognition tasks, they still have some shortcomings. For example, conventional unidirectional LSTMs only use preceding context information, which limits its ability to generate future context information. Furthermore, the architectures of traditional LSTM units are complex. This large size restricts their abilities to generalize and leads to a vanishing gradient problem across multiple layers. Because of the use of temporal context, training on entire sentence-length utterances requires extremely large training times, and the decoding real time factor is also poor. The proposed architectures and algorithms are designed to substantially reduce the above shortcomings.

In this section, we extend conventional LSTM and investigate in more depth the gated recurrent unit (GRU) element introduced in our previous work [18]. We also propose a new element called a residual GRU (rGRU) to alleviate the vanishing gradient problem across multiple layers.

3.1 Bidirectional LSTM

As discussed previously, the BLSTM [22, 23] is able to make use of both the preceding and following contexts within an utterance. The BLSTM does this by processing the data in both directions with two separate parameter sets, forward parameters and backward parameters.

The propagating output is calculated as follows:

$$ \overrightarrow{h_{t}} = \overrightarrow{\texttt{LSTM}}(x_{t}, h_{t-1}), $$

(8)

$$ \overleftarrow{h_{t}} = \overleftarrow{\texttt{LSTM}}(x_{t}, h_{t+1}), $$

(9)

where $h_{t} = \left [\overrightarrow {h_{t}},\overleftarrow {h_{t}}\right ], t=1:L$, and L is the length of training sentences. (8) uses the forward parameters, while (9) uses the backward parameters.

After propagating in both forward and backward directions, the individual directional outputs are concatenated and fed forward to the next hidden layer. This bidirectional approach enables the system to more fully take advantage of the time dependencies of the input features.

3.2 Local window BLSTM

Although BLSTM networks often achieve better performance in speech recognition than LSTM, the latency during training and decoding is significant because it is necessary to wait until seeing a whole sentence. This makes BLSTM inappropriate for real-time online speech recognition.

However, in traditional backpropagation through time algorithm (BPTT) [24], the error signals are truncated based on the BPTT batch size parameter. This results in temporal isolation between blocks of the signal. The impact of the gradient becomes attenuated through these time steps, a process which is impacted by the batch size as well. For frames which are too far away from the current time, the impact of their gradient will be lower.

As we know, when considering the mechanism of pronunciation for human, the history phoneme information, including the shape changes of vocal cords, mouths, and noses, occupy the most impacts of sequential information. The future information play an auxiliary role, such as continuous changes of vocal organs. In addition, one of the successful key points behind the RNN is that they model the temporal relationships over phonemes. In general, the relationships between phonemes are constrained within word boundaries. The future information within word boundaries may be most useful to help to learn the current phoneme information.

Inspired by above thoughts, we consider the time dependencies within a local window. Based on this concept, in this subsection, we introduce the local window BLSTM (LW-BLSTM) approach.

Figure 2 shows the illustration of LW-BLSTM. All time dependencies are considered within a fixed local window N. The length of local window N is Len(N). These local windows are non-overlapping chunks. Training LW-BLSTM is same as for the BLSTM. When using a new chunk during training, the initial $\overrightarrow {h_{0}}$ directly uses the final $\overrightarrow {h}_{Len(N)}$ from the previous chunk of the same utterance. In contrast to the traditional BLSTM, we do not need to wait until seeing a whole sentence, so both training and decoding speed are accelerated significantly. Moreover, the computational resources used for training are reduced sharply. In particular, the GPU memory requirements are reduced by a factor of about 10. This enables us to train a larger number of chunks in parallel, which accelerates the training speed further.

Some prior work has investigated methods to reduce latency and speed up the training process of BLSTM. This includes context-sensitive-chunk BLSTM (CSC-BLSTM) [25] and latency-controlled BLSTM (LC-BLSTM) [26]. Figure 2 shows the differences among these approaches. Comparing to the CSC-BLSTM approach, the LW-BLSTM approach introduced here incorporates the entire past history by using the final hidden states as the initial condition for the next block. This may lead to a more accurate approximation and be one of the key points behind the LW-BLSTM. Compared to LC-BLSTM, we do not distinguish truncated future context and preceding context, relative to a fixed local window. This may lead to fewer backward frames but avoids the potential issue of having the appended frames generate no output.

3.3 Gated recurrent units and LW-BGRU

Although LSTM RNNs have achieved excellent results, this architecture has some weaknesses. The architecture has a large number of parameters and can overfit relatively easily, especially for low resource tasks. In addition, training requires several complex mechanisms such as nonlinear clipping operations on cell activations and peephole connections [16] which may make it difficult to tune the parameters. To address these problems, we will adopt GRUs, another type of recurrent unit.

The GRU was recently proposed by Cho et al. [27]. Like LSTM, it was designed to adaptively reset or update memory content. As is shown in Fig. 3, each GRU has a reset gate and an update gate, which control the memory flow. The GRU fully exposes its memory content at each time step and balances output between the previous memory state and the new candidate memory state.

The GRU reset gate r_t is computed by

$$ r_{t}=\sigma(W_{rx}x_{t} + U_{rh}h_{t-1} + b_{r}), $$

(10)

where σ is the sigmoid function, and x_t and h_t−1 are the input to the GRU and the previous output of the GRU. W_rx, U_rh, and b_r are forward matrices, recurrent matrices, and biases for reset gate, respectively.

Similarly, the update gate z_t is computed by

$$ z_{t}=\sigma(W_{zx}x_{t} + U_{zh}h_{t-1} + b_{z}), $$

(11)

where the parameters are as above.

Next, the candidate memory state m_t is calculated by

$$ m_{t}=\phi(Wx_{t} + U(r_{t}*h_{t-1}) + b), $$

(12)

where ϕ is the hyperbolic tangent function and * denotes element-wise multiplication.

Lastly, the output of the GRU is calculated by

$$ h_{t}=z_{t}*h_{t-1} + (1 - z_{t})*m_{t}. $$

(13)

From the above propagating procedures, we can see that both GRU and LSTM use gates to control information flow and effectively create shortcut paths across multiple temporal steps. These gates and shortcuts help to detect and obtain the existence of an important feature in the input sequence. In addition, they allow the error to be backpropagated easily, thus reducing the difficulty due to vanishing or exploding gradients with respect to time [11].

The update gate helps the GRU to capture long term dependencies and plays a role like that of the forget gate in LSTM. The reset gate helps the GRU to reset whenever the detected feature is not necessary anymore. So when the GRU tries to learn temporally changed features, these gates activate differently.

The main difference between LSTM units and GRUs is that there is no output activation function or output gate to control the output in a GRU. Intuitively, because the output may be unbounded, this could hurt performance significantly. However, experimental results show that this is not true for GRUs, perhaps because coupling the reset gate and update gate avoids this problem and makes the use of an output gate or activation function less valuable [28–30]. Further, because an output gate is not used in GRUs, the total size of GRU layers is smaller than that of LSTM layers, which helps the GRU network avoid overfitting.

3.4 Residual BLSTM and residual BGRU

In LSTM or GRU architectures, a sigmoid function or hyperbolic tangent function is chosen as the nonlinear activation. These bounded functions may accentuate the vanishing gradient problem because it is easy for the gradients to become small when the error signal passes through multiple layers. This issue has attracted attentions of researchers in the machine learning community.

In recent years, some novel architectures, like residual net [31] and highway networks [32] have introduced an additional spatial shortcut path from lower layers for efficient training of deep networks with multiple layers.

The residual network approach [31] was successfully applied to train more than 100 convolutional layers for image classification and detection. The key insight in the residual network is the inclusion of a shortcut path between layers that can be used for an additional gradient path. The highway network [32] is an another way of implementing a shortcut path in a feed-forward neural network. Highway LSTM [33] is a recurrent version of highway network. This approach reuses shortcut gradient paths in the temporal direction for a highway shortcut in the spatial domain. Highway connections are used between internal memory cells instead of output layers. A new gate network was also introduced to control highway paths from the prior layer memory cells.

Inspired by these approaches, we propose the bidirectional residual LSTM and GRU (BrLSTM and BrGRU). These new architectures combine the merits of both bidirectional RNN and residual networks. In detail, (1) ∼ (5) and (10) ∼ (12) do not change, while (6) is updated as follows:

$$ h_{t} = o_{t} \otimes \phi(c_{t}) + W_{hx}x_{t} $$

(14)

Similarly, (13) is updated as follows:

$$ h_{t}=z_{t}*h_{t-1} + (1 - z_{t})*m_{t} + + W_{hx}x_{t} $$

(15)

There has been some similar recent work in this area, notably residual LSTM [34] and highway LSTM [33]. In contrast to residual LSTM, we add the residual directly to the output of the LSTM units, while [34] add the residual part to the hidden states before the projection functions. In addition, we apply the residual part to the new local window BLSTM and BGRU. In contrast to the highway LSTM, our methods use an output layer for the spatial shortcut connection instead of an internal memory cell, which reduces interference with a temporal gradient flow.

4 Experimental results and discussion

4.1 Data corpus

In order to evaluate our model, we implement experiments on a series of low resource speech recognition task, OpenKWS.

Since 2013, the National Institute of Standards and Technology (NIST) has conducted a series of keyword search evaluations called OpenKWS [35]. It is a part of the IARPA Babel program. These evaluations try to build a high-performance automatic speech recognition system for keyword search tasks. The data for the IARPA Babel program consists of conversational telephone speech from 25 languages. During the evaluations every year, an unknown and resource-limited surprise language is released, and participating teams are given only a short period of time to finish the task.

For each surprise language, the amount of training data is about 40 h. In addition to the training set, there are also tuning, development, and evaluation sets for each surprise language. The duration of the tuning set is about 5 h, while the development set contains about 10 h of transcribed data. The development set is for evaluation by the participants themselves. The evaluation set contains about 90 h of conversational speech. There is no pronunciation lexicon released. However, a language-specific peculiarities (LSP) document is available, which can help participating teams build the grapheme-to-phoneme lexicon.

We included seven different languages in our experimental works. These languages include Cantonese, Pashto, Vietnamese, Tamil, Swahili, Kazakh, and Georgian. Figure 4 shows all Babel languages and highlights our target languages, represented by white points.

4.2 Baseline setup

For these target languages, the pronunciation lexicons are processed based on LSP document information. The language model used for each language is a trigram, trained using just the transcript of the training data for each language and modified by Kneser-Ney smoothing.

We select two kinds of input features to train the individual models. The first is a vector of 40-dimensional Mel filterbank features concatenated with first and second order derivatives. For this inputs, a GMM-HMM is trained to generate targets using the Kaldi toolkit [36]. The GMM-HMM is first trained using 13-dimensional PLP features concatenated with 3-dimensional pitch features with zero means and unit variances. After that, LDA is implemented to reduce the feature dimensionality to 40. Next, the GMM-HMMs are trained by speaker adaptive training (SAT) and further enhanced by discriminative training using the boosted maximum mutual information (BMMI) criterion.

The second feature type is a vector of 128-dim multilingual bottleneck features [37]. Figure 5 shows the configuration of the bottleneck feature extractor. We use a six-layer TDNN [38] as the feature extractor. The splicing indexes used are {− 2,− 1,0,1,2} {− 1,2} {− 3,3} {− 7,2} {0}(∗) {0}. The splicing indexes of {− 2,− 1,0,1,2} indicate that the first layer sees five consecutive frames of input, and the {− 1,2} indicate that the second hidden layers see two frames of the previous layer, separated by three frames. All layers except the bottleneck layer contain 1024 neurons. The bottleneck layer is located at the fifth layer, denoted as (∗). The dimension of the bottleneck layer is 128. The original input to the TDNN feature extractor is the 40-dimensional Mel-filter bank features concatenated with 3-dimensional pitch features. To fully take advantage of available low resource languages, we include 24 languages from the IARPA Babel program dataset and use multitask learning methods [39] to train the bottleneck feature extractor. For each language in the 24 language set, a separate GMM-HMM is trained to generate the frame-level senone alignments as the targets of the feature extractor using the above methods. The details regarding corpus size and the number of senones for all 24 languages are listed in Table 1. The total duration of training data used to extract multilingual features is about 1400 h.

Table 1 The corpus size and number of senones of the Babel languages

Advanced recurrent network-based hybrid acoustic models for low resource speech recognition

Abstract

1 Introduction

2 Baseline LSTM System

3 Advanced recurrent architectures and training algorithms

3.1 Bidirectional LSTM

3.2 Local window BLSTM

3.3 Gated recurrent units and LW-BGRU

3.4 Residual BLSTM and residual BGRU

4 Experimental results and discussion

4.1 Data corpus

4.2 Baseline setup

4.3 BLSTM and LW-BLSTM

4.3.1 The hyperparameters in BLSTM

4.3.2 LW-BLSTM hyperparameters

4.3.3 Results for all languages

4.4 LW-BGRU and LW-BrGRU

4.4.1 LW-BGRU and LW-BrGRU hyperparameters

4.4.2 Results for all languages

4.5 Experimental results for different models

4.6 Visualization for different models

5 Conclusions

Abbreviations

References

Acknowledgements

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Additional information

Authors’ information

Rights and permissions

About this article

Cite this article

Share this article

Keywords