# Integrated exemplar-based template matching and statistical modeling for continuous speech recognition

- Xie Sun
^{1}and - Yunxin Zhao
^{2}Email author

**2014**:4

https://doi.org/10.1186/1687-4722-2014-4

© Sun and Zhao; licensee Springer. 2014

**Received: **3 September 2013

**Accepted: **14 January 2014

**Published: **1 February 2014

## Abstract

We propose a novel approach of integrating exemplar-based template matching with statistical modeling to improve continuous speech recognition. We choose the template unit to be context-dependent phone segments (triphone context) and use multiple Gaussian mixture model (GMM) indices to represent each frame of speech templates. We investigate two different local distances, log likelihood ratio (LLR) and Kullback-Leibler (KL) divergence, for dynamic time warping (DTW)-based template matching. In order to reduce computation and storage complexities, we also propose two methods for template selection: minimum distance template selection (MDTS) and maximum likelihood template selection (MLTS). We further propose to fine tune the MLTS template representatives by using a GMM merging algorithm so that the GMMs can better represent the frames of the selected template representatives. Experimental results on the TIMIT phone recognition task and a large vocabulary continuous speech recognition (LVCSR) task of telehealth captioning demonstrated that the proposed approach of integrating template matching with statistical modeling significantly improved recognition accuracy over the hidden Markov modeling (HMM) baselines for both TIMIT and telehealth tasks. The template selection methods also provided significant accuracy gains over the HMM baseline while largely reducing the computation and storage complexities. When all templates or MDTS were used, using the LLR local distance gave better performance than the KL local distance. For MLTS and template compression, KL local distance gave better performance than the LLR local distance, and template compression further improved the recognition accuracy on top of MLTS while having less computational cost.

## Keywords

## 1 Introduction

In speech recognition, hidden Markov modeling (HMM) has been the dominant approach since it provides a principled way of jointly modeling speech spectral variations and time dynamics. However, HMM has the shortcoming of assuming the observations being independent within each state, which makes it ineffective in modeling the fine details of speech temporal evolutions that are important in characterizing nonstationary speech sounds [1]. Time derivatives of cepstral coefficients [2] are widely used to supplement time dynamic information to speech feature distributions. Trajectory model [3] introduces time-varying covariance modeling to capture temporal evolutions of speech features. Additionally, approaches like segment models [4, 5] and long-contextual-span model of resonance dynamics [6] have been proposed for similar purposes.

Exemplar-based methods have the potential in addressing the deficiency of HMMs and in recent years they have drawn renewed attention in the speech recognition community [7, 8], such as sparse representations (SRs) [9] and template matching [10, 11]. Template-based methods make direct comparisons between a test pattern and the templates of training data via dynamic time warping (DTW), and potentially they can capture the speech dynamics better than HMMs. Template-based methods were originally used to recognize isolated words or connected digits with good performances [12]. Until recently, template-based methods had been impractical for large tasks of speech recognition, since feature vectors of training templates need to be stored in computer memory. With today’s rapid advance in computing power and memory capacity, template-based methods are investigated for large recognition tasks and promising results are reported [10, 11, 13–18]. However, they are still difficult to use in large vocabulary continuous speech recognition (LVCSR) due to their needs for intensive computing time and storage space. The newly proposed methods, such as template pruning and filtering [19], template-like dimension reduction of speech observations [20], and template matching in the second-pass decoding search [21], are beginning to address this problem. In general, there is a tradeoff between the costs in computation and space and the accuracy in recognition.

Considering the pros and cons of HMMs and template methods, i.e., HMM-based statistical models are effective in compactly representing speech spectral distributions of discrete states but are ineffective in representing the fine details of speech dynamics, while template matching captures well the speech temporal evolutions but demands much larger computational complexity and memory space, it appears plausible to integrate the two approaches so as to exploit their strengths and avoid their weaknesses. In the current work, we propose a novel approach of integrating exemplar-based template matching with statistical modeling. We construct triphone context-dependent phone templates to preserve the time dynamic information of phone units and use phonetic decision trees to generate templates of tied triphone units, which improves the reliability of triphone templates and covers unseen triphones by some triphone clusters. The load on memory storage is reduced by using Gaussian mixture model (GMM) indices to represent the speech frames of the templates. It is worth noting that Gaussian indices were previously used to represent speech frames in speech segmentation [22], speech separation [23], and keyword spotting [24–26]. To facilitate comparison of the templates labeled by GMM indices, we propose the local distances of log likelihood ratio (LLR) and Kullback-Leibler (KL) divergence for DTW-based template matching. To further reduce the costs of memory space and computation, we propose template selection methods to generate template representatives based on the criteria of minimum distance (MDTS) and maximum likelihood (MLTS) and we also propose a template compression method to integrate information from training templates to obtain more informative template representatives. In the recognition stage, the GMMs and the templates are used together by DTW with the proposed local distances. The proposed methods have been applied to lattice rescoring on the tasks of TIMIT [27] phone recognition and telehealth [28] large vocabulary continuous speech recognition, and they have led to consistent error reductions over the HMM baselines.

This paper is organized as follows. In Section 2, we discuss the related work for template-based speech recognition and provide an overview of our proposed system. In Section 3, we describe the proposed methods for template construction, matching, and clustering. In Section 4, we discuss the proposed methods for template representative selection and compression. In Section 5, we present evaluation results on the task of TIMIT phone recognition and the task of telehealth LVCSR. Finally in Section 6, we give our conclusion and discuss future work.

## 2 Related work and system overview

### 2.1 Related work

Continuous speech recognition using template-based approaches has gained significant attention over the past several years. In [10], a top-down search algorithm was combined with a data-driven selection of candidates for DTW alignment to reduce search space, together with a flexible subword unit selection mechanism and a class-sensitive distance measure. On the Resource Management task, although the performance of the template matching system fell below the best published HMM results, the word error patterns of the two types of systems were found to be different and their combination was beneficial. In [13], an episodic-HMM hybrid system was proposed to exploit the ability of HMMs in producing high-quality phone graphs as well as the capability of an episodic memory in accessing fine-grained acoustic data for rescoring, where template matching was performed by DTW using the Euclidean distance. This system was evaluated on the 5k-word *Wall Street Journal* (WSJ) task and it showed a comparable performance with state-of-the-art HMM systems. In [18], prosodic information of duration, speaking rate, loudness, pitch, and voice quality was integrated with template matching through conditional random fields to improve recognition accuracy. On the Nov92 20k-word trigram WSJ task, the proposed method improved the state-of-the-art template baseline without prosodic information and led to a relative word error rate reduction of 7%. To make the template-based approach realistic for hundreds of hours of speech training data, a data pruning method was described for template-based automatic speech recognition in [19]. The pruning strategy worked iteratively to eliminate more and more templates from an initial database, and at each iteration, the feedback for data pruning was provided by the word error rate of the current model. This data pruning reduced the database size or the model size by about 30%, and consequently saved the computation time and memory usage in speech recognition. In [21], exemplar-based word-level features were investigated for large-scale speech recognition. These features were combined with the acoustic and language scores of the first-pass model through a segmental conditional random field to rescore word lattices. Since the word lattices helped restrict the search space, the templates were not required to cover the full training data, and the templates were also filtered to a smaller set to reduce computation cost and improve robustness. Experimental results showed that the template-based approach obtained a slightly better performance than the baseline system in Voice Search and YouTube tasks.

Relative to the above-discussed efforts, our approach as proposed in the current work falls into the hybrid category, but our integration of statistical modeling and template representation and matching are tighter, since we not only rescore the lattices generated by the HMM baseline, but we also use the baseline phonetic decision tree (PDT) structures to define the tied triphone templates, representing the template frames by the GMMs and using the LLR and KL distances to measure the differences of speech frames represented in this way. In the aspect of reducing computation and memory costs, we absorb the training data information into template representatives through clustering and estimation, rather than selecting a subset of training data as the templates. On the TIMIT and telehealth tasks, we are able to show statistically significant improvements in phone and word accuracies, respectively, over the HMM baselines.

### 2.2 System overview

## 3 Template representation, matching, and clustering

### 3.1 Template representation

*m*

_{1},

*m*

_{2}, …,

*m*

_{ N }} that consists of the GMMs of the phonetic-decision-tree tied triphone states in the baseline HMMs to label the template frames, where

*N*is the total number of GMMs from the HMM baseline. To do so, we compute the likelihood scores of a feature vector or frame (these two terms are used interchangeably with the understanding that a feature vector is normally extracted from a frame of data),

*x*

_{ t }∈

*R*

^{ d }(

*d*is the dimension of a real-valued feature vector), of a phone template by all GMMs and take the

*n*GMMs that give the top

*n*likelihood scores, $p\left({x}_{t}|{m}_{1\left({x}_{t}\right)}\right)\ge p\left({x}_{t}|{m}_{2\left({x}_{t}\right)}\right)\ge \dots \ge p\left({x}_{t}|{m}_{n\left({x}_{t}\right)}\right)\ge \dots $

_{,}to label

*x*

_{ t }. Each GMM index is also associated with a weight ${w}_{k\left({x}_{t}\right)}$ that is defined to be proportional to the likelihood score $p\left({x}_{t}|{m}_{k\left({x}_{t}\right)}\right)$

_{,}with ${w}_{k\left(t\right)}=\frac{p\left({x}_{t}|{m}_{k\left({x}_{t}\right)}\right)}{{\displaystyle {\sum}_{l=1}^{n}}p\left({x}_{t}|{m}_{l\left({x}_{t}\right)}\right)}\phantom{\rule{0.5em}{0ex}}\mathrm{and}\phantom{\rule{0.5em}{0ex}}{\displaystyle {\sum}_{k=1}^{n}}{w}_{k\left({x}_{t}\right)}=1$. A template frame is therefore represented as

In general, *n < < d*, and hence storing the template frames in GMM indices requires a much smaller space than storing the feature frames for the Templates.

### 3.2 Template matching

*d*(

*i*,

*j*) denote the local distance between the

*i*th and the

*j*th frames of two sequences under comparison and

*D*(

*i*,

*j*) denote the cumulative distance between the two sequences up to the time

*i*and

*j*. The symmetric constraint that we adopt here is defined as

*S*

_{ x }representing a template and a sequence

*S*

_{ y }representing a test segment, their average frame distance is calculated as

*S*

_{ x }and

*S*

_{ y }to a common time axis, and

*N*is the warping path length. Considering the fact that in HMM-based decoding search the acoustic score of a test segment is the sum of its frame log likelihood scores (the segment acoustic score is therefore the average frame score scaled by the length of the segment), we define the distance between the template

*S*

_{ x }and the test segment

*S*

_{ y }in the similar way as

where *L* is the length of the test segment *S*_{
y
}. Through scaling the average frame distance by the test segment length, the acoustic scores for different hypotheses of a test speech utterance (which in general consists of many segments) can be directly compared in template matching, as in HMM-based decoding search. Note that without the normalization by *N* in Equation 3, a template matching score for a speech segment would be affected by the length of the time warping path, which may vary with different templates; on the other hand, if the rescaling by *L* is not adopted, then the total distance on a decoding path would be dependent on the number of test segments in the path but not the lengths of these segments.

with **Σ** as the covariance matrix estimated from training data.

*n*GMMs $\left\{{m}_{1\left({y}_{{t}^{\text{'}}}\right)},\dots ,{m}_{n\left({y}_{{t}^{\text{'}}}\right)}\right\}$ with the weights $\left\{{w}_{1\left({y}_{{t}^{\text{'}}}\right)},\dots ,{w}_{n\left({y}_{{t}^{\text{'}}}\right)}\right\}$, the NLL distance becomes

The NLL distance is of the feature-model type, as it does not use the information of the GMM labels on the test segment frames. The proposed log likelihood ratio and KL divergence distances make use of the GMM labels on both the test and the training frames. These two model-model distances are described below.

#### 3.2.1 Log likelihood ratio local distance

*x*

_{ t }is labeled by a GMM ${m}_{1\left({x}_{t}\right)}$ and the training frame ${y}_{{t}^{\text{'}}}$ is labeled by a GMM ${m}_{1\left({y}_{{t}^{\text{'}}}\right)}$. The LLR local distance between

*x*

_{ t }and ${y}_{{t}^{\text{'}}}$ is then defined as follows:

(it is worth mentioning here that although getting a negative log likelihood ratio is a mathematical possibility, it never occurred in the experiments described in Section 5).

#### 3.2.2 KL divergence local distance

*x*

_{ t }is involved in the distance calculation. Here we consider measuring the local distance between two frames without using the feature vectors. KL divergence is widely used for measuring the difference between two probability distributions [29]. Since the frames are represented by GMM indices, the KL divergence between GMMs becomes a natural choice for indirectly measuring the dissimilarity of two frames. Because there is no closed-form expression for KL distance of GMMs, we use the Monte Carlo sampling method of Hershey and Olsen [30] to compute the divergence from a GMM

*m*

_{ x }to a GMM

*m*

_{ y }as

*x*

_{ i }s are i.i.d. samples generated from the GMM

*m*

_{ x }. Since the KL divergence is asymmetric, we further define a symmetric KL distance as

### 3.3 PDT-based template clustering and matching score calculation

Considering the fact that certain triphone contexts may rarely occur or even be missing in a training set, we investigate tying triphone templates into clusters of equivalent contexts to improve the reliability of template matching as well as to handle unseen triphones in recognition. Among many possible clustering algorithms, we decide to utilize the PDT tying structures of the triphone states in the baseline HMMs directly to cluster triphone segments, since the tying structure of a phone state indicates partial similarities among triphone segments. We assume that each phone HMM has three emitting states as commonly used in HTK [31]. For the triphone templates of each monophone, we keep the three tying structures defined by the three emitting states of the corresponding phone HMM and use them jointly in template matching.

Specifically, in matching a test speech segment against a triphone arc in a word lattice, we first identify the three tied triphone clusters by answering the phonetic questions in the PDTs, and for an identified cluster *i* with *k*_{
i
} templates, we then choose $\sqrt{{k}_{i}}$ best-matching templates and average their matching scores for the test segment, and we further average the three scores of the three clusters as the matching score between the speech segment and the triphone arc. Using the square-root rule helps compress the variations of the number of templates *k*_{
i
} used in computing the scores, since the number often vary largely in different triphone clusters. The rule is also analogous to the *K*-nearest neighbor (KNN) method where *K* is set as the square root of the training sample size [32].

*X*extracted from a speech utterance according to the start and end time of the phone arc

*P*that has a predecessor phone

*P*

_{ L }and successor phone

*P*

_{ R }. Figure 3 illustrates the way that the matching scores of

*X*with the three triphone template clusters containing

*P*

_{ L }−

*P*+

*P*

_{ R }are averaged to one matching score, which is used to replace the original acoustic score in the phone lattice for the phone arc

*P*.

## 4 Template selection and compression

When the above-described template matching is used for lattice rescoring in LVCSR, the computation and storage overheads are still high. However, certain redundancies in the training templates can be reduced to improve computation and storage efficiency. We propose three methods of template selection and compression to address this problem. In template selection, the goal is to choose a small subset of templates as the representatives for the full set of training templates. In template compression, new GMMs are generated for labeling the frames of the selected template representatives so as to better capture the information in the training Templates.

### 4.1 Minimum-distance-based template selection

*D*(

*C*

_{ i },

*C*

_{ j }) for two clusters, the following procedure describes the algorithm for clustering

*m*templates {

*s*

_{1},

*s*

_{2},…,

*s*

_{ m }} in a leaf node of a PDT:

- 1.
Initialize the template set

*Z*_{1}= {{*s*_{1}}, {*s*_{2}}, …, {*s*_{ m }}} with each template*s*_{ i }being a cluster. - 2.
For

*n*=*2*,…,*m*: Obtain the new set*Z*_{ n }by merging the two clusters*C*_{ i }and*C*_{ j }in the set*Z*_{ n − 1 }with the distance*D*(*C*_{ i },*C*_{ j }) to be the minimum among all existing distinct cluster pairs. Stop the clustering process if the number of clusters in the set*Z*_{ n }drops below a threshold.

*D*(

*C*

_{ i },

*C*

_{ j }) is commonly defined by the distance of their elements

*D*(

*s*

_{ x },

*s*

_{ y }), and the average distance measure is adopted here [33]:

Note that *D*(*s*_{
x
}, *s*_{
y
}) is the DTW distance of two templates as defined in Section 3.2, and in this step, the local distance *d* is the Euclidean distance of two frames.

*C*

_{ i }, the template-to-cluster distance is defined as follows [33]:

*s*

^{*}is selected as the representative for the cluster

*C*

_{ i }if its distance to the rest of the templates in the cluster is the minimum, i.e.,

The frames of the selected template representatives are subsequently indexed by their *n*-best GMMs according to Section 3.1.

### 4.2 Maximum-likelihood-based template selection

*s*

^{*}as generated by the MDTS method is relabeled by a set of GMMs that are selected by using a maximum likelihood criterion, so as to make the representative better characterize the templates in each cluster. For maximum likelihood template selection (MLTS), we use the DTW described in Section 3.2 to align the templates in a cluster

*C*

_{ i }to the MDTS-initialized template center

*s*

^{*}. Figure 4 illustrates an outcome of aligning the sequences

*s*

_{1},…,

*s*

_{ N }to

*s*

^{*}in

*C*

_{ i }, where the frames ${x}_{{t}_{1}}^{\left(1\right)},\dots ,{x}_{{t}_{N}}^{\left(N\right)}$ of the sequences

*s*

_{1},…,

*s*

_{ N }, respectively, are aligned to the frame ${x}_{{t}_{*}}^{\left(*\right)}$ of the cluster center

*s*

^{*}. The following procedure describes the MLTS method that is applied to relabel ${x}_{{t}_{*}}^{\left(*\right)}$ of

*s*

^{*}by using the aligned frames $X=\left\{{x}_{{t}_{*}}^{\left(*\right)},\phantom{\rule{0.5em}{0ex}}{x}_{{t}_{1}}^{\left(1\right)},\dots ,{x}_{{t}_{N}}^{\left(N\right)}\right\}$:

*j*th GMM in

*M*

_{ i }.

- 1.
Pool the distinct GMMs which are used to label the frames in

*X*into a local GMM set*M*. - 2.
Use the

*K*-medoids algorithm [33] with the KL distance to partition the GMM set*M*into*l*clusters*M*_{ i },*i*= 1,…,*l*, where each*M*_{ i }defines a subset of frames that are labeled by the GMMs in*M*_{ i }. - 3.
For

*i*= 1,…,*l*: Use the maximum likelihood criterion to select a GMM of*M*_{ i }as the cluster center ${m}_{i}^{*}$ for*M*_{ i }:

- 4.For
*i*= 1,…,*l*: Calculate the weight*w*_{ i }for each GMM cluster center ${m}_{i}^{*}$, which is proportional to the likelihood of*X*evaluated by ${m}_{i}^{*}$_{,}i.e., $p\left(X|{m}_{i}^{*}\right)$:${w}_{i}=\frac{p\left(X|{m}_{i}^{*}\right)}{{\displaystyle {\sum}_{k=1}^{l}}p\left(X|{m}_{k}^{*}\right)}=\frac{{e}^{{\displaystyle {\sum}_{\mathit{x\epsilon X}}}logp\left(x|{m}_{i}^{*}\right)}}{{\displaystyle {\sum}_{k=1}^{l}}{e}^{{\displaystyle {\sum}_{\mathit{x\epsilon X}}}logp\left(x|{m}_{k}^{*}\right)}}.$(18)

After the relabeling, the frame *x*_{
t
} is represented by ${m}_{i}^{*}$ and *w*_{
i
}, *i* = 1,…, *l*. The MLTS procedure is applied to each frame of *s*^{*}. The resulting representation of *s*^{*} has a form similar to what is described in Section 3.1, with the difference that the best-fitting *n* GMMs of the baseline HMMs are used to label a frame in Section 3.1, but here the template frames that are aligned to a frame of the MDTS representative are used to select a set of *l* GMMs to relabel the frame of the representative.

### 4.3 Template compression

*M*

_{ i }as in MLTS, here we merge the original GMMs in each cluster

*M*

_{ i }into a new GMM and use the

*l*new GMMs from the

*l*clusters

*M*

_{ i },

*i*= 1,…,

*l*to relabel the frame. To reduce the negative effect of outlier templates, for each GMM ${m}_{i}^{j}$ in a cluster

*M*

_{ i }, we calculate its distance to the cluster center ${m}_{i}^{*}$ based on the KL distance ${d}_{i}^{j}=d\left({m}_{i}^{j},{m}_{i}^{*}\right)$

_{.}From the distances ${d}_{i}^{j}$ of

*M*

_{ i }, the mean $\overline{d}$ and the standard deviation σ are computed. If a GMM ${m}_{i}^{j}$ is

*t*times standard deviation away from $\overline{d}$, i.e.,

*n*

_{ G }GMMs left in

*M*

_{ i }. We first pool the component Gaussian densities from the

*n*

_{ G }GMMs and normalize the weight of each Gaussian component by

*n*

_{ G }. We then merge the pooled Gaussian components according to the criterion of minimum entropy increment. The entropy increase due to merging two Gaussian components

*f*

_{ i }~

*N*(

*µ*

_{ i }, Σ

_{ i }) and

*f*

_{ j }~

*N*(

*µ*

_{ j }, Σ

_{ j }) into

*N*(

*µ*, Σ) is defined as [35]:

*w*

_{ i }and

*w*

_{ j }are the normalized mixture weights for

*f*

_{ i }and

*f*

_{ j }. The mean

*μ*, covariance Σ, and mixture weight

*w*of the newly generated Gaussian component are defined as

The Gaussian components are merged iteratively until the number of components in *M*_{
i
} is below a preset threshold. The remaining Gaussian components are used to construct a new GMM, and the new GMM is used as one of the *l* GMMs to label the corresponding frame of the template representative.

## 5 Experimental results

We performed speaker-independent phone recognition on the task of TIMIT [27] and speaker-dependent large vocabulary speech recognition on the task of telehealth captioning [28]. The experimental outcomes were measured in phone accuracy and word accuracy, respectively, for TIMIT and telehealth through aligning each phone or word string hypothesis against its reference string by using the Levenshtein distance [31].

### 5.1 Corpora

**Datasets used in the telehealth task: speech (min)/text (no. of words)**

Training set | Test set | |
---|---|---|

Dr. 1 | 210/35,348 | 19.3/3,248 |

Dr. 2 | 200/39,398 | 29.8/5,085 |

Dr. 3 | 145/28,700 | 12.1/3,988 |

Dr. 4 | 180/39,148 | 14.3/2,759 |

Dr. 5 | 250/44,967 | 27.8/6,421 |

Total | 985/187,561 | 103.3/21,501 |

### 5.2 Experimental set up and lattice rescoring

For both tasks of TIMIT and telehealth, the speech features consisted of 13 MFCCs and their first- and second-order time derivatives, and crossword triphone acoustic models were trained by using HTK toolkit. In calculating a KL distance between two GMMs [30], 10,000 Monte Carlo simulation data samples were generated.

For the TIMIT dataset, the set of 39 phones was defined as in [36], and a phone bi-gram language model (LM) was used (trained from the TIMIT training speech transcripts). The HMM baseline was trained with the GMM mixture sizes of 24; and 1,189 GMMs were extracted for template construction. The total original triphone templates were 152,715 in the training set. Phone lattices were generated for each test sentence by HTK. The average number of nodes per lattice was in the order of 850, and the average number of arcs was in the order of 2,350.

For the telehealth task, speaker-dependent acoustic models were trained for five healthcare provider speakers Dr. 1 to Dr. 5. In the baseline acoustic model, each GMM included 16 Gaussian components and on average, 1,905 GMMs were extracted from the baseline HMMs of each of the five doctors. The average number of triphone templates was 181,601 per speaker for the five doctors. Trigram language models were trained on both in-domain and out-of-domain datasets, where word-class mixture trigram language models with weights obtained from a procedure of forward weight adjustment were used [37]. For each test sentence, word lattices including phone boundaries were generated by HTK. The average number of nodes per lattice was in the order of 700, and the average number of arcs was in the order of 1,950.

In rescoring a lattice, the acoustic score of each phone arc in the lattice was replaced by its corresponding triphone template matching score, where the distance score of Equation 4 was negated to become a similarity score. By using the acoustic similarity scores and the original language model scores, the best path with the largest sum of acoustic and language model log scores was searched on the lattice using dynamic programming to produce the rescored sentence hypothesis.

### 5.3 TIMIT phone recognition task

On the TIMIT task, we provide a detailed account of the factors in the proposed template matching methods that affect the rescoring performance, including local distances, number of GMMs employed for frame labeling, template selection, compression methods and their interactions with the local distances, and the percentage of selected template representatives. We also examine the patterns of phone error reduction and look at the cost-performance tradeoffs.

#### 5.3.1 Local distances

#### 5.3.2 Number of GMMs for frame labeling

*n*= 1,

*n*= 3,

*n*= 5, and

*n*= 7) in labeling each frame of the templates. For both LLR and KL distances, the accuracy performance peaked when five GMMs were used for frame labeling, and phone accuracies of 74.51% and 74.26% were achieved for LLR and KL distances with absolute improvements of 1.79% and 1.54%, respectively, over the HMM baseline of 72.72%. The results confirmed the advantage of using multiple GMMs for frame labeling over using single GMM, as the former induced smaller quantization errors than the latter. However, using too many GMMs to represent a frame could increase confusion and reduce efficiency. We conducted significance tests on the performance difference between the ‘5GMMs’ case and the HMM baseline. Let

*x*

_{ i }and

*y*

_{ i }be the phone recognition accuracy of the

*i*th test sentence for the baseline and a proposed method, respectively. Let

*t*

_{ i }

*= y*

_{ i }

*− x*

_{ i }and denote the sample mean and sample variance of

*t*

_{ i }as $\overline{t}$ and

*s*

^{2}with the sample size

*m*. The Student’s

*t*test statistic is $T=\overline{t}/\left(s/\sqrt{m}\right)$. In the TIMIT standard test set,

*m =*1,344 and

*t*

_{m − 1,1 − 0.05}= 1.65 for one-sided test. For the LLR and KL local distances, we obtained

*T*>

*t*

_{m − 1,1 − 0.05}, and therefore our proposed template matching methods using the LLR and KL distances improved TIMIT phone recognition accuracy significantly over the HMM baseline at the significance level of 0.05. We also used twofold cross-validation on the test set to automatically select the number of GMMs for frame labeling, and the case of 5GMM was selected in each validation set. Therefore, the result of the 5GMM case in Figure 7 also represents an open test performance. In the subsequent experiments, five GMMs were used for labeling each frame.

#### 5.3.3 Template selection and compression

*t*in Equation 19 was set to 2 for removing GMM outliers, and the number of Gaussian components in each merged GMM was 24, the same as the GMM mixture size in the baseline HMMs, with a total of 749 newly generated GMMs for the template representatives. In MDTS, the phone accuracies were 73.82% and 72.70% for the LLR and KL distance, respectively, and in MLTS, the phone accuracies were 74.05% and 73.07% for the LLR and the KL distance, respectively. Relative to MLTS, template compression increased absolute phone accuracy by 0.27% with the KL distance and it decreased absolute phone accuracy by 0.40% with the LLR distances. Several points worth noting in Figure 8 are discussed below.

First, MDTS worked well with the LLR distance but poorly with the KL distance, and vice versa for MLTS and template compression. In MDTS, the template representative frames were labeled in the same way as the test frames, i.e., by the best-fit GMMs of the baseline model, and in this case, a better outcome of LLR than KL is consistent with what was shown in Figure 6 for using all templates. In MLTS, however, the selected template representative frames were relabeled by GMMs to maximize the likelihood of the aligned template frames, and template compression went further by generating new GMMs from the baseline GMMs and used the new GMMs to relabel the representative frames. Because in MLTS or template compression the template representative frames were no longer labeled by the best-fit GMMs, the LLR distance that contrasted the model-frame fit became ineffective in comparison with the KL distance that measures the distance between GMMs.

Second, relative to using all of the original templates as discussed in Section 5.3.2, using 20% template representatives that were selected by MLTS with the KL distance slightly decreased phone accuracy by 0.21% (from 74.26% to 74.05%), but using the template representatives selected by MDTS with the LLR distance significantly decreased phone accuracy by 0.69% (from 74.51% to 73.82%). This difference may be explained by the fact that MDTS simply selects a cluster center as a template representative, but MLTS further refines the GMM indices of each template representative frame to maximize the likelihood of the aligned frames in the corresponding cluster. In this way, MLTS absorbs more information from the training data into the template representatives than MDTS, and so fewer template representatives are needed in MLTS than in MDTS.

Third, with the KL distance, template compression further improved the performance over MLTS, where by using 20% template representatives, phone accuracy was actually improved by 0.06% over the case of using all templates (from 74.26% to 74.32%). This indicates that the new GMMs were more effective in labeling the template representative frames, and the exclusion of the outlier GMMs was helpful, too.

In summary, MDTS worked well with LLR distance, and MLTS and template compression worked well with KL distance. Using the respectively compatible local distances and fixing the selection percentage at 20%, template compression performed the best, MLTS the next, and MDTS the last. Specifically, the accuracy gains over the HMM baseline were 1.6% absolute by template compression with KL, 1.33% by MLTS with KL, and 1.1% by MDTS with LLR. We also conducted the Student’s *t* test on the performance differences between each of the three methods (with respectively compatible distance) and the HMM baseline, and the three methods all significantly improved phone accuracy over the baseline at the level of α = 0.05.

#### 5.3.4 Evaluation on the outlier threshold t

*t*of Equation 19 for removing the GMM outliers affected the recognition performance, where the template selection method was MLTS with the KL distance, and 20% template representatives were selected. Among the four

*t*values studied here, it is observed that

*t =*2 gave the best phone accuracy performance. Also note that when

*t*=

*∞*, all GMMs in a cluster were used to generate compressed templates, where the existence of outliers degraded the accuracy performance significantly. Accordingly, the threshold

*t =*2 was used in all the template compression experiments.

**Phone accuracies (percent) from using different outlier threshold values for the compressed template representatives**

Threshold tσ | 1σ | 2σ | 3σ | ∞ |
---|---|---|---|---|

Accuracy (%) | 73.95 | 74.32 | 73.42 | 70.89 |

#### 5.3.5 Evaluation on the number of template representatives in template selection methods

*l*in MLTS was set to 5, corresponding to using five GMMs to label each frame of a template representative. It is seen from the two curves that with the percentage varied from 100% down to 1%, the phone accuracies decreased for both methods. When 100% templates were used, i.e., without template selection, LLR distance performed better than KL distance, as discussed in Section 5.3.1 and Section 5.3.2. When less than 80% templates were used, MLTS performed better than MDTS since the MLTS templates generalized better than MDTS templates, as discussed in Section 5.3.3. For MDTS, when the selection percentage reduced from 100% to 60%, the phone accuracy dropped rapidly by 0.55% (from 74.51% to 73.96%), and when the selection percentage reduced from 60% to 20%, the phone accuracy reduced slowly by 0.14% (from 73.96% to 73.82%). In contrast, for MLTS, with the selection percentage reduced from 100% to 20%, the phone accuracy went down gradually by 0.21% (from 74.26% to 74.05%). Moreover, both curves went down rapidly when the selection percentage was further reduced below 20%. From Figure 9, we conclude that MLTS is more robust to using a small percentage of template representatives, and the selection percentage of 20% is a reasonable compromise between accuracy performance and computation and storage cost.

#### 5.3.6 Phone accuracy analysis

**Phone accuracies (percent) of vowels, semivowels, stops, fricatives, nasals, and silence**

Vowels | Semivowels | Stops | Fricatives | Nasals | Silence | |
---|---|---|---|---|---|---|

HMM baseline | 63.48 | 72.47 | 73.65 | 74.83 | 72.03 | 86.02 |

KL distance | 68.30 | 76.52 | 73.45 | 72.37 | 72.53 | 85.48 |

LLR distance | 68.32 | 76.85 | 75.65 | 71.95 | 72.66 | 85.57 |

It is not surprising that the template-based methods produced the largest positive impact on semivowels (largest relative phone error reduction). Semivowels are transient sounds and templates can capture their trajectory information better than HMM. Similarly, some vowel sounds are nonstationary, such as diphthongs or vowels in strong coarticulation. Stops, having the closure and burst pattern, are nonstationary as well and often have short durations, and they are difficult to model by HMM but can be better represented by templates, as reflected in the accuracy gain by the LLR-based template matching. Fricatives are noise like and without clear trajectory patterns, and their boundaries are also difficult to determine, making template-based methods not as effective as HMMs.

#### 5.3.7 Computation time and memory overhead

We first compare the storage space costs of the conventional and the proposed template representation methods, assuming a speech feature vector is 39 dimensional as in the baseline HMM. In conventional template methods that use Mahalanobis local distance, a speech frame is represented by a 39-dimensional vector (float), while in the proposed method a frame is labeled by *n* GMM indices (short integer) and *n −* 1 weights (float). On a 32-bit machine and with *n =* 5 in our experiments, the proposed method used 26 (5 × 2 + (5–1) × 4) bytes per frame versus the conventional method of 156 bytes per frame, which amounts to an 83% saving in storage space. For the TIMIT dataset, there were 152,715 phone templates and the average length of a phone template was eight frames (with the frame shift of 10 ms), giving a total of 1,221,720 frames and an overhead for template storage of 30.3 MB. In template selection, the memory overhead was around 6.1 MB when 20% templates were selected to be the representatives. In template compression, the memory overhead for template storage was the same as in template selection. However, since there were 749 new GMMs for labeling the frames of the template representatives, there was an extra memory overhead of 5.4 MB.

**Computational overhead (percent) per frame using all templates, template selection, and template compression for TIMIT phone recognition**

All templates | Template selection | Template compression | |
---|---|---|---|

Test frame labeling overhead | 40.0 | 40.0 | 22.4 |

Rescoring overhead | 22.0 | 4.4 | 4.4 |

Overall computational overhead | 62.0 | 44.4 | 26.8 |

### 5.4 Large vocabulary speech recognition task

*t*test on the word accuracy gain (averaged over the five doctors) obtained by each of the three cases over the baseline and found the performance gain in every case to be statistically significant at the level of α = 0.05.

**Comparison of word accuracies (percent) between the HMM baseline and the template-based methods**

Speakers (no. of words) | Dr. 1 (3,248) | Dr. 2 (5,085) | Dr. 3 (3,988) | Dr. 4 (2,759) | Dr. 5 (6,421) | Average |
---|---|---|---|---|---|---|

Baselines | 72.14 | 82.50 | 84.00 | 74.20 | 79.32 | 78.43 |

All templates (LLR) | 73.53 | 84.22 | 85.98 | 75.74 | 80.67 | 80.03 |

MLTS (KL) | 73.22 | 83.39 | 84.87 | 75.35 | 80.15 | 79.40 |

Template compression (KL) | 73.55 | 83.61 | 85.21 | 75.71 | 80.39 | 79.70 |

**Average computation overhead (percent) per frame of the five doctors**

All templates (LLR) | MLTS (KL) | Template compression (KL) | |
---|---|---|---|

Test frame labeling overhead | 43.3 | 43.3 | 23.3 |

Rescoring overhead | 26.7 | 5.4 | 5.4 |

Overall computational overhead | 70.0 | 48.7 | 28.7 |

### 5.5 Discussion

So far we have shown that representing the template frames by GMMs and using the local distance measures of LLR or KL significantly improved the accuracy performance over our HMM baselines, and the proposed methods are much more effective than the conventional template matching methods where the template frames use the original speech features. A question naturally arises is how would the proposed template-matching methods interact with an underlying acoustic model from which the GMMs are derived and the phone or word lattices are generated, and of particular interest is that as a baseline HMM system improves, whether the performance gain we have observed by the proposed template matching methods can still hold. This is a relevant issue since a baseline HMM system can be improved by using more advanced training methods and better features. Recently, a major advance has been made in using deep neural networks (DNNs) with many hidden layers for speech acoustic modeling, where the resulting DNNs learn a hierarchy of nonlinear feature detectors that can capture complex statistical patterns for speech data. For example, context-independent, pre-trained DNN/HMM hybrid architectures have achieved competitive performance in TIMIT phone recognition [38], context-dependent DNN/HMM has led to large improvements to several public domain large speech recognition tasks [39], and dumping features from deep convolutional neural networks to train GMM/HMM-based systems achieved higher accuracy performance than DNN/HMM hybrid architectures in several tasks [40].

We have investigated this issue in [41] on the TIMIT phone recognition task by performing lattice rescoring with the proposed template-matching methods on top of progressively better HMM baselines, where the test set was the same as discussed in Section 5.1. The HMM baseline system employed discriminative training, neural-network-derived phone posterior probability features, as well as ensemble acoustic models, etc. We observed that with the baseline system phone accuracy raised to 73.25%, 75.66%, 76.51%, and 77.97%, the template-matching-based lattice rescoring delivered consistent performance gains and gave phone accuracies of 74.74%, 77.27%, 77.96%, and 79.55%, respectively, where the phone accuracy of 79.55% was among the best reported results on the TIMIT continuous phoneme recognition task. For the sake of space, we omit the details of these baseline systems. For further information, please refer to [41]. The consistent performance gains support the notion that template matching improves recognition accuracy through a mechanism different from HMM. This is in agreement with the observation in [10] that the template matching system and the HMM system behave differently in word error patterns. Since our template-based methods are compatible with the GMMs trained from neural-network-derived features, it is reasonable to expect that our methods can take advantage of and add value to the advancements in this research direction.

## 6 Conclusions

In this paper, we have presented a novel approach of integrating template matching with statistical modeling for continuous speech recognition. The approach inherits the GMMs and the PDT state tying structures from the baseline HMMs and is therefore easily implemented. Generating template representatives and representing the frames by GMM indices make the approach extendable to LVCSR task. Based on our experimental results from the tasks of TIMIT phone recognition and telehealth LVCSR, we conclude that the proposed method of integrating template matching and statistical modeling has significantly improved the recognition performance over our HMM baselines, and the proposed template selection and compression methods have also largely saved computation time and memory space over using all templates with small losses in accuracy performance. Although in the current work we used the basic acoustic modeling techniques to train our HMM baselines, the proposed template matching methods can take advantage of and add value to more advanced GMM/HMM systems, and as such they are promising for further improving the state-of-the-art speech recognition.

## Declarations

### Disclosures

The work described in this paper was conducted during the first author’s Ph.D. study at the University of Missouri-Columbia, USA.

## Authors’ Affiliations

## References

- Ostendorf M, Digalakis V, Kimball OA: From HMMs to segment models: a unified view of stochastic modeling for speech recognition.
*IEEE Trans on SAP*1996, 4(5):360-378.Google Scholar - Furui S: Speaker-independent isolated word recognition using dynamic features of speech spectrum.
*IEEE Trans SAP*1986, ASSP-34(1):52-59.Google Scholar - Gish H, Ng K: Parametric trajectory models for speech recognition.
*Proc of ICSLP1*1996, 466-469.Google Scholar - Glass J: A probabilistic framework for segment-based speech recognition.
*Computer Speech and Language*2003, 17(2–3):13-152.Google Scholar - Zweig G, Nguyen P: A segmental CRF approach to large vocabulary continuous speech recognition. In
*IEEE Workshop on Automatic Speech Recognition & Understanding*. Merano; 2009:152-157.Google Scholar - Deng L, Yu D, Acero A: A long-contextual-span model of resonance dynamics for speech recognition: parameter learning and recognizer evaluation. In
*Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding*. San Juan; 2005:145-150.Google Scholar - Demuynck K, Seppi D, van Hamme H, van Compernolle D: Progress in example based automatic speech recognition. In
*Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. Prague; 2011:4692-4695.Google Scholar - Sainath TN, Ramabhadran B, Nahamoo D, Kanevsky D, van Compernolle D, Demuynck K, Gemmeke JF, Bellegarda JR, Sundaram S: Exemplar-based processing for speech recognition.
*IEEE Signal Process. Mag*2012, 29: 98-113.View ArticleGoogle Scholar - Sainath TN, Ramabhadran B, Nahamoo S, Kanevsky S, Sethy A: Exemplar-based sparse representation features for speech recognition. In
*Proceedings of INTERSPEECH 2010*. Makuhari; 2010:2254-2257.Google Scholar - de Wachter M, Matton M, Demuynck K, Wambacq P, Cools R, van Compernolle D: Template-based continuous speech recognition.
*IEEE Trans ASLP*2007, 15(4):1377-1390.Google Scholar - Sun X, Zhao Y: Integrate template matching and statistical modeling for speech recognition. In
*Proceedings of INTERSPEECH 2010*. Makuhari; 2010:74-77.Google Scholar - Rabiner L, Juang B:
*Fundamentals of Speech Recognition*. Englewood Cliffs: Prentice Hall; 1993.Google Scholar - Demange S, van Compernolle D: HEAR: an hybrid episodic-abstract speech recognizer. In
*Proceedings of INTERSPEECH 2009*. Brighton; 2009:3067-3070.Google Scholar - Golipour L, O’Shaughnessy D: Phoneme classification and lattice rescoring based on a k-NN approach. In
*Proceedings of INTERSPEECH 2010*. Makuhari; 2010:1954-1957.Google Scholar - Demuynck K, Demuynck K, Seppi D, van Compernolle D, Nguyen P, Zweig G: Integrating meta-information into exemplar-based speech recognition with segmental conditional random fields. In
*Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. Prague; 2011:5048-5051.Google Scholar - Sundaram S, Bellegarda JR: Latent perceptual mapping: a new acoustic modeling framework for speech recognition. In
*Proceedings of INTERSPEECH 2010*. Makuhari; 2010:881-884.Google Scholar - Sun X, Zhao Y: New methods for template selection and compression in continuous speech recognition. In
*Proceedings of INTERSPEECH 2011*. Florence; 2011:985-988.Google Scholar - Seppi D, Demuynck K, van Compernolle D: Template-based automatic speech recognition meets prosody. In
*Proceedings of INTERSPEECH 2011*. Florence; 2011:545-548.Google Scholar - Seppi D, Van Compernolle D: Data pruning for template-based automatic speech recognition. In
*Proceedings of INTERSPEECH 2010*. Makuhari; 2010:901-904.Google Scholar - Sundaram S, Bellegarda J: Latent perceptual mapping with data driven variable-length acoustic units for template-based speech recognition. In
*Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. Kyoto; 2012:4125-4128.Google Scholar - Heigold G, Nguyen P, Weintraub M, Vanhoucke V: Investigations on exemplar-based features for speech recognition towards thousands of hours of unsupervised, noisy data. In
*Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. Kyoto; 2012:4437-4440.Google Scholar - Ming J: Maximizing the continuity in segmentation- a new approach to model, segment and recognize speech. In
*Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. Taiwan; 2009:3849-3852.Google Scholar - Ming J, Srinivasan R, Crookes D, Jafari A: CLOSE—a data-driven approach to speech separation.
*IEEE Trans ASLP*2013, 21(7):1355-1368.Google Scholar - Garcia A, Gish H: Keyword spotting of arbitrary words using minimal speech resources. In
*Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), vol. 1*. Atlanta; 2006:123-127.Google Scholar - Hazen T, Shen W, White C: Query-by-example spoken term detection using phonetic posteriorgram templates. In
*IEEE Workshop on Automatic Speech Recognition & Understanding*. Merano; 2009.Google Scholar - Zhang Y, Glass J: Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In
*IEEE Workshop on Automatic Speech Recognition & Understanding*. Merano; 2009.Google Scholar - Lamel L, Kassel R, Seneff S: Speech database development: design and analysis of the acoustic-phonetic corpus.
*Proceedings of the DARPA Speech Recognition Workshop*1989.Google Scholar - Zhao Y, Zhang X, Hu R, Xue J, Li X, Che L, Hu R, Schopp L: An automatic captioning system for telemedicine. In
*Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. Toulouse; 2006:957-960.Google Scholar - Kullback S: Letter to the editor: the Kullback–Leibler distance.
*Am. Stat*1987, 41(4):338-341.View ArticleGoogle Scholar - Hershey JR, Olsen PA: Approximating the Kullback–Leibler divergence between Gaussian mixture models. In
*Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 4*. Hawaii; 2007:317-320.Google Scholar - Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Valtchev V, Woodland P:
*The HTK Book*. Cambridge: Cambridge University Engineering Department; 2009.Google Scholar - Duda R, Hart P, Stork D:
*Pattern Classification*. 2nd edition. New York: Wiley; 2009.Google Scholar - Theodoridis S, Koutroumbas K:
*Pattern Recognition*. 3rd edition. San Diego: Academic Press; 2006.Google Scholar - Sankar A, Beaufays F, Digalakis V: Training data clustering for improved speech recognition. In
*Proceedings of EUROSPEECH*. Madrid; 1995.Google Scholar - Li Y, Li L: A greedy merge learning algorithm for Gaussian Mixture Model.
*Third International Symposium on IITA*2009, 506-509. vol. 2, Nanchang, 21–22 November 2009Google Scholar - Lee KF, Hon HW: Speaker-independent phone recognition using hidden Markov models.
*IEEE Trans ASSP*1989, 37(11):1641-1648. 10.1109/29.46546View ArticleGoogle Scholar - Zhang X, Zhao Y, Schopp L: A novel method of language modeling for automatic captioning in telemedicine.
*IEEE Trans ITB*2007, 11(3):332-337.Google Scholar - Mohamed A, Dahl G, Hinton G: Acoustic modeling using deep belief networks.
*IEEE Trans ASLP*2012, 20(1):14-22.Google Scholar - Seide F, Li G, Chen X, Yu D: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In
*Proceedings of IEEE Workshop on Automatic Speech Recognition & Understanding*. Hawaii; 2011:24-29.Google Scholar - Sainath TN, Mohamed A, Kingsbury B, Ramabhadran B: Deep convolutional neural networks for LVCSR. In
*Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*. Vancouver; 2013:8614-8618.Google Scholar - Sun X, Chen X, Zhao Y: On the effectiveness of statistical modeling based template matching approach for continuous speech recognition. In
*Proceedings of INTERSPEECH*. Florence; 2011:2163-2166.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.