Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion

Tejedor, Javier; Toledano, Doroteo T.; Lopez-Otero, Paula; Docio-Fernandez, Laura; Garcia-Mateo, Carmen; Cardenal, Antonio; Echeverry-Correa, Julian David; Coucheiro-Limeres, Alejandro; Olcoz, Julia; Miguel, Antonio

doi:10.1186/s13636-015-0063-8

Research
Open access
Published: 07 August 2015

Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion

Javier Tejedor¹,
Doroteo T. Toledano²,
Paula Lopez-Otero³,
Laura Docio-Fernandez³,
Carmen Garcia-Mateo³,
Antonio Cardenal³,
Julian David Echeverry-Correa⁴,
Alejandro Coucheiro-Limeres⁴,
Julia Olcoz⁵ &
…
Antonio Miguel⁵

EURASIP Journal on Audio, Speech, and Music Processing volume 2015, Article number: 21 (2015) Cite this article

4168 Accesses
11 Citations
Metrics details

Abstract

Spoken term detection (STD) aims at retrieving data from a speech repository given a textual representation of the search term. Nowadays, it is receiving much interest due to the large volume of multimedia information. STD differs from automatic speech recognition (ASR) in that ASR is interested in all the terms/words that appear in the speech data, whereas STD focuses on a selected list of search terms that must be detected within the speech data. This paper presents the systems submitted to the STD ALBAYZIN 2014 evaluation, held as a part of the ALBAYZIN 2014 evaluation campaign within the context of the IberSPEECH 2014 conference. This is the first STD evaluation that deals with Spanish language. The evaluation consists of retrieving the speech files that contain the search terms, indicating their start and end times within the appropriate speech file, along with a score value that reflects the confidence given to the detection of the search term. The evaluation is conducted on a Spanish spontaneous speech database, which comprises a set of talks from workshops and amounts to about 7 h of speech. We present the database, the evaluation metrics, the systems submitted to the evaluation, the results, and a detailed discussion. Four different research groups took part in the evaluation. Evaluation results show reasonable performance for moderate out-of-vocabulary term rate. This paper compares the systems submitted to the evaluation and makes a deep analysis based on some search term properties (term length, in-vocabulary/out-of-vocabulary terms, single-word/multi-word terms, and in-language/foreign terms).

1 Introduction

The enormous amount of information stored in audio and audiovisual repositories promotes the development of efficient methods that aim at retrieving the stored information. For audio content search, significant research has been conducted in spoken document retrieval (SDR), keyword spotting, spoken term detection (STD), and query-by-example. Spoken term detection aims at finding a list of terms (composed of individual words or multiple words) within audio archives, and has been receiving much interest for years from the likes of IBM [1–3], BBN [4], SRI and OGI [5–7], BUT [8–10], Microsoft [11], QUT [12, 13], JHU [14–16], Fraunhofer IAIS/NTNU/TUD [17], NTU [18, 19], IDIAP [20], etc. In addition, several evaluations including SDR, STD, and query-by-example STD have been recently proposed [21–31].

Given the increasing interest in STD evaluations around the world, we organized an international evaluation of STD in the context of the ALBAYZIN 2014 evaluation campaign. This campaign is an internationally open set of evaluations supported by the Spanish Network of Speech Technologies (RTTH [32]) and the ISCA Special Interest Group on Iberian Languages (SIG-IL [33]), which have been held every 2 years since 2006. The evaluation campaigns provide an objective mechanism to compare different systems and are a powerful way to promote research on different speech technologies (e.g., speech segmentation [34], speaker diarization [35], language recognition [36], query-by-example spoken term detection [37], and speech synthesis [38] in the ALBAYZIN 2010 and 2012 evaluation campaigns). This year, this campaign has been held during the IberSPEECH 2014 conference [39].

1.1 Introduction to spoken term detection technology

Spoken term detection relies on a text-based input, commonly the orthographic transcription of the search term. Spoken term detection systems are typically composed of two main stages: indexing by an automatic speech recognition (ASR) subsystem, and then search by a detection subsystem, as depicted in Fig. 1. The ASR subsystem decodes the input speech signal in terms of word/subword lattices. The detection subsystem integrates a term detector and a decision maker. The term detector searches for putative detections of the terms in the word/subword lattices. The decision maker decides whether each detection is reliable enough to be considered as a hit or should be rejected as a false alarm (FA). Finally, a tool provided by the National Institute of Standards and Technology (NIST) is commonly used for performance evaluation [40].

There are two main approaches to STD: the word-based approach [6, 41–45] that searches for terms in the output of a large vocabulary continuous speech recognition (LVCSR) system, and the subword-based approach which searches for subword representations of search terms within the output of a subword speech recognition system. The word-based STD approach typically obtains better performance than the subword-based approach thanks to the lexical information it employs. However, the subword-based approach has the unique advantage that it can detect terms that consist of words that are not in the recognizer’s vocabulary — out-of-vocabulary (OOV) terms — whereas the word-based approach can only detect in-vocabulary (INV) terms. Several subword unit types have been employed in the subword-based approach, including word fragments [46], particles [47, 48], acoustic words [49], graphones [6, 7], multigrams [9, 50], syllables [51–53], and graphemes [54], although phonemes are the most commonly used due to their simplicity and natural relationship with spoken languages [41, 55–59]. In order to exploit the relative advantages of the word and phoneme-based approaches, it has been proposed to combine these two approaches by using the word-based approach to detect INV terms and the subword-based approach to detect OOV terms, e.g., [41, 56, 60–64]. A hybrid approach that fuses word and subword lattices and then searches for both INV terms and OOV terms in the hybrid lattices has also been proposed [11, 65]. Another hybrid approach uses word/subword mixed lexica and language models to generate hybrid lattices [7, 10, 66]. A recent hybrid approach employs word confusion networks (WCNs) during ASR decoding and next incorporates a probabilistic phonetic retrieval (PPR) framework to deal with OOV terms [67]. Kaldi STD system [68–70] employs a word-based approach for term detection and a method based on proxy words (i.e., replacing the OOV term by the most similar in-vocabulary term/terms) to detect OOV terms [71].

1.2 Spoken term detection under the IARPA BABEL program and Open KWS

Significant research has been conducted on STD under the IARPA BABEL program [72]. This program was born in 2011 and aims at developing fully automatic and noise-robust speech recognition systems in limited time (e.g., 1 week) and with limited amount of transcribed training data, so that they can be applied to any language in order to process massive amounts of speech data recorded in challenging real-world situations. Spoken term detection perfectly fits within the scope of this program, which includes keyword search algorithms and low resource languages within its research areas. This program supports research in the following languages, corresponding to base period, option period 1, and option period 2 releases: Cantonese, Pashto, Tagalog, Turkish, Vietnamese, Assamese, Bengali, Haitian Creole, Lao, Zulu, Tamil, Kurmanji Kurdish, Tok Pisin, Cebuano, Kazakh, Telugu, Lithuanian, and Swahili. Since 2013, NIST has been organizing an annual open STD evaluation called NIST Open Keyword Search (KWS), which is closely related to the BABEL program but open to other research groups besides BABEL participants (more information in “Comparison to other evaluations” section). In this section, we will review some relevant results arisen from research in this framework, which focuses on OOV term detection, score normalization, and system combination.

The work presented in [73] focused on OOV term detection from different recognition units (word, syllable, and word fragment) and two search strategies (whole unit fuzzy search and phone fuzzy search) from the lattices obtained during the ASR process. For the phone fuzzy search, each recognition unit is first split into phones. Experimental results showed that (1) phone-based search outperformed the whole unit-based search for OOV terms, and whole-word search performed the best for INV terms; (2) the syllable models outperformed the word fragment models for the phone search; and (3) system combination from different recognition units and search strategies performed better than each individual system.

Wang and Metze [74] focused on score normalization and proposed a term-specific threshold that uses the confidence scores assigned to all the detections of the given term to compute the final score for each detection.

Karakos et al. [75] presented a new score normalization approach based on the combination of an unsupervised linear fit method and a supervised linear model method (Powell’s method [76]) from several input features such as posterior probability, keyword length, false alarm probability, etc.

Chiu and Rudnicky [77] proposed a score normalization based on word burst (i.e., words of interest that occur near each other in the speech content) by penalizing the term detections that do not occur near other detections of the same term.

Deep neural networks (DNNs) as input for a Hidden Markov Model (HMM)-Gaussian Mixture Model (GMM) classifier have also shown their potential [78–81].

Language-independent and unsupervised training-based approaches have also been considered within this program aiming at building a system for an unknown language [82]. The limited data corresponding to some languages covered in the program (Cantonese, Pashto, Turkish, Tagalog, Vietnamese, Assamese, Bengali, Haitian Creole, Lao, and Zulu) were used for system training. The system is based on multi-lingual bottle-neck DNNs and Hidden Markov Model Toolkit (HTK) [83] for training and decoding and the IBM keyword search system for term detection [84]. Results showed that INV term performance is good for languages (e.g., Haitian Creole) whose phonetic structure is similar to that of the languages used for system training.

Various subword unit types (syllable, phone, grapheme, and automatically discovered) were investigated in [85] in the framework of lattice- and consensus network-based exact match term detection. Experimental results showed that (1) the automatically discovered units performed the best in isolation, (2) the combination of all the subword unit types for detection fusion significantly outperformed each subword unit type, and (3) fusion of the phone- and grapheme-based systems performed better than each individual system.

Lee et al. [86] investigated graph-based re-ranking techniques for scoring detection in STD systems for low-resource languages (Assamese, Bengali, and Lao). A node in the graph represents a hypothesized region of the given term, and connections are created from acoustically similar hypothesized regions. The STD system is based on fuzzy matching and different word/subword units (word, syllable, morpheme, and phoneme).

Ma et al. [87] proposed a combined approach for detection re-scoring from linear interpolation of a rule-based detection re-scoring system, a logistic regression-based detection re-scoring system, and a rank learning-based detection re-scoring system. The detection re-scoring system based on word-burst features (e.g., number, strength, and proximity of neighbor hypothesis, etc.), consensus network features (e.g., posterior probability, number of hit arcs, number of average arcs per bin, etc.), and acoustic features (e.g., pitch, number of unvoiced frames, jitter, etc.).

Chiu et al. [88] proposed combining finite state transducer- and confusion network-based STD systems from DNN, bottle-neck, and perceptual linear prediction (PLP) acoustic features.

A novel two-stage discriminative score normalization method was presented in [89]. The term detector employed word lattices obtained from an LVCSR system to output term detections. Next, the discriminative score normalization method relies on a multi-layer perceptron (MLP)-based confidence measure from two novel features. These novel features are the ranking score, computed as the rank of the posterior probability of the detection compared to the posterior probability of all the arcs in the lattice where the detection resides and the relative posterior probability of the detection compared to the maximum posterior probability within the arcs in the lattice where the detection resides. The new confidence score is then taken by an ATWV-oriented score normalization in the second stage, which optimizes the final score for the evaluation metric.

Wegmann et al. [90] presented a system where detections of several ASR systems were combined. ASR systems were built from HTK [83] and Kaldi [68] tools and employed PLP and bottle-neck acoustic features. More interestingly, this work also made an analysis of the ATWV performance from different approaches. The first approach consisted on setting the optimal threshold for each term from the ground-truth information. This analysis showed that there are important performance gaps in ATWV due to the thresholding algorithm employed, suggesting that a better threshold selection will produce significant performance gains. The second approach is based on bootstrapping techniques to show the ATWV results of randomly selected groups of terms. The different distribution of the ATWV performance across the different term groups showed that ATWV heavily depends on the selected terms, and even that small changes in the ASR system accuracy can cause large changes in the STD performance.

Several score normalization and system combination approaches were presented in [91]. Score normalization based on term-dependent thresholding, rank normalization and mapping back to posteriors, sum-to-one normalization, and machine learning. The term-dependent thresholding simply re-scores the detection by considering the confidence scores of all the detections of the given term in the ATWV formulation. The rank normalization is based on the false alarm rate for the given term as score normalization value for each term detection. The mapping back to posteriors approach relies on the average posteriors of the detections of all the terms except that being detected that are ranked in the same position within the detection list for the given term. The sum-to-one approach normalizes the score of the detection by the sum of all the scores of the detections of the given term. The machine learning approach is based on a linear model by applying the Powell’s method [76] to maximize ATWV performance from several input features (e.g., rank normalization, mapping back to posteriors, term length, etc.). System combination merged the detections from different STD systems that rely on different approaches (e.g., GMM-based and DNN-based HMMs) and combined the detection scores from Powell’s method.

Su et al. [53] proposed syllable-weighted finite state transducer (WFST) for speech indexing and direct search on syllable- and word-WFST for term detection. The word-WFST is obtained by syllable-to-word mapping from the original syllable-WFST. Experiments showed that the system combination from word- and syllable-WFST at detection level significantly outperforms each individual system.

Chen et al. [92] presented a novel subword unit-based approach that focused on pronunciation prediction. To get the optimal set of subword units, the pronunciation prediction is first based on syllables, which are then converted to a more specific subword units (similar to morphemes), according to a certain lexicon segmentation that obtains the highest language model score for each pronunciation in the lexicon. For OOV term detection, the phoneme transcription of the terms is obtained with the sequitur grapheme-to-phoneme tool [93], which is next mapped to subword units. The novel subword approach outperformed the system performance of word-, syllable-, and phoneme-based units. In addition, system combination from word, novel subword, syllable, and phoneme units showed significant performance gains over each individual system.

Trmal et al. [94] proposed system combination from different ASR systems that employ different configurations in terms of acoustic features and acoustic models (e.g., subspace GMMs (SGMMs), DNNs, and bottle-neck features). Kaldi STD system [68–70] was used for term detection in all the systems. A syllable-based lexicon expansion was proposed for OOV keyword search. Point process models (PPMs) were also employed for keyword search. These are based on whole-word, event-based acoustic modeling and phone-based search [95, 96]. Since they are phone based, OOV term detection is not an issue for the PPM-based STD systems. Experimental results showed that (1) the combination of PPM-based STD and Kaldi-based STD effectively improved the STD performance, and (2) the lexicon expansion generally outperforms the system performance.

1.3 Keyword spotting under the DARPA RATS program

The DARPA Robust Automatic Transcription of Speech (RATS) program also includes keyword spotting within its research areas. Different to the BABEL program, DARPA RATS program mainly focuses on speech recognition under highly noisy communication channels, where typically speech signals of less than 10 dB are specified. Two main languages have been employed in this program: Levantine Arabic and Farsi. For these languages, significant research has also been carried out in keyword spotting. In this section, we will try to summarize the most significant research in this program, which mainly focuses on score normalization and system combination.

A keyword spotting system was presented in [97] with a score normalization approach based on the false alarm probability of the given term. In addition, a white list-based approach in the ASR system was also presented. This approach modifies the beam pruning produced at recognition, by keeping alive (using a wider beam) those states that form a detection of a term in the white list. Since the white list contains all the search terms, all the term detections are very unlikely to be pruned.

The system presented in [98] used also the white list approach presented in [97] and focused on system combination from word lattices and phone confusion networks. Both word lattices and phone confusion networks were generated from different ASR systems that employed different configurations (Mel-frequency cepstral coefficient (MFCC), PLP, GMM, SGMM, etc.). Detections of the different ASR systems were combined using logistic regression.

Deep neural networks have also been employed for developing keyword spotting systems under the DARPA RATS program [99]. In this work, several word- and subword-based systems were combined with the system combination approach presented in [91]. A similar work based on DNNs, GMMs, and convolutional neural networks (CNNs) for acoustic modeling and various signal processing features (standard cepstral and filter-bank features, noise-robust features, and MLP features) was presented in [100]. This employs word- and phone-based ASR systems to produce a set of term detections that are next fused with the logistic regression-based approach presented in [98].

Mangu et al. [101] employed CNNs, DNNs, and GMMs as acoustic modeling, audio segmentation based on GMMs and DNNs, word lattices as ASR output, phone-WFST for keyword search, and system combination. System combination took the output of the different ASR systems and merged the detections of all the systems. Detection scores are normalized by the sum of all the scores of the detections of the given term. Experimental results showed that (1) CNNs perform very well for keyword search, (2) audio segmentation plays a very important role in keyword search, and (3) system combination yields significant performance gains.

Seigel et al. [102] employed a system combination approach based on word and grapheme ASR. Word- and grapheme-based lattices are first produced and then used for term search. Conditional random field (CRF) models are used for detection scoring in a discriminative confidence scoring framework. The input features to the CRF are related to the lattice information, contextual posterior features, and unigram prior features.

Mitra et al. [103] focused on system combination from word lattices. The word lattices are obtained from different GMM-HMM speech recognition systems that employ different sets of acoustic features (e.g., PLP, normalized modulation cepstral coefficients, and modulation of medium duration speech amplitude), along with various feature combination and dimensionality reduction techniques (principal component analysis, heteroscedastic linear discriminant analysis, and nonlinear autoencoder network). Experiments showed that the feature combination (prior combination) and the detection combination from individual ASR systems (posterior combination) yield significant performance gains.

The rest of the paper is organized as follows: the next section presents the STD evaluation and includes an evaluation description, the metric used, the database released for experimentation, a comparison with previous evaluations, and the participants involved in the evaluation. Next, we present the different systems submitted to the evaluation. Results along with discussion are presented in a separate section, and finally conclusions are presented.

2 Spoken term detection evaluation

2.1 STD evaluation overview

This evaluation involves searching a list of terms within speech content. Therefore, the evaluation is designed for research groups working on speech indexing and retrieval and speech recognition as well. In other words, the STD evaluation focuses on retrieving the appropriate audio files, with the occurrences and timestamps, which contain any of those terms.

The evaluation consists of searching a training/development term list within training/development speech data and searching a test term list within test speech data. The evaluation result ranking is based on the system performance when searching the test terms within test speech data. Participants can use the training/development data for system training and tuning, but any additional data can also be employed.

Participants could submit a primary system and up to 4 contrastive systems. No manual intervention is allowed for each system developed to generate the final output file, and hence all the developed systems must be fully automatic. Listening to the test data or any other human interaction with the test data is forbidden before all the evaluation results in terms of the performance of the systems in test data (i.e., evaluation result ranking) have been sent back to the participants. The standard XML-based format corresponding to the NIST STD 2006 evaluation [22] has been used for building the system output file.

2.2 Evaluation metric

In STD, a hypothesized occurrence is called a detection; if the detection corresponds to an actual occurrence, it is called a hit, otherwise it is a false alarm. If an actual occurrence is not detected, this is called a miss. The ATWV proposed by NIST [22] has been used as the main metric for the evaluation. This metric integrates the hit rate and false alarm rate of each term into a single metric and then averages over all the terms:

$$ \text{ATWV}=\frac{1}{|\Delta|}\sum_{K \in \Delta}{\left(\frac{N^{K}_{\text{hit}}}{N^{K}_{\text{true}}} - \beta \frac{N^{K}_{\text{FA}}}{T-N^{K}_{\text{true}}}\right)}, $$

((1))

where Δ denotes the set of terms and |Δ| is the number of terms in this set. $N^{K}_{\text {hit}}$ and $N^{K}_{\text {FA}}$ represent the numbers of hits and false alarms of term K, respectively, and $N^{K}_{\text {true}}$ is the number of actual occurrences of K in the audio. T denotes the audio length in seconds, and β is a weight factor set to 999.9, as in the ATWV proposed by NIST [4]. This weight factor causes an emphasis placed on recall compared to precision in the ratio 10:1.

ATWV represents the TWV for the threshold set by the STD system (usually tuned on development data). An additional metric, called maximum term weighted value (MTWV) [22] can also be used to evaluate the performance of an STD system. This MTWV is the maximum TWV achieved by the STD system for all possible thresholds and hence does not depend on the tuned threshold. Therefore, this MTWV represents an upper-bound of the performance obtained by the STD system. Results based on this metric are also presented to evaluate the system performance with respect to threshold selection.

In addition to ATWV and MTWV, NIST also proposed a detection error tradeoff (DET) curve [104] to evaluate the performance of an STD system working at various miss/FA ratios. Although DET curves were not used for the evaluation itself, they are also presented in this paper for system comparison.

The NIST STD evaluation tool [40] was employed to compute MTWV, ATWV, and DET curves.

Additionally, precision, recall, and F-measure values are also presented in this paper to evaluate system performance. Whereas the original ATWV metric proposed by NIST gives more emphasis to recall than to precision (in other words, it is more important a miss than a false alarm), F-measure assigns the same cost to precision and recall values. Therefore, F-measure allows us to compare the system performance in a different way. However, it must be noted that the systems submitted to the evaluation were tuned and optimized towards ATWV.

2.3 Database

The database used for the evaluation consists of a set of talks extracted from the MAVIR workshops [105] held in 2006, 2007, and 2008 (corpus MAVIR 2006, 2007, and 2008) that contain speakers from Spain and Latin America (henceforth MAVIR corpus or database). The MAVIR corpus contains 3 recordings in English and 10 recordings in Spanish, but only the recordings in Spanish were used for the evaluation.

The MAVIR Spanish data consist of spontaneous speech files, each containing different speakers, which amount to about 7 h of speech and are further divided for the purpose of this evaluation into training/development and test sets. There are 20 male and 3 female speakers in the MAVIR Spanish database. The data were also manually annotated in an orthographic form, but timestamps were only set for phrase boundaries. To prepare the data for the evaluation, we manually added the timestamps for the roughly 6000 occurrences of spoken terms used in the training/development and test evaluation sets.

The speech data were originally recorded in several audio formats (PCM mono and stereo, MP3, 22.05 KHz., 48 KHz., etc.). All data were converted to PCM, 16 KHz., single channel, 16 bits per sample using SoX tool [106]. Recordings were made with the same equipment, a Digital TASCAM DAT model DA-P1, except for one recording. Different microphones were used for the different recordings. They mainly consisted of tabletop or floor standing microphones, but in one case a lavalier microphone was used. The distance from the mouth of the speaker to the microphone varies and was not particularly controlled, but in most cases the distance was smaller than 50 cm. All the speech contain real and spontaneous speech of MAVIR workshops in a real setting. Thus, the recordings were made in large conference rooms with capacity for over a hundred people and a large amount of people in the conference room. This poses additional challenges including background noise (particularly babble noise) and reverberation. The realistic settings and the different nature of the spontaneous speech in this database make it appealing and challenging enough for our evaluation. Table 1 includes some database features such as the number of word occurrences, duration, and signal-to-noise ratio (SNR) [107] of each speech file in the MAVIR Spanish database.

Table 1 MAVIR database characteristics. “train/dev” stands for training/development, “occ.” stands for occurrences, “min” stands for minutes, “SNR” for signal-to-noise ratio, and “dB” for decibels

Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion

Abstract

1 Introduction

1.1 Introduction to spoken term detection technology

1.2 Spoken term detection under the IARPA BABEL program and Open KWS

1.3 Keyword spotting under the DARPA RATS program

2 Spoken term detection evaluation

2.1 STD evaluation overview

2.2 Evaluation metric

2.3 Database

2.4 Comparison to other evaluations

2.5 Participants

2.6 Additional considerations for the STD evaluation design

3 Systems

3.1 Fusion-based STD system (fusion)

3.1.1 Kaldi-based STD system

3.1.2 UVigo LVCSR-based STD system

3.1.3 System fusion

3.2 Word lattice-based Kaldi STD system (WL-Kaldi)

3.3 Word 1-best-based HTK STD system (W1B-HTK)

3.4 Word lattice-based Kaldi ATWV-based STD system (WL-ATWV-Kaldi)

3.5 Word lattice-based Kaldi WER-based STD system (WL-WER-Kaldi)

3.6 Phone 1-best-based HTK STD system (P1B-HTK)

4 Results and discussion

4.1 Comparison to previous STD evaluations

4.2 Performance analysis of STD systems based on term length

4.3 Performance analysis of STD systems based on single/multi-word terms

4.4 Performance analysis of STD systems based on in-vocabulary/out-of-vocabulary terms

4.5 Performance analysis of STD systems based on in-language/out-of-language terms

4.6 Performance analysis of STD systems based on specific terms

4.7 Lessons learned

5 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords