Skip to main content

Advertisement

Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation

Article metrics

  • 596 Accesses

Abstract

The huge amount of information stored in audio and video repositories makes search on speech (SoS) a priority area nowadays. Within SoS, Query-by-Example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given a spoken query. Research on this area is continuously fostered with the organization of QbE STD evaluations. This paper presents a multi-domain internationally open evaluation for QbE STD in Spanish. The evaluation aims at retrieving the speech files that contain the queries, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: MAVIR database, which comprises a set of talks from workshops; RTVE database, which includes broadcast television (TV) shows; and COREMAH database, which contains 2-people spontaneous speech conversations about different topics. The evaluation has been designed carefully so that several analyses of the main results can be carried out. We present the evaluation itself, the three databases, the evaluation metrics, the systems submitted to the evaluation, the results, and the detailed post-evaluation analyses based on some query properties (within-vocabulary/out-of-vocabulary queries, single-word/multi-word queries, and native/foreign queries). Fusion results of the primary systems submitted to the evaluation are also presented. Three different teams took part in the evaluation, and ten different systems were submitted. The results suggest that the QbE STD task is still in progress, and the performance of these systems is highly sensitive to changes in the data domain. Nevertheless, QbE STD strategies are able to outperform text-based STD in unseen data domains.

Introduction

The huge amount of information stored in audio and audiovisual repositories makes it necessary to develop efficient methods for search on speech (SoS). Significant research has been carried out in this area from spoken document retrieval (SDR) [16], keyword spotting (KWS) [712], spoken term detection (STD) [1318], and Query-by-Example Spoken Term Detection (QbE STD) [1926] tasks.

STD aims to find terms within audio archives. It is based on a text-based input, commonly the word/phone transcription of the search term, and hence, STD is also called text-based STD. Query-by-Example Spoken Term Detection also aims to search within audio archives but is based on an acoustic (spoken) input. This is a highly valuable alternative for visually impaired people or when using devices that do not have a text-based input (such as smart speakers), and consequently, the query must be given in another format such as speech.

STD systems typically comprise three different stages: (1) the audio is decoded into word/subword lattices using an automatic speech recognition (ASR) subsystem trained for the target language; (2) a term detection subsystem searches the terms within those word/subword lattices to hypothesize detections; and (3) confidence measures are computed to rank detections. The STD systems are normally language-dependent and require large amounts of resources to be built.

On the other hand, QbE STD has been traditionally addressed using three different approaches: methods based on the word/subword transcription of the query, methods based on template matching of features, and hybrid approaches. These approaches are described below.

Methods based on the word/subword transcription of the spoken query

In these methods, first, the spoken query is decoded using an ASR system and then a text-based STD approach is employed to hypothesize detections. The errors produced in the transcription of the query can lead to significant performance degradation. In [21] and [27], the authors employ a Viterbi-based search on hidden Markov models (HMMs). In other works [19, 2830] dynamic time warping (DTW) or variants of DTW are applied (e.g., non-segmental dynamic time warping (NS-DTW)) to align phone sequences. More sophisticated approaches [20, 3133] employ word and syllable speech recognizers. In [34], the authors employ a phone-based speech recognizer and weighted finite state transducer (WFST)-based search, whereas in [35], they apply multilingual phone-based speech recognition from supervised and unsupervised acoustic models and sequential dynamic time warping for search. The works [3638] propose the discovery of unsupervised acoustic features (e.g., bottleneck features) and unsupervised acoustic units for query/utterance representation, and [39] and the work by (Lopez-Otero et al.: Probabilistic information retrieval models for query-by-example spoken document retrieval, submitted to Multimed. Tools Appl.) make use of information retrieval models for QbE STD employing ASR.

Methods based on template matching

In these methods, sequences of feature vectors are extracted from both the input spoken queries and the utterances, which are then used in the search stage to hypothesize detections. Regarding the features used for query/utterance representation, Gaussian posteriorgrams are employed in [22, 29, 40, 41]; an i-vector-based approach for feature extraction is proposed in [42]; phone log-likelihood ratio-based features are used in [43]; posteriorgrams derived from various unsupervised tokenizers, supervised tokenizers, and semi-supervised tokenizers are employed in [44]; and posteriorgrams derived from a Gaussian mixture model (GMM) tokenizer, phoneme recognition, and acoustic segment modeling are used in [45]. Phoneme posteriorgrams have been widely used [34, 41, 4654] and bottleneck features as well [34, 5560]. Posteriorgrams from non-parametric Bayesian models are used in [61], articulatory class-based posteriorgrams are employed in [62], intrinsic spectral analysis is proposed in [63], unsupervised segment-based bag of acoustic words is employed in [64], and [65] is based on the sparse subspace modeling of posteriorgrams. An exhaustive feature set is proposed in [66], which includes Mel-frequency cepstral coefficients (MFCCs), spectral entropy, fundamental frequency, among others.

All these studies employ the standard DTW algorithm for query search, except for [40], which employs the NS-DTW algorithm, [41, 50, 51, 53, 56, 59, 61, 66] which employ the subsequence DTW (S-DTW) algorithm, [22] which presents a variant of the S-DTW algorithm, and [52] which employs the segmental DTW algorithm. An interesting alternative is [54] which proposes the use of hashing of the phone posteriors to speed-up search and to enable searching on massively large datasets.

These template matching-based methods were found to outperform subword transcription-based techniques in QbE STD [67] and can be effectively employed to build language-independent STD systems, since prior knowledge of the language involved in the speech data is not necessary.

Hybrid methods

These methods take advantage of the text-based STD approach and the approaches based on template matching by combining them to hypothesize detections. A powerful way of enhancing the performance relies on building hybrid (fused) systems that combine the two individual methods. Logistic regression-based fusion of acoustic keyword spotting and DTW-based systems using language-dependent phoneme recognizers is presented in [6870]. An information retrieval technique to hypothesize detection and DTW-based score detection are proposed in [39]. Logistic regression-based fusion on DTW and phone-based systems is employed in [7174]. DTW-based search at the HMM state-level from syllables obtained from a word-based speech recognizer and a deep neural network (DNN) posteriorgram-based rescoring are employed in [75], and [76] adds a logistic regression-based approach for detection rescoring. Finally, [77] employs a syllable-based speech recognizer and dynamic programming at the triphone state level to output detections and DNN posteriorgram-based rescoring.

Methods

Research carried out in a certain area may be difficult to compare in the absence of a common evaluation framework. In QbE STD, research also suffers from this issue since the published systems typically employ different acoustic databases and different lists of queries that make system comparison impossible. In this context, international evaluations provide a unique framework to measure the progress of any technology, such as QbE STD in this case.

ALBAYZIN evaluation campaigns comprise an internationally open set of evaluations supported by the Spanish Thematic Network on Speech Technologies (RTTHFootnote 1) and the ISCA Special Interest Group on Iberian Languages (SIG-ILFootnote 2), which have been held biennially since 2006. These evaluation campaigns provide an objective mechanism to compare different systems and are a powerful way to promote research on different speech technologies [7887].

Spanish is a major language in the world, and significant research has been conducted on it for ASR, KWS, and STD tasks [8894]. The increasing interest in SoS around the world and the lack of SoS evaluations dealing with Spanish encouraged us to organize a series of QbE STD evaluations starting in 2012 and held biennially until 2018, aiming to evaluate the progress in this technology for Spanish. Each evaluation has been extended by incorporating new challenges. The main novelty of the fourth ALBAYZIN QbE STD evaluation is the addition of a new data domain, namely broadcast television (TV) shows, with the inclusion of shows from the Spanish public television Radio Televisión Española (RTVE). In addition, a novel conversational speech database has also been used to assess the validity of the submitted systems in an unseen data domain. Moreover, the queries used in one of the databases (MAVIR) in the ALBAYZIN 2016 QbE STD evaluation were kept to enable a straightforward comparison of the systems submitted to both evaluations.

The main objectives of this evaluation can be summarized as follows:

  • Organize the first Spanish QbE STD multi-domain evaluation whose systems are ranked according to different databases and different domains

  • Provide an evaluation and benchmark with increasing complexity in the search queries compared to the previous ALBAYZIN QbE STD evaluations

This evaluation is suitable for research groups/companies that work in speech recognition.

This paper is organized as follows: First, Section 3 presents the evaluation and a comparison with other QbE STD evaluations. Then, Section 4, the different systems submitted to the evaluation, along with a text-based STD system, are presented. Evaluation results and discussion are presented in Section 5, which includes the corresponding paired t tests [95] as statistical significance measure for system comparison. The Section 6 presents a post-evaluation analysis based on some properties of the queries and the fusion of the primary systems submitted to the evaluation. The last section outlines the main conclusions of the paper.

ALBAYZIN 2018 QbE STD evaluation

Evaluation overview

This evaluation involves searching queries given in spoken form within speech data, by indicating the appropriate audio files with the occurrences and timestamps that contain any of those queries.

The evaluation consists in searching different query lists within different sets of speech data. Speech data comprise different domains (workshop talks, broadcast TV shows, and 2-people conversations), for which individual datasets are given. The ranking of the evaluation results is based on the average system performance on the three datasets in the test experiments.

Two different types of queries are defined in this evaluation, in-vocabulary (INV) and out-of-vocabulary (OOV) queries. The OOV query set was defined to simulate the out-of-vocabulary words of a large vocabulary continuous speech recognition (LVCSR) system. In case participants employ LVCSR for processing the audio, these OOV words must be previously removed from the system dictionary, and hence, other methods have to be used for searching OOV queries. On the other hand, the INV queries could appear in the LVCSR system dictionary.

Participants could submit a primary system and up to four contrastive systems. No manual intervention was allowed for each developed system to generate the final output file, and hence, all the systems had to be fully automatic [96].

About 3 months were given to the participants for system development, and therefore, the QbE STD evaluation focuses on building QbE STD systems in a limited period of time. The training, development, and test data were released to the participants at different times. Training and development data were released by the end of June 2018. The test data were released by the beginning of September 2018. The final system submission was due by mid-October 2018. Final results were discussed at IberSPEECH 2018 conference by the end of November 2018.

Evaluation metrics

In QbE STD, a hypothesized occurrence is called a detection; if the detection corresponds to an actual occurrence, it is called a hit; otherwise it is called a false alarm (FA). If an actual occurrence is not detected, it is called a miss. The actual term-weighted value (ATWV) metric proposed by the National Institute of Standards and Technology (NIST) [96] has been used as the main metric for the evaluation. This metric integrates the hit rate and false alarm rate of each query into a single metric and then averages over all the queries:

$$ \text{ATWV}=\frac{1}{|\Delta|}\sum_{K \in \Delta}{\left(\frac{N^{K}_{\text{hit}}}{N^{K}_{\text{true}}} - \beta \frac{N^{K}_{\text{FA}}}{T-N^{K}_{\text{true}}}\right)}, $$
(1)

where Δ denotes the set of queries and |Δ| is the number of queries in this set. \(N^{K}_{\text {hit}}\) and \(N^{K}_{\text {FA}}\) represent the numbers of hits and false alarms of query K, respectively, and \(N^{K}_{\text {true}}\) is the number of actual occurrences of K in the audio. T denotes the audio length in seconds, and β is a weight factor set to 999.9 as in [97]. This weight factor causes an emphasis placed on recall compared to precision with a ratio 10:1.

ATWV represents the term-weighted value (TWV) for a threshold given by the QbE STD system (usually tuned on development data). An additional metric, called maximum term-weighted value (MTWV) [96], can also be used to evaluate the performance of a QbE STD system. MTWV is the maximum TWV obtained by the QbE STD system for all possible thresholds, and hence does not depend on the tuned threshold. Therefore, MTWV represents an upper bound of the performance obtained by the QbE STD system. Results based on this metric are also presented to evaluate the system performance regardless of the decision threshold.

In addition to ATWV and MTWV, NIST also proposed a detection error tradeoff (DET) curve [98] to evaluate the performance of a QbE STD system working at various miss/FA ratios. Although DET curves were not used for the evaluation itself, they are also presented in this paper for system comparison.

In this work, the NIST STD evaluation tool [99] was employed to compute MTWV, ATWV, and DET curves.

Databases

Three different databases that comprise different acoustic conditions and domains have been employed for the evaluation: (1) MAVIR database, which was employed in all the previous ALBAYZIN QbE STD evaluations, is used for comparison purposes; (2) RTVE database, which consists of different programs recorded from the Spanish public television (Radio Televisión Española) and involves different broadcast TV shows; (3) COREMAH database, which contains conversational speech with two speakers per recording. For MAVIR and RTVE databases, three separate datasets (i.e., training, development, and test) were provided to the participants. For COREMAH database, only test data were provided. This allowed measuring the generalization capability of the systems in an unseen data domain. Tables 1, 2, and 3 include some database features such as the division into training, development, and test data of the speech files; the number of word occurrences; duration; the number of speakers; and average mean opinion score (MOS) [100] as a way to get an idea of the quality of each speech file in the different databases.

Table 1 Characteristics of the MAVIR database. Number of word occurrences (#occ.), duration (dur.) in minutes (min), number of speakers (#spk.), and average MOS (Ave. MOS)
Table 2 Characteristics of the RTVE database. Number of word occurrences (#occ.), duration (dur.) in minutes (min), number of speakers (#spk.), and average MOS (Ave. MOS)
Table 3 Characteristics of the COREMAH database (only for testing). Number of word occurrences (#occ.), duration (dur.) in seconds (sec), number of speakers (#spk.), and average MOS (Ave. MOS)

MAVIR

MAVIR database consists of a set of Spanish talks extracted from the MAVIR workshopsFootnote 3 held in 2006, 2007, and 2008.

The MAVIR Spanish data consist of spontaneous speech files from different speakers from Spain and Latin America, which amount to about 7 h of speech. These data are then divided for the purpose of this evaluation into training, development, and test sets. The data were also manually annotated in an orthographic form, but timestamps were only set for phrase boundaries. To prepare the data for the evaluation, the organizers manually added the timestamps for the roughly 1600 occurrences of the spoken queries used in the development and test evaluation sets. The training data were made available to the participants including the orthographic transcription and the timestamps for phrase boundariesFootnote 4.

The speech data were originally recorded in several audio formats (pulse-code modulation (PCM) mono and stereo, MP3, 22.05 kHz, 48 kHz, among others). The recordings were converted to PCM, 16 kHz, single channel, 16 bits per sample using the SoX toolFootnote 5. All the recordings except one were made with the same equipment, a Digital TASCAM DAT model DA-P1. Different microphones were used, which mainly consisted of tabletop or floor standing microphones, but in one case, a lavalier microphone was used. The distance from the speaker’s mouth to the microphone varied and was not controlled at all, but it was smaller than 50 cm in most of the cases. The recordings were made in large conference rooms with capacity for over a hundred people and a large amount of people in the conference room. This poses additional challenges including background noise (particularly babble noise) and reverberation.

RTVE

RTVE database belongs to the broadcast TV program domain and contains speech from different TV shows recorded from 2015 to 2018 (Millenium, La tarde en 24H, Comando actualidad, España en comunidad, to name a few). These comprise about 570 h in total, which were further divided into training, development, and test sets for the purpose of this evaluation. To prepare the data for the evaluation, organizers manually added the timestamps for the roughly 1400 occurrences of the spoken queries used in the development and test evaluation sets. The training data were available to participants with the corresponding subtitles (note that subtitles are not literal transcriptions of speech data). The development data were further divided into two different development sets: the dev1 dataset consists of about 60 h of speech material with human-revised word transcriptions without time alignment and the dev2 dataset, the one that was actually employed as real development data for QbE STD evaluation, consists of 15 h of speech data. The recordings were provided in Advanced Audio Coding (AAC) format, stereo, 44.1 kHz, and variable bit rate. As far as we know, this database represents the largest speech database employed in any SoS evaluation in Spanish language. More information about the RTVE database can be found in [101].

COREMAH

COREMAH database contains conversations about different topics such as rejection, compliment, and apology. It was recorded in 2014 and 2015 in a university environmentFootnote 6 [102]. This database contains about 45 min of speech data from speakers with different levels of fluency in Spanish (native, intermediate B1, and advanced C1). Since the main purpose of this database is to evaluate the submitted systems in an unseen data domain, only the Spanish native speaker recordings are employed in the evaluation in order to recreate the same conditions of the other databases. To prepare the data for the evaluation, organizers manually added the timestamps for the roughly 850 occurrences of the spoken queries used in the test evaluation set.

The original recordings are videos in Moving Picture Experts Group (MPEG) format. The audio of these videos was extracted and converted to PCM, 16 kHz, single channel, and 16 bits per sample using the ffmpegFootnote 7 tool. It is worth mentioning the large degree of overlapped speech in the recordings, which makes this database very challenging for the evaluation.

Query list selection

The selection of queries for the development and test sets aimed to build a realistic scenario for QbE STD by including high-occurrence queries, low-occurrence queries, in-language (INL) (i.e., Spanish) queries, out-of-language (OOL) (i.e., foreign) queries, single-word and multi-word queries, in-vocabulary and out-of-vocabulary queries, and queries of different length. A query may not have any occurrence or appear one or more times in the development/test speech data. Table 4 presents some relevant features of the development and test lists of queries such as the number of INL and OOL queries, the number of single-word and multi-word queries, and the number of INV and OOV queries, along with the number of occurrences of each type in the corresponding dataset. It must be noted that a multi-word query is considered OOV in case any of the words that form the query is OOV.

Table 4 Characteristics of the lists of development and test queries for MAVIR, RTVE, and COREMAH databases

Comparison to other QbE STD international evaluations

The QbE STD evaluations that are the most similar to ALBAYZIN are MediaEval 2011 [103], 2012 [104], and 2013 [105] Spoken Web Search (SWS) evaluations. However, these evaluations differ in several aspects:

  • The most important difference is the nature of the audio content used for the evaluations. In the SWS evaluations, the speech was typically telephone speech, either conversational or read and elicited speech, or speech recorded with in-room microphones. In the ALBAYZIN QbE STD evaluations, the audio consisted of microphone recordings of real talks in workshops that took place in large conference rooms in the presence of an audience. In addition, ALBAYZIN 2018 QbE STD evaluation also contains live-talking conversational speech and broadcast TV shows and explicitly defines different in-vocabulary and out-of-vocabulary query sets.

  • SWS evaluations dealt with Indian- and African-derived languages, as well as Albanian, Basque, Czech, non-native English, Romanian, and Slovak languages, while the ALBAYZIN QbE STD evaluations only deal with Spanish language.

These differences make it difficult to compare the results obtained in ALBAYZIN and SWS QbE STD evaluations.

In 2014, the Query-by-Example Search on Speech Task (QUESST) held at MediaEval differed from the previous evaluations in that it was a spoken document retrieval task (i.e., no query timestamps had to be output) [106]. In 2015, QUESST was similar to that of 2014, but the acoustic conditions of the speech data were much more complicated (e.g., reverberation, different kinds of noise), and there were different types of queries (exact queries, queries with lexical variations, queries with changes in the word order, to name a few) [107].

In addition to the MediaEval evaluations, other QbE STD evaluations were organized with the NTCIR-11 [108] and NTCIR-12 [109] conferences. The data used in these evaluations contained spontaneous speech in Japanese provided by the National Institute for Japanese Language, and spontaneous speech recorded during seven editions of the Spoken Document Processing Workshop. As additional information, these evaluations provided the participants with the results of a voice activity detection (VAD) system for the speech data, the manual transcription of the speech data, and the output of an LVCSR system. Although ALBAYZIN QbE STD evaluations are somehow similar in terms of speech nature to the NTCIR QbE STD evaluations (i.e., the speech was recorded in real workshops), ALBAYZIN QbE STD evaluations make use of a different language and define disjoint development and test query lists to measure the generalization capability of the systems.

Table 5 summarizes the main characteristics of SWS, NTCIR, and ALBAYZIN QbE STD evaluations.

Table 5 Comparison of the different QbE STD evaluations

Systems

Three teams submitted ten different systems to ALBAYZIN 2018 QbE STD evaluation, as listed in Table 6. The systems belong to three of the categories described above: text-based STD, template matching, and hybrid systems.

Table 6 Participants in ALBAYZIN 2018 QbE STD evaluation along with the submitted systems

A-Hybrid DTW+LVCSR system

This system (Fig. 1) consists of the fusion of four different QbE STD systems. Three of them are based on DTW, and the other on LVCSR.

Fig. 1
figure1

Architecture of A-Hybrid DTW+LVCSR system. “transcr.” denotes transcription

Feature extraction in DTW-based systems

Each DTW-based system employs a different speech representation:

  • Phoneme posteriorgrams [67], which represent the probability of each phonetic unit at every time instant. The English phone decoder developed by the Brno University of Technology (BUT) [110] is used to obtain phoneme posteriorgrams, and then a Gaussian softening is applied in order to have Gaussian-distributed probabilities [111].

  • Low-level descriptors (Table 7) obtained using the OpenSMILE feature extraction toolkit [112]) are extracted every 10 ms using a 25-ms window, except for F0, probability of voicing, jitter, shimmer, and harmonics-to-noise ratio (HNR), for which a 60-ms window is used. These features are augmented with their delta coefficients.

  • Gaussian posteriorgrams [113], which represent the probability of each Gaussian in a GMM at every time instant. Feature extraction and Gaussian posteriorgram computation are performed using the Kaldi toolkit [114]. The GMM is trained employing MAVIR and RTVE training as well as RTVE dev1 data, using 19 MFCCs plus energy, and their delta and delta delta coefficients.

Table 7 Acoustic features used in the A-Hybrid DTW+LVCSR QbE STD system

Query detection

From each feature set described above, a search procedure is followed to hypothesize query detections. The search is based on the S-DTW algorithm [115], which is a variant of the standard DTW search. In S-DTW, a cost matrix Mn×m must first be defined, in which the rows and the columns correspond to the frames of the query (Q) and the utterance (U), respectively:

$$\begin{array}{@{}rcl@{}} M_{i,j} = \left\{ \begin{array}{lcl} c(q_{i},u_{j}) &\mbox{ if }& i = 0 \\ c(q_{i},\mathrm{u}_{j}) + M_{i-1,0} &\mbox{ if }& i > 0,\; j = 0 \\ c(q_{i},\mathrm{u}_{j}) + M^{\mathrm{*}}(i,j) &\mbox{ otherwise}, &\\ \end{array} \right. \end{array} $$
(2)

where c(qi,uj) is a function that defines the cost between the query vector qi and the utterance vector uj, and

$$ M^{\mathrm{*}}(i,j) = \text{min}\left(M_{i-1,j},\mathrm{M}_{i-1,j-1},\mathrm{M}_{i,j-1}\right), $$
(3)

which implies that only horizontal, vertical, and diagonal path movements are allowed.

Pearson’s correlation coefficient r [116] is used as a cost function by mapping it into the interval [0,1] applying the following transformation:

$$ c(q_{i},u_{j}) = \frac{1-r(q_{i},u_{j})}{2}. $$
(4)

Once the matrix M is computed, the end of the best warping path between Q and U is obtained as follows:

$$ b^{\mathrm{*}} = \underset{b \in 1,\ldots,m}{\text{argmin}}\, M(n,b). $$
(5)

The starting point of the path ending at b, namely a, is computed by backtracking, hence obtaining the best warping path.

A query Q may appear several times in an utterance U, especially if U is a long recording. Therefore, not only must the best warping path be detected, but also others that are less likely. One approach to overcome this issue consists in detecting a given number of candidate matches nc: Every time a warping path that ends at frame b is detected, M(n,b) is set to to ignore this element in the future.

A confidence score must be assigned to every detection of a query Q in an utterance U. Firstly, the cumulative cost of the warping path \(\phantom {\dot {i}\!}M_{n,b^{*}}\) is length-normalized [68] and then z-norm is applied so that all the confidence scores of all the queries have the same distribution [70].

LVCSR-based QbE STD

This strategy follows a probabilistic retrieval model for information retrieval [117] that is applied in this evaluation for the QbE STD task. This model consists of the following stages:

  • Indexing: A DNN-based LVCSR system built with the Kaldi toolkit [114] is employed. The utterances are converted into phone-level n-best lists to store different phone transcriptions (50) for each utterance. Then, these are indexed in terms of phone n-grams of different size [39, 118]. The minimum and maximum sizes of the n-grams are set to 1 and 5, respectively, according to [39]. With respect to the probabilistic retrieval model, each utterance is represented by means of a language model (LM) [117]. The start time and duration of each phone are also stored in the index.

  • Search: The DNN-based LVCSR system is employed to obtain the word transcription of each query. Then, it is converted to phone transcription using the dictionary created with Cotovia software [119] and searched within the different indices. Finally, a score for each utterance is computed following the query likelihood retrieval model [120]. It must be noted that this model sorts the utterances according to how likely it is they contain the query, but the start and end times of the match are required in this task. To obtain these times, the phone transcription of query Q is aligned to that of utterance U by computing the minimum edit distance (MED) MED(Q,U). This allows the recovery of the start and end times since they are stored in the index. In addition, the MED is used to penalize the score returned by the query likelihood retrieval model (Lopez-Otero et al.: Probabilistic information retrieval models for query-by-example spoken document retrieval, submitted to Multimed. Tools Appl.):

    $$ {}\text{score}(Q,U) = \text{score}_{LM}(Q,U)\cdot \mathrm{score_{MED}}(Q,U), $$
    (6)

    where scoreMED(Q,U) is a score between 0 and 1 derived from MED(Q,U) and computed as:

    $$ \mathrm{score_{MED}}(Q,U) = \frac{n_{Q}-\text{MED}(Q,U)}{K}, $$
    (7)

    where nQ is the number of phonemes of the query, and K is the length of the best alignment path.

Indexing and search were performed using Lucene.Footnote 8

Calibration and fusion

Discriminative calibration and fusion [121] are applied in order to combine the outputs of the three DTW systems and that of the LVCSR system. The global minimum score produced by the system for all the queries is used to hypothesize the missing scores. After normalization, calibration and fusion parameters are estimated by logistic regression on the development datasets to obtain improved discriminative and well-calibrated scores [122]. Calibration and fusion training are performed using Bosaris toolkit [123].

The decision threshold, weight of the LM in the DNN-based LVCSR system, and number of n-best lists in the LVCSR-based QbE STD system are tuned from the combined ground truth labels of the MAVIR and RTVE development data. The rest of the parameters are set based on preliminary experiments.

B-Fusion DTW system

This system combines the DTW-based systems presented in A-Hybrid DTW+LVCSR system.

C-Phoneme-posteriorgram DTW system (C-PhonePost DTW)

This system only employs DTW search on the phoneme posteriorgrams presented in the A-Hybrid DTW+LVCSR system, and hence does not make use of the calibration and fusion stage.

D-LVCSR system

This system only employs the LVCSR approach described in the A-Hybrid DTW+LVCSR system to hypothesize query detections.

E-DTW system

This system (Fig. 2) integrates two different stages: feature extraction and query detection, which are explained next.

Fig. 2
figure2

Architecture of E-DTW system

Feature extraction

The English phoneme recognizer developed by BUT [110] is employed to compute phoneme posteriorgrams that represent both the queries and the utterances and is very similar to the posteriorgram features of the former systems, except for the Gaussian softening stage.

Query detection

First, a cost matrix that stores the similarity between every query/utterance pair is computed. The Pearson correlation coefficient r(qn,um) [116] is employed to build the cost matrix, where qn represents the query phoneme posteriorgram frames and um represents the utterance phoneme posteriorgram frames.

The final cost used in the search stage is modified as follows: c(qn,um)=1−max(0,r(qn,um)). Therefore, for all the Pearson correlation coefficient values lower or equal to 0, the cost will be maximum. The S-DTW algorithm explained in the A-Hybrid DTW+LVCSR system is employed to hypothesize detections from this cost matrix. Finally, a neighborhood search is carried out so that all the paths (i.e., query detections) which overlap more than 500 ms from a previously obtained optimal path are rejected in the final system output.

Parameter tuning is carried out using MAVIR development data and then applied to the other datasets.

F-Combined DTW system

This system (Fig. 3) is based on the combination of different search processes, each employing a different feature set.

Fig. 3
figure3

Architecture of F-Combined DTW system

Voice activity detection

The spoken queries and the utterances are first processed with the VAD system developed by Google for the WebRTC project [124], which is based on Gaussian distributions of speech and non-speech features.

Feature extraction

The feature extraction module performs stacked bottleneck feature (sBNF) computation following the BUT/Phonexia approach [125], both for queries and utterances. To do so, three different neural networks are applied, each trained to classify a different set of acoustic units and later optimized for language recognition tasks. The first network is trained on telephone speech from the English Fisher corpus [126] with 120 monophone state targets, referred to as FisherMono. The second one is also trained on the Fisher corpus but with 2423 triphone tied-state targets and is referred to as FisherTri. The third network is trained on telephone speech in 17 languages taken from the Intelligence Advanced Research Projects Activity (IARPA) Babel program [127], with 3096 stacked monophone state targets for the 17 languages involved (BabelMulti for short). Given that the sBNF extractors are trained using 8 kHz speech signals, the queries and the utterances are downsampled to 8 kHz.

The architecture of the sBNF networks consists of two stages. The first one is a standard bottleneck network fed with low-level acoustic features spanning 10 frames (100 ms), the bottleneck size being 80. The second stage takes as input five equally spaced bottleneck features of the first stage, spanning 31 frames (310 ms), and is trained on the same targets as the first stage, with the same bottleneck size (80). The bottleneck features extracted from the second stage are known as stacked bottleneck features and comprise the output of the feature extraction module. Alternatively, instead of sBNFs, the extractor can output target posteriors.

The operation of BUT/Phonexia sBNF extractors requires an external VAD module providing speech/non-speech information. If no external VAD is provided, a simple energy-based VAD is computed internally. This system employs the WebRTC VAD module.

The first aim for the feature extraction stage was to employ the BUT/Phonexia posteriors, but the huge size of FisherTri (2423) and BabelMulti (3096) targets requires some kind of selection, clustering, or dimensionality reduction approach. Therefore, given that—at least theoretically—the same information is conveyed by sBNFs, with a suitably low dimensionality (as 80 in this case), sBNFs are employed.

Dynamic time warping-based search

This system follows the DTW-based approach presented in [128]. Given the two sequences of sBNFs corresponding to a query and an utterance, a VAD system is used to discard non-speech frames, but keeping the timestamp of each frame. To avoid memory issues, utterances are split into chunks of 5 min with 5-s overlap and processed independently. This chunking process is key to the speed and feasibility of the search procedure.

Let Q=(q[1],q[2],…,q[m]) be the sequence of VAD-filtered sBNFs of length m corresponding to a query and U=(u[1],u[2],…,u[n]) be those of an utterance of length n. Since sBNFs (theoretically) range from − to +, the distance between a pair of vectors q[i] and u[j] is defined as follows:

$$ d(q[i],u[j]) = -\log \left(1 + \frac{q[i] \cdot u[j]}{|q[i]| \cdot |u[j]|} \right) + \log 2. $$
(8)

Note that d(v,w)≥0, with d(v,w)=0 if and only if v and w are aligned and pointing in the same direction, and d(v,w)=+ if and only if v and w are aligned and pointing in opposite directions.

The distance matrix computed according to Eq. 8 is normalized with respect to the utterance U as follows:

$$ d_{\text{norm}}(q[i],u[j])=\frac{d(q[i],u[j])-d_{\text{min}}(i)}{d_{\text{max}}(i)-d_{\text{min}}(i)}, $$
(9)

where

$$\begin{array}{@{}rcl@{}} d_{\text{min}}(i)=\min\limits_{j=1,\ldots,n} d(q[i],u[j]) \end{array} $$
(10)
$$\begin{array}{@{}rcl@{}} d_{\text{max}}(i)=\max\limits_{j=1,\ldots,n} d(q[i],u[j]). \end{array} $$
(11)

In this way, matrix values are in the range [0,1], and a perfect match would produce a quasi-diagonal sequence of zeroes. This can be seen as test normalization since, given a query Q, distance matrices take values in the same range (and with the same relative meaning), no matter the acoustic conditions, the speaker, or other factors of the utterance U.

Note that the chunking process described above makes the normalization procedure differ from that applied in [128], since dmin(i) and dmax(i) are not computed for the whole utterance but for each chunk independently. On the other hand, considering chunks of 5 min might be beneficial, since normalization is performed in a more local fashion, that is, more suited to the speaker(s) and acoustic conditions of each particular chunk.

The best match of a query Q of length m in an utterance U of length n is defined as that minimizing the average distance in a crossing path of the matrix dnorm. A crossing path starts at any given frame of U, k1[1,n], then traverses a region of U which is optimally aligned to Q (involving L vector alignments), and ends at frame k2[k1,n]. The average distance in this crossing path is:

$$ d_{\text{avg}}(Q,U) = \frac{1}{L} \sum_{l=1}^{L} d_{\text{norm}}(q[i_{l}],u[j_{l}]), $$
(12)

where il and jl are the indices of the vectors of q and u in the alignment l, for l=1,2,…,L. Note that i1=1,iL=m,j1=k1, and jL=k2. The optimization procedure is O(n·m·d) in time, where d is the size of the feature vectors and O(n·m) in space. Readers are referred to [128] for more details.

The detection score is computed as 1−davg(Q,U), thus ranging from 0 to 1, being 1 only for a perfect match. The starting time and the duration of each detection are obtained by retrieving the time offsets corresponding to frames k1 and k2 in the VAD-filtered utterance.

This procedure is iteratively applied to find not only the best match, but also less likely matches in the same utterance. To that end, a queue of search intervals is defined and initialized with [1,n]. Given an interval [a,b], and assuming that the best match is found at [a,b], the intervals [a,a−1] and [b+1,b] are added to the queue (for further processing) only if the following conditions are satisfied: (1) the score of the current match is greater than a given threshold T (T=0.85); (2) the interval is long enough (half the query length: m/2); (3) the number of matches (those already found plus those waiting in the queue) does not exceed a given threshold M (M=7). An example is shown in Fig. 4. Finally, the list of matches for each query is ranked according to the scores and truncated to the N highest scores (N=1000, though it effectively applied only in a few cases).

Fig. 4
figure4

Example of the iterative DTW procedure. (1) The best match of Q in u[1,n] is located in u[k1,k2]. (2) Since the score is greater than the established threshold T, the search continues in the surrounding segments u[1,k1−1], and u[k2+1,n]; (3) u[k2+1,n] is not searched, because it is too short. (4) The best match of Q in u[1,k1−1] is located in u[k3,k4]. (5) Its score is lower than T, so the surrounding segments u[1,k3−1] and u[k4+1,k1−1] are not searched. The search procedure outputs the segments u[k1,k2] and u[k3,k4]

Four different DTW-based searches are carried out. Three of them employ the three sBNF sets computed in the feature extraction module (FisherMono, denoted as sBNF1 in Fig. 3; FisherTri, denoted as sBNF2 in Fig. 3; and BabelMulti, denoted as sBNF3 in Fig. 3). The other DTW search employs the concatenation of the three sBNF sets (denoted as sBNF4 in Fig. 3), which leads to 240-dimensional sBNF vectors. Each DTW search produces different query detections that are next fused in the fusion stage.

Calibration and fusion

The scores produced by the different searches are transformed according to a discriminative calibration/fusion approach commonly applied in speaker and language recognition [129].

First, the so-called q-norm (query normalization) is applied, so that zero-mean and unit variance scores are obtained per query. Then, if n different systems are fused, detections are aligned so that only those supported by k or more systems (1≤kn) are retained for further processing (k=2). To build the full set of trials (potential detections), a rate of one trial per second is chosen (which is consistent with the evaluation script provided by the organizers). Given a detection of a query Q supported by at least k systems, and a system A that did not provide a score for it, there could be different ways to fill up this hole. The minimum score that A has output for query Q in other trials is selected. In fact, the minimum score for the query Q is hypothesized for all target and non-target trials of query Q for which system A has not output a detection score. When a single system is considered (n=1), the majority voting scheme and the filling up of missing scores are skipped. In this way, a complete set of scores is prepared, which besides the ground truth (target/non-target labels) for a development set of queries, can be used to discriminatively estimate a linear transformation.

The calibration/fusion model is estimated on the development set and then applied to both the development and test sets using Bosaris toolkit [123].

The calibration/fusion parameters and optimal decision threshold are obtained from the corresponding development set for each database (MAVIR and dev2 for RTVE). Since the evaluation organizers did not provide any development data for the COREMAH database, the optimal calibration/fusion parameters tuned on MAVIR data are employed, and the optimal decision threshold is chosen so that 15% of the detections with the highest scores are assigned a YES decision. The parameters involved in the feature extraction and search procedures are set based on preliminary experiments.

G-Super bottleneck feature DTW system (G-Super-BNF DTW)

This system is the same as the F-Combined DTW system, except that only DTW-based search on the concatenation of the three sBNFs as features is used to hypothesize query detections.

H-Multilingual bottleneck feature DTW system (H-Multilingual-BNF DTW)

This system is the same as the G-Super-BNF DTW system, except that DTW-based search on the BabelMulti sBNF set is used for query detection.

I-Monophone bottleneck feature DTW system (I-Monoph.-BNF DTW)

This system is the same as the G-Super-BNF DTW system, except that DTW-based search on the FisherMono sBNF set is used for query detection.

J-Triphone bottleneck feature DTW system (J-Triph.-BNF DTW)

This system is the same as the G-Super-BNF DTW system, except that DTW-based search on the FisherTri sBNF set is used for query detection.

K-Text STD system

This system (Fig. 5) does not compete in the evaluation itself, but it is presented in order to examine the upper bound limits of QbE STD technology. This system employs the correct word transcription of the query to hypothesize detections and the same ASR approach as that used in the LVCSR-based QbE STD system.

Fig. 5
figure5

Architecture of K-Text STD system

The ASR subsystem is based on the Kaldi open-source toolkit [114] and employs DNN-based acoustic models.

The data used to train the acoustic models of this Kaldi-based LVCSR system are extracted from the Spanish material used in the 2006 TC-STAR automatic speech recognition evaluation campaignFootnote 9 and the Galician broadcast news database Transcrigal [130]. It must be noted that all the non-speech parts as well as the speech parts corresponding to transcriptions with pronunciation errors, incomplete sentences, and short speech utterances are discarded, so in the end, the acoustic training material consists of approximately 104.5 h.

The LM employed in the LVCSR system is constructed using a text database of 150 million word occurrences, composed of material from several sources (transcriptions of European and Spanish Parliaments from the TC-STAR database, subtitles, books, newspapers, online courses, and the transcriptions of the MAVIR sessions included in the development set provided by the evaluation organizersFootnote 10 [131]). Four-gram LMs have been built with the SRILM tooolkit [132]. The final LM is an interpolation between a LM trained on RTVE data and another one trained on the rest of the text corpora. The LM vocabulary size is limited to the most frequent 300K words, and for each evaluation dataset, the OOV words are removed from the LM. Grapheme-to-phoneme conversion is carried out with the Cotovia software [119].

The STD subsystem integrates the Kaldi term detector [114, 133, 134], which searches for the input terms within the word lattices obtained in the previous step [135]. The Kaldi decision-maker conducts a YES/NO decision for each detection based on the term-specific threshold approach presented in [136]. To do so, the score for each detection is computed as follows:

$$ p > \frac{N_{\text{true}}}{\frac{T}{\beta}+\frac{\beta-1}{\beta}N_{\text{true}}}, $$
(13)

where p is the confidence score of the detection, Ntrue is the sum of the confidence score of all the detections of the given term, β is set to 999.9, and T is the length of the audio in seconds.

The proxy words strategy in the Kaldi open-source toolkit [137] is employed for OOV query detection. This strategy consists in substituting each OOV word of the search query with acoustically similar INV proxy words so that the OOV query search can be then carried out using the obtained INV query.

The decision threshold and the weight of the LM in the ASR subsystem for MAVIR and RTVE development data are tuned for each dataset from the individual development dataset. However, for all the test data (i.e., MAVIR, RTVE, and COREMAH), these parameters are tuned from the combined ground truth labels of the MAVIR and RTVE development data, aiming to avoid overfitting issues. The rest of the parameters are set based on preliminary experiments.

Evaluation results and discussion

This section presents the overall evaluation results and the results obtained for each individual database on development and test data.

Overall results

The overall evaluation results are presented in Table 8 for development and test data, along with a comparison with the text STD system presented above. These show that the best performance for the ATWV metric on test data is for the A-Hybrid DTW+LVCSR system, which highlights the power of hybrid systems for QbE STD. However, two findings arise: (1) the ranking of the evaluation results for development and test data differs and (2) the K-Text STD system, which relies on a DNN-based ASR subsystem and the correct word transcription of the query, obtains better results than any of the QbE STD systems in development data, but its performance is similar to that of the best QbE STD system on test data. Calibration threshold issues may be causing these differences in performance, since the K-Text STD system also obtains the best MTWV in test data.

Table 8 Overall system results of the ALBAYZIN 2018 QbE STD evaluation on development and test data (average of the results on the two development and three test corpora)

Development data

MAVIR

Evaluation results for MAVIR development data are presented in Table 9. By comparing the QbE STD systems, the best performance is obtained with the B-Fusion DTW system. Paired t tests show that the difference in performance is statistically significant (p<0.01) compared with all the QbE STD systems except for the A-Hybrid DTW+LVCSR and D-LVCSR systems. B-Fusion DTW employs a fusion of DTW-based systems with different feature sets, which suggests that different features convey different patterns that enhance the performance. The A-Hybrid DTW+LVCSR system, which integrates an LVCSR-based system in the fusion, does not outperform the B-Fusion DTW system, probably due to some threshold calibration issues (better MTWV and worse ATWV) in medium-quality and highly spontaneous speech domains as MAVIR. The K-Text STD system, which employs the correct word transcription of the query and an LVCSR approach, performs the best. This best performance is statistically significant for a paired t test (p<0.01) compared with the rest of the systems. This is due to the use of the correct transcription of the spoken query in the DNN-based speech recognition system, which plays an important role in query detection for highly spontaneous and medium-quality speech domains.

Table 9 System results of the ALBAYZIN 2018 QbE STD evaluation on MAVIR development data

On the other hand, the worst systems are those that employ stacked bottleneck features, which suggests that the use of the sBNFs, as proposed by the authors of those systems, is less powerful than other features for QbE STD in medium-quality and highly spontaneous speech domains.

RTVE

The evaluation results for RTVE development data are presented in Table 10. They show that the best performance among the QbE STD systems is obtained with the C-PhonePost DTW system. A paired t test shows that the difference in performance is statistically significant (p<0.01) compared with all the QbE STD systems except for the A-Hybrid DTW+LVCSR and B-Fusion DTW. This best performance does not correspond to the best system on MAVIR development data, maybe due to the RTVE data comprising higher-quality and better-pronounced speech data than MAVIR, and hence, the best performance may not correspond to the same system. Two more remarked differences can be seen on these data compared to the MAVIR development data: (1) the systems that rely on sBNFs obtain much better performance, and (2) the K-Text STD system obtains similar results to that obtained with the A-Hybrid DTW+LVCSR, B-Fusion DTW, and C-PhonePost DTW systems. These differences highlight the power of sBNFs and QbE STD systems when addressing query detection in high-quality and well-pronounced speech domains as RTVE.

Table 10 System results of the ALBAYZIN 2018 QbE STD evaluation on RTVE development data

Due to threshold calibration issues, the A-Hybrid DTW+LVCSR system, which obtains the best MTWV, does not perform the best for ATWV, as in MAVIR development data.

The E-DTW system obtains the worst overall performance. This can be due to the fact that the optimal parameters obtained with the MAVIR development data have been applied on these data without adjustment. Since RTVE data convey many different properties (i.e., high-quality and well-pronounced speech), the parameter tuning is not effective across changes in the data domain.

Test data

MAVIR

The results corresponding to MAVIR test data are presented in Table 11. They show that the best performance for QbE STD is obtained with the B-Fusion DTW system, which is consistent with the results in MAVIR development data. This best performance is statistically significant for a paired t test (p<0.01) compared to all the QbE STD systems except for the C-PhonePost DTW system, for which the difference is weakly significant (p<0.04). The performance gap between MTWV and ATWV for the best system suggests that the threshold has been well-calibrated. The rest of the findings observed from the development results also arise: (1) the worst systems are those that employ the sBNFs for feature extraction; (2) the A-Hybrid DTW+LVCSR system, which integrates an LVCSR approach in the fusion of the B-Fusion DTW system, obtains worse performance than the B-Fusion DTW system, due to the low performance of the LVCSR system. This indicates that parameter tuning on the development data is not generalizing well on test data; (3) the K-Text STD system performs better than any QbE STD system, with all the performance gaps statistically significant for a paired t test (p<0.01).

Table 11 System results of the ALBAYZIN 2018 QbE STD evaluation on MAVIR test data

RTVE

Evaluation results for RTVE test data are presented in Table 12. They show that the best performance for QbE STD corresponds to the A-Hybrid DTW+LVCSR system. The performance gap between MTWV and ATWV on this system indicates that the threshold presents some calibration issues. The best performance is statistically significant for a paired t test (p<0.01) compared to the rest of the QbE STD systems. This highlights the power of the hybrid systems for QbE STD systems on high-quality and well-pronounced speech data, and for which considerable amount of resources are available. For development data, A-Hybrid DTW+LVCSR, B-Fusion DTW, and C-PhonePost DTW systems obtain equivalent performance. Nevertheless, when test data are given to the hybrid systems, these are able to generalize better than the other systems, due to the complementary information integrated by hybrid systems.

Table 12 System results of the ALBAYZIN 2018 QbE STD evaluation on RTVE test data

As in development data, the E-DTW system performs the worst, due to the fact that no additional tuning on RTVE data has been carried out, whereas the systems that employ sBNFs for feature extraction enhance their performance with respect to the MAVIR test data.

The K-Text STD system performs better than any other QbE STD system. This best performance is statistically significant (p<0.01) compared to all the QbE STD systems.

COREMAH

Evaluation results for COREMAH test data are presented in Table 13. For the QbE STD systems, the best performance is obtained with the E-DTW system. This best performance is statistically significant for a paired t test (p<0.01) compared to the rest of the QbE STD systems. Remind that no development data were provided for COREMAH, and hence, parameter tuning must be carried out with some other data. The E-DTW system was tuned with the MAVIR optimal parameters, which indicates that MAVIR data convey properties which are similar to the conversational speech in COREMAH data. However, the parameters of the rest of the systems employed RTVE data for tuning (except for the F-Combined DTW, G-Super-BNF DTW, H-Multilingual-BNF DTW, I-Monoph.-BNF DTW, and J-Triph.-BNF DTW systems, which employed MAVIR data as well), which leads to a worse performance due to higher data mismatch. For those systems, the low performance may be due to the type of tuning carried out.

Table 13 System results of the ALBAYZIN 2018 QbE STD evaluation on COREMAH test data

The K-Text STD system obtains worse performance than the best QbE STD system, although the performance gap is weakly significant for a paired t test (p<0.03). This could be due to the data mismatch between COREMAH data and the RTVE data, which were used, along with MAVIR data, for parameter tuning in this case.

Analysis of development and test data DET curves

DET curves of the QbE STD systems submitted to the evaluation and the text-based STD system are presented in Figs. 6 and 7 for MAVIR and RTVE development data, respectively, and Figs. 8, 9, and 10 for MAVIR, RTVE, and COREMAH test data, respectively.

Fig. 6
figure6

DET curves of the QbE STD systems and text STD system for MAVIR development data

Fig. 7
figure7

DET curves of the QbE STD systems and text STD system for RTVE development data

Fig. 8
figure8

DET curves of the QbE STD systems and text STD system for MAVIR test data

Fig. 9
figure9

DET curves of the QbE STD systems and text STD system for RTVE test data

Fig. 10
figure10

DET curves of the QbE STD systems and text STD system for COREMAH test data

On MAVIR development data, the B-Fusion DTW system performs the best for low FA rates, and the A-Hybrid DTW+LVCSR system performs the best for moderate and low miss rates. This indicates that the hybrid system is suitable for cases in which a miss is more important than an FA. On RTVE development data, the B-Fusion DTW system performs the best for low and moderate FA rates, and the A-Hybrid DTW+LVCSR system performs the best for low miss rates. This confirms the power of hybrid systems for low miss rate scenarios.

On MAVIR test data, the C-PhonePost DTW system performs the best for very low FA rates, the B-Fusion DTW system performs the best for low and moderate FA rates, and the A-Hybrid DTW+LVCSR system performs the best for low miss rates. On RTVE test data, the B-Fusion DTW system performs the best for low FA rates, and the A-Hybrid DTW+LVCSR system performs the best for low miss rates. On COREMAH test data, the C-PhonePost DTW system performs the best for low FA rates, the B-Fusion DTW system performs the best for moderate FA rates, and the A-Hybrid DTW+LVCSR performs the best for low miss rates. However, according to the results in Table 13, the best performance is obtained with the E-DTW system, since it outputs less number of FAs (although the number of hits is also lower) than those three other systems so that more hits are ranked in top positions, enhancing the MTWV/ATVW performance measure.

In summary, B-Fusion DTW and A-Hybrid DTW+LVCSR systems obtain the best figures in the DET curves and make them more appropriate for search on speech from spoken queries.

The DET curves also show that the K-Text STD system performs the best for all data, except for RTVE development data on all the operating points, and MAVIR and COREMAH test data on low FA rates. Results on COREMAH test data suggest that QbE STD may outperform text-based STD on unseen data domains, at least on some scenarios (as low FA rates in this case).

Post-evaluation analysis

After the evaluation period, an analysis based on some query properties and fusion of the primary systems submitted from the different participants has been carried out. This section presents the results of this analysis.

Performance analysis of QbE STD systems for in-language and out-of-language queries

An analysis of the QbE STD systems and the K-Text STD system for in-language and out-of-language queries has been carried out, and the results are presented in Tables 14, 15, and 16 for MAVIR, RTVE, and COREMAH databases, respectively. On MAVIR data, QbE STD system performance is, in general, better on OOL than on INL queries. We consider this is due to the fact all the systems that employ template matching techniques rely on a language-independent approach, for which English language is largely used for feature extraction. Since the OOL queries are in English, this is clearly giving better performance. The systems that employ template matching approaches and obtain better performance on INL than on OOL queries are the F-Combined DTW, G-Super-BNF DTW, I-Monoph.-BNF DTW, and J-Triph.-BNF DTW systems, for which the better MTWV performance on OOL queries indicates some threshold calibration issues. The D-LVCSR system, which is based on subword unit search from word-based ASR, performs better on OOL queries than on INL queries. We consider this could be due to the larger OOV rate of the INL queries (18.2%) compared to that of the OOL queries (14.2%), which could affect the word-based ASR performance.

Table 14 System results of the ALBAYZIN 2018 QbE STD evaluation on MAVIR test data for in-language (INL) and out-of-language (foreign) (OOL) queries
Table 15 System results of the ALBAYZIN 2018 QbE STD evaluation on RTVE test data for in-language (INL) and out-of-language (foreign) (OOL) queries
Table 16 System results of the ALBAYZIN 2018 QbE STD evaluation on COREMAH test data for in-language (INL) and out-of-language (foreign) (OOL) queries

On RTVE data, the systems that only employ template matching approaches obtain, in general, better performance on OOL queries than on INL queries, which is due to the use of the query language (i.e., English). The only exceptions are the F-Combined DTW and G-Super-BNF DTW systems, for which the better MTWV performance on OOL queries indicates some threshold calibration issues, and the E-DTW system, for which the performance (ATWV < 0) is meaningless. For the systems that employ ASR (i.e., A-Hybrid DTW+LVCSR and D-LVCSR systems), the performance is better on INL queries, since the ASR language matches that of the query.

On COREMAH data, systems obtain, in general, better MTWV performance on OOL queries than on INL queries, which is due to the use of the English language for system construction. Threshold calibration issues lead to higher ATWV for INL queries in some cases. For the D-LVCSR system, which is based on ASR, the MTWV performance is better on INL queries than on OOL queries, which is consistent with the match between the language of the ASR system and the queries. However, threshold calibration issues produce a worse ATWV on INL queries.

As expected, the K-Text STD system, which is language-dependent and relies on the search in word lattices output by a Spanish ASR system, obtains better performance on INL queries than on OOL queries, since the query language matches the ASR target language. The only exception is the COREMAH data, for which a better MTWV performance on INL queries suggests threshold calibration issues in domains for which no development data are provided.

Performance analysis of QbE STD systems for single and multi-word queries

A similar analysis has been carried out for single and multi-word queries, and the results are presented in Tables 17, 18, and 19 for MAVIR, RTVE, and COREMAH databases, respectively. They show that system performance on multi-word queries is always better than on single-word queries for MAVIR and RTVE databases. We consider this is due to the fact that multi-word queries are longer than single-word queries, and hence produce less FAs so that the final performance gets improved.

Table 17 System results of the ALBAYZIN 2018 QbE STD evaluation on MAVIR test data for single-word (Single) and multi-word (Multi) queries
Table 18 System results of the ALBAYZIN 2018 QbE STD evaluation on RTVE test data for single-word (Single) and multi-word (Multi) queries
Table 19 System results of the ALBAYZIN 2018 QbE STD evaluation on COREMAH test data for single-word (Single) and multi-word (Multi) queries

On COREMAH data, for which no development data were provided, the performance drops dramatically. In addition, there is just one multi-word query, for which no detections are given by any system. The single-word query detection fails in threshold calibration in most of the cases, hence obtaining an ATWV < 0. The only system that obtains an ATWV > 0 is the E-DTW system due to a perfect threshold calibration. This perfect calibration may be due to the fact that MAVIR development data were used for parameter tuning in the experiments using COREMAH data. MAVIR data present highly spontaneous and medium-quality speech data, which matches in some extent the speech of COREMAH data. On the other hand, RTVE data, which were used for parameter tuning in the rest of the systems (except for the F-Combined DTW, G-Super-BNF DTW, H-Multilingual-BNF DTW, I-Monoph.-BNF DTW, and J-Triph.-BNF DTW systems, which employed MAVIR data as well), present well-pronounced and high-quality speech, which do not match COREMAH data, and hence degrades the performance. For those systems, the low performance may be due to the type of tuning carried out.

Performance analysis of QbE STD systems for INV and OOV queries

An analysis of the QbE STD systems and the K-Text STD system for in-vocabulary and out-of-vocabulary queries has been carried out, and the results are presented in Tables 20, 21, and 22 for MAVIR, RTVE, and COREMAH databases, respectively. They show that, for MAVIR and RTVE databases, the performance on INV queries is better than on OOV queries. Although many of the QbE STD systems presented do not rely on ASR (the only exceptions are the D-LVCSR and the A-Hybrid DTW+LVCSR systems), system performance is, theoretically, better on INV queries than on OOV queries, due to the different properties INV and OOV queries convey. However, on COREMAH data, the MTWV obtained on OOV queries is in general, better than on INV queries. Since no development data were provided for this database, INV and OOV query detection must rely on parameter tuning that does not match the data domain, making INV query detection more difficult. The performance gaps between MTWV and ATWV metrics suggest some threshold calibration issues on COREMAH data, due to the lack of development data for this domain.

Table 20 System results of the ALBAYZIN 2018 QbE STD evaluation on MAVIR test data for in-vocabulary (INV) and out-of-vocabulary (OOV) queries
Table 21 System results of the ALBAYZIN 2018 QbE STD evaluation on RTVE test data for in-vocabulary (INV) and out-of-vocabulary (OOV) queries
Table 22 System results of the ALBAYZIN 2018 QbE STD evaluation on COREMAH test data for in-vocabulary (INV) and out-of-vocabulary (OOV) queries

As expected, the K-Text STD system obtains better performance on INV queries for all the databases due to the match in the target language and the presence of the query terms in the vocabulary of the ASR system.

System fusion

After the evaluation, we have tried to combine all the primary systems developed by the participants by fusing the scores they produced. System fusion consists of two different stages: (1) pre-processing and (2) calibration and fusion. These are explained next.

Pre-processing

First, scores for each query and system are normalized to mean 0 and variance 1. All the detections given by the fused systems are taken into account to generate the output of the fusion system. Given a certain query detection output by a certain system A, in case some other fused system B does not detect it (and hence, the corresponding score does not exist for it), the score generated for that detection is the minimum global score for system B.

Calibration and fusion

Calibration and fusion are carried out with the Bosaris toolkit [123]. To do so, a linear model based on logistic regression trained from the scores of the detections of development queries is employed. MAVIR and RTVE data parameters are optimized independently from their corresponding development sets and then applied to their corresponding test sets. For COREMAH data, the model trained for MAVIR data is employed.

Fusion employs the three primary systems corresponding to the three participants in the evaluation (i.e., E-DTW, A-Hybrid DTW+LVCSR, and F-Combined DTW systems).

Fusion results

The results of the primary system fusion are presented in Table 23 for development data and Table 24 for test data. They show that system fusion enhances the performance of the best individual QbE STD system on MAVIR and COREMAH data, and the opposite stands for RTVE data. A paired t test shows that the best performance of the Fusion system is statistically significant (p<0.01) compared to the best QbE STD system on MAVIR test data (A-Hybrid DTW+LVCSR), and weakly significant (p<0.08) compared to the best QbE STD system on MAVIR development data (A-Hybrid DTW+LVCSR). This highlights the power of fused systems in QbE STD in challenging domains that include medium-quality and highly spontaneous speech data. The drop in performance of the Fusion system compared to the best QbE STD system on RTVE test data (A-Hybrid DTW+LVCSR) is not statistically significant for a paired t test.

Table 23 Fusion system results of the ALBAYZIN 2018 QbE STD evaluation on development data
Table 24 Fusion system results of the ALBAYZIN 2018 QbE STD evaluation on test data

The K-Text STD system performs better than the Fusion system for MAVIR and RTVE data. This improvement in performance is statistically significant for a paired t test (p<0.01) for both development and test sets of MAVIR data and for the test set of RTVE data. However, on COREMAH test data, the Fusion system outperforms the K-Text STD system. This improvement in performance is statistically significant for a paired t test (p<0.02), which indicates that fusing QbE STD systems that are based on different strategies can outperform text-based STD technology on unseen data domains.

DET curves of the fusion systems along with the rest of the primary systems and the K-Text STD system are presented in Figs. 11, 12, 13, 14, and 15 for development/test MAVIR, RTVE, and COREMAH data. Comparing the QbE STD systems on MAVIR development and test data, it can be seen that (1) the Fusion system performs the best, except for very low FA rates, for which the E-DTW system performs the best, and (2) the K-Text STD system performs better than any QbE STD system for all the operating points. On RTVE development data, the Fusion system performs the best, except for very low miss rates, for which the A-Hybrid DTW+LVCSR system performs the best. Comparing the QbE STD systems on RTVE test data, it can be seen that (1) the Fusion system performs the best, except for very low FA rates and low miss rates, for which the A-Hybrid DTW+LVCSR system performs the best, and (2) the K-Text STD system performs better than any QbE STD system for all the operating points. Comparing the QbE STD systems on COREMAH test data, it can be seen that (1) the Fusion system performs the best for all the operating points, except for very low FA rates, for which the F-Combined DTW system performs the best, and for very low miss rates, for which the A-Hybrid DTW+LVCSR system obtains the best performance, and (2) the K-Text STD system outperforms any QbE STD system in low miss rates.

Fig. 11
figure11

DET curves of the fusion, primary QbE STD systems, and text STD system for MAVIR development data

Fig. 12
figure12

DET curves of the fusion, primary QbE STD systems, and text STD system for RTVE development data

Fig. 13
figure13

DET curves of the fusion, primary QbE STD systems, and text STD system for MAVIR test data

Fig. 14
figure14

DET curves of the fusion, primary QbE STD systems, and text STD system for RTVE test data

Fig. 15
figure15

DET curves of the fusion, primary QbE STD systems, and text STD system for COREMAH test data

These results highlight the power of fusing systems in QbE STD since the Fusion system obtains, in general, the best performance across the different datasets, and in some scenarios, QbE STD outperforms text-based STD using textual queries.

Comparison to the ALBAYZIN 2016 QbE STD evaluation

The evaluations carried out in 2016 and 2018 share the MAVIR data (queries and utterances). Therefore, a comparison between the best system submitted to both evaluations can be carried out. On MAVIR test data, the best result obtained in the 2018 evaluation is ATWV = 0.2810, which is higher than that obtained in the previous ALBAYZIN 2016 QbE STD evaluation (ATWV = 0.2646). The best performance in 2016 corresponded to a combined system that integrated DTW search on different feature sets. However, in the 2018 evaluation, the detections obtained from the different feature sets are added the detections from a text STD approach (hence resulting in a hybrid QbE STD system). This hybrid system, which integrates two standard approaches for QbE STD, is clearly giving a better performance than systems that only integrate template matching approaches.

Towards a language-independent STD system

Due to the intrinsic language independence of various QbE STD systems submitted to this evaluation (see Table 6), the feasibility of language-independent STD systems can be examined. From the overall evaluation results (see Table 8), it can be seen that language-independent STD systems are still far from obtaining better or even similar performance to that obtained with language-dependent STD systems. The performance obtained with the best language-independent system (i.e., B-Fusion DTW) is ATWV = 0.3082, and the performance obtained with the K-Text STD system is ATWV = 0.4427, which suggests that language-independent STD still represents a challenge. This is clearer for domains in which training/development data are given in advance for system training and tuning (see Tables 11 and 12). When the data domain changes (as COREMAH data in this evaluation), the performance of language-dependent STD systems drops dramatically so that language-independent STD systems may obtain similar or even better performance compared to language-dependent STD systems (see Table 13). Therefore, it can be claimed that language- independent STD systems are feasible for out-of-domain data.

Conclusions

This paper has presented a multi-domain international QbE STD evaluation for SoS in Spanish. The amount of systems submitted to the evaluation has made it possible to compare the progress of QbE STD technology under a common framework. Three different teams participated in the evaluation and ten different systems were submitted. Additionally, a text-based STD system has also been presented to compare STD and QbE STD technologies. Systems belong to three well-known QbE STD categories: text-based STD, template matching, and hybrid. Among those systems, A-Hybrid DTW+LVCSR and D-LVCSR systems, which include a probabilistic retrieval model for information retrieval and a query likelihood retrieval model, and F-Combined DTW, G-Super-BNF DTW, H-Multilingual-BNF DTW, I-Monoph.-BNF DTW, and J-Triph.-BNF DTW, which employ stacked bottleneck features for signal representation, can be considered novel from a QbE STD perspective.

Results have shown a high variability with regard to domain change. On the one hand, systems have obtained the best performance on RTVE data, for which a large amount of training data are available for system construction and present high-quality and well-pronounced speech. For these data, hybrid systems are typically the best choice due to those afore-mentioned characteristics. On the other hand, systems have obtained the worst performance on COREMAH data, for which only test data were provided. This indicates that domain change is quite challenging in QbE STD. On MAVIR data, which are also quite challenging due to the presence of spontaneous speech, system performance was between those for RTVE and COREMAH data.

We have also shown that template matching systems for which the language of the foreign queries is employed in development (e.g., for feature extraction) obtained better performance on OOL query detection than on INL query detection. Systems have obtained better performance on multi-word query detection than on single-word query detection because lower FA rates are generally obtained on longer queries. Systems have obtained better performance on INV queries than on OOV queries for domains for which development data are provided, since OOV queries convey, in general, more diverse properties. However, for out-of-domain data, system performance on OOV queries may be better than on INV queries since the change in the data domain is more critical, especially for the systems based on template matching.

Given the best overall result obtained in the evaluation (ATWV = 0.3260), which comes from the average of the three domains, there is still an ample room for improvement. Specifically, it has been observed that QbE STD systems degrade to a great extent in unseen data domains, for which language-independent STD systems (ATWV = 0.1436) outperformed language-dependent STD systems (ATWV = − 0.5828). This encourages us to maintain the QbE STD evaluation in the next years, focusing on multi-domain QbE STD.

Availability of data and materials

Not applicable.

Notes

  1. 1.

    http://www.rthabla.es/

  2. 2.

    http://www.isca-speech.org/iscaweb/index.php/sigs?layout=edit&id=132

  3. 3.

    http://www.mavir.net

  4. 4.

    http://cartago.lllf.uam.es/mavir/index.pl?m=videos

  5. 5.

    http://sox.sourceforge.net/

  6. 6.

    http://www.lllf.uam.es/coremah/

  7. 7.

    https://ffmpeg.org/

  8. 8.

    http://lucene.apache.org

  9. 9.

    http://www.tc-star.org

  10. 10.

    http://cartago.lllf.uam.es/mavir/index.pl?m=descargas

Abbreviations

AAC:

Advanced audio coding

ASR:

Automatic speech recognition

ATWV:

Actual term-weighted value

BUT:

Brno University of Technology

DET:

Detection error tradeoff

DNN:

Deep neural network

DTW:

Dynamic time warping

FA:

False alarm

GMM:

Gaussian mixture model

HMM:

Hidden Markov model

HNR:

Harmonics-to-noise ratio

IARPA:

Intelligence advanced research projects activity

INL:

In-language

INV:

In-vocabulary

KWS:

Keyword spotting

LM:

Language model

LVCSR:

Large vocabulary continuous speech recognition

MED:

Minimum edit distance

MFCC:

Mel-frequency cepstral coefficient

MOS:

Mean opinion score

MPEG:

Moving picture experts group

MTWV:

Maximum term-weighted value

NIST:

National institute of standards and technology

NS-DTW:

Non-segmental dynamic time warping

OOL:

Out-of-language

OOV:

Out-of-vocabulary

PCM:

Pulse code modulation

QbE STD:

Query-by-Example Spoken Term Detection

QUESST:

Query-by-Example Search on Speech Task

RTVE:

Radio Televisión Española

S-DTW:

Subsequence DTW

sBNF:

Stacked bottleneck feature

SDR:

Spoken document retrieval

SIG-IL:

Special interest group on iberian languages

SoS:

Search on speech

STD:

Spoken term detection

SWS:

Spoken web search

TV:

Television

TWV:

Term-weighted value

VAD:

Voice activity detection

WFST:

Weighted finite state transducer

References

  1. 1

    K. Ng, V. W. Zue, Subword-based approaches for spoken document retrieval. Speech Commun.32(3), 157–186 (2000).

  2. 2

    B. Chen, K. -Y. Chen, P. -N. Chen, Y. -W. Chen, Spoken document retrieval with unsupervised query modeling techniques. IEEE Trans. Audio Speech Lang. Process.20(9), 2602–2612 (2012).

  3. 3

    T. -H. Lo, Y. -W. Chen, K. -Y. Chen, H. -M. Wang, B. Chen, in Proc. of ASRU. Neural relevance-aware query modeling for spoken document retrieval (IEEEUSA, 2017), pp. 466–473.

  4. 4

    W. F. L. Heeren, F. M. G. de Jong, L. B. van der Werff, M. A. H. Huijbregts, R. J. F. Ordelman, in Proc. of LREC. Evaluation of spoken document retrieval for historic speech collections (ELRABelgium, 2008), pp. 2037–2041.

  5. 5

    Y. -C. Pan, H. -Y. Lee, L. -S. Lee, Interactive spoken document retrieval with suggested key terms ranked by a Markov decision process. IEEE Trans. Audio Speech Lang. Process.20(2), 632–645 (2012).

  6. 6

    Y. -W. Chen, K. -Y. Chen, H. -M. Wang, B. Chen, in Proc. of Interspeech. Exploring the use of significant words language modeling for spoken document retrieval (ISCAFrance, 2017), pp. 2889–2893.

  7. 7

    P. Gao, J. Liang, P. Ding, B. Xu, in Proc. of ICASSP. A novel phone-state matrix based vocabulary-independent keyword spotting method for spontaneous speech (IEEEUSA, 2007), pp. 425–428.

  8. 8

    B. Zhang, R. Schwartz, S. Tsakalidis, L. Nguyen, S. Matsoukas, in Proc. of Interspeech. White listing and score normalization for keyword spotting of noisy speech (ISCAFrance, 2012), pp. 1832–1835.

  9. 9

    A. Mandal, J. van Hout, Y. -C. Tam, V. Mitra, Y. Lei, J. Zheng, D. Vergyri, L. Ferrer, M. Graciarena, A. Kathol, H. Franco, in Proc. of Interspeech. Strategies for high accuracy keyword detection in noisy channels (ISCAFrance, 2013), pp. 15–19.

  10. 10

    T. Ng, R. Hsiao, L. Zhang, D. Karakos, S. H. Mallidi, M. Karafiat, K. Vesely, I. Szoke, B. Zhang, L. Nguyen, R. Schwartz, in Proc. of Interspeech. Progress in the BBN keyword search system for the DARPA RATS program (ISCAFrance, 2014), pp. 959–963.

  11. 11

    V. Mitra, J. van Hout, H. Franco, D. Vergyri, Y. Lei, M. Graciarena, Y. -C. Tam, J. Zheng, in Proc. of ICASSP. Feature fusion for high-accuracy keyword spotting (IEEEUSA, 2014), pp. 7143–7147.

  12. 12

    S. Panchapagesan, M. Sun, A. Khare, S. Matsoukas, A. Mandal, B. Hoffmeister, S. Vitaladevuni, in Proc. of Interspeech. Multi-task learning and weighted cross-entropy for DNN-based keyword spotting (ISCAFrance, 2016), pp. 760–764.

  13. 13

    J. Mamou, B. Ramabhadran, O. Siohan, in Proc. of ACM SIGIR. Vocabulary independent spoken term detection (ACMUSA, 2007), pp. 615–622.

  14. 14

    D. Schneider, T. Mertens, M. Larson, J. Kohler, in Proc. of Interspeech. Contextual verification for open vocabulary spoken term detection (ISCAFrance, 2010), pp. 697–700.

  15. 15

    C. Parada, A. Sethy, M. Dredze, F. Jelinek, in Proc. of Interspeech. A spoken term detection framework for recovering out-of-vocabulary words using the web (ISCAFrance, 2010), pp. 1269–1272.

  16. 16

    I. Szöke, M. Faps̆o, L. Burget, J. C̆ernocký, in Proc. of Speech Search Workshop at SIGIR. Hybrid word-subword decoding for spoken term detection (ACMUSA, 2008), pp. 42–48.

  17. 17

    Y. Wang, F. Metze, in Proc. of Interspeech. An in-depth comparison of keyword specific thresholding and sum-to-one score normalization (ISCAFrance, 2014), pp. 2474–2478.

  18. 18

    L. Mangu, G. Saon, M. Picheny, B. Kingsbury, in Proc. of ICASSP. Order-free spoken term detection (IEEEUSA, 2015), pp. 5331–5335.

  19. 19

    A. Buzo, H. Cucu, C. Burileanu, in Proc. of MediaEval. SpeeD@MediaEval 2014: Spoken term detection with robust multilingual phone recognition (CEURGermany, 2014), pp. 721–722.

  20. 20

    R. Konno, K. Ouchi, M. Obara, Y. Shimizu, T. Chiba, T. Hirota, Y. Itoh, in Proc. of NTCIR-12. An STD system using multiple STD results and multiple rescoring method for NTCIR-12 SpokenQuery &Doc task (Japan Society for Promotion of ScienceJapan, 2016), pp. 200–204.

  21. 21

    R. Jarina, M. Kuba, R. Gubka, M. Chmulik, M. Paralic, in Proc. of MediaEval. UNIZA system for the spoken web search task at MediaEval 2013 (CEURGermany, 2013), pp. 791–792.

  22. 22

    X. Anguera, M. Ferrarons, in Proc. of ICME. Memory efficient subsequence DTW for query-by-example spoken term detection (IEEEUSA, 2013), pp. 1–6.

  23. 23

    H. Lin, A. Stupakov, J. Bilmes, in Proc. of Interspeech. Spoken keyword spotting via multi-lattice alignment (ISCAFrance, 2008), pp. 2191–2194.

  24. 24

    C. Chan, L. Lee, in Proc. of Interspeech. Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping (ISCAFrance, 2010), pp. 693–696.

  25. 25

    S. Settle, K. Levin, H. Kamper, K. Livescu, in Proc. of Interspeech. Query-by-example search with discriminative neural acoustic word embeddings (ISCAFrance, 2017), pp. 2874–2878.

  26. 26

    R. Shankar, C. M. Vikram, S. R. M. Prasanna, in Proc. of Interspeech. Spoken keyword detection using joint DTW-CNN (ISCAFrance, 2018), pp. 117–121.

  27. 27

    A. Ali, M. A. Clements, in Proc. of MediaEval. Spoken web search using and ergodic hidden Markov model of speech (CEURGermany, 2013), pp. 861–862.

  28. 28

    A. Caranica, A. Buzo, H. Cucu, C. Burileanu, in Proc. of MediaEval. SpeeD@MediaEval 2015: Multilingual phone recognition approach to Query By Example STD (CEURGermany, 2015), pp. 781–783.

  29. 29

    S. Kesiraju, G. Mantena, K. Prahallad, in Proc. of MediaEval. IIIT-H system for MediaEval 2014 QUESST (CEURGermany, 2014), pp. 761–762.

  30. 30

    M. Ma, A. Rosenberg, in Proc. of MediaEval. CUNY systems for the Query-by-Example search on speech task at MediaEval 2015 (CEURGermany, 2015), pp. 831–833.

  31. 31

    J. Takahashi, T. Hashimoto, R. Konno, S. Sugawara, K. Ouchi, S. Oshima, T. Akyu, Y. Itoh, in Proc. of NTCIR-11. An IWAPU STD system for OOV query terms and spoken queries (Japan Society for Promotion of ScienceJapan, 2014), pp. 384–389.

  32. 32

    M. Makino, A. Kai, in Proc. of NTCIR-11. Combining subword and state-level dissimilarity measures for improved spoken term detection in NTCIR-11 SpokenQuery &Doc task (Japan Society for Promotion of ScienceJapan, 2014), pp. 413–418.

  33. 33

    N. Sakamoto, K. Yamamoto, S. Nakagawa, in Proc. of ASRU. Combination of syllable based N-gram search and word search for spoken term detection through spoken queries and IV/OOV classification (IEEEUSA, 2015), pp. 200–206.

  34. 34

    J. Hou, V. T. Pham, C. -C. Leung, L. Wang, H. Xu, H. Lv, L. Xie, Z. Fu, C. Ni, X. Xiao, H. Chen, S. Zhang, S. Sun, Y. Yuan, P. Li, T. L. Nwe, S. Sivadas, B. Ma, E. S. Chng, H. Li, in Proc. of MediaEval. The NNI Query-by-Example system for MediaEval 2015 (IEEEUSA, 2015), pp. 141–143.

  35. 35

    J. Vavrek, P. Viszlay, M. Lojka, M. Pleva, J. Juhar, M. Rusko, in Proc. of MediaEval. TUKE at MediaEval 2015 QUESST (CEURGermany, 2015), pp. 451–453.

  36. 36

    H. Wang, T. Lee, C. -C. Leung, B. Ma, H. Li, Acoustic segment modeling with spectral clustering methods. IEEE/ACM Trans. Audio Speech Lang. Process.23(2), 264–277 (2015).

  37. 37

    C. -T. Chung, L. -S. Lee, Unsupervised discovery of structured acoustic tokens with applications to spoken term detection. IEEE/ACM Trans. Audio Speech Lang. Process.26(2), 394–405 (2018).

  38. 38

    C. -T. Chung, C. -Y. Tsai, C. -H. Liu, L. -S. Lee, Unsupervised iterative deep learning of speech features and acoustic tokens with applications to spoken term detection. IEEE/ACM Trans. Audio Speech Lang. Process.25(10), 1914–1928 (2017).

  39. 39

    P. Lopez-Otero, J. Parapar, A. Barreiro, Efficient query-by-example spoken document retrieval combining phone multigram representation and dynamic time warping. Inf. Process. Manag.56(1), 43–60 (2019).

  40. 40

    G. Mantena, S. Achanta, K. Prahallad, Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping. IEEE/ACM Trans. Audio Speech Lang. Process.22(5), 946–955 (2014).

  41. 41

    H. Tulsiani, P. Rao, in Proc. of MediaEval. The IIT-B Query-by-Example system for MediaEval 2015 (CEURGermany, 2015), pp. 341–343.

  42. 42

    M. Bouallegue, G. Senay, M. Morchid, D. Matrouf, G. Linares, R. Dufour, in Proc. of MediaEval. LIA@MediaEval 2013 spoken web search task: an I-vector based approach (CEURGermany, 2013), pp. 771–772.

  43. 43

    L. J. Rodriguez-Fuentes, A. Varona, M. Penagarikano, G. Bordel, M. Diez, in Proc. of MediaEval. GTTS systems for the SWS task at MediaEval 2013 (CEURGermany, 2013), pp. 831–832.

  44. 44

    H. Wang, T. Lee, C. -C. Leung, B. Ma, H. Li, in Proc. of ICASSP. Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection (IEEEUSA, 2013), pp. 8545–8549.

  45. 45

    H. Wang, T. Lee, in Proc. of MediaEval. The CUHK spoken web search system for MediaEval 2013 (CEURGermany, 2013), pp. 681–682.

  46. 46

    J. Proenca, A. Veiga, F. Perdigão, in Proc. of MediaEval. The SPL-IT query by example search on speech system for MediaEval 2014 (CEURGermany, 2014), pp. 741–742.

  47. 47

    J. Proenca, A. Veiga, F. Perdigao, in Proc. of EUSIPCO. Query by example search with segmented dynamic time warping for non-exact spoken queries (SpringerGermany, 2015), pp. 1691–1695.

  48. 48

    J. Proenca, L. Castela, F. Perdigao, in Proc. of MediaEval. The SPL-IT-UC Query by Example search on speech system for MediaEval 2015 (CEURGermany, 2015), pp. 471–473.

  49. 49

    J. Proenca, F. Perdigao, in Proc. of Interspeech. Segmented dynamic time warping for spoken Query-by-Example search (ISCAFrance, 2016), pp. 750–754.

  50. 50

    P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo, in Proc. of MediaEval. GTM-UVigo systems for the Query-by-Example search on speech task at MediaEval 2015 (CEURGermany, 2015), pp. 521–523.

  51. 51

    P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo, in Proc. of ASRU. Phonetic unit selection for cross-lingual Query-by-Example spoken term detection (IEEEUSA, 2015), pp. 223–229.

  52. 52

    A. Saxena, B. Yegnanarayana, in Proc. of Interspeech. Distinctive feature based representation of speech for Query-by-Example spoken term detection (ISCAFrance, 2015), pp. 3680–3684.

  53. 53

    P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo, in Proc. of Interspeech. Compensating gender variability in query-by-example search on speech using voice conversion (ISCAFrance, 2017), pp. 2909–2913.

  54. 54

    A. Asaei, D. Ram, H. Bourlard, in Proc. of Interspeech. Phonological posterior hashing for query by example spoken term detection (ISCAFrance, 2018), pp. 2067–2071.

  55. 55

    M. Skacel, I. Szöke, in Proc. of MediaEval. BUT QUESST 2015 system description (CEURGermany, 2015), pp. 721–723.

  56. 56

    H. Chen, C. -C. Leung, L. Xie, B. Ma, H. Li, in Proc. of Interspeech. Unsupervised bottleneck features for low-resource Query-by-Example spoken term detection (ISCAFrance, 2016), pp. 923–927.

  57. 57

    Y. Yuan, C. -C. Leung, L. Xie, H. Chen, B. Ma, H. Li, in Proc. of ICASSP. Pairwise learning using multi-lingual bottleneck features for low-resource Query-by-Example spoken term detection (IEEEUSA, 2017), pp. 5645–5649.

  58. 58

    J. van Hout, V. Mitra, H. Franco, C. Bartels, D. Vergyri, in Proc. of ASRU. Tackling unseen acoustic conditions in query-by-example search using time and frequency convolution for multilingual deep bottleneck features (IEEEUSA, 2017), pp. 48–54.

  59. 59

    E. Yilmaz, J. van Hout, H. Franco, in Proc. of ASRU. Noise-robust exemplar matching for rescoring query-by-example search (IEEEUSA, 2017), pp. 1–7.

  60. 60

    Y. Yuan, C. -C. Leung, L. Xie, H. Chen, B. Ma, H. Li, in Proc. of ICASSP. Pairwise learning using multi-lingual bottleneck features for low-resource query-by-example spoken term detection (IEEEUSA, 2017), pp. 5645–5649.

  61. 61

    A. H. H. N. Torbati, J. Picone, in Proc. of Interspeech. A nonparametric bayesian approach for spoken term detection by example query (ISCAFrance, 2016), pp. 928–932.

  62. 62

    A. Popli, A. Kumar, in Proc. of MMSP. Query-by-example spoken term detection using low dimensional posteriorgrams motivated by articulatory classes (IEEEUSA, 2015), pp. 1–6.

  63. 63

    P. Yang, C. -C. Leung, L. Xie, B. Ma, H. Li, in Proc. of Interspeech. Intrinsic spectral analysis based on temporal context features for query-by-example spoken term detection (ISCAFrance, 2014), pp. 1722–1726.

  64. 64

    B. George, A. Saxena, G. Mantena, K. Prahallad, B. Yegnanarayana, in Proc. of Interspeech. Unsupervised query-by-example spoken term detection using bag of acoustic words and non-segmental dynamic time warping (ISCAFrance, 2014), pp. 1742–1746.

  65. 65

    D. Ram, A. Asaei, H. Bourlard, Sparse subspace modeling for query by example spoken term detection. IEEE/ACM Trans. Audio Speech Lang. Process.26(6), 1126–1139 (2018).

  66. 66

    P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo, Finding relevant features for zero-resource query-by-example search on speech. Speech Commun.84:, 24–35 (2016).

  67. 67

    T. J. Hazen, W. Shen, C. M. White, in Proc. of ASRU. Query-by-example spoken term detection using phonetic posteriorgram templates (IEEEUSA, 2009), pp. 421–426.

  68. 68

    A. Abad, R. F. Astudillo, I. Trancoso, in Proc. of MediaEval. The L2F spoken web search system for MediaEval 2013 (CEURGermany, 2013), pp. 851–852.

  69. 69

    I. Szöke, M. Skácel, L. Burget, in Proc. of MediaEval. BUT QUESST 2014 system description (CEURGermany, 2014), pp. 621–622.

  70. 70

    I. Szöke, L. Burget, F. Grézl, J. H. Černocký, L. Ondel, in Proc. of ICASSP. Calibration and fusion of query-by-example systems - BUT SWS 2013 (IEEEUSA, 2014), pp. 7849–7853.

  71. 71

    A. Abad, L. J. Rodríguez-Fuentes, M. Penagarikano, A. Varona, G. Bordel, in Proc. of Interspeech. On the calibration and fusion of heterogeneous spoken term detection systems (ISCAFrance, 2013), pp. 20–24.

  72. 72

    P. Yang, H. Xu, X. Xiao, L. Xie, C. -C. Leung, H. Chen, J. Yu, H. Lv, L. Wang, S. J. Leow, B. Ma, E. S. Chng, H. Li, in Proc. of MediaEval. The NNI query-by-example system for MediaEval 2014 (CEURGermany, 2014), pp. 691–692.

  73. 73

    C. -C. Leung, L. Wang, H. Xu, J. Hou, V. T. Pham, H. Lv, L. Xie, X. Xiao, C. Ni, B. Ma, E. S. Chng, H. Li, in Proc. of Interspeech. Toward high-performance language-independent Query-by-Example spoken term detection for MediaEval 2015: Post-Evaluation analysis (ISCAFrance, 2016), pp. 3703–3707.

  74. 74

    H. Xu, J. Hou, X. Xiao, V. T. Pham, C. -C. Leung, L. Wang, V. H. Do, H. Lv, L. Xie, B. Ma, E. S. Chng, H. Li, in Proc. of ICASSP. Approximate search of audio queries by using DTW with phone time boundary and data augmentation (IEEEUSA, 2016), pp. 6030–6034.

  75. 75

    S. Oishi, T. Matsuba, M. Makino, A. Kai, in Proc. of NTCIR-12. Combining state-level and DNN-based acoustic matches for efficient spoken term detection in NTCIR-12 SpokenQuery &Doc-2 task (Japan Society for Promotion of ScienceJapan, 2016), pp. 205–210.

  76. 76

    S. Oishi, T. Matsuba, M. Makino, A. Kai, in Proc. of Interspeech. Combining state-level spotting and posterior-based acoustic match for improved query-by-example spoken term detection (ISCAFrance, 2016), pp. 740–744.

  77. 77

    M. Obara, K. Kojima, K. Tanaka, S. -W. Lee, Y. Itoh, in Proc. of Interspeech. Rescoring by combination of posteriorgram score and subword-matching score for use in Query-by-Example (ISCAFrance, 2016), pp. 1918–1922.

  78. 78

    B. Taras, C. Nadeu, Audio segmentation of broadcast news in the Albayzin-2010 evaluation: overview, results, and discussion. EURASIP J. Audio Speech Music. Process.2011(1), 1–10 (2011).

  79. 79

    M. Zelenák, H. Schulz, J. Hernando, Speaker diarization of broadcast news in Albayzin 2010 evaluation campaign. EURASIP J. Audio Speech Music. Process.2012(19), 1–9 (2012).

  80. 80

    L. J. Rodríguez-Fuentes, M. Penagarikano, A. Varona, M. Díez, G. Bordel, in Proc. of Interspeech. The Albayzin 2010 Language Recognition Evaluation (ISCAFrance, 2011), pp. 1529–1532.

  81. 81

    J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo, A. Cardenal, J. D. Echeverry-Correa, A. Coucheiro-Limeres, J. Olcoz, A. Miguel, Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion. EURASIP J. Audio Speech Music. Process.2015(21), 1–27 (2015).

  82. 82

    J. Tejedor, D. T. Toledano, X. Anguera, A. Varona, L. F. Hurtado, A. Miguel, J. Colás, Query-by-example spoken term detection ALBAYZIN 2012 evaluation: overview, systems, results, and discussion. EURASIP J. Audio Speech Music. Process.2013(23), 1–17 (2013).

  83. 83

    J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio-Fernandez, C. Garcia-Mateo, Comparison of ALBAYZIN query-by-example spoken term detection 2012 and 2014 evaluations. EURASIP J. Audio Speech Music. Process.2016(1), 1–19 (2016).

  84. 84

    D. Castán, D. Tavarez, P. Lopez-Otero, J. Franco-Pedroso, H. Delgado, E. Navas, L. Docio-Fernández, D. Ramos, J. Serrano, A. Ortega, E. Lleida, Albayzín-2014 evaluation: audio segmentation and classification in broadcast news domains. EURASIP J. Audio Speech Music. Process.2015(33), 1–9 (2015).

  85. 85

    F. Méndez, L. Docío, M. Arza, F. Campillo, in Proc. of FALA. The Albayzin 2010 text-to-speech evaluation (ISCAFrance, 2010), pp. 317–340.

  86. 86

    J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio-Fernandez, L. Serrano, I. Hernaez, A. Coucheiro-Limeres, J. Ferreiros, J. Olcoz, J. Llombart, Albayzin 2016 spoken term detection evaluation: an international open competitive evaluation in spanish. EURASIP J. Audio Speech Music Process.2017(22), 1–23 (2017).

  87. 87

    J. Tejedor, D. T. Toledano, P. Lopez-Otero, L. Docio-Fernandez, J. Proença, F. Perdigão, F. García-Granada, E. Sanchis, A. Pompili, A. Abad, Albayzin query-by-example spoken term detection 2016 evaluation. EURASIP J. Audio Speech Music Process.2018(2), 1–25 (2018).

  88. 88

    J. Billa, K. W. Ma, J. W. McDonough, Zavaliagkos, D. R. Miller, K. N. Ross, A. El-Jaroudi, in Proc. of Eurospeech. Multilingual speech recognition: the 1996 Byblos callhome system (ISCAFrance, 1997), pp. 363–366.

  89. 89

    H. Cuayahuitl, B. Serridge, in Proc. of MICAI. Out-of-vocabulary word modeling and rejection for spanish keyword spotting systems (SpringerGermany, 2002), pp. 156–165.

  90. 90

    M. Killer, S. Stuker, T. Schultz, in Proc. of Eurospeech. Grapheme based speech recognition (ISCAFrance, 2003), pp. 3141–3144.

  91. 91

    J. Tejedor, Contributions to keyword spotting and spoken term detection for information retrieval in audio mining. PhD thesis (Universidad Autónoma de Madrid, Madrid, 2009).

  92. 92

    L. Burget, P. Schwarz, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O. Glembek, N. Goel, M. Karafiat, D. Povey, A. Rastrow, R. C. Rose, S. Thomas, in Proc. of ICASSP. Multilingual acoustic modeling for speech recognition based on subspace gaussian mixture models (IEEEUSA, 2010), pp. 4334–4337.

  93. 93

    J. Tejedor, D. T. Toledano, D. Wang, S. King, J. Colás, Feature analysis for discriminative confidence estimation in spoken term detection. Comput. Speech Lang.28(5), 1083–1114 (2014).

  94. 94

    J. Li, X. Wang, B. Xu, in Proc. of Interspeech. An empirical study of multilingual and low-resource spoken term detection using deep neural networks (ISCAFrance, 2014), pp. 1747–1751.

  95. 95

    M. Hazewinkel, Student test (Kluwer Academic, Denmark, 1994).

  96. 96

    NIST, The spoken term detection (STD) 2006 Evaluation Plan. https://catalog.ldc.upenn.edu/docs/LDC2011S02/std06-evalplan-v10.pdf. Accessed Apr 2019.

  97. 97

    J. G. Fiscus, J. Ajot, J. S. Garofolo, G. Doddingtion, in Proc. of SSCS. Results of the 2006 spoken term detection evaluation (ACMUSA, 2007), pp. 45–50.

  98. 98

    A. Martin, G. Doddingtion, T. Kamm, M. Ordowski, M. Przybocki, in Proc. of Eurospeech. The DET curve in assessment of detection task performance (ISCAFrance, 1997), pp. 1895–1898.

  99. 99

    NIST, Evaluation Toolkit (STDEval) Software. https://www.nist.gov/itl/iad/mig/tools. Accessed Apr 2019.

  100. 100

    I. T. Union, ITU-T Recommendation P.563: Single-ended method for objective speech quality assessment in narrow-band telephony applications. http://www.itu.int/rec/T-REC-P.563/en. Accessed Apr 2019.

  101. 101

    E. Lleida, A. Ortega, A. Miguel, V. Bazán, C. Pérez, M. Zotano, A. de Prada, RTVE2018 database description. Vivolab and Corporación Radiotelevisión Española, Zaragoza. http://catedrartve.unizar.es/reto2018/RTVE2018DB.pdf. Accessed Apr 2019.

  102. 102

    M. V. Matos, Diseño y compilación de un corpus multimodal de análisis pragmático para la aplicación a la enseñanza del español. PhD thesis Universidad Autónoma de Madrid, Madrid, (2017).

  103. 103

    N. Rajput, F. Metze, in Proc. of MediaEval. Spoken web search (CEURGermany, 2011), pp. 1–2.

  104. 104

    F. Metze, E. Barnard, M. Davel, C. van Heerden, X. Anguera, G. Gravier, N. Rajput, in Proc. of MediaEval. The spoken web search task (CEURGermany, 2012), pp. 41–42.

  105. 105

    X. Anguera, F. Metze, A. Buzo, I. Szöke, L. J. Rodriguez-Fuentes, in Proc. of MediaEval. The spoken web search task (CEURGermany, 2013), pp. 921–922.

  106. 106

    X. Anguera, L. J. Rodriguez-Fuentes, I. Szöke, A. Buzo, F. Metze, in Proc. of MediaEval. Query by Example Search on Speech at MediaEval 2014 (CEURGermany, 2014), pp. 351–352.

  107. 107

    I. Szöke, L. J. Rodriguez-Fuentes, A. Buzo, X. Anguera, F. Metze, J. Proenca, M. Lojka, X. Xiong, in Proc. of MediaEval. Query by Example Search on Speech at MediaEval 2015 (CEURGermany, 2015), pp. 81–82.

  108. 108

    T. Akiba, H. Nishizaki, H. Nanjo, G. J. F. Jones, in Proc. of NTCIR-11. Overview of the NTCIR-11 spokenquery &doc task (Japan Society for Promotion of ScienceJapan, 2014), pp. 1–15.

  109. 109

    T. Akiba, H. Nishizaki, H. Nanjo, G. J. F. Jones, in Proc. of NTCIR-12. Overview of the NTCIR-12 spokenquery &doc-2 (Japan Society for Promotion of ScienceJapan, 2016), pp. 1–13.

  110. 110

    P. Schwarz, Phoneme recognition based on long temporal context. PhD thesis (FIT, BUT, Brno, Czech Republic, 2008).

  111. 111

    A. Varona, M. Penagarikano, L. J. Rodríguez-Fuentes, G. Bordel, in Proc. of Interspeech. On the use of lattices of time-synchronous cross-decoder phone co-occurrences in a SVM-phonotactic language recognition system (ISCAFrance, 2011), pp. 2901–2904.

  112. 112

    F. Eyben, M. Wollmer, B. Schuller, in Proc. of ACM Multimedia (MM). OpenSMILE - the Munich versatile and fast open-source audio feature extractor (ACMUSA, 2010), pp. 1459–1462.

  113. 113

    Y. Zhang, J. R. Glass, in Proc. of ASRU. Unsupervised spoken keyword spotting via segmental DTW on gaussian posteriorgrams (IEEEUSA, 2009), pp. 398–403.

  114. 114

    D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, in Proc. of ASRU. The KALDI speech recognition toolkit (IEEEUSA, 2011).

  115. 115

    M. Muller, Information retrieval for music and motion (Springer, New York, 2007).

  116. 116

    I. Szöke, M. Skacel, L. Burget, in Proc. of MediaEval. BUT QUESST 2014 system description (CEURGermany, 2014), pp. 621–622.

  117. 117

    J. Ponte, W. Croft, in Proc. of ACM SIGIR. A language modeling approach to information retrieval, (1998), pp. 275–281.

  118. 118

    J. Parapar, A. Freire, A. Barreiro, in Proc. of ECIR. Revisiting n-gram based models for retrieval in degraded large collections, (2009), pp. 680–684.

  119. 119

    E. Rodríguez-Banga, C. Garcia-Mateo, F. Méndez-Pazó, M. González-González, C. Magariños, in Proc. of Iberspeech. Cotovía: an open source TTS for Galician and Spanish, (2012), pp. 308–315.

  120. 120

    C. Manning, P. Raghavan, H. Schutze, Introduction to information retrieval (Cambridge University Press, Cambridge, 2008).

  121. 121

    A. Abad, L. J. Rodríguez-Fuentes, M. Peñagarikano, A. Varona, G. Bordel, in Proc. of Interspeech. On the calibration and fusion of heterogeneous spoken term detection systems, (2013), pp. 20–24.

  122. 122

    N. Brummer, D. van Leeuwen, in Proc. of IEEE Odyssey 2006: The Speaker and Language Recognition Workshop. On calibration of language recognition scores (IEEEUSA, 2006), pp. 1–8.

  123. 123

    N. Brummer, E. de Villiers, The BOSARIS Toolkit user guide: theory, algorithms and code for binary classifier score processing (Agnitio Labs. https://sites.google.com/site/nikobrummer. Accessed Apr 2019.

  124. 124

    J. Wiseman, Python interface to the WebRTC (https://webrtc.org/) voice activity detector (VAD). https://github.com/wiseman/py-webrtcvad. Accessed Apr 2019.

  125. 125

    A. Silnova, P. Matejka, O. Glembek, O. Plchot, O. Novotny, F. Grezl, P. Schwarz, L. Burget, J. H. Cernocky, in Proc. of Odyssey. BUT/Phonexia bottleneck feature ExtractorIEEEUSA, 2018), pp. 283–287.

  126. 126

    C. Cieri, D. Miller, K. Walker, in Proc. of LREC. The Fisher Corpus: a resource for the next generations of speech-to-text (ELRABelgium, 2004), pp. 69–71.

  127. 127

    Intelligence Advanced Research Projects Activity (IARPA), Babel Program (Intelligence Advanced Research Projects Activity (IARPA). https://www.iarpa.gov/index.php/research-programs/babel. Accessed Apr 2019.

  128. 128

    L. J. Rodriguez-Fuentes, A. Varona, M. Penagarikano, G. Bordel, M. Diez, in Proc. of ICASSP. High-performance query-by-example spoken term detection on the SWS 2013 evaluationIEEEUSA, 2014), pp. 7819–7823.

  129. 129

    A. Abad, L. J. Rodriguez-Fuentes, M. Penagarikano, A. Varona, M. Diez, G. Bordel, in Proc. of Interspeech. On the calibration and fusion of heterogeneous spoken term detection systems (ISCAFrance, 2013), pp. 20–24.

  130. 130

    C. Garcia-Mateo, J. Dieguez-Tirado, L. Docio-Fernandez, A. Cardenal-Lopez, in Proc. of LREC. Transcrigal: a bilingual system for automatic indexing of broadcast news (ELRABelgium, 2004), pp. 2061–2064.

  131. 131

    A. Moreno, L. Campillos, in Proc. of Iberspeech. MAVIR: a corpus of spontaneous formal speech in spanish and english (ISCAFrance, 2004), pp. 224–230.

  132. 132

    A. Stolcke, in Proc. of Interspeech. SRILM - an extensible language modeling toolkit (ISCAFrance, 2002), pp. 901–904.

  133. 133

    G. Chen, S. Khudanpur, D. Povey, J. Trmal, D. Yarowsky, O. Yilmaz, in Proc. of ICASSP. Quantifying the value of pronunciation lexicons for keyword search in low resource languages (IEEEUSA, 2013), pp. 8560–8564.

  134. 134

    V. T. Pham, N. F. Chen, S. Sivadas, H. Xu, I. -F. Chen, C. Ni, E. S. Chng, H. Li, in Proc. of SLT. System and keyword dependent fusion for spoken term detection (IEEEUSA, 2014), pp. 430–435.

  135. 135

    D. Can, M. Saraclar, Lattice indexing for spoken term detection. IEEE Trans. Audio Speech Lang. Process.19(8), 2338–2347 (2011).

  136. 136

    D. R. H. Miller, M. Kleber, C. -L. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M. Schwartz, H. Gish, in Proc. of Interspeech. Rapid and accurate spoken term detection (ISCAFrance, 2007), pp. 314–317.

  137. 137

    G. Chen, O. Yilmaz, J. Trmal, D. Povey, S. Khudanpur, in Proc. of ASRU. Using proxies for OOV keywords in the keyword search task (IEEEUSA, 2013), pp. 416–421.

Download references

Author information

Authors’ contributions

JT and DTT designed and prepared the QbE STD evaluation, built the E-DTW system, and carried out the post-evaluation analysis. PL-O and LD-F built the A-Hybrid DTW+LVCSR, B-Fusion DTW, C-PhonePost DTW, D-LVCSR, and K-Text STD systems. MP and LJR-F built the F-Combined DTW, G-Super-BNF DTW, H-Multilingual-BNF DTW, I-Monoph.-BNF DTW, and J-Triph.-BNF DTW systems and carried out the primary system fusion. AM-S provided the MAVIR and COREMAH databases, collaborated with labeling the new data for the evaluation, and provided linguistic support. All the authors contributed in the final discussion of the results. The main contributions of this paper are as follows: (1) Systems submitted to the fourth Query-by-Example Spoken Term Detection evaluation for Spanish language are presented. (2) A new challenging database based on Spanish broadcast news has been used. (3) Analysis of system results and primary system fusion for the three different domains are presented. All authors read and approved the final manuscript.

Authors’ information

Not applicable.

Correspondence to Javier Tejedor.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tejedor, J., Toledano, D.T., Lopez-Otero, P. et al. Search on speech from spoken queries: the Multi-domain International ALBAYZIN 2018 Query-by-Example Spoken Term Detection Evaluation. J AUDIO SPEECH MUSIC PROC. 2019, 13 (2019) doi:10.1186/s13636-019-0156-x

Download citation

Keywords

  • Query-by-Example Spoken Term Detection
  • International evaluation
  • Spanish language
  • Search on speech