ALBAYZIN Query-by-example Spoken Term Detection 2016 evaluation

Query-by-example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given an acoustic (spoken) query containing the term of interest as the input. This paper presents the systems submitted to the ALBAYZIN QbE STD 2016 Evaluation held as a part of the ALBAYZIN 2016 Evaluation Campaign at the IberSPEECH 2016 conference. Special attention was given to the evaluation design so that a thorough post-analysis of the main results could be carried out. Two different Spanish speech databases, which cover different acoustic and language domains, were used in the evaluation: the MAVIR database, which consists of a set of talks from workshops, and the EPIC database, which consists of a set of European Parliament sessions in Spanish. We present the evaluation design, both databases, the evaluation metric, the systems submitted to the evaluation, the results, and a thorough analysis and discussion. Four different research groups participated in the evaluation, and a total of eight template matching-based systems were submitted. We compare the systems submitted to the evaluation and make an in-depth analysis based on some properties of the spoken queries, such as query length, single-word/multi-word queries, and in-language/out-of-language queries.


Introduction
The huge amount of heterogeneous speech data stored in audio and audiovisual repositories makes it necessary to develop efficient methods for speech information retrieval. There are different speech information retrieval tasks, including spoken document retrieval (SDR), keyword spotting (KWS), spoken term detection (STD), and query-by-example spoken term detection (QbE STD).
Spoken term detection aims at finding individual words or sequences of words within audio archives. It is based on a text-based input, commonly the word/phone transcription of the search term. For this reason, STD is also called text-based STD. Query-by-example spoken term detection is similar, but is based on an acoustic (spoken) input. In QbE STD, we consider the scenario in which the user has found a segment of speech which contains terms of amounts of resources in the form of transcribed corpora to be built. QbE STD has been mainly addressed from three different approaches: methods based on the word/subword transcription of the query, methods based on template matching of features, and hybrid approaches. These approaches are described below.

Methods based on the word/subword transcription of the query
These methods make use of the text-based STD technology. In order to do this, they need to transcribe the query into word/subword units. The errors produced in this transcription can lead to significant performance degradation. [1,2] employ a Viterbi-based search on Hidden Markov Models (HMMs). [3][4][5][6] employ dynamic time warping (DTW) or variants of DTW, e.g., non-segmental dynamic time warping (NS-DTW) from phone recognition. [7][8][9][10] employ word and syllable speech recognizers. Hou et al. [11] employs a phone-based speech recognizer and a weight finite-state transducer (WFST)-based search.
Vavrek et al. [12] uses multilingual phone-based speech recognition, from supervised and unsupervised acoustic models and sequential dynamic time warping for search.
These methods were found to outperform subword transcription-based techniques in QbE STD [34]. This approach can be employed effectively to build languageindependent STD systems, since prior knowledge of the language involved in the speech data is not necessary.

Hybrid approach
A powerful way of enhancing performance relies on building hybrid (fused) systems that combine the two individual methods. [35][36][37] propose a logistic regression-based fusion of acoustic keyword spotting and DTW-based systems using language-dependent phoneme recognizers. [38][39][40][41] use a logistic regression-based fusion on DTW-and phone-based systems. Oishi et al. [42] uses a DTW-based search at the HMM state-level from syllables obtained from a word-based speech recognizer and a deep neural network (DNN) posteriorgram-based rescoring, and [43] adds a logistic regression-based approach for detection rescoring. Obara et al. [44] employs a syllablebased speech recognizer and dynamic programming at the triphone-state level to output detections and DNN posteriorgram-based rescoring.

Motivation and organization of this paper
The increasing interest from within the speech research community in speech information retrieval has allowed the successful organization of several international evaluations related to SDR [45,46], STD [47,48], and QbE STD [49,50]. In 2012 and 2014, the first two QbE STD evaluations in Spanish were held in the context of the ALBAYZIN 2012 and 2014 evaluation campaigns. These campaigns are internationally open sets of evaluations supported by the Spanish Network of Speech Technologies (RTTH) 1 and the ISCA Special Interest Group on Iberian Languages (SIG-IL) 2 , which have been held every 2 years since 2006. These evaluation campaigns provide an objective mechanism for the comparison of different systems and the promotion of research into different speech technologies such as audio segmentation [51], speaker diarization [52], language recognition [53], spoken term detection [54], query-by-example spoken term detection [55,56], and speech synthesis [57].
The Spanish language is widespread throughout the world, and significant research has been conducted into it for ASR [58][59][60], KWS [61,62], and STD [62][63][64]. This, combined with the success of the ALBAYZIN QbE STD evaluations held in 2012 and 2014, have encouraged us to organize a new QbE STD evaluation for the 2016 ALBAYZIN evaluation campaign which aims to evaluate the progress in this technology in Spanish. Compared with the previous evaluations, the third ALBAYZIN QbE STD evaluation incorporated stricter rules regarding the evaluation queries, e.g., in-vocabulary (INV) vs. out-ofvocabulary (OOV) queries, and employs two different databases to cover different acoustic conditions and topics to provide a more comprehensive evaluation. In addition, all the queries and the database employed in the QbE STD evaluation held in 2014 are kept, thus enabling a comparison between the systems submitted to both evaluations on the common set of queries.
The remainder of the paper is organized as follows: The following section presents a description of the QbE STD evaluation. Section 3 presents the different systems submitted to the evaluation. The results and discussion are then presented, and the paper is concluded in the final section.

Evaluation description
The ALBAYZIN QbE STD 2016 evaluation involves searching for audio content within audio content using an audio content query. The input to the system is an acoustic example per query; therefore, prior knowledge of the correct word/subword transcription corresponding to each query is not available. The target participants are the research groups or companies working on speech indexing, speech retrieval, and speech recognition.
The evaluation consists of searching a development query list within development speech data, and searching two different test query lists within two different sets of test speech data (MAVIR and EPIC databases, which will be explained later). The evaluation result ranking is based on the system performance when searching the query terms within the test speech data corresponding to the MAVIR database. Any kind of data, except for the MAVIR test data and the EPIC data, can be used by the participants for system training and development. The systems could be fine-tuned for each of the two databases individually. To facilitate the system construction, the participants were provided with MAVIR data, which can only be used as defined by the training, development, and test subsets.
This evaluation defines two different sets of queries for each database: the in-vocabulary query set and the out-ofvocabulary query set. The OOV query set was defined to simulate the out-of-vocabulary words of a Large Vocabulary Continuous Speech Recognition (LVCSR) system. If the participants employed an LVCSR system for processing the audio, these OOV queries must be removed from the system dictionary. Therefore, other methods must be used for searching the OOV queries. Conversely, the INV queries can appear in the dictionary of the LVCSR system.
The evaluation participants could submit a primary system and up to two contrastive systems. No manual intervention was allowed to generate the final output file, and hence, all the systems had to be fully automatic. Listening to the test data, or any other human interaction with the test data, was forbidden before all the evaluation results had been sent to the participants. The standard XMLbased format corresponding to the National Institute of Standards and Technology (NIST) STD evaluation tool [65] was used to build the system output file.
The participants were given approximately 3 months to construct the system. The training and development data were released by the end of June 2016. The test data were released at the beginning of September 2016. The final system submission was due by mid-October 2016. The evaluation results were discussed at the IberSPEECH 2016 conference at the end of November 2016.

Evaluation metric
In QbE STD, a hypothesized occurrence is called a detection; if the detection corresponds to an actual occurrence, it is called a hit; otherwise it is a false alarm. If an actual occurrence is not detected, this is called a miss. The actual term-weighted value (ATWV) proposed by NIST [65] was used as the main metric for the evaluation. This metric integrates the hit rate and the false alarm rate of each query into a single metric and is then averaged over all the queries: where denotes the set of queries and | | is the number of queries in this set. N K hit and N K FA represent the numbers of hits and false alarms of query K, respectively, and N K true is the number of actual occurrences of K in the audio. T denotes the audio length in seconds, and β is a weight factor set at 999.9, as in the ATWV proposed by NIST [66]. This weight factor causes an emphasis to be placed on recall compared to the precision in the ratio 10:1.
The ATWV represents the term-weighted value (TWV) for the threshold set by the system (usually tuned on development data). An additional metric, called maximum term-weighted value (MTWV) [65], can also be used to evaluate the performance of a QbE STD system. The MTWV is the ATWV the system would obtain with the optimum threshold. The MTWV results are presented to evaluate threshold selection.
In addition to the ATWV and the MTWV, NIST also proposed a detection error trade-off (DET) curve [67] to evaluate the system performance at various miss/FA ratios. Although the DET curves were not used for the evaluation, they are also presented in this paper for a comparison of the systems.
The NIST STD evaluation tool [68] was employed to compute the MTWV, the ATWV, and the DET curves.

Database
Two different databases that comprise different acoustic conditions and domains were employed for the evaluation. For comparison, the same MAVIR database employed in the ALBAYZIN QbE STD evaluation held in 2014 was used. The second database was the EPIC database distributed by ELRA 3 . For the MAVIR database, three separate datasets, i.e., training, development, and test, were given to the participants. For the EPIC database, only the test data were provided. The MAVIR and EPIC data could only be used for the intended purpose of the corresponding subset (training, development, and test). The use of two different domains was permitted to compare the system performance across the two different domains and enabled the examination of the performance degradation of the systems depending on the nature of the speech data, the acoustic conditions, the training/development and testing mismatch, and the over-fitting issues.
The MAVIR database consists of a set of Spanish talks taken from the MAVIR workshops 4 held in 2006, 2007, and 2008 that contain speakers from Spain and Latin America.
The MAVIR Spanish data consist of spontaneous speech files, each containing different speakers, amounting to approximately 7 h of speech. These data were further divided for the purpose of this evaluation into training, development, and test sets. The data were also manually annotated in an orthographic form, but the timestamps were only set for the phrase boundaries. To prepare the data for the evaluation, the organizers manually added the timestamps for the approximately 1600 occurrences of the spoken terms used in the development and test evaluation sets. The training data were made available to the participants and included the orthographic transcription and the timestamps for the phrase boundaries 5 .
The MAVIR speech data were originally recorded in several audio formats, e.g., pulse code modulation (PCM) mono and stereo, MP3, 22.05 KHz, and 48 KHz. The data were converted to PCM, 16 KHz, single channel, 16 bits per sample using the SoX tool 6 . Except for one, all the recordings were made with the same equipment, a Digital TASCAM DAT model DA-P1. Different microphones were used for the different recordings. In most cases, they were tabletop or floor standing microphones, but in one case, a lavalier microphone was used. The distance from the mouth of the speaker to the microphone varied and was not particularly controlled but in most cases was less than 50 cm. The recordings contain spontaneous speech from the MAVIR workshops in a real setting. The recordings were made in large conference rooms with capacity of over a hundred people, and a large number of people were in the conference room. This poses additional challenges including background noise, in particular 'babble noise' and reverberation. The realistic settings and the different nature of the spontaneous speech in this database made it appealing and challenging enough for the evaluation. Table 1 includes some database features such as the division in training, development, test data of the speech files, the number of word occurrences, the file duration, and the p.563 Mean Opinion Score (MOS) [69] which gives and indication of the quality of each speech file. The p.563 standard estimates the quality of the human voice without a reference signal, for which no reference signal is necessary. The MOS values are in the range of 1-5, 1 representing the worst quality and 5 the best [69].
The EPIC database comprises speeches from the European Parliament recorded in 2004 in English, Spanish, and Italian, together with their corresponding simultaneous translations into other languages. Only the original Spanish speeches, which consist of more than 1.5 h of clean speech, were used for the evaluation as a test set. To evaluate the systems submitted to the evaluation, the organizers manually added the timestamps for the approximately 1100 occurrences of the spoken terms used in the test set. The original speeches in the EPIC database were recorded as video files stored in a .mpeg1 format. Therefore, the original Spanish speeches were extracted from the corresponding Spanish video files, and converted to PCM, 16 KHz, single channel, 16 bits per sample, using the ffmpeg tool 7 . Table 2 includes the Spanish EPIC database with the same database features presented in Table 1.

Query list selection
All the queries selected for the development and test sets aimed to build a realistic scenario for QbE STD, by including high-occurrence queries, low-occurrence queries, inlanguage (INL) queries, out-of-language (OOL) queries, single-word and multi-word queries, in-vocabulary and out-of-vocabulary queries, and queries of different lengths. A query may not have any occurrence or may appear once or more in the speech data. Table 3 includes some features of the development and test query lists such as the number of INL and OOL queries, the number of single-word and multi-word queries, and the number of INV and OOV queries, together with the number of occurrences of each set in the corresponding speech data. It must be noted that a multi-word query was considered OOV in cases where any of the words that formed the term of the query were OOV. The test EPIC query list only contained easy terms, i.e., no OOL and multi-word queries were included, because this corpus was aimed at evaluating the systems submitted to the evaluation in a different domain.

Comparison with other QbE STD evaluations
The evaluations that are most similar to the ALBAYZIN QbE STD are the MediaEval 2011 [70], 2012 [71], and 2013 [49] Spoken Web Search evaluations. In 2014, the Query by Example Search on Speech task (QUESST) held at MediaEval differed from the previous evaluations in that it was a Spoken Document Retrieval task, i.e., no query timestamps had to be output by the systems, and only the audio files that contained the query must be retrieved [46]. In 2015, the QUESST was similar to that of 2014, but the systems had to provide a score per query and  utterance [72]. 2016 was the last year that the search-onspeech task was included in MediaEval, by means of the zero-cost speech recognition task. This consisted of building LVCSR systems from low resources [73]. The task in the MediaEval 2011, 2012, and 2013 Spoken Web Search and the ALBAYZIN evaluations was the same, i.e., searching speech content from speech queries, but they differed in several aspects. This makes it difficult to compare the results obtained in the ALBAYZIN QbE STD evaluation to the previous MediaEval evaluations. The most important difference is the nature of the audio content used for the evaluations. In the Medi-aEval evaluations, the speech was typically telephone speech, either conversational or read and elicited speech, or speech recorded with in-room microphones. In the ALBAYZIN evaluations, the audio consisted of microphone recordings of real talks in workshops that took place in large conference rooms in the presence of an audience. The microphones, the conference rooms, and the recording conditions changed from one recording to another. The microphones were not close-talking microphones but were mainly tabletop or floor standing microphones.
In addition, the MediaEval evaluations dealt with Indian-and African-derived languages, as well as Albanian, Basque, Czech, non-native English, Romanian, and Slovak languages, while the ALBAYZIN evaluations deal only with Spanish.
In addition to the MediaEval evaluations, a new round of QbE STD evaluations was organized with the NTCIR-11 [74] and NTCIR-12 [75] conferences. The data used in these evaluations contained spontaneous speech in Japanese provided by the National Institute for Japanese language and spontaneous speech which was recorded during seven editions of the Spoken Document Processing Workshop. As additional information, these evaluations provided participants with the results of a voice activity detection system on the input speech data, the manual transcription of the speech data, and the output of an LVCSR system. Although the ALBAYZIN QbE STD evaluation could be considered to be similar in terms of speech nature to the NTCIR QbE STD evaluations, i.e., the speech was recorded in real workshops, the ALBAYZIN evaluations make use of other languages and define disjointed development and test query lists to measure the generalization capability of the systems. Table 4 summarizes the main characteristics of the MediaEval QbE STD evaluations, the NTCIR-11 and NTCIR-12 QbE STD evaluations, the previous ALBAYZIN QbE STD evaluations, and the ALBAYZIN QbE STD 2016 evaluation.

Systems
Eight different systems were submitted to the ALBAYZIN QbE STD 2016 evaluation from four different research groups (see Table 5). Some were submitted in time for the evaluation, and some were submitted as post-evaluation systems and so were not included in the competition. All were based on a feature representation of the queries and the utterances and a DTW-based search to hypothesize detections. In addition, a text-based STD system was also presented to compare performance when using written and acoustic (spoken) queries.

A-GTM-UVigo-Three feature+DTW-based fusion QbE STD system (A-GTM-UVigo-3-fea+DTW fusion)
The architecture of this system is shown in Fig. 1; it consists of a fusion of three different DTW-based QbE STD systems that employ different approaches for feature extraction.

Feature extraction
Given a query Q with n frames (and equivalently, an utterance U with m frames), three speech representations that result in a set Q = {q 1 , . . . , q n } of n vectors of dimension D (and equivalently, a set of U = {u 1 , . . . , u m } of m vectors of dimension D) are based on: • Phoneme posteriorgram + phoneme unit selection: This speech representation relies on phoneme posteriorgrams [34]. Given a query/utterance and a phoneme recognizer with P phonetic units, the posterior probability of each phonetic unit is computed for each frame, leading to a set of vectors of dimension P that represent the probability of each phonetic unit at every frame. To construct a widecoverage language-independent QbE STD system, the Czech, English, Hungarian, and Russian phoneme recognizers developed by the Brno University of Technology (BUT) [76] are used to obtain the phoneme posteriorgrams; in these decoders, each phonetic unit has three different states and a posterior probability is an output for each of them, so they are combined to obtain one posterior probability for each unit [17]. After obtaining the posteriors, Gaussian softening is applied to obtain Gaussiandistributed probabilities [77]. Then, the phoneme unit selection strategy described in [25] is applied. • Acoustic features + feature selection: Aiming to obtain as much information as possible from the speech signals, a large set of features, summarized in Table 6, are used to represent the queries and utterances; these features, obtained using the OpenSMILE feature extraction toolkit [78], are extracted every 10 ms using a 25-ms window, except for the F0, probability of voicing, jitter, shimmer, and harmonics-to-noise ratio (HNR), where a 60 ms window is used due to its best performance in preliminary work. Finally, the feature selection technique described in [79] is applied to obtain the most discriminative features. • Gaussian posteriorgrams: The Gaussian posteriorgrams [80] are used to represent the queries and the utterances. Given a GMM with G Gaussians, the posterior probability of each Gaussian is computed for each time frame, leading to a set of vectors of dimension G that represent the probability of each Gaussian at every time instant. In this system, 19 Mel-frequency Cepstral Coefficients (MFCCs) are extracted from the acoustic signals, accompanied by their energy, delta, and double delta coefficients due to their best performance in previous work. The feature extraction and the Gaussian posteriorgram computation are carried out using the Kaldi toolkit [81].

Search
The search stage uses the S-DTW algorithm [82], which is a variant of the standard DTW. For the S-DTW, a cost matrix M ∈ n×m must first be defined, in which the rows and the columns correspond to the query and the utterance frames, respectively:  where c(q i , u j ) is a function that defines the cost between the query vector q i and the utterance vector u j and which implies that only horizontal, vertical, and diagonal path movements are allowed. Pearson's correlation coefficient r [83] is used as a cost function by mapping it into the interval [0,1] applying the following transformation: Once the matrix M is computed, the end of the best warping path between Q and U is obtained as follows: b * = arg min b∈1,...,m M(n, b).
The starting point of the path ending at b * , namely a * , is computed by backtracking, hence obtaining the best warping path P(Q, U) = {p 1 , . . . , p k , . . . , p K }, where p k = (i k , j k ), (i.e., the kth element of the path is formed by q i k and u j k , and K is the length of the warping path).
A query Q may appear several times in an utterance U, especially if U is a long recording. Therefore, not only must the best warping path be detected, but also others that are less likely. One approach to overcome this issue consists of detecting a given number of candidate matches n c : Every time a warping path that ends at frame b * is detected, M (n, b * ) is set to ∞ to ignore this element in the future.
A confidence score must be assigned to every detection of a query Q in an utterance U. Firstly, the cumulative cost of the warping path M n,b * is length-normalized [35], and then, z-norm is applied so that all the confidence scores of all the queries have the same distribution [37].

Fusion
Discriminative calibration and fusion are applied to combine the detections of the different systems obtained from the different feature extraction approaches [38]. The global minimum score produced by the systems for all the queries is used to hypothesize the missing confidence scores due to its good performance in previous work. The calibration and the fusion parameters are then estimated by logistic regression on the development data to obtain improved discriminative and well-calibrated likelihood ratios [84]. The calibration and the fusion training are performed using the Bosaris toolkit [85].
The fusion is carried out on the detections output by the S-DTW search from the phoneme posteriorgram + phoneme unit selection approach on the English phoneme decoder, the acoustic features + feature selection approach from a set of 90 relevant features, and the Gaussian posteriorgram approach with 128 Gaussians. This configuration proved to be the best on the development data.

B-L2F-Four phone log-likelihood ratio feature+DTW-based fusion QbE STD system (B-L2F-4-pllr fea+DTW fusion)
Four different QbE STD systems that employ DTW-based query detection and several phoneme recognizers are fused. The system architecture is shown in Fig. 2.

Speech segmentation
The set of utterances is pre-processed using the audio segmentation module presented in [86]. This performs speech/non-speech classification and speaker segmentation, as well as other tasks. The speech/non-speech segmentation is implemented using a multi-layer perceptron (MLP) based on perceptual linear prediction (PLP) features, followed by a finite state machine. This finite state machine smooths the input probabilities given by the MLP using a median filter over a small window. The smoothed signal is then thresholded and analysed using a time window (t min ). The finite state machine consists of four possible states classified as probable non-speech, non-speech, probable speech, and speech. If the input audio signal has a probability of speech above a given threshold, the finite state machine is placed into the probable speech state. If, after a given time interval (t min ), the average speech probability is above a given confidence value, the machine changes to the speech state. Otherwise, it goes to the non-speech state. The finite-state machine generates segment boundaries for the non-speech segments larger than the resolution of the median window. Additionally, the non-speech segments larger than t min are discarded. The value of t min and the threshold are chosen to maximize the non-speech detection in the work presented in [86] which aims to avoid the system processing the short silence segments included in large speech segments. With the speech segmentation module, a partition of each utterance into smaller segments is obtained. Only the resulting speech segments are given to the query search. This strategy offers two computational advantages: (1) Because the same query may occur multiple times in an utterance, a DTW-based search should proceed sequentially or iteratively over the whole utterance, storing the candidates found during the search, and initiating a new process with the remaining audio until a certain stopping criterion is met. By splitting the utterance into smaller segments, the search can be parallelized, allowing for different searches of the same query at the same time.
(2) Because the segments classified as non-speech are discarded, the performance of the DTW algorithm benefits from an overall reduction in the search space. On the other hand, this strategy conveys at least two drawbacks that may affect the query detection: (1) The errors of the audio segmentation module can result in missing speech segments that may eventually prove to contain query terms that are lost. (2) It is assumed that only a single match per query can occur in a sub-segment, which may also introduce misses in the search.

Feature extraction
Two different approaches are employed for feature extraction which aim to obtain complementary information from the speech signals. The first employs the AUDIMUS phoneme recognizers for speech representation, and the second is based on the phoneme recognizers developed by the BUT [76]. The AUDIMUS phoneme recognizers are based on hybrid connectionist methods [87]. Four phoneme recognizers that exploit four different sets of acoustic models were used. These are trained in European Portuguese, Brazilian Portuguese, Spanish, and the American English languages. The acoustic models are based on MLPs that are part of the L 2 F in-house hybrid connectionist ASR system called AUDIMUS [88,89]. AUDIMUS combines four MLP outputs trained with various sets of features, as shown in Table 7. The language-dependent MLPs are trained using different amounts of annotated data. Each MLP is characterized by the input frame context size, i.e., 13   The phoneme recognizers for the Czech, Hungarian, and Russian languages developed by BUT [76] are also employed. These output phone-state level posterior probabilities and multiple non-speech units, which are reduced to single-state phone posterior probabilities, and a unique silence output unit. This results in 43-dimensional feature vectors for Czech, 59-dimensional feature vectors for Hungarian, and 50-dimensional feature vectors for Russian. The frames where the non-speech posterior probability is the highest are also discarded.
Finally, both the AUDIMUS and the BUT posterior feature vectors are converted to phone log-likelihood ratios (PLLR) as described in [90]. This representation proved to be very effective in spoken language recognition [91] and other similar tasks [92].

Search
Given two sequences of feature vectors corresponding to a query Q and an utterance U, the logarithm of the cosine distance is computed between each pair of vectors (Q[i], U[j]) to build a cost matrix as follows: The cost matrix is then normalized with respect to the utterance, such that the matrix values range from 0 to 1 [93]. The normalization is conducted as follows: where d min (i) = min In this way, a perfect match would produce a quasidiagonal sequence of zeros. The DTW search looks for the best alignment of the query and a partition of the normalized cost matrix corresponding to a speech segment. The algorithm uses three additional matrices to store the accumulated distance of the optimal partial warping path found (AD), the length of the path (L), and the path itself.
The best alignment of a query in an utterance is defined as the one that minimizes the average distance in a warping path of the normalized cost matrix. A warping path may start at any given frame of U, i.e., k 1 , then traverses a region of U, which is optimally aligned to Q, and ends at frame k 2 . The average distance in this warping path is computed as follows: The confidence score for each detection is computed as 1 − d avg (Q, U), thus ranging from 0 to 1, where 1 represents a perfect match. The start time and the duration of each detection are obtained by retrieving the time offsets corresponding to the frames k 1 and k 2 in the filtered utterance. The detection results are filtered out to reduce the number of detections per query to a fixed amount of hypothesis. Different values, ranging from 50 to 500, are experimented with to empirically determine the threshold, with the value of 100 detections per hour with the best performance observed on the development data.

Fusion
The output detections from the Brazilian Portuguese, Spanish, and European Portuguese AUDIMUS phoneme recognizers, and the Czech BUT phoneme recognizer [76], are fused with the strategy presented in the three feature+DTW-based fusion QbE STD system. This configuration gave the best performance on the development data.

C-L2F-Four likelihood feature+DTW-based fusion QbE STD system (C-L2F-4-likel fea+DTW fusion)
This system is the same as the B-L2F-Four phone loglikelihood ratio feature+DTW-based fusion QbE STD system with the following modifications: • The English phoneme recognizer developed by BUT [76] is added to the feature extraction module. • The fusion is carried out on the detections provided by the Brazilian Portuguese, Spanish, and English AUDIMUS phoneme recognizers and the English phoneme recognizer from BUT. • The feature extractor from the AUDIMUS and the BUT phoneme recognizers [76] outputs log likelihoods instead of PLLR features. • The threshold in the search is set to 300 detections per hour. This value was tuned on the development data with the new configuration.

D-ELiRF-UPV-Posteriorgram+DTW-based QbE STD system (D-ELiRF-UPV-Post+DTW)
This system, whose architecture is shown in Fig. 3, is based on DTW search on phoneme posteriorgrams. For feature extraction, the phoneme recognizers developed at BUT for Czech, English, Hungarian, and Russian languages [76] are employed to obtain a posteriorgrambased representation of the queries and the utterances. The English language is employed in the final system submitted because this gave the best performance on the development data.
For a search, the system employs the S-DTW algorithm explained above. However, instead of using the usual transition set with horizontal, vertical, and diagonal path movements, the horizontal and vertical transitions are modified so that the paths found must have a length between half and twice the length of the query, as shown in Fig. 4. These path movement modifications aim to augment the query detection rate. To do so, M * (i, j) in the cost matrix is modified as follows: where x and y represent the allowed transitions. Different cost functions such as the Kullback-Leibler divergence, the cosine distance, and the inner product were explored, but the cosine distance was finally employed because it provided the best performance on the development data. The confidence score assigned to each detection is based on the distance computed by the S-DTW algorithm.

E-ELiRF-UPV-Posteriorgram+DTW-based normalized QbE STD system (E-ELiRF-UPV-Post+DTWNorm)
This system is the same as the D-ELiRF-UPV-Pos-teriorgram+DTW-based QbE STD system with a single modification in the S-DTW algorithm. This modification relies on the fact that the S-DTW search considers the length of the paths, and hence, the cost matrix is modified as follows: so that: x , y = arg min (x,y) where L(i − x, j − y) is the length of the best path ending in (i, j). With this modification, the fact that two paths have similar distance values but differ in the length of their alignments is considered.

F-SPL-IT-UC-Four phoneme recognizer+DTW-based fusion QbE STD system (F-SPL-IT-UC-4-phnrec+DTW fusion)
This system, whose architecture is presented in Fig. 5, consists of fusion of four DTW-based search systems from different phoneme recognizers.

Feature extraction
State-level phone posterior probabilities are employed as features for the query and the utterance representation.  These are computed using the phoneme recognizer developed by BUT [76]. Three different phoneme recognizers are trained in Spanish, English, and European Portuguese. Although the queries mainly contain speech, a voice activity detector is employed. To do so, the frames for which the average of the posterior probability of silence and noise is higher than 0.5 were removed before applying the query search. The Spanish recognizer was trained using the training data provided by the organizers. Because the file mavir02.wav presents a low-frequency noise, highpass filtering with a cut-off frequency of 150 Hz, followed by spectral subtraction, is applied to this file before further processing. A phoneme dictionary is built using g2p-seq2seq 8 and a Spanish dictionary from CMU 9 . The phoneme alignment of the speech data is carried out with the Kaldi speech recognition toolkit [81].
As in previous studies [22,94], the English recognizer was trained using the training subsets of TIMIT and Resource Management databases.
The European Portuguese recognizer was trained using annotated broadcast news data and a dataset of command words and sentences, as carried out in previous studies [22,94].

Search
The DTW algorithm is used for query detection from the state-level phone posterior probabilities that represent each query and utterance frame. The logarithm of the cosine distance, as in the B-L2F-Four phone log-likelihood ratio feature+DTW-based fusion QbE STD system, is employed as a distance metric between a query and an utterance frame to build a cost matrix.
The DTW search considers paths that start at the first frame of the query and at any frame of the utterance and move in unitary weighted jumps diagonally, vertically, or horizontally from the lowest accumulated distance. The DTW search result corresponds to the accumulated distances (D acc ) at the last frame of the query, for every frame of the utterance. The information regarding the start frame of the path, the ending frame of the path, and the number of diagonal, horizontal, and vertical movements is stored. The DTW search is carried out for Spanish, English, and European Portuguese languages individually. An additional DTW search based on averaging all the cost matrices given by the three languages is conducted, as in [18].
Finally, the accumulated distances are normalized according to the following equation: where D acc is the accumulated distance, and N D , N V , and N H are the numbers of diagonal, vertical, and horizontal path movements, respectively. A confidence score is assigned to each detection by changing the sign of D norm , i.e., score = −D norm .
To select the candidate hits on the final normalized path distances, the system employs two limits for peak picking. The first is a hard limit of a maximum number of peaks, which implies an average of 1 peak per 20 s of audio. The second is a threshold where only the peaks above the 90% quantile of values above the mean plus standard deviation are selected. This guarantees that a small number of peaks is always chosen. Additionally, the peaks must be separated by a distance which is at least equal to the query length. The duration of the candidate hits in the utterance is also limited to between 0.5 and 1.9 times the size of the query. These figures were optimized on the development data.

Fusion
The next step is to normalize the confidence scores per query, for which z-norm is applied to each query score (q-norm). At this stage, there are four outputs from the four DTW search processes, i.e., the three phoneme recognizers and the average cost matrix. The fusion scheme is similar to that presented in [38]. Firstly, all the candidate hits are aligned (expanding the start and the end times), and a default score per sub-system for the candidate hits that are not found in all the sub-systems is assigned. This default score, which is equal to zero due to the q-norm, is the mean confidence score of that subsystem since this outperforms all other strategies such as the minimum score per query. All the candidate hits are considered, since this performs better than limiting the detections to those candidate hits found on more than one sub-system. Finally, the sub-system fusion is carried out by logistic regression with the Bosaris toolkit [85] to obtain improved discriminative and well-calibrated likelihood ratios [84]. The logistic regression is trained with the development data.

G-SPL-IT-UC-Three phoneme recognizer+DTW-based fusion QbE STD system (G-SPL-IT-UC-3-phnrec+DTW fusion)
This system is the same as the F-SPL-IT-UC-Four phoneme recognizer+DTW-based fusion QbE STD system except that the detections of the sub-system that employs the DTW-search on the average cost matrix are removed in the fusion strategy. This aims to evaluate the QbE STD system performance based on the individual languages.

H-SPL-IT-UC-Two language-independent phoneme recognizer+DTW-based fusion QbE STD system (H-SPL-IT-UC-2-LIphnrec+DTW fusion)
This system is the same as the F-SPL-IT-UC-Four phoneme recognizer+DTW-based fusion QbE STD system except that only the detections of the systems which employ the English and the Portuguese phoneme recognizers are fused. This aims to evaluate the QbE STD system performance using a language-independent setup.

I-Text-based STD system
This system was employed for comparison purposes with the QbE STD systems submitted to the evaluation. It was not submitted by any participant, nor did it compete in the evaluation. Because this system employs the correct transcription of the queries for the search, the system does not follow the rules of the evaluation. Therefore, this system simulates a scenario in which the queries are perfectly decoded by an ideal ASR subsystem. The text-based STD system consists of the combination of a word-based STD system to detect the INV words and a phone-based STD system to detect the OOV words. Therefore, the correct word transcription of each query is used for the word-based STD system, and the correct phone transcription of each query is used for the phone-based STD system. Both systems are described below.

Word-based STD system
The ASR subsystem is based on the Kaldi open-source toolkit [81] and employs the DNN-based acoustic models. Specifically, a DNN-based context-dependent speech recognizer is trained following the DNN training approach presented in [95]. Forty-dimensional MFCCs, which are augmented with three pitch-and voicing-related features [96] and appended with their delta and double delta coefficients, are firstly extracted for each speech frame. The DNN has 6 hidden layers with 2048 neurons each. Each speech frame is spliced across ± 5 frames to produce 1419-dimensional vectors that are the input into the first layer. The output layer is a soft-max layer representing the log-posteriors of the context-dependent HMM states. The Kaldi LVCSR decoder generates word lattices [97] using these DNN-based acoustic models.
The data used for acoustic model (AM) training of this Kaldi-based LVCSR system have been extracted from the Spanish material in the 2006 TC-STAR automatic speech recognition evaluation campaign 10 and the Galician broadcast news database Transcrigal [98]. It must be noted that all the non-speech parts, as well as the speech parts corresponding to transcriptions with pronunciation errors, incomplete sentences, and short speech utterances, were discarded. This resulted in approximately 104.5 h of acoustic training material.
The language model (LM) of the LVCSR system is constructed using a text database of 160 millions of word occurrences from several sources such as the transcriptions of European and Spanish Parliaments from the TC-STAR database, subtitles, books, newspapers, online courses, and the transcriptions of the development data provided by the organizers. Specifically, the LM is obtained by static interpolation of the trigram-based LMs which are trained using these different text databases. The LMs are built with the SRILM toolkit [99], with the Kneser-Ney discounting strategy. The final interpolated LM is obtained using the SRILM static n-gram interpolation functionality. The LM vocabulary size is limited to the most frequent 60,000 words, and for each evaluation data set, the OOV terms are removed from the LM. This word-based LVCSR system configuration was chosen due to its good performance in the STD task [100].
The STD subsystem integrates the Kaldi term detector [81,101,102] which searches for the input terms within the word lattices obtained in the previous step. These lattices are processed using the lattice indexing technique described in [103] so that the lattices of all the utterances in the search collection are converted from the individual WFSTs to a single generalized factor transducer structure in which the start-time, the end-time, and the lattice posterior probability of each word token are stored as threedimensional costs. This factor transducer is an inverted index of all the word sequences seen in the lattices. Thus, given a list of terms, a simple finite-state machine is created such that it accepts each term and composes it with the factor transducer to obtain all the occurrences of the terms in the search collection. The Kaldi decision-maker conducts a YES/NO decision, for each detection, based on the term-specific threshold (TST) approach presented in [104]. Therefore, a detection is assigned the YES decision if: where p is the posterior probability of the detection, N true is the sum of the confidence score of all the detections of the given term, β is set to 999.9, and T is the length of the audio in seconds.

Phone-based STD system
The OOV terms are handled with a phone-based STD system strategy. A phoneme sequence is first obtained from the 1-best word path of the word-based Kaldi LVCSR system presented above. Next, a reduction of the phoneme set is performed to combine the phonemes with high confusion, which aims to augment the term detection rate; specifically, the semivowels /j/ and /w/ are represented as the vowels /i/ and /u/, respectively, and the palatal n /η/ is represented as /n/. Then, the tre-agrep tool is employed to compute candidate hits so that the Levenshtein distance between each recognized phoneme sequence and the phoneme sequence corresponding to each term can be computed. An analysis of the proposed strategy suggests that those candidate hits whose Levenshtein distance was equal to 0 were, in general, correct hits. The candidate hits with Levenshtein distance equal to 1 were found to be false alarms, although many hits were also found; since no specific criterion to assign a confidence score is implemented, only those candidate hits with Levenshtein distance equal to 0 are kept and assigned the maximum score (1). The OOV term detections found using this phone-based STD system are directly merged with the INV detections obtained using the word-based STD system.

System comparison
The systems submitted to the evaluation convey both similar and different properties that make them all interesting from a system comparison perspective. All the QbE STD systems employed DTW or DTW variants for the query search, for which the cost function is in general, the cosine distance. In addition, almost all the QbE STD systems employed fusion to output the final list of query detections. Regarding the feature extraction, the systems are based, in general, on posteriorgramderived features for the query/utterance representation. However, there are specific differences that make each system distinct:  Table 8 highlights the main differences and consistencies corresponding to the feature extraction, the cost functions, the search algorithm, and the fusion of each QbE STD system.

Results and discussion
The system results are presented in Table 9 for the development data, and Tables 10 and 11 show the performance for the MAVIR and the EPIC test data, respectively. The most important findings in the results are presented in Table 12.

Development data results
• The best performance for the QbE STD task was obtained by the C-L2F-4-likel fea+DTW fusion system. This system explicitly models the target language, i.e., Spanish, using a specific phoneme recognizer and is based on the fusion of different phoneme recognizers, since these improve the system performance. Paired t tests show that this best performance was statistically significant when compared with the B-L2F-4-pllr fea+DTW (p < 0.02), D-ELiRF-UPV-Post+DTW (p < 0.01), E-ELiRF-UPV-Post+DTWNorm (p < 0.01), and H-SPL-IT-UC-2-LIphnrec+DTW fusion (p < 0.01) systems. • The worst performance was exhibited by the D-ELiRF-UPV-Post+DTW and E-ELiRF-UPV-Post+ DTWnorm systems, which did not employ any fusion strategy.
• The performance obtained by the H-SPL-IT-UC-2-LIphnrec+DTW fusion system also confirmed significant performance degradation when the target language information was not used in the system. However, the A-GTM-UVigo-3-fea+DTW fusion system was an exception; although this did not employ the target language information, it still obtained a reasonable performance. This effect is possibly due to the use of a robust feature extractor, which involves the feature selection and the phoneme unit selection. • As expected, the I-Text-based STD system, which employed the correct transcription of the query as input and the target language information, significantly outperformed all the QbE STD systems (p < 0.01). However, it must be noted that this I-Text-based STD system did not compete in the evaluation itself, because it did not follow the rules of the evaluation.

MAVIR test data
• The system with the best performance for the QbE STD task does not match the system of the development data. On these test data, the best Table 9 System results of the ALBAYZIN QbE STD 2016 evaluation on the development data  performance was for the A-GTM-UVigo-3-fea+ DTW fusion system. We consider this may be due to some over-adaptation of the selected phoneme recognizers for the query search and the fusion to the development data, which caused a worse generalization on unseen (test) data. • The best performance of the A-GTM-UVigo-3-fea+DTW fusion system could be due to the robust feature extraction it employs. This system is language-independent and hence is suitable to build a language-independent STD system, which is a hot topic in the search-of-speech. The results obtained with this system suggest that a fusion strategy combined with a robust feature extractor, which integrates a varied set of features in individual search processes, can alleviate the gap between language-dependent and language-independent QbE STD systems in highly difficult domains such as spontaneous speech. This best performance was statistically significant for a paired t test compared with the D-ELiRF-UPV-Post+DTW (p < 0.01), E-ELiRF-UPV-Post+DTWNorm (p < 0.01) and H-SPL-IT-UC-2-LIphnrec+DTW fusion (p < 0.01) systems.
• The remainder of the findings observed in the development data can also be found in the test data: The worst systems did not employ the target language information nor fusion, and the I-Text-based STD system significantly outperformed the QbE STD systems (p < 0.01).

EPIC test data
• The best performance for the QbE STD task was for the language-dependent G-SPL-IT-UC-3-phnrec+DTW fusion system. We consider the discrepancy compared with the MAVIR database relies on the change of the acoustic domain. The parameter tuning and the ATWV threshold estimation could dramatically change the system performance ranking (as in the A-GTM-UVigo-3-fea+DTW fusion system) when different domain data are used for training/development and test. The best performance of the G-SPL-IT-UC-3-phnrec+DTW fusion system was statistically significant for a paired t test compared with the A-GTM-UVigo-3-fea+DTW fusion (p < 0.01), B-L2F-4-pllr fea+DTW (p < 0.01), D-ELiRF-UPV-Post+DTW (p < 0.01), and E-ELiRF-UPV-Post+DTWNorm (p < 0.01) systems, and weakly significant compared with the  • The A-GTM-UVigo-3-fea+DTW fusion system dramatically decreases the performance due to an issue in the estimation of the ATWV threshold. • The results suggest that using the target language is not that beneficial when the acoustic domain of the development and the test data changes, since the performance of the language-independent QbE STD systems, i.e., H-SPL-IT-UC-2-LIphnrec+DTW fusion, is better than that of some language-dependent QbE STD systems, i.e., B-L2F-4-pllr fea+DTW fusion. • The I-Text-based STD system, as in the other datasets, significantly outperformed the performance of the QbE STD systems (p < 0.01).

Development and test data DET curves
The DET curves are presented in Figs. 6, 7, and 8 for the development data, the MAVIR test data, and the EPIC test data, respectively. These show a similar pattern to that observed in the system ranking from the MTWV/ATWV results.

System performance analysis based on query length
An analysis of the system performance based on the length of the queries was carried out. The results are presented in Tables 13 and 14 for the MAVIR and the EPIC test data, respectively. For the MAVIR data, it can be observed that, in general, the long queries obtained the best performance. This is due to the fact that when the length of the query increases, there is less confusion between the query terms, because these typically differ to a great extent and hence a better performance is obtained. However, it can also be seen that the short queries outperformed the medium-length queries. This could be due to the fact that the short queries, which contain up to 7 phonemes, are not short enough to make the QbE STD performance worse compared to the medium-length queries, which contain between 8 and 10 phonemes. For the I-Text-based STD system, the medium-length queries obtained the best performance. These outperformed the short queries, because as described above, there is less acoustic confusion the longer the length of the query. In this I-Text-based STD system, the medium-length queries also performed better than the long queries, which may be related to the fact that the long queries have an OOV rate of 56%, whereas the medium-length queries have an OOV rate of 39%. For the EPIC data, although the best performance also corresponded to the long queries, a different pattern of behaviour is observed: In general, the medium-length queries outperformed the short queries. This discrepancy with the MAVIR data may rely on the different conditions of each database such as the different number of queries, type of speech, and acoustic conditions. For the I-Textbased STD system, the long-length queries performed slightly better than the short-and medium-length queries, probably due to the lesser acoustic confusion. For this system, the medium-length queries performed slightly worse than the short-length queries. Although this may be surprising, it must be noted that some of the short-length queries can contain up to 7 phonemes, and so are not really very short.

System performance analysis based on single-word/multi-word queries
An analysis of the system performance based on the single-word and the multi-word queries was carried out, and the results are presented in Table 15. The results show a degradation in performance from the multi-word to the single-word queries. The multi-word queries are typically longer than the single-word queries, and hence, better performance could be expected, as shown in the query length analysis. The only exception is the I-Textbased STD system, for which the ATWV performance was worse for the multi-word queries than for the single-word queries. However, it should be noted that the MTWV was much better for the multi-word queries. This indicates a problem in the threshold setting for multi-word queries.

System performance analysis based on in-language/out-of-language queries
An analysis of the system performance based on the inlanguage and the out-of-language queries was carried out and the results are presented in Table 16. These results   Table 13 show a degradation in performance from the out-oflanguage to the in-language queries. This is the reverse of what should be expected in the case of a languagedependent setup. However, since all the QbE STD systems rely on the fusion of search systems that employ different languages, the OOL issue becomes almost irrelevant. The OOL queries can obtain a better performance than the INL queries in a QbE STD system in the case where the OOL query language is employed to build the system. In this case, the English language was chosen for the OOL queries, and all the QbE STD systems (except the B-L2F-4-pllr fea+DTW fusion system) used English in the feature extraction module. Regarding the B-L2F-4-pllr fea+DTW fusion system, the fusion strategy still performs better on the OOL queries because four different languages are fused. On the other hand, performance degradation is observed from the INL to the OOL queries in the I-Textbased STD system. In this case, the system is languagedependent because only the Spanish language was used to build the system, and hence a worse performance was obtained for the OOL queries because they did not match the target language. However, for the INL queries, where the pronunciation matches the target language, and for which enough data are typically used to train both the AMs and LMs, the system performance improved when compared to that of the QbE STD systems.

Comparison with the ALBAYZIN QbE STD 2014 evaluation
In order to measure the progress of the QbE STD technology in Spanish, a comparison of the best results obtained in the common set of queries of the ALBAYZIN QbE STD evaluations held in 2014 and 2016 was carried out. The best performance obtained in the 2014 and 2016 evaluations in the common set of queries was ATWV = 0.2881 and ATWV = 0.2541, respectively, which showed some performance degradation. It must be noted that the system submitted to the evaluation held in 2014 fuses the results of the text-based STD and the template matchingbased approaches, which resulted in a better performance. On the contrary, the best system presented in the 2016  evaluation was language-independent and included only template matching approaches. The data employed for the training and development varied from one evaluation to another. In the 2016 evaluation, there were less training data belonging to the MAVIR domain and the participants could not use the same data for training and development which could have influenced the system performance gap. Nevertheless, the best result obtained in the 2016 evaluation is still remarkable, as it was obtained by a language-independent QbE STD system and did not employ text-based STD technology.

Towards a language-independent STD system
The feasibility of language-independent STD systems can be examined from the systems submitted to the ALBAYZIN QbE STD 2016 evaluation. By comparing the best language-independent QbE STD system (A-GTM-UVigo-3-fea+DTW fusion for the MAVIR data and H-SPL-IT-UC-2-LIphnrec+DTW fusion for the EPIC data) with the I-Text-based STD system, we can claim that building a language-independent STD system with a performance similar to that of a language-dependent STD system remains a challenge. This means that researchers still need to focus more on the QbE STD technology to approximate the language-independent to the languagedependent STD systems.

Conclusions
This paper presents the systems submitted to the ALBAYZIN QbE STD 2016 evaluation together with a text-based STD system for comparison purposes. Four different research groups took part in the evaluation, and eight different systems were submitted in total. All the systems submitted allowed INV and OOV query detection, because they were based on template matching techniques. With regard to the most novel and interesting technical contributions, the feature extraction employed in the A-GTM-UVigo-3-fea+DTW fusion system is worth mentioning. It uses three feature extraction methods that integrate different information sources and two different feature selection approaches. The B-L2F-4-pllr fea+DTW fusion system also presents a valuable feature extraction approach by computing phone log-likelihood ratios from two different phoneme recognizers. The candidate hit selection proposed in the F-SPL-IT-UC-4-phnrec+DTW fusion system is also worth mentioning. The results showed that system fusion plays an important role in the QbE STD systems and that the languageindependence issue can be partially compensated by using a robust feature extractor. Regarding the domain comparison, we showed that for an easy domain such as that of the EPIC data, with an easy query list, i.e., INV, INL, and single-word queries, even though the training and the development data belonged to a different domain, the performance was better ( ATWV = 0.3011) than that for MAVIR data (ATWV = 0.2646), which presented a more difficult speech and query list and the same type of training and development data. The out-of-language query detection can obtain similar or even better performance than the in-language query detection when the language of those OOL queries is used to construct the system or the system fuses several language-dependent QbE STD systems. In addition, we also showed that multi-word query detection is easier than single-word query detection because the multi-word queries are generally longer than the single-word queries and that the long-length queries typically perform better.
A comparison of the results of the languageindependent QbE STD system with those of the language-dependent text-based STD system presented in this paper shows that it is clear that there is still ample room for improvement to approximate the performance of a language-independent QbE STD system to that of a language-dependent text-based STD system. This encourages the organizers to maintain this evaluation in the next ALBAYZIN evaluation campaign for which two different domains (including MAVIR data), and a cross-search, i.e., searching the development queries in the test speech data and searching the test queries in the development speech data, will be considered as a measure of the generalization capability of the systems to unseen data.