Query-by-Example Spoken Term Detection ALBAYZIN 2012 evaluation: overview, systems, results, and discussion

Query-by-Example Spoken Term Detection (QbE STD) aims at retrieving data from a speech data repository given an acoustic query containing the term of interest as input. Nowadays, it has been receiving much interest due to the high volume of information stored in audio or audiovisual format. QbE STD differs from automatic speech recognition (ASR) and keyword spotting (KWS)/spoken term detection (STD) since ASR is interested in all the terms/words that appear in the speech signal and KWS/STD relies on a textual transcription of the search term to retrieve the speech data. This paper presents the systems submitted to the ALBAYZIN 2012 QbE STD evaluation held as a part of ALBAYZIN 2012 evaluation campaign within the context of the IberSPEECH 2012 Conferencea. The evaluation consists of retrieving the speech files that contain the input queries, indicating their start and end timestamps within the appropriate speech file. Evaluation is conducted on a Spanish spontaneous speech database containing a set of talks from MAVIR workshopsb, which amount at about 7 h of speech in total. We present the database metric systems submitted along with all results and some discussion. Four different research groups took part in the evaluation. Evaluation results show the difficulty of this task and the limited performance indicates there is still a lot of room for improvement. The best result is achieved by a dynamic time warping-based search over Gaussian posteriorgrams/posterior phoneme probabilities. This paper also compares the systems aiming at establishing the best technique dealing with that difficult task and looking for defining promising directions for this relatively novel task.


Introduction
The ever-increasing volume of heterogeneous speech data stored in audio and audiovisual repositories promotes the development of efficient methods for retrieving the stored information. Much work has addressed this issue by means of spoken document retrieval (SDR), keyword spotting, spoken term detection (STD), query-by-example (QbE) or spoken query approaches.
Spoken term detection aims at finding individual words or sequences of words within audio archives. Therefore, it relies on a text-based input, commonly the phone by random browsing or some other method). His/her purpose is to find similar data within the repository. In doing so, the user selects one or several speech cuts containing the term of interest (henceforth, query) and the system outputs him/her other putative hits from the repository (henceforth, utterances). Another scenario for QbE STD considers one or several user speech recordings of the term of interest. Therefore, QbE STD differs from the STD defined previously, the so-called text-based STD, in that the former uses an acoustic query as input, instead of a text-based representation of the term. This, on one hand, offers a big advantage for devices without text-based capabilities, which can be effectively used under the QbE STD paradigm. On the other hand, QbE STD can be also employed for building language-independent STD systems [7,8], which is mandatory when no or very limited training data are available to build a reliable speech recognition system, since a priori knowledge of the language involved in the speech data is not necessary.
Given the high amount of information stored in speech format, automatic systems that are able to provide access to this content are necessary. In this direction, several evaluations including SDR, STD, and QbE STD have been proposed recently [30][31][32][33][34][35][36]. Taking into account the increasing interest in the QbE STD evaluation around the world, we organized an international evaluation of QbE STD in the context of ALBAYZIN 2012 evaluation campaign. This campaign is an internationally open set of evaluations supported by the Spanish Network of Speech Technologies (RTTH c ) and the ISCA Special Interest Group on Iberian Languages (SIG-IL) every 2 years from 2006. The evaluation campaigns provide an objective mechanism to compare different systems and to promote research on different speech technologies such as speech segmentation [37], speaker diarization [38], language recognition [39], and speech synthesis [40] in the ALBAYZIN 2010 evaluation campaign. This year, this campaign has been held during the IberSPEECH 2012 Conference d , which integrated the 'VII Jornadas en Tecnología del Habla' and the 'III Iberian SLTech Workshop'.
The rest of the paper is organized as follows: the next section presents the QbE STD evaluation that includes an evaluation description, the metric used, the database released for experimentation, a comparison with previous evaluations, and the participants involved in the evaluation. Next, we present the different systems submitted to the evaluation. Results along with some discussion are presented in Section 'Results and discussion' and the work is concluded in the last section.

Evaluation description and metric
This evaluation involves searching for audio content within audio content using an audio content query. Therefore, this is suitable for groups working on speech indexing and retrieval and on speech recognition as well. In other words, this task focuses on retrieving the appropriate audio files, with the occurrences and timestamps, which contain any of those queries. Therefore, the input to the system is an acoustic example per query, and hence prior knowledge of the correct word/phone transcription corresponding to each query cannot be used.
Participants could submit a primary system and up to two contrastive systems. No manual intervention is allowed for each system developed to generate the final output file and hence, all the developed systems must be fully automatic. Listening to the test data, or any other human interaction with the test data, is forbidden before all the results have been submitted. The standard XMLbased format corresponding to the NIST STD 2006 evaluation [31] has been used for building the system output file.
In QbE STD, a hypothesized occurrence is called a detection; if the detection corresponds to an actual occurrence, it is called a hit, otherwise it is a false alarm (FA). If an actual occurrence is not detected, this is called a miss. The actual term weighted value (ATWV) [31] has been used as metric for the evaluation. This integrates the hit rate and false alarm rate of each query term into a single metric and then averages over all search query terms: where denotes the set of search query terms and | | is the number of query terms in this set. N K hit and N K FA respectively represent the numbers of hits and false alarms http://asmp.eurasipjournals.com/content/2013/1/23 of query term K, and N K true is the number of actual occurrences of K in the audio. T denotes the audio length in seconds, and β is a weight factor set to 999.9 [41].
ATWV represents the term weighted value (TWV) for the threshold set by the QbE STD system (usually tuned on development data). An additional metric, called maximum term weighted value (MTWV) [31], can be also used to evaluate the performance of the QbE STD system. This MTWV is the maximum TWV achieved by a given QbE STD system and does not depend on the tuned threshold. Although it was not used for the evaluation, results based on this metric are also presented to evaluate the threshold selection in the submitted systems.
In addition to ATWV and MTWV, NIST also proposed a detection error tradeoff (DET) curve [42] to evaluate the performance of the QbE STD system working at various miss/FA ratios. Although DET curves were not used for the evaluation itself either, they are also presented in this paper for system comparison.

Database
The database used for the evaluation consists of a set of talks extracted from the Spanish MAVIR workshops e held in (Corpus MAVIR 2006 corresponding to Spanish language that contain speakers from Spain and South America (henceforth MAVIR database).
This MAVIR database includes ten spontaneous speech files, each containing different speakers, which amount at about 7 h of speech and are further divided into training/development and test sets. There are 20 male and 3 female speakers in the database. The data were also manually annotated in an orthographic form, but timestamps were set only for phrase boundaries. To prepare the data for the evaluation, we manually added the timestamps for the roughly 2, 000 occurrences used in the training/development and test parts of the database.
The speech data were originally recorded in several audio formats [pulse-code modulation (PCM) mono and stereo, MP3, 22.05 kHz, 48 kHz, etc.]. All data were converted to PCM, 16 kHz, single channel, and 16 bits per sample using Sox tool f in order to unify the format for the participants. Recordings were made with the same equipment, a digital TASCAM DAT model DA-P1 (TEAC Corporation, Tokyo, Japan), except for one recording. Different microphones were used for the different recordings. Most of them were tabletop or floor standing microphones, but in one case a lavalier microphone was used. The distance from the mouth of the speaker to the microphone varies and was not particularly controlled, but in most cases the distance was smaller than 50 cm. All the speech contains real and spontaneous speech of MAVIR workshops in real setting. Thus, the recordings were made in large conference rooms with capacity for over a hundred people and a large amount of people in the conference room. This poses additional challenges including background noise (particularly babble noise) and reverberation. The realistic settings and the different nature of the spontaneous speech in this database make it appealing and challenging enough for our evaluation and definitely for further work. Table 1 includes some database features, such as the number of words, duration, and signal-to-noise ratio (SNR) [43] of each speech file.
Training/development data amount at about 5 h of speech extracted from seven out of ten speech files of the MAVIR database and contained 15 male and 2 female speakers. However, there is no constraint in the amount of training/development data beyond the MAVIR corpus that can be employed to build the systems. The training/development list consists of 60 queries, which were chosen based on their occurrence rate in the training/development speech data. Each query is composed of a single word whose length varies between 7 and 16 single graphemes. Ground truth labels and evaluation tools were provided to the participants by the date of the release. There are 1, 027 occurrences of those queries in the training/development data. Table 2 includes information related to the training/development queries.
Test data amount at about 2 h of speech extracted from the other three speech files not used as training/development data and contained five male and one female speakers. The test list consists of 60 queries, which were chosen based on their occurrence rate in the test speech data. Each query is composed of a single word whose length varies between 7 and 16 single graphemes. No ground truth labels corresponding to the test data were given to the participants until all the systems were submitted to the evaluation. There are 892 occurrences of those queries in the test data. Table 3 includes information related to the test queries.

Comparison to other evaluations
In the last years, several evaluations in the field of spoken term detection have taken place. In this section, we review the former evaluations mainly to highlight the differences with the evaluation presented in this article. The most similar evaluations to our evaluation are the MediaEval 2011 and 2012 Search on Speech evaluations [33,34]. The task of MediaEval and our evaluation is the same: a Query-by-Example Spoken Term Detection evaluation in which participants search for audio content within audio content using an audio content query. However, our evaluation differs from MediaEval evaluations in different ways.
The most important difference is the nature of the audio content used for the evaluation. In MediaEval evaluations all speech is telephone speech, either conversational or read and elicited speech. In our evaluation, the audio contains microphone recordings of real talks in real workshops, on large conference rooms with public. Microphones, conference rooms, and even recording conditions change from one recording to another. Microphones are not close-talking microphones but mainly tabletop and ground standing microphones. This difference in the evaluation conditions makes our evaluation to pose different challenges, and makes it difficult to compare the results obtained in our evaluation to previous MediaEval evaluations.
The evaluation presented here is, to the best of our knowledge, the first QbE STD evaluation that deals with Spanish language. This makes our evaluation different in another way to MediaEval 2011 and 2012 evaluations, which dealt with Indian and African languages. In addition, participants of our evaluation could make use of the language knowledge (i.e., Spanish) when building their system/s.
Besides the MediaEval Search on Speech Evaluations, the National Institute of Standards and Technology (NIST) of the USA organized in 2006 the NIST STD evaluation [31]. In this case, the evaluation proposed a different task: searching spoken terms using a textual query composed of one or several words. The data contained speech on English, Mandarin Chinese, and Modern Standard and Levantine Arabic. Again, none of these languages was Spanish. In this case, the nature of the speech included conversational telephone speech (CTS), broadcast news (BNews) speech, and speech recorded in roundtable meeting rooms (RTMeet) with distantly placed microphones (this last type is used only for English). Of the three different types of speech, the last one is more similar to the nature of the speech in our evaluation, although there are still differences as to the size of the room, larger in our case, which is very important for reverberation; also the use of amplification of the audio in the conference rooms is not present in the case of a roundtable meeting.
The NIST STD 2006 evaluation results are publicly available g and are a very interesting result to analyze the influence of the language and the nature of speech on STD results. Table 4 presents the best results obtained for each condition by the teams participating in the evaluation. With respect to the type of speech, it is clear from Table 4 that results using microphone speech, particularly distant microphones, in less controlled settings than audiovisual studios (such as in broadcast news) or closetalking conversational telephone data are definitely much more limited. Taking this into account and the very challenging nature of the database used in our evaluation, perhaps even more challenging than the roundtable meeting recordings used in NIST STD 2006 evaluation, we should not expect a very high performance in our evaluation.
With respect to the language, English is the language with more resources and for which more research has been done. When applying the similar technology to languages with fewer resources or for which less specific research has been devoted, performance decreases are observed. In the case of the NIST STD 2006 evaluation, very important performance decreases are observed when moving from English to other languages. In the case of our evaluation, we should not expect important decreases due to the use of Spanish since we are conducting a query-byexample evaluation in which language resources are less important and the technology is relatively more language independent. However, we will probably lose some performance due to using a query-by-example setting. In fact, we see that this happens in the particular setting of our evaluation by comparing results of the query-by-example systems with the performance obtained by a text-based spoken term detection system that is more comparable to the systems participating in the NIST STD 2006 evaluation.
Finally, NIST has recently conducted a new evaluation called NIST Open Keyword Search evaluation [36] that is very similar to the former NIST STD 2006 evaluation. This new evaluation was only conducted on CTS data on a surprise language that was announced only 4 weeks before the evaluation. At the time of writing this article, there are no publicly available results of this evaluation.

Participants
Four different systems (systems 1 to 4) were submitted from three different research groups to ALBAYZIN http://asmp.eurasipjournals.com/content/2013/1/23 2012 Query-by-Example Spoken Term Detection evaluation. In addition, one additional research group submitted a system (named text-based STD system in this paper) that is capable of text-based STD. This system will be used in this paper as a reliable baseline to be compared with the systems submitted to the main QbE STD evaluation. Participants are listed in Table 5. About 3 months were given to the participants for system designing. Training/development data were released at the end of June 2012; test data were released at the beginning of September 2012; and the final system submission was due at the end of September 2012.

Systems
In this section, the systems that are submitted for the evaluation are described. The systems appear in the same order that they are ranked in Tables 6 and 7. A full description of the systems can be found in IberSPEECH 2012 online conference proceedings [44].

System 1
The system is based on a DTW zero-resource matching approach. The system architecture is depicted in Figure 1. First, acoustic features (13 Mel frequency cepstral coefficients (MFCCs) along with their first and second derivatives) were extracted from the speech signal for each frame. To solve the speaker-dependent issue that these features suffer from [8], these MFCC features are used to train a posterior Gaussian Mixture Model (GMM). This GMM is trained from a combination of expectation-maximization and K-means algorithms aiming at maximizing the discovery and separation of automatically derived acoustic regions in the speech signal, as described in [45]. Finally, Gaussian posteriorgram features are extracted from this model as final features. Next, a GMM-based speech/silence detector is applied to filter out non-speech segments. The resulting features (i.e., those corresponding to speech segments) are next sent to the subsequence-DTW (SDTW) [46] matching algorithm, which hypothesizes query detections within the utterances. The minus logarithm of the cosine distance has been employed as similarity measure between each query frame and each utterance frame. This SDTW algorithm allows any query to appear at any time within the  utterance. After the matching algorithm returns all possible detections and their scores, an overlap detection algorithm is executed where all those matches that overlap with each other more than 50% of the detection time are post-processed by keeping the detection with the highest score (i.e., the lowest distance) in the output file along with the non-overlapped detections. It must be noted that this system can be considered language independent, since it does not make use of the target language and can be effectively used for building language-independent STD systems. A full system description can be found in [47].

System 2
This system looks for an exact match of the phone sequence output by a speech recognition process given a spoken query, within the phone lattices corresponding to the utterances. Brno University of Technology phone decoders for Czech, Hungarian, and Russian have been employed [48]. In this way, this system does not make use of prior knowledge of the target language (i.e., Spanish) and hence, as the previous system, is language independent and suitable for building a language-independent STD system. The system, whose architecture is depicted in Figure 2, integrates different stages as follows: first, Czech, Hungarian, and Russian phone decoders have been used to produce phone lattices both for queries and utterances. Then, the phone transcription corresponding to each query is extracted from the phone lattice by taking the highest likelihood phone sequence using the lattice tool of SRILM [49]. Next, Lattice2Multigram tool [50][51][52] h has been used to hypothesize detections that perform an exact match of the phone transcription of each query within each utterance. In this way, three different output files that contain the detections from each phone decoder are obtained. The score given by the Lattice2Multigram tool for each detection is normalized by the length of the detection (in number of frames) and by all the detections found within the phone lattices except the current one. Overlapped detections that are hypothesized by two or more phone decoders are merged so that the most likely detection (i.e., the one with the highest score) remains along with the non-overlapped detections. As a post-process, just the best K hypothesis for each utterance is kept in the final output file. K was set to 50 which got the best performance on training/development data. The full system description can be found in [53].
Two different configurations for this system were submitted. The first one, referred as system 2a, combines the detections from the Hungarian and Russian phone decoders, since they got the best performance in the training/development data. The second one, referred as system 2b, merges the detections from all the phone decoders (i.e., Czech, Hungarian, and Russian) in the final output file.

System 3
The system, whose architecture is presented in Figure 3, is based on a search on phoneme lattices generated from a posteriori phoneme probabilities. This is composed of different stages as follows: first, these probabilities are obtained by combining the acoustic class probabilities estimated from a clustering procedure on the acoustic space and the conditional probabilities of each acoustic class with respect to each phonetic unit [54]. The clustering makes use of standard GMM distributions for each acoustic class, which are estimated from the unsupervised way of the Maximum Likelihood Estimation procedure. The conditional probabilities are obtained from a coarse segmentation procedure [55]. An acoustic class represents a phone in the target language (i.e., Spanish) and hence this system employs the knowledge of the target language. Second, the phoneme lattices are obtained for each query and utterance from an ASR process that takes as input the phoneme probabilities computed in the previous stage. This ASR process examines if each vector of phoneme probabilities contains probabilities for each phoneme above a predefined detection threshold (tuned on training/development data) to output a specific phoneme for each frame. Start and end time marks for each phoneme are assigned from backward/forward procedures that mark frames before/after the current one with a probability for that phoneme higher than an extension threshold (tuned on training/development data as well) stopping when the probability is lower than this threshold to assign the corresponding start and end timestamps. The accumulated frame phoneme probability is used as score for each phoneme in the lattice. In the third step, a search of every path in the lattice corresponding to the query within the phoneme lattice corresponding to the utterance is conducted to hypothesize detections. Substitution, deletion, and insertion errors in those query lattice paths are allowed when hypothesizing detections. The score for each detection is computed by accumulating the individual score for each phoneme both in the query and the utterance lattice paths. Overlapped detections are discarded in the final output file by keeping the best, and detections with a score lower than a predefined threshold (tuned on the training/development data) are also filtered out the final output. This threshold is query dependent since a query detection is considered a hit if its score is lower than the mean of all scores of this query minus the standard deviation of these scores computed from all occurrences of the detected query in all speech files. The full system description can be found in [56].
Two different configurations were submitted. The first one, referred as system 3a, tuned all the thresholds so that at least 6% of hits on training/development data are produced. The second one, referred as system 3b, is a late submission and tuned the thresholds for ATWV performance. This second configuration allows a fair comparison with the rest of the systems submitted.

System 4
This system employs the same phoneme probabilities used in the first stage to build system 3 as query/utterance representation and hence it makes use of the target language. The system architecture is shown in Figure 4. To hypothesize detections, a segmental DTW search [57] is conducted with the Kullback-Leibler (KL) divergence as similarity measure between each query frame and each utterance frame. The Segmental DTW algorithm allows any query to appear at any point within the utterance. Overlapped detections found by the segmental DTW search and detections with a score lower than a predefined threshold (tuned on the training/development data) are filtered out the final output. As in system 3, this threshold is query dependent, and a query detection is considered a hit if its score is lower than the mean of all the scores of this query minus the standard deviation of these scores computed from all the occurrences of the detected query in all speech files. The full system description can be found in [56].
As in the previous system, two different configurations were submitted. The first one, referred as system 4a, optimizes the system so that at least 10% of hits on training/development data are produced. The second one, referred as system 4b, is a late submission, optimizes the system according to ATWV metric and hence only allows a query to have at most two detections in all the speech files. This system optimization towards the ATWV metric allows a fair comparison with the rest of the systems submitted.

Text-based Spoken Term Detection system
For comparison with the systems presented before, we present a system that can conduct STD which employs the phone transcription corresponding to each query to hypothesize detections. It must be noted that the correct phone transcription corresponding to each search term has been employed. The system architecture is depicted in Figure 5.
The STD system consists of four different stages: in the first stage, a phone recognition is conducted to output phone lattices based on two different speech recognizers: (1) a standard triphone context-dependent hidden Markov model (HMM) speech recognizer with mixtures of diagonal covariance Gaussians as observation density functions in the states and (2) a biphone context-dependent HMM speech recognizer where the observation probabilities are obtained from a multilayer perceptron (MLP). In the second stage, a STD subsystem hypothesizes detections from each speech recognizer. The 1-best output of each phonetic recognizer is used as source text for an edit distance search. In doing so, each putative detection could be any substring which has a phonetic edit distance with the searched word of less than 50% of its length. Next, we take all the detections found from the different phonetic recognizers and merge them. For overlapped detections, the best detection (i.e., the one with the minimum edit distance) remains. In the third stage, two different confidence measures based on minimum edit distance and lattice information are used as confidence scores for each putative detection. The former is computed from standard substitution, insertion, and deletion errors in the 1-best phone sequence given by each speech recognizer, and normalized by the length of the word. The latter is computed as follows: (1) we determined each lattice by using HLRescore from HTK [58] so that a smaller and more useful graph is used next; (2) we run the lattice-tool from the SRILM toolkit [49] to obtain the corresponding acoustic mesh graph; (3) the confidence calculated in the acoustic mesh graph is used in a modified edit distance algorithm where, instead of all costs equal to 1, we simply sum the confidence of the matching phones with the searched word. Then, the score of a putative detection is the sum of the confidences through the acoustic mesh of the searched word between the time limits where the detection resides. This score is also normalized by the length of the word. The fourth stage makes use of the Bosaris toolkit i to fuse both scores obtained in the previous stage to compute the final confidence for each detection. A full system description can be found in [59].

Results and discussion
The results of the QbE STD evaluation are presented for every system submitted by the participants along with the system applied on STD in terms of MTWV and ATWV in Tables 6 and 7 for training/development and test data, respectively.
By analyzing the systems submitted for QbE STD evaluation at due time (i.e., not considering the late submissions) on test data, system 1 achieved the best performance both in terms of MTWV and ATWV. This reflects the good threshold setting approach used. It must be noted that both the difficulty of the task itself (searching acoustic queries on spontaneous data the type and quality of the acoustic data) and the absence of prior knowledge of the target language produce this low performance. However, this system is worse than the text-based STD system. This, as expected, is due to the use of the correct phone transcription for each query and hence the knowledge of the target language employed to build the text-based STD system. Special mention requires the late submission corresponding to system 4b. Although this system performance is not the best in terms of MTWV on test data, this achieves the best ATWV. This is caused by the near MTWV and ATWV system performance which reflects the fact that the threshold tuned on the training/development data performs very well on unseen (test) data. This may be due to several factors: (1) first, the two occurrences per query limitation produces less detections in the final output, which seriously limits the MTWV system performance and (2) the query-dependent threshold plays a very important role as score normalization. The best ATWV performance of this system may be due to the similarity measure used to conduct the segmental DTW search, being the Kullback-Leibler divergence, that perfectly fits the posterior probabilities computed in the first stage. The use of the target language to estimate these posterior probabilities also contributes to this. However, in case of system 1, a priori knowledge of the target language was not applied, and the cosine distance may not fit the Gaussian posterior probabilities as well as the KL divergence, which may result in a less generalizable threshold setting, and hence, in a higher gap between MTWV and ATWV. Again, system 4b still underperforms the text-based STD system. Similar trends are observed on training/development and test data. The main discrepancy lies on the best MTWV performance of the late submission corresponding to system 4b, which outperforms system 1 on training/development data and underperforms system 1 on test data. We consider that this is due to the different set of queries in both sets of data and some overfitting to training/development data in parameter tuning (e.g., number of detections per query that limits MTWV performance on unseen data as explained earlier). Systems 3a and 4a achieve different MTWV and ATWV performance. This is because both systems were tuned to output a predefined number of hits (6% and 10% respectively) on training/development data. This causes a high number of FAs, leading to a negative ATWV performance. In addition, an MTWV equal to 0.0 means that the best possible performance is obtained with no output detections.
It can be also seen that system 2a underperforms system 2b on test data. This means that the addition of the Czech decoder is actually helping the QbE STD system. However, in the training/development data, the opposite occurred (see Table 6). This may be due to the different development and test queries provided by the organizers. Systems 1 and 2a,b do not make use of the target language whereas systems 3a, 3b, 4a, and 4b do.
In particular, what is highly remarkable is the best overall performance of system 1 in terms of MTWV, which can be employed to build language-independent STD systems. A better strategy for threshold setting of this system is necessary to get nearer MTWV values to ATWV system performance. http://asmp.eurasipjournals.com/content/2013/1/23 DET curves are also presented in Figures 6 and  7 for training/development and test data respectively. They show the system performance working at different miss/FA ratios. System 1 clearly outperforms the rest of the QbE STD systems for almost every operating point, except at the best operating point of system 4b, and when the miss rate is low, where system 4a performs the best. As expected from the ATWV results, by comparing the textbased STD system with the rest, the former outperforms the others except when the FA rate is low, where system 1 performs the best. Training/development and test data exhibit a similar trend.
A more detailed analysis is presented in Figures 8 and  9 in terms of hit/FA performance for the different systems for training/development and test data respectively. As expected from the ATWV results, the late submission corresponding to system 4b achieves the best tradeoff between hits and FAs between those submitted to the QbE STD evaluation. systems 2a and 2b just output a few detections which results in bad ATWV performance. It must be noted that these two systems (2a and 2b) dramatically increase the number of FAs as long as more detections are hypothesized, in such a way that the best ATWV result is achieved with a small number of hits and FAs. System 3b exhibits a similar behavior on test data. Systems 3a and 4a achieve such a high number of FAs on test data that their ATWV performance decreases dramatically. This is because both systems were developed by producing at least 6% and 10% coverage of hits in the training/development data, respectively, which increases both the number of hits and FAs. However, the increase in the number of FAs is much higher than the increase in the number of hits, resulting in an overall worse ATWV performance. This is confirmed by the results of systems 3a and 4a on training/development data: to get the best performance in terms of MTWV, the high number of FAs that causes no detections are outputted in these data. System 3b confirms this on training/development data. Again, system 1 achieves the best result in terms of hit/FA performance when compared with the systems submitted at due time to the main QbE STD evaluation. Looking at the performance of the text-based STD system (out of the main QbE STD evaluation), which conducts STD and employs the correct phone transcription of the search terms when hypothesizing detections, it produces the best ATWV result, since it gets quite a high number of hits and a small number of FAs.  Figure 6 DET curves of the QbE STD ALBAYZIN evaluation systems on training/development data. The broken black curve represents system 1, the red dot curve represents system 2a, the dark blue curve represents system 2b, the green curve represents system 3a, the solid black curve represents system 3b, the light blue curve represents system 4a, the red curve represents system 4b and the pink curve represents the text-based STD system. Systems 3b and 4b represent late submissions. Systems 1 to 4 are on QbE STD and text-based STD is on STD.  Figure 7 DET curves of the QbE STD ALBAYZIN evaluation systems on test data. The broken black curve represents system 1, the red dot curve represents system 2a, the dark blue curve represents system 2b, the green curve represents system 3a, the solid black curve represents system 3b, the light blue curve represents system 4a, the red curve represents system 4b and the pink curve represents the text-based STD system. Systems 3b and 4b represent late submissions. Systems 1 to 4 are on QbE STD and text-based STD is on STD. It should be also noted that p(FA) and p(miss) in Tables 6 and 7 do not relate to ATWV performance but to MTWV performance (i.e., with the a posteriori best decision threshold). In this way, systems with MTWV = 0.0 (i.e., those that do not generate detections at best decision threshold) obtain p(FA) = 0.0 and p(miss) = 1.0.

Comparison to previous QbE STD evaluations
Although our evaluation results cannot be directly compared with those obtained in MediaEval 2011 and 2012 Search on Speech evaluations [33,34] because the database used for experimentation is different, we can mention that our results are the worst (the best performance of MediaEval 2011 is ATWV = 0.222 and that of MediaEval 2012 is ATWV = 0.740). This may be due to the generous time windows allowed in MediaEval 2011 Search on Speech Evaluation and the equal weight given to miss and FA detections when scoring MediaEval 2012 Search on Speech Evaluation systems, which got the higher ATWV performance. In our case, we have been 100% compliant with the ATWV setup, parameters, and scoring provided by NIST. Although the time window allowance contributes in a minor extent to the system's performance (see Table 8), the equal weight given to misses and FAs contributes in a greater extent (see Table 9). However, these results are still far from those obtained in past MediaEval evaluations. This is due to the more complex database (different recording conditions, speakers from different countries, etc.) used in our evaluation, as explained earlier. This is confirmed by the fact that system 1 that achieves the best performance at the submission due time, being equivalent to a system presented in MediaEval 2012 Search on Speech evaluation (which obtained an ATWV = 0.294 in that evaluation), obtains clearly the worst performance in our evaluation (ATWV = 0.0122). The time window allowance hardly contributes to improve the performance due to the small number of detections obtained by the systems when aiming to maximize ATWV performance. Therefore, increasing the time window to consider a detection as a hit does not play an important role when examining ATWV performance. For systems with more detections (e.g., systems 3a, 3b, and 4b), however, an increase in the time window allowance contributes to the highest ATWV performance gains. The different ranking obtained in Table 9 must be noted, where a same weight is given to misses and FAs, compared to that obtained in the real evaluation (see Table 7). This does not mean that the best system of the QbE STD evaluation is not actually the best, since the system tuning carried out on training/development data is greatly impacted by the ATWV formulation and hence by the different weight given to misses and FAs. The fast spontaneous speed, the noise background in some test queries, and the challenging acoustic conditions may be also causing the worse system performance compared to past MediaEval evaluations. A further analysis based on query length, speaking speed, and energy is presented next.

Performance analysis of the QbE STD systems based on query length
An analysis of the performance of the QbE STD systems (i.e., those with an input acoustic query) based on the length of the queries has been conducted and results are http://asmp.eurasipjournals.com/content/2013/1/23  shown in Table 10. Queries have been divided into three categories: short-length queries (queries shorter than 40 hundredth of seconds), medium-length queries (queries between 40 and 50 hundredth of seconds) and long-length queries (queries longer than 50 hundredth of seconds). It can be clearly seen that in general, longer queries exhibit the best performance while shorter queries obtain the worst performance. This is because short-length queries are naturally more confusable within speech data than long-length queries, which occurs in ASR systems with long-length words and short-length words.

Performance analysis of the QbE STD systems based on query speaking speed
A similar analysis based on the speaking speed of each query has been carried out for the QbE STD systems and results are presented in Table 11. Queries have been divided into three categories: slow queries with a slow pronunciation speed (above 5.90 hundredth of seconds per phone), medium queries with a medium pronunciation speed (between 5.90 and 4.82 hundredth of seconds per phone), and fast queries with a fast pronunciation speed (below 4.82 hundredth of seconds per phone). Results show that slow queries exhibit the  best performance. We consider that this is because slow queries usually posses a clearer pronunciation, and less co-articulation, than faster (medium and fast) queries. For faster queries, however, some degree of mispronunciation (i.e., deleting phones) could appear, which affects the final performance.

Performance analysis of the QbE STD systems based on query energy
A similar analysis based on the average energy of each query has been conducted for the QbE STD systems and results are presented in Table 12. Energy has been obtained using Praat program [60]. Here, the queries have been divided into three different categories as follows: low-(below 54 dB), medium-(between 54 and 65 dB), and high-energy (above 65 dB) queries. The results show that medium energy queries posses the best performance in general. We consider this is because extreme values of energy tend to cause more errors than standard (medium) values of energy in the queries, as also shown in [61] for ASR systems. The only exception is system 1, in which the high-energy queries perform the best. We consider that this may be due to the voice activity detector (VAD) included within the system that is applied both to the query and test data. VAD may be causing the clipping of queries with smaller values for energy (low and medium), which may worsen the QbE STD performance for these queries.

Performance analysis of the QbE STD systems for specific queries
A more detailed analysis has been conducted to show some specific query properties and their relation with QbE STD performance focusing on the two best QbE STD systems (system 1 and system 4b). We have set two different categories as follows: worst queries and best queries. The former are those that contribute with a negative ATWV within the final performance and the latter are those that posses the best ATWV contribution within the final performance. Twelve different queries belong to the worst query category, and ten different queries do to the best query category. Among the worst queries, there are ten queries that belong to one of the worst groups based on the earlier analyses (short-length queries, highenergy queries for system 4b, and low-energy queries for system 1). Among the best queries, there are seven queries that belong to one of the best groups presented http://asmp.eurasipjournals.com/content/2013/1/23 in the previous analyses (medium energy queries and long-length queries).

Template matching-based versus phone transcription-based QbE STD
Systems 1 and 4a,b employ a template matching-based approach for QbE STD, whereas systems 2a,b and 3a,b employ a phone transcription-based approach for QbE STD. This means that the best overall performance is achieved by the template matching-based approach proposed both in systems 1 and 4. This result confirms the conclusion presented in [18] where a template matchingbased approach outperformed a phone transcriptionbased approach for QbE STD. Results obtained by system 2a,b suggest that building a speech recognizer on a language different from the target language to produce phoneme lattices and a next search within these phoneme lattices is not appropriate when addressing the QbE STD task, since they are not reliable enough to represent the speech content in an out-oflanguage setup. In addition, the query search algorithm employed in system 3a,b considers so many paths in the lattice that represents the query to hypothesize detections within the utterances that many FAs are generated. A better score confidence estimation for this system is necessary to reject as many FAs as possible.
Despite the bad performance exhibited by the configuration 4a corresponding to system 4, it must be noted that this was not optimized for the final metric (i.e., ATWV) but to get a predefined hit coverage, which greatly affects the final ATWV performance [62] and hence, a fair comparison with the rest of the systems cannot be made.

Set of features for QbE STD
Different sets of features have been employed as speech signal representation: Gaussian posteriorgrams for system 1, a posteriori phoneme probabilities for systems 3a,b and 4a,b, and three-state MLP-based phoneme probabilities for system 2a,b. Although all these features should be fed within all the search algorithms to derive a more powerful conclusion, we can observe that Gaussian posteriorgram features are suitable for speech signal representation due to the best performance of system 1 when no prior knowledge of the target language is used. We can also mention that the posterior phoneme probabilities used in the language-dependent setup corresponding to the late submission of system 4b are an effective representation of the speech signal due to their best ATWV performance.

Towards a language-independent STD system
From the systems submitted to this evaluation, an analysis aiming at deciding the feasibility of a languageindependent STD system can be conducted. By comparing the best language-independent QbE STD system (system 1) with the text-based STD system, we can claim that building a language-independent STD system is still a far milestone. This means that more research is needed in this direction to get nearer language-dependent to language-independent STD systems.

Challenge of the QbE STD task
By inspecting the results of all the systems submitted to the QbE STD evaluation, we can claim that building a reliable QbE STD system is still far from being a solved problem. The low ATWV performance exhibited by the best system (ATWV = 0.0217) confirms this. There are many issues that must be still solved in the future. First, a robust feature extraction process is necessary to represent in an accurate way the query/utterance speech content. Next, a suitable search algorithm that hypothesizes detections is also necessary to output as many hits as possible while maintaining a reasonably low number of FAs. In addition, the spontaneous speech, inherent to QbE STD systems, is an important drawback since phenomena, such as disfluences, hesitations, and noises, are very difficult to deal with. Some pre-processing steps that deal with these phenomena could enhance the final performance. From the systems submitted to this evaluation, we can claim that Gaussian posteriorgrams or, generally speaking, posterior phoneme probabilities, as features and a subsequent DTW-based search are a reasonable good starting point when facing QbE STD.

Conclusions
We have presented the four systems submitted to ALBAYZIN 2012 Query-by-Example Spoken Term Detection evaluation along with a system that conducts STD. Four different Spanish research groups (TID, GTTS, ELiRF, and VivoLab) took part in the evaluation. There were two different kinds of systems submitted for evaluation: template matching-based systems and phone transcription-based systems. Systems 1 and 4a,b belong to the former group and systems 2a,b and 3a,b belong to the latter. Results show better performance of the template matching-based systems over the systems that employ the phone transcription of each query obtained from a phone decoding followed by a text-based STD-like search to hypothesize detections. The best system employs Gaussian posteriorgram/a posteriori phoneme probability features and a DTW-like search to hypothesize detections.
We have also shown that QbE STD systems (systems 1 and 4b) are still far from systems that deal with textbased STD (text-based STD system) and that long-length, medium energy, and slow speaking speed queries contribute to get higher the QbE STD system performance.
This evaluation is the first that has been conducted for Spanish language so far, which represents a good baseline http://asmp.eurasipjournals.com/content/2013/1/23 for future research in this language. In addition, the spontaneous speech database chosen for the experimentation, and in particular its realistic and challenging acoustic conditions, made the evaluation and the database attractive enough for future research. Results presented in this paper indicate that there is still a big room for improvement which encourages us to maintain this evaluation in the next ALBAYZIN evaluation campaigns.