System for Fast Lexical and Phonetic Spoken Term Detection in a Czech Cultural Heritage Archive

The main objective of the work presented in this paper was to develop a complete system that would accomplish the original visions of the MALACH project. Those goals were to employ automatic speech recognition and information retrieval techniques to provide improved access to the large video archive containing recorded testimonies of the Holocaust survivors. The system has been so far developed for the Czech part of the archive only. It takes advantage of the state-of-the art speech recognition system tailored to the challenging properties of the recordings in the archive (elderly speakers, spontaneous speech, emotionally loaded content) and its close coupling with the actual search engine. The design of the algorithm adopting the spoken term detection approach is focused on the speed of the retrieval. The resulting system is able to search through the 1,000 hours of video constituting the Czech portion of the archive and ﬁnd query word occurrences in the matter of seconds. The phonetic search implemented alongside the search based on the lexicon words allows to ﬁnd even the words outside the ASR system lexicon such as names, geographic locations or Jewish slang.


Introduction
The whole story of the the cultural heritage archive that is in focus of our research and development effort began in 1994 when, after releasing "Schindler's List", Steven Spielberg was approached by many survivors who wanted him to listen to their stories of the Holocaust. Inspired by these requests, Spielberg decided to start the Survivors of the Shoah Visual History Foundation (VHF) so that as many survivors as possible could tell their stories and have them saved. In his original vision, he wanted the VHF (which later eventually became the USC Shoah Foundation Institute [1]) to perform several tasks, including collecting and preserving the Holocaust survivors' testimonies and cataloging those testimonies to make them accessible.
The "collecting" part of the mission has been completed, resulting into what is believed to be the largest collection of digitized oral history interviews on a single topic: almost 52,000 interviews of 32 languages, a total of 116,000 hours of video. About half of the collection is in English, and about 4,000 of English interviews (approx. 10,000 hours, i.e. 8% of the entire archive) have been extensively annotated by subject-matter experts (subdivided into topically coherent segments, equipped with a three-sentence summary and indexed with keywords selected from a pre-defined thesaurus). This annotation effort alone required approximately 150,000 hours (75 person-years) and proved that a manual cataloging of the entire archive is unfeasible at this level of granularity.
This finding prompted the proposal of the MALACH project (Multilingual Access to Large Spoken Archives -years 2002-2007) whose aim was to use automatic speech recognition (ASR) and information retrieval techniques for access to the archive and thus circumvent the need for manual annotation and cataloging. There were many partners involved in the project (see the project website [2]), each of them possessing expertise in a slightly different area of the speech processing and information retrieval technology.
The goal of our lab was originally only to prepare the ASR training data for several Central and Eastern European languages (namely Czech, Slovak, Russian, Polish and Hungarian); over the course of the project, we gradually became involved in essentially all the research areas, at least for the Czech language. After the project has finished, we felt that although a great deal of work has been done (see for example [3] or [4]), some of the original project objectives still remained somehow unfulfilled. Namely, there was still no complete end-to-end system that would allow any user to type a query to the system and receive a ranked list of pointers to the relevant passages of the archived video recordings. Thus we have decided to carry on with the research and fulfill the MALACH project visions at least for the Czech part of the archives.
The portion of the testimonies that was given in Czech language is small when compared to the English part (about 550 testimonies, 1,000 hours of video material), yet the amount of data is still prohibitive for complete manual annotation (verbatim transcription) and also poses a challenge when designing a retrieval system that works in (or very near to) real-time.
The big advantage that our research team had when building a system for searching the archive content was that we had a complete control over all the modules employed in the cascade, from the data preparation works through the ASR engine to the actual search algorithms. That way we were well aware of inherent weaknesses of individual components and able to fine tune the modules to best serve the overall system performance.
The following sections will thus describe the individual system components, concentrating mainly on the advancements that were achieved after the original MALACH project was officially finished.

Automatic Speech Recognition Data Preparation
The speech contained in the testimonies is very specific in many aspects, owing mostly to the very nature of the archive. The speakers are of course elderly (note that the recording started in 1995) and due to the character of their stories, their speech is often very emotional and contains many disfluences and non-speech events such crying or whimpering. The speaking rate also varies greatly depending on the speaker, which again is frequently an issue related to age (some interviewees were so old that they struggled with the mere articulation while others were still at the top of their rhetorical abilities) and/or the language environment where the speakers spent the last decades (as those living away from the Czech Republic naturally stopped to search for the correct expression more often).
Consequently, the existing annotated speech corpora were not suitable for training of the acoustic models 1 and we have to first prepare the data by transcribing a part of the archived testimonies.
We have randomly selected 400 different Czech testimonies from the archive and transcribed 15-minute segment from each of them, starting 30 minutes from the beginning of the interview (thus getting past the biographical questions and initial awkwardness). Detailed description of the transcription format is given in [5]; let us only mention that in addition to the lexical transcription, the transcribers also marked several non-speech events. That way we have obtained 100 hours of training data that should be representative of the majority of the speakers in the archive. Another 20 testimonies (10 male and 10 female speakers) were transcribed completely for the ASR development and test purposes.

Acoustic Modeling
The acoustic models in our system are based on the state-of-the-art Hidden Markov Models (HMM) architecture. Standard 3-state left-to-right models with a mixture of multiple Gaussians in each state are used.
Triphone dependencies (including the cross-word ones) are taken into account. The speech data was param-eterized as 15-dimensional PLP cepstral features including their delta and delta-delta derivatives (resulting into 45-dimensional feature vectors) [6]. These features were computed at rate of 100 frames per second.
Cepstral mean subtraction was applied per speaker. The resulting triphone-based model was trained using HTK Toolkit [7]. The number of clustered states and number of Gaussians mixtures per state was optimized using a development test set and had more than 6k states and 16 mixtures per state (almost 100k Gaussians).
As was already mentioned, non-speech events appearing in spontaneous speech of survivors were also annotated. We used these annotated events to train a generalized model of silence in the following manner: We took the sets of Gaussian mixtures from all the non-speech event models including the standard model for a long pause (silencesil -see [7]). Then we weighted those sets according to the state occupation statistics of the corresponding models and compounded the weighted sets together in order to create a robust "silence" model with about 128 Gaussian mixtures. The resulting model was incorporated into the pronunciation lexicon so that each phonetic baseform in the lexicon is allowed to have either the short pause model (sp) or the new robust sil model at the end.
The described technique "catches" most of standard non-speech events appearing in running speech very well, which improved the recognition accuracy by eliminating many of the insertion errors.
The state-of-the-art speaker adaptive training and discriminative training [8] algorithms were employed to further improve the quality of the acoustic models. Since the speaker identities were known, we could split the training data into several clusters (male interviewees, female interviewees and interviewers) before the actual discriminative training adaptation (DT -see [9] for details) to enhance the method's effectiveness.

Language Modeling
The language model used in the final system draws from the experience gained from the extensive experiments performed over the course of the MALACH project [10]. Those experiments revealed that even though the transcripts of the acoustic model training data constitute a rather small corpus from the language modeling point of view (approx. one million tokens), there are by far more suitable for the task than much larger, but "out-of-domain" text corpora (comprising, for example, newspaper articles). However, if a more sophisticated technique than just throwing in more data is used for extending the language model training corpus, it is possible to further improve the recognition performance (see bellow). We have also found out that the spontaneous nature of the data brought up a need for careful handling of colloquial words that are abundant in casual Czech speech. It turned out that the best results were achieved when the colloquial forms are employed in the acoustic modeling stage only and the standardized forms are used as the "surface" forms in the lexicon of the decoder and in the language model estimation process (see [11] for details). In other words, the recognizer produces text in the standardized word forms while the colloquial variants are treated as pronunciation variants inside the decoder lexicon.
In concordance with those findings, we have trained two basic language models. The first one was estimated using only the acoustic training set transcripts and the second was trained from the selection of the Czech National Corpus (CNC). This corpus is relatively large (approx. 400M words) and extremely diverse. Therefore it was impractical to use the whole corpus and we investigated the possibility of using automatic methods to select sentences from the CNC that are in some way similar to the sentences in the training set transcriptions. The method that we have used is based on [12] and employs two unigram language models -one of them (P CN C ) is estimated from the CNC collection and the other (P T r ) was estimated from the acoustic training set transcripts. A likelihood ratio test was applied to each sentence in the CNC, using a threshold t: a sentence s from the CNC was added to the filtered set (named CNC-S) if P CN C (s) < t.P T r (s). This is a simple way of assessing whether sentences from the CNC are closer to the testimony transcriptions than to the bulk of the CNC corpus itself. The test threshold effectively allowed us to determine the size of selected sub-corpus CNC-S. Gradually decreasing the threshold yields smaller and smaller subcorpora that, ideally, are more and more similar to the testimony transcriptions. A threshold of 0.8 created a CNC-S containing about 3% of the CNC (approx. 16M tokens). Merging the lexicons from both CNC-S and acoustic training set transcripts and consequently interpolating corresponding language models yielded WER improvement of 2% absolute [10]. The interpolation ration 3:1 (transcriptions to the CNC-S) was used in the presented system as this factor gave the best recognition performance in the experiments [10].

Speech Recognition -Generation of Word and Phoneme Lattices
There was an important issue to solve even before the actual speech recognition process started. That is, what speech signal should be actually recognized. The problem was that the signal extracted from the archive video recordings was stereo, one channel containing the speech of the interviewer, the other the speech of the interviewee. However, there were frequent echoes despite the fact that the speakers were wearing lapel microphones. This was particularly challenging in the event of cross-talking when the speech of both dialogue participants was mixed together in both channels and we have to design an algorithm for separating the speech that was based on the levels of energy. Also, to save the computational power and storage, we have omitted from recognition all the portions of the signal that did not contain any speech.
Then the processed signal was streamed into our in-house ASR system [15] that was used in two recognition passes. The first pass employs the trigram language model described in Section and clustered DT adapted acoustic models that are automatically gradually adapted to each individual speaker. This unsupervised iterative speaker adaptation algorithm employs both fMLLR and MAP methods (see [16] for details) and uses for adaptation only the speech segments with confidence measure (expressed in our case in terms of posterior probabilities) exceeding 0.99, thus ensuring reliable estimates of the transformation matrices.
The speaker adapted models are then employed in the second pass to generate the lattices to be used in the search engine. In order to help the search algorithm, the lattices were equipped with a confidence scores computed as the posterior probabilities using the forward-backward algorithm. Both word and phoneme lattices were generated in this manner, important distinction being that the phoneme recognizer did not use any language model for the lattice generation.
The parameters of the ASR system were optimized on the development data (complete testimonies of 5 male and 5 female speakers). The recognition results listed the Table 1 show the (one-best) phoneme recognition accuracy as well as recognition accuracy of the word-based system. This accuracy was computed on the test set comprising another 5 male's and 5 female's testimonies. The total number of words in the test set was 63,205 with 2.39% out-of-vocabulary (OOV) words. Note that the accuracy of the Czech ASR reported just after the MALACH project completion was about 10% absolute lower (see [17]).
Using the lattices for searching is an important step away from the oversimplifying approach to speech retrieval on the same archive that was adopted by all teams participating in the CLEF campaign CL-SR tracks in 2006 and 2007 [18], where the problem of searching speech was reduced to a classic documentoriented retrieval by using only the one-best ASR output and artificially creating "documents" by sliding a fixed-length window across the resulting text stream. The lattice-based approach, on the other hand, allows to explore the alternative hypotheses about the actual speech content -note that the one-best error rate is still rather high. Dropping the artificial segmentation into the (quite long) fixed-length documents then enables much more finely grained time resolution when looking for the relevant passages. This could save the users of the search engine a lot of browsing through unrelevant bits of the archive. Furthermore, the presence of phoneme lattices enables for searching of out-of-vocabulary terms (see more details in the following sections).

Indexing and Searching
The general goal of the search system is rather clear and well-defined. The task is to: 1. Identify appropriate replay points in the recordings -that is, the moments where the discussion about the queried topics starts.
2. Present them in some user-friendly manner to the searcher.
However, there are many ways to approach this tasks. One of them is essentially a standard text retrieval that was used in the aforementioned CLEF campaign. The approach adopted in the presented work conforms to the definition of spoken term detection (STD) as given for example in [19]. This method does not care about the somehow abstract topic of the document (like traditional IR does or at least claims to) but instead it just looks for the occurrences of query terms. Unlike the keyword spotting methods, the STD uses a prebuilt index for the actual query searching, making the search faster; it also means that the queries need not to be known beforehand 2 .

Indexing
Separate indexes were built from the word and the phoneme lattices.

Word Index
The construction of the word index was the easier task. In the word lattice, every arc represents one word and the weight of the arc denotes the confidence measure (expressed as posterior probability) associated to the given word. In order to reduce the size of the resulting index, two stages of pruning were applied.
The first stage takes place at the beginning when all the arcs whose posterior probability is lower than a threshold θ w are discarded (θ w = 0.05). Each of the remaining arcs is represented by a 5-tuple: (start t, end t, word, score, item id) where start t and end t are the beginning and end time, respectively, word is the word (ASR lexicon item) associated with the arc, score is the aforementioned posterior probability and finally item id is the identifier of the original video file (start t and end t represent the offset relative to the beginning of this file). The index is further pruned by removing similar items. If there are two arcs labeled with the same word that are either overlapping or are being less than ∆t w apart (∆t w is set to 0.5 seconds), only the arc with the higher score is retained. It follows from the description that the indexing procedure omits the structural properties of the original lattice but, on the other hand, makes a compact and efficient representation of the recognized data. The total number of items in the resulting word index is approximately 12M.

Phoneme Index
The building of the phoneme index is more complicated. Having single phones as the index items was found to be ineffective as it produced a lot of false alarms. Therefore the proposed algorithm traverses the lattice and collect triplets of adjacent arcs (i.e., trigrams of the subsequent phonemes) and immediately discards those trigrams that meet any one of the following conditions: • one or more of the phonemes is a silence In the next step of the indexing algorithm, all the trigrams whose combined score falls bellow a threshold θ C (θ C = 0.1) are discarded. The remaining trigrams are then ordered on the time axis -if there are more triplets labeled with the same phoneme trigram within the window of the length ∆t p (∆t p = 0.03s), only the triplet with the highest score is included in the index. All the algorithm steps naturally again cause the structural properties of the lattice to be omitted. Finally, the same 5-tuples representing each remaining arc as in the case of the word index are stored in the database, only now the word is replaced with the numeric ID representing given phoneme trigram. There are approximately 63k different phoneme trigrams in the final index and the number of items exceeds 88M.

Searching
When searching the word index, all possible phonetic transcriptions (phoneme representations) of the query word are found in the lexicon. Then those phoneme sequences are mapped back to all corresponding word forms from the lexicon. This allows to search simultaneously for example for both the English and Czech spelling variants of a word (e.g., Shoah and the Czech transliterationŠoá). The system also makes possible to search for all inflected forms of a given word. 3 If this feature is enabled, the lemma is also found for each of the query words. Consequently the set of query words is extended with all possible word forms found in the vocabulary for each of the lemmas (these linguistic processing steps are done using a method described in [20]).
Search in the phoneme index takes place when the query word is an OOV or when it is forced by the user. The query word is again transcribed into a sequence of phonemes (we have a rule-based system for phonetic transcription and thus the phoneme representation can be obtain easily even for an OOV word).
Then for each of the phoneme strings the following steps are performed: 1. The consecutive phone trigrams are generated -e.g., the word 'válka' (the war) is decomposed into 'v aa l', 'aa l k' and 'l k a'.
2. All those trigrams are simultaneously searched for in the phoneme index and ordered according to the video file ID and the starting time.
3. For each video file ID, the found trigrams are clustered together on the time axis so that the time gap between clusters is at least equal to θ search (θ search = 0.2s).
4. Every such cluster is then assigned a score that is computed as score comb = (1 − λ) score ACM + λ score hit where score ACM is the arithmetic mean of scores of the phoneme index items in the cluster and score hit is the ratio between the number of trigrams that were correctly found in the given cluster and the number of trigrams representing the searched word. This implies that the algorithm does not strictly require the presence of all trigrams from the query. The interpolation coefficient was tuned using development data and consequently set to λ = 0.6. The score comb then serves as the ultimate relevance score.
The presented system also provides some functionality that allows searching for phrases of several words.
Every word in the query phrase can be marked as either mandatory or optional. The search algorithm then: 1. Looks for individual words and orders the results on the time axis (separately for each video file) 2. Clusters the results so that the the time gap between the clusters is at least θ phrase−search = 10s 3. Discards all clusters which do not contain all mandatory words. 4. Assigns each cluster a score that is computed as the arithmetic mean of the individual word scores.

GUI Description and HW/SW Implementation Details
The graphical user interface is designed with the IT non-professional in mind and is therefore as simple as possible (see the Figure 1). In the lower left corner, it has a text box for entering query word/phrase and check boxes for selecting the channels to be searched (interviewer and/or interviewee). The query can be modified using a set of simple operators -the plus sign is used to mark mandatory words and enclosing a word in parentheses tells the search engine that it should look for the exact word form only (i.e., the default expansion to all possible word forms is disabled). The retrieved results are shown in the right half of the GUI window. Each item shows the unique video file ID, the channel, the speaker's name, the exact form of the word or the phrase that was found, the time when the word/phrase occurs and the relevance score. The upper left corner then contains the multimedia player with the usual controls that allows to immediately replay any video file listed in the result window, starting several second prior to the query occurrence.
The search engine was implemented with a specific focus on the retrieval speed and on the system scalability. We also wanted the run the search algorithm on a portable equipment so that we can disseminate the research results at various forums. Thus we have decided to employ SQL database server architecture for storage of both word and phoneme indexes in order to ensure fast system response (as the SQL access algorithms are well-optimized for speed). The speed is further improved by storing the database on the 64 GB SSD drive instead of the conventional HDD. Other parameters of the hardware are rather moderate (HP EliteBook 8730w with Intel R Core TM 2 Duo Processor 2.80 GHz, 4 GB RAM). The video files with the actual testimonies are stored on two external USB hard drives (1 TB each). The system architecture supports remote access to the database which enables to run the search algorithm on different portions of the archive in parallel using several CPUs and therefore allows to scale the system to much larger archives rather seamlessly.

Evaluation
The quality of the STD was evaluated using two sets of queries whose occurrences have been manually annotated in the test portion of the video data. The first set (SetIn) contains 20 words that are present in the ASR lexicon (and consequently also in the word index); the total number of occurrence of SetIn words in the test data is 374. The second set (SetOut) consists of 108 words that are not included in the ASR lexicon and thus can be found only using the phoneme-based search; those words occurs in the test data 414 times in total. The detection results are showed in Figure 2 (DET plot) and Figure 3 (ROC plot). In addition, the Figure-of-Merit (FOM) and Equal-Error-Rate (EER) values are given in Table 2.
The plots reveal that the number of false alarms is essentially the same for search in 1-best ASR output (i.e., the situation when only the best path through the lattice is retained) and search in the word lattice, up to the point where the 1-best system is not able to provide any more correct hits and the lattice search becomes the clear winner. Searching in the phoneme lattice, on the other hand, produces substantially more false alarms than both the word-based algorithms, yet its big advantage lies in the ability to search for the words that are missing from the ASR lexicon. Those missing words are often some rare personal and place names and just as they are underrepresented in the language model training data, they are also very important to the searchers of the collection [21] (in fact, two-thirds of the requests specified named entities in the preliminary MALACH user studies). The phonetic search mode also allows users to type only an "approximate" spelling of the searched query which is extremely helpful especially in the case of foreign words or even words that are transliterated from different alphabets and it is not clear what spelling variant (if any) appears in the ASR lexicon.
One of the key considerations of our STD engine design was the focus on the quick response of the system. The following section is therefore devoted to the evaluation of the retrieval speed. Figure 4 depicts the histograms of search times for both the lexical and phonetic searches (we define the search time as the period between entering the query and the moment when all the found segments are presented to the user).
It shows that the statistical mode of the lexical search time is only 0.5 second and the vast majority of the searches is finished in less than 5 seconds. For the search in the phonetic lattices, the statistical mode is 7.5 seconds and the majority of the searches takes less than 20 seconds which we find still very reasonable. Figure 5 further reveals that in the case of searching the word lattices, the search time is more or less linear to the number of retrieved results, It's because retrieving one word hit requires just a single SQL query that takes always the same time and no further processing is necessary. On the other hand, the phonetic search time dispersion that could be observed in Figure 6 is attributed to the fact that large number of individual phoneme trigrams is retrieved first for all queries and those trigrams are then clustered and filtered to produce the list of relevant results. The search time then depends mainly on the query length (see Figure 7) because this is the factor that influences the number of individual phoneme trigrams that are returned in the first retrieval step.

Conclusions
The paper introduced the system for searching spontaneous speech data that was built in an effort taken to advance towards the ultimate goals of the MALACH project. Both the objective evaluation presented in this paper and the positive feedback that the researchers were receiving during several live demonstrations suggest that the work was successful and that we have created fast and efficient system. It could make the Czech interviews more accessible to the historians, filmmakers, students, and of course also to the general public. Negotiations concerning the system deployment in the Malach Center for Visual History in Prague are currently taking place, as well as the preparation of the joint project with the USC Shoah Foundation Institute that would aim at the development of the English version of the system.