A cross-lingual adaptation approach for rapid development of speech recognizers for learning disabled users
© Bohac et al.; licensee Springer. 2014
Received: 1 February 2014
Accepted: 1 October 2014
Published: 18 October 2014
Building a voice-operated system for learning disabled users is a difficult task that requires a considerable amount of time and effort. Due to the wide spectrum of disabilities and their different related phonopathies, most approaches available are targeted to a specific pathology. This may improve their accuracy for some users, but makes them unsuitable for others. In this paper, we present a cross-lingual approach to adapt a general-purpose modular speech recognizer for learning disabled people. The main advantage of this approach is that it allows rapid and cost-effective development by taking the already built speech recognition engine and its modules, and utilizing existing resources for standard speech in different languages for the recognition of the users’ atypical voices. Although the recognizers built with the proposed technique obtain lower accuracy rates than those trained for specific pathologies, they can be used by a wide population and developed more rapidly, which makes it possible to design various types of speech-based applications accessible to learning disabled users.
Millions of individuals suffer from learning disabilities that also affect their speech production. These conditions result in atypical voices that are very difficult to understand even for human listeners, as they may affect one or more of the major language subsystems, including phonology, morphology, syntax and semantics. Focusing on phonology, impaired speech can affect voice timing, pitch, volume, fluency and articulation .
Different studies have focused on the nature of such mispronunciations and their impact in intelligibility. For example,  shows that impaired speakers have a good control of tone but a diminished discrimination between stressed and unstressed vowels, as well as abnormal production of extremely long or short vowels. In , the authors focus on how to measure the intelligibility of atypical voices objectively along different perceptual dimensions.
Speech technology enhances the functional and affective experience of technology for many user groups, including people with reading difficulties, hearing and visually impaired, older adults, and people with learning disabilities . One of the main applications of speech technology is voice therapy. For example,  presents the PreLingua tool, which aims to train skills such as intensity, tone, vocal onset, phonation time and vocalization.
However, when the phonological disorder is severe, it can be useful to complement the voice therapy with other applications for augmentative and alternative communication. For example, VIVOCA  is a voice-input voice-output augmentative communication aid for people with severe impairment. Joode et al. present a detailed study on assistive technologies for people with cognitive deficits including different uses of speech technology , and Lancioni et al. provide a review of speech generating devices for augmented communication . Recently, there was a special issue on speech and language processing as assistive technologies , which shows the potential interest of this area.
Despite their high potential to help users, these technologies usually address specific disorders. For example,  studied the best configuration of parametrization, feature selection and classification techniques for the recognition of stuttered events, while  studied different measures of vocal quality, articulation, nasality and prosody of spastic dysarthria. This means that most of the systems have been tailored to specific population groups, which makes them more effective for those users but not so adequate for people suffering from other related disorders.
In this paper, we present an approach to develop speech recognizers for learning disabled users aimed at a general population. Also, we were particularly interested in defining a procedure which is rapid and cost-effective. The existence of affordable assistive technology is an effective mean to ensure full and equal enjoyment of all human rights and fundamental freedoms . To make the development fast and efficient, and to speed up the transition from an experimental tool to a professional program, we decided to employ a modular automatic speech recognition (ASR) system designed at the Technical University of Liberec during the last decade. It can be easily adapted to various tasks, including on-line and off-line speech-to-text transcription  and robust voice-command control based on real-time keyword spotting . The system has been originally developed for the Czech language, but later it was ported to other languages such as Slovak, Polish, Croatian or Russian, using a cross-lingual adaptation approach .
It is well known that ASR depends on large amounts of transcribed speech recordings in order to estimate the parameters of the acoustic model. Recording such large speech corpora is time-consuming and expensive; as a result, there do not exist sufficient quantities of data for disabled users.
Our proposal uses a cross-lingual adaptation approach in which we use most of the available resources in order to recognize atypical speech. This approach is based on an idea similar to the one used to recognize poorly resourced languages: to use data from a well-resourced source language to estimate the acoustic models for a recognizer in a poorly resourced target language ,. In our case, to use data from typical voices in different languages in order to recognize impaired speech using a small amount of training data from disabled people.
Thus, the general idea is to achieve acceptable results reducing the cost of adapting a model to atypical voices. This can contribute to fostering the development of speech applications and help disabled users to be more actively involved in choosing their assistive technology from a wider range of options.
The rest of the paper is structured as follows. ‘Related work’ section presents related works, ‘Proposed method’ section presents our proposal for cross-lingual adaptation of speech technologies, the ‘A case study for Spanish disabled users’ section shows a case study developing our proposal in which we ported a speech recognizer from Czech to Spanish and then adapted it to Spanish disabled users. The experimental results with this example are discussed in ‘Experimental evaluation’ section. Finally, in ‘Conclusions’, we present the conclusions and propose future improvements which may increase the performance of the recognizers developed following our proposal.
2 Related work
As discussed in , traditional automatic speech recognition techniques are unsuitable for impaired speech for several reasons: the amount of training material is limited, the training samples are highly variable, and they are very different from voices corresponding to non-disabled users.
Due to their very reduced intelligibility, some authors have addressed the problem of automatic recognition of disabled users by carrying out in-depth studies of the most salient features of some types of atypical voices. For example,  studied the predictability of articulatory errors and trained a Bayesian network in order to build an augmented ASR system that considered the statistical relationships between vocal tract configurations and their acoustic consequences. Similarly,  focused on aspects of syllabic strength for moderate hypokinetic dysarthric speech.
Other work is not so much focused on the recognition itself, but on facilitating the correction of the errors that the recognizer will presumably make. This way, some authors focus on allowing the users to select between alternative word candidates using n-best lists , while others propose methods that use different approaches to compute the most probable mismatch taking into account the peculiarities of certain pathologies .
Despite their high performances, these approaches are focused on particular disabilities or demand a detailed study of the characteristics of the target users. Other authors have addressed more general-purpose approaches. For example, Hawley et al. proposed an incremental approach in which they collected an initial corpus from each user and employed it to train models of words in a reduced vocabulary of commands . Then, they re-estimated the model using the initial training examples and subsequent examples collected from the users while they were employing an application. The advantage of this proposal is that it does not require expert knowledge about the pathologies or a high amount of atypical voice samples, for which the resources are very limited.
We believe that the challenge of limited resources for atypical voices is similar to the case of under-resourced languages: it is costly and time-consuming to gather and process speech material in both cases, which is one of the major limiting factors for speech-enabled application development . Cross-language approaches allow the common exploitation of acoustical similarities between languages in order to be able to use resources available in different languages for the recognition of a less-resourced one. In fact, this approach has also been used to recognize other types of atypical voices such as non-native or accented speech . There are different ways in which existing models can be used along with new data in a target language, mainly training on multilingual data , or cross-lingual adaptation of the acoustic  and language  models.
Our recent experiments show that it is possible to make a cost-efficient cross-lingual adaptation of speech recognition technologies . The continuous speech recognizer may achieve an accuracy between 65% and 75% when using an acoustic model of related languages (e.g. recognition of Croatian using Slovak, Polish, Russian or Czech acoustic models) and between 80% and 85% when the acoustic model is enriched by (semi-automatically obtained) training data of the target language (e.g. recognition of Croatian using Czech + Croatian mixed acoustic model). In this paper, we propose a method to exploit the benefits of cross-lingual adaptation of speech technologies for the recognition of the atypical speech of learning disabled users.
Our previous work in the development of assistive speech technologies for motor handicapped users  showed that assistive speech technologies can improve the living conditions of disabled people (and even help them to find job opportunities). Thus, the availability of methods for rapid and cost-efficient prototyping of speech applications, such as the one proposed in this paper, can be of great help for people who suffer from communication disorders.
3 Proposed method
As the development of speech recognition technologies starting from scratch is a very time- and resource-consuming process, we propose to avoid these costs by means of cross-lingual adaptation. The general idea is to use the resources already created for a source language to ease and accelerate the production of the target language resources. The same idea can be used to adapt existing models for common speakers to the needs of handicapped speakers.
3.1 Cross-lingual adaptation
As can be observed, the inputs to the ASR are as follows: the language model in the target language, the (adapted) acoustic model, and a vocabulary in the target language.
G2P conversion determines how the target language words sound in terms of the source language phonetic inventory (SL - phoneme set in the figure). It is carried out in two main stages. Firstly, we convert between the target language orthographic form and target language phonetic form (this process is denoted as TL - G2P in the figure). Secondly, we decide how to map between the target language phoneme inventory (TL - phoneme set) and that of the source language. This involves determining which phoneme pairs (or maybe phoneme groups) are the most similar and possibly which phonemes remain unused.
When the G2P conversion is defined, a fitting acoustic model can be prepared. There exist three alternatives: i) to use the data already available in the SL and mix the model from one or more source languages, ii) mix the SL recordings with some amount of the TL data, and iii) use the TL data only. In the latter case, we can exploit the SL for the development of support technologies in order to lower the demands of expert work (e.g. forced alignment of the TL data using SL models).
Finally, a language model can be built. To do so, it is necessary to gather and pre-process a sufficient amount of textual data in the target language and analyse it to choose a suitable vocabulary. Part of this vocabulary is the phonetic form of the items obtained by the G2P conversion, which can be manually corrected.
3.2 Acoustic and language model
The preparation of training data for the acoustic model is the most time-consuming phase of the adaptation. To reduce the time requirements, we propose to use forced alignment, which consists in assigning time stamps to every word in the input orthographic transcription according to the corresponding audio recording. Forced alignment has many other applications, such as indexing a spoken document for searching, timing subtitles automatically or testing data preparation -. Processing a document with a forced alignment algorithm requires an audio recording, its corresponding text transcription, an acoustic model (fitting the phonetic inventory used in the phonetic transcription) and a vocabulary containing all words in the document or a G2P conversion module. Eventually, there can be some more input resources to process more complicated tasks (e.g. processing of numerals, physical units, degrees and titles, and special symbols like @, &, or %).
The forced alignment tool that we use is described in  and is based on the continuous speech recognizer in the source language. The language model is very constrained as all words must appear strictly in the correct order. They can either fit one of its acoustic forms or be skipped. After the recording is ‘recognized’ a special post-processing takes place which corrects the different errors that may arise (e.g. when the textual transcript somehow differs from the audio content). As this is a complex process, we encourage the readers to see the details in .
Forced alignment is particularly advantageous for the adaptation to atypical voices, as the quality of the acoustic model required can be much lower than the model demanded by the continuous speech recognizer and still it is possible to obtain very accurate results. Moreover, our approach was developed to process inaccurate transcriptions as it can handle ‘low quality’ acoustic models, i.e. non-robust models trained with a small amount ofdata .
The preparation of the language model does not demand much manual work. The most sensible step is to find a suitable source for the text in the language model. Also, usually there are many characters that are not appropriate for training a language model, there are several ways in which they can be handled, for example, some of them may be erased (e.g. ‘.’, ‘?’…), others can be rewritten (e.g. numerals) and some of them can be unified and then rewritten (e.g. brackets). Once the text has been processed, a bigram statistical language model is automatically created.
3.3 Adaptation to atypical voices
Every application for disabled users demands a high level of adaptation and customization. Some of these enhancements can be done by the system developers, while others demand active cooperation between the final user and the developers (or a trained assistant). The main enhancements we can enlist include the following: i) training the acoustic model using the data recorded with a subset of final users, ii) general changes in the G2P so it covers some typical speech disorders, iii) adaptation of the acoustic model for the concrete speaker or environment, and iv) reasonable vocabulary limitation (usually context-dependent).
Some users are able to cooperate further (e.g. motor handicapped) but some are not (e.g. severe intellectual disability). If the users are able to cooperate, it is possible to enhance the described adaptation scheme by recording 10 to 60 min of additional recordings in the environment in which they will more frequently use the application and redefine the keywords so that they are easier to pronounce . Finally, users can be offered some training to help them master the assisting technology and improve the performance .
Once the cross-lingual adaptation is carried out, a second step is performed to adapt the recognizer trained with common voices to detect impaired speech. Manually sorting and transcribing the recordings is very time-consuming. There is no other choice in the case of longer utterances (sentences), but we propose an automated approach to choose and prepare the isolated word recordings (even if it may imply losing some data).
Our solution requires the audio recording, expected transcription of the recording, and a G2P tool. The solution must be robust to face different phenomena. For example, as some of the users may have reading difficulties, they can be prompted by an assistant to repeat a word that he/she has previously read aloud. This may lead to recording the annotated word more than once. Also, the observed pronunciation may strongly differ from the G2P one. All these obstacles are solved by setting the forced alignment tool properly. We suggest changing the forced aligner language model so the words may be repeated. The phonetic inventory of the forced aligner can also be enhanced by a set of rules modelling the most common pronunciation distortions observed in the data set, as well as the totally mispronounced words (e.g. when omitting groups of phonemes).
Once the recordings are aligned, we select the data suitable for training. If the detected pronunciation differs slightly from the presumed one, we use it or we do not use it otherwise. The similarity is checked using the minimum edit distance (MED) algorithm . The MED algorithm aligns the reference and result sequences in the terms of hits, substitutions, deletions or insertions needed to transform one sequence to the other. We set the rule where a word containing N phonemes in the reference transcription (N>3) has to reach at least N−2 hits to be accepted for training, a heuristic that we found to be appropriate from our previous work . Once the data has been selected, they are added to the training set and the whole procedure is repeated in a second iteration.
4 A case study for Spanish disabled users
In previous work , we have successfully ported different ASR models from different Slavic languages, such as Polish, Croatian, Slovak or Russian, to Czech. In this paper, we propose to use a similar approach to build a Spanish model with our ASR system using the resources available for Czech (source language) and Spanish (target language) and adapt it for the recognition of impaired speech (adaptation of the target language).
Although Czech and Spanish are not as similar as the Slavic languages considered in our previous works, we can exploit the Czech resources to speed-up the preparation of the Spanish training data (which is a very time-consuming task). Once we have sufficient amount of the Spanish data, we can leave out the original Czech data and use the Spanish resources only.
The ASR system we have used was originally proposed for processing Czech, a highly inflective Slavic language ,, so it supports large vocabularies (it can operate with 500,000 vocabulary items in the on-line mode). The inputs are converted via the FFmpeg codec to the standard 16 kHz pulse-code modulation (PCM) wave format (16 bits per sample), this way the system supports most audio (and video) input formats. The parametrization uses standard 39-MFCC vectors computed on 20 ms frames with 10 ms overlap, and the feature vector is processed with the cepstral mean subtraction (CMS) normalization.
We cover 42 phonemes and 8 non-speech events (e.g. click, breathe, silence, hesitation). The output of the recognizer comprises the written form of the detected word, the detected phonetic form (words usually have several phonetic alternatives), and the time stamps (beginning and end of each word and noise). Additionally, it can be used on-line, when it can also run a post-processing that formats the output as it was pronounced.
Continuous speech recognition is very powerful, but it can also be very demanding for disabled users who might not be able to pronounce a whole sentence correctly but still be able to say it word by word. That is why we employ a keyword spotter (KWS), which is a speech recognition technology used for the detection of isolated words of interest (keywords) from an audio stream. Typical applications include smart homes, industrial enhancements, making audio-archives accessible, or security purposes. The specific implementations and their accuracies may differ between on-line and off-line applications as shown in .
As the algorithms mostly rely on the acoustic similarity between keywords and features of the audio stream, we must pay attention when choosing the keywords. If we were interested in detecting acoustically similar words or words that are substrings (one word is included in the other) the system will raise many false alarms or confuse the output. This problem is particularly significant in Slavic languages where the words often differ in the ending only .
As we demand the ability of on-line response, we use our KWS system derived from the continuous Czech speech recognition system described previously. It can be used in both on-line and off-line modes. Another advantage is that this KWS uses the same parametrization and acoustic model as the continuous speech recognizer. The difference is in the language model and the vocabulary. It employs phoneme-based n-grams to model speech and build the filler model. Such defined fillers compete with keywords from the vocabulary and with the non-speech events and noises. The performance is controlled by language model parameters and penalties so we can tune the ratio between false alarms and missing detections.
As indicated in ‘Adaptation to atypical voices’ section, we propose to constrain the minimal length of a keyword to three phonemes and the minimum difference between keywords in at least two phonemes. This ensures there will be no substrings in the vocabulary but still there can be false alarms caused by a combination of two (or more) words in the audio stream that form together one of the keywords. This problem does not appear in the case of isolated word utterances.
4.1 Cross-lingual adaptation from Czech to Spanish
To build the Spanish acoustic model we used the Albayzin  corpus. The corpus comprises two sub-corpora with 6,800 utterances each: one based on texts extracted from novels and the other based on queries to a geography database. The utterances were recorded under good acoustic conditions (quiet offices, with the same set of professional microphones) and were pronounced by 304 speakers (152 female, 152 male), whose age varied from 18 to 55 years. The Albayzin corpus represents 12 h 52 min of annotated speech in 13,600 gender and phonetically balanced sentences.
To obtain the training data, we carried out the following preparations. The first step was the conversion of the original audio data from original.ses format to the.wavformat - we used (16 kHz, mono 16 bit per sample PCM). From the transcriptions, we removed the punctuation marks, replaced the numbers with their word forms (e.g. 512 = ‘quinientos doce’), and processed some special symbols (mainly units of areas or distances). To annotate noises, we employed our forced aligner module (see ‘Acoustic and language model’ section), which was able to detect and annotate the noises and select the best alternative phonetic representation for each word.
For the cross-lingual adaptation, we employed the proposed G2P conversion in two stages. The first stage was the conversion from the Spanish orthographic form to the Spanish phonetic form. In order to carry out the conversion, we used several rules that varied from unigrams to trigrams. The converter sequentially parsed the orthographic form finding the longest fitting rule. Then, the output was slightly modified to reflect the voicing assimilation and some more phenomena by a set of regular expressions.
The proposed mapping between Spanish and Czech phonemes
Spanish phoneme [adapted X-SAMPA] → Czech phoneme [PAC]
E S P t e x t : guillermo y yolanda practicaban ciclismo con Jaime ∙
E S P p h o n : giyermo i yolanda praktikaban ziklismo kon jaJme ∙
E S P P A C : giČermo i Čolanda praktikaban siklismo kon Xajme
Where the rules used for the first three items in the sentence are as follows: ‘gui’ → ‘gi’ ; ‘ll’ → ‘y’ ; ‘e’ → ‘e’ ; ‘r’ → ‘r’ ; ‘m’ → ‘m’ ; ‘o’ → ‘o’ ; ‘ y’ → ‘i’ ; ‘y’ → ‘y’ ; ‘o’ → ‘o’ ; ‘l’ → ‘l’ ; ‘a’ → ‘a’ ; ‘n’ → ‘n’ ; ‘d’ → ‘d’ ; ‘a’ → ‘a’ ; ‘ ’ → ‘ ’. As can be observed, rules may differ if the beginning or end of a word is encountered.
To build a language model and vocabulary for continuous Spanish recognition, it was necessary to retrieve and process a large amount of Spanish texts. As we were interested in rapid development, we used daily Spanish and international news from different web pages. We downloaded 11.7 GB of texts from http://elpais.com/, http://www.20minutos.es/, and http://spanish.news.cn/.
As the text corpus was gathered from downloaded articles, we carried out a careful post-processing to prepare the corpus for training the target language model (statistical bigram model of Spanish). This way, we used different scripts to remove all HTML tags, foreign (non-Spanish) characters, English words, and other parts of the text that were not suitable for our purpose (e.g. currency rates, information from stock exchange and sports results). We also replaced all numbers with their orthographic transcription. For example, instead of ‘in 1926’ we had ‘in nineteen hundred and twenty six’ (in Spanish: ‘en 1926’ - ‘en mil novecientos veintiséis’).
Once the data was processed, we computed the bigram language model. For preparing the vocabulary, we employed all words that occurred more than 10 times in the corpus. We decided to use collocations (several words that usually go together - for example ‘Los Ángeles’) for definite and indefinite articles (e.g. ‘el pan’, ‘un profesor’) as short items are disadvantageous for speech recognition. Using this approach, we generated a vocabulary with 54,217 words and word collocations.
4.2 Acoustic models generated
As we trained and compared several acoustic models, we decided to list them here together with their features. We have quantified the amount and sources of training data, so they can be easily compared. In all the models, the Spanish data are used twice - first with floating CMS, then with CMS computed over all the recordings.
AM_CZ is the model made from 200 hours of Czech recordings already available from our previous work  (as previously discussed, we consider Czech our source language).
AM_cz&ES_cross denotes five different models. All consist of 1.45 h of Czech recordings (chosen to cover the Czech phonemes not used in Spanish phonetic inventory) and approximately 10 h of Spanish data from the Albayzin corpus (for details, see ‘Cross-lingual adaptation from Czech to Spanish’ section).
AM_CZ&es denotes the acoustic model consisting of 133 h 24 min of Czech recordings, 12 h 52 min of continuous Spanish speech (whole Albayzin corpus), and 69 min of isolated words uttered by disabled people.
AM_cz&ES consists of 1 h 27 min Czech data and all the Spanish data mentioned in AM_CZ&es. The Czech training data mostly covers phonemes unused in the Spanish vocabulary (as we experimented with the best pairing of Czech and Spanish phonemes). This can be considered equivalent to a ‘Spanish only’ model.
AM_cz&ES_2 is similar to AM_cz&ES but including more isolated words uttered by the disabled people (140 min).
AM_cz&ES_sent consists of the AM_cz&ES_2 and 28 min of continuous speech uttered by disabled people.
4.3 Adaptation to Spanish atypical voices
To gather a corpus of impaired speech, we have worked together with two associations of people with learning disabilities in Southeastern Spain: JABALCÓNa and APAFAb. Both associations are based in mainly rural areas and have around 100 affiliated persons. They work for the social integration of the learning disabled through different programs of professional development such as wood workshops. People in these associations are mainly adults from the towns and villages nearby who visit the centres during the day.
To select subjects with a wide range of phoniatric problems.
To select only subjects for which the participation in the recordings would not impact their wellness, as certain disabilities imply that a change in the person’s agenda can be very disruptive.
To select only subjects who were willing to participate voluntarily with the consent of their families and/or caregivers.
Following this approach, 42 subjects were selected. As the subjects could not participate in long recording sessions, we had to make several visits to the associations to make the recordings. During these visits, we carried out three types of recordings: single words, sentences, and conversations. The first group of recordings were frequent words from their daily activities from a vocabulary that was agreed with their caregivers and categorized in to the following six scenarios: ‘street’ (street, coin, house…), home (bed, sofa, table…), food (apple, meat, fork…), ‘family’ (father, mother, sister…), ‘dressing’ (trousers, jersey, coat…), and ‘me’ (cold, happy, hungry…). The second group were basic sentences containing words in this vocabulary (e.g. ‘The fork is on the table’, ‘I am cold’, ‘Open the door’…). Finally, the third group was comprised of open conversations about their daily activities: activities inside the association, visits around the village, sports (especially football), summer activities and family.
Number of elements and impaired subjects recorded
Group of recordings
Number of recordings
Number of users
42 users (30 male,
940 utterances (45 min)
18 users (13 male,
18 users (13 male,
The recordings took place in the associations, so that the participants did not have to travel and the recordings contained the environmental noise that would also surround the users when employing the generated recognizers. Due to the high number of activities carried out by the associations, the same rooms were not always available during the recording sessions, also the recordings were done at different times of the day, so the levels of acoustic noise vary and in some cases, there appear some events that produced louder noises such as opening/closing doors. We believe that these situations are desirable to build acoustic models that consider the noise that will be present during the usage of the speech recognizer. State-of-the-art databases of impaired speech (some of them are described in ), are usually recorded in laboratory conditions, which makes it more difficult to employ the recognizers trained with them in real settings.
With respect to the single words, the accepted data (chosen by the automatic classification) were used for training a new acoustic model, and we repeated this procedure with the adapted model in several iterations. In the first round, we used 6,156 recordings to obtain 2,268 words suitable for training. The second round (with the improved acoustic model) chose 921 more words for training. Then, from the other 6,900 recordings, we chose other additional 3,300 samples.
With respect to the sentences, they consist of 45 min of annotated speech in 940 utterances. We automatically prepared the phonetic annotations using the speech, its orthographic transcription and the G2P technique described in ‘Cross-lingual adaptation’ section, and corrected them as well as the orthographic transcriptions in the cases in which they were inaccurate (i.e. they did not correspond to what was pronounced). For adapting the acoustic model, we employed 618 utterances with 28 min of speech.
Finally, the conversations were split by speakers. We used only the parts with disabled speakers. From 47 min of conversation between disabled speakers and moderators, we separated 16 annotated min for test purposes.
5 Experimental evaluation
To evaluate our proposal, we have carried out different experiments corresponding to the recognition of Spanish disabled users with Czech as the source language (the case study described in the ‘A case study for Spanish disabled users’ section). Concretely, we have studied two speech recognition technologies (keyword spotter and continuous speech recognizer) using the different acoustic models described in the ‘Acoustic models generated’ section and also varying the vocabulary and language models.
5.1 Evaluation metrics
5.2 Keyword spotter for common speakers
Comparison of acoustic models with different balance of Czech and Spanish training data evaluated over typical Spanish voices
In average, there were 350 items chosen for the spotting. In this case, we spot the words from a continuous speech so we also checked if the combination of words could substitute a vocabulary item. The phonetic forms were generated automatically by the G2P module (only one alternative for each item). As our work is focused on disabled speakers who have troubles with spoken communication, we set the system for high DR (even if it implies possibly higher FA). Although false alarms may raise errors, it is possible to recover from them in the application that employs the ASR by establishing interaction contexts or scenarios, and also providing N-best lists from which the users may select the most appropriate response. This may be more suited for disabled users than having a recognizer with a lower DR that gives the impression to be not responding to the user’s inputs.
From all points of view, the AM_cz&ES_cross achieved the best results. This was a surprise for us because when we were porting Slavic languages, it was advantageous to use more data (even from relative languages and not the source or target one) as it guaranteed the robustness of the model. In this case, when porting a Roman language, the mapped phonemes were so different that it was better to use the Czech data only for uncovered phonemes (phonemes not used in Spanish). The AM_CZ model had low FA, but the DR was insufficient. The results of this experiment (with almost no Czech data) are very promising in comparison with our former work with Czech KWS  so we decided to minimize the usage of Czech acoustic data in the acoustic model training.
5.3 Keyword spotter for disabled speakers
Impact of alternative phonetics and recordings’ pronunciation quality of the atypical voices
Test data set
p a c k 1_baseall
p a c k 1_basesuit
p a c k 1_baseunsuit
p a c k 1_alterall
p a c k 1_altersuit
p a c k 1_alterunsuit
p a c k 2_baseall
p a c k 2_alterall
As can be observed, the results are not very encouraging but we must realize there are only 69 min (approximately 35 min of pure speech) to adapt the acoustic model to disabled speakers. However, we can state that the algorithm that chooses the training data was correct, as it was able to correctly discriminate the unsuitable data from the suitable samples. We can also clearly see the impact of the alternative phonetics in the vocabulary (labelled alter). The only drawback is the increase of the false alarm rate. But as stated before, in this domain, it is preferable to have a higher detection rate than to lower the false alarm rate.
Impact of dividing vocabulary into scenarios - atypical voices
We also wanted to verify that increasing the amount of the training data pronounced by the disabled speakers improves the KWS performance. We ran the KWS with AM_cz&ES_2 over the pack2 recordings and the accuracy was 25.30% when using one phonetic only and 29.60% when using the alternative phonetics. This shows the negative impact that the reduced amount of utterances by disabled speakers had in the previously discussed experimental results.
Our last experiment with the KWS was the evaluation of the possible improvements gained via speaker (and simultaneously channel) adaptation, as the parameters of the recording differed between the Albayzin corpus and the devices used to record the disabled speakers. We used the pack2 data which were found suitable for training to make the adaptation and compared the results when using AM_cz&ES with and without the adaptation.
where W is the extended transformation matrix, is the extended vector of features, n is the dimension of data and ω represents a bias offset.
The matrix W has to be calculated within an iterative process, where the likelihood of adaptation data with known transcription is maximized . Note that in our case, only one global transform was estimated for all Gaussian components of the system. Hence, it was not necessary to include the Jacobian of the transformation in the likelihood calculation.
There exist two basic approaches on how to perform adaptation for a target speaker, supervised and unsupervised . The former utilizes adaptation data that is annotated manually by a human expert. The latter employs a speech recognizer, which creates these transcripts automatically.
We have employed a supervised adaptation approach. At first, the manual orthographic transcripts of adaptation data were available. Then, we performed forced alignment to adapt the utterances using a speech recognizer operating with the baseline speaker independent model and the lexicon containing all pronunciation variants of all words occurring in the orthographic transcripts. As a result of this process, we obtained accurate phonetic transcripts with labelled noises produced by speakers (breathing, various hesitation sounds, cough, lip-smack, etc.). Finally, general speaker-specific transformations were estimated using this annotated adaptation data.
As the impact of adaptation differs between speakers, it is not reasonable to measure an average improvement. For speakers who were badly processed by the baseline, the gain varied. For some of them ACC decreased from 11.37% to 8.75%; for others, increased from 2.92% to 6.07%. For better recognized speakers, the behaviour was much more predictable: ACC increased by approximately 20% (e.g. 40.19% increased to 58.52%, and 59.85% increased to 73.14%). Thus, the speaker-channel adaptation can be very useful but, on the other hand, it is necessary to identify the user.
5.4 Continuous speech recognition
Results for continuous speech recognition with different acoustic models - typical voices
Results for continuous speech recognition with different language models - atypical voices
However, as the continuous speech recognizer accuracy is higher than 60%, it is possible to use this recognizer to semi-automatically obtain new data for training. In order to do this, it is possible to use the approach that we described in , specially if we have at least partially transcribed speech. For example, it is possible to gather this type of data from radio broadcasts, as on their web pages there are usually some abstracts about the recordings and in a lot of cases there are some rewritten parts also. With our recognizer, we can automatically transcript the recordings and compare them with the text from the web page. If we find matchings that are long enough (usually a few words), we can extract these parts and use them to build more accurate models in a short time.
Speech technology can be a very valuable help for learning disabled users, whose varied conditions result in different dysfunctions of their speech system. Different efforts have been made by the scientific community in order to propose different approaches to the implementation of automatic speech recognizers for these users. However, many of the state-of-the-art approaches focus on particular pathologies or imply an in-depth study of specific dimensions such as certain articulatory features. This makes the development of automatic speech recognizers costly, as there is a need for a large amount of specific data samples to train the recognizer.
In this paper, we have presented a cross-lingual approach for the development of speech recognizers for disabled people. Our main aim was to study mechanisms to take the most of the resources already available for average users in different languages for the recognition of atypical voices. Although the generality of the method reduces its performance compared to recognizers trained on large databases of impaired speech, it allows rapid and cost-efficient development, which makes it suitable for fast development of assistive speech applications.
We have evaluated the proposed model preparing a KWS and CSR for Spanish using Czech resources, and adapting the Spanish resources obtained with the cross-lingual approach to fit learning disabled speakers.
The experimental results show that the KWS for common Spanish speakers achieves results comparable to those reached by the source KWS when used for Czech. The CSR obtains accuracies over 60%. Although it does not seem very satisfactory, we would like to emphasize that there were more than 20% OOV words. The high OOV rate demonstrates we have to increase the vocabulary for CSR (and concurrently retrain the language model). Taking all these aspects into account, the results for the recognition of common Spanish speakers are very promising, and when the new vocabulary is ready, we should be able to launch the almost automatic improvement procedure proposed in . This procedure will give us new data to improve the robustness of the acoustic model for Spanish.
Both technologies (KWS and CSR) obtained lower accuracy rates with the disabled speakers. In the case of CSR, we believe that limiting the vocabulary and changing the system behaviour from continuous speech recognition to recognition of isolated words would provide several benefits. On the one hand, the speakers (who are not used to speak for long periods of time) will have the opportunity to relax their vocal tract and pronounce the words better and on the other hand isolated word dictation is more reliable than CSR. In the case of KWS, although the results are not sufficient for real operation, we have shown the positive impacts of the tested enhancement techniques. We show the importance of a careful choice of keywords as well as context-dependent limitation of the vocabulary, together with the use of proper alternative phonetics. We have also demonstrated the improvement achieved through the supervised speaker (and channel) adaptation.
We have proved that for common speakers a cost-efficient cross-lingual adaptation can be done even with a training dataset smaller than the usual databases for training speech recognizers. Especially, the forced alignment tool has proven to be very useful. In the case of disabled speakers, the task itself is challenging. However, we succeeded in the automatic elimination of the data which was not suitable for training, and generally, we can say that the proposed method leads to significant reduction of time-consuming expertwork.
For future work, we plan to improve the quality of phonetic alternatives using a data-driven weighted finite state transducer (WFST)-based approach. The WFST can be trained directly from a vocabulary where the input data consists of (Spanish word - Czech phonetic) pairs . Along with the general-purpose approach, it is also possible to adapt the vocabularies to fit the speakers individually, this WFST-based system should be able to ‘learn’ the rules created by an expert and to propose alternatives replacing the actual G2P module.
Another promising guideline comes from our ongoing work. We are currently making experiments replacing the physical state decoder of the speech recognizer by a neural network. This network has the advantage that it uses seven subsequent parametrized frames to classify the middle one, so it uses more information. The experiments show promising improvement especially for the data with low initial recognition score. Another advantage of this approach is that it uses the same training data as the current HMM-based decoder. We want to apply this change for the existing speech recognizer and also to prepare another KWS based on these neural networks which output would be processed directly by the weighted finite state transducers.
Although the number of recordings in our database of atypical voices is in the same order as other state-of-the-art corpora (see a review of existing corpora in  and ), it would be desirable to record new data and/or manually annotate the data marked as ‘not suitable for training’. The latter solution has the drawback of requiring a large amount of expert work, but we can presume it would help to recognize the more severely impaired speakers.
As the emphasis in the paper has been to take advantage of the existing resources to help to develop new assistive technologies for disabled people, we offer our collaboration and resources to interested researchers. In the near future, we plan to test the adapted recognizer under real conditions integrating it in an application for tablets, the results of this new stage of our research will also be at the disposal of the scientificcommunity.
This research was supported by the project ‘Favoreciendo la vida autónoma de discapacitados intelectuales con problemas de comunicación oral mediante interfaces personalizados de reconocimiento automático del habla’, financed by the Centre of Initiatives for Development Cooperation (Centro de Iniciativas de Cooperación al Desarrollo, CICODE), University of Granada, Spain. This research was supported by the Student Grant Scheme 2014 (SGS) at the Technical University of Liberec. The authors want to thank the staff from the associations JABALCON and APAFA and the participants in the recordings for their contribution to make this work possible.
- J Sigafoos, RW Schlosser, GE Lancioni, MF O’Reilly, VA Green, NN Singh, GE Lancioni, NN Singh, in Assistive Technology for People with Communication Disorders. Autism and Child Psychopathology Series (Springer,New York, 2014), pp. 77–112.View ArticleGoogle Scholar
- Saz O, Simón J, Rodríguez W-R, Lleida E, Vaquero C: Analysis of acoustic features in speakers with cognitive disorders and speech impairments. EURASIP J. Adv. Signal Process 2009, 2009: 159-234. 10.1155/2009/159234View ArticleGoogle Scholar
- Falk TH, Chan W-Y, Shein F: Characterization of atypical vocal source excitation, temporal dynamics and prosody for objective measurement of dysarthric word intelligibility. Speech Commun 2012, 54: 622-631. 10.1016/j.specom.2011.03.007View ArticleGoogle Scholar
- Neerincx MA, Cremers AHM, Kessens JM, van Leeuwen DA, Truong KP: Attuning speech-enabled interfaces to user and context for inclusive design: technology, methodology and practice. Univers. Access. Inform. Soc 2009, 8: 109-122. 10.1007/s10209-008-0136-xView ArticleGoogle Scholar
- Rodríguez WR, Saz O, Lleida E: A prelingual tool for the education of altered voices. Speech Commun 2012, 54: 583-600. 10.1016/j.specom.2011.05.006View ArticleGoogle Scholar
- Hawley MS, Cunningham SP, Green PD, Enderby P, Palmer R, Sehgal S, O’Neill P: A voice-input voice-output communication aid for people with severe speech impairment. IEEE Trans. Neural Syst. Rehabil. Eng 2013, 21: 23-31. 10.1109/TNSRE.2012.2209678View ArticleGoogle Scholar
- Joode Ed, Heugten Cv, Verhey F, Boxtel Mv: Efficacy and usability of assistive technology for patients with cognitive deficits: a systematic review. Clin. Rehabil 2010, 24: 701-714. 10.1177/0269215510367551View ArticleGoogle Scholar
- GE Lancioni, NN Singh, MF O’Reilly, J Sigafoos, D Oliva, in Assistive Technology for People with Severe/Profound Intellectual and Multiple Disabilities. Autism and Child Psychopathology Series (Springer,New York, 2014), pp. 277–313.Google Scholar
- McCoy KF, Arnott JL, Ferres L, Fried-Oken M, Roark B: Speech and language processing as assistive technologies. Comput. Speech Lang , 27: 1143-1146. (2013-09) 10.1016/j.csl.2013.04.005View ArticleGoogle Scholar
- Chia Ai O, Hariharan M, Yaacob S, Sin Chee L: Classification of speech dysfluencies with MFCC and LPCC features. Expert Syst. Appl 2012, 39: 2157-2165. 10.1016/j.eswa.2011.07.065View ArticleGoogle Scholar
- Borg J, Larsson S, Östergren P: The right to assistive technology: for whom, for what, and by whom? Disabil. Soc 2011, 26: 151-167. 10.1080/09687599.2011.543862View ArticleGoogle Scholar
- Nouza J, Blavka K, Červa P, Zdansky J, Silovsky J, Bohac M, Prazak J: Making czech historical radio archive accessible and searchable for wide public. J. Multimed 2012, 7: 159-169. 10.4304/jmm.7.2.159-169View ArticleGoogle Scholar
- P Červa, J Nouza, in Proceedings of the Conference of the International Speech Communication Association Interspeech: 27-31 August 2007; Antwerp, Belgium, (ISCA, France). Design and development of voice controlled aids for motor-handicapped persons, (2007), pp. 2521–2524.Google Scholar
- Nouza J, Červa P, Kucharová M: Cost-efficient development of acoustic models for speech recognition of related languages. Radioengineering 2013, 22: 866-873.Google Scholar
- P Lal, S King, Cross-lingual automatic speech recognition using tandem features. 21, 2506–2515 (2013).Google Scholar
- Besacier L, Barnard E, Karpov A, Schultz T: Automatic speech recognition for under-resourced languages: a survey. Speech Commun 2014, 56: 85-100. 10.1016/j.specom.2013.07.008View ArticleGoogle Scholar
- F Rudzicz, Production knowledge in the recognition of dysarthric speech. PhD thesis, University of Toronto (2011).Google Scholar
- Borrie SA, McAuliffe MJ, Liss JM, O’Beirne GA, Anderson TJ: A follow-up investigation into the mechanisms that underlie improved recognition of dysarthric speech. J. Acoust. Soc. Am 2012, 132: 102-108. 10.1121/1.4736952View ArticleGoogle Scholar
- J-P Hosom, T Jakobs, A Baker, S Fager, in Proceedings of the 11th Conference of the International Speech Communication Association (Interspeech): 26-30 September 2010; Makuhari, Japan (International, Speech Communication Association, France), ed. by T Kobayashi, K Hirose, and S Nakamura. in Automatic speech recognition for assistive writing in speech supplemented word prediction, (2010), pp. 2674–2677.Google Scholar
- WK Seong, JH Park, HK Kim, in Dysarthric Speech Recognition Error Correction Using Weighted Finite State Transducers Based on Context-Dependent Pronunciation Variation. LNCS, ed. by K Miesenberger, A Karshmer, P Penaz, and W Zagler (Springer,Heidelberg, 2012), pp. 475–482.Google Scholar
- I Kraljevski, G Strecha, M Wolff, O Jokisch, S Chungurski, R Hoffmann, in Cross-Language Acoustic Modeling for Macedonian Speech Technology Applications, ed. by S Markovski, M Gusev. Advances in Intelligent Systems and Computing (Springer,Berlin, 2013), pp. 35–45.Google Scholar
- Imseng D, Bourlard H, Dines J, Garner PN, Magimai-Doss M: Applying multi- and cross-lingual stochastic phone space transformations to non-native speech recognition. IEEE Trans. Audio Speech Lang. Process 2013, 21: 1713-1726. 10.1109/TASL.2013.2260150View ArticleGoogle Scholar
- T Schultz, K Kirchhoff, Multilingual Speech Processing (Academic Press, USA, 2006).Google Scholar
- D Imseng, P Motlicek, PN Garner, H Bourlard, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU): 8-12 December 2013; Olomouc, Czech Republic (IEEE, USA). Impact of deep MLP architecture on different acoustic modeling techniques for under-resourced speech recognition, (2013), pp. 332–337.View ArticleGoogle Scholar
- Xu P, Fung P: Cross-lingual language modeling for low-resource speech recognition. IEEE Trans. Audio Speech Lang. Process 2013, 21: 1134-1144. 10.1109/TASL.2013.2244088View ArticleGoogle Scholar
- Bohac M, Blavka K: Text-to-speech alignment for imperfect transcriptions. LNCS: Text, Speech Dialogue 2013, 8082: 536-543.Google Scholar
- J Zhang, F Pan, Y Yan, in Proceedings of the 4 th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC): 26-27 August 2012; Nanchang, China (IEEE, U.S.A.) An LVCSR based automatic scoring method in English reading tests, (2012), pp. 34–37.View ArticleGoogle Scholar
- DP Córdova Lucero, DT Toledano, in Proceedings of the Joint 7 th Spanish Speech Technology Workshop and the Iberian SLTech Workshop: 21-23 November 2012; Madrid, Spain (Springer, Germany). Preliminary results of alignment of text and audio in news and songs, (2012), pp. 59–68.Google Scholar
- J Nouza, P Červa, J Chaloupka, in Proceedings of the International Conference on Health Informatics (HEALTHINF - BIODEVICES): 26-29 January 2011; Rome, Italy (SciTePress, U.K.) Rainbow bridge: Training center based on voice technology for people with physical disabilities, (2011), pp. 529–533.Google Scholar
- Wagner RA, Fischer MJ: String-to-string correction problem. J. ACM 1974, 21: 168-173. 10.1145/321796.321811MathSciNetView ArticleGoogle Scholar
- M Bohac, in Proceedings of the 54th International Symposium ELMAR: 12-14 September 2012; Zadar, Croatia (IEEE, USA). Performance comparison of several techniques to detect keywords in audio streams and audio scene, (2012), pp. 215–218.Google Scholar
- Nouza J, Zdansky J, Červa P, Silovsky J: Challenges in speech processing of Slavic languages (case studies in speech recognition of Czech and Slovak). LNCS 2010, 5967: 225-241.Google Scholar
- Bohac M, Nouza J, Blavka K: Investigation on most frequent errors in large-scale speech recognition applications. LNCS: Text, Speech Dialogue 2012, 7499: 520-527.Google Scholar
- Albayzin corpus in the European Language Resources Association.. Accessed 10 October 2014., [http://catalog.elra.info/product_info.php?products_id=746]
- D-L Choi, B-W Kim, Y-W Kim, Y-J Lee, Y Um, M Chung, in Proceedings of the 8 th International Conference on Language Resources and Evaluation (LREC): 23-25 May 2012; Istanbul, Turkey (IEEE, USA), ed. by N Calzolari (Conference Chair), K Choukri, T Declerck, MU Dogan, B Maegaard, J Mariani, J Odijk, and S Piperidis. Dysarthric speech database for development of QoLT software technology, (2012), pp. 47–50.Google Scholar
- Gales MJF, Woodland PC: Mean and variance adaptation within the MLLR framework. Comput. Speech Lang 1996, 10: 249-264. 10.1006/csla.1996.0013View ArticleGoogle Scholar
- Gales MJF: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang 1998, 12: 75-98. 10.1006/csla.1998.0043View ArticleGoogle Scholar
- Červa P, Nouza J: Supervised and unsupervised speaker adaptation in large vocabulary continuous speech recognition of Czech. LNCS (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2005, 3658: 203-210.Google Scholar
- M Bohac, J Malek, K Blavka, in Proceedings of the 36th International Conference on Telecommunications and Signal Processing (TSP): 2–4 July 2013; Brno, Czech Republic (IEEE, U.S.A.) Iterative grapheme-to-phoneme alignment for the training of WFST-based phonetic conversion, (2013), pp. 474–478.View ArticleGoogle Scholar
- Rudzicz F: Using articulatory likelihoods in the recognition of dysarthric speech. Speech Commun 2012, 54: 430-444. 10.1016/j.specom.2011.10.006View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.