A preliminary demonstration of exemplar-based voice conversion for articulation disorders using an individuality-preserving dictionary
© Aihara et al.; licensee Springer. 2014
Received: 5 April 2013
Accepted: 25 January 2014
Published: 1 February 2014
We present in this paper a voice conversion (VC) method for a person with an articulation disorder resulting from athetoid cerebral palsy. The movement of such speakers is limited by their athetoid symptoms, and their consonants are often unstable or unclear, which makes it difficult for them to communicate. In this paper, exemplar-based spectral conversion using nonnegative matrix factorization (NMF) is applied to a voice with an articulation disorder. To preserve the speaker’s individuality, we used an individuality-preserving dictionary that is constructed from the source speaker’s vowels and target speaker’s consonants. Using this dictionary, we can create a natural and clear voice preserving their voice’s individuality. Experimental results indicate that the performance of NMF-based VC is considerably better than conventional GMM-based VC.
KeywordsVoice conversion NMF Articulation disorders Voice reconstruction Assistive technologies
In recent years, a number of assistive technologies using information processing have been proposed, for example, sign language recognition using image recognition technology [1–3], text reading systems from natural scene images [4–6], and the design of wearable speech synthesizers . In this study, we focused on a person with an articulation disorder resulting from athetoid cerebral palsy. There are about 34,000 people with speech impediments associated with an articulation disorder in Japan alone, and one of the causes of speech impediments is cerebral palsy.
Cerebral palsy is a result of damage to the central nervous system, and the damage causes movement disorders. Three general times are given for the onset of the disorder: before birth, at the time of delivery, and after birth. Cerebral palsy is classified into the following types: (1) spastic, (2) athetoid, (3) ataxic, (4) atonic, (5) rigid, and (6) a mixture of these types .
Athetoid symptoms develop in about 10% to 15% of cerebral palsy sufferers. In the case of a person with this type of articulation disorder, his/her movements are sometimes more unstable than usual. That means their utterances (especially their consonants) are often unstable or unclear due to the athetoid symptoms. Athetoid symptoms also restrict the movement of their arms and legs. Most people suffering from athetoid cerebral palsy cannot communicate by sign language or writing, so there is a great need for voice systems for them.
In , we proposed robust feature extraction based on principal component analysis (PCA) with more stable utterance data instead of DCT. In , we used multiple acoustic frames (MAF) as an acoustic dynamic feature to improve the recognition rate of a person with an articulation disorder, especially in speech recognition using dynamic features only. In spite of these efforts, the recognition rate for articulation disorders is still lower than that of physically unimpaired persons. Maier et. al.  proposed automatic speech recognition systems for the evaluation of speech disorders that resulted from head and neck cancer.
In this paper, we propose a voice conversion (VC) method for articulation disorders. Regarding speech recognition for articulation disorders, the recognition rate using a speaker-independent model, which is trained by well-ordered speech, is 3.5% . This result implies that the utterance of a person with an articulation disorder is difficult to understand for people who have not communicated with them before. In recent years, people with an articulation disorder may use slide shows and a previously synthesized voice when they give a lecture. However, because their movement is restricted by their athetoid symptoms, to make slides or synthesize their voice in advance is hard for them. People with articulation disorders desire a VC system that converts their voice into a clear voice that preserves their voice’s individuality. However, a speech conversion method for people with articulation disorders resulting from athetoid cerebral palsy has not been successfully developed.
where x l is the l th frame of the observation, and a j and hj,l are the j th basis and the weight, respectively. A and h l are the dictionary and the activity of frame l, respectively. In some separation approaches, a dictionary is constructed for each source, and the mixed signals are expressed with a sparse representation of these dictionaries. Using only the weights (called activity in this paper) of basis in the target dictionary, the target signal can be reconstructed. Gemmeke et al. also used the activity of the speech dictionary as phonetic scores instead of likelihoods of hidden Markov models (HMMs) for speech recognition .
In our study, we adopt the supervised NMF approach , with a focus on VC from poorly articulated speech resulting from articulation disorders into well-ordered articulation. An input spectrum with an articulation disorder is represented by a linear combination of an articulation disorder basis and its weights using NMF. By replacing the basis produced by someone with an articulation disorder with a well-ordered basis, the original speech spectrum is replaced with a well-ordered spectrum. In the voice of a person with an articulation disorder, their consonants are often unstable and that makes their voices unclear. Hence, by replacing the articulation disorder basis of consonants only, a voice with an articulation disorder is converted into a clear voice that preserves the individuality of the speaker’s voice.
The rest of this paper is organized as follows. Section 2 discusses related works, while Section 3 describes the NMF-based VC method. Section 4 presents the experimental results, and the final section presents the conclusions.
2 Related works
VC is a technique for changing specific information in an input speech while retaining the other information in the utterance such as its linguistic information. Unlike speech synthesis, VC application does not need text input. Many statistical approaches to VC have been studied and applied to various tasks. One of the most popular VC applications is speaker conversion . In speaker conversion, a source speaker’s voice individuality is changed to a specified target speaker’s so that the input utterance sounds as if it had been spoken by a specified target speaker. The other information, such as its linguistic information or emotional information, is retained. Emotion conversion is a technique for changing emotional information in input speech while maintaining linguistic information and speaker individuality [17, 18]. With those approaches, a mapping function is trained in advance using a small amount of training data consisting of utterance pairs consisting of source and target voices.
A Gaussian mixture model (GMM)-based approach is widely used for VC because of its flexibility and good performance . The conversion function is interpreted as the expectation value of the target spectral envelope. The conversion parameters are evaluated using minimum mean-square error (MMSE) using a parallel training set. A number of improvements to this approach have been proposed. Toda et al.  introduced the global variance (GV) of the converted spectra over time sequence. Helander et al.  proposed transforms based on partial least squares (PLS) in order to prevent the over-fitting problem of standard multivariate regression. However, over-smoothing and over-fitting problems in these GMM-based approaches have been reported  because of statistical averages and large number of parameters. These problems degrade the quality of synthesized speech. There have also been approaches that do not require parallel data using GMM adaptation techniques  or eigen-voice GMM (EV-GMM) .
However, these approaches have been developed for speaker conversion. If the person with an articulation disorder is set as a source speaker and a physically unimpaired person is set as a target speaker, an articulation disorder voice may be converted into a well-ordered voice, but the source speaker’s voice individuality is also converted into the target speaker’s individuality.
In the field of assistive technology, Nakamura et al. [23, 24] proposed GMM-based VC systems that reconstruct a speaker’s individuality in electrolaryngeal speech and speech recorded by NAM microphones. These systems are effective for electrolaryngeal speech and speech recorded by NAM microphones, but the target speaker’s individuality will be changed to the source speaker’s individuality. Veaux et al.  used HMM-based speech synthesis to reconstruct the voice of individuals with degenerative speech disorders. HMM-based speech synthesis needs text input to synthesize speech. In the case of people with an articulation disorder resulting from athetoid cerebral palsy, it is difficult for them to input text because of their athetoid symptoms.
The goals of this study can be divided into in following three points: (1) convert the voice uttered by a person with an articulation disorder so that everyone can understand what he/she said, (2) preserve the individuality of the speaker’s voice, and (3) output a natural-sounding voice. Our proposed exemplar-based VC can create a natural-sounding voice because there is no statistical model in our approach, and the source speaker’s individuality can be preserved using our individuality-preserving dictionary.
3 Voice conversion based on NMF
3.1 Exemplar-based voice conversion
A dictionary is a collection of source or target basis. Our VC method needs two dictionaries that are phonemically parallel. One dictionary is a source dictionary, which is constructed from a source feature. The source feature is constructed from an articulation-disordered spectrum and its segment feature which consists of some consecutive frames. We use a segment feature in order to consider temporal information in activity estimation. The other dictionary is a target dictionary, which is constructed from a target feature. The target feature is mainly constructed from a well-ordered spectrum. These two dictionaries consist of the same words and are aligned with dynamic time warping (DTW). Hence, these dictionaries have the same number of bases.
Input source features, X s , which consist of an articulation-disordered spectrum and its segment features, are decomposed into a linear combination of bases from the source dictionary A s by NMF. The weights of the basis are estimated as an activity H s . Therefore, the activity includes the weight information of input features for each basis.
Then, the activity is multiplied by a target dictionary in order to obtain converted spectral features which are represented by a linear combination of bases from the target dictionary. Because the source and target dictionaries are parallel phonemically, the basis used in the converted features is phonemically the same as that of the source features.
Spectral envelopes extracted by STRAIGHT analysis  are used in the source and target features. The other features extracted by STRAIGHT analysis, such as F0 and the aperiodic components, are used to synthesize the converted signal without any conversion.
3.2 Preserving the individuality of the speaker’s voice
In order to make a parallel dictionary, some pairs of parallel utterances are needed, where each pair consists of the same text. One is spoken by a person with an articulation disorder (source speaker), and the other is spoken by a physically unimpaired person (target speaker).
3.3 Estimation of activity
where 1 is an all-one matrix and.∗ denotes element-wise multiplication, respectively. The first term is the Kullback-Leibler (KL) divergence between X s and A s H s . The second term is the sparse constraint with the L1-norm regularization term that causes H s to be sparse. The weights of the sparsity constraints can be defined for each exemplar by defining λ T = [ λ1…λ J ].
To increase the sparseness of H s , elements of H s , which are less than threshold, are rounded to zero.
4.1 Experimental conditions
The proposed method was evaluated on word-based VC for one person with an articulation disorder. We recorded 432 utterances (216 words, repeating each two times) included in the ATR Japanese speech database . The speech signals were sampled at 16 kHz and windowed with a 25-ms Hamming window every 10 ms. A physically unimpaired Japanese male in the ATR Japanese speech database was chosen as a target speaker.
The source feature is a 2,565-dimensional segment spectrum, and the target feature is a 513-dimensional single spectrum. Those spectra are extracted by STRAIGHT analysis. The mel-cepstral coefficient, which is converted from the STRAIGHT spectrum, is used for DTW in order to align the temporal fluctuation.
We compared our NMF-based VC to conventional GMM-based VC. In GMM-based VC, the first through 24th cepstral coefficients extracted by STRAIGHT are used as source and target features.
4.2 Objective evaluation of exemplar-based VC
where and are the d th coefficients on frame t of the converted and target mel-cepstra, T is a total number of frames, and α is a scaling factor, respectively.
We selected 50 words at random as an evaluation set and used the target speaker’s utterance as a reference. An individuality-preserving dictionary is not suitable for mel-cepstral distortion because vowel frames in converted voices are different from the reference. Therefore, in this subsection, we used a target dictionary that is constructed from the target speaker’s vowels and consonants.
4.3 Subjective evaluation
We used 216 utterances for training and used the other 216 utterances for the test. We conducted subjective evaluation on three topics. A total of ten Japanese speakers took part in the test using headphones. For the ‘speech quality’ evaluation, we performed a mean opinion score (MOS) test . The opinion score was set to a five-point scale (5 excellent, 4 good, 3 fair, 2 poor, 1 bad). Thirty-two words, which are difficult for a person with an articulation disorder to utter, were evaluated. The subjects were asked about the speech quality in the articulation-disordered voice, the NMF-based converted voice, and the GMM-based converted voice. Each voice uttered by a physically unimpaired person was presented as a reference of five points on the MOS test.
Fifty words were converted using NMF-based VC and GMM-based VC for the following evaluations. On the ‘similarity’ evaluation, the XAB test was carried out. In the XAB test, each subject listened to the articulation-disordered voice. Then the subject listened to the voice converted by the two methods and selected which sample sounded most similar to the articulation-disordered voice. On the ‘naturalness’ evaluation, a paired comparison test was carried out, where each subject listened to pairs of speech converted by the two methods and selected which sample sounded more natural.
4.4 Results and discussion
We proposed a spectral conversion method based on NMF for a voice with an articulation disorder. By combining articulation-disordered vowels and well-ordered consonants, we constructed an individuality-preserving dictionary. Experimental results demonstrated that our VC method can improve the listening speech quality of the words uttered by a person with an articulation disorder.
Our proposed method has the following benefits compared to conventional GMM-based VC: (1) NMF-based VC can preserve the individuality of the source speaker’s voice using an individuality-preserving dictionary, and (2) our NMF-based VC can create a natural voice because the individuality-preserving dictionary keeps the source speaker’s vowels.
Some problems remain with this method. Articulation-disordered speech has a co-articulation effect between phonemes. As shown in Figure 4, the mispronunciation of consonants also affects the vowels that follow. Solving this problem will be the focus of our future work. In this study, there was only one subject person, so in future experiments, we will increase the number of subjects and further examine the effectiveness of our method.
This research was supported in part by MIC SCOPE.
- Lin J, Ying W, Huang TS: Capturing human hand motion in image sequences. In IEEE Workshop on Motion and Video Computing, Orlando,5–6Dec 2002. IEEE, Piscataway; 2002:99-104.View ArticleGoogle Scholar
- Starner T, Weaver J, Pentland A: Real-time American sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell 1998, 20(12):1371-1375. 10.1109/34.735811View ArticleGoogle Scholar
- Fang G, Gao W, Zhao D: Large vocabulary sign language recognition based on hierarchical decision trees. 5th International Conference on Multimodal Interfaces 2004, 34(3):125-131.Google Scholar
- Ezaki N, Bulacu M, Schomaker L: Text detection from natural scene images: towards a system for visually impaired persons. Proceedings of the 17th International Conference on Pattern Recognition (ICPR 2004) 2004, 2: 683-686.View ArticleGoogle Scholar
- Bashar MK, Matsumoto T, Takeuchi Y, Kudo H, Ohnishi N: Unsupervised texture segmentation via wavelet-based locally orderless images (WLOIs) and SOM. In Computer Graphics and Imaging. IASTED/ACTA Press, Calgary; 2003:279-284.Google Scholar
- Wu V, Manmatha R, Riseman EM: Textfinder: an automatic system to detect and recognize text in images. IEEE Trans. Pattern Anal. Mach. Intell 1999, 21(11):1224-1229. 10.1109/34.809116View ArticleGoogle Scholar
- Yabu K, Ifukube T, Aomura S: A basic design of wearable speech synthesizer for voice disorders [Japanese]. EIC Technical report (Institute Electron. Inf. Commun. Eng) 2006, 105(686):59-64.Google Scholar
- Canale ST, Campbell WC: Campbell’s Operative Orthopaedics. Mosby-Year Book, St. Louis; 2002.Google Scholar
- Matsumasa H, Takiguchi T, Ariki Y, Li I, Nakabayachi T: Integration of metamodel and acoustic model for dysarthric speech recognition. J. Multimedia 2009, 4(4):254-261.View ArticleGoogle Scholar
- Miyamoto C, Komai Y, Takiguchi T, Ariki Y, Li I: Multimodal speech recognition of a person with articulation disorders using AAM and MAF. In IEEE International Workshop on Multimedia Signal Processing (MMSP’10), St. Malo, 4–6 Oct 2010. IEEE, Piscataway; 2010:517-520.View ArticleGoogle Scholar
- Maier A, Haderlein T, Stelzle F, Noth E, Nkenke E, Rosanowski F, Schutzenberger A, Schuster M: Automatic speech recognition systems for the evaluation of voice and speech disorders in head and neck cancer. EURASIP J. Audio Speech Music Process 2010, 2010: 926951. 10.1186/1687-4722-2010-926951View ArticleGoogle Scholar
- Lee D, Seung HS: Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing 13 (NIPS 2000). MIT Press, Massachusetts; 2001:556-562.Google Scholar
- Virtanen T: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process 2007, 15(3):1066-1074.View ArticleGoogle Scholar
- Gemmeke JF, Virtanen T: Noise robust exemplar-based connected digit recognition. In 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Dallas, 14–19 March 2010. IEEE, Piscataway; 2010:4546-4549.Google Scholar
- Schmidt MN, Olsson RK: Single-channel speech separation using sparse non-negative matrix factorization. In Interspeech 2006 - ICSLP, Ninth International Conference on Spoken Language Processing, Pittsburgh, 17–21 Sept 2006. Curran Associates, Inc., New York; 2006:2614-2617.Google Scholar
- Toda T, Black A, Tokuda K: Voice conversion based on maximum likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process 2007, 15(8):2222-2235.View ArticleGoogle Scholar
- Iwami Y, Toda T, Saruwatari H, Shikano K: GMM-based voice conversion applied to emotional speech synthesis. IEEE Trans. Speech Audio Process 1999, 7: 2401-2404.Google Scholar
- Aihara R, Takashima R, Takiguchi T, Ariki Y: GMM-Based emotional voice conversion using spectrum and prosody features. Am. J. Signal Process 2012, 2(5):135-138.View ArticleGoogle Scholar
- Stylianou Y, Cappe O, Moulines E: Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process 1998, 6(2):131-142. 10.1109/89.661472View ArticleGoogle Scholar
- Helander E, Virtanen T, Nurminen J, Gabbouj M: Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process 2010, 18(5):912-921.View ArticleGoogle Scholar
- Lee CH, Wu CH: Map-based adaptation for speech conversion using adaptation data selection and non-parallel training. In Interspeech 2006 - ICSLP, Ninth International Conference on Spoken Language Processing, Pittsburgh, 17–21 Sept 2006. Curran Associates, Inc., New York; 2006:2254-2257.Google Scholar
- Toda T, Ohtani Y, Shikano K: Eigenvoice conversion based on Gaussian mixture model. In Interspeech 2006 - ICSLP, Ninth International Conference on Spoken Language Processing, Pittsburgh, 17–21 Sept 2006. Curran Associates, Inc., New York; 2006:2446-2449.Google Scholar
- Nakamura K, Toda T, Saruwatari H, Shikano K: Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Commun 2012, 54(1):134-146. 10.1016/j.specom.2011.07.007View ArticleGoogle Scholar
- Nakamura K, Toda T, Saruwatari H, Shikano K: Speaking aid system for total laryngectomees using voice conversion of body transmitted artificial speech. In Interspeech 2006 - ICSLP, Ninth International Conference on Spoken Language Processing, Pittsburgh, 17–21 Sept 2006. Curran Associates, Inc., New York; 2006:1395-1398.Google Scholar
- Veaux C, Yamagishi J, King S: Using HMM-based speech synthesis to reconstruct the voice of individuals with degenerative speech disorders. In 13th Annual Conference of the International Speech Communication Association 2012 (INTERSPEECH 2012), Portland, 9–13 September 2012. Curran Associates, Inc. New York; 2012:966-969.Google Scholar
- Kawahara H, Masuda-Katsuse I, Cheveigne A: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun 1999, 27(3–4):187-207.View ArticleGoogle Scholar
- Gemmeke JF, Viratnen T, Hurmalainen A: Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process 2011, 19(7):2067-2080.View ArticleGoogle Scholar
- Kurematsu A, Takeda K, Sagisaka Y, Katagiri S, Kuwabara H, Shikano K: ATR Japanese speech database as a tool of speech recognition and synthesis. Speech Commun 1990, 9: 357-363. 10.1016/0167-6393(90)90011-WView ArticleGoogle Scholar
- Kominek J, Schultz T, Black AW: Synthesizer voice quality of nwe languages calibrated with mean mel cepstral distortion. In The International Workshop on Spoken Language Technology for Under-Resourced Languages (SLTU). Hanoi University of Technology, Hanoi; 5–7 May 2008.Google Scholar
- International Telecommunication Union: ITU-T Recommendation P.800-P.899: Methods for Objective and Subjective Assessment of Quality. ITU, Geneva; 2003.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.