3.1 Exemplar-based voice conversion
Figure 1 shows the basic approach of our exemplar-based VC using NMF. L and J represent frames of dictionary and basis of dictionary, respectively. D represents the number of dimensions of the source feature, which consists of a segment feature of the source speaker’s spectrum, and d represents the number of dimensions of a target feature, which consists a single feature of the target speaker’s spectrum.
A dictionary is a collection of source or target basis. Our VC method needs two dictionaries that are phonemically parallel. One dictionary is a source dictionary, which is constructed from a source feature. The source feature is constructed from an articulation-disordered spectrum and its segment feature which consists of some consecutive frames. We use a segment feature in order to consider temporal information in activity estimation. The other dictionary is a target dictionary, which is constructed from a target feature. The target feature is mainly constructed from a well-ordered spectrum. These two dictionaries consist of the same words and are aligned with dynamic time warping (DTW). Hence, these dictionaries have the same number of bases.
Input source features, Xs, which consist of an articulation-disordered spectrum and its segment features, are decomposed into a linear combination of bases from the source dictionary As by NMF. The weights of the basis are estimated as an activity Hs. Therefore, the activity includes the weight information of input features for each basis.
Then, the activity is multiplied by a target dictionary in order to obtain converted spectral features which are represented by a linear combination of bases from the target dictionary. Because the source and target dictionaries are parallel phonemically, the basis used in the converted features is phonemically the same as that of the source features.
Figure 2 shows an example of the activity matrices estimated from the word ikioi (‘vigor’ in English). One was uttered by a person with an articulation disorder and the other by a physically unimpaired person. To show an intelligible example, each dictionary was structured from just the one word ikioi and aligned with DTW. As shown in Figure 2, these activities have high energies at similar elements. For this reason, when there are parallel dictionaries, it may be possible to substitute the activity of the source features estimated with the source dictionary with that of the target features. Therefore, the target speech can be constructed using the target dictionary and the activity of the source signal as shown in Figure 1.
Spectral envelopes extracted by STRAIGHT analysis [26] are used in the source and target features. The other features extracted by STRAIGHT analysis, such as F0 and the aperiodic components, are used to synthesize the converted signal without any conversion.
3.2 Preserving the individuality of the speaker’s voice
In order to make a parallel dictionary, some pairs of parallel utterances are needed, where each pair consists of the same text. One is spoken by a person with an articulation disorder (source speaker), and the other is spoken by a physically unimpaired person (target speaker).
The left side of Figure 3 shows the process for constructing a parallel dictionary. Spectrum envelopes, which are extracted from parallel utterances, are phonemically aligned. In order to estimate activities of source features precisely, segment features, which consist of some consecutive frames, are constructed. Target features are constructed from consonant frames of the target’s aligned spectrum and vowel frames of the source’s aligned spectrum. Source and target dictionaries are constructed by lining up each of the features extracted from parallel utterances.
Figure 3 shows how to preserve a source speaker’s voice individuality in our VC. The vowels voiced by the speaker strongly indicate the speaker’s individuality. On the other hand, consonants of people with articulation disorders are often unstable. Figure 4 shows an example of the spectrogram for the word ikioi (vigor in English) of a person with an articulation disorder. The spectrogram of a physically unimpaired person speaking the same word is shown in Figure 5. In Figure 4, the area labeled ‘k’ is not clear, compared to the same region in Figure 5. By combining a source speaker’s vowels and target speaker’s consonants in the target dictionary, the individuality of the source speaker’s voice can be preserved.
3.3 Estimation of activity
In the NMF-based approach, the spectrum source signal at frame l is approximately expressed by a nonnegative linear combination of the source dictionary and the activities:
(2)
where is the magnitude spectra of the source signal. Given the spectrogram, Equation 2 can be written as follows:
(3)
The joint matrix Hs is estimated based on NMF with the sparse constraint that minimizes the following cost function:
(4)
where 1 is an all-one matrix and.∗ denotes element-wise multiplication, respectively. The first term is the Kullback-Leibler (KL) divergence between Xs and AsHs. The second term is the sparse constraint with the L1-norm regularization term that causes Hs to be sparse. The weights of the sparsity constraints can be defined for each exemplar by defining λT= [ λ1…λ
J
].
Hs minimizing Equation 4 is estimated iteratively applying the following update rule [12, 27]:
(5)
To increase the sparseness of Hs, elements of Hs, which are less than threshold, are rounded to zero.
Using the activity and the target dictionary, the converted spectral features are constructed:
(6)