 Research
 Open Access
 Published:
Nonparallel dictionary learning for voice conversion using nonnegative Tucker decomposition
EURASIP Journal on Audio, Speech, and Music Processing volume 2019, Article number: 17 (2019)
Abstract
Voice conversion (VC) is a technique of exclusively converting speakerspecific information in the source speech while preserving the associated phonemic information. Nonnegative matrix factorization (NMF)based VC has been widely researched because of the naturalsounding voice it achieves when compared with conventional Gaussian mixture modelbased VC. In conventional NMFVC, models are trained using parallel data which results in the speech data requiring elaborate preprocessing to generate parallel data. NMFVC also tends to be an extensive model as this method has several parallel exemplars for the dictionary matrix, leading to a high computational cost. In this study, an innovative parallel dictionarylearning method using nonnegative Tucker decomposition (NTD) is proposed. The proposed method uses tensor decomposition and decomposes an input observation into a set of mode matrices and one core tensor. The proposed NTDbased dictionarylearning method estimates the dictionary matrix for NMFVC without using parallel data. The experimental results show that the proposed method outperforms other methods in both parallel and nonparallel settings.
1 Introduction
Voice conversion (VC) is a technique used to convert speakerspecific information in the speech of a source speaker into that of a target speaker while retaining linguistic information. Lately, VC techniques have been garnering particular attention [1], and various statistical approaches to VC have been studied [2, 3] as these techniques can be applied to numerous tasks [4–8]. Of these approaches, the Gaussian mixture model (GMM)based mapping method [9] is the most prevalent, and a number of enhancements have been proposed [10–12]. Other VC methods, such as approaches based on nonnegative matrix factorization (NMF) [13–15], neural networks [16], deep learning [17, 18], restricted Boltzmann machines [19–21], variational autoencoders [22], and a generative adversarial network [23], have also been proposed. Notably, in recent years, the NMF has outperformed GMM in parallel data conditions. Exemplarbased NMFVC retains the high naturality of the converted speech, and many of its variants have been proposed [24, 25]. Although more recent deep learning methods require significantly large training data, NMFVC requires comparatively less training data. Therefore, this study focuses on NMFVC.
NMF [26] is one of the most popular sparse representation methods. The goal of NMF is to decompose the input observation into two matrices: the basis matrix and weight matrix. In this study, the basis matrix is referred to as the “dictionary," and the weight matrix as the “activity." The NMFbased method can be classified into two approaches: the dictionarylearning approach [14] and exemplarbased approach [27]. In the dictionarylearning approach, the dictionary and activity are estimated simultaneously during the training, and the estimated dictionary is used in conversion. However, in the exemplarbased approach, the training data is straightaway used as exemplars in the conversion step. By using the learned dictionary instead of the exemplars, the VC is executed with lower computation times.
However, both the NMFbased approaches require parallel data (aligned speech data from the source and the target speakers, so that each frame of the source speaker’s data corresponds to that of the target speaker’s data) for training the models, which leads to several problems. First, the data are limited to predefined statements (both speakers must utter the same statements). Second, the training data (the parallel data) are not the original speech data anymore, as the speech data are stretched and modified along the time axis when aligned, and there is no certainty that each frame is aligned perfectly. As the dictionary is assembled from parallel data, the error of alignment in the parallel data might adversely affect VC performance. Several other approaches have been proposed that do not use (or minimally use) parallel data of the source and the target speakers [28–30]. For example, in [28], the spectral relationships between two arbitrary speakers (reference speakers) is modeled using GMMs and the source speaker’s speech is converted using the matrix that projects the feature space of the source speaker into that of the target speaker through that of the reference speakers. In this study, the conventional NMFbased VC method is expanded into a nonparallel VC method. A previous study [30] proposed using the phone segmentation results from automatic speech recognition to construct a subdictionary for each phone for an exemplarbased NMF voice conversion. This particular technique was applied to the nonparallel VC.
To tackle the nonparallel approach, a nonnegative Tucker decomposition (NTD) [31–33]based dictionarylearning method is proposed. The NTD is a nonnegative extension of the Tucker decomposition that decomposes the input observation into a set of matrices and one core tensor. Tucker decomposition is generally introduced to deal with a highorder tensor. In recent studies, Tucker decomposition has been widely applied in visual questionanswering systems [34] and speech recognition [35]. As spectral features are used for input observation, a set of matrices consists of two mode matrices for frequency and time and a core tensor corresponding to a core matrix. It is assumed that these matrices correspond to the frequency basis matrix, the phonemic information, and a codebook between the frequency basis and each phone, respectively. In the proposed approach, the activity matrix in NMF is decomposed into the codebook and the phonemic information. When learning the dictionaries, while the activity matrix is shared between speakers using parallel data in the conventional NMFVC, in the proposed method, the codebook is shared between speakers, and the phonemic information is dependent on a speaker. Hence, the timevarying phonemic information can be captured for each speaker. During the conversion, only the phonemic information matrix is estimated as the activity matrix. As the proposed method can have timedependent factors for each speaker, there is no necessity for parallel data. To the best of authors’ knowledge, NTDbased VC has not been attempted, except [36] where Tucker decomposition was used to represent the speaker space and the conversion mechanism was based on GMM. The present VC is based on NMF, and this approach is fundamentally different from those presented previously [36].
Several methods have been proposed for tensor decomposition [37–39]. In [37], NMF is applied to variational Bayesian matrix factorization, where each observed entry is assumed to be a beta distribution. Shi et al. [38] proposed tensor decomposition with variance maximization for feature extraction. In [39], pairwise similarity information is incorporated into Tucker tensor decomposition. While these methods have useful properties, it is difficult to adapt them directly to VC. NTD can be readily integrated with NMFbased VC, because NMF is the secondorder case of the Tucker decomposition with the nonnegative constraint.
The rest of this paper is organized as follows. In Section 2, a conventional NMFbased VC is described. Section 3 includes the description of the proposed method. Section 4 details the evaluation of the experimental data, and Section 5 details the Experiments on VCC 2018. Finally, in Section 6, the conclusions are presented.
2 NMFbased voice conversion
NMF is a matrix decomposition method under nonnegative constraints. The basic idea behind decomposing a matrix \({\mathbf {X}}\in \mathbb {R}^{F\times T}\) is to find two matrices \({\mathbf {W}}\in \mathbb {R}^{F\times }\) ^{K} and \({\mathbf {H}}\in \mathbb {R}^{K\times T}\) that minimize the distance between X and WH under nonnegative constraints. F and T represent the number of dimensions and frames. In NMF, W is called a basis matrix and contains K bases in columns. H is called an activity matrix and indicates the activity of each basis along the time index.
VC approaches using NMF are divided into two categories: supervised and unsupervised approaches. The supervised approach, known as the exemplarbased VC, estimates only the activity from observation and the dictionary must be provided. However, the unsupervised approach, i.e., the dictionarylearning VC, estimates both the dictionary and the activity from observation. The proposed method is based on the latter, i.e., the dictionarylearning approach.
2.1 Dictionary learning using nMF
Figure 1 shows the basic approach of the dictionarylearning NMFbased VC [14], where F, T, and K represent the number of dimensions, frames, and bases, respectively. This VC method needs two dictionaries that are phonemically parallel. \({\mathbf {W}}^{s}\in \mathbb {R}^{F\times K}\) represents a source dictionary, and \({\mathbf {W}}^{t}\in \mathbb {R}^{F\times K}\) represents a target dictionary. In exemplarbased VC, these two dictionaries consist of the same words or sentences and are aligned with dynamic time warping (DTW), which is comparable with the conventional GMMbased VC. In dictionarylearning VC, these two dictionaries are estimated simultaneously and as a result have the same number of bases.
For the training source speaker data \({\mathbf {X}}^{s}\in \mathbb {R}^{F\times T}\) and the training target speaker data \({\mathbf {X}}^{t}\in \mathbb {R}^{F\times T}\), two dictionaries W^{s}, W^{t}, and the activity \({\mathbf {H}}\in \mathbb {R}^{K\times T}\) are simultaneously estimated. The cost function of this joint NMF is defined as follows:
where X^{s} and X^{t} represent parallel data. In Eq. (1), d_{KL}(A,B) denotes the KullbackLeibler divergence between the two matrices A and B, and the last term is the sparsity constraint with the L1norm regularization term that causes the activity matrix to be sparse. λ represents the weight of the sparsity constraint. This function is minimized by iteratively updating parameters, as is done in the traditional NMF.
This method assumes that when the source and the target spectra (which are from the same words but spoken by different speakers) are expressed with sparse representations of the source dictionary and the target dictionary, respectively, the obtained activity matrices are approximately equivalent to each other. In the conversion process, for the input source spectrogram X^{s}, only the activity H^{s} is estimated while fixing the source dictionary W^{s}. The estimated source activity H^{s} is multiplied with the target dictionary W^{t}, and the target spectrogram \(\hat {{\mathbf {X}}}^{t}\) is constructed as follows:
2.2 Problems
NMFbased VC has several problems. First, if the source and target utterances are aligned using DTW in advance, the estimated parameters are affected by the quality of the alignment. And a mismatch of alignment appears to persist. Aihara et al. [24] have shown that this mismatch degrades the performance of exemplarbased VC. Second, it appears that the activity matrix contains other information along with the phonetic information. Aihara et al. [25, 27] assumed that the activity matrix contains the phonetic information and speaker information, and accordingly proposed certain frameworks to overcome this effect, thereby improving the performance of NMFbased VC. In this study, an alternative approach is proposed. The activity matrix is decomposed into the speakershared matrix and the speakerdependent phonetic information matrix. This decomposition makes parallel data unnecessary. Moreover, during the conversion, estimating only the phonetic information matrix as the activity matrix is expected to improve the accuracy of activity estimation.
3 Methods
3.1 NTD
Given a nonnegative Nway tensor, NTD [40] decomposes the input tensor into a core tensor and a set of mode matrices that are restricted to have only nonnegative elements. In this study, as the spectral features are used as the input observation, a core tensor is represented as a matrix, and there are two mode matrices. Under these conditions, NTD is simply defined as follows:
where \({\mathbf {X}}\in \mathbb {R}^{F\times T}\), \({\mathbf {U}}\in \mathbb {R}^{F\times M}\), \({\mathbf {V}}\in \mathbb {R}^{T\times L}\), and \({\mathbf {G}}\in \mathbb {R}^{M\times L}\) represent an input spectrogram, a mode matrix along the frequency axis, a mode matrix along the time axis, and a core matrix, respectively. F, T, M, and L indicate the number of frequency bins, frames, frequency basis, and time basis, respectively. The cost function of NTD is defined as follows:
NTD provides a general form of the nonnegative tensor factorization including a special case of NMF; updating algorithms have been proposed in [40]. These updating algorithms are based on that NMF.
3.2 Dictionary learning using nTD
This section describes the method of estimating a parallel dictionary between the source and target speakers by NTD. The objective function is represented as follows:
where \({\mathbf {X}}^{s}\in \mathbb {R}^{F\times T_{s}}\), \({\mathbf {X}}^{t}\in \mathbb {R}^{F\times T_{t}}\), \({\mathbf {U}}^{s}\in \mathbb {R}^{F\times M}\), \({\mathbf {U}}^{t}\in \mathbb {R}^{F\times M}\), \({\mathbf {V}}^{s}\in \mathbb {R}^{T_{s}\times L}\), \({\mathbf {V}}^{t}\in \mathbb {R}^{T_{t}\times L}\), and \({\mathbf {G}}\in \mathbb {R}^{M\times L}\) represent the source and target spectrograms, the source and target frequency basis matrices, the source and target time basis matrices, and a core matrix, respectively. α and β represent the weight of each term. F, T_{s}, T_{t}, M, and L indicate the number of frequency bins, source and target frames, frequency basis, and time basis, respectively. This function is minimized by iteratively updating the following equations in the same manner as the NTD:
where.∗ and./ denote elementwise multiplication and division, respectively. In this framework, only a core matrix G is shared, and timevarying matrices V^{s} and V^{t} are dependent on each speaker, as shown in Fig. 2. Therefore, there is no necessity for parallel data.
After each matrix in the model is estimated, the source and target parallel dictionaries are calculated as U^{s}G and U^{t}G, respectively. During conversion, for the given source spectrogram X^{s}, only V^{s} is estimated as X^{s}=U^{s}GV^{s}^{⊤}. Then, the target spectrogram \(\hat {{\mathbf {X}}}^{t}\) is obtained as \(\hat {{\mathbf {X}}}^{t} = {\mathbf {U}}^{t}{\mathbf {G}}{{\mathbf {V}}^{s}}^{\top }\).
It is assumed that U^{s} and U^{t} represent the frequency basis matrices, and V^{s} and V^{t} represent the phonemic information. As the core matrix is not dependent on either the frequency or the time, this matrix represents the codebook between the frequency bases and the phones. Based on this assumption, the core matrix makes a correspondence between frequency bases and phones. Specifically, there are L phones, and a spectrum of each phone is constructed using M frequency bases. Although the information contained in the activity matrix is not only the phonemic information, in conventional NMFbased approaches, the activity matrix is assumed to contain only the phonemic information. Therefore, the estimated activity is degraded. In contrast, the proposed NTDbased approach specifically decomposes the activity matrix into the speakershared information and the speakerdependent phonemic information. Therefore, it is expected that the performance of the activity estimation will be improved during conversion.
4 Experimental evaluation
4.1 Conditions
The proposed VC technique was evaluated in a speakerconversion task using clean speech data by comparing its results with the conventional GMMbased method [10], the conventional NMFbased dictionarylearning method [14], and an adaptive restricted Boltzmann machine (ARBM)based method [20] that does not use parallel data. For the evaluation, voice samples of speech data stored in the ATR Japanese speech database [41] of three males and three females were used. The sampling rate was 16 kHz. A total of 45 sentences were used for training, and another 50 sentences were used for testing. Parallel data aligned using dynamic programming matching (DPM) was used to train the GMMbased and NMFbased methods. The proposed method and the ARBMbased method do not require parallel data. As training data, the same utterances were used for the source and the target speaker in the parallel setting, and completely different utterances for each speaker were used in the nonparallel setting.
Parameter initialization has a significant impact on the conversion performance. In this study, V^{s} and V^{t} are initialized randomly. Table 1 shows the initialization algorithm for U^{s}, U^{t}, and G. In the parallel setting, the initialization is based on the NMF framework using parallel data calculated by the source and target training data. In the nonparallel setting, the initialization is based on the NMF and NTD frameworks. This initialization method uses an adaptive matrix [42]. Finally, initialized parameters are optimized by Eqs. (6) to (10).
In the conventional NMFbased method and the proposed method, a 513dimensional WORLD spectrum [43] is used for spectral features. The hyperparameters α and β are used to control the length of the training data for the source and the target speaker, respectively. These parameters were set as follows:
where T_{s} and T_{t} represent the number of frames of source and target training data, respectively. The sparse constraint λ was set to 0.2. The parameters are updated until the convergence condition F_{t}−F_{t−1}<εF_{T} is fulfilled, where F_{t} indicates a value of an objective function at an iteration t. ε was set to \(\exp (9)\). The GMM experiments were implemented using sprocket [44]. In the conventional NMFbased dictionarylearning method, the number of bases is 1000. In the ARBMbased method, a 32dimensional Melcepstrum that was calculated from the 513dimensional WORLD spectra was used as an input vector. Softmax constraints were set to hidden units.
In this study, a conventional linear regression based on the mean and standard deviation [10] was used to convert F0 information. Other information, such as aperiodic components, was synthesized without any conversion.
The proposed method was evaluated both objectively and subjectively. Melcepstral distortion (MCD) [dB] was used as a measure of the objective evaluations, defined as follows:
where \(mc^{{\text {conv}}}_{d}\) and \(mc^{{\text {tar}}}_{d}\) represent the dth dimension of the converted and target Melcepstral coefficients, respectively.
The subjective evaluation was based on “speech quality" and “similarity to the target speaker (individuality).” In the subjective evaluation, 25 sentences were evaluated by 10 native Japanese speakers. To evaluate the speech quality, a mean opinion score (MOS) test was performed. The opinion score was set to a 5point scale (5, excellent; 4, good; 3, fair; 2, poor; 1, bad). For the similarity evaluation, an XAB test was conducted, in which each participant listened to the voice of the target speaker and then to the voice converted using the two methods. The participant was then asked to judge which sample sounded most similar to the target speaker’s voice.
4.2 Parameters
The performance of each method was evaluated using different parameters. One male speaker and one female speaker were selected and maletofemale conversion and femaletomale conversion was evaluated.
First, the performance of the conventional GMMbased VC was evaluated using different number of mixtures. The results obtained when using 4, 8, 16, 32, 64, and 128 mixtures are shown in Fig. 3. A lower value indicates a better result. As shown in Fig. 3, the optimal numbers were close to 8. Therefore, eight mixtures were used in the subsequent experiments.
Next, the performance of the conventional ARBMbased VC was evaluated using a different number of hidden units. The results are shown in Fig. 4 when using 2, 4, 8, 16, 32, and 48 hidden units. As shown in Fig. 4, the optimal number was around 32. Therefore, 32 hidden units were used in the later experiments.
Finally, the performance of the proposed method was evaluated using a different number of frequency bases. The results are shown in Fig. 5 when the numbers of frequency bases M were 100, 200, 300, 400, and 500. The optimal number was around 200. Therefore, 200 was used as the number of frequency bases in the subsequent experiments. In the experiments, to control the number of dictionary bases during conversion, the number of time bases L was fixed to 1000.
4.3 Results
In this section, the proposed method is compared with conventional GMM, NMF, and ARBMbased methods.
Initially, the proposed method is compared with the parallel method in a parallel setting. Table 2 shows the average MCD values for maletofemale conversion, femaletomale conversion, maletomale conversion, and femaletofemale conversion. In this table, “Mi” and “Fj” indicate the ith male speaker and jth female speaker, respectively, and \({\text {src}}\rightarrow {\text {tar}}\) denotes the srctotar conversion. The rightmost column in the table indicates the mean value for each method with a 95% confidence interval. Here, a lower value indicates a better result. In these experiments, the models were trained using parallel utterances. The GMM and NMF frameworks require parallel data. For these, parallel utterances were used to calculate the parallel data. Table 2 clearly demonstrates that the proposed NTDbased dictionary learning is not affected by the alignment error in DTW, and hence yields 10.1% and 1.8% relative improvements when compared with the conventional GMMbased method and the conventional NMFbased dictionary learning, respectively. Moreover, it confirms that the proposed method achieved a significantly better score than both the comparative methods, when using a p value test of 0.05.
Next, the method was compared with the nonparallel method in a nonparallel setting. Table 3 shows the average MCD values for maletofemale conversion, femaletomale conversion, maletomale conversion, and femaletofemale conversion. These results show that the proposed method has a comparable performance to the conventional nonparallel method, ARBM. However, the proposed method achieved a notably worse score than the ARBMbased method, when using a p value test of 0.05. This difference is explained in the next section.
Figure 6 shows the results of the MOS test on speech quality. The error bar shows a 95% confidence interval. Here, a higher value indicates a better result. MtoF, FtoM, MtoM, and FtoF denote maletofemale conversion, femaletomale conversion, maletomale conversion, and femaletofemale conversion, respectively. “NTD (para)” and “NTD (nonpara)” denote the proposed method with parallel utterances training and nonparallel utterances training, respectively. The proposed method achieved a significantly better score than the conventional methods. Specifically, NTD with the nonparallel setting showed the best results across all conversions.
Figures 7 and 8 show the results of the XAB test. The error bar shows a 95% confidence interval. For this test, a higher value indicates a better result. In Fig. 7, the results of the proposed method and conventional NMFbased dictionarylearning method are compared. In the maletofemale and femaletofemale conversions, the proposed method achieved a better score than NMFbased dictionary learning. In the maletomale and femaletomale conversions, the proposed method achieved a lower score than NMFbased dictionary learning. However, the difference between the two methods is not statistically significantly, because p>0.3 in the p value test. The proposed NTDbased dictionary learning without calculating parallel data showed comparable performance to the conventional NMFbased dictionary learning, which requires parallel data. In Fig. 8, the results of the proposed method and the ARBMbased VC are compared. In conversions to male, the proposed method achieved a better score than ARBMbase VC. In conversions to female, the proposed method achieved a lower score than ARBMbased VC. In only the maletofemale conversion, the difference was significant — p<0.05. However, in other conversions, the difference was not statistically significant. These tests show that the proposed nonparallel VC approach effectively converts the individuality of the source speaker’s voice to the target speaker’s voice while preserving high speech quality.
4.4 Discussion
In the objective evaluations, the proposed method achieved a better MCD value than the conventional VC, which uses parallel data. This is due to the fact that the proposed method is not affected by the mismatch of DPM. Moreover, the proposed NTDbased method yielded better performance, although the number of learned parameters decreased by approximately 60% of the conventional NMFbased one. This result indicates that the proposed dictionary learning has better spectral representation while keeping the number of bases of dictionaries constant during conversion. In addition, the average difference in MCD between the proposed method and the ARBMbased method was approximately 0.08 dB. This difference is relatively small. It is assumed that MCD is superior to the ARBMbased method, as it uses Melcepstrum as an input feature, whereas NTDbased methods use a WORLD spectrum. In the speech quality test, the proposed method using nonparallel training data achieved a better MOS score than that using parallel utterances. This is due to the model’s ability to learn diverse phonemic information by using nonparallel data when compared with parallel utterances. For example, n sentences are used for each speaker as training data. In the instances using parallel utterances, which consist of the same context for both speakers, the frequency base matrices U^{s} and U^{t} and the codebook G are learned from n context patterns. However, in the nonparallel setting, where a different context was used for the source and target speakers, the frequency base matrices and the codebook were learned from n and 2n context patterns, respectively. A codebook was effectively learned while improving the generalization ability. Therefore, the method using nonparallel data outperformed that using parallel utterances.
5 Experiments on voice conversion challenge 2018
The proposed method was also evaluated on the Voice Conversion Challenge (VCC) 2018 [45], which includes both parallel and nonparallel recordings from native English speakers from the USA. VCC 2018 consists of a total of 12 speakers. Each speaker has sets of 81 and 35 sentences for training and evaluation, respectively. The recordings were down sampled to 16 kHz. Systems were conducted for the 16 combinations of sourcetarget pairs.
The results of this objective evaluation are shown in Table 4. Our proposed method did not outperform the GMMbased VC in the parallel setting, while the NTDbased method achieved 3.89% relative improvement compared with the ARBMbased method in the nonparallel setting. These results demonstrate that our method is especially effective in nonparallel settings.
6 Conclusion
An innovative dictionarylearning method of NMFbased voice conversion was proposed. It makes NMFVC possible for nonparallel training. While exemplarbased VC retains the naturality of the converted speech to a high degree, the source and target dictionaries expand significantly. Although dictionarylearning VC achieves compact dictionary representation, the parallel dictionaries of the source and target speakers are difficult to learn. These conventional NMFVC methods require parallel utterances by the source and target speakers to construct the source and target dictionaries. In this study, a method parallel dictionary learning for NMFVC based on NTD was proposed that does not require parallel data during training. NTD decomposes an input observation into a set of mode matrices and one core tensor. In the proposed framework, it is assumed that NTD decomposes the spectrogram into the frequency basis matrix, phonemic information matrix, and codebook matrix. Recently, several studies have been conducted for NMFVC, and the scope of possible applications is widening. It is assumed that the proposed method assists these applications with nonparallel training. It was confirmed that the proposed method achieved an almost identical MCD to the conventional NMFbased dictionary learning that uses parallel data. Furthermore, the performance of the proposed method was comparable to that of the conventional ARBMbased method in a nonparallel setting.
In future work, we plan to apply the method to assistive technology for speakers with articulation disorders. The speech of such speakers is considerably different from that of the speech of unimpaired persons, and it is difficult to align correctly. The proposed method does not require the same texts of speech data for the source and target speakers or the framewise matching between acoustic features of both speakers. Furthermore, the NTDbased dictionary learning is a natural expansion of the NMFbased method, and it can read parallel and nonparallel data to learn the dictionary. Therefore, we also aim to investigate a semisupervised dictionarylearning method that improves the performance of a model trained with a small set of parallel data using a large set of nonparallel data.
In the real world, background noise deteriorates conversion performance. However, the proposed model has not been designed with noise robustness in mind. In order to retain the quality of converted voices in a noisy environment, noise robustness is required. In our previous study [46], a noiserobust NMFbased VC was proposed, where the performance was improved by 25% compared with the GMMbased method. As the currently proposed method is based on NMFbased VC, it will be easy to apply the noiserobust conversion. The evaluation of our proposed method for a noisy environment will be a topic for our future work.
Availability of data and materials
All data used in this study are included in the ATR Japanese speech database [41].
Abbreviations
 ARBM:

Adaptive restricted Boltzmann machine
 DPM:

Dynamic programming matching
 DTW:

Dynamic time warping
 GMM:

Gaussian mixture model
 MCD:

MelCepstral Distortion
 MOS:

Mean opinion score
 NMF:

Nonnegative matrix factorization
 NTD:

Nonnegative Tucker decomposition
 VC:

Voice conversion
 VCC:

Voice conversion challenge
References
T. Toda, L. H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, J. Yamagishi, in Proc. Interspeech. The voice conversion challenge 2016 (ISCASan Francisco, 2016), pp. 1632–1636.
R. Gray, Vector quantization. IEEE Assp. Mag.1(2), 4–29 (1984).
H. Valbret, E. Moulines, J. P. Tubach, Voice transformation using PSOLA technique. Speech Comm.11(2–3), 175–187 (1992).
A. Kain, M. W. Macon, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Spectral voice conversion for texttospeech synthesis (IEEESeattle, 1998), pp. 285–288.
C. Veaux, X. Rodet, in Proc. Interspeech. Intonation conversion from neutral to expressive speech (ISCAFlorence, 2011), pp. 2765–2768.
K. Nakamura, T. Toda, H. Saruwatari, K. Shikano, Speakingaid systems using GMMbased voice conversion for electrolaryngeal speech. Speech Comm.54(1), 134–146 (2012).
L. Deng, A. Acero, L. Jiang, J. Droppo, X. Huang, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Highperformance robust speech recognition using stereo training data (IEEESalt Lake City, 2001), pp. 301–304.
A. Kunikoshi, Y. Qiao, N. Minematsu, K. Hirose, in Proc. Interspeech. Speech generation from hand gestures based on space mapping (ISCABrighton, 2009), pp. 308–311.
Y. Stylianou, O. Cappé, E. Moulines, Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process.6(2), 131–142 (1998).
T. Toda, A. W. Black, K. Tokuda, Voice conversion based on maximumlikelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process.15(8), 2222–2235 (2007).
E. Helander, T. Virtanen, J. Nurminen, M. Gabbouj, Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process.18(5), 912–921 (2010).
D. Saito, H. Doi, N. Minematsu, K. Hirose, in Proc. Interspeech. Application of matrix variate gaussian mixture model to statistical voice conversion (ISCASingapore, 2014), pp. 2504–2508.
R. Takashima, T. Takiguchi, Y. Ariki, in IEEE Workshop on Spoken Language Technology. Exemplarbased voice conversion in noisy environment (IEEEMiami, 2012), pp. 313–317.
R. Takashima, R. Aihara, T. Takiguchi, Y. Ariki, in Speech Synthesis Workshop. Noiserobust voice conversion based on spectral mapping on sparse space (ISCABarcelona, 2013), pp. 71–75.
Z. Wu, T. Virtanen, E. Chng, H. Li, Exemplarbased sparse representation with residual compensation for voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process.22(10), 1506–1521 (2014).
S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, K. Prahallad, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Voice conversion using artificial neural networks (IEEETaipei, 2009), pp. 3893–3896.
T. Nakashika, R. Takashima, T. Takiguchi, Y. Ariki, in Proc. Interspeech. Voice conversion in highorder eigen space using deep belief nets (ISCALyon, 2013), pp. 369–372.
T. Nakashika, T. Takiguchi, Y. Ariki, Voice conversion using rnn pretrained by recurrent temporal restricted Boltzmann machines. IEEE/ACM Trans. Audio Speech Lang. Process.23(3), 580–587 (2015).
L. H. Chen, Z. H. Ling, Y. Song, L. R. Dai, in Proc. Interspeech. Joint spectral distribution modeling using restricted Boltzmann machines for voice conversion (ISCALyon, 2013), pp. 3052–3056.
T. Nakashika, T. Takiguchi, Y. Minami, Nonparallel training in voice conversion using an adaptive restricted Boltzmann machine. IEEE/ACM Trans. Audio Speech Lang. Process.24(11), 2032–2045 (2016).
Z. Wu, E. Chng, H. Li, in ChinaSIP. Conditional restricted Boltzmann machine for voice conversion (IEEEBeijing, 2013), pp. 104–108.
Y. Saito, Y. Ijima, K. Nishida, S. Takamichi, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.NonParallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and DVectors (IEEECalgary, 2018), pp. 5274–5278.
F. Fang, J. Yamagishi, I. Echizen, J. LorenzoTrueba, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.HighQuality Nonparallel Voice Conversion Based on CycleConsistent Adversarial Network (IEEECalgary, 2018), pp. 5279–5283.
R. Aihara, T. Nakashika, T. Takiguchi, Y. Ariki, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Voice conversion based on nonnegative matrix factorization using phonemecategorized dictionary (IEEEFlorence, 2014), pp. 7894–7898.
R. Aihara, T. Takiguchi, Y. Ariki, in Proc. Interspeech. Parallel dictionary learning for voice conversion using discriminative graphembedded nonnegative matrix factorization (ISCASan Francisco, 2016), pp. 292–296.
D. D. Lee, H. S. Seung, in NIPS. Algorithms for nonnegative matrix factorization (MIT PressDenver, 2000), pp. 556–562.
R. Aihara, T. Takiguchi, Y. Ariki, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Activitymapping nonnegative matrix factorization for exemplarbased voice conversion (IEEESouth Brisbane, 2015), pp. 4899–4903.
A. Mouchtaris, J. V. der Spiegel, P. Mueller, Nonparallel training for voice conversion based on a parameter adaptation approach. IEEE Trans. Audio Speech Lang. Process.14(3), 952–963 (2006).
T. Hashimoto, H. Uchida, D. Saito, N. Minematsu, in Proc. Interspeech. Paralleldatafree manytomany voice conversion based on dnn integrated with eigenspace using a nonparallel speech corpus (ISCAStockholm, 2017), pp. 1278–1282.
B. Sisman, H. Li, K. C. Tan, in ASRU. Sparse representation of phonetic features for voice conversion with and without parallel data (IEEEOkinawa, 2017), pp. 677–684.
L. D. Lathauwer, B. D. Moor, J. Vandewalle, A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl.21(4), 1253–1278 (2000).
P. M. Kroonenberg, J. De Leeuw, Principal component analysis of threemode data by means of alternating least squares algorithms. Psychometrika. 45:, 69–97 (1980).
L. R. Tucker, Some mathematical notes on threemode factor analysis. Psychometrika. 31:, 279–311 (1966).
H. Ben younes, R. Cadène, M. Cord, N. Thome, in ICCV. Mutan: Multimodal tucker fusion for visual question answering (IEEE Computer SocietyVenice, 2017), pp. 2631–2639.
J. T. Chien, C. Shen, in Proc. Interspeech. Deep neural factorization for speech recognition, (2017), pp. 3682–3686.
D. Saito, K. Yamamoto, N. Minematsu, K. Hirose, in Proc. Interspeech. Onetomany voice conversion based on tensor representation of speaker space (ISCAFlorence, 2011), pp. 653–656.
Z. Ma, A. E. Teschendorff, A. Leijon, Y. Qiao, H. Zhang, J. Guo, Variational bayesian matrix factorization for bounded support data. IEEE Trans. Pattern Anal. Mach. Intell.37(4), 876–889 (2015).
Q. Shi, Y. M. Cheung, Q. Zhao, H. Lu, Feature extraction for incomplete data via lowrank tensor decomposition with feature regularization. IEEE Trans. Neural Netw. Learn. Syst.30(6), 1803–1817 (2019).
B. Jiang, C. Ding, J. Tang, B. Luo, Image representation and learning with graphlaplacian Tucker tensor decomposition. IEEE Trans. Cybernet.49(4), 1417–1426 (2019).
Y. Kim, S. Choi, in Computer Vision and Pattern Recognition. Nonnegative tucker decomposition (IEEE Computer SocietyMinneapolis, 2007), pp. 1–8.
A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, K. Shikano, ATR Japanese speech database as a tool of speech recognition and synthesis. Speech Commun.9(4), 357–363 (1990).
R. Aihara, T. Fujii, T. Nakashika, T. Takiguchi, Y. Ariki, Smallparallel exemplarbased voice conversion in noisy environments using affine nonnegative matrix factorization. EURASIP J. Audio Speech Music Process.2015:, 32 (2015).
M. Morise, F. Yokomori, K. Ozawa, World: A vocoderbased highquality speech synthesis system for realtime applications. IEICE Trans.99D(7), 1877–1884 (2016).
K. Kobayashi, T. Toda, in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop. sprocket: Opensource voice conversion software (ISCALes Sables d’Olonne, 2018), pp. 203–210.
J. LorenzoTrueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z. H. Ling, in Odyssey. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods (ISCALes Sables d’Olonne, 2018), pp. 195–202.
R. Aihara, R. Takashima, T. Takiguchi, Y. Ariki, Noiserobust voice conversion based on sparse spectral mapping using nonnegative matrix factorization. IEICE Trans. Inf. Syst.97D(6), 1411–1418 (2014).
Acknowledgements
Not applicable.
Funding
This work was supported in part by JSPS KAKENHI (no. JP17J04380) and PRESTO, JST (no. PMJPR15D2).
Author information
Authors and Affiliations
Contributions
YT performed the experiments and wrote the paper. YT, TN, and TT reviewed and edited the manuscript. All of the authors discussed the final results. All of the authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Takashima, Y., Nakashika, T., Takiguchi, T. et al. Nonparallel dictionary learning for voice conversion using nonnegative Tucker decomposition. J AUDIO SPEECH MUSIC PROC. 2019, 17 (2019). https://doi.org/10.1186/s1363601901601
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1363601901601
Keywords
 Voice conversion
 Nonnegative Tucker decomposition
 Nonnegative matrix factorization
 Nonparallel training