Voice conversion using speaker-dependent conditional restricted Boltzmann machine
- Toru Nakashika†^{1}Email author,
- Tetsuya Takiguchi†^{2} and
- Yasuo Ariki†^{2}
https://doi.org/10.1186/s13636-014-0044-3
© Nakashika et al.; licensee Springer. 2015
Received: 28 February 2014
Accepted: 11 December 2014
Published: 25 February 2015
Abstract
This paper presents a voice conversion (VC) method that utilizes conditional restricted Boltzmann machines (CRBMs) for each speaker to obtain high-order speaker-independent spaces where voice features are converted more easily than those in an original acoustic feature space. The CRBM is expected to automatically discover common features lurking in time-series data. When we train two CRBMs for a source and target speaker independently using only speaker-dependent training data, it can be considered that each CRBM tries to construct subspaces where there are fewer phonemes and relatively more speaker individuality than the original acoustic space because the training data include various phonemes while keeping the speaker individuality unchanged. Each obtained high-order feature is then concatenated using a neural network (NN) from the source to the target. The entire network (the two CRBMs and the NN) can be also fine-tuned as a recurrent neural network (RNN) using the acoustic parallel data since both the CRBMs and the concatenating NN have network-based representation with time dependencies. Through voice-conversion experiments, we confirmed the high performance of our method especially in terms of objective evaluation, comparing it with conventional GMM, NN, RNN, and our previous work, speaker-dependent DBN approaches.
Keywords
Voice conversion Conditional restricted Boltzmann machine Deep learning Recurrent neural network Speaker-specific features1 Introduction
In recent years, voice conversion (VC), a technique used to change specific information in the speech of a source speaker to that of a target speaker while retaining linguistic information, has been garnering much attention in speech signal processing. VC techniques have been applied to various tasks, such as speech enhancement [1], emotion conversion [2], speaking assistance [3], and other applications [4,5]. Most of the related work in VC focuses not on f0 conversion but on the conversion of spectrum features, and we conform to that in this report as well.
Various statistical approaches to VC have been studied so far, including those discussed in [6,7]. Among these approaches, the Gaussian mixture model (GMM)-based mapping method [8] is widely used, and a number of improvements have been proposed. Toda et al. [9] introduced dynamic features and the global variance (GV) of the converted spectra over a time sequence. Helander et al. [10] proposed transforms based on partial least squares (PLS) to prevent the over-fitting problem encountered in standard multivariate regression. There have also been approaches that do not require parallel data since they use a GMM adaptation technique [11], eigenvoice GMM [12,13] or probabilistic integration model [14]. Other approaches based on statistical approaches have been proposed; Jian et al. [15] used canonical correlation analysis for the VC, and Takashima et al. [16] proposed a VC technique using exemplar-based non-negative matrix factorization (NMF).
However, most of the conventional VC methods, including the GMM-based approaches, rely on ‘shallow’ voice conversion based on linear (or piecewise linear) transformation. That means a source speech was converted in the original feature space directly or in the shallow architecture with a few hidden layers. To capture the characteristics of speech more precisely, it is necessary to have a deeper non-linear architecture with more hidden layers. The shape of the vocal tract is generally non-linear, so non-linear voice conversion is more compatible with human speech. One example of a non-linear VC method is proposed by Narendranath et al. [17] and Desai et al. [18] based on neural networks (NN). In the GMM-based approaches, the conversion is achieved so as to maximize the conditional probability calculated from a joint probability of source speech and target speech, which is trained beforehand. On the other hand, NN-based approaches directly train the conditional probability, which converts the feature vector of a source speaker to that of a target speaker. It is often reported that such a discriminative approach performs better than a generative approach, such as GMM, in speech recognition and synthesis as well as in VC [19,20]. For these reasons, NN-based approaches achieve relatively high performance if the training samples are carefully prepared [18].
These approaches often suffer from over-smoothing or over-fitting problems. GMM-based approaches represent acoustic features using multiple Gaussian distributions, which are estimated by averaging observations with similar context descriptions in the training. Therefore, the outputs of the GMM distribute near the modes (means) of the Gaussians, which causes problems with over-smoothing. Furthermore, over-fitting problems arise when we give more Gaussian mixtures due to precise estimation of the observed distribution. In the NN-based approaches, the model is often over-fitted due to its complexity because it exaggerates small fluctuations in the unknown data if the number of training data is not enough relative to the number of parameters.
In order to alleviate the over-smoothing effect in a GMM-based method, some methods have been proposed so far, such as the global variance model [21], a minimizing-divergence model [22], and post-filtering [23]. An exemplar-based VC system using non-negative matrix factorization (NMF) has also been proposed to tackle the over-smoothing problems [16,24]. In our earlier work [25], we proposed a new VC technique that copes with the over-fitting problems in NN-based approaches, using a combination of speaker-dependent restricted Boltzmann machines (RBMs) [26] (or deep belief nets (DBN) [27]) that captures high-order features in an unsupervised manner and a concatenating NN. It is reported that these graphical models are better at representing the distribution of high-dimensional observations with cross-dimension correlations than GMM in speech synthesis [28] and in speech recognition [29]. Since Hinton et al. introduced an effective training algorithm for the DBN in 2006 [27], the use of deep learning rapidly spread in the field of signal processing, as well as in speech signal processing. An RBM (or DBN) has been used, for example, for hand-written character recognition [27], 3-D object recognition [30], machine transliteration [31], and so on.
In this paper, we extend our earlier work in [25] to systematically capture time information as well as latent (deep) relationships between a source speaker’s and a target speaker’s features in a single network. This is accomplished by combining speaker-dependent conditional restricted Boltzmann machines (CRBMs) and a concatenating NN.
A CRBM is a non-linear probabilistic model used to represent time series data consisting of three factors: (i) an undirected model between binary latent variables and the current visible variables, (ii) a directed model from the previous visible variables to the current visible variables, and (iii) a directed model from the previous visible variables to the latent variables. In our approach, we first train two exclusive CRBMs for the source and the target speakers independently using segmented training data prepared for each speaker, then train a NN using the projected features, and finally, fine-tune the networks as a single network for VC. Because the training data for the source speaker CRBM include various phonemes particular to the speaker, the speaker-dependent network tries to capture the abstractions to maximally express the training data that have abundant speaker individuality information and less phoneme-related information. Furthermore, the network captures time-series features with the directed models (ii) and (iii), enabling it to discover temporal correlations at the same time. Therefore, we expect that if feature conversion is conducted in such time-related individuality-emphasized high-order spaces, it is much easier to convert voice features than if the original spectrum-based space is used.
Similar research can be found in [32] and [33]. Wu et al. employed a CRBM to capture the linear and non-linear relationship between the source and the target features [32]. Chen et al. also used a RBM to model the joint spectral distribution instead of using the conventional joint density GMM [33]. Unlike these approaches, which is based on a joint model, our method trains two exclusive RBMs for each speaker, aiming to capture speaker-specific conversion-friendly features. We will discuss the differences between these approaches and the proposed method in Section 2.
The rest of the article is organized as follows. In Section 2, we briefly review the fundamental techniques, (RBMs and CRBMs) before explaining our method. The proposed VC system is presented in Section 2, and we compare the proposed method with existing related work in Section 2. We describe the various experiments and VC results in Section 2, and we conclude the article in Section 2.
2 Preliminaries
Our voice conversion system uses CRBMs to capture high-order conversion-friendly features. A RBM, the fundamental model of the CRBM, was first introduced as a method of representing binary-valued data [34,35], and it later came to be used to deal with real-valued data (such as acoustic features) known as a Gaussian-Bernoulli RBM (GBRBM) [36]. However, it has been reported that the original GBRBM had some difficulties because the training of the parameters was unstable [27,37,38]. Later, an improved learning method for GBRBM was proposed by Cho et al. [39] to overcome the difficulties. We briefly review RBMs and CRBMs in this section, introducing their improved versions.
2.1 RBM
where ∥·∥^{2} denotes L2 norm. \(\boldsymbol {W} \in \mathbb {R}^{I \times J}\), \(\boldsymbol {\sigma } \in \mathbb {R}^{I \times 1}\), \(\boldsymbol {b} \in \mathbb {R}^{I \times 1}\), and \(\boldsymbol {c} \in \mathbb {R}^{J \times 1}\) are model parameters of the GBRBM, indicating the weight matrix between visible units and hidden units, the standard deviations associated with Gaussian visible units, a bias vector of the visible units, and a bias vector of hidden units, respectively. The fraction bar in Equation 2 denotes the element-wise division.
where W _{:j } and W _{ i:} denote the jth column vector and the ith row vector, respectively. \(\mathcal {S}(\cdot)\) and \(\mathcal {N}(\cdot | \mu, \sigma ^{2})\) indicate an element-wise sigmoid function and Gaussian probability density function with the mean μ and variance σ ^{2}, respectively.
where 〈·〉_{data} and 〈·〉_{model} indicate expectations of input data and the inner model, respectively. However, it is generally difficult to compute the second term, so, typically, the expected reconstructed data 〈·〉_{recon} computed by Equations 4 and 5 is used instead [27].
Using Equations 6, 7, 8, and 9, each parameter can be updated by stochastic gradient descent with a fixed learning rate and a momentum term.
2.2 CRBM
3 Proposed voice conversion
In general, the fewer phonological and the more individuality-emphasized features a source input includes for a speaker, the easier it is to convert the source features to target features. This paper proposes voice conversion using such features.
where c _{ y } is a bias vector for the target speaker.
where \(\bigodot _{l=0}^{L}\) denotes the composition of L+1 functions. For instance, \(\bigodot _{l=0}^{1} \eta _{l}(\boldsymbol {z}) = \mathcal {S}(\boldsymbol {W}_{1} \mathcal {S}(\boldsymbol {W}_{0} \boldsymbol {z} + \boldsymbol {d}_{0}) + \boldsymbol {d}_{1})\) for a NN with one hidden layer.
and \(\left \{f_{(k)}\right \}_{k=0}^{L+2} = \{ \mathcal {S}, \mathcal {S}, \cdots, \mathcal {S}, \mathcal {I} \}\), where indicates an identity function.
where \(\frac {\partial ^{+} \boldsymbol {h}^{(k)}}{\partial \theta }\) refers to the immediate partial derivative of the hidden units h ^{(k)} with respect to θ (i.e., h ^{(k−1)} is regarded as a constant with respect to θ).
As Equation 24 indicates, we need a current acoustic vector from a source speaker and previous vectors from both a source speaker and a target speaker to estimate the target speaker’s current acoustic vector. However, we never know the correct previous vector of the target speaker, so in practice, we use the last converted (estimated) vectors as the previous target vector iteratively, starting from a zero vector. We confirmed that this approach worked well through our preliminary experiments.
where w ^{(m)}, \(\boldsymbol {\mu }_{\cdot }^{(m)}\) and \(\boldsymbol {\Sigma }_{\cdot }^{(m)}\) are the weight, the corresponding mean vectors, and the corresponding covariance matrices to the speaker of the mth mixture, respectively, showing it to be an additive model of piecewise linear functions. Our approach using Equation 24 is based on the composite function of multiple different non-linear functions feeding time-series data. Therefore, it is expected that our composite model can represent more complex relationships than the conventional GMM-based method and other static network approaches [18,25].
4 Related work
Since the acoustic signals we are targeting are time-series data, the model that captures time-related information will provide us with the better performance.
5 Experiments
5.1 Conditions
In our experiments, we conducted voice conversion using the ATR Japanese speech database [42], comparing our method (speaker-dependent restricted Boltzmann machines; say ‘SD-CRBM’) with four methods: the well-known GMM-based approach (‘GMM’), conventional NN-based voice conversion [18] (‘NN’), our previous work [25] (‘SD-RBM’) and, for a reference, a recurrent neural network with randomly-initialized weights (‘RNN’). In order to evaluate our method under various circumstances, we tested male-to-female (the source and the target speakers are identified with MMY and FTK in the database, respectively), female-to-female (FKN and FTK), and male-to-male (MMY and MHT) patterns.
For an input vector, we calculated 24-dimensional MFCC features from 513-dimensional STRAIGHT spectra [43] using the filter-theory [44] to decode the MFCC back to STRAIGHT spectra in the synthesis stage. Each speech signal was sampled at 12 kHz and windowed with a 25-ms Hamming window every 10 ms. Unlike our previous work [25], we processed the obtained MFCC with zero component analysis (ZCA) whitening [38], where we confirmed it worked better than without whitening, especially for ‘NN.’ The parallel data of the source/target speakers processed by dynamic programming were created from 216 word utterances in the dataset and were used for the training of each method (note that two CRBMs for ‘SD-CRBM’ and two RBMs for ‘SD-RBM’ can be trained without the necessity of using parallel data, although we used the same parallel training data for the CRBMs and the RBMs in this research.)
The network-based approaches (‘SD-CRBM’, ‘SD-RBM’, ‘NN’, and ‘RNN’) were trained using gradient descent with a learning rate of 0.01 and momentum of 0.9, with the number of epochs being 400. The parameters of ‘NN’ and ‘RNN’ were initialized randomly. All the network-based methods had four layers including an input layer, two hidden layers, and an output layer. Other configurations, such as the number of hidden units, will be discussed in the following section.
For the GMM-based approach, we used diagonal covariance matrices without global variance and dynamic features.
where c _{ d } and c d′ denote the dth original target MFCC and the converted MFCC, respectively. The smaller the value of MCD is, the closer the converted spectra are to the target spectra. We calculated the MCD for each frame in the training data and averaged the MCD values for the final evaluation.
For the subjective evaluation, ABX listening tests were conducted, where nine participants listened to five pairs of converted speech signals (from a development set, which was used for the determination of model parameters) produced using our approach (‘SD-CRBM’) and the converted speech signals produced by the other methods (‘SD-RBM’, ‘NN’, ‘RNN’, and ‘GMM’) along with an original target speech signal (generated from analysis-by-synthesis). We evaluated the models, which were trained using N=5,000 or N=20,000 training frames. They then selected the better one in terms of speaker identity (how well they can recognize the speaker from the converted speech) and speech quality (how clear and natural the converted speech is).
5.2 Determining appropriate parameters
In this section, we report preliminary experiments in which we tested various models with different hyper parameters to determine the appropriate ones. All models were trained using N=20,000 frames from the male-to-female training data and evaluated using a development set of five sentences (identified with SDA16 ∼SDA20 in the database) that were not included in either the training set or the test set.
5.2.1 Network-based methods
Here, we will see how our approach works as the number of hidden units J in each hidden layer changes, comparing it to four network-based methods (‘SD-CRBM’, ‘SD-RBM’, ‘NN’, and ‘RNN’). In this preliminary experiment, three architectural patterns were tested, where J=24, 48, and 72. We used L=0, which forms a four-layer network for all methods (for example, when J=48 is used, the numbers of units in ‘NN’ from the input/source layer to the output/target layer become 24,48,48, and 24 in order). For ‘SD-CRBM’, we set P=1 (1 delay for ‘RNN’ as well), which means we take into account only one previous frame.
For the remaining experiments in this paper, the best architectures for each method were used, i.e., J=24 for ‘SD-RBM’ and ‘NN’, and J = 72 for ‘SD-CRBM’ and ‘RNN’.
5.2.2 The number of previous frames
In the remaining experiments, we used P=1, which provided the best performance in the preliminary experiment.
5.2.3 GMM-based method
5.3 Evaluation
In this section, we evaluate our method (‘SD-CRBM’) comparing it with four methods (‘SD-RBM’, ‘NN’, ‘RNN’, and ‘GMM’) using objective and subjective criteria for each pair of speakers, by changing the number of training frames as N=5,000, 10,000, 20,000, and 40,000.
5.3.1 Results
p values between our method and each method w.r.t. speaker identity in case N=5,000
SD-RBM | NN | RNN | GMM | |
---|---|---|---|---|
p | 0.2796 | 0.1636 | 0.0013 | 0.0032 |
p values between our method and each method w.r.t. speaker identity in case N=20,000
SD-RBM | NN | RNN | GMM | |
---|---|---|---|---|
p | 0.0913 | 0.4417 | 0.0913 | 0.0001 |
p values between our method and each method w.r.t. speaker quality in case N=5,000
SD-RBM | NN | RNN | GMM | |
---|---|---|---|---|
p | 0.2796 | 0.4343 | 0.3096 | 0.0032 |
p values between our method and each method w.r.t. speech quality in case N=20,000
SD-RBM | NN | RNN | GMM | |
---|---|---|---|---|
p | 0.4417 | 0.0913 | 0.3299 | 0.0000 |
5.3.2 Discussion
In objective criteria, our approach (‘SD-CRBM’) outperformed the other methods, including the popular GMM-based voice conversion method, in most cases. In subjective criteria as well, we obtained significantly better performance compared with each opponent, in terms of speaker identity and/or speech quality (to be specific, in terms of both speaker identity and speech quality for ‘GMM’, in terms of only speech quality for ‘NN’, in terms of only speaker identity for ‘SD-RBM’ and ‘RNN’). The reason for the improvement is attributed to the fact that our time-involving, high-order conversion system using CRBMs is able to capture and convert the abstractions of speaker individualities better than the other methods. In particular, as shown in Figures 6, 7, and 8, our approach achieved high performance in MCD criteria. This is because the CRBMs captured time-series data more appropriately and alleviated estimation errors.
6 Conclusion
We presented a voice conversion method that combines speaker-dependent CRBMs and a NN to extract speaker-individual information for speech conversion. Through experiments, we confirmed that our approach is effective, especially in terms of MCD, compared with the well-known conventional GMM-based approach, a NN-based approach, and our own previous work, SD-RBM, (and recurrent neural network for a reference), regardless of the gender in conversion.
We also conducted ABX experiments for subjective evaluation. The results showed that the performance of our method was not always significantly different in comparison to NN, RNN, and SD-RBM; however, it did perform significantly better than these methods in terms of either speaker identity or speech quality. In the future, we will work to improve our method so that it obtains better results in regard to the sense of hearing.
Declarations
Authors’ Affiliations
References
- A Kain, MW Macon, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Spectral voice conversion for text-to-speech synthesis, (1998), pp. 285–288.Google Scholar
- C Veaux, X Robet, in Proceedings of Interspeech. Intonation conversion from neutral to expressive speech, (2011), pp. 2765–2768.Google Scholar
- K Nakamura, T Toda, H Saruwatari, K Shikano, Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Commun. 54(1), 134–146 (2012).View ArticleGoogle Scholar
- L Deng, A Acero, L Jiang, J Droppo, X Huang, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). High-performance robust speech recognition using stereo training data, (2001), pp. 301–304.Google Scholar
- A Kunikoshi, Y Qiao, N Minematsu, K Hirose, in Proceedings of Interspeech. Speech generation from hand gestures based on space mapping, (2009), pp. 308–311.Google Scholar
- R Gray, Vector quantization. ASSP Mag. IEEE. 1(2), 4–29 (1984).View ArticleGoogle Scholar
- H Valbret, E Moulines, J-P Tubach, Voice transformation using PSOLA technique. Speech Commun. 11(2), 175–187 (1992).View ArticleGoogle Scholar
- Y Stylianou, Cappé O, E Moulines, Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6(2), 131–142 (1998).View ArticleGoogle Scholar
- T Toda, AW Black, K Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007).View ArticleGoogle Scholar
- E Helander, T Virtanen, J Nurminen, Gabbouj, Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 18(5), 912–921 (2010).View ArticleGoogle Scholar
- C-H Lee, C-H Wu, in Proceedings of Interspeech. Map-based adaptation for speech conversion using adaptation data selection and non-parallel training, (2006), pp. 2254–2257.Google Scholar
- T Toda, Y Ohtani, K Shikano, in Proceedings of Interspeech. Eigenvoice conversion based on gaussian mixture model, (2006), pp. 2446–2449.Google Scholar
- D Saito, Yamamoto K, N Minematsu, K Hirose, in Proceedings of Interspeech. One-to-many voice conversion based on tensor representation of speaker space, (2011), pp. 653–656.Google Scholar
- D Saito, S Watanabe, A Nakamura, N Minematsu, in Proceedings of Interspeech. Probabilistic integration of joint density model and speaker model for voice conversion, (2010), pp. 1728–1731.Google Scholar
- Z Jian, Z Yang, in Proceedings of International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing. Voice conversion using canonical correlation analysis based on Gaussian mixture model, (2007), pp. 210–215.Google Scholar
- R Takashima, T Takiguchi, Y Ariki, in IEEE Spoken Language Technology Workshop (SLT). Exemplar-based voice conversion in noisy environment, (2012), pp. 313–317.Google Scholar
- M Narendranath, HA Murthy, S Rajendran, B Yegnanarayana, Transformation of formants for voice conversion using artificial neural networks. Speech Commun. 16(2), 207–216 (1995).View ArticleGoogle Scholar
- S Desai, EV Raghavendra, B Yegnanarayana, AW Black, K Prahallad, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Voice conversion using artificial neural networks, (2009), pp. 3893–3896.Google Scholar
- Y-J Wu, H Kawai, J Ni, R-H Wang, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Minimum segmentation error based discriminative training for speech synthesis application, (2004), p. 629.Google Scholar
- E McDermott, TJ Hazen, J Le Roux, A Nakamura, S Katagiri, Discriminative training for large-vocabulary speech recognition using minimum classification error. IEEE Trans. Audio Speech Lang. Process. 15(1), 203–223 (2007).View ArticleGoogle Scholar
- T Tomoki, K Tokuda, A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Trans. Inform. Syst. 90(5), 816–824 (2007).Google Scholar
- Z-H Ling, L-R Dai, Minimum Kullback-Leibler divergence parameter generation for HMM-based speech synthesis. IEEE Trans. Audio Speech Lang. Process. 20(5), 1492–1502 (2012).View ArticleGoogle Scholar
- Z-H Ling, Y-J Wu, Y-P Wang, L Qin, R-H Wang, in Blizzard Challenge Workshop. USTC system for blizzard challenge 2006 an improved HMM-based speech synthesis method, (2006).Google Scholar
- Z Wu, T Virtanen, T Kinnunen, ES Chng, H Li, in Proceedings of the 8th ISCA Speech Synthesis Workshop. Exemplar-based voice conversion using non-negative spectrogram deconvolution, (2013), pp. 221–226.Google Scholar
- T Nakashika, R Takashima, T Takiguchi, Y Ariki, in Proceedings of Interspeech. Voice conversion in high-order eigen space using deep belief nets, (2013), pp. 369–372.Google Scholar
- P Smolensky, Information processing in dynamical systems: foundations of harmony theory. Parallel Distributed Process. 1, 194–281 (1986).Google Scholar
- GE Hinton, S Osindero, Y-W Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006).View ArticleMATHMathSciNetGoogle Scholar
- Z-H Ling, L Deng, D Yu, Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Trans. Audio Speech Lang. Process. 21(10), 2129–2139 (2013).View ArticleGoogle Scholar
- A-R Mohamed, GE Dahl, G Hinton, Acoustic modeling using deep belief networks. Audio Speech Lang. Process. IEEE Trans. 20(1), 14–22 (2012).View ArticleGoogle Scholar
- V Nair, G Hinton, 3-D object recognition with deep belief nets. Adv. Neural Inform. Process. Syst. 22, 1339–1347 (2009).Google Scholar
- T Deselaers, S Hasan, O Bender, H Ney, in Proceedings of the Fourth Workshop on Statistical Machine Translation. A deep learning approach to machine transliteration, (2009), pp. 233–241.Google Scholar
- Z Wu, ES Chng, H Li, in Proceedings of the IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP). Conditional restricted Boltzmann machine for voice conversion, (2013).Google Scholar
- C Ling-Hui, L Zhen-Hua, S Yan, D Li-Rong, in Proceedings of Interspeech. Joint spectral distribution modeling using restricted Boltzmann machines for voice conversion, (2013), pp. 3052–3056.Google Scholar
- DH Ackley, GE Hinton, TJ Sejnowski, A learning algorithm for Boltzmann machines. Cogn. Sci. 9(1), 147–169 (1985).View ArticleGoogle Scholar
- Y Freund, D Haussler, Unsupervised learning of distributions of binary vectors using two layer networks. Adv, Neural Inform. Process. Syst. 4, 912–919 (1991).Google Scholar
- GE Hinton, RR Salakhutdinov, Reducing the dimensionality of data with neural networks. Science. 313(5786), 504–507 (2006).View ArticleMATHMathSciNetGoogle Scholar
- G Hinton, in Tech. Rep. Department of Computer Science. A practical guide to training restricted Boltzmann machines (University of Toronto, 2010).Google Scholar
- A Krizhevsky, G Hinton, Learning multiple layers of features from tiny images (Computer Science Department, University of Toronto, Tech. Rep, 2009).Google Scholar
- K Cho, A Ilin, T Raiko, in Artificial Neural Networks and Machine Learning–ICANN 2011. Improved learning of gaussian-bernoulli restricted Boltzmann machines, (2011), pp. 10–17.Google Scholar
- GW Taylor, GE Hinton, ST Roweis, in Advances in Neural Information Processing Systems. Modeling human motion using binary latent variables, (2006), pp. 1345–1352.Google Scholar
- R Pascanu, T Mikolov, Y Bengio, On the difficulty of training recurrent neural networks. (2012).Google Scholar
- A Kurematsu, K Takeda, Y Sagisaka, S Katagiri, H Kuwabara, K Shikano, ATR japanese speech database as a tool of speech recognition and synthesis. Speech Communication. 9(4), 357–363 (1990).View ArticleGoogle Scholar
- H Kawahara, M Morise, T Takahashi, R Nisimura, T Irino, H Banno, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Tandem-straight: a temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, f0, and aperiodicity estimation, (2008), pp. 3933–3936.Google Scholar
- B Milner, X Shao, in Proceedings of Interspeech. Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model, (2002), pp. 2421–2424.Google Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.