- Research
- Open Access

# Speaker-adaptive-trainable Boltzmann machine and its application to non-parallel voice conversion

- Toru Nakashika
^{1}Email author and - Yasuhiro Minami
^{1}

**2017**:16

https://doi.org/10.1186/s13636-017-0112-6

© The Author(s) 2017

**Received: **20 October 2016

**Accepted: **12 June 2017

**Published: **29 June 2017

## Abstract

In this paper, we present a voice conversion (VC) method that does not use any parallel data while training the model. Voice conversion is a technique where only speaker-specific information in the source speech is converted while keeping the phonological information unchanged. Most of the existing VC methods rely on parallel data—pairs of speech data from the source and target speakers uttering the same sentences. However, the use of parallel data in training causes several problems: (1) the data used for the training is limited to the pre-defined sentences, (2) the trained model is only applied to the speaker pair used in the training, and (3) a mismatch in alignment may occur. Although it is generally preferable in VC to not use parallel data, a non-parallel approach is considered difficult to learn. In our approach, we realize the non-parallel training based on speaker-adaptive training (SAT). Speech signals are represented using a probabilistic model based on the Boltzmann machine that defines phonological information and speaker-related information explicitly. Speaker-independent (SI) and speaker-dependent (SD) parameters are simultaneously trained using SAT. In the conversion stage, a given speech signal is decomposed into phonological and speaker-related information, the speaker-related information is replaced with that of the desired speaker, and then voice-converted speech is obtained by combining the two. Our experimental results showed that our approach outperformed the conventional non-parallel approach regarding objective and subjective criteria.

## Keywords

## 1 Introduction

In recent years, voice conversion (VC), which is a technique used to change speaker-specific information in the speech of a source speaker into that of a target speaker while retaining linguistic information, has been garnering much attention since the VC techniques can be applied to various tasks [1–5]. Most of the existing approaches rely on statistical models [6, 7], and the approach based on the Gaussian mixture model (GMM) [8–11] is one of the mainstream methods used nowadays. Other statistical models, such as non-negative matrix factorization (NMF) [12, 13], neural networks (NNs) [14], restricted Boltzmann machines (RBMs) [15, 16], and deep learning [17, 18], are also used in VC. However, almost all of the existing VC methods require parallel data (aligned speech data from the source and the target speakers so that each frame of the source speaker’s data corresponds to that of the target speaker) for training the models, which leads to several problems. First, the data is limited to pre-defined articles (both speakers must utter the same articles). Second, the trained model is only applied to the speaker pair used in the training, and it is difficult to reuse the model on the conversion of another speaker pair. Third, the training data (the parallel data) is not the original speech data anymore because the speech data is stretched and modified in the time axis when aligned. Furthermore, it is not guaranteed that each frame is aligned perfectly, and the mismatching may cause some errors in training.

Several approaches that do not use *parallel data from the source to the target speakers*
^{1} have been also proposed [19–23]. In [19], for example, the authors model the spectral relationships between two arbitrary speakers (reference speakers) using GMMs and convert the source speaker’s speech using the matrix that projects the feature space of the source speaker into that of the target speaker through that of the reference speakers. As a result, parallel data from the source and target speakers is not required. In [21, 22], codebooks (eigenvoice) are obtained using the parallel data of the reference speakers, and a many-to-many VC is achieved by mapping the source speaker’s speech into an eigenvoice and the eigenvoice into the target speaker’s speech. The multistep VC [24] is also proposed to reduce the training cost of estimating the mapping functions for each speaker pair.

In this paper, we propose a totally-parallel-data-free^{2} VC method using an energy-based probabilistic model and speaker adaptive training (SAT). The idea is simple and intuitive. A speech signal of an arbitrary speaker is composed of neutral speech (the speech with the averaged voice calculated from a collection of speech samples from multiple speakers) that directly links to the phonological information belonging to no one, accompanied with the speaker-specific information. In this assumption, VC is achieved by three steps: decomposing a speech signal into neutral speech and speaker-specific information, replacing the speaker-specific information with that of the desired speaker, and composing a speech signal using the neutral speech and the replaced speaker information. The proposed model, called a speaker adaptive trainable Boltzmann machine (SATBM), is designed to help such a decomposition. The above VC steps can be viewed as a simplified version of the combination of automatic speech recognition (ASR) and text-to-speech (TTS) systems: text estimation from the input speech using the ASR system, followed by speech generation of the target speaker from the text using the TTS system. Although the VC can be realized by this approach, our VC scheme has several advantages. First, in our approach, we can reduce (or omit) the cost of training two different systems. Second, the combination approach requires a large amount of training data of the target speaker in TTS, while our approach does not. Third, the latent phonological features in our approach can be optimized for the VC. Fourth, ideally, the voice-converted speech can be generated in real time by our approach due to the frame-wise conversion.

We attempted the non-parallel training using another probabilistic model named the adaptive restricted Boltzmann machine (ARBM) [25] in our previous work. The architecture is different from the proposed model in this paper, which makes some differences, e.g., while an ARBM is based on the model-space transformation, a SATBM is based on the constrained model-space transformation. In the following sections, we will discuss this in more detail.

## 2 Formulation

^{3}\(\hat {\boldsymbol {x}}_{rt} = \left [\hat {x}_{rt}^{1}, \cdots, \hat {x}_{rt}^{D}\right ]^{\top } \in \mathbb {R}^{D}\) (

*D*is the number of dimensions) of a speaker

*r*at time

*t*as follows:

*r*, respectively. Note that

**A**

_{ r }is a global matrix for all the phonemes or kinds of speech sounds unlike in MLLR or similar techniques. Here, we assume that x

_{ t }is normally distributed with time-varying (phoneme-dependent) mean \(\boldsymbol {\mu }_{t} \in \mathbb {R}^{D}\) and time-invariant diagonal variance \(\mathbf {\Sigma } = \text {diag}{(\boldsymbol {\sigma }^{2})}, \boldsymbol {\sigma }^{2} = \left [\sigma _{1}^{2}, \cdots, \sigma _{D}^{2}\right ]^{\top } \in \mathbb {R}^{D}\) given a latent phonological vector \(\boldsymbol {h}_{t} = \left [h_{t}^{1}, \cdots, h_{t}^{H}\right ]^{\top } \in \mathbb {B}^{H}\) (\(\mathbb {B}\) is a binary space and

*H*is the number of dimensions of the latent vector). At this time, \(\hat {\boldsymbol {x}}_{rt}\) is also normally distributed; that is,

_{ t }in Eq. (2) is explained as follows. The speech of the neutral speaker at a certain time is supposed to be determined by the latent, phonological information that must exist behind but is not observable. For example, if the phoneme /e/ is intended at time

*t*, then the neutral speech at

*t*should correspond acoustically to the phoneme /e/. Therefore, we assume that the mean vector of the neutral speaker μ

_{ t }is determined using a latent phonological vector h as

_{ t }in Eq. (2) can be calculated as follows:

where we introduce \(\hat {\boldsymbol {b}}_{r} = \mathbf {A}_{r} \boldsymbol {b} + \boldsymbol {b}_{r}\) and \(\hat {\mathbf {W}}_{r} = \mathbf {A}_{r} \mathbf {W}\).

_{ t }. In this formulation, it is beneficial in terms of reducing the number of parameters to use the already-defined parameters. We define π

_{ t }as follows:

where we use the replacement of \(\hat {\boldsymbol {c}}_{r} = \boldsymbol {c} - \hat {\mathbf {W}}_{r}^{\top } \hat {\mathbf {\Sigma }}_{r}^{-1} \boldsymbol {b}_{r}\).

_{ t }. From Eqs. (4) and (6), we notice that the same term \(- \hat {\boldsymbol {x}}_{rt}^{\top } \hat {\mathbf {\Sigma }}_{r}^{-1} \hat {\mathbf {W}}_{r} \boldsymbol {h}\) appears in the exponential. Consequently, the following joint probability satisfies Eqs. (4) and (6):

*r*when using a GB-RBM with the visible units of acoustic features of the neutral speaker (or the averaged speaker) and the hidden units of latent phonological features, as shown in Fig. 1 a. In another viewpoint, it can be regarded as a sort of semi-RBM [27] since there are shared connections \(\hat {\mathbf {W}}_{r}\) between \(\hat {\boldsymbol {x}}_{rt}\) and h

_{ t }and connections \(\hat {\mathbf {\Sigma }}_{r}^{-1}\) among \(\hat {\boldsymbol {x}}_{rt}\) but no connections among \(\hat {\boldsymbol {h}}_{t}\) (Fig. 1 b). The difference is that the model in Eq. (7) assumes the existence of the neutral speaker and defines additional parameters that enable speaker-adaptive training. In this paper, we call the probabilistic model defined in Eq. (7) speaker-adaptive-trainable Boltzmann machine (SATBM). In our previous work [25], we have proposed another probabilistic model named adaptive restricted Boltzmann machine (ARBM), that is an extension of an RBM where only the connection weights between the visible and hidden units are speaker-adaptive. The ARBM is based on a model-space transformation, whereas the SATBM is based on both a model-space transformation and a feature-space transformation (i.e., constrained model-space transformation), as Eqs. (1) and (2) indicate. Specifically, in the SATBM, the speaker-dependent parameters (means and covariance matrix) of Gaussian visible units are represented as

**A**

_{ r }

**W**h

_{ t }+

**A**

_{ r }b+b

_{ r }of means and \(\mathbf {A}_{r} \mathbf {\Sigma } \mathbf {A}_{r}^{\top }\) of a covariance matrix. On the other hand, in ARBM, the speaker-dependent parameters of Gaussian visible units are represented as

**A**

_{ r }

**W**h

_{ t }+b+b

_{ r }of means and

**Σ**of a covariance matrix. This indicates that the speaker-dependent Gaussian parameters in the SATBM are more strongly influenced by the speaker and changed to adapt to the speaker more than those in the ARBM. In another perspective, the SATBM directly models the correlations between the dimensions in the observed features while the ARBM does not. The observed features take different values every time a specific speaker pronounces the same phoneme, and the extent of the variation depends on the speaker. The SATBM also represents such characteristics of each speaker. For this reason, we expect the SATBM would be superior in acoustic modeling to the ARBM.

### 2.1 One-hot activation of h
_{
t
}

We can further add constraints \(\sum _{j=1}^{H}h_{t}^{t}=1\) to our model resulting in a one-hot vector h
_{
t
}, which indicates that only a certain phonological component is activated. In the real speech, only one phoneme, such as /a/ and /e/, should be activated in the background at a certain frame. Therefore, this modification may give better representation for speech. The use of a one-hot representation is inspired by such a phonological reason.

where \(\hat {\boldsymbol {w}}_{r}^{j}\) and \(\hat {c}_{r}^{j}\) indicate the *j*th column vector in \(\hat {\mathbf {W}}_{r}\) and the *j*th element in \(\hat {\boldsymbol {c}}_{r}\), respectively, and *ψ*(·) denotes a softmax function (we also define ψ(·) as an element-wise softmax function for convenience). Equation 9 is used when we sample h
_{
t
} instead of Eq. (5), as discussed in the following sections.

## 3 Parameter estimation based on SAT

**Θ**

^{ SI }={

**W**,σ

^{2},b,c} for SI parameters. Given a collection of the speech data, \(\boldsymbol {X} = \{\boldsymbol {X}_{r} \}_{r=1}^{R},\ \boldsymbol {X}_{r} = \{\hat {\boldsymbol {x}}_{rt} \}_{t=1}^{T_{r}}\) that is composed of

*R*speakers, these parameters are simultaneously estimated so as to maximize the likelihood as

According to the SAT paradigm, the SD parameters **Θ**
^{
SD
} undertake the speaker-induced variation, and the SI parameters **Θ**
^{
SI
} capture the remaining information, i.e., phonetically relevant variation. Unlike the conventional SAT+MLLR (maximum likelihood linear regression), the SATBM explicitly models the relationships between the speaker-normalized acoustic features and the phonological information, which implies the possibility that the model represents the speech data better than SAT+MLLR.

*θ*∈{

**Θ**

^{ SD },

**Θ**

^{ SI }} is derived as follows:

_{data}and 〈·〉

_{model}denote expectations of the empirical data and the inner model, respectively. It is generally difficult to compute the expectations of the inner model; however, we can still use contrastive divergence (CD) [29] and efficiently approximate them with the expectations of the reconstructed data. We can analytically calculate the partial gradients \(\frac {\partial E(\hat {\boldsymbol {x}}_{rt}, \boldsymbol {h}_{t})}{\partial \theta }\) for each parameter as follows:

## 4 Application to VC

*R*reference speakers’ speech (we discard the speaker-dependent parameters \(\hat {\mathbf {\Theta }}^{SD}\)). In the adaptation stage, new speaker-dependent parameters \(\mathbf {\Theta }^{SD}_{i}=\{\mathbf {A}_{i},\boldsymbol {b}_{i}\}\) and \(\mathbf {\Theta }^{SD}_{o}=\{\mathbf {A}_{o},\boldsymbol {b}_{o}\}\) are estimated using adaptation data of the source and the target speakers \(\{ \hat {\boldsymbol {x}}_{it} \}_{t=1}^{Ti}\), \(\{ \hat {\boldsymbol {x}}_{ot} \}_{t=1}^{To}\) while keeping \(\hat {\mathbf {\Theta }}^{SI}\) fixed. That is,

_{ it }into that of the target speaker x

_{ ot }, we take an ML-based approach. In this approach, x

_{ ot }is computed so as to maximize the probability given x

_{ it }, formulated as

where we give \(\hat {\boldsymbol {h}}_{t} \triangleq \mathbb {E} [ p(\boldsymbol {h}_{t} | \boldsymbol {x}_{it}) ]\). It is worth noting that the conversion function is based on the non-linear transformation.

## 5 Experimental evaluation

### 5.1 System configuration

In our VC experiments, we evaluated the performance of our model, a SATBM, using ASJ Continuous Speech Corpus for Research (ASJ-JIPDEC^{4}). In the training stage where the SI parameters are estimated, we randomly selected and used the speech data of five sentences (approx. 160 k frames) uttered by 56 speakers (26 males and 30 females) from set A in the corpus. For adaptation and evaluation, a male (identified as “ECL0001”) and female (“ECL1003”) speakers that were not included in the training were used as source and target speakers, respectively, unless otherwise stated. We also evaluated the proposed SATBM using the other speaker pairs, which will be discussed in Section 5.4. The amount of adaptation data was five sentences for each person. As an acoustic feature vector, we used 32-dimensional mel-cepstral features that were calculated from the 513-dimensional WORLD [30] spectra without dynamic features. In the training of the system, we used up to 64 softmax hidden units, a learning rate of 0.01, a momentum of 0.9, and a batch-size of *R*×100(=5600) and set the number of iterations as 200. For the evaluation of the proposed method, we used parallel data (of 10 different sentences from the training and adaptation data) of the source and the target speakers, which was created using dynamic programming. But again, note that all speech data used for the training and the adaptation was NOT parallel.

where m
_{
i
}, m
_{
o
}, and m
_{
c
} are mel-cepstral features at a frame of the source speaker’s speech, target speaker’s speech, and converted speech, respectively. The higher the value of MDIR is, the better the performance of the VC is. The MDIR was calculated for each frame from the parallel data of 10 sentences and averaged.

#### 5.1.1 Methods to be compared

_{ ot }is calculated as

which was derived from the equation \(\boldsymbol {x}_{t} = \mathbf {A}_{i}^{-1} (\boldsymbol {x}_{it} - \boldsymbol {b}_{i}) = \mathbf {A}_{o}^{-1} (\boldsymbol {x}_{ot} - \boldsymbol {b}_{o})\) starting with Eq. (1). However, it is under the assumption that the *true* feature space of the neutral speaker was obtained. The parameters **A**
_{
r
} and b
_{
r
} are estimated in SAT using the gradient decent, the same as our proposed method. So, the difference between the linear-transform approach and the proposed model is whether the latent phonological features are modeled or not.

For reference, we also compared our proposed model with a popular GMM-based VC with 8, 16, 32, and 64 mixtures using the parallel data of five sentences.

#### 5.1.2 Optimal number of hidden units

*H*=16 (the optimal number of units) and

*H*=64 (too large a number of units). Figures 3 and 4 show examples of the expected values of the hidden units for the cases of

*H*=16 and

*H*=64, respectively, comparing the distributions obtained from the source and target speakers. From Figs. 3 and 4, we observed that more hidden units when

*H*=64 were speaker-dependent than those when

*H*=16. To measure objectively how close the distributions are to each other, we further calculated the Euclidean distances and the cosine similarities between the two distributions obtained from the source and target speakers’ speech, as shown in Table 1. Obviously, Table 1 shows that the two hidden activations of

*H*=16 are close to each other, more than those of

*H*=64. A similar discussion can be seen in our previous work using the ARBM [25].

In speech recognition in Japanese, 43 phonemes are often used [31], which consists of seven short vowels, five long vowels, 28 consonants, and three special symbols. These kinds of Japanese phonemes were defined by the Acoustical Society of Japan (ASJ) committee. Comparing these artificial numbers with the optimal number of hidden units as *H*=16, we could state that this is reasonable because using the static short-term acoustic features does not represent the consonants and long vowels sufficiently and because the natural speech should contain some allophones.

Considering the above, we will use 16 softmax hidden units in the following experiments unless otherwise stated.

### 5.2 Objective comparison

We also found that the SATBM and the ARBM degraded with nine diagonals and more. This is due to the over-fitting caused by the large number of parameters. In some literature, such as [32–34], it is known that warping cepstral-based features between different speakers is achieved by linear transformation with an adaptation matrix, and a few diagonal elements (such as tridiagonal, pentadiagonal, and heptadiagonal) of the adaptation matrix are sufficient for warping the cepstral features. Therefore, it does not make sense to use adaptation matrices with many diagonal elements (more than seven diagonals) in terms of efficient learning for this speaker pair. For more discussion using the various speaker pairs, see Section 5.4.

The average MDIR of the GMM-based approach was 3.93 with 32 mixtures, which was the best in the GMM-based approach. Unfortunately, it performed better than our approach. However, again, such an approach benefits from using parallel data and should not be compared with the non-parallel approach, just in terms of VC quality.

### 5.3 Subjective comparison

Subjective comparison of parallel VC (GMM) and non-parallel VC (SATBM) using the 5-scale tests in terms of speech quality and speaker specificity

Speech quality | Speaker specificity | |
---|---|---|

GMM | 3.37 | 3.66 |

SATBM | 2.11 | 1.91 |

### 5.4 Evaluation using various speaker pairs

MDIR and MCD from various speaker pairs

Source → target | Type | MDIR [dB] | MCD [dB] |
---|---|---|---|

ECL0001 → ECL1003 | M2F | 3.12 | 6.87 |

MIT0001 → ECL1003 | M2F | 4.03 | 7.00 |

ECL1003 → CAN0001 | F2M | 2.77 | 7.32 |

NEC1002 → MIT0002 | F2M | 3.48 | 6.98 |

ECL0001 → MIT0001 | M2M | 1.48 | 6.09 |

ECL0001 → MIT0002 | M2M | 2.02 | 6.46 |

CAN1001 → ECL1003 | F2F | 1.28 | 6.85 |

NEC1001 → ECL1003 | F2F | 1.55 | 6.14 |

## 6 Conclusions

In this paper, we presented a VC method that does not require any parallel data during training and adaptation according to the basic idea of dividing a speech signal into phoneme-relevant and speaker-relevant information and replacing only the speaker-relevant information with the desired one. To model this, we assumed that the neutral speaker’s acoustic features are normally distributed, and its mean is affine-transformed from the latent phonological features that are Bernoulli-distributed. As a result, we showed that the joint probability of the acoustic features and the phonological features forms a type of Boltzmann machine. We also showed the method of estimating the target speaker’s features given the source speaker’s features in a probabilistic manner. In our VC experiments, we obtained better performance with our model than the other non-parallel VC approaches in both objective and subjective criteria. However, we still have concerns that the proposed approach fell short of the GMM-based approach that uses parallel data in training. In the future, we will continue to improve the system (hopefully, to the performance level of the GMM-based approach) in the non-parallel VC because non-parallel training has several merits, e.g., we can freely use most of the existing speech data.

## 7 Endnotes

^{1} Note that they still require parallel data among the reference speakers.

^{2} It means that the method requires neither the parallel data of a source speaker and target speaker nor the parallel data of the reference speakers.

^{3} In our experiments, we used mel cepstra as the acoustic feature vector.

## Declarations

### Authors’ contributions

TN designed the speaker-adaptive-trainable Boltzmann machine, performed the experimental evaluation, and drafted the manuscript. YM reviewed the paper and provided some advice. Both authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- A Kain, MW Macon, in ICASSP. Spectral voice conversion for text-to-speech synthesis (IEEE, 1998), pp. 285–288.Google Scholar
- C Veaux, X Robet, in INTERSPEECH. Intonation conversion from neutral to expressive speech (ISCA, 2011), pp. 2765–2768.Google Scholar
- K Nakamura, T Toda, H Saruwatari, K Shikano, Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. Speech Commun.
**54**(1), 134–146 (2012).View ArticleGoogle Scholar - L Deng, A Acero, L Jiang, J Droppo, X Huang, in ICASSP. High-performance robust speech recognition using stereo training data (IEEE, 2001), pp. 301–304.Google Scholar
- A Kunikoshi, Y Qiao, N Minematsu, K Hirose, in INTERSPEECH. Speech generation from hand gestures based on space mapping (ISCA, 2009), pp. 308–311.Google Scholar
- R Gray, Vector quantization. IEEE ASSP Mag.
**1**(2), 4–29 (1984).View ArticleGoogle Scholar - H Valbret, E Moulines, J-P Tubach, Voice transformation using PSOLA technique. Speech Commun.
**11**(2), 175–187 (1992).View ArticleGoogle Scholar - Y Stylianou, Cappe, Ó, E Moulines, Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process.
**6**(2), 131–142 (1998).View ArticleGoogle Scholar - T Toda, AW Black, K Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process.
**15**(8), 2222–2235 (2007).View ArticleGoogle Scholar - E Helander, T Virtanen, J Nurminen, M Gabbouj, Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process.
**18**(5), 912–921 (2010).View ArticleGoogle Scholar - D Saito, H Doi, N Minematsu, K Hirose, in INTERSPEECH. Application of matrix variate Gaussian mixture model to statistical voice conversion (ISCA, 2014), pp. 2504–2508.Google Scholar
- R Takashima, T Takiguchi, Y Ariki, in SLT. Exemplar-based voice conversion in noisy environment (IEEE, 2012), pp. 313–317.Google Scholar
- R Takashima, R Aihara, T Takiguchi, Y Ariki, in SSW8. Noise-robust voice conversion based on spectral mapping on sparse space (SynSIG, 2013), pp. 71–75.Google Scholar
- S Desai, EV Raghavendra, B Yegnanarayana, AW Black, K Prahallad, in ICASSP. Voice conversion using artificial neural networks (IEEE, 2009), pp. 3893–3896.Google Scholar
- LH Chen, ZH Ling, Y Song, LR Dai, in INTERSPEECH. Joint spectral distribution modeling using restricted Boltzmann machines for voice conversion (ISCA, 2013), pp. 3052–3056.Google Scholar
- Z Wu, ES Chng, H Li, in ChinaSIP. Conditional restricted Boltzmann machine for voice conversion (IEEE, 2013).Google Scholar
- T Nakashika, R Takashima, T Takiguchi, Y Ariki, in INTERSPEECH. Voice conversion in high-order eigen space using deep belief nets (ISCA, 2013), pp. 369–372.Google Scholar
- T Nakashika, T Takiguchi, Y Ariki, Voice conversion using RNN pre-trained by recurrent temporal restricted Boltzmann machines. IEEE/ACM Trans. Audio Speech Lang. Process.
**23**(3), 580–587 (2015).View ArticleGoogle Scholar - A Mouchtaris, J Van der Spiegel, P Mueller, Nonparallel training for voice conversion based on a parameter adaptation approach. IEEE Trans. Audio Speech Lang. Process.
**14**(3), 952–963 (2006).View ArticleGoogle Scholar - C-H Lee, C-H Wu, in INTERSPEECH. Map-based adaptation for speech conversion using adaptation data selection and non-parallel training (ISCA, 2006), pp. 2254–2257.Google Scholar
- T Toda, Y Ohtani, K Shikano, in INTERSPEECH. Eigenvoice conversion based on Gaussian mixture model (ISCA, 2006), pp. 2446–2449.Google Scholar
- Y Ohtani, T Toda, H Saruwatari, K Shikano, in NTERSPEECH. Many-to-many eigenvoice conversion with reference voice (ISCA, 2009), pp. 1623–1626.Google Scholar
- D Saito, K Yamamoto, N Minematsu, K Hirose, in INTERSPEECH. One-to-many voice conversion based on tensor representation of speaker space (ISCA, 2011), pp. 653–656.Google Scholar
- T Masuda, M Shozakai, in ICASSP. Cost reduction of training mapping function based on multistep voice conversion (IEEE, 2007), pp. 693–696.Google Scholar
- T Nakashika, T Takiguchi, Y Ariki, in MLSLP 2015. Parallel-data-free, many-to-many voice conversion using an adaptive restricted Boltzmann machine (ISCA, 2015), pp. 1–4.Google Scholar
- K Cho, A Ilin, T Raiko, in ICANN. Improved learning of Gaussian-Bernoulli restricted Boltzmann machines (ENNS, 2011), pp. 10–17.Google Scholar
- R Salakhutdinov, Learning and evaluating Boltzmann machines. Technical report, Technical Report UTML TR 2008-002, Department of Computer Science, University of Toronto (2008).Google Scholar
- T Anastasakos, J McDonough, R Schwartz, J Makhoul, in ICSLP 96, 2. A compact model for speaker-adaptive training (WASET, 1996), pp. 1137–1140.Google Scholar
- GE Hinton, S Osindero, YW Teh, A fast learning algorithm for deep belief nets. Neural Comput.
**18**(7), 1527–1554 (2006).MathSciNetView ArticleMATHGoogle Scholar - M Morise, in
*Proc. the Stockholm Music Acoustics Conference (SMAC)*. An attempt to develop a singing synthesizer by collaborative creation (Logos VerlagBerlin, 2013), pp. 287–292.Google Scholar - T Kawahara, A Lee, T Kobayashi, K Takeda, N Minematsu, K Itou, A Ito, M Yamamoto, A Yamada, T Utsuro, K Shikano, Japanese dictation toolkit. J. Acoust. Soc. Japan (E).
**20**(3), 233–239 (1999).View ArticleGoogle Scholar - M Pitz, H Ney, Vocal tract normalization equals linear transformation in cepstral space. IEEE Trans. Speech Audio Process.
**13**(5), 930–944 (2005).View ArticleGoogle Scholar - E Variani, T Schaaf, in INTERSPEECH. VTLN in the MFCC domain: band-limited versus local interpolation (ISCA, 2011), pp. 1273–1276.Google Scholar
- T Emori, K Shinoda, Vocal tract length normalization using rapid maximum-likelihood estimation for speech recognition. Syst. Comput. Japan.
**33**(5), 30–40 (2002).View ArticleGoogle Scholar - B Milner, X Shao, in INTERSPEECH. Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model (ISCA, 2002), pp. 2421–2424.Google Scholar