# Small-parallel exemplar-based voice conversion in noisy environments using affine non-negative matrix factorization

- Ryo Aihara
^{1}Email author, - Takao Fujii
^{1}, - Toru Nakashika
^{2}, - Tetsuya Takiguchi
^{1}and - Yasuo Ariki
^{1}

**2015**:32

https://doi.org/10.1186/s13636-015-0075-4

© Aihara et al. 2015

**Received: **24 February 2015

**Accepted: **12 October 2015

**Published: **25 November 2015

## Abstract

The need to have a large amount of parallel data is a large hurdle for the practical use of voice conversion (VC). This paper presents a novel framework of exemplar-based VC that only requires a small number of parallel exemplars. In our previous work, a VC technique using non-negative matrix factorization (NMF) for noisy environments was proposed. This method requires parallel exemplars (which consist of the source exemplars and target exemplars that have the same texts uttered by the source and target speakers) for dictionary construction. In the framework of conventional Gaussian mixture model (GMM)-based VC, some approaches that do not need parallel exemplars have been proposed. However, in the framework of exemplar-based VC for noisy environments, such a method has never been proposed. In this paper, an adaptation matrix in an NMF framework is introduced to adapt the source dictionary to the target dictionary. This adaptation matrix is estimated using only a small parallel speech corpus. We refer to this method as affine NMF, and the effectiveness of this method has been confirmed by comparing its effectiveness with that of a conventional NMF-based method and a GMM-based method in noisy environments.

## Keywords

## 1 Introduction

Background noise is an unavoidable factor in speech processing. In the task of automatic speech recognition (ASR), one problem is that the recognition performance remarkably decreases under noisy environments, and this creates a significant problem in regard to the development of the practical use of ASR.

Non-negative matrix factorization (NMF) [1] is a popular approach for source separation or speech enhancement [2, 3]. In some approaches for NMF-based source separation, the exemplars, which are called “bases”, are grouped for each source, and the mixed signals are expressed with a sparse representation of these bases. By using only the weights of the atoms related to the target signal, the target signal can be reconstructed. Gemmeke et al. [4] also propose an exemplar-based method for noise-robust speech recognition using NMF.

In our previous work, we proposed an exemplar-based method for noise-robust voice conversion (VC) using NMF. VC is a technique for converting a speaker’s voice individuality while maintaining phonetic information in the utterance. In [5], we evaluated the conventional statistical VC method in a noisy environment and revealed that noise in the input signal is not only output with the converted signal but also tends to degrade the conversion performance itself due to the unexpected mapping of source features. In NMF-based VC, noise exemplars are extracted from before- and after-utterance sections and input noisy signals are decomposed into a linear combination of noise and speaker’s clean exemplars. For this reason, no training processes related to noise signals are required. Only the weights related to the source exemplars are taken, and the target signal is constructed from the target exemplars and the weights. This method showed better performances than the conventional GMM-based method in speaker conversion experiments using noise-added speech data.

Moreover, we assume that our NMF-based VC creates a natural-sounding voice compared to statistical VC. In [6], over-fitting and over-smoothing problems are reported in statistical VC. Because our NMF-based VC is not a statistical but an exemplar-based method, we assume that our approach can avoid the over-fitting problem and create a natural-sounding voice.

In spite of these efforts, VC is not used in practice. One reason for this is that conventional VC needs a large amount of parallel training data between the source and target speakers. In recent years, some statistical approaches that do not require parallel training data have been proposed [7–10]. In this paper, we propose noise-robust VC using a small parallel corpus based on an NMF-based speaker adaptation technique.

In [11], adaptation of speaker-specific bases in NMF for single-channel speech-music separation has been presented. In this framework, speaker-specific bases are adapted to the other speaker using an affine matrix. We call this method affine NMF (A-NMF) and apply it to VC. In VC, the source dictionary is constructed using sufficient source speaker data, and it is adapted using a small amount of parallel data (about ten words only) in order to obtain the target dictionary, where a linear regression transformation matrix (affine matrix) is trained based on NMF.

The contributions of this paper are summarized in two points. First, we have decreased the total amount of parallel training data required for NMF-based VC. Conventional NMF-based VC requires 216-word parallel data for dictionary construction. However, experimental results using our proposed approach, which requires only a small amount of parallel data, demonstrate a conversion quality that is almost the same as that of the conventional NMF-based VC. The second contribution is that there needs to be no concern about differences between parallel dictionaries. The parallel dictionary, as used in the conventional NMF-based VC, has a mismatched alignment, and this mismatch degrades the VC performance. In our proposed method, there is no mismatch because the target dictionary is estimated from the source dictionary using affine NMF. Details of this effect are given in Section 3.3.

The rest of this paper is organized as follows. Section 2 discusses related works, while Section 3 describes the conventional NMF-based VC method. An adaptation technique in an NMF framework is described in Section 4. Section 5 describes the results of the experiments, and the final section presents the conclusions.

## 2 Related works

A Gaussian mixture model (GMM)-based approach is widely used for VC because of its flexibility and good performance [12]. In this approach, the conversion function is interpreted as the expectation value of the target spectral envelope. The conversion parameters are evaluated using a minimum mean-square error (MMSE) on a parallel training set. A number of improvements in this approach have been proposed. Toda et al. [13] introduced dynamic features and the global variance (GV) of the converted spectra over a time sequence. Helander et al. [6] proposed transforms based on partial least squares (PLS) in order to prevent the over-fitting problem associated with standard multivariate regression. However, over-smoothing and over-fitting problems in these GMM-based approaches have been reported [6] because of statistical averages and a large number of parameters. These problems degrade the quality of synthesized speech.

The above statistical VC needs a large parallel corpus between the source and target speakers. In this paper, “parallel” means that the text of the corpus between the source and target speakers is the same. This constraint can be a difficult requirement to meet in practice. In GMM-based VC, there have been approaches that do not require parallel data. Lee et al. [7] used maximum a posteriori (MAP) in order to adapt training data. Mouchtaris et al. [8] proposed non-parallel training for GMM-based VC. Toda et al. [9] proposed eigen-voice GMM (EV-GMM) for many-to-many VC in which the source and target speech are represented by a super vector of the reference speakers. Saito et al. [10] proposed tensor representation for one-to-many GMM VC. However, these approaches do not work well in noisy environments because they are based on a statistical approach.

Our VC approach is exemplar-based, which is different from the conventional GMM-based VC. Exemplar-based VC using NMF has been proposed in [5]. In this framework, parallel training data is stored as source and target dictionaries. Input speech is decomposed into a linear combination of source exemplars from the source dictionary. Selected source exemplars are replaced with target exemplars, and the input speech is converted.

We assume that our approach using NMF has two advantages over conventional statistical VC. The first advantage is noise robustness and the second is the resultant natural-sounding converted voice. The noise robustness of this method was confirmed in [14]. In [15], we proposed multimodal NMF-based VC to enhance the noise robustness of our method. The natural-sounding converted voice in NMF-based VC was confirmed in [16]. Wu et al. [17] applied a spectrum compression factor to NMF-based VC and improved the conversion quality. The NMF-based VC has also been adapted for assistive technology for those with articulation disorders [18].

## 3 NMF-based voice conversion

### 3.1 Sparse representations for voice conversion

*D*,

*L*,

*J*, and

*K*are the numbers of feature dimensions, frames, clean exemplars, and noise exemplars, respectively. In approaches based on sparse representations, the observed signal is represented by a linear combination of a small number of bases. We call the collection of these bases a “dictionary” and the collection of its weights “activities”.

*l*is approximately expressed by a non-negative linear combination of the source dictionary, noise dictionary, and their activities.

**A**

^{ s },

**A**

^{ n }, \(\mathbf {h}_{l}^{s}\), and \(\mathbf {h}_{l}^{n}\) are the source dictionary, noise dictionary, and their activities at frame

*l*. Given the spectrogram, (1) can be written as follows:

where **H**
^{
s
} and **H**
^{
n
} denote the activity matrices of the source and noise dictionaries, respectively.

In order to consider only the shape of the spectrum, **X**, **A**
^{
s
}, and **A**
^{
n
} are first normalized for each frame or exemplar so that the sum of the magnitudes over frequency bins equals unity.

**H**is estimated based on NMF with the sparse constraint that minimizes the following cost function [4]:

**X**and

**A**

**H**. The second term is the sparse constraint with a L1-norm regularization term that causes

**H**to be sparse. The weights of the sparsity constraints can be defined for each exemplar by defining

**λ**

^{ T }=[

*λ*

_{1}…

*λ*

_{ J }…

*λ*

_{ J+K }]. In this study, the weights for source exemplars [

*λ*

_{1}…

*λ*

_{ J }] were set to 0.1, and those for noise exemplars [

*λ*

_{ J+1}…

*λ*

_{ J+K }] were set to 0 because the noise signal is less sparse compared to the speech signal.

**H**minimizing (3) is estimated iteratively applying the following update rule:

### 3.2 Target speech construction

**A**
^{
t
} in Fig. 1 represents a target dictionary that consists of the target speaker’s exemplars. **A**
^{
s
} and **A**
^{
t
} consisted of the same words and are aligned with DTW just as the conventional GMM-based VC is. This method assumes that when the source signal and the target signal (which are the same words but spoken by different speakers) are expressed with sparse representations of the source dictionary and the target dictionary, respectively, the obtained activity matrices are approximately equivalent. For this reason, we assume that when there are parallel dictionaries, the activity of the source features estimated with the source dictionary may be able to be substituted with that of the target features.

**H**, the activity of source signal

**H**

^{ s }is extracted, and by using the activity and the target dictionary, the converted spectral features \(\hat {\mathbf {X}}^{t}\) are constructed.

The converted spectral features are de-normalized so that the sum of the magnitudes over frequency bins equals input spectral features.

### 3.3 Difference between parallel dictionaries

As mentioned in Section 3.2, this method assumes that if the parallel source and target spectra are decomposed into a parallel dictionary and its activities, the activity matrices will be approximately equivalent. In this framework, we assume that each basis in the dictionary represents a phoneme part and the activity matrix represents the phonetic information of the utterance, which is independent of the speaker.

## 4 Exemplar-based voice conversion using a small-parallel corpus

In the framework of the conventional NMF-based VC which is described in Section 3, a large-parallel corpus of source and target speakers is needed for dictionary construction. In this section, we propose target dictionary estimation from a small-parallel corpus only.

**X**

^{ s }and

**X**

^{ t }show a small amount of parallel data between the source and target speakers. In the activity estimation stage, a source spectral exemplar matrix

**X**

^{ s }is decomposed into a linear combination of bases from the source dictionary

**A**

^{ s }. The source dictionary consists of the source speaker’s exemplars. It is constructed the same way the dictionary is constructed when using the conventional NMF-based VC, as explained in Section 3. The indexes and weights of the bases are estimated using (4) as source activity

**H**

^{ s }.

**W**, the target feature vector at the

*l*-th frame is obtained as follows:

where **A**
^{
s
} is the source dictionary and \(\mathbf {h}_{l}^{s}\) is the activity vector of the source signal at the *l*-th frame.

**X**

^{ t }and

**W**

**A**

^{ s }

**H**

^{ s }is used.

**W**, is estimated using

**A**

^{ s },

**H**

^{ s }, and a small amount of the parallel target speech data,

**X**

^{ t }, as follows:

The new parallel target dictionary is given by \(\hat {\mathbf {A}}^{t} = \mathbf {W} \mathbf {A}^{s}\).

**X**is decomposed into the multiplication of dictionary

**A**= [

**A**

^{ s }

**A**

^{ n }] by its activity

**H**=[

**H**

^{ s }

^{ T }

**H**

^{ n }

^{ T }]

^{ T }as follows:

**H**

^{ s }as follows:

In this method, we do not have to consider the difference between the parallel dictionaries in Section 3.3 because the parallel utterances are used as adaptation data, not as a dictionary. The activity matrix estimated from the source dictionary contains both the phoneme information and speaker information of the input utterance, as explained in Section 3.3. In this method, the adaptation matrix is estimated from the fixed source dictionary and source activity matrix, and target speaker information is extracted using the adaptation matrix in this procedure. In other words, the adaptation matrix is independent of the phoneme, and it is the conversion matrix from the source to the target speaker.

## 5 Experiments

### 5.1 Experimental conditions

The new VC technique was evaluated by comparing it with conventional techniques based on GMM [12] and NMF [5] in a speaker conversion task using noisy speech data. Speaker MMY, MAU, MNM, FTK, FYN, and FMS were selected from the ATR Japanese speech database [20], and we conducted male-to-female (MMY →FTK and MAU →FYN), male-to-male (MMY →MAU and MNM →MMY), and female-to-female (FTK →FYN and FMS →FTK) conversions. The sampling rate, frame shift, and window length are 8 kHz, 5 ms, and 25 ms, respectively.

Number of frames in each parallel dictionary

No. of training data | 216 | 50 | 25 | 10 |
---|---|---|---|---|

No. of frames | 61,168 | 13,839 | 6782 | 2826 |

The noisy speech signals were obtained by adding noise signals to clean speech data. We used three types of noise signal (restaurant, station, and exhibition), and these are randomly taken from the non-utterance section of CENSREC-1-C database [21]. The SNRs for each noise was set to 20 and 10 dB. They are added to a test word independently to each other. (A noisy speech signal includes one type of noise signal.) The average number of exemplars in the noise dictionary for each word was 104.

In the objective evaluation, all 50 test words with three types of noisy signal at two different types of SNRs were converted. Therefore, a of total 1800 words (6 pairs × 50 words × 3 noise types × 2 SNRs) were used for subjective evaluation. In subjective evaluation, the half of the test data with restaurant noise at 10 dB were used. Therefore, the total amount of test words was 150 (6 pairs × 25 words) in subjective evaluation.

where **x**
_{
t
} and \(\hat {y}_{t}\) denote the log-scaled F0 of the source speaker and the converted word at frame *t*, respectively. *μ*
^{(x)} and *σ*
^{(x)} denote the mean and standard deviation of the log-scaled F0, as calculated from the source speaker’s training data. *μ*
^{(y)} and *σ*
^{(y)} are the mean and standard deviation of the target speaker data. We made no conversions to the aperiodic components. With STRAIGHT, we used converted spectra, F0, and source aperiodic components for synthesizing the target voice.

### 5.2 Experimental results

where **X**
^{
s
}, **X**
^{
t
}, and \(\hat {\mathbf {X}}^{t}\) denote the source, target, and converted spectrum, respectively.

*p*value test of 0.05.

For the evaluation of speaker individuality, an XAB test was carried out. In the XAB test, each participant listened to the target speech. The participant then listened to the speech converted by the two methods and selected the sample that sounded more similar to the target speech. Figure 11 shows that the NMF-based VC with speaker adaptation obtained a higher score than the conventional NMF-based VC without speaker adaptation. We confirmed this result by a 0.05 *p* value test.

## 6 Conclusions

In this paper, an exemplar-based VC technique using speaker adaptation was presented. This method requires only a small amount of parallel data, where a linear regression transformation matrix is used to adapt a source dictionary to a target dictionary and it is estimated in an NMF framework. In comparison experiments between GMM-based VC, NMF without speaker adaptation, and NMF with speaker adaptation, the NMF-based VC with speaker adaptation showed better performance.

Some problems remain with this method. The proposed method requires higher computation times than the GMM-based method. In [25], we proposed a framework that reduces computational time for NMF-based VC. In future work, we will investigate the optimal number of bases and evaluate performance under other noise conditions. In addition, this method is limited to only one-to-one voice conversion because it requires a small amount of parallel data. Hence, we will research a method for many-to-many VC within this framework and apply this method to other VC applications, such as assistive technology [18] or emotional VC [26].

## Declarations

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- DD Lee, HS Seung, in Proc. Neural. Inf. Process. Syst, 13. Algorithms for non-negative matrix factorization, (2001), pp. 556–562.Google Scholar
- T Virtanen, Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio, Speech and Lang. Process.
**15**(3), 1066–1074 (2007).View ArticleGoogle Scholar - MN Schmidt, RK Olsson, in Proc. INTERSPEECH. Single-channel speech separation using sparse non-negative matrix factorization, (2006), pp. 2614–2617.Google Scholar
- JF Gemmeke, T Viratnen, A Hurmalainen, Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio, Speech and Lang. Process.
**19**(7), 2067–2080 (2011).View ArticleGoogle Scholar - R Takashima, T Takiguchi, Y Ariki, in Proc. SLT. Exemplar-based voice conversion in noisy environment, (2012), pp. 313–317.Google Scholar
- E Helander, T Virtanen, J Nurminen, M Gabbouj, Voice conversion using partial least squares regression. IEEE Trans. On Audio, Speech, Lang. Process.
**18**(5), 912–921 (2010).View ArticleGoogle Scholar - CH Lee, CH Wu, in Proc. INTERSPEECH. MAP-based adaptation for speech conversion using adaptation data selection and non-parallel training, (2006), pp. 2254–2257.Google Scholar
- A Mouchtaris, JV der Spiegel, P Mueller, Nonparallel training for voice conversion based on a parameter adaptation approach. IEEE Trans. Audio, Speech, and Lang. Processing.
**14**(3), 952–963 (2006).View ArticleGoogle Scholar - T Toda, Y Ohtani, K Shikano, in Proc. INTERSPEECH. Eigenvoice conversion based on Gaussian mixture model, (2006), pp. 2446–2449.Google Scholar
- D Saito, K Yamamoto, N Minematsu, K Hirose, in Proc. INTERSPEECH. One-to-many voice conversion based on tensor representation of speaker space, (2011), pp. 653–656.Google Scholar
- EM Grais, H Erdogan, in Proc. INTERSPEECH. Adaptation of speaker-specic bases in non-negative matrix factorization for single channel speech-music separation, (2011), pp. 569–572.Google Scholar
- Y Stylianou, O Cappe, E Moilines, Continuous probabilistic transform for voice conversion. IEEE. Trans. Speech and Audio Processing.
**6**(2), 131–142 (1998).View ArticleGoogle Scholar - T Toda, A Black, K Tokuda, Voice conversion based on maximum likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio, Speech and Lang. Process.
**15**(8), 2222–2235 (2007).View ArticleGoogle Scholar - R Takashima, T Takiguchi, Y Ariki, Exemplar-based voice conversion using sparse representation in noisy environments. IEICE Trans. Fundam. Electron. Commun. Comp. Sci.
**E96-A**(10), 1946–1953 (2013).View ArticleGoogle Scholar - K Masaka, R Aihara, T Takiguchi, Y Ariki, in Proc. INTERSPEECH. Multimodal exemplar-based voice conversion using lip features in noisy environments, (2014), pp. 1159–1163.Google Scholar
- R Aihara, T Nakashika, T Takiguchi, Y Ariki, in Proc. ICASSP. Voice conversion based on non-negative matrix factorization using phoneme-categorized dictionary, (2014), pp. 7894–7898.Google Scholar
- Z Wu, T Virtanen, ES Chng, H Li, Exemplar-based sparse representation with residual compensation for voice conversion. IEEE Trans. Audio, Speech and Lang. Process.
**22**(10), 1506–1521 (2014).View ArticleGoogle Scholar - R Aihara, R Takashima, T Takiguchi, Y Ariki, A preliminary demonstration of exemplar-based voice conversion for articulation disorders using an individuality-preserving dictionary. EURASIP J. Audio, Speech, and Music Process.
**2014**(5) (2014). doi:http://dx.doi.org/10.1186/1687-4722-2014-5. - R Aihara, T Takiguchi, Y Ariki, in Proc. ICASSP. Activity-mapping non-negative matrix factorization for exemplar-based voice conversion, (2015), pp. 4899–4903.Google Scholar
- A Kurematsu, K Takeda, Y Sagisaka, S Katagiri, H Kuwabara, K Shikano, ATR Japanese speech database as a tool of speech recognition and synthesis. Speech Communication.
**9:**, 357–363 (1990).View ArticleGoogle Scholar - N Kitaoka, T Yamada, S Tsuge, C Miyajima, K Yamamoto, T Nishiura, M Nakayama, Y Denda, M Fujimoto, T Takiguchi, S Tamura, S Matsuda, T Ogawa, S Kuroiwa, K Takeda, S Nakamura, CENSREC-1-C: An evaluation framework for voice activity detection under noisy environments. Acoustical Science and Technology.
**30**(5), 363–371 (2009).View ArticleGoogle Scholar - H Kawahara, H Matsui, in Proc. ICASSP, I. Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation, (2003), pp. 256–259.Google Scholar
- T En-Najjary, O Roec, T Chonavel, in Proc. ICSLP. A voice conversion method based on joint pitch and spectral envelope transformation, (2004), pp. 199–203.Google Scholar
- INTERNATIONAL TELECOMMUNICATION UNION, Methods for objective and subjective assessment of quality. ITU-T Recommendation, 800 (2003).Google Scholar
- R Aihara, R Takashima, T Takiguchi, Y Ariki, Noise-robust voice conversion based on sparse spectral mapping using non-negative matrix factorization. IEICE Trans. Inf. Syst.
**E97-D**(6), 1411–1418 (2014).View ArticleGoogle Scholar - C Veaux, X Robet, in Proc. INTERSPEECH. Intonation conversion from neutral to expressive speech, (2011), pp. 2765–2768.Google Scholar