Multimodal voice conversion based on non-negative matrix factorization

A multimodal voice conversion (VC) method for noisy environments is proposed. In our previous non-negative matrix factorization (NMF)-based VC method, source and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. The input source signal is then decomposed into source exemplars, noise exemplars, and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. In this study, we propose multimodal VC that improves the noise robustness of our NMF-based VC method. Furthermore, we introduce the combination weight between audio and visual features and formulate a new cost function to estimate audio-visual exemplars. Using the joint audio-visual features as source features, VC performance is improved compared with that of a previous audio-input exemplar-based VC method. The effectiveness of the proposed method is confirmed by comparing its effectiveness with that of a conventional audio-input NMF-based method and a Gaussian mixture model-based method.


Introduction
Background noise is an unavoidable factor in speech processing. In automatic speech recognition (ASR) tasks, one problem is that recognition performance decreases significantly in noisy environments, which impedes the development of practical ASR applications.
The same problem occurs in VC, which modifies nonlinguistic information such as voice characteristics, while maintaining linguistic information in its original state. The noise in the input signal is output with the converted signal and degrades conversion performance because of unexpected mapping of source features. To address this problem, we propose a noise-robust VC method that is based on sparse representations.
In recent years, sparse representation-based approaches have gained interest in a broad range of signal processing techniques. NMF [1], which is based on the idea of sparse representations, is a well-known approach for source separation and speech enhancement [2,3]. In such approaches, the observed signal is represented by a linear combination of a small number of atoms, such as the *Correspondence: makka@me.cs.scitec.kobe-u.ac.jp 1 Graduate School of System Informatics, Kobe University, 1-1 Rokkodai, Nada-ku, 657-8501 Kobe, Japan Full list of author information is available at the end of the article exemplar and basis of NMF. In some source separation approaches, atoms are grouped for each source, and the mixed signals are expressed with a sparse representation of these atoms. The target signal can then be reconstructed using only the weights of the atoms related to the target signal. Gemmeke et al. [4] proposed an exemplarbased method for noise-robust speech recognition using NMF. In their method, the observed speech is decomposed into speech atoms, noise atoms, and their weights. Then, the weights of the speech atoms are used as phonetic scores (instead of the likelihoods of hidden Markov models) for speech recognition.
Previously, we have discussed a noise-robust VC technique using NMF [5]. In that method, source and target exemplars are extracted from parallel training data, in which the same texts are uttered by the source and target speakers. In addition, the noise exemplars are extracted from the before-and after-utterance sections in an observed signal. Consequently, no training processes related to noise signals are required. The input source signal is expressed with a sparse representation of the source exemplars and (non-sparse) noise exemplars. Only the weights related to the source exemplars are picked up, and the target signal is constructed from the target exemplars and the picked-up weights. This method has demonstrated better performance than the conventional Gaussian mixture model (GMM)-based method [6] in VC experiments using noise-added speech data. However, the performance of the method was not sufficient for practical use.
Audio-visual speech recognition which uses dynamic visual lip and audio information has been studied as a technique for robust speech recognition in noisy environments. In audio-visual speech recognition, there are three priminary integration methods: early integration [7], which connects the audio feature vector with the visual feature vector; late integration [8], which weighs the likelihood of the result obtained by a separate process for audio and visual signals, and synthetic integration [9], which calculates the product of the output probability in each state. A discrete cosine transform (DCT) is widely used as a visual feature in audio-speech recognition. Previously, we have proposed audio-visual speech recognition using a visual feature extracted from an active appearance model [10,11]. The feature contains shape information that expresses lip movement and texture information that expresses intensity changes, such as tooth.
In this study, we propose a multimodal VC technique using NMF with a combination weight between audio and visual features. The visual information is extracted from videos that capture the lip movement of utterances. The extracted visual features are connected to the audio features and used as source exemplars. The input noisy audio-visual feature is represented by a linear combination of source and noise exemplars. Then, the source exemplars are replaced with related parallel target exemplars extracted from clean audio features. The effectiveness of the proposed method has been confirmed by comparing it with that of the conventional audio-input NMF-based method and the conventional GMM-based method.
The remainder of this paper is organized as follows. In Section 2, related works are introduced. In Section 3, the proposed method is described. Experimental data are evaluated in Section 4 and conclusions are presented in Section 5.

Related works
VC is a technique for converting specific information to speech while maintaining other information within the utterance. One of the most popular VC applications is speaker conversion [6] where a source speaker's voice individuality is changed to that of a specified target speaker such that the input utterance sounds as though the specified target speaker has spoken the utterance.
Other studies have examined several tasks that use VC. Emotion conversion is a technique that changes emotional information in input speech while maintaining linguistic information and speaker individuality [12,13]. VC has also been adopted as assistive technology that reconstructs a speaker's individuality in electrolaryngeal speech [14], disordered speech [15] or speech recorded by nonaudible murmur microphones [16]. Recently, VC has been used for ASR and speaker adaptation in text-to-speech (TTS) systems [17].
Statistical approaches to VC are the most widely studied [6,18,19]. Among these approaches, a GMM-based mapping approach [6] is the most common. In this approach, the conversion function is interpreted as the expectation value of the target spectral envelope, and the conversion parameters are evaluated using minimum mean-square error (MMSE) on a parallel training set. A number of improvements to this approach have been proposed. Toda et al. [20] introduced dynamic features and the global variance (GV) of the converted spectra over a time sequence. Helander et al. [21] proposed transforms based on partial least squares (PLS) in order to prevent the over-fitting problem associated with standard multivariate regression. In addition, other approaches that make use of GMM adaptation techniques [22] or eigenvoice GMM (EV-GMM) [23,24] do not require parallel data.
Our VC approach is exemplar-based, which differs from conventional GMM-based VC. Exemplar-based VC using NMF has been proposed previously [5]. We assume that our NMF approach is advantageous in that it results in a more natural-sounding converted voice compared to conventional statistical VC. The natural sounding converted voice in NMF-based VC has been confirmed [25]. Wu et al. [26] applied a spectrum compression factor to NMFbased VC to improve conversion quality. However, the effectiveness of these approaches was confirmed using clean speech data; thus, their utilization in noisy environments was not considered. Noise in the input signal may degrade conversion performance due to unexpected mapping of source features.
The contributions of this paper are summarized as follows. First, we propose multimodal exemplar-based VC for noisy environments. The effectiveness of conventional VC approaches has been confirmed with clean speech data; however, their utilization in noisy environments was not considered. Therefore, noise-robust VC is required for real environments because noise in the input signal may degrade conversion performance due to unexpected mapping of source features. The main framework for exemplar-based multimodal VC has been proposed previously [27,28]. In this paper, we evaluate our multimodal VC using continuous digital utterances which have been used in most studies related to audio-visual signal processing. Second, we have conducted detailed objective evaluation and confirmed the effectiveness of visual data in VC. Note that we analyzed the performance of the proposed method for vowel and consonant parts separately. The evaluation revealed that visual input data improved the conversion quality of the consonant part. Third, we conducted subjective evaluations and confirmed that the proposed multimodal VC reduces noise effectively compared to conventional VC.

Basic approach
In approaches based on sparse representations, the observed signal is represented by a linear combination of a small number of bases.
Here, x l represents the l-th frame of the observation, and w j and h j,l represent the j-th basis and the weight, resent the collection of bases and collection of weights, respectively. When the weight vector h l is sparse, the observed signal can be represented by a linear combination of a small number of bases with non-zero weights. In this paper, each basis represents the exemplar of the speech or noise signal, and the collection of exemplar W and the weight vector h l are referred to as "dictionary" and "activity", respectively. Figure 1 shows the basic approach of the proposed exemplar-based VC using NMF. Here D, d, L, and J represent the number of dimensions of source features,the dimensions of target features, the frames of the dictionary, and the basis of the dictionary, respectively.
The proposed VC method requires two phonemically parallel dictionaries, where one dictionary (i.e. the source dictionary) is constructed from source features and the other dictionary (i.e., the target dictionary) is constructed from target features. These dictionaries consist of the same words and are aligned with dynamic time warping (DTW); thus, they have the same number of bases.
An input source feature matrix X s is decomposed into a linear combination of bases from the source dictionary W s using NMF. The weights of the bases are estimated as an activity H s . Therefore, the activity includes the weight information of input features for each basis. The activity is then multiplied by a target dictionary to obtain the converted spectral feature matrixX t , which is represented by a linear combination of bases from the target dictionary. The source and target dictionaries are parallel phonemically; therefore, the bases used in the converted features are phonemically the same as those of the source features. Figure 2 shows the process for constructing a parallel dictionary. To construct a parallel dictionary, some pairs of parallel utterances are required, with each pair consisting of the same text. The source dictionary W s consists of jointed audio-visual features, while the target dictionary W t consists of only audio features.

Multimodal dictionary construction
For audio features, a simple magnitude spectrum calculated by short-time Fourier transform (STFT) is extracted from clean parallel utterances. Mel-frequency cepstral coefficients (MFCCs) are calculated from the STRAIGHT spectrum to obtain alignment information in DTW.
For visual features, DCT of lip motion images of the source speaker's utterance is used. We have adopted 2D-DCT for lip images and perform a zigzag scan to obtain the 1D-DCT coefficient vector. Note that DCT  The audio feature of the noise dictionary is extracted from the before-and after-utterance sections in the inputnoisy audio signal. Note that the visual feature of the noise dictionary is extracted in the same way.

Estimation of activity from noisy source signals using NMF with a combination weight
In the exemplar-based approach, the spectrum of the noisy source signal at a frame is approximately expressed by a non-negative linear combination of the source dictionary, noise dictionary, and their activities.
Here, x s and x n represent the spectrum of the source signal and the noise, respectively. W s , N, and h av and h n represent the source dictionary, the noise dictionary and their activities at a frame, respectively. Note that all spectra are normalized for each frame. Figure 3 shows a flow chart describing the proposed method. Here, X s, , and W t,A (D' × J) represent the source audio signal, source visual signal, source audio dictionary, source visual dictionary, audio noise dictionary, visual noise dictionary, audio-visual activity, noise activity and target audio dictionary, respectively.
The joint matrix h is estimated on the basis of NMF with the sparse constraint that minimizes a cost function [4]. Previously, we used simple NMF without considering the weights of audio and visual parameters when estimating the activity [27]. Thus, we introduce audio-visual weights α and β because we must adjust the weight depending on the signal-to-noise ratio (SNR) and the new cost function as follows.
Here, the first and second terms are the Kullback-Leibler (KL) divergence of audio data and visual data, respectively. The third term is the sparsity constraint with the L1-norm regularization term that causes h to be sparse. The symbol . * denotes element-wise multiplication. Note that the human voice has sparseness; however, noise signals do not. Therefore, we separate the human voice and noisy signals using NMF sparseness constraint λ. We experimentally set the value of λ of speech to 0.1 and of noise to 0 [5,29] because the spectrum of the source signal should be expressed with a sparse representation of the source exemplars and the noise spectrum should not be expressed with a sparse representation. h minimizing (3) is iteratively estimated by applying the following update rule.  Here, D and E represent the dimensions of the audio and visual dictionaries, respectively.

Target speech construction
From the estimated joint matrix h, the activity of the source signal h av is extracted, and using the activity and the target dictionary, the converted spectral features are constructed.x The input source and converted spectral features are expressed as a STRAIGHT spectrum. Thus, the target speech is synthesized using a STRAIGHT synthesizer. The other features extracted by STRAIGHT analysis, such as F0 and the aperiodic components, are used to synthesize the converted signal without conversion.

Experimental conditions
The proposed multimodal VC technique was evaluated by comparing it with an exemplar-based audio-input method [5] and a conventional GMM-based method [6] in a speaker-conversion task using clean speech data and noise-added speech data.
The source speaker was a Japanese male, and the target speaker was a Japanese female. The target female audio data was taken from the CENSREC-1-AV [30] database. We recorded the source male audio-visual data with the same text as the target female utterances. Table 1 shows the content of the audio data taken from the CENSREC-1-AV database. We used a video camera (HDR-CX590, SONY) and a pin microphone (ECM-66B, SONY) for recording. We recorded audio and visual data simultaneously in a dark anechoic room. The camera was positioned 65 cm from the speaker and 130 cm from the floor. Figures 4 and 5 show the recorded audio waves and lip images, respectively. We labelled the recorded audio data manually and used the labelled data in a subsequent experiment. The sampling rate of the audio data in each database was 8 kHz, and the frame shift was 5 ms.
A total of 40 utterances of clean continuous digital speech were used to construct parallel dictionaries in the NMF-based methods. These utterances were also used to train the GMM in the GMM-based method. Ten randomly selected utterances of clean and noisy continuous digital speech were used in the evaluation. Table 1 shows the content of the database, dictionary, and test data. We used the noise data from the CENSREC-C-1 [31] database. The noisy speech was created by adding white noise or a noise signal recorded in a car, airport, restaurant, or subway to the clean speech data. The SNRs were 0, 10, and 20 dB. The noise dictionary was extracted from the beforeand after-utterance sections in the evaluation sentence. The average number of noisy frames was 223.
In the NMF-based methods, a 257-dimensional magnitude spectrum was used for the source and noise dictionaries and a 512-dimensional STRAIGHT spectrum was used for the target dictionary. STRAIGHT analysis  could not express the noise spectrum well. To express the noisy source speech with a sparse representation of source and noise dictionaries, a simple magnitude spectrum was used to construct the source and noise dictionaries. The DFT length was 40 ms. The number of iterations used to estimate the activity was 300. In the GMM-based method, the 1 st through 24 th MFCC obtained from the STRAIGHT spectrum were used as feature vectors. The number of mixtures was 32.
The frame rate of the visual data was 30 fps. The lip image size was 130 × 80 pixels. For visual features, 50dimensions of the DCT coefficient of the lip motion images of the source speaker's utterance were used. We introduced segment features for the DCT coefficient that consist of consecutive frames (two frames before and two frames after). Therefore, the total dimension of the visual feature was 250. For the weights of the audio-visual feature, α was 1, and β was changed to 10 from 1.

Results and discussion
Here, mc t d andmc t d represent the d th coefficient of the target and the converted mel-cepstra, respectively. We calculated the mel-cepstra from the converted STRAIGHT spectrum. Figure 6 shows that the distortion of the proposed method is lower than that of the conventional NMF and GMM-based methods in noisy environments. The proposed method obtained lower distortion than the non-weighted method (β = 1) by selecting an optimal image feature weight. These results indicate the effectiveness of the proposed method.
For an SNR of 0 dB, the best performance of the proposed method was 2.976 dB (β = 5), which is 0.338 dB less than that of audio-input NMF. In addition, for an SNR of 20 dB, the best performance was 2.674 dB (β = 7), which is 0.271 lower than audio-input NMF. Therefore, the performance difference between the conventional NMF method and the proposed method was greater in a low SNR environment.
In a clean environment, the proposed method did not show significant difference when compared with audioinput NMF.
We also investigated the effectiveness of the proposed method in various noisy environments. Table 2 shows the MCD in each noisy environment, where the SNR was 10 dB and the image feature weight β was set to 5. Table 2 shows that the MCD of the proposed method is less than that of audio-input NMF in each noisy environment. Thus, we confirm the effectiveness of the proposed method in various noisy environments.
In addition, we calculated the improvement ratio for vowels and consonants using the labelled data. We used the mel-cepstral distortion ratio (MCDR) to evaluate improvement ratio. These values are shown in Table 3. The MCDR is defined as follows. Here, mc s d represents the d th coefficient of the source signal. The SNR was 10 dB and the image feature weight β was 5 in this experiment. Table 3 shows that the performance of the proposed method outperforms audio-input NMF in the evaluation of both vowels and consonants. In audio-input conversion, the improvement ratio of vowels is better than that of consonants; however, in audio-visual conversion, the improvement ratio of consonants is better because image features compensate for the conversion of the consonant part, which is degraded by the noisy signal in audio-input conversion.
We also performed a mean opinion score (MOS) test [32] on the similarity and noise suppression of the converted speech. The opinion score was set to a 5point scale (5; excellent, 4; good, 3; fair, 2; poor, 1; bad). The tests were performed with 11 subjects. For the evaluation of similarity, each subject listened to the converted speech and evaluated how similar the sample was to the target speech. For the evaluation of noise suppression, each subject listened to the converted speech and evaluated the degree of noise suppression in the sample. Figure 7 shows the results of the MOS test. The error bars show 95 % confidence intervals. As can be seen, the performance of the GMM-based method degraded considerably. This may be because the noise caused unexpected mapping in the GMM-based method. Conversely, the performance degradations of the VC methods based on our proposed multimodal NMF, and audio NMF were less than that of the GMM-based method. In addition, in the noise suppression test, the proposed method obtained a higher score than the other two methods. This result demonstrates the noise robustness of the proposed multimodal VC method.

Conclusions
We have proposed multimodal VC using NMF based on the idea of sparse representation and introduced the weight of audio-visual features. In the proposed method, the joint audio-visual feature is used as the source feature. Noisy audio-visual features are then decomposed into a linear combination of the clean audio-visual feature and the noise feature. By replacing the source speaker's audiovisual feature with the target speaker's audio feature, the voice individuality of the source speaker is converted to the target speaker. Furthermore, we have introduced audio-visual weights and formulated a new cost function. By selecting an optimal weight of the image feature, we achieve good transformation accuracy. Our experimental results demonstrate the superior effectiveness of the proposed VC technique compared with conventional audioinput NMF and GMM-based VC. In addition, we have shown that the proposed method is effective in several noisy environments.