- Open Access
A novel voice conversion approach using admissible wavelet packet decomposition
EURASIP Journal on Audio, Speech, and Music Processing volume 2013, Article number: 28 (2013)
The framework of voice conversion system is expected to emphasize both the static and dynamic characteristics of the speech signal. The conventional approaches like Mel frequency cepstrum coefficients and linear predictive coefficients focus on spectral features limited to lower frequency bands. This paper presents a novel wavelet packet filter bank approach to identify non-uniformly distributed dynamic characteristics of the speaker. Contribution of this paper is threefold. First, in the feature extraction stage, dyadic wavelet packet tree structure is optimized to involve less computation while preserving the speaker-specific features. Second, in the feature representation step, magnitude and phase attributes are treated separately to rule out on the fact that raw time-frequency traits are highly correlated but carry intelligent speech information. Finally, the RBF mapping function is established to transform the speaker-specific features from the source to the target speakers. The results obtained by the proposed filter bank-based voice conversion system are compared to the baseline multiscale voice morphing results by using subjective and objective measures. Evaluation results reveal that the proposed method outperforms by incorporating the speaker-specific dynamic characteristics and phase information of the speech signal.
The voice conversion (VC) system aims to apply various modifications to the source speaker’s voice so that the converted signal sounds like a particular target speaker’s voice[1, 2]. The VC system is comprised of two phases: training and transformation. The training phase includes feature extraction and incorporates features to formulate an appropriate mapping function. Subsequently, the source speaker characteristics are transformed to that of target speaker using mapping function developed in the training phase. In order to extract the speaker-specific features, several speech feature representations have been developed in the literature, such as Formant Frequencies (FF)[1, 4], Linear Predictive Coefficients (LPC)[1, 5] and Line Spectral Frequency (LSF)[6–8], Mel Frequency Cepstrum Coefficient (MFCC), Mel Generated Cepstrum (MGC), and spectral lines. The LPC features can provide a good approximation model for the vocal tract characteristics, but it neglects few significant details of the individual speaker, like the nasal cavity, unvoiced sound, and other side branches related to non-linguistic information. For the enhancement of the speech quality, a STRAIGHT approach has been proposed. However, it needs enormous computation and therefore is inappropriate for real-time applications. The methods based on the vocal tract model have been developed using MFCC features considering the nonlinear mechanism of the human auditory system. Most of the above approaches provide a good approximation to the source-filter model. However, these methods ignore fine temporal details during the extraction of formant coefficients and the excitation signal[12, 15]. This gives muffled effect in synthesized target speech.
Further improvements in the synthesized speech quality have been reported in various multiscale approaches[16–18]. To our knowledge, initially, the wavelet-based sub-band model proposed by Turk and Arslan produced promising results. Following the ideas of sub-band-based approach, the multiscale approach has been proposed for voice morphing. Afterwards, the auditory sub-band-based wavelet neural network architecture has been proposed, which is widely application for speech classification. However, VC needs to model the speech and speaker-specific characteristics of the speech for developing transformation model[4, 19]. The features representing speaker identity are distributed non-uniformly in different frequency regions.
This paper presents the wavelet filter structure for extracting the speaker-specific features without considering any underlying knowledge of the human auditory system. This filter bank is analysed using admissible wavelet transform as it gives freedom to decompose the low- and high-frequency bands. The contribution of this paper are as follows: (1) the first is the use of the admissible wavelet packet transform based filter bank to extract the speaker-specific information of the speech signal, (2) the second is to reduce the computational complexity of the proposed features using Discrete Cosine Transform (DCT), (3) the third is to incorporate phase of the DCT coefficients to emphasize that phase equally contributes to the synthesized speech signal naturality as the magnitude.
Radial Basis Function is explored to establish the nonlinear mapping rules for modifying the source speaker features to that of the target speaker. The RBF model is used as a mapping model because of its fast training procedure and good generalization properties. Finally, the performance of the proposed filter bank-based VC model is compared with the state-of-the-art multiscale voice morphing using RBF analysis. This is done using various objective measures, such as performance index (P LSF ), formant deviation[7, 20], and spectral distortion. The commonly used subjective measures such as Mean Opinion Score (MOS) and ABX are used to verify the quality and similarity of the converted speech signal.
The rest of the paper is structured as follows: The optimal filter bank is explained in Section 2. The new VC system based on optimal filter bank along with the state of the art multiscale method is explained in Section 3. Thereafter, Section 4 briefs the theoretical aspects of RBF-based transformation model. The database and performance measures for comparison of quality and similarity of the synthesized speech are mentioned in Section 5. Finally, the conclusions are derived in Section 6.
2 Optimal filter bank
The voice individuality caused by different articulatory speech organs is distributed non-uniformly in some invariant parts of the vocal tract, such as the nasal cavity, piriform fossa, and laryngeal tube. The information of the glottis is encoded in the low-frequency region from 100 to 400 Hz, and the piriform fossa is positioned in the medium frequency band from 4 to 5 kHz. The information of consonant factor exists in a higher frequency region, i.e., 7 to 8 kHz[12, 14]. The first three formants are encoded in the lower and middle frequency regions from 200 Hz to 3 kHz.
The VC system needs to realize the transformation model considering the speaker-specific features. The traditional auditory filter bank is not suitable to capture the speaker individuality of the speech signal[12, 23]. Therefore, the frequency resolution in different bands is restructured considering the non-uniform distribution of the speaker-specific information in these bands. Additional details about wavelet analysis can be found in[17, 18, 24].
For the design of filter bank, the ARCTIC database is used. The input speech signal sampled at 16 kHz is pre-processed in various stages, such as pre-emphasis, framing, and windowing. The 8-kHz bandwidth speech frame is decomposed up to four levels by wavelet packet decomposition. This partitions the frequency axis into 16 bands each of 500-Hz band width. The different frequency bands with the speaker-specific features are further decomposed to get finer resolution than the Mel filter bank[24, 25]. The lower frequency range 0 to 1 kHz captures the fundamental frequency which has maximum energy with most speaker-specific information. Therefore, the lower two bands 0 to 0.5 kHz and 0.5 to 1 kHz is decomposed up to the seventh level. It splits the band of 0 to 1 kHz into 16 sub-bands 62.5 Hz each, which is finer than corresponding bandwidth of Mel filter bank. In addition, the frequency band of 1 to 3 kHz contains the speaker-specific information about the first and second harmonics of the fundamental frequency. This frequency band carries less speaker-specific information compared to previous lower sub-bands. Therefore, the band of 1 to 2 kHz is decomposed up to six levels and 2- to 3-kHz band is decomposed up to five levels. This gives 12 sub-bands with finer frequency resolution than the Mel sub-bands. The frequency band of 4 to 5 kHz related to the invariant part of the vocal tract gives information about the piriform fossa. It holds features suitable for speaker identity. However, the resolution of this frequency range is coarser in Mel filter bank. Therefore, this band is further decomposed up to fifth level. The frequency bands 3 to 4 kHz and 5 to 8 kHz do not require further decomposition as these bands already have a fine frequency resolution than the corresponding bands of Mel filter bank. The significant band decomposition is continued till the substantial energy of the corresponding bands is achieved.
The selection of the wavelet basis is done using root-mean-square error (RMSE) measure[17, 18]. In reference with the above discussion, the experiments are carried out. The final filter structure shown in Figure1 gives best results. It consists of 40 different sub-bands. The quality and naturalness in the VC system may be improved by capturing speaker-specific features in the high-frequency region.
3 Voice conversion framework
In this section, the design of a new VC algorithm using the proposed filter bank is explained. In order to compare the performance of the proposed VC system with the state-of-the-art multiscale-based voice morphing using RBF analysis is considered.
3.1 Proposed filter bank-based VC
The proposed VC system depicted in Figure2, consists of two phases: training and transformation. During the training phase, the normalized utterances of the source speaker are segmented into frames of 32 ms each frame consisting of 480 samples. Thereafter, the proposed filter bank is applied to each of these frames. Then, log and DCT transformation of the filter coefficients is carried out to reduce the computational complexity. The feature vector is formed considering the phase along with the magnitude of DCT coefficients. The similar set of procedures is used to obtain the feature vectors from the target speaker. However, it is unlikely that synchronized feature vectors would be obtained even if the source and target speaker utter the same sentence. Therefore, feature vectors of source speaker are time aligned with that of the target speaker to train the mapping model. The alignment is carried out using dynamic time warping technique. The aligned magnitude and phase feature vectors of source and target speakers are used to train the separate RBF-based transformation model to establish conversion rules.
In the transformation stage, the test utterances of source speaker are pre-processed in the similar way as the training stage to get the separate feature vectors for magnitude and phase information of filtered coefficients. Then, the transformed coefficients are obtained by projecting coefficients through the separate trained models. Afterwards, inverse mathematical operations such as Inverse Discrete Cosine Transform (IDCT) and antilog are applied to the transformed coefficients analogous to operations performed in the training phase. The time domain speech frames are computed in the inverse filtering stage and combined using overlap-add technique. The use of post filtering followed by inverse operations ensures the good quality of the converted speech signal.
3.2 Baseline multiscale voice morphing using RBF analysis
As discussed earlier, the performance of the proposed algorithm is compared with a state of the art multiscale voice morphing. The pre-processing operations of this method are similar to proposed voice conversion method. The dyadic wavelet filter bank applied to the source and target speech frames partitions each of the frames into different frequency bands. Wavelet basis functions, coiflet5, and bi-orthogonal6.8 are chosen for male-to-female and female-to-male conversion, respectively. The wavelet basis with minimum reconstruction error is chosen. It is important to note that the wavelet coefficients at the highest sub-band are set to zero in this filter bank. The networks are trained using frames of normalized wavelet coefficients of the remaining four levels. The network with minimum error on the validation data is chosen for each level and mapping function at that corresponding level is established. The transformation phase employs the RBF-based mapping rules developed in the training stage to obtain the morphed features of the target speaker. Then, inverse mathematical operations analogous to the training stage are used to reconstruct the target speaker’s speech signal.
4 Radial basis function-based mapping
Radial basis function-based transformation model is explored to capture the nonlinear dynamics of the acoustical cues between source and desired target speakers. The baseline method performs spectral conversion using RBF-based transformation model and a similar approach is used in this paper for transforming speaker-specific features[17, 28]. The RBF neural network is a special case of feed forward network which maps input space nonlinearly to hidden space followed by linear mapping from hidden space to output space. The network represents a map from M0 dimensional input space to N0 dimensional output space written as,. The training dataset includes input output pairs [x k ,d k ]; k = 1,2…M0. When the M0 dimensional input x is applied to the RBF model, the mapping function F is computed as
where ||.|| is a norm usually Euclidean, computes the distance between applied input x and training data point d j . The above equation can also be written in matrix form as
where Φ(||x-d j ||),j = 1,2…m is the set of m arbitrary functions known as Radial Basis Functions. The σ is the spread factor of the basis function. The commonly considered form of Φ is Gaussian function defined as,
The radial basis function network model learning process includes training and generalization phase. Training of the network is carried out using the input dataset alone. The optimized basis function is used in the training phase which is usually obtained using k-means algorithm in an unsupervised manner. In the second phase, the weights in the hidden to output layer are optimized in a least square sense by minimizing squared error function,
where (d k )n is desired value for k th output unit when input to the network is xn. The weight vector is determined as
where Φ: matrix of size (n × j), D: matrix of size (n × k), ΦT: transpose of matrix Φ,
where (ΦTΦ)-1ΦT denotes the pseudo inverse of matrix Φ and D denotes the target matrix for. The weight matrix W can be calculated by linear inverse matrix technique and used for mapping between the source and target feature vectors. The exact interpolation of RBF is acquainted with two serious problems namely, poor performance for noisy data and increased computational complexity. These problems can be addressed by modifying two RBF parameters. First, one is the spread factor calculated as
The optimized spread factor confirms that the individual RBFs are neither wide nor narrow. The second is bias unit. A bias unit is introduced into the linear sum of activations at the desired output layer to compensate difference between the mean over the data set of the basis function activations and the corresponding mean of the targets. Hence, we obtain the RBF network with the mapping function F k (x) computed as
The separate conversion models are used for mapping the magnitude and phase feature vectors of the source speaker to that of the target speaker. The optimum networks obtained through the training are used to predict the transformed parameters of the target speaker from the source speaker.
5 Experimental results
The training set includes phonetically balanced English utterances of seven professional narrators. The utterances in this database are sampled at 16 kHz. The corpus includes sentences of JMK (Canadian male), BDL (American male), AWB (Scottish male), RMS (American male), KSP (Indian male), CLB (American female), and SLT (American female).
The utterances of two male speakers, AWB (M1) and BDL (M2), and two female speakers, CLB (F1) and SLT (F2), are employed in the analysis. The transformation models are developed for four different speaker combinations: M1-F1, F2-M2, F1-F2, and M1-M2. The minimum 40 parallel utterances are required to form a VC model. Our training set includes 50 parallel utterances obtained from each of the speaker pairs and a separate set of 25 utterances of each source speaker are used to evaluate the system. In order to evaluate the VC system the objective measures, such as performance index, spectral distortion and formant deviations are considered. The end user of the VC system is human so the objective evaluations are confirmed with subjective evaluations. The subjective evaluations involve rating the system performance in terms of similarity and quality of the converted and target speech. Usually, ABX and MOS tests are employed to evaluate similarity and quality, respectively. The performance index (P LSF ) is computed for investigating the requirement of normalized error for different pairs. The spectral distortion between desired and transformed utterances, and the inter speaker spectral distortion, D LSF (d(n),s(n)) are used for computing the P LSF measure. In general, the speaker spectral distortion between signals u and v, D LSF (u,v) is defined as
where N represents the number of frames, P refers to a LSF order, and is the j th LSF component in the frame i. The performance index is given by
The performance index P LSF = 1 indicates that the converted signal is identical to the desired one, whereas P LSF = 0 specifies that the converted signal is not at all similar to the desired one.
It can be seen in the Table1 that both the proposed VC method and baseline method shows performance index differences for M1-F1, F2-M2, M1-M2, and F1-F2 pairs. The results specify that the performance of the proposed system is significantly better than that of the baseline method.
The other performance measures, such as formant deviation (D k ), root-mean-square error (RMSE), and correlation coefficients (σx,y) are used to analyse our system. Deviation parameter is defined as, the percentage variation in the actual (x k ) and predicted (y k ) formant frequencies, derived from the corresponding speech frames. It represents the percentage of test frames that lie within a specified deviation (D k ) and is calculated as
For a given transformed and target signals, root-mean-square error is calculated in terms of percentage of average desired formant values obtained from the speech segments. It is computed as
The error e k is the difference between the actual and predicted formant values. N is the number of observed formant frequencies of speech frames. The parameter d k is the deviation error. The correlation coefficient γ(x,y) is the parameter to be computed in terms of covariance COV(X,Y) between the target (x) and the predicted (y) formant values and the standard deviations σ X , σ Y of the target and the predicted formant values, respectively. The parameters γ(x,y) and C O V(X,Y) are calculated as
One can observe that the RMSE values between the desired and the predicted acoustic space parameters for proposed model are less than that of the baseline model. However, every time RMSE does not give strong information about the spectral distortion. Consequently, scatter plot and spectral distortion are employed additionally as objective measures.
The baseline method scatter plots for first-, second-, third-, and fourth-order formant frequencies using M1-F1 and F2-M2 speaker pairs are shown in Figures3 and4, respectively. Similar analysis is done for proposed method as shown in Figures5 and6. The clusters obtained using proposed model are more compact and diagonally oriented compared to the baseline model. It is observed that the higher predicted formants are more closely oriented toward the desired formants for proposed filter bank-based method than that of the baseline method. Also, the diagonal orientation of the clusters demonstrates the good prediction ability of both the methods, as perfect prediction means all the data points in scatter plot are diagonally oriented in right side. The compact clusters obtained for proposed method imply its ability to capture the formant structure of desired speaker.
Figure7 shows the desired and predicted spectral envelopes for proposed and baseline method. It can be seen in the figure that the spectral envelopes obtained for proposed method follow the same shape and have peaks and valleys at same frequencies confirming the similarity between them. On the other hand, the spectral envelopes for baseline method have different shapes.
As mentioned earlier, the proposed and the baseline methods are also evaluated in terms of subjective tests: MOS and ABX. Mean opinion score is a quality evaluation test for the synthesized speech and ABX is the test for similarity between converted and target speech signal. The tests related to quality and similarity are carried out using 25 synthesized speech utterances obtained from four different speaker pairs and the corresponding target utterances. In the first part, the listeners are asked to judge the quality of synthesized speech signal using MOS in the scale of 1 to 5 as shown in Table4. The MOS results shown in Table5 indicates that the conversion is more in proposed method than baseline method.
In the next part of the evaluation, the ABX similarity test (A: Source, B: Target, X: Transformed speech signal) is carried out without considering the speech quality. The listeners are asked to grade the speaker identity on the five-point scale. The listeners are asked to give ratings in the scale of 1 to 5 to decide whether the output X matches with A or B as shown in Table4. The higher value of ABX suggests that mapping functions which are developed with proposed and the baseline method can convert the identity of one speaker to the other with acceptable level. Table5 shows that the listeners have given better rating to the proposed method than that of the baseline method in term of both MOS and ABX test.
In this article, a new feature extraction approach based on admissible wavelet packet transform has been proposed. The earlier feature extraction methods focused only on the low-frequency bands without considering the features in the high-frequency bands which are equally important for speaker identity. The proposed method mainly emphasizes the speech signal frequency regions which are important for speaker identity. The features obtained from the proposed filter bank are modified using RBF-based conversion models. Different objective and subjective measures used in our work justifies the performance of proposed and baseline model. The proposed method gives considerably improved results than the baseline method in terms of both the quality and identity of the speaker. The performance of the proposed system proved the significance of combining the information from the high-frequency bands with low-frequency bands to use it effectively for voice conversion.
Lee K-S: Statistical approach for voice personality transformation. Audio, Speech, Lang. Process., IEEE Trans 2007, 15(2):641-651.
Ye H, Young S: High quality voice morphing. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’04), vol. 1,. Montreal; 17–21 May 2004:I-9–12.
Arslan LM: Speaker transformation algorithm using segmental code books (stasc). Speech Commun 1999, 28: 211-226. 10.1016/S0167-6393(99)00015-1
Kuwabara H, Sagisaka Y: Acoustics characteristics of speaker individuality: control and conversion. Speech Commun 1995, 16: 165-173. 10.1016/0167-6393(94)00053-D
Abe M, Nakamura S, Shikano K, Kuwabara H: Voice conversion through vector quantization. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP-88). New York, NY; 11–14 April 1988:655-658.
Kain A, Macon MW: Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’01), vol. 2. Salt Lake City, UT; 7–11 May 2001:813-816.
Rao KS: Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Comput. Speech & Lang 2010, 24(3):474-494. 10.1016/j.csl.2009.03.003
Turk O, Arslan LM: Robust processing techniques for voice conversion. Comput. Speech & Lang 2006, 20(4):441-467. 10.1016/j.csl.2005.06.001
Desai S, Black AW, Yegnanarayana B, Prahallad K: Spectral mapping using artificial neural networks for voice conversion. Audio, Speech, and Lang. Process., IEEE Trans 2010, 18(5):954-964.
Helander E, Virtanen T, Nurminen J, Gabbouj M: Voice conversion using partial least squares regression. Audio, Speech, Lang. Process., IEEE Trans 2010, 18(5):912-921.
Sundermann D, Hoge H, Bonafonte A, Ney H, Black AW: Residual prediction based on unit selection. In 2005 IEEE Workshop on Automatic Speech Recognition and Understanding. San Juan; 27 Nov 2005:369-374.
Lu X, Dang J: An investigation of dependencies between frequency components and speaker characteristics for text-independent speaker identification. Speech Commun 2008, 50(4):312-322. 10.1016/j.specom.2007.10.005
Kawahara H, Masuda-Katsuse I, de Cheveigné A: Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extractionpossible role of a repetitive structure in sounds. Speech Commun 1999, 27(3):187-207.
Hayakawa S, Itakura F: Text-dependent speaker recognition using the information in the higher frequency band. In IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP-94), vol.1. Adelaide; 19–22 Apr 1994:I/137-140.
Imai S: Cepstral analysis synthesis on the mel frequency scale. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP '83. Boston, MA; 14–16 April 1983:93-96.
Turk O, Arslan LM: Subband based voice conversion. In International Conference on Spoken Language Processing. Denver, CO; 16–20 Sept 2002:289-292.
Orphanidou C, Moroz IM, Roberts SJ: Multiscale voice morphing using radial basis function analysis. In Algorithms for Approximation. Berlin Heidelberg: Springer; 2007:61-69.
Guido RC, Sasso Vieira L, Barbon Júnior S, Sanchez FL, Dias Maciel C, Silva Fonseca E, Carlos Pereira J: A neural-wavelet architecture for voice conversion. Neurocomputing 2007, 71(1):174-180.
Furui S: Research of individuality features in speech waves and automatic speaker recognition techniques. Speech Commun 1986, 5(2):183-197. 10.1016/0167-6393(86)90007-5
Laskar R, Chakrabarty D, Talukdar F, Rao KS, Banerjee K: Comparing ann and gmm in a voice conversion framework. Appl. Soft Comput 2012, 12(11):3332-3342. 10.1016/j.asoc.2012.05.027
Nurminen J, Popa V, Tian J, Tang Y, Kiss I: A parametric approach for voice conversion. In International (TC-STAR) Workshop on Speech-to-Speech Translation, Audio and Visual Communications,Nokia Research Center. Barcelona, Spain; June 2006:225-229.
Narendranath M, Murthy HA, Rajendran S, Yegnanarayana B: Transformation of formants for voice conversion using artificial neural networks. Speech Commun 1995, 16(2):207-216. 10.1016/0167-6393(94)00058-I
Ormanci E, Nikbay UH, Turk O, Arslan LM: Subjective assessment of frequency bands for perception of speaker identity. In Proceedings of the ICSLP 2002,INTERSPEECH. Denver, CO; 16–20 September 2002:2581-2584.
Farooq O, Datta S: Mel filter-like admissible wavelet packet structure for speech recognition. Signal Processing Letters, IEEE 2001, 8(7):196-198.
Lung S-Y: Wavelet feature selection based neural networks with application to the text independent speaker identification. Pattern Recognit 2006, 39(8):1518-1521. 10.1016/j.patcog.2006.02.004
Alsteris LD, Paliwal KK: Short-time phase spectrum in speech processing: a review and some experimental results. Digit. Signal Process 2007, 17(3):578-616. 10.1016/j.dsp.2006.06.007
Watanabe T, Murakami T, Namba M, Hoya T, Ishida Y: Transformation of spectral envelope for voice conversion based on radial basis function networks. Seventh International Conference on Spoken Language Processing, INTERSPEECH, ISCA(2002), (Denver, CO, 16–20 September 2002)
Iwahashi N, Sagisaka Y: Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks. Speech Commun 1995, 16(2):139-151. 10.1016/0167-6393(94)00051-B
Kominek J, Black AW: The CMU ARCTIC Speech Databases. In SSW5-2004. Pittsburgh, PA; 14–16 June 2004:223-224.
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Nirmal, J.H., Zaveri, M.A., Patnaik, S. et al. A novel voice conversion approach using admissible wavelet packet decomposition. J AUDIO SPEECH MUSIC PROC. 2013, 28 (2013). https://doi.org/10.1186/1687-4722-2013-28
- Admissible wavelet packet
- Dynamic time warping
- Radial basis function
- Speaker-specific features
- Wavelet-based filter bank