A novel voice conversion approach using admissible wavelet packet decomposition
© Nirmal et al.; licensee Springer. 2013
Received: 26 June 2013
Accepted: 18 November 2013
Published: 10 December 2013
The framework of voice conversion system is expected to emphasize both the static and dynamic characteristics of the speech signal. The conventional approaches like Mel frequency cepstrum coefficients and linear predictive coefficients focus on spectral features limited to lower frequency bands. This paper presents a novel wavelet packet filter bank approach to identify non-uniformly distributed dynamic characteristics of the speaker. Contribution of this paper is threefold. First, in the feature extraction stage, dyadic wavelet packet tree structure is optimized to involve less computation while preserving the speaker-specific features. Second, in the feature representation step, magnitude and phase attributes are treated separately to rule out on the fact that raw time-frequency traits are highly correlated but carry intelligent speech information. Finally, the RBF mapping function is established to transform the speaker-specific features from the source to the target speakers. The results obtained by the proposed filter bank-based voice conversion system are compared to the baseline multiscale voice morphing results by using subjective and objective measures. Evaluation results reveal that the proposed method outperforms by incorporating the speaker-specific dynamic characteristics and phase information of the speech signal.
KeywordsAdmissible wavelet packet Dynamic time warping Radial basis function Speaker-specific features Wavelet-based filter bank
The voice conversion (VC) system aims to apply various modifications to the source speaker’s voice so that the converted signal sounds like a particular target speaker’s voice[1, 2]. The VC system is comprised of two phases: training and transformation. The training phase includes feature extraction and incorporates features to formulate an appropriate mapping function. Subsequently, the source speaker characteristics are transformed to that of target speaker using mapping function developed in the training phase. In order to extract the speaker-specific features, several speech feature representations have been developed in the literature, such as Formant Frequencies (FF)[1, 4], Linear Predictive Coefficients (LPC)[1, 5] and Line Spectral Frequency (LSF)[6–8], Mel Frequency Cepstrum Coefficient (MFCC), Mel Generated Cepstrum (MGC), and spectral lines. The LPC features can provide a good approximation model for the vocal tract characteristics, but it neglects few significant details of the individual speaker, like the nasal cavity, unvoiced sound, and other side branches related to non-linguistic information. For the enhancement of the speech quality, a STRAIGHT approach has been proposed. However, it needs enormous computation and therefore is inappropriate for real-time applications. The methods based on the vocal tract model have been developed using MFCC features considering the nonlinear mechanism of the human auditory system. Most of the above approaches provide a good approximation to the source-filter model. However, these methods ignore fine temporal details during the extraction of formant coefficients and the excitation signal[12, 15]. This gives muffled effect in synthesized target speech.
Further improvements in the synthesized speech quality have been reported in various multiscale approaches[16–18]. To our knowledge, initially, the wavelet-based sub-band model proposed by Turk and Arslan produced promising results. Following the ideas of sub-band-based approach, the multiscale approach has been proposed for voice morphing. Afterwards, the auditory sub-band-based wavelet neural network architecture has been proposed, which is widely application for speech classification. However, VC needs to model the speech and speaker-specific characteristics of the speech for developing transformation model[4, 19]. The features representing speaker identity are distributed non-uniformly in different frequency regions.
This paper presents the wavelet filter structure for extracting the speaker-specific features without considering any underlying knowledge of the human auditory system. This filter bank is analysed using admissible wavelet transform as it gives freedom to decompose the low- and high-frequency bands. The contribution of this paper are as follows: (1) the first is the use of the admissible wavelet packet transform based filter bank to extract the speaker-specific information of the speech signal, (2) the second is to reduce the computational complexity of the proposed features using Discrete Cosine Transform (DCT), (3) the third is to incorporate phase of the DCT coefficients to emphasize that phase equally contributes to the synthesized speech signal naturality as the magnitude.
Radial Basis Function is explored to establish the nonlinear mapping rules for modifying the source speaker features to that of the target speaker. The RBF model is used as a mapping model because of its fast training procedure and good generalization properties. Finally, the performance of the proposed filter bank-based VC model is compared with the state-of-the-art multiscale voice morphing using RBF analysis. This is done using various objective measures, such as performance index (P LSF ), formant deviation[7, 20], and spectral distortion. The commonly used subjective measures such as Mean Opinion Score (MOS) and ABX are used to verify the quality and similarity of the converted speech signal.
The rest of the paper is structured as follows: The optimal filter bank is explained in Section 2. The new VC system based on optimal filter bank along with the state of the art multiscale method is explained in Section 3. Thereafter, Section 4 briefs the theoretical aspects of RBF-based transformation model. The database and performance measures for comparison of quality and similarity of the synthesized speech are mentioned in Section 5. Finally, the conclusions are derived in Section 6.
2 Optimal filter bank
The voice individuality caused by different articulatory speech organs is distributed non-uniformly in some invariant parts of the vocal tract, such as the nasal cavity, piriform fossa, and laryngeal tube. The information of the glottis is encoded in the low-frequency region from 100 to 400 Hz, and the piriform fossa is positioned in the medium frequency band from 4 to 5 kHz. The information of consonant factor exists in a higher frequency region, i.e., 7 to 8 kHz[12, 14]. The first three formants are encoded in the lower and middle frequency regions from 200 Hz to 3 kHz.
The VC system needs to realize the transformation model considering the speaker-specific features. The traditional auditory filter bank is not suitable to capture the speaker individuality of the speech signal[12, 23]. Therefore, the frequency resolution in different bands is restructured considering the non-uniform distribution of the speaker-specific information in these bands. Additional details about wavelet analysis can be found in[17, 18, 24].
For the design of filter bank, the ARCTIC database is used. The input speech signal sampled at 16 kHz is pre-processed in various stages, such as pre-emphasis, framing, and windowing. The 8-kHz bandwidth speech frame is decomposed up to four levels by wavelet packet decomposition. This partitions the frequency axis into 16 bands each of 500-Hz band width. The different frequency bands with the speaker-specific features are further decomposed to get finer resolution than the Mel filter bank[24, 25]. The lower frequency range 0 to 1 kHz captures the fundamental frequency which has maximum energy with most speaker-specific information. Therefore, the lower two bands 0 to 0.5 kHz and 0.5 to 1 kHz is decomposed up to the seventh level. It splits the band of 0 to 1 kHz into 16 sub-bands 62.5 Hz each, which is finer than corresponding bandwidth of Mel filter bank. In addition, the frequency band of 1 to 3 kHz contains the speaker-specific information about the first and second harmonics of the fundamental frequency. This frequency band carries less speaker-specific information compared to previous lower sub-bands. Therefore, the band of 1 to 2 kHz is decomposed up to six levels and 2- to 3-kHz band is decomposed up to five levels. This gives 12 sub-bands with finer frequency resolution than the Mel sub-bands. The frequency band of 4 to 5 kHz related to the invariant part of the vocal tract gives information about the piriform fossa. It holds features suitable for speaker identity. However, the resolution of this frequency range is coarser in Mel filter bank. Therefore, this band is further decomposed up to fifth level. The frequency bands 3 to 4 kHz and 5 to 8 kHz do not require further decomposition as these bands already have a fine frequency resolution than the corresponding bands of Mel filter bank. The significant band decomposition is continued till the substantial energy of the corresponding bands is achieved.
3 Voice conversion framework
In this section, the design of a new VC algorithm using the proposed filter bank is explained. In order to compare the performance of the proposed VC system with the state-of-the-art multiscale-based voice morphing using RBF analysis is considered.
3.1 Proposed filter bank-based VC
In the transformation stage, the test utterances of source speaker are pre-processed in the similar way as the training stage to get the separate feature vectors for magnitude and phase information of filtered coefficients. Then, the transformed coefficients are obtained by projecting coefficients through the separate trained models. Afterwards, inverse mathematical operations such as Inverse Discrete Cosine Transform (IDCT) and antilog are applied to the transformed coefficients analogous to operations performed in the training phase. The time domain speech frames are computed in the inverse filtering stage and combined using overlap-add technique. The use of post filtering followed by inverse operations ensures the good quality of the converted speech signal.
3.2 Baseline multiscale voice morphing using RBF analysis
As discussed earlier, the performance of the proposed algorithm is compared with a state of the art multiscale voice morphing. The pre-processing operations of this method are similar to proposed voice conversion method. The dyadic wavelet filter bank applied to the source and target speech frames partitions each of the frames into different frequency bands. Wavelet basis functions, coiflet5, and bi-orthogonal6.8 are chosen for male-to-female and female-to-male conversion, respectively. The wavelet basis with minimum reconstruction error is chosen. It is important to note that the wavelet coefficients at the highest sub-band are set to zero in this filter bank. The networks are trained using frames of normalized wavelet coefficients of the remaining four levels. The network with minimum error on the validation data is chosen for each level and mapping function at that corresponding level is established. The transformation phase employs the RBF-based mapping rules developed in the training stage to obtain the morphed features of the target speaker. Then, inverse mathematical operations analogous to the training stage are used to reconstruct the target speaker’s speech signal.
4 Radial basis function-based mapping
The separate conversion models are used for mapping the magnitude and phase feature vectors of the source speaker to that of the target speaker. The optimum networks obtained through the training are used to predict the transformed parameters of the target speaker from the source speaker.
5 Experimental results
The training set includes phonetically balanced English utterances of seven professional narrators. The utterances in this database are sampled at 16 kHz. The corpus includes sentences of JMK (Canadian male), BDL (American male), AWB (Scottish male), RMS (American male), KSP (Indian male), CLB (American female), and SLT (American female).
The performance index P LSF = 1 indicates that the converted signal is identical to the desired one, whereas P LSF = 0 specifies that the converted signal is not at all similar to the desired one.
Performance index of proposed method and baseline method for different synthesized speech samples
Type of conversion
Converted sample 1
Converted sample 2
Converted sample 3
Converted sample 4
Performance of baseline method for predicting formant frequencies within a specified percentage of deviation
% Predicted frame within deviation
Performance of proposed method for predicting formant frequencies within a specified percentage of deviation
% Predicted frame within deviation
One can observe that the RMSE values between the desired and the predicted acoustic space parameters for proposed model are less than that of the baseline model. However, every time RMSE does not give strong information about the spectral distortion. Consequently, scatter plot and spectral distortion are employed additionally as objective measures.
Score used in speech quality (MOS) and speaker identity (ABX)
MOS (speech quality)
ABX (speaker identity)
Bad (imperfect to perceive)
Poor (almost impossible to perceive)
Fair (sound perception is not perfect)
Very good (cell phone quality)
More or less the same
Outstanding (perfect to perceive)
Subjective analysis for quality (MOS) and identity (ABX)
In the next part of the evaluation, the ABX similarity test (A: Source, B: Target, X: Transformed speech signal) is carried out without considering the speech quality. The listeners are asked to grade the speaker identity on the five-point scale. The listeners are asked to give ratings in the scale of 1 to 5 to decide whether the output X matches with A or B as shown in Table4. The higher value of ABX suggests that mapping functions which are developed with proposed and the baseline method can convert the identity of one speaker to the other with acceptable level. Table5 shows that the listeners have given better rating to the proposed method than that of the baseline method in term of both MOS and ABX test.
In this article, a new feature extraction approach based on admissible wavelet packet transform has been proposed. The earlier feature extraction methods focused only on the low-frequency bands without considering the features in the high-frequency bands which are equally important for speaker identity. The proposed method mainly emphasizes the speech signal frequency regions which are important for speaker identity. The features obtained from the proposed filter bank are modified using RBF-based conversion models. Different objective and subjective measures used in our work justifies the performance of proposed and baseline model. The proposed method gives considerably improved results than the baseline method in terms of both the quality and identity of the speaker. The performance of the proposed system proved the significance of combining the information from the high-frequency bands with low-frequency bands to use it effectively for voice conversion.
- Lee K-S: Statistical approach for voice personality transformation. Audio, Speech, Lang. Process., IEEE Trans 2007, 15(2):641-651.View ArticleGoogle Scholar
- Ye H, Young S: High quality voice morphing. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’04), vol. 1,. Montreal; 17–21 May 2004:I-9–12.Google Scholar
- Arslan LM: Speaker transformation algorithm using segmental code books (stasc). Speech Commun 1999, 28: 211-226. 10.1016/S0167-6393(99)00015-1View ArticleGoogle Scholar
- Kuwabara H, Sagisaka Y: Acoustics characteristics of speaker individuality: control and conversion. Speech Commun 1995, 16: 165-173. 10.1016/0167-6393(94)00053-DView ArticleGoogle Scholar
- Abe M, Nakamura S, Shikano K, Kuwabara H: Voice conversion through vector quantization. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP-88). New York, NY; 11–14 April 1988:655-658.View ArticleGoogle Scholar
- Kain A, Macon MW: Design and evaluation of a voice conversion algorithm based on spectral envelope mapping and residual prediction. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’01), vol. 2. Salt Lake City, UT; 7–11 May 2001:813-816.Google Scholar
- Rao KS: Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Comput. Speech & Lang 2010, 24(3):474-494. 10.1016/j.csl.2009.03.003View ArticleGoogle Scholar
- Turk O, Arslan LM: Robust processing techniques for voice conversion. Comput. Speech & Lang 2006, 20(4):441-467. 10.1016/j.csl.2005.06.001View ArticleGoogle Scholar
- Desai S, Black AW, Yegnanarayana B, Prahallad K: Spectral mapping using artificial neural networks for voice conversion. Audio, Speech, and Lang. Process., IEEE Trans 2010, 18(5):954-964.View ArticleGoogle Scholar
- Helander E, Virtanen T, Nurminen J, Gabbouj M: Voice conversion using partial least squares regression. Audio, Speech, Lang. Process., IEEE Trans 2010, 18(5):912-921.View ArticleGoogle Scholar
- Sundermann D, Hoge H, Bonafonte A, Ney H, Black AW: Residual prediction based on unit selection. In 2005 IEEE Workshop on Automatic Speech Recognition and Understanding. San Juan; 27 Nov 2005:369-374.View ArticleGoogle Scholar
- Lu X, Dang J: An investigation of dependencies between frequency components and speaker characteristics for text-independent speaker identification. Speech Commun 2008, 50(4):312-322. 10.1016/j.specom.2007.10.005View ArticleGoogle Scholar
- Kawahara H, Masuda-Katsuse I, de Cheveigné A: Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extractionpossible role of a repetitive structure in sounds. Speech Commun 1999, 27(3):187-207.View ArticleGoogle Scholar
- Hayakawa S, Itakura F: Text-dependent speaker recognition using the information in the higher frequency band. In IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP-94), vol.1. Adelaide; 19–22 Apr 1994:I/137-140.Google Scholar
- Imai S: Cepstral analysis synthesis on the mel frequency scale. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP '83. Boston, MA; 14–16 April 1983:93-96.View ArticleGoogle Scholar
- Turk O, Arslan LM: Subband based voice conversion. In International Conference on Spoken Language Processing. Denver, CO; 16–20 Sept 2002:289-292.Google Scholar
- Orphanidou C, Moroz IM, Roberts SJ: Multiscale voice morphing using radial basis function analysis. In Algorithms for Approximation. Berlin Heidelberg: Springer; 2007:61-69.View ArticleGoogle Scholar
- Guido RC, Sasso Vieira L, Barbon Júnior S, Sanchez FL, Dias Maciel C, Silva Fonseca E, Carlos Pereira J: A neural-wavelet architecture for voice conversion. Neurocomputing 2007, 71(1):174-180.View ArticleGoogle Scholar
- Furui S: Research of individuality features in speech waves and automatic speaker recognition techniques. Speech Commun 1986, 5(2):183-197. 10.1016/0167-6393(86)90007-5View ArticleGoogle Scholar
- Laskar R, Chakrabarty D, Talukdar F, Rao KS, Banerjee K: Comparing ann and gmm in a voice conversion framework. Appl. Soft Comput 2012, 12(11):3332-3342. 10.1016/j.asoc.2012.05.027View ArticleGoogle Scholar
- Nurminen J, Popa V, Tian J, Tang Y, Kiss I: A parametric approach for voice conversion. In International (TC-STAR) Workshop on Speech-to-Speech Translation, Audio and Visual Communications,Nokia Research Center. Barcelona, Spain; June 2006:225-229.Google Scholar
- Narendranath M, Murthy HA, Rajendran S, Yegnanarayana B: Transformation of formants for voice conversion using artificial neural networks. Speech Commun 1995, 16(2):207-216. 10.1016/0167-6393(94)00058-IView ArticleGoogle Scholar
- Ormanci E, Nikbay UH, Turk O, Arslan LM: Subjective assessment of frequency bands for perception of speaker identity. In Proceedings of the ICSLP 2002,INTERSPEECH. Denver, CO; 16–20 September 2002:2581-2584.Google Scholar
- Farooq O, Datta S: Mel filter-like admissible wavelet packet structure for speech recognition. Signal Processing Letters, IEEE 2001, 8(7):196-198.View ArticleGoogle Scholar
- Lung S-Y: Wavelet feature selection based neural networks with application to the text independent speaker identification. Pattern Recognit 2006, 39(8):1518-1521. 10.1016/j.patcog.2006.02.004View ArticleGoogle Scholar
- Alsteris LD, Paliwal KK: Short-time phase spectrum in speech processing: a review and some experimental results. Digit. Signal Process 2007, 17(3):578-616. 10.1016/j.dsp.2006.06.007View ArticleGoogle Scholar
- Watanabe T, Murakami T, Namba M, Hoya T, Ishida Y: Transformation of spectral envelope for voice conversion based on radial basis function networks. Seventh International Conference on Spoken Language Processing, INTERSPEECH, ISCA(2002), (Denver, CO, 16–20 September 2002)Google Scholar
- Iwahashi N, Sagisaka Y: Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks. Speech Commun 1995, 16(2):139-151. 10.1016/0167-6393(94)00051-BView ArticleGoogle Scholar
- Kominek J, Black AW: The CMU ARCTIC Speech Databases. In SSW5-2004. Pittsburgh, PA; 14–16 June 2004:223-224.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.