An audio watermark-based speech bandwidth extension method
© Chen et al.; licensee Springer. 2013
Received: 12 February 2013
Accepted: 13 May 2013
Published: 6 June 2013
A novel speech bandwidth extension method based on audio watermark is presented in this paper. The time-domain and frequency-domain envelope parameters are extracted from the high-frequency components of speech signal, and then these parameters are embedded in the corresponding narrowband speech bit stream by the modified least significant bit watermark method which uses perception property. At the decoder, the wideband speech is reproduced with the reconstruction of high-frequency components based on the parameters extracted from bit stream of the narrowband speech. The proposed method can decrease poor auditory effect caused by large local distortion. The simulation results show that the synthesized wideband speech has low spectral distortion and its speech perception quality is greatly improved.
The narrowband speech with 8 KHz sampling frequency is widely used in many communication systems . This kind of speech sounds unnatural due to the missing of high-frequency components; therefore, it can not meet the demands for high-quality perception, such as telephone/video conference systems. With the increasing of communication network bandwidth, wideband speech transmission is strongly desired, but large-scale update of narrow communication infrastructures is difficult and expensive. For the existing communication network, such as public switched telephone network (PSTN) and global system for mobile communication (GSM), speech bandwidth extension (BWE) technique is an effective and realistic choice to obtain wideband speech quality.
Speech BWE methods are mainly divided into two classes. One is based on correlation between narrowband speech components and wideband ones; the other is based on information hiding technique. Most of the former methods produce wideband speech by linear prediction (LP) model , i.e., excitation signal and linear prediction coefficients (stand for spectral envelope). Nagel et al. proposed high-frequency (HF) information generation method based on signal sideband modulation , i.e., low-frequency (LF) band signal is first modulated, then extended into HF part, and, finally, filled the gap between LF and HF with noise and shaped the frequency-domain envelope. Fuchs and Lefebvre proposed a harmonic BWE method . This method generated HF components by parallel phase vocoder and removed noise in the intersection part of spectrums. Pulakka et al. proposed a speech BWE method using Gaussian mixture model based estimation of the high band Mel spectrum . Pulakka and Alku proposed a BWE method of telephone speech using neural network and filter bank implementation for high-band Mel spectrum . Pham et al. used back-forward filter to generate excitation signal , which makes perception quality of synthesized wideband speech improve greatly. Bauer and Fingscheidt used pre-trained neural network to generate HF speech components and synthesized wideband speech by spline interpolation method . Naofumi proposed a hidden Markov model (HMM)-based BWE methods . This method can enhance the speech quality without increasing the amount of transmission data. These methods, based on correlation between narrowband speech components and wideband ones, have low enough computational complexity, but noises are easily introduced into the frequency band between LF and HF .
The speech BWE methods based on information hiding technique usually embed HF components information into the bit stream of narrowband speech, and then, the wideband speech is recovered based on the HF information at the receiver. Chen and Leung proposed a speech BWE method based on least significant bits (LSB) audio watermark , which can embed more HF speech components information but is susceptible to noise and channel interference. Geiser and Vary proposed a speech BWE method based on data hiding technique . They embedded linear prediction coefficients of HF components into the encoded narrowband speech then recovered the data in the decoder and synthesized wideband speech. But when suffering from the channel interference, this method has poor synthesized wideband speech. Esteban and Galand proposed a speech BWE method based on the GSM EFR codec , which embed the sideband information into the narrowband speech stream by watermark. This method can synthesize wideband speech with less noise.
In this paper, a new BWE method based on the modified LSB watermark technique is proposed. This method first extracts the necessary HF components parameters, including time-domain envelopes, frequency-domain envelopes, and energy of the wideband speech; then these parameters are compressed and embedded into the narrowband speech bit stream with a modified watermark technique. In decoder, the reverse procedure is applied to extract the HF parameters; then these parameters are used to synthesize HF components; finally, the wideband speeches are recovered from the LF and HF speech components.
2 Speech BWE method based on audio watermark
2.1 Down-sampling processing of speech signal
where the filter order ORD is equal to 64, and swb is the input wideband speech signal.
2.2 High-frequency parameters extraction
2.3 Watermark embedding and extracting
In order to further reduce the amount of data, vector quantization (VQ) is conducted to both time-domain and frequency-domain envelopes . In the VQ process, the time-domain and frequency-domain envelopes are divided into four sections and three sections, respectively, where each section is a four-dimensional vector and is quantized with 6 bits. Thus, the total number of digital information is 12+12+6∗4+6∗3=66 bits, and the quantization code book in reference  is available.
Usually, audio watermark is designed to be undetectable and perceivable but can be extracted with a hidden message by some algorithms. Using this feature of watermark, we assign the 66 bits digital information as watermark and embed it into LF bit stream; thus in the receiving terminal, HF information hidden can be obtained with watermark extractor. In this paper, a modified LSB watermark method is proposed, which is based on communication protocol characteristics and human hearing perception.
According to the time-domain masking effect of human auditory, a large signal can make masking effect on the small signal . So changes in the small signals can not be easily heard. With this auditory characteristics, we embed the watermark with LF and HF components parameters into the small signal position to make the watermark hidden better.
When extracting watermark, we decide whether watermark is embedded or not based on the characteristics of bit streams. If the C6 bit is 0, the watermark is extracted from the lowest position of bits; if the C6 bit is 1, there is no watermark in bit stream. If reaching the end of the frame but the extracted watermarks are less than 66 bits, then return to a starting point and extract watermark in the C6 = 1 position until the watermark bits extracted are up to 66 bits.
2.4 Recovery of HF components
where a i is linear prediction coefficient of the LF part, p is the order of AR model, G is the gain.
When obtaining u(n) from the AR model, the parameters of HF components are also extracted from watermark in LF bit stream, including 16 time-domain envelopes, 12 envelope frequency-domain envelopes, the average time-domain envelope, and the average frequency-domain envelope. Then, the HF parameters recovered from LF bitstream are used to shape both time-domain and frequency-domain envelopes of u(n) . Since shaping method of the frequency-domain envelope is similar with the one in time-domain, shaping process of time-domain envelope is only given as follows.
where are the envelope parameters of u(n) in time domain.
After above-mentioned time-domain and frequency-domain envelopes are shaped, the HF speech components are reconstructed.
2.5 Synthesis of wideband speech
3 Simulation and result discussion
In order to evaluate the performance of proposed BWE scheme, both objective and subjective experiments are carried out. Without loss of generality, according to the character of pitch and timbre, test speeches are divided into five types: male speech, female speech, boy speech, girl speech, and song. All test speeches are quantized with 16 bits and sampled at 16 KHz. These speeches will be used as the original wideband speeches for the following experiments.
3.1 Objective measurements
The objective measurements, including spectral distortion and spectrogram, are used to compare the performance between original wideband speech at transmitting terminal and expanded wideband speech at receiving terminal.
Objective test results
Distortion measure (dB)
Signal-to-noise ratio of narrowband speech
SNR of narrowband speech
after G.711 decode (dB)
3.2 Subjective evaluation
Signal to noise ratio of narrowband speech
A is much better than B
A is better than B
A is slightly better than B
A is the same with B
A is slightly worse than B
A is worse than B
A is much worse than B
Because human auditory and subjective perceptions are based on personal experiences, knowledge background, test environment, and mental state, each person’s subjective experience on the same speech will drift, but the difference is small. In order to make sure that the test situation can truly reflect the speech quality in the test, the 32 listeners (16 females and 16 males), whose ages are between 20 and 40, are invited for test experiments in the same test environment. None of the listeners had any hearing handicap, and they are native speakers of Chinese. The listeners have experience about communications facilities; especially, they were not engaged in communications or signal processing work and did not participate in any speech aspects of the subjective test in the recent 6 months.
Before formal listening tests, listeners was told of the main idea of the experiment. When the listeners understood the guidance, they will first listen to the initial situation and give their advices. Any technical problems, such as test principle or distortion degree, was forbidden before all experiments are over. In order to reduce the tiredness of the listeners, the test was divided into blocks. When test was ongoing, the listeners were not allowed to know the test results of other persons.
A speech bandwidth extension method based on the modified audio watermark is proposed in this paper. The high-frequency speech information as watermark is embedded in the narrowband (i.e., low-frequency) speech bit stream. A modified LSB watermark method based on the characteristics of the communication protocol and the human hearing perception is proposed and used in the proposed BWE method. The objective and subjective evaluations show that the quality of speech synthesized by the proposed method is better than narrowband speech and is comparable to AMR-WB codec at 18.25 kbps.
This work was supported by National Natural Science Foundation of China (nos. 61172107, 61172110, and 60772161), Dalian Municipal Science and Technology Fund Scheme (no. 2008J23JH025), Specialized Research Fund for the Doctoral Program of Higher Education of China (no. 200801410015), and the Fundamental Research Funds for the Central Universities of China (no. DUT13LAB06).
- ITU-T Recommendation G. 711: Pulse code modulation (PCM) of voice frequencies. (ITU-T, 1972)Google Scholar
- Plumpe MD, Quatieri TF, Reynolds DA: Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Trans. Speech Audio Process 1999, 7(5):569-586. 10.1109/89.784109View ArticleGoogle Scholar
- Nagel F, Disch S, Wilde S: A continuous modulated single sideband bandwidth extension. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) 357-360. Texas, 14–19 March 2010Google Scholar
- Fuchs G, Lefebvre R: A new post-filtering for artificially replicated high-band in speech coders. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 713-716. Toulouse, 14–19 May, 2006Google Scholar
- Pulakka H, Remes U, Palomaki K: Speech bandwidth extension using gaussian mixture model-based estimation of the highband Mel spectrum. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5100-5103. Prague, 22–27 May 2011Google Scholar
- Pulakka H, Alku P: Bandwidth extension of telephone speech using a neural network and a filter bank implementation for highband Mel spectrum. IEEE Trans. Audio, Speech, Lang. Process 2011, 19(7):2170-2183.View ArticleGoogle Scholar
- Pham TV, Schaefer F, Kubin G: A novel implementation of the spectral shaping approach for artificial bandwidth extension. 3th IEEE International Conference on Communications and Electronics (ICCE) Nha Trang 262-267. 11–13 August 2010Google Scholar
- Bauer P, Fingscheidt T: An HMM-based artificial bandwidth extension evaluated by cross-language training and test. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4589-4592. Las Vegas, 31 March–4 April 2008Google Scholar
- Naofumi: A band extension technique for G.711 speech using steganography. IEICE Trans. Commun 2006, E89-B(6):1896-1898. 10.1093/ietcom/e89-b.6.1896View ArticleGoogle Scholar
- Mohan M, Karpur DB, Narayan M: Artificial bandwidth extension of narrowband speech using Gaussian mixture model. IEEE International Conference on Communications and Signal Processing (ICCSP) 410-412. Kerala, 10–12 February 2011Google Scholar
- Chen S, Leung H: Artificial bandwidth extension of telephony speech by data hiding. IEEE International Symposium on Circuits and Systems (ISCAS) 3151-3154. Kobe, 23–26 May 2005Google Scholar
- Geiser B, Vary P: Backwards compatible wideband telephony in mobile networks: CELP watermarking and bandwidth extension. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 533-536. Honolulu, Hawaii, 15–20 April 2007Google Scholar
- Esteban D, Galand C: Application of quadrature mirror filters to split band voice coding schemes. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 191-195. Hartford, May 1977Google Scholar
- ITU-T Recommendation G.729.1: G.729-based embedded variable bit-rate coder: an 8–32 kbit/s scalable wideband coder bit stream interoperable with G.729, (ITU-T, 2006)Google Scholar
- Nomura T, Iwadare M, Serizawa M: A bitrate and bandwidth scalable CELP coder. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 341-344. Seattle, 12–15 May 1998Google Scholar
- Mustiere F, Bouchard M, Bolic M: Bandwidth extension for speech enhancement. Canadian Conference on Electrical and Computer Engineering 1-4. Calgary, 2–5 May 2010Google Scholar
- Jax P, Vary P: An upper bound on the quality of artificial bandwidth extension of narrowband speech signals. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 237-240. Orlando, 13–17 May 2002Google Scholar
- Hsu HW, Liu CM: Decimation-whitening filter in spectral band replication. IEEE Trans. Audio, Speech, Lang Process 2011, 19(8):2304-2313.View ArticleGoogle Scholar
- Zhang J: Bandwidth extension for China AVS-M standard. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 4149-4152. Taipei, 19–24 April 2009Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.