A blind bandwidth extension method for audio signals based on phase space reconstruction
© Bao et al.; licensee Springer. 2014
Received: 26 October 2012
Accepted: 28 November 2013
Published: 6 January 2014
Bandwidth extension is an effective technique for enhancing the quality of audio signals by reconstructing their high-frequency components. In this paper, a novel blind bandwidth extension method is proposed based on phase space reconstruction. Phase space reconstruction is introduced to convert the low-frequency modified discrete cosine transform coefficients of wideband audio to a multi-dimensional space, and the high-frequency modified discrete cosine transform coefficients of the audio signal are reconstructed by a non-linear prediction model. The performance of the proposed method was evaluated through objective and subjective tests. It is found that the proposed method achieves a better performance than the typical linear extrapolation method, and its performance is comparable to the conventional efficient high-frequency bandwidth extension method.
According to formal and informal listening tests, the listeners usually prefer the bandwidth-limited audio signals over the heavily distorted full-band signals. So, the high-frequency components (HFC) of audio signals are partly or completely discarded in many audio coding methods at low bit rates in order to increase coding efficiency. If a high-frequency reconstruction module is embedded into audio codecs, the quality of the reproduced audio signals could be improved.
Bandwidth extension (BWE) plays an important role in high-frequency reconstruction of audio signals. Generally, BWE can be divided into non-blind BWE and blind BWE. In the non-blind BWE method, some side information, such as time/frequency envelope of high-frequency (HF) band, noise floor, level of inverse filtering, and additional sine signals, should be extracted and coded in advance at the encoder and transmitted together with the encoded low-frequency components (LFC). At the decoder, the components in the low-frequency (LF) band were copied into the HF band based on the received side information . This method is often called spectral band replication (SBR) [2, 3]. A well-known audio codec utilizing SBR is high-efficiency advanced audio coding (HE-AAC) [4, 5], which has been applied in mobile multimedia players, mobile phones, and digital radio services. In addition, the non-blind BWE methods used in extended adaptive multi-rate - wideband (AMR-WB+)  and audio/video coding standard for mobile multimedia applications (AVS-M)  also exhibit good BWE performance by adopting a linear prediction model to describe the spectral envelope and extending the bandwidth of audio signals based on the time/frequency envelope. Compared to LFC coding, non-blind BWE can help an audio codec to produce a much higher decoding quality with side information, but it takes up more channel resource for transmission. Thus, the non-blind BWE method is unusable when the transmission channel cannot afford enough bit rates for additional side information.
For the blind BWE, the HFC of audio signals are often completely discarded at the encoder, and there is no side information related to the HFC for coding and transmission. Only the LFC are coded at the encoder. The high-frequency reconstruction completely depends on the decoded LFC. Blind BWE can be easily applied within different audio codecs. Traditional blind BWE methods have been studied extensively for narrowband speech based on the speech generation model . However, there is as yet no efficient method for blind BWE of audio signals. A straightforward method is to linearly extrapolate the HFC from the LFC under the assumption that the audio amplitude spectrum with logarithm scale is linearly declined with the frequency increase [8, 9]. However, this assumption of audio amplitude spectrum is not true in most cases due to the complicated characteristics of the audio spectrum. Actually, audio signals have more non-linear characteristics, and an efficient non-linear prediction method is essential for reconstructing the HFC in blind BWE. Efficient high-frequency bandwidth extension (EHBE)  could reproduce some new HF harmonic components using non-linear filtering methods. In the EHBE method, after the highest octave presented in the decoded LF information was extracted as the fundamental by a band-pass filter, the harmonics could be created by half-wave rectification. Then, the desired part of the complete harmonic signal was extracted by another band-pass filter and scaled by gain G. Combined with the delayed input signals, the full-frequency audio signals could be obtained. Due to frequency mixing, auditory distortion is also perceived following the application of the EHBE method.
In this paper, a blind high-frequency reconstruction method of audio signals based on phase space reconstruction (PSR) and non-linear prediction is proposed. Here, PSR is used to convert the LF modified discrete cosine transform (MDCT) coefficients of wideband audio to a multi-dimensional space. In order to improve the performance, the energy and harmonic components of the reconstructed HF spectrum are further adjusted according to listening perception. The objective and subjective evaluations show that the proposed method achieves a better performance than linear extrapolation (LE) method  and is comparable to the EHBE method .
The outline of this paper is as follows: First, we describe the PSR of audio signals and discuss the calculation of embedding dimension and embedding delay related to PSR. Section 3 describes the high-frequency reconstruction principle of audio signals, followed by a more detailed description of the prediction of the HF-MDCT coefficients and the adjustment of energy and harmonics of HFC. The performance of the proposed method is evaluated in Section 4. Finally, the conclusions are given in Section 5.
2 Phase space reconstruction of audio signals
Motivated by the previous works [11, 12], the PSR method, which is similar to the delay reconstruction method , is used to convert the LF-MDCT coefficients into a multi-dimensional space. A non-linear prediction model is built up in the phase space to simulate the hidden relationship between the given phase points and the unknown MDCT coefficients. By using this model, we can restore the audio spectrum of the HF components from the LF components in the phase space. So, the primary problem is how to build a multi-dimensional phase space from a one-dimensional audio signal.
where y k , k = 1,2,…, M-(m-1)τ is a m-dimensional row vector, and it represents a phase point in phase space, the embedding dimension m represents the minimum dimension number of phase space, and the embedding delay τ describes the distance between adjacent components of each phase point. Thus, once the phase space of LF-MDCT coefficients is reconstructed, the relationship between the phase points and MDCT coefficients can be described by a non-linear model. The HFC of audio signals can be intra-frame restored from these phase points in phase space based on non-linear prediction, according to the assumption of the strong correlation between HF and LF spectra of the audio signals.
According to the PSR principle, we can obtain the phase points y5, y6,…, containing the HF-MDCT coefficients, x(7), x(8),…. The (linear or non-linear) relationship between phase point y5 and its adjacent phase points, such as y3 and y4, can be used to predict the component x(7) included in phase point y5. Once the component x(7) within the phase point y5 is determined, we can use a similar method to estimate the component x(8) within phase point y6, and so on. This procedure does not stop until the final component corresponding to the cutoff frequency of the full band is determined.
In order to reconstruct the phase space, two important parameters m and τ should be calculated in advance. The details of calculating embedding dimension m and embedding delay τ will be given in the following two sub-sections.
2.1 Calculation of embedding delay τ
The embedding delay τ should be neither too small nor too large. If τ is too small, the difference between the adjacent components of each phase point in the phase space will become too small. If τ is too large, there will be no any correlation between adjacent components of each phase point in the phase space. These two cases will not correctly reflect the true relationship between the MDCT coefficients, and the PSR will be ineffective. We found that for the MDCT spectrum of audio signals, a slight change of τ will lead to the adjacent phase points becoming completely uncorrelated. In this case, BWE from LF to HF is impossible. Therefore, a trade-off of embedding delay τ is crucial for the PSR.
where denotes the mean of LF-MDCT coefficients, and i = 0,1,2, …, M - 1 is the index of R(i).
It is found that the phase space will be reconstructed as long as the embedding delay τ is limited to a certain range, even though τ has a small deviation. Here, when the first zero value or the first minimum value of R(i) appears, the corresponding index i is chosen as the embedding delay τ. This can reduce the computational complexity since it does not need to compute all R(i).
2.2 Calculation of embedding dimension m
In this paper, we adopt the false nearest neighbors (FNN)  method to calculate the embedding dimension m. The basic idea is to gradually determine the false nearest neighbors of each phase point by increasing the dimension number of phase space. The dimension number that makes the adjacent phase points completely unfolded is chosen as the final embedding dimension m.
If f d (k) > fD, the phase point y k NN is determined as the false nearest neighbor of the phase point y k , where fD is a threshold value used for verifying whether the nearest neighbor of phase point is false or not. Here, fD is a fixed value between 10 and 50.
In order to find the embedding dimension m, the Euclidean distance D d (k) and Dd + 1(k) between each phase point and its nearest neighbors are first calculated by increasing the dimension number of phase space d with step length 1 given the embedding delay τ. Then, f d (k) is obtained by Equation (6). By comparing f d (k) with fD, the false nearest neighbors for all the phase points are verified. Furthermore, the percentage of the false nearest neighbors in all nearest neighbors of all phase points is defined as β(d).
In our experiment, the initial value of d is set to 1. If β(d) > βD, where βD is a percentage threshold between 5% and 10%, d is increased by 1. The comparison between β(d) and βD will not stop until β(d) < βD. The value d that makes β(d) to be less than βD is chosen as the final embedding dimension m.
3 High-frequency reconstruction of audio signals
The block diagram of the proposed high-frequency reconstruction method of audio signals is shown in Figure 3. The high-frequency reconstruction includes five steps:
Step 1. Calculate LF-MDCT coefficients of audio signals;
Step 2. Reconstruct the phase space of LF-MDCT coefficients;
Step 3. Predict the HF-MDCT coefficients;
Step 4. Adjust the energy and harmonics of HFC;
Step 1 and step 5 are well known, so we will not discuss them here. Step 2 has been described in Section 2. In this section, we will discuss step 3 and step 4, respectively.
3.1 Non-linear prediction of HF MDCT coefficients
Once the phase space is reconstructed based on LF-MDCT coefficients, the HF-MDCT coefficients can be predicted according to the hidden relationship between the given phase points and the unknown MDCT coefficients . In this paper, the non-linear prediction method is utilized to obtain HF-MDCT coefficients based on the laws that the adjacent phase points have similar characteristics. We use 2τ phase points at the bottom of Equation (2) to predict the HF-MDCT coefficients because they have more relationship with HF-MDCT coefficients. The high-frequency band is divided into L regions in terms of 2τ interval, and the cutoff frequency of audio signals is located in the L th region.
In Equation (7), the region index l of high-frequency band starts to count from 1 and those phase points used for prediction are updated by circularly increasing kc. Thus, we can use Equation (7) to predict the HF-MDCT coefficients within the L bands.
where D i , i = 1, 2,…, p is the Euclidean distance between x(k1), i = 1, 2, …, p, and x(kc + 2τl), kc = M - 2τ + 1, M - 2τ + 2,…, M, l = 1, 2, …, L, Dmin is the minimum value of D i , i = 1, 2,…, p.
3.2 Energy and harmonic component adjustment of HF-MDCT coefficients
where i represents the index of sub-bands, Msub is the number of MDCT coefficients in the i th sub-band, and j is the index of MDCT coefficients in the i th sub-band.
where AD is the threshold value of ξ and here it is set to 0.95.
where i represents the index of HF sub-bands, Msub is the number of MDCT coefficients in the i th sub-band, and j is the index of MDCT coefficients in the i th sub-band.
We found that the harmonic components are not sufficient in the reconstructed HF spectrum. It is necessary to adjust the high-frequency harmonic components with some LF harmonic components. Our experiments have shown that the audio quality can be improved by introducing some LF harmonic components into the HF spectrum. We use the following four steps to adjust the harmonic components.
Step 1. Take fast Fourier transform (FFT) on the LF-MDCT coefficients x(i), i = 1 ,…, M, and the reconstructed HF-MDCT coefficients (i), i=M+1 ,…, 2M, respectively. The amplitude and phase of discrete Fourier transform (DFT) coefficients derived from LF-MDCT coefficients and HF-MDCT coefficients are denoted as aLF(f), φLF(f), aHF(f), φHF(f), f = 1 ,…, M;
Step 2. Select five DFT coefficients derived from the LF-MDCT coefficients whose amplitudes are larger than others, and these coefficients' frequency indexes are denoted as f i , i = 1 ,…, 5. Replace aHF(f i ), i = 1 ,…, 5 by aLF(f i ) i = 1 ,…, 5, and keep φHF(f i ), i = 1 ,…, 5 unchanged;
Step 3. Take inverse FFT on the adjusted DFT coefficients, aHF(f)exp[iφHF(f)];
Step 4. Modify high-frequency spectrum energy with the aforementioned method.
Thus, we finish the reconstruction of the HF-MDCT coefficients. The full-band MDCT coefficients are obtained by connecting the LF-MDCT and HF-MDCT coefficients. The audio signals in the time domain can be reconstructed by an inverse MDCT.
4 Performance evaluation
In this section, the reconstructed audio signals that are independent of audio codecs are firstly used for objective evaluation. According to the definition from the International Telecommunication Union-Telecommunication Standardization Sector (ITU-T) , super wideband (SWB) audio signals with a 14-kHz bandwidth were sampled at 32 kHz and were transformed into frequency domain though MDCT with a Kaiser-Bessel window. Every 20 ms, the most recent 1,280 audio samples were fed to the MDCT and were transformed into a frame of 640 spectral coefficients centered at 25-Hz intervals. First, the last 360 coefficients of these audio signals above 7 kHz were discarded to obtain the wideband (WB) audio signals with a 7-kHz bandwidth. Then, the proposed BWE method was employed to extend the bandwidth of these WB audio signals to 14 kHz. Finally, we compared the performance among the proposed method, LE method, and EHBE method. Here, the half-wave rectification was used as the non-linear device in our experiment with respect to EHBE method, and it provides the best performance compared with other non-linear devices according to the experimental results.
In the objective evaluation, 18 audio files were chosen for testing. The length of each audio signal is between 10 to 20 s, and they were sampled at 32 kHz. These audio signals used for testing were divided into three types: simple audio, complicated audio, and singing audio. Each type includes six audio files. For simple audio signals, the number of instruments performed is less than four. For the complicated audio signals, the number of instruments performed is much larger than four. Here, singing with accompaniment is called singing audio. Moreover, the input of BWE is WB audio signals generated from SWB audio signals, and its level was normalized to -26 dBov.
where i denotes the index of audio frame, and P i and are the power spectra of the original SWB audio signals and the extended audio signals, respectively. Nhigh and Nlow are the indices corresponding to the upper and lower bound of the frequency band from 7 to 14 kHz.
Besides objective evaluation, our PSR method and the referential methods were applied into the G.722.1 WB audio codec  at 24 kb/s in order to further evaluate the subjective quality of BWE methods. The extended audio signals were compared with the SWB audio signals reproduced by G.722.1C  at the same bit rate by using a multiple stimuli with hidden reference and anchor (MUSHRA) listening test .
As shown in Figure 7, the subjective scores for the three different kinds of audio signals are similar in the MUSHRA test. In addition, the audio quality of G.722.1 with PSR is better than the original G.722.1 WB audio codec, but is inferior to the G.722.1C SWB audio codec. LE method shows its lower performance compared with PSR and EHBE methods. Both PSR and EHBE are preferred over the G.722.1 WB audio codec, while EHBE is better than the proposed PSR method on average. According to the informal listening tests, the tonal distortion which is caused by the non-linear filtering in EHBE is not sensitive to the listeners because the phase of the extended spectrum is comparatively continuous. Moreover, the PSR method can also reproduce the SWB audio signals whose subjective quality is similar to the EHBE method. So, the proposed PSR method is able to improve the quality of the WB audio signals decoded by G.722.1.
This paper presents a blind bandwidth extension (BWE) method of audio signals. Our BWE method consists of phase space reconstruction and high-frequency reconstruction of the audio signals. This blind BWE method incorporates both non-linear prediction and linear prediction. The objective tests show that the audio quality with the proposed method is a little bit better than the LE method and EHBE method. This BWE method was proved to be effective in quality enhancement of audio signals. In addition, the proposed method, LE method, and EHBE method were employed to extend the bandwidth of the audio signals decoded by the ITU-T G.722.1 WB codec. Subjective test results presented in this paper show that the proposed method greatly improved the auditory quality of the WB audio signals decoded by G.722.1. The bandwidth extension performance of the proposed method is higher than that of the typical LE method and is comparable to that of the conventional EHBE method.
This work was supported by the National Natural Science Foundation of China under grant numbers 61072089 and 60872027, and the Beijing Natural Science Foundation Program and Scientific Research Key Program of Beijing Municipal Commission of Education (no. KZ201110005005). The authors wish to thank Dr. Christian Ritz, an associate professor in the University of Wollongong, for polishing the English language and for commenting on our paper. The authors also wish to thank the anonymous reviewers for their valuable comments that helped in significantly improving the presentation of our paper.
- Larsen E, Aarts RM: Audio Bandwidth Extension: Application of Psychoacoustics, Signal Processing and Loudspeaker Design. New York: Wiley; 2004.View ArticleGoogle Scholar
- Dietz M, Liljeryd L, Kjörling K, Kunz O: Spectral band replication, a novel approach in audio coding. In Proceedings of the 112th AES Convention. Munich, Germany; 2002.Google Scholar
- Ekstrand P: Bandwidth extension of audio signals by spectral band replication. In Proceedings of the 1st IEEE Benelux Workshop on Model based Processing and Coding of Audio. Leuven, Belgium; 2002.Google Scholar
- International Standards Organization: ISO/IEC 14496–3, Information technology – Coding of Audio-Visual Objects - Part 3: Audio. Geneva, Switzerland: International Standards Organization; 2011.Google Scholar
- Meltzer S, Moser G: MPEG-4 HE-AAC v2—audio coding for today's digital media world. EBU Technical Review 2006, 305: 37-38.Google Scholar
- 3GPP: TS 26.290: Audio Codec Processing Functions—Extended Adaptive Multi-Rate-Wideband (AMR-WB+) codec; Transcoding functions. Sophia-Antipolis: 3GPP; 2004.Google Scholar
- Jie Z, Kihyun C, Eunmi O: Bandwidth extension for China AVS-M standard. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'09). Taipei, Taiwan, China; 2009.Google Scholar
- Liu CM, Lee WC, Hsu HW: High frequency reconstruction for band-limited audio signals. In Proceedings of the 6th International Conference on Digital Audio Effects (DAFX'03). London, UK; 2003.Google Scholar
- Budsabathon C, Nishihara A: Bandwidth extension with hybrid signal extrapolation for audio coding. IEICE transactions on Fundamentals of Electronics, Communications and Sciences 2007, 90(8):1564-1569.View ArticleGoogle Scholar
- Larsen E, Aarts M, Danessis M: Efficient high-frequency bandwidth extension of music and speech. In AES 112th Convention. Munich, Germany; 2002.Google Scholar
- Sha Y-T, Bao C-C, Jia M-S, Liu X: High frequency reconstruction of audio signal based on chaotic prediction theory. In Proceedings of IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP'10). Dallas, Texas, USA; 2010.Google Scholar
- Liu X, Bao C-C, Jia M-S, Sha Y-T: Nonlinear bandwidth extension based on nearest-neighbor matching. In Proceedings of the 2nd APSIPA ASC. Biopolis, Singapore; 2010.Google Scholar
- Kantz H, Schreiber T: Nonlinear Time Series Analysis. 2nd edition. Cambridge: Cambridge University Press; 2004.MATHGoogle Scholar
- Abarbanel HDI, Brown R, Sidorowich JJ, Tsimring LS: The analysis of observed chaotic data in physical systems. Reviews of Modern Physic 1993, 65(4):1331-1392. 10.1103/RevModPhys.65.1331MathSciNetView ArticleGoogle Scholar
- Hayes MH: Statistical Digital Signal Processing and Modeling. New York: Wiley; 1996.Google Scholar
- International Telecommunication Union: ITU-T Recommendation G.722.1 Annex C: Low Complexity Coding at 24 and 32 kb/s for Hands-Free Operation in Systems with Low Frame Loss Annex C 14 kHz Mode at 24, 32 and 48 kb/s. Geneva: International Telecommunication Union; 2009a.Google Scholar
- Pulakka H, Laaksonen L, Vainio M, Pohjalainen J, Alku P: Evaluation of an artificial speech bandwidth extension method in three languages. IEEE Trans. Audio Speech Lang. Processing 2008, 16(6):1124-1137.View ArticleGoogle Scholar
- International Telecommunication Union: ITU-T Recommendation G.722.1, Low Complexity Coding at 24 and 32 kbit/s for Hands-Free Operation in Systems with Low Frame Loss. Geneva: International Telecommunication Union; 2009b.Google Scholar
- International Telecommunication Union: ITU-T Recommendation BS.1543-1, Method for the Subjective Assessment of Intermediate Sound Quality (MUSHRA). Geneva: International Telecommunication Union; 2001.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.