Skip to main content

Audio bandwidth extension using ensemble of recurrent neural networks

Abstract

In audio communication systems, the perceptual audio quality of the reproduced audio signals such as the naturalness of the sound is limited by the available audio bandwidth. In this paper, a wideband to super-wideband audio bandwidth extension method is proposed using an ensemble of recurrent neural networks. The feature space of wideband audio is firstly divided into different regions through clustering. For each region in the feature space, a specific recurrent neural network with a sparsely connected hidden layer, referred as the echo state network, is employed to dynamically model the mapping relationship between wideband audio features and high-frequency spectral envelope. In the following step, the outputs of multiple echo state networks are weighted and fused by means of network ensemble, in order to further estimate the high-frequency spectral envelope. Finally, combining the high-frequency fine spectrum extended by spectral translation, the proposed method can effectively extend the bandwidth of wideband audio to super wideband. Objective evaluation results show that the proposed method outperforms the hidden Markov model-based bandwidth extension method on the average in terms of both static and dynamic distortions. In subjective listening tests, the results indicate that the proposed method is able to improve the auditory quality of the wideband audio signals and outperforms the reference method.

1 Introduction

Due to the restrictions of network transmission rate and data processing capability, the effective bandwidth of audio signals available for the existing audio communication systems is commonly limited. To improve coding efficiency [1], a practical perceptual audio coding framework would encode and transmit only the low-frequency (LF) components of the audio. However, state-of-the-art audio communication systems are not satisfied with the auditory quality achieved by the wideband (WB) audio reproduction, which are sampled at 16 kHz corresponding to the bandwidth of 50~7000 Hz. Contemporary audio communication systems are looking forward to explore the technique to ensure high-quality audio service with more brightness and expressiveness. Therefore, it is desirable to allow existing WB audio systems to achieve the auditory quality of super-wideband (SWB) audio systems, which are sampled at 32 kHz with the bandwidth of 50~14,000 Hz.

Among the methods to enhance the perceptual quality of the WB audio signals, blind bandwidth extension (BWE) is designed to analyze the statistical relationship between the low-frequency and high-frequency components of WB audio signals and artificially restore the missing high-frequency (HF) components in the frequency range of 7000~14,000 Hz from the decoded WB signals at the decoder end. The advantage of this approach is to avoid any modifications inside the source coding and the network transmission process [2]. In recent decades, many blind BWE solutions have been developed for speech and audio signals, and these typical BWE methods can be summarized to perform two main tasks, namely, the estimation of the spectral envelope and the extension of the fine spectrum [1, 2]. Informal listening test results indicate that the estimation accuracy of the HF spectral envelope is crucial to the improvement of auditory quality for the reproduced signals [3]. Therefore, most BWE approaches have concentrated on modelling the mapping relationship between LF and HF spectral coefficients based on statistical learning methods and further estimated the HF spectral envelope under some error criterions. In 1994, a statistical recovery method [4] was proposed to predict the HF spectrum and achieved an improved perceptual quality of the reproduced signals. In the same year, Carl and Heute proposed the spectral envelope estimation method based on codebook mapping (CBM) [5]. This method explored the joint codebook between LF features and the HF spectral envelope and modelled their bijective mapping relationship by training. Further, the improved methods with interpolation, soft decision, and split codebook were proposed to decrease the spectral distortion caused by the single codebook [68]. In 2000, the Gaussian mixture model-based spectral envelope estimation method was proposed [9]. The Gaussian mixture model (GMM) is adopted to mimic the joint probability density between LF and HF spectral coefficients, and the HF spectral envelope is estimated under the minimum mean square error criterion. The GMM method builds a soft clustering-based statistical model to restrain the spectral discontinuity of the audio signals reproduced by CBM and achieves better performance in both subjective and objective tests. In addition, feed-forward neural networks (FNNs) are also adopted to extend the spectral envelope [10, 11]. Iser and Schmidt made a comparison between FNN and CBM in the context of the BWE application [12]. The results show that there was no significant difference in auditory quality of the extended signals among these two methods but the FNN method required less computational complexity in comparison with the CBM method.

The abovementioned methods lay stress on the relationship between LF and HF spectral coefficients in the current frame and exhibit static characteristic of the audio spectrum. Jax and Vary brought the inter-frame correlation of audio features into extension of the spectral envelope and employed a hidden Markov model (HMM) to represent the temporal evolution of the spectral envelope for audio signals [13, 14]. Some other research results also show that the HMM-based BWE method obtained better performance in terms of both subjective and objective quality [15, 16]. In our previous research work [17], we adopted the HMM-based estimator to extend the bandwidth of WB audio to SWB and obtained good subjective and objective quality of the reproduced audio signals. While improved results are achieved in [17], some dynamic distortion is still perceived. This is caused by the fact that only discrete states are adopted to piecewise approximate the actual evolution of the audio spectrum over time. Thus, we further introduce a continuous state-space model into the spectral envelope extension and propose the blind BWE method based on the ensemble of recurrent neural networks in this paper. The feature space of wideband audio is firstly divided into different regions through clustering. For each region in the feature space, a specific recurrent neural network with a sparsely connected hidden layer, referred as the echo state network, is employed to dynamically model the mapping relationship between wideband audio features and high-frequency spectral envelope. Different from the traditional fully connected recurrent neural networks, an echo state network (ESN) adopts the sparsely connected hidden layer in which the connectivity and weights of hidden neuron nodes are fixed and randomly assigned. The recurrent nature of the connections turns the time varying WB audio features into specific temporal patterns in high dimension. Then, a simple linear transformation is used to map the state of hidden neuron nodes to the desired HF spectral envelope. Given a rich collection of recurrently connected nodes, the ESN is able to effectively fit the nonlinear mapping function between WB audio features and the HF spectral envelope. Next, the outputs of the parallel ESNs for different regions in the feature space are weighted and fused by means of network ensemble techniques, so as to form ensemble echo state networks (EESNs) for further estimating the HF spectral envelope. Finally, combining with the HF fine spectrum extended by spectral translation, the proposed method can effectively extend the bandwidth of the WB audio to SWB and upgrade the subjective and objective quality of the WB audio.

2 Audio bandwidth extension using ensemble echo state network

2.1 Overview

The block diagram of the proposed blind BWE method is shown in Fig. 1. The input signals are the WB audio signals sampled at 16 kHz with a bandwidth of 7 kHz. Through up-sampling and low-pass filtering, the audio signals sampled at 32 kHz with the bandwidth of 7 kHz are separated into 32-ms frames with 16-ms overlap. Then, the audio signals are windowed with the Hamming window and are transformed into the frequency domain by using fast Fourier transform (FFT). The audio amplitude spectrum A(k) in the frequency range of 64~7000 Hz is mapped onto the perceptually correlated equivalent rectangular bandwidth (ERB) scale by 20-channel triangular overlapping windows, in order to extract the 20-dimensional Mel frequency cepstral coefficient (MFCC) vector F MFCC. Next, the extracted MFCCs are normalized to the range between 0 and 1 via min-max normalization. The normalized MFCCs are fed to the pre-trained EESN for estimating the HF spectral envelope. Here, the HF spectral envelope, F RMS(i), i = 0,…, 3, is represented by the root mean square (RMS) values of the audio amplitude spectrum in four nonuniform sub-bands in the frequency band of 7~14 kHz as follows:

$$ {F}_{\mathrm{RMS}}(i)=\sqrt{\frac{1}{h(i)-l(i)+1}{\displaystyle \sum_{k=l(i)}^{h(i)}{A}^2(k)}},\kern1em i=0,1,\dots, 3 $$
(1)

where A(k) is the spectral magnitude of audio signals and h(i) and l(i) correspond to the lower and upper boundaries of frequency bins in the ith sub-band, respectively. Note that there is no overlap between the adjacent sub-bands. The central frequencies of HF sub-bands are arranged at 8470, 9338, 11,653, and 13,657 Hz, according to the perceptually correlated ERB scale.

Fig. 1
figure 1

Block diagram of the proposed blind bandwidth extension method for audio signals

The extension of the fine spectrum is less critical for the subjective auditory quality of the extended audio signals, in comparison with the extension of the spectral envelope [2]. Thus, in order to reduce algorithm complexity, the HF fine spectrum is extended by using the simple spectral translation method [2]. The LF components are replicated into the HF band in the range of 7~14 kHz. The spectral envelope of the extended HF components is further adjusted by employing the estimated RMS values \( {\widehat{F}}_{\mathrm{RMS}}(i) \) of HF sub-bands. Finally, the reconstructed HF components are converted to the time domain by the inverse FFT, windowing, overlapping, and addition operations. By combining the WB audio signals with appropriate delay, the proposed method can reconstruct the SWB audio signals sampled at 32 kHz with the bandwidth of 14 kHz. Details of the proposed BWE method are discussed below.

2.2 Estimation of spectral envelope based on EESN

Let F X (m) denote the MFCCs of WB audio signals in the mth frame, whose dimension is set to d X  = 20. F Y (m) is the RMS value of HF sub-bands in the mth frame, and its dimension is set as d Y  = 4. Through an unknown mapping function H(), the WB features F X (m) can be converted into the HF spectral envelope F Y (m) as follows:

$$ {F}_Y=H\left({F}_X\right) $$
(2)

We assume that the unknown mapping function can be modelled by ensembling several different state-space models. If a kth state-space model is applied, the process of regenerating the HF spectral envelope is approximated by

$$ {S}_{\mathrm{hidden}}\left(m,k\right)={H_{\mathrm{state}}}^{(k)}\left({S}_{\mathrm{hidden}}\left(m-1,k\right),{F}_X\left(m-1\right),U\left(m,k\right)\right) $$
(3)
$$ {\widehat{F}}_Y\left(m\left|k\right.\right)={H_{\mathrm{observation}}}^{(k)}\left({S}_{\mathrm{hidden}}\left(m,k\right),{F}_X(m),V\left(m,k\right)\right) $$
(4)

where S hidden(m, k) with the dimension d S is the hidden state vector of the kth state-space model in the mth frame, and its temporal evolution determines the dynamic characteristics of state space; \( {\widehat{F}}_Y\left(m\left|k\right.\right) \) is the HF spectral envelope estimated by the kth state-space model in the mth frame; H state (k)() is referred to as the state updating equation of the kth state-space model and describes the evolution process of hidden states given the input vector F X (m − 1); H observation (k)() is referred to as the observation equation of the kth state-space model and denotes the mapping process from F X (m) to \( {\widehat{F}}_Y\left(m\left|k\right.\right) \) controlled by the hidden state S hidden(m, k); and U(m, k) and V(m, k) represent the error caused by the state updating and observation processes, respectively.

By integrating different state-space models, we can estimate the HF spectral envelope F Y (m) from the WB audio feature F X (m). Therefore, there remain two important issues, namely, how to build an effective state-space model for approximating the true mapping process and how to choose the suitable state-space models among candidate models at each time.

2.2.1 Model structure of ESN

H state (k)() and H observation (k)() can be separately represented by two FNNs to build up a state-space-based recurrent neural network [18] for fitting the true mapping process. Motivated by this, ESN [19, 20] is employed as the state-space model in this paper. Different from traditional fully connected recurrent neural networks, ESN adopts the sparsely connected hidden layer in which the connectivity and weights of hidden neuron nodes are randomly assigned and unchanged during training. The recurrent nature of the connections turns the time varying WB audio features into specific temporal patterns in high dimension. From the perspective of dynamic system theory, these states of hidden neuron nodes serve as a memory of the input audio feature and are termed as echoes of the input history. This is where ESN draws its name from. Given a rich collection of recurrently connected nodes with nonlinear activation functions, such a hidden layer, being an input-driven dynamical model, could provide a rich and relevant enough feature space, such that the desired HF spectral envelope could be obtained by linear combination from the state of hidden neuron nodes.

In the ESN-based state-space model, the updating process of hidden states for the kth ESN is represented by a leaky-integrated nonlinear activation unit [21] as follows:

$$ {\tilde{S}}_{\mathrm{hidden}}\left(m,k\right)= \tanh \left({W}^{\mathrm{in}}(k)\left[\begin{array}{c}\hfill 1\hfill \\ {}\hfill {F}_X\left(m-1\right)\hfill \end{array}\right]+{W}^{res}(k){S}_{\mathrm{hidden}}\left(m-1,k\right)\right) $$
(5)
$$ {S}_{\mathrm{hidden}}\left(m,k\right)=\left(1-\alpha \right){S}_{\mathrm{hidden}}\left(m-1,k\right)+\alpha {\tilde{S}}_{\mathrm{hidden}}\left(m,k\right) $$
(6)

where \( {\tilde{S}}_{\mathrm{hidden}}\left(m,\ k\right) \) is the updating state of S hidden(m, k) in the mth frame. Leaking rate α of the leaky-integrated nonlinear unit represents the dynamic updating speed of hidden states. When the value of α is small, the dynamic change of hidden states may be slow. If α is close to 1, the nonlinear unit approximates a tanh function. W in(k) is the input weight matrix with the dimension of d S  × (d X  + 1), and the elements of W in(k) are uniformly distributed in [−a in, a in]. The scaling factor a in of W in decides the nonlinearity of the activation units. When a in is close to 0, the leaky-integrated nonlinear activation units approach linear functions. For a larger a in, the state updating process driven by F X (m − 1) exhibits more nonlinearity. W res(k) is the recurrent weight matrix with the dimension of d S  × d S , and its spectral radius a res can be artificially adjusted for controlling network stability in the actual applications. It is noteworthy that a res and a in jointly decide the relative importance between F X (m − 1) and S hidden(m − 1, k) in the state updating process. If a res is larger, the previous hidden states own more effects on the current hidden states and the recurrent network would show more long-term correlations. Otherwise, the input vectors become the key factor to derive the hidden states. In addition, some research works indicated that the sparsely connected hidden states could slightly improve the estimation accuracy of ESN and accelerate the speed of model training [19, 20]. So, we can change the percentage of nonzero elements relative to the total amount of elements in W res to adjust the sparsity f sparsity of the recurrent weight matrix.

ESN takes advantage of a simple linear mapping function to represent the observation equation H observation (k)() [19] in the following way:

$$ {\widehat{F}}_Y\left(m\left|k\right.\right)={W}^{\mathrm{out}}(k)\left[\begin{array}{c}\hfill 1\hfill \\ {}\hfill {F}_X(m)\hfill \\ {}\hfill {S}_{\mathrm{hidden}}\left(m,\ k\right)\hfill \end{array}\right] $$
(7)

where W out(k) is the output weight matrix with the dimension of d S  × (1 + d X + d S ) for the kth ESN.

From the perspective of network structure, ESN includes an input layer (F X ), a hidden layer (S hidden), and an output layer (F Y ), as shown in Fig. 2. Here, the hidden layer, also referred to as the reservoir in ESN, consists of a large number of sparse-connected neuron nodes. With the help of the nonlinearity, sparsity, and feedback structure of neurons, the reservoir expands the audio features into a nonlinear high-dimensional feature space and further provides a more temporal context, so as to make the recurrent neural network exhibit rich dynamic characteristics [22]. The conclusions in relevant studies on recurrent learning algorithms [23] show that the weight dynamics favor a functional partition of the recurrent network into a fast output layer and a slower dynamical reservoir, whose rates of weight change are closely coupled. Therefore, ESN randomly generates the input and recurrent weight matrices in network initialization and fixes them during network learning. Then, the network is optimized by adjusting the output weight matrix to fit the unknown nonlinear mapping function H().

Fig. 2
figure 2

Network structure of a single echo state network

2.3 Model training for ESN

The WB audio feature F X (m) and the HF spectral envelope F Y (m) are separately extracted frame by frame from the aligned WB and SWB audio signals, which are available for the training procedure, so as to form the training sample data [F X (m)T, F Y (m)T]T, where m = 0,1,…, N train − 1 denotes the index of audio frame. In the model training of ESN, the hidden state S hidden(−1, k) of the kth ESN model is first initialized as 0. Next, F X (m) is fed into multiple pre-defined ESNs for driving the reservoirs. Then, by constantly updating the hidden states, the reservoirs exhibit different dynamic characteristics. Finally, the output weight matrix of the kth ESN is computed according to the information collected from the sequences of F X (m), F Y (m), and S hidden(m, k). A more detailed description about model training is provided below.

  1. 1.

    Initializing the reservoir parameters

    The initialization of reservoir parameters uses a heuristic method [19] as follows.

    1. (a)

      A matrix with the dimension of d S  × d S is randomly created, and its element values are uniformly distributed within [−1, 1]. A fraction of the small element values within the random matrix are set to 0, in order to form a sparse matrix W. The sparsity f sparsity is used to adjust the percentage of nonzero elements relative to the total amount of elements in W. Then, W is normalized according to its spectral radius λ max, i.e., W/λ max.

    2. (b)

      The resulting sparse matrix is scaled to obtain the recurrent weight matrix, i.e., W res = a res W/|λ max|. Here, a res < 1 is the spectral radius of W res and can be experimentally adjusted.

    3. (c)

      The input weight matrix W in is randomly created according to a uniform distribution.

  2. 2.

    Driving the reservoir

    The hidden states S hidden(m, k) is updated in the light of two model parameters {W res(k), W in(k)}. But the update process is impacted by the initial hidden state S hidden(−1, k) [19]. Therefore, we set a network stability threshold N T  = 13 and assume that the whole network is approaching stability after the first N T state updates.

    Starting with the N Tth updates, the sample data F X (m) of the WB audio feature, the hidden states S hidden(m), and the sample data F Y (m) of the HF spectral envelope are collected for establishing the state collection matrix B and the expectation output matrix Q. The column vectors of the state collection matrix B(k) for the kth ESN are represented as [1, F X (m)T, S hidden(m, k)T]T, so the dimension of B(k) is (1 + d X  + d S ) × (N train − N T ). The column vectors of the expectation output matrix Q(k) for the kth ESN are represented as [F Y (m)], so the dimension of Q(k) is (d Y ) × (N train − N T ). Through the abovementioned method of data collection, the negative effect of S hidden(−1, k) on ESN is effectively eliminated, and the mapping relationship between the WB audio feature F X (m) and the HF spectral envelope F Y (m) can be properly reflected according to the model trained from B(k) and Q(k).

  3. 3.

    Computing the output weight matrix

The output weight matrix W out(k) for the kth ESN is computed according to B(k) and Q(k) to minimize the error between the network output \( {\widehat{F}}_Y\left(m\left|k\right.\right) \) and the true HF spectral envelope F Y (m). Informal evaluation indicates that, when the element value of W out(k) is large, W out(k) may amplify the slight difference of F X (m) and over-fitting may occur during network training. Therefore, the ridge regression method is employed in order to avoid over-fitting as follows:

$$ {W}^{\mathrm{out}}(k)=\underset{W^{\mathrm{out}}(k)}{ \arg \min}\left(\left[\frac{1}{N_{\mathrm{train}}-{N}_T}{\displaystyle \sum_{m=0}^{N_{t\mathrm{rain}}-{N}_T-1}{\left\Vert {F}_Y(m)-{\widehat{F}}_Y\left(m\Big|k\right)\right\Vert}^2}\right]+\beta {\left\Vert {W}^{\mathrm{out}}(k)\right\Vert}^2\right) $$
(8)

where βW out(k)2 is the regular item for penalizing the large elements in W out(k) and the relative importance between two items in the above error function can be controlled by adjusting β.

Finally, the resulting W out(k) can be given as

$$ {W}^{\mathrm{out}}(k)=Q(k)B{(k)}^T{\left[B(k)B{(k)}^T+\beta I\right]}^{-1} $$
(9)

where I is a unit matrix.

The connection weight matrices W res(k), W in(k), and W out(k) for the kth ESN are obtained by using the training method above. In the actual extension, the frame-wise MFCCs are fed into the ESNs. The hidden states S hidden(k) are updated with the change of F MFCC in a frame-by-frame basis. The dynamic characteristics of WB audio features are brought into the network model. Further, the HF spectral envelope F Y (m) is effectively estimated by fusing the outputs \( {\widehat{F}}_Y\left(m\left|k\right.\right) \) of multiple ESNs.

2.3.1 Network ensemble fusion

According to the network ensemble theory, frequently, an ensemble of models performs better than any individual model. This is because the various errors of the models average out [24]. In this paper, a network ensemble framework based on the statistical distribution of input feature space is adopted to integrate multiple ESNs for further forming ensemble echo state network.

Figure 3 shows the block diagram of the EESN-based spectral envelope estimator. In the model training procedure of EESN, we first selected the GMM-based clustering method to separate the feature space of the input WB audio feature F X into N ESN regions. The mean vector of each Gaussian component corresponds to a clustering center of the feature space. Accordingly, the training sample data [F X (m)T, F Y (m)T]T, m = 0,1,…, N train − 1 could be segmented into N ESN groups on the basis of the separation of feature space. By assuming that the statistical properties of each group of input features were similar, N ESN ESNs with the same network structure were built up to model the local mapping relationship between F X and F Y within the corresponding N ESN regions of the feature space. It is worth mentioning that the WB audio features extracted from the training data can be simultaneously fed into all the ESNs in order to continuously drive their state updating equations since their input and recurrent weight matrices are unchanged during model training, while the output weight matrices of different ESNs need to be respectively trained according to their corresponding regions in the feature space. Therefore, the feature space separation-based network ensemble framework does not actually affect temporal modelling of ESNs.

Fig. 3
figure 3

Block diagram of the EESN-based spectral envelope estimator

In the actual bandwidth extension procedure, the mth-frame audio feature F X (m) is fed into all of these ESNs, so as to obtain the corresponding estimated values \( {\widehat{F}}_Y\left(m\left|k\right.\right) \) of the HF spectral envelope for the kth ESN. Meanwhile, according to the probability distribution of the WB audio feature modelled by GMM, a posteriori probability p(k|F X (m)) of the kth region in the feature space is computed and adopted as the weights for guiding us on integrating the HF spectral envelope estimated by different ESNs.

Within the GMM framework, p(k|F X (m)) is computed as follows:

$$ p\left(k\left|{F}_X\right.(m)\right)=\frac{p\left({F}_X(m)\left|k\right.\right)p(k)}{{\displaystyle \sum_{i=0}^{N_{\mathrm{ESN}}-1}p\left({F}_X(m)\left|i\right.\right)p(i)}} $$
(10)

where p(k) is the a priori probability of the kth region in the feature space, and it needs to match the following condition:

$$ {\displaystyle \sum_{k=0}^{N_{\mathrm{ESN}}-1}p(k)}=1,\kern1em 0\le p(k)\le 1 $$
(11)

p(F X (m)|k) is the probability of F X (m) given the kth region, and it is represented as

$$ \begin{array}{ll}p\left({F}_X(m)\left|k\right.\right)\hfill & =N\left({F}_X(m);{m}_k,{C}_k\right)\hfill \\ {}\hfill & =\frac{1}{{\left(2\pi \right)}^{\frac{D_X}{2}}{\left|{C}_k\right|}^{\frac{1}{2}}} \exp \left[-\frac{1}{2}{\left({F}_X(m)-{m}_k\right)}^T{C}_k^{-1}\left({F}_X(m)-m{}_k\right)\right]\hfill \end{array} $$
(12)

where \( N \)() represents the Gaussian probability density function. m k and C k are the mean and covariance of the kth Gaussian component, respectively, and can be trained according to the standard EM algorithm. Finally, in the GMM-based soft-decision ensemble framework, multiple ESNs can be linearly combined by taking advantage of the posterior probabilities p(k|F X (m)), and the estimated HF spectral envelope \( {\widehat{F}}_Y(m) \) is given by

$$ \begin{array}{ll}{\widehat{F}}_Y(m)\hfill & ={\displaystyle \sum_{k=0}^{N_{ESN}-1}p\left(k\left|{F}_X\right.(m)\right){\widehat{F}}_Y\left(m\left|k\right.\right)}\hfill \\ {}\hfill & ={\displaystyle \sum_{k=0}^{N_{ESN}-1}p\left(k\left|{F}_X\right.(m)\right){W}^{\mathrm{out}}(k)\left[\begin{array}{c}\hfill 1\hfill \\ {}\hfill {F}_X(m)\hfill \\ {}\hfill {S}_{\mathrm{hidden}}\left(m,\ k\right)\hfill \end{array}\right]}\hfill \end{array} $$
(13)

2.4 High-frequency component reconstruction

In general, the LF and HF components have the similar spectral structure. The slightly different fine structure for the audio spectrum would not significantly affect the auditory quality of the bandwidth-extended audio signals, if the HF spectral envelope is well restored. So, in the proposed blind BWE method, spectral translation is adopted to directly copy the fine spectrum in the frequency range of 0~7 kHz to the HF band of 7~14 kHz. Here, the LF fine spectrum is represented by the normalized spectral magnitude A norm(k), k = 0,…, 223, as follows:

$$ {A}_{\mathrm{norm}}(k)=\frac{A(k)}{F_{\mathrm{RMS}\_\mathrm{W}\mathrm{B}}(i)},\kern1em i=\left\lfloor \frac{k}{N_{\mathrm{subband}}}\right\rfloor $$
(14)

where A(k) is the spectral magnitude of audio signals. F RMS_WB(i) is the RMS value of the LF sub-bands for describing the LF spectral envelope of audio signals. Here, in order to maintain the spectral flatness of A norm(k), the LF spectrum is uniformly divided into 14 sub-bands in the frequency scale and each LF sub-band contains N subband = 16 frequency bins. Then, the HF fine spectrum extended by spectral translation is given by

$$ {A}_{\mathrm{norm}}(k)={A}_{\mathrm{norm}}\left(k-224\right),\kern1em k=224,\dots .447 $$
(15)

Since the HF spectral envelope is analyzed in the ERB scale, the HF fine spectrum is also divided into four non-overlapping sub-bands. By combining the estimated HF sub-band RMS values \( {\widehat{F}}_{\mathrm{RMS}}(i) \), i = 0,…, 3, the extended HF spectrum A swb(k), k = 224,…, 447, can be presented as

$$ {A}_{\mathrm{swb}}(k)={A}_{\mathrm{norm}}(k){\widehat{F}}_{\mathrm{RMS}}(i),\kern1em i=\mathrm{Subband}(k) $$
(16)

in which Subband(k) represents the index of the HF sub-band corresponding to the kth frequency bin.

The phase of HF spectrum θ(k), k = 224,…, 447, is also obtained by spectral translation as follows:

$$ \theta (k)=\theta \left(k-224\right),\kern1em k=224,\dots .447 $$
(17)

Finally, the HF spectrum is converted into the time domain by inverse FFT. The resulting HF signals are combined with the properly delayed WB signals to reconstruct the SWB audio signals.

3 Parameter selection for EESN

In this section, we made some preliminary experiments in order to select the appropriate model parameters according to the objective performance of spectral envelope extension. The SWB audio data used for parameter selection came from live concert recordings with the length of about 6 h and contain dialog, pop music, singing, and live background sounds. These signals were sampled at 32 kHz with the bandwidth of 14 kHz and were stored by 16-bit PCM. After low-pass filtering and time alignment, the parallel WB training dataset was built up. Twenty-dimensional MFCCs and four-dimensional HF sub-band RMS values were extracted from the WB and SWB datasets as the input vectors F X and the expectation output vectors F Y for EESN, respectively. Seventy-five percent of these sample data were randomly selected for training the EESN model, and others were used as the validation set for parameter selection.

In this section, the mean square error between the original and estimated spectral envelope in the frequency range of 7~14 kHz was selected as the objective measurement for evaluating the performance of the EESN method. The estimation error of HF spectral envelope can be defined as

$$ {e}_{\mathrm{MS}}(i)=\frac{1}{4}{\displaystyle \sum_{n=0}^3{\left({F}_{\mathrm{RMS}}\left(i,n\right)-{\widehat{F}}_{\mathrm{RMS}}\left(i,n\right)\right)}^2} $$
(18)

where e MS(i) is the mean square error of the HF spectral envelope for the ith frame, F RMS(i, n) and \( {\widehat{F}}_{\mathrm{RMS}}\left(i,n\right) \) are the true and estimated RMS values of the nth HF sub-band in the ith audio frame, respectively, and n = 0, 1, 2, 3 is the index of HF sub-band. Before computing the mean square error, all the data needed to be resampled to 32 kHz and were temporally aligned with the original SWB audio. Then, the resulting error values were averaged over all the frames for each audio signal in the validation set, in order to be used as the distortion measure of the spectral envelope estimation.

The extension performance of the EESN method is mainly affected by seven model parameters: the scaling factor of W in (a in), spectral radius of W res (a res), sparsity of W res (f sparsity), leaking rate (α), regular factor for ridge regression (β), dimension of hidden state (d S ), and number of ESNs (N ESN). In EESN, different ESNs are generally independent of each other, so we can first reduce the EESN model to a single ESN model by setting N ESN = 1 and optimize the ESN model parameters (a in, a res, f sparsity, α, β, and d S ), by using grid search. Next, several ESNs are constructed with the “optimal” ESN parameters and are integrated for further forming EESN, and then, the optimal value of N ESN can be manually determined, guided by evaluation on the validation set.

3.1 ESN model parameters

In order to improve search efficiency, a two-stage grid search was conducted to optimize the ESN model parameters, while the number of ESN was preset as 1.

In the first-stage search, we manually set the bounds for each model parameter, according to their actual characteristics in the ESN model. Then, a finite set of reasonable values for each parameter were pre-selected, as shown in Table 1, so as to perform the coarse grid search over wider parameter intervals. Next, an ESN was trained with each candidate in the Cartesian product of these six parameter sets. Finally, their performance was evaluated on the validation set in terms of the estimation error of the HF spectral envelope. According to the experiment results, the parameter setting that achieves the lowest estimation error (12.7824) in the validation procedure is obtained as {a in = 1/8, a res = 1/2, f sparsity = 1/4, α = 1, β = 3, d S  = 400}. The resulting parameter setting can be used to identify the promising regions for each model parameter: a in (1/16, 1/4), a res (1/4, 3/4), f sparsity (1/8, 1/2), α (4/5, 1), β (2, 4), and d S  (200, 600).

Table 1 Candidates of the ESN model parameters for coarse grid search

The second-stage gird search was further adopted with smaller steps, so as to determine the “optimal” ESN model parameters. For each model parameter, five grid points were equally spaced within the corresponding promising region. Then, all the 56 possible combinations of the six ESN model parameters were evaluated, and the lowest estimation error (12.4961) of the HF spectral envelope was obtained at {a in = 0.1094, a res = 0.625, f sparsity = 0.2188, α = 1, β = 3, d S  = 500}.

In addition, we should note that there are no evident differences for different leaking rates of the leaky-integrated nonlinear activation unit in terms of the estimation error values of the HF spectral envelope, and α = 1 provides a smaller estimation error than other α values, according to the evaluation results in the two-stage grid search process. This suggests that the long-term memory of hidden states does not clearly improve the performance of the ESN model.

3.2 Number of ESNs

EESN adopts the a posteriori probabilities of each region in the feature space as weights to integrate the outputs of different ESNs, which model the local mapping functions between LF and HF components within the corresponding regions of the feature space. So, the number of ESNs is an important factor for the proposed EESN-based BWE method. According to the results of the two-stage grid search, the model parameters for each ESN are set as a in = 0.1094, a res = 0.625, f sparsity = 0.2188, α = 1, β = 3, and d S  = 50. Then, we experimented with several values of N ESN to form different EESNs and evaluated them under the estimation error of the HF spectral envelope. The mean and standard deviation of the spectral envelope estimation error are shown in Table 2. It is noticeable that the performance of the EESN-based spectral envelope estimator improves quickly with the increase of N ESN and the mean estimation error of HF spectral envelope is reduced down to about 11.3. With the further consideration of computation complexity and memory demands, the number of ESN can be set as 8.

Table 2 LSD of EESN model with different N ESN

As a consequence of the above, the EESN model parameters are finally selected as a in = 0.1094, a res = 0.625, f sparsity = 0.2188, α = 1, β = 3, d S  = 50, and N ESN = 8.

4 Performance evaluation

In this section, log spectral distortion, cosh measure, and differential log spectral distortion (DLSD) were adopted as objective measurements for evaluating the performance of the proposed BWE method in comparison with the HMM-based reference method. In addition, the auditory quality improvement of the extended audio signals was further discussed in terms of subjective preference tests.

4.1 Reference method and audio data

Except for the spectral envelope estimation module, the HMM-based reference method follows the similar principle shown in Fig. 1. The MFCCs were extracted from the WB audio and fed into the HMM-based Bayesian estimator of the HF spectral envelope under the minimum mean square error criterion, while the HF fine spectrum was extended by spectral translation. In the HMM-based reference method, the HMM states were defined by vector quantization of the joint space between LF and HF coefficients. For each state of the HMM, the state-dependent conditional expectation of the HF spectral envelope was calculated based on a GMM of the joint probability density of the WB audio feature and HF spectral envelope. The respective a posteriori probabilities of the corresponding states of the HMM were used to weight the conditional expectation, in order to obtain the estimation of the HF spectral envelope. Here, the HMM parameters were learned by using the hybrid approach suggested by Jax and Vary [13], and the state number of HMM and mixture number for each GMM were also selected in terms of the estimation error of the HF spectral envelope. The experiment results indicated that the estimation error was slightly decreased with an increase of the number of hidden states and Gaussian mixtures. Finally, taking into consideration algorithm complexity and auditory quality, we set the number of hidden states as 16 and required each GMM to have 32 mixtures and full covariance matrices in the HMM-based reference method.

The audio data of the training and validation set for the proposed and reference methods were from the lossless audio recorded in a live concert with the length of about 6 h. They contained dialog, music, singing, and background sound. Through resampling and low-pass filtering, the parallel WB and SWB audio datasets were obtained, and the level of all the signals needed to be normalized to −26 dBov before further processing. In addition, 15 SWB audio signals were selected from the MPEG audio quality listening test dataset as the test data and contained pop music, solo instrumental music, symphony, and speech. These signals were sampled as 32 kHz with the bandwidth of 14 kHz and were limited in the length of 10~20 s. Then, they were converted into the WB audio through low-pass filtering and down-sampling, and the level of the resulting signals were normalized to −26 dBov as the input of the BWE methods. Since disjoint speakers and disjoint music pieces were used in training, validation, and testing, the accuracy and generalization of the proposed EESN method could be evaluated more objectively and fairly. A detailed analysis of the subjective and objective quality for the extended signals is given below.

4.2 Objective measurement

The objective quality of the SWB signals reproduced by different methods was evaluated in terms of log spectral distortion (LSD), cosh, and DLSD measurements.

4.2.1 LSD

The LSD [25, 26] between the original SWB audio and the extended audio signals in the frequency range of 7~14 kHz was selected as the objective measurement for comparing the extension performance between the EESN method and HMM method. LSD measurement is computed directly from the FFT power spectra as

$$ {d}_{\mathrm{LSD}}(i)=\sqrt{\frac{1}{N_{\mathrm{high}}-{N}_{\mathrm{low}}+1}{{\displaystyle \sum_{n={N}_{\mathrm{low}}}^{N_{\mathrm{high}}}\left[10{ \log}_{10}\frac{P_i(n)}{{\widehat{P}}_i(n)}\right]}}^2} $$
(19)

where d LSD(i) is the LSD value of the ith frame and P i and \( {\widehat{P}}_i \) are the FFT power spectra of the original SWB audio signals and the audio signals processed with BWE methods, respectively. N high and N low are the indices corresponding to the upper and lower bounds of the HF band with the frequency range of 7~14 kHz. Before computing the LSD values, all the data needed to be resampled to 32 kHz and were temporally aligned with the original SWB audio. Then, the LSD values were computed only for the HF band in the range of 7~14 kHz. The resulting LSD values were averaged over all the frames for each test signal, and the mean LSD was used as the distortion measure.

Table 3 shows the LSD values of the signals extended by the proposed and reference BWE methods. On average, the EESN method outperforms the HMM method in terms of LSD measure. For rock music, violin solo, and speech, the LSD difference between two methods is with the range of ±0.3 dB. But for jazz music, the LSD difference is relatively large, because this type of music has rich HF components and obvious transients over time. In comparison with the HF spectra extended by the HMM method, the spectra extended by the proposed EESN method are more similar to the original ones and obtain a lower LSD value. For symphony and country music, the energy of the HF components extended by the EESN method is slightly high, so the LSD value of the EESN method is higher than that of the HMM method, though the audio extended by EESN is brighter.

Table 3 LSD of the signals reproduced by different BWE methods

4.2.2 Cosh measure

In comparison to LSD measure, Itakura-Saito distortion gives more heavy weights on the peaks of the audio spectrum and is more relative to the true subjective auditory quality [26]. Itakura-Saito distortion is defined as

$$ {d}_{\mathrm{IS}}\left({P}_i,{\widehat{P}}_i\right)=\frac{1}{N_{\mathrm{high}}-{N}_{\mathrm{low}}+1}{\displaystyle \sum_{n={N}_{l\mathrm{o}\mathrm{w}}}^{N_{\mathrm{high}}}\left[\frac{P_i(n)}{{\widehat{P}}_i(n)}-{ \log}_{10}\frac{P_i(n)}{{\widehat{P}}_i(n)}-1\right]} $$
(20)

But, Itakura-Saito distortion is asymmetrical as distance metric, so cosh measure is selected as the modified measurement to describe the perceptual distortion of the reproduced audio. The cosh measure [26] is defined as

$$ {d}_{\mathrm{COSH}}(i)=\frac{1}{2}\left[{d}_{\mathrm{IS}}\left({P}_i,{\widehat{P}}_i\right)+{d}_{\mathrm{IS}}\left({\widehat{P}}_i,{P}_i\right)\right] $$
(21)

In this paper, the cosh measure was only computed within the frequency range of 7~14 kHz. The resulting cosh values were averaged over all the frames for each test signal, and the mean cosh value was used as the distortion measure. Table 4 gives the comparison results of the COSH measure of the signals extended by two methods. For jazz music and speech, the EESN method gains a better quality than the HMM method, but for the country music, the EESN method has more distortion since the reconstructed HF spectrum owns higher energy. On average, the SWB audio reproduced by EESN can achieve a better objective quality in terms of the cosh measure.

Table 4 cosh measure of the signals reproduced by different BWE methods

4.2.3 DLSD

The continuity of the audio spectrum over time is as perceptually important as the accuracy of the spectrum reconstruction. Here, DLSD is utilized as a dynamic distortion measure for evaluating the smoothness of temporal evolution for the spectral envelope of the extended audio signals [27]. If the DLSD value of the reproduced audio is small, it indicates that the spectrum evolves smoothly over time and is helpful for the overall subjective auditory quality of the reconstructed audio. DLSD is defined as

$$ {d}_{\mathrm{DLSD}}(i)=\sqrt{\frac{1}{2\left({N}_{\mathrm{high}}-{N}_{\mathrm{low}}+1\right)}{\displaystyle \sum_{n={N}_{\mathrm{low}}}^{N_{\mathrm{high}}}{\left[10{ \log}_{10}\frac{P_i(n)}{P_{i-1}(n)}-10{ \log}_{10}\frac{{\widehat{P}}_i(n)}{{\widehat{P}}_{i-1}(n)}\right]}^2}} $$
(22)

where P i − 1 and \( {\widehat{P}}_{i-1} \) are the FFT spectra of the original and extended SWB audio signals in the previous frame.

Table 5 shows the DLSD values of the signals extended by two BWE methods. The dynamic distortion of violin solo and symphony is similar, because their energy of the HF components is low and evolves smoothly over time. For country, jazz, and rock music, the EESN method can effectively restore transients and achieve an improvement in DLSD over the HMM method on the average. For speech, the DLSD value of EESN is about 3.89 dB. In some unvoiced frames, the HF energy of the extended signals has some difference from that of the original signals, and it causes some dynamic distortion.

Table 5 DLSD of the signals reproduced by different BWE methods

4.3 Subjective preference test

The subjective quality of two BWE methods was assessed by using the pair-wise subjective preference tests. The subjective tests were arranged in a quiet room, and 15 listeners were invited to take part in the tests. Five test signals were selected from the MPEG audio dataset. The listeners were asked to choose which audio they preferred from two presented test items or to indicate no preference in each test case, and they were also allowed to repeat the testing data with no time limitation before giving answers. The examples of demmostation are given in the Additional file 1.

Six groups of tests were presented to each listener. They were the comparison between the original WB and SWB signals, the comparison between the extended signals by the EESN method and the WB signals, the comparison between the extended signals by the HMM method and the WB signals, the comparison between the extended signals by the EESN method and the HMM method, the comparison between the extended signals by the EESN method and the original SWB signals, and the comparison between the original SWB signals and the extended signals by the HMM method. The results of the subjective tests are shown in Table 6. It is found that the extended signals by two BWE methods are preferable over the WB signals in terms of subjective auditory quality and the original SWB signals gain the best performance in comparison with the extended signals and the WB signals. In addition, the subjective audio quality of the proposed EESN method is slightly better with a little bit less of artifacts, compared to the HMM reference method, and is subjectively comparable to the original SWB signals. According to the comments of listeners, more dynamic contents in country, jazz, and rock music are restored by using the EESN methods, and it leads to a preference for EESN in terms of subjective listening quality, in comparison with the reference method. For symphony and violin solo, the audio spectrum changes smoothly over time, so no significant difference can be perceived between the audio signals extended by different BWE methods. In addition, some bandwidth-extended audio signals of country and jazz music own higher energy than the original SWB audio signals, so some listeners subjectively prefer the extended audio signals to the original ones.

Table 6 Results of the subjective preference tests

5 Conclusions

A bandwidth extension method for audio signals based on ensemble echo state network is proposed in this paper. For each region in the feature space, a specific echo state network with recursive structure is utilized to dynamically model the mapping relationship between the LF and HF coefficients in the light of the continuous state updating equation. And, the outputs of multiple ESNs are fused by means of the GMM-based network ensemble techniques, in order to further estimate the HF spectral envelope. Subjective and objective quality test results show that the EESN method achieves improvement in terms of both static and dynamic distortions in comparison with the HMM-based reference method on the average and the auditory quality of the reproduced signals is close to that of the original SWB audio.

References

  1. P Vary, R Martin, Digital speech transmission—enhancement, coding and error concealment (Wiley, UK, 2006)

    Book  Google Scholar 

  2. E Larsen, MR Aarts, Audio bandwidth extension—application of psychoacoustics, signal processing and loudspeaker design (Wiley, UK, 2004)

    Book  Google Scholar 

  3. C Avendaño, H Hermansky, EA Wan, Beyond Nyquist towards the recovery of broad bandwidth speech from narrow-bandwidth Speech (Proc European Conference on Speech and Language Processing, Madrid, 1995), pp. 165–168

    Google Scholar 

  4. YM Cheng, D O’Shaughnessy, P Mermelstein, Statistical recovery of wideband speech from narrowband speech. IEEE Trans. Speech Audio Process 2(4), 544–548 (1994)

    Article  Google Scholar 

  5. H Carl, U Heute, Bandwidth enhancement of narrow-band speech signals (Proc 7th European Signal Processing Conference, Edinburgh, 1994), pp. 1178–1181

    Google Scholar 

  6. J Epps, WH Holmes, A new technique for wideband enhancement of coded narrowband speech (Proc IEEE Workshop on Speech Coding, Porvoo, 1999), pp. 174–176

    Google Scholar 

  7. IY Soon, KY Chai, Bandwidth extension of narrowband speech using soft-decision vector quantization (Proc Fifth International Conference on Information, Communications and Signal Processing, Bangkok, 2005), pp. 734–738

    Google Scholar 

  8. U Kornagel, Techniques for artificial bandwidth extension of telephone speech. Signal Process. 86(6), 1296–1306 (2006)

    Article  MATH  Google Scholar 

  9. KY Park, HS Kim, Narrowband to wideband conversion of speech using GMM based transformation (Proc IEEE International Conference on Acoustics Speech and Signal Processing, Istanbul, 2000), pp. 1843–1846

    Google Scholar 

  10. CV Botinhao, BS Carlos, LP Caloba, MR Petraglia, Frequency extension of telephone narrowband speech signal using neural networks (Proc IMACS Multiconference on Computational Engineering in Systems Applications (CESA), Beijing, 2006), pp. 1576–1579

    Google Scholar 

  11. VP Tuan, F Schaefer, G Kubin, A novel implementation of the spectral shaping approach for artificial bandwidth extension (Proc 3rd International Conference on Communications and Electronic, Nha Trang, 2010), pp. 262–267

    Google Scholar 

  12. B Iser, G Schmidt, Neural networks versus codebooks in an application for bandwidth extension of speech signals (Proc European Conference on Speech and Language Processing, Geneva, 2003), pp. 565–568

    Google Scholar 

  13. P Jax, P Vary, Wideband extension of telephone speech using a hidden Markov model (Proc 7th IEEE Workshop on Speech Coding, Delavan, 2000), pp. 133–135

    Google Scholar 

  14. P Jax, P Vary, On artificial bandwidth extension of telephone speech. Signal Process. 83(8), 1707–1719 (2003)

    Article  MATH  Google Scholar 

  15. GB Song, P Martynovich, A study of HMM-based bandwidth extension of speech signals. Signal Process. 89(10), 2036–2044 (2009)

    Article  MATH  Google Scholar 

  16. C Yagli, MAT Turan, E Erzin, Artificial bandwidth extension of spectral envelope along a Viterbi path. Speech Comm. 55(1), 111–118 (2013)

    Article  Google Scholar 

  17. X Liu, CC Bao, Blind bandwidth extension of audio signals based on non-linear prediction and hidden Markov model. APSIPA Transactions on Signal and Information Processing 3(e8), 1–16 (2014)

    MathSciNet  Google Scholar 

  18. G Lear, A state-space-based recurrent neural network for dynamic system identification. J. Syst. Eng. 6(3), 186–193 (1996)

    Google Scholar 

  19. M Lukosevicius, A practical guide to applying echo state networks, in Neural networks: tricks of the trade, ed. by G Montavon, GB Orr, KR Müller, 2nd edn. (Springer, Heidelberg, 2012), pp. 659–686

    Chapter  Google Scholar 

  20. M Lukosevicius, H Jaeger, Reservoir computing approaches to recurrent neural network training. Comput. Sci. Rev. 3(3), 127–149 (2009)

    Article  MATH  Google Scholar 

  21. H Jaeger, M Lukosevicius, D Popovici, U Siewert, Optimization and applications of echo state networks with leaky-integrator neuron. Neural Netw. 20(3), 335–352 (2007)

    Article  MATH  Google Scholar 

  22. H Jaeger, H Haas, Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication. Science 304(5667), 78–80 (2004)

    Article  Google Scholar 

  23. UD Schiller, JJ Steil, Analyzing the weight dynamics of recurrent learning algorithm. Neucomputing 63, 757–779 (2005)

    Google Scholar 

  24. LK Hansen, P Salamon, Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12(10), 993–1001 (1990)

    Article  Google Scholar 

  25. H Pulakka, L Laaksonen, M Vainio, J Pohjalainen, P Alku, Evaluation of an artificial speech bandwidth extension method in three languages. IEEE Trans. Audio Speech Lang. Process. 16(6), 1124–1137 (2008)

    Article  Google Scholar 

  26. AH Gray, JD Markel, Distance measures for speech processing. IEEE Trans. Audio Speech Lang. Process. 24(5), 380–391 (1976)

    Article  Google Scholar 

  27. F Norden, T Eriksson, Time evolution in LPC spectrum coding. IEEE Trans. Speech Audio Process 12(3), 290–301 (2004)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants 61072089 and 61471014.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chang-Chun Bao.

Additional information

Competing interests

The authors declare that they have no competing interests.

Additional file

Additional file 1:

Demonstration for subjective listening tests. (PPT 10166 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, X., Bao, CC. Audio bandwidth extension using ensemble of recurrent neural networks. J AUDIO SPEECH MUSIC PROC. 2016, 12 (2016). https://doi.org/10.1186/s13636-016-0090-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-016-0090-0

Keywords