 Empirical Research
 Open access
 Published:
Efficient bandwidth extension of musical signals using a differentiable harmonic plus noise model
EURASIP Journal on Audio, Speech, and Music Processing volume 2023, Article number: 51 (2023)
Abstract
The task of bandwidth extension addresses the generation of missing high frequencies of audio signals based on knowledge of the lowfrequency part of the sound. This task applies to various problems, such as audio coding or audio restoration. In this article, we focus on efficient bandwidth extension of monophonic and polyphonic musical signals using a differentiable digital signal processing (DDSP) model. Such a model is composed of a neural network part with relatively few parameters trained to infer the parameters of a differentiable digital signal processing model, which efficiently generates the output fullband audio signal.
We first address bandwidth extension of monophonic signals, and then propose two methods to explicitly handle polyphonic signals. The benefits of the proposed models are first demonstrated on monophonic and polyphonic synthetic data against a baseline and a deeplearningbased ResNet model. The models are next evaluated on recorded monophonic and polyphonic data, for a wide variety of instruments and musical genres. We show that all proposed models surpass a higher complexity deep learning model for an objective metric computed in the frequency domain. A MUSHRA listening test confirms the superiority of the proposed approach in terms of perceptual quality.
1 Introduction
Audio bandwidth extension (BWE) is a subtask of audio enhancement [1] whose goal is to extrapolate the audio spectrum to higher frequencies, in contrast with audio inpainting whose goal is to interpolate missing parts [2]. BWE has been considered early in telecommunication systems to overcome bandwidth limitations, especially in telephony for which the typical sampling rate is 16 kHz, i.e., leading to the highest frequency in the processed signal be 8 kHz. In the case of human conversations, the quality of speech can be greatly improved if the sampling rate is increased to 44.1 or 48 kHz [3]. In the same vein, another application of BWE is to improve the quality of old music recordings, possibly in addition to the removal of clicks and noise [4] or declipping [5]. In both applications, the signal enhancement is handled without access to the original signal with better quality. Informed BWE algorithms can also be useful in audio coding [6] where signals of smaller sampling rates are more effectively compressed, requiring the use of a BWE to restore the full sampling rate of the decoded signal. In most cases, low bitrate side information is transmitted along the compressed lowfrequency signal to improve the performance of the BWE module.
Finally, BWE is also meaningful for interoperability of audio processing tools as many audio signal processing methods, such as source separation [1], speech synthesis [7], or voice conversion [8], focus on 16kHz signals, hence the need for a BWE system beforehand if the acquired signal is not at the desired sampling rate.
Even though many deep learningbased systems have been proposed to tackle BWE, most of them do not consider runtime efficiency as critical, leading to highquality systems that can be very costly at inference. Highquality generators based on autoregressive signal models such as Wavenet [9] or on diffusion [10] have intrinsic high complexity and sequentiality which limit their use for time or delay critical applications.
In this paper, we propose to consider differentiable digital signal processing (DDSP) models derived from the seminal work of [11] in order to tackle BWE in an efficient manner. Controlling a harmonic plus noise sound model with a deep learning architecture allows us to considerably reduce inference time. Experiments described in this paper demonstrate a large speed increase compared to a reference ResNet implementation [12], this with better resulting perceptual quality. This is due to several factors, including the reduction of learnable parameters. Using the DDSP approach, the sound is generated using deterministic synthesizers that are controlled by several deeplearning modules of relatively small sizes. For comparison, the ResNet architecture has more than 55k learnable parameters, while the tested DDSP approach has around 4k parameters.
The remaining of this article is organized as follows. In Section 2, we present a general overview of works related to BWE. In Section 3, we explain the proposed models designed to address BWE. The experimental protocol whose code is available online^{Footnote 1}, and which rely on publicly avaible datasets, is detailed in Section 4. In Section 5, we show how the proposed models are well designed when considering synthetic data, and in Section 6 they are evaluated on real data. Finally, we conclude this article in Section 7.
2 Related work
Most approaches considered speech signals with application to telephony. The literature that consider music or general audio is more scarce. When available, we here put the focus on literature related to musical audio.
2.1 Signal processing approaches
Early works employed pure signal processing methods for BWE. In the area of audio coders, some nonblind systems rely on spectral band replication (SBR) [6] using side information extracted during compression. The SBR algorithm is based on the replication of the lowband spectrum to the highband region, possibly with the benefits of side information about the high frequencies to improve the overall performance. It has been extended in several works [13, 14], e.g., by replacing the replication by a stretching of the lowband content towards the highband part, thus preserving the intrinsic harmonic relationships. Sourcefilter models have been also employed to extend the bandwidth using line spectral frequencies in [15]. Systems based on dictionary learning to map lowfrequency patterns to highfrequency components have been proposed in [16, 17]. Classic machine learning methods have also been explored for BWE, such as Gaussian mixture models (GMMs) [18], hidden Markov models (HMM) [19, 20], or nonnegative matrix factorization (NMF) [21, 22].
2.2 Convolutional deep learning approaches
Recently, deep learning (DL) methods have shown great performance to synthesize the upper band spectrum. The first works that apply DL technique in BWE literature used deep neural networks (DNNs) with dense layers to infer the high frequencies up to 8 kHz [23,24,25]. In [23], the log shorttime Fourier transform is fed into several dense layers with the last one inferring the highband spectrum magnitude. The waveform is reconstructed by using the flipped phase from the lowband to estimate the highband phase information. While this flipped method avoids having phase discontinuities at the low/high frontier, [24] proposes to cope with this potential issue by extended the meansquared error (MSE) loss function with a regularization term. GaussianBernoulli restricted Boltzmann machines (GBRBM) has been employed alongside dense layers in [25] in order to estimate the higher spectral envelope. Other systems make use of convolutional neural networks (CNNs) to infer the high frequencies from the lowband input features, using 1D convolutions in the time domain [12, 26,27,28] or 2D convolutions in the spectro/temporal domain [29, 30]. In [26], the authors show that using a network architecture of 1D dilated convolutions and residual connections outperforms a stateoftheart based on a long shortterm memory (LSTM) system on speech signals. The authors of [27] make use of 1D convolutional layers in an encoderdecoder scheme to extend the bandwidth of speech and musical signals in three upscaling ratios: 2, 4, and 6. They show the effectiveness of their system compared to [24] for objective and perceptive metrics. In the same vein, an encoderdecoder architecture in the time domain is also used in [28], but the authors propose to opt for subpixel layers instead of classical transposed convolutional layers because it was shown that fewer artifacts are created by considering those layers. Another encoderdecoder system can be found in [12] where the architecture also contains residual connections in a UNet scheme. The authors show (1) that using a ResNet architecture outperforms the UNet, probably because of the loss of information in the bottleneck layer of the former and (2) that the employed DNNs overfit on the filter shapes present in the training data. This latter problem can fortunately be alleviated with a data augmentation strategy which utilizes a wide variety of lowpass filters during training.
As the paper describes thoroughly the architecture proposed as well as its learning procedure, we choose to use this latter system as a reference “high complexity” system.
2.3 Generative adversarial networks
Generative adversarial networks (GAN) have been explored in several works for BWE. In [31], the authors show that relying on GANs can improve the generated speech quality by using a simple DNN. In [32], the generator is based on a UNetlike architecture and the discriminator is trained to distinguish between generated and true wideband signals, with the addition of a perceptual loss expressing the distance between features learned by a pretrained automatic speech recognition (ASR) network. A combination of two discriminators, one based on spectral features and the other based on temporal features, have been proposed in [33] to extend the bandwidth from 8 kHz to 48 kHz. In [34], the generator is also based on a UNet architecture yet it is proposed to employ CNNs for the three discriminators, each one being applied on a downsampled version of generated or true waveform (downsampling factors = 1, 2, 4). The generator is then trained to generate piano signals.
While GAN do not impose strong constraints in terms of inference complexity, GANs are known to be notoriously difficult to train, as they require very specific choices in optimization and architectures in order to stabilize training and could fail to cover modes of the data distribution [35].
At the time of the design of this study, we found no pretrained general audio BWE model learnt with an adversarial procedure. We thus do not consider a GAN trained generator as another reference method.
2.4 Diffusion models
In terms of quality of generation, diffusion models now provides very convincing performance for a wide variety of data, including audio [10, 36]. As for autoregressive architecture like Wavenet [9], this important increase of quality comes at a strong computational cost at inference. The network has to be called sequentially a large number of times (usually from 100 to 1000 times) in order to perform the inference. One can reduce the size of the network or reduce the number of steps in order to accelerate sampling [35], but those approaches are detrimental to the quality of the generated audio and the inference time remains high.
In this paper, we find that the inference of a standard ResNet architecture is already about 1000 times real time on a standard central processing unit (CPU) and our study focuses on efficient BWE, we choose not to consider diffusion models as a reference.
3 Differentiable sound models
In this article, we address BWE using DDSP models derived from the seminal work of [11] that focuses on the generation of audio signals with a combination of neural networks and digital signal processing models. This approach allows one to train the neural network parameters in an endtoend fashion with backpropagation, if the rest of the model is differentiable. Besides several sound synthesis models [37, 38], DDSP has also been successfully applied to other tasks, such as neural audio effect [39], style transfer [40], sound matching [41], or virtual analog [42].
In this section, we describe the DDSP models we propose for monophonic and polyphonic BWE.
3.1 Monophonic BWE system
To address BWE for monophonic musical signal, we adapt the model proposed in [11], which is monophonic by design. The main difference with the original DDSP model is that, in order to reconstruct the higher frequencies, the model takes as input the lowband (LB) audio signal of bandwidth \(\frac{f_N}{\alpha }\), with \(f_N\) the Nyquist frequency, and is trained to output the wideband (WB) signal of bandwidth \(f_N\). The overall architecture, illustrated on Fig. 1 is the same as in [11], and consists in two parts: an trainable encoderdecoder neural network, and a harmonicplusnoise synthesizer. The neural network is illustrated in blue, the extracted features are shown in yellow, and the differentiable synthesizer is colored in red. This monophonic model is labeled DDSPmonodec, referring to the design of the decoder to generate monophonic parameters.
3.1.1 Extracted features
The input LB signal is first analyzed to extract the fundamental frequency \(f_0(n)\) and loudness l(n) over time. In the monophonic setting, we use CREPE [43], a stateoftheart monophonic pitch estimator based on a convolutional neural network, to estimate \(f_0\). The loudness l is obtained with a Aweighting of the power spectrum [44].
3.1.2 Neural network
The input LB signal waveform is processed by an encoder which creates a latent vector z. In the encoder, the first 30 mel frequency cepstrum coefficients (MFCC) are extracted from the audio input (fast Fourier transform (FFT) size of 1024, overlap of 75 % and 128 mels between 20 and 8000 Hz) and then passed into a trainable normalization layer. After that, the MFCCs go into a gated recurrent unit (GRU) with 512 units, and finally, a 512neuron linear layer outputs the latent vector z(n).
The three vectors z(n), \(f_0(n)\) and l(n) are then fed into the decoder. Each of them first goes into a separate multilayer perceptron (MLP) with three layers, and the outputs are concatenated. The obtained vector is processed by a 512unit GRU and then another 3layer MLP. Finally, two separate dense layers are used: the first one outputs the harmonic amplitudes \(A_h(n)\) (see Section 3.1.3) using a softmax activation, and the second one gives the noise filter coefficients N(k). Note that, as in [11], we use a modified sigmoid function \(\sigma (x)\) at the output of these two last dense layers : \(\sigma (x) = 2 \cdot sigmoid(x)^{log(10)} + 10^{7}\). This architecture has around 3k learnable parameters.
3.1.3 Harmonicplusnoise synthesizer
Both outputs from the neural network are used separately in the additive synthesizer and noise modules. The additive synthesizer takes the estimated \(f_0(n)\) and the inferred harmonic amplitudes \(A_h(n)\) to generate the audio signal y(n):
where \(\phi _h\) is the instantaneous phase of the hth sinusoidal component. It is computed by integrating the instantaneous frequency \(f_h(n) = h f_0(n)\) :
where \(\phi _{0,h}\) is a random initial phase. In the filtered noise module, we obtain a timedomain finite impulse response (FIR) filter as the inverse discrete Fourier transform of the noise filter coefficients N(k) from the neural network output. The filtered noise signal is synthesized by convolving a white noise with the FIR filter. The harmonic signal and filtered noise are finally summed to obtain the wideband output signal.
Even if the fullband output signal is generated, only the missing high frequency content is kept, and added to the input lowband signal.
3.1.4 Noiseonly synthesizer
We also consider a noise only synthesizer in which the output of the autoencoder only contains the noise filter coefficients N(k). We label this model DDSPnoise. This will allow us in the experimental part to evaluate the respective value of the harmonic and noise parts of the synthesizer.
3.1.5 Loss function
We use the multiscale spectral (MSS) loss function to train our models computed on the missing highfrequency region. It is defined as \(\mathcal {L}(y, \tilde{y}) = \sum _{s} L_s(Y_s, \tilde{Y}_s)\), where \(Y_s\) and \(\tilde{Y}_s\) are the highfrequency magnitude spectrograms of the groundtruth signal y and the reconstructed signal \(\tilde{y}\), respectively, computed using a FFT size s, and :
\( \cdot \) being the common \(L_1\) norm. Indeed, experiments demonstrated that it is preferable to compute each loss only on the highfrequency region for solving the BWE task. We use the same set of FFT sizes as in [11], that is [2048, 1024, 512, 256, 128, 64] samples.
3.2 Polyphonic BWE methods
By the use of a single harmonic synthesizer, the DDSP architecture can only generate high frequency content harmonically from a single \(f_0\). To address BWE for polyphonic musical signals, we propose two systems: a cyclic use of a monophonic BWE system detailed above, and a BWE system based on a polyphonic DDSP architecture.
3.2.1 Cyclic monophonic decoder
In the monophonic BWE system, the DDSP model generates a harmonic signal based on a single \(f_0\) estimated from a monophonic pitch estimator [43]. Now that we are in a polyphonic context, we use a stateoftheart multipitch estimator [45] which outputs a maximum of I different fundamental frequencies \(f_0^i\). Considering that this multipitch estimator has a rather good performance, we propose to iteratively use the monophonic DDSP model DDSPmonodec in a cyclic manner, as illustrated in Fig. 2. We label this model DDSPmonodeccyclic. Pseudocode of the overall algorithm is detailed in Algorithm 1.
The monophonic DDSP model is applied for I iterations on a lowband signal \(x_{LB}^i\) which correspond to the original lowband signal minus the \(i1\) estimated sources. At each iteration i, a loudness contour \(l^i\) is extracted from what we label a residual lowband input signal \(x_{LB}^i\) and passed, along with the \(i^{th}\) estimated pitch \(f_0^i\) (obtained on x at the beginning of the algorithm) and \(x_{LB}^i\), into the DDSP model.
The output fullband monophonic signal \(\tilde{y}^i\), which contains a harmonic content from the current \(f_0^i\), is then lowpass filtered to keep only the lowfrequency part \(\tilde{y}_{LB}^i\). Finally, the magnitude spectrogram of \(\tilde{y}_{LB}^i\) is subtracted to the magnitude spectrogram of the residual lowband input signal:
The lowband input signal is then obtained in the timedomain using an inverse shortterm Fourier transform (STFT) on \(X_{LB}^i\) (phase is kept in place).
In that way, at each iteration i, the harmonic content generated at the previous step is removed in the spectral domain from the residual lowband input signal, so that a different \(f_0^i\) should be extracted. The residual lowband signal should contain less and less harmonics during this process.
At the beginning of the iteration, the loudness contour is then estimated on the full polyphonic signal, which will lead to estimations errors, that hopefully will decrease at each iteration.
The output of the noise synthesizer, which is part of the monophonic DDSP model at each iteration in order to have a more precise estimate of the amplitude of the harmonic of the sinusoidal part. While the noise part is thus estimated at each iteration we only considered the noise part of the last iteration I in order not to overestimate the noise part.
Finally, the fullband monophonic output signals \(\tilde{y}^i\) are summed and mixed with the noise part to obtain the estimated fullband polyphonic signal \(\tilde{y}\). As in the monophonic BWE setting, the high frequency content from this fullband signal is mixed with the lowband input signal.
3.2.2 Polyphonic decoder
To address BWE for polyphonic signals, we propose another model adapted from the original DDSP models, illustrated in Fig. 3, which we label DDSPpolydec because the decoder outputs the parameters intended to control a polyphonic synthesizer. As before, the model is trained on polyphonic data.
In this model, I additive harmonic synthesizers are used, where I is the estimated number of fundamental frequencies \(f_0^i, i \in {1, ..., I}\) present in the input lowband signal. To estimate the parameters for each separate additive synthesizer, we extend the decoder detailed in section 3.1.2 by using I separate MLPs for each \(f_0^i\) (instead of a single MLP for vector \(f_0\) in the monophonic DDSP model). The outputs of those I MLPs are then concatenated into one vector, which is itself concatenated to the outputs of the two other MLPs applied on z and l. Then, as in the monophonic model, the obtained vector goes through a GRU and another MLP. After that, \(I+1\) dense layers are used: one for estimating the noise filter coefficients N(k), and I other layers to output the H harmonic amplitudes of the I additive synthesizers.
In this model, we employ the same multipitch estimator [45] as in the cyclic model to estimate a maximum of I \(f_0^i\). If only \(I' < I\) fundamental frequencies are given by the estimator, we set \(f_0^i = 0, i > I'\), and all \(f_0^i\) are fed in the decoder. To prevent any adverse impact on sound quality of those missing values, only the \(I'\) first sets of H harmonic amplitudes are extracted from the decoder output and used with the first \(I'\) additive synthesizers.
4 Experimental protocol
In this section, we detail the datasets, metrics and baselines used to assess the performance of the proposed BWE models.
The task that we consider is bandwidth extension task where the input signal is sampled at 4 kHz, thus with frequencies up to 2 kHz and the output signal is sampled at 16 kHz, thus with frequencies up to 8 kHz.
As our approach is quite flexible in terms of extension scenario, we also performed experiments for the task going from a sampling frequency of 8kHz to 16kHz. We found that the ranking between models was the same as the one for upsampling from 2 to 8 kHz. We thus display and discuss results only for the latter, as the task is more challenging and lead to more salient perceptual differences, a required aspect for a successful perceptual evaluation.
4.1 Datasets
To train and evaluate our models, we used both monophonic and polyphonic datasets. Synthetic data has been also been considered in order to check expected behaviors of proposed systems. Those systems are then evaluated on uncontrolled realworld data.
4.1.1 Synthetic datasets
In order to analyze the inference capabilities of the trained models, we generated two synthetic datasets, respectively containing monophonic and polyphonic signals. These signals are generated using a harmonicplusnoise synthesizer, as for the DDSP models, allowing for precise analysis of the models generating capabilities.
Each monophonic signal is generated given a \(f_0\) corresponding to a certain MIDI pitch between C3 (e.g., 130.82 Hz) and G#6 (e.g., 1661.22 Hz). A harmonic signal is generated from this \(f_0\) with H harmonics (\(H \in \{10, 15, 20\}\)), where the amplitude of the hth harmonic is \(\frac{1}{h^2}\). A pink noise is added to this harmonic signal with a signaltonoise ratio of 10 dB. Then, an attack, sustain, decay (ASD) envelope is generated and multiplied to the harmonicplusnoise signal. The durations of attack and decay and the sustain level are randomly picked in the interval [0, 0.3], [0.5, 1], and [0, 2] (in seconds), respectively. Finally, a random gain in interval [0.75, 1] scales the final monophonic harmonicplusnoise signal. The final monophonic synthetic dataset is obtained by generating all combinations of \(f_0\) with the three H values, giving three signals.
The polyphonic synthetic dataset is generated by composing chords on the diatonic scale simply by considering multiple notes from the monophonic synthetic dataset, as follows. To generate a Inote polyphonic chord signal, we randomly pick I monophonic signals by taking care that a particular pitch (regardless of the octave) does not appear more than once among these I signals. For each note, a gain is randomly picked in [0.5, 1], and all notes are mixed with corresponding gains. To build the full database, we generated polyphonic signals for all combinations of \(f_0\) and \(I \in \{2, 3, 4, 5\}\).
From the generated monophonic and polyphonic synthetic datasets, 90% of the signals form the train set, and the remaining signals form the test set.
4.1.2 Realworld monophonic datasets
Two realworld datasets consisting of monophonic musical signals are used to evaluate our models. The OrchideaSOL dataset [46] includes signals of single notes from many different instruments (accordion, bassoon, tuba, horn, trombone, trumpet, guitar, harp, contrabass, viola, violin, violoncello, clarinet, flute, oboe, and saxophone). In the original dataset, many different playing styles are available for each instrument; however, we only keep the ordinario one, corresponding to a natural playing. The training set for our experiments contains \(90\%\) of the original dataset, i.e., about 5.5 h of audio, while the test set contains \(10\%\), i.e., about 42 min of audio.
Medleysolosdb [47] is another largely monophonic dataset which contains melodies of one of eight different instruments (clarinet, distorted electric guitar, female singer, flute, piano, saxophone, trumpet, and violin), i.e. the \(f_0\) changes over time in those signals. In our experiments, we considered the original provided test and train splits, which corresponds to about 2.4 and 5 h of audio, respectively. As some of the instruments are polyphonic i.e. distorted electric guitar, piano, and violin, a small part of the dataset cannot strictly be considered as monophonic. In order to preserve the integrity of train/test splits of the dataset, we chose not to discard those instruments.
4.1.3 Realworld polyphonic datasets
To assess the proposed model for polyphonic BWE, we employed two realworld datasets containing multiple multitrack mixes. Gtzan dataset [48] has been widely exploited in many audio signal processing tasks. It contains 1000 30s music tracks equally split into 10 genres (blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock). The train and test splits contain 7.5 h and 50 min of audio, respectively.
We also used the mixed version of each track of the MedleyDB dataset [49], since most of the corresponding stems are already part of the training split of the previously mentionned Medleysolosdb dataset. The whole MedleyDB dataset is split into train and test sets in a \(90\%\)/\(10\%\) way, corresponding to approximately 6 h and 50 min of audio data, respectively.
4.2 Evaluation and metrics
To evaluate the performance of the proposed models, we first employ an objective metric computed in the frequency domain named logspectral distance (LSD), defined as:
where Y(n, k) and \(\tilde{Y}(n,k)\) are the STFT representation of the target fullband signal and the estimated fullband signal, respectively (see Section 4.4 for STFT parameter values).
Secondly, we ran a listening test based on a MUSHRA methodology [50] to assess the perceptive accuracy of the proposed models. We followed the classical MUSHRA specifications to build a listening test which was completed by 44 participants. More details are provided in Section 6.
4.3 Reference methods
4.3.1 Null baseline
The null baseline is simply the absence of addition of any content in the missing high frequency range. It provides a “ground floor” baseline to assess if the contribution of a given method is not actually worse than doing nothing.
4.3.2 Spectral band replication
To compare the performance of our models with existing approaches in the literature, a simplified version of the SBR algorithm [13] has been evaluated on the considered datasets. This algorithm has a long history in audio codec technologies and comes in various designs that often consider the use of side information, transmitted in the bitstream for the decoder to perform BWE. In this work, we implemented a simplified version that is blind, i.e. does not require any information for performing BWE.
In this algorithm, the input signal x is treated in the frequency domain, frame by frame. The upper half frequencies are inferred by replicating the lower half frequencies, with the idea of transposing the lower harmonics upwards. As we aim to extend the bandwidth of musical signal from 2 to 8 kHz, we extend this algorithm by replicating the lower band three times to reconstruct the full spectrum. In order to obtain a typical frequency amplitude decay, for each replication, the amplitudes of the transposed frequencies are adjusted so that there is an energy continuity at the replication frontier, i.e., for the jth replication (\(j \in \{1, 2, 3\}\) as it is a fourfold bandwidth extension), the energies of the same portion of frequencies on both sizes are equal:
where K is the number of frequency bins, and \(\alpha \in [0, 1]\) the fraction of frequency bins considered for matching the energies of adjacent replicated bands. Experimentally, we found that \(\alpha = 0.5\) led to the best performance for the overall algorithm.
In our experiment, to consider the SBR at its best performance, the groundtruth phase information is used to obtain the fullband signal in the temporal domain. We acknowledge that the phase is not known in practice and would have to be estimated in a realistic production setting.
4.3.3 ResNet architecture
We also compare our models to a higher complexity system based on deep learning [12]. We chose this system because the ResNet architecture shows better results than the other proposed model based on a UNet architecture. The ResNet architecture takes an input signal in the temporal domain and outputs a signal of the same size, with highfrequency components. It is composed of 15 residual blocks made of two 1D convolutional layers each, with 512 convolutional filters of size 7, with a rectified linear unit (ReLU) activation after the first layer. For each layer, the input is added back to the output after being multiplied by a factor of 0.1 (for stabilizing the training) in a residual fashion. The input signal is added back to the output. Batch normalization and dropout with a factor of 0.5 are used after each convolutional layer. This model has around 55M learnable parameters.
To train this model, we use the same strategy as in [12], i.e., using a mean square error loss with a learning rate reducing schedule.
4.4 Experimental parameters
In our work, all audio signals are sampled at \(f_s=16\) kHz. To compute the STFT of these signals, we used an analysis window of 1024 samples with a hop length of 256 samples. The input signals are of length 64,000 samples (\(=4\) s) for the DDSP models and the SBR baseline, and 8192 samples (\(\approx 0.5\) s) for the ResNet model. In the DDSP models, we considered \(H=100\) a number of harmonics and a size of \(K=65\) for the noise transfer function N(k). In the cyclic DDSP system, we use a total number of \(I=5\) iterations.
DDSP models are trained for 25,000 steps with batches of size 32. We used the Adam optimizer with an initial learning rate of 0.001 for DDSP models, and the latter is halved if the loss has not been decreased during four plateaus of 2500 steps. We used A100 GPUs for the training, which permit us to train DDSP models for around 1 h for DDSPmonodec, 2 h for DDSPpolydec, while ResNet training took around 19 h.
5 Validation on synthetic data
In this section, we first study the performance of the proposed models against the baselines on the monophonic and polyphonic synthetic data. It allows for more detailed insights on the models’ ability to accurately generate the missing high frequency content.
5.1 Monophonic dataset
We first trained and evaluated our monophonic DDSP model on the monophonic synthetic dataset against the SBR method [13] and the ResNet model [12]. Table 1 summarizes the results.
The results show the benefit of the DDSP model over the two reference models. On Fig. 4, the generated upper band from the proposed monophonic DDSP model, SBR baseline and ResNet models are illustrated for one frame of a particular synthetic signal with \(f_0 \approx 830\) Hz. The DDSP model is robust enough to synthesize the wanted harmonics with matching amplitudes, showing that it is capable to learn the chosen harmonic amplitude decay. The SBR baseline duplicates the lowband harmonic content with an offset because of the mismatch between the cutoff frequency and \(f_0\), and the ResNet model is apparently not capable of generating relevant high frequency harmonics, thus minimizing its loss by very few addition of energy.
5.2 Polyphonic dataset
Table 2 shows the LSD metric for all models and baselines on the polyphonic synthetic dataset. We can see that all proposed models surpass the SBR baseline and the ResNet model, and that the BWE performance has been improved by the design of both the cyclic system and the polyphonic model. The polyphonic DDSP model is almost twice as good as SBR and Resnet, which is an important improvement. When looking at Fig. 5, which illustrated the upper band generation for one polyphonic example from all the considered models and baselines, we notice that both cyclic and polyphonic methods are capable of generating precise harmonics, with a relatively good amplitude match compare to the groundtruth. For the monophonic setting, the SBR baseline generates shifted harmonics. The ResNet model seems to be able to focus only on some harmonics, with relatively precise amplitudes, while also generating some noise in the lowest generated frequencies.
The three DDSPbased models seems quite capable of estimating the low harmonic amplitudes, while the high harmonic content suffers from too high amplitudes, which may lead to nonnatural artifacts. Possible reasons for this defect are given at the end of the next section.
6 Evaluation on realworld datasets
In this section, we present the performance results for each monophonic and polyphonic recorded datasets of the proposed models against the reference methods: namely SBR and ResNet model.
6.1 Objective evaluation
The proposed models are first evaluated objectively using the LSD metrics on the realworld datasets. Monophonic models DDSPmonodec and DDSPnoise are evaluated on both monophonic and polyphonic datasets, while polyphonic models DDSPmonodeccyclic and DDSPpolydec are evaluated only on polyphonic datasets (Gtzan and MedleyDB. Table 3 shows the results.
First, we can see that all proposed models surpass both SBR and ResNet model in terms of LSD, except for the cyclic model which is worse than SBR. On the OrchideaSOL, Gtzan, and MedleyDB datasets, the gain in performance is substantial for the best model compared to the reference ones. For example, DDSPmonodec leads to a LSD of 5.68 where SBR and ResNet achieve 9.27 and 14.04, respectively. On polyphonic signals, the ResNet model seems to be quite bad at predicting high frequencies (LSD = 26.84 and 16.17 on Gtzan and MedleyDB, respectively), whereas our DDSPbased models give quite lower LSDs (less than 12 for all theses models on both datasets).
When looking at the performance of the proposed models, we first observe that the polyphonic models DDSPmonodeccyclic and DDSPpolydec do not achieve a better performance than the monophonic one DDSPmonodec, whereas the noiseonly model DDSPnoise is on par with its results. This observation is quite constrasted from what we obtained on the synthetic datasets. When having a look at Fig. 6, we can see that DDSPpolydec seems to generate the highest frequency with too low amplitudes, whereas DDSPmonodec is a bit more precise in the high frequencies.
By informal listening of some reconstructed signals, we managed to distinguish two types of unwanted artifacts. The first kind happens when the amplitudes of the reconstructed harmonics are too high, which leads to a very synthetic high frequency reconstruction. One of the reasons for these wrongly inferred harmonic amplitudes is that, in both DDSPmonodeccyclic and DDSPmonodec, the loudness contour is estimated for a mixture made of several \(f_0\), making it less trivial for the autoencoder to estimate each \(f_0\) harmonic amplitudes. The second type of artifacts can be heard when the synthesized noise handles much of the high frequency content, while the harmonic amplitudes are too low, or even nonexistent. This happens when the multipitch estimator fails to correctly predict the set of \(f_0\)s, then the overall system do not generate high amplitude harmonics, and compensates with noise. Because of that, we conjecture that the proposed models should be more effective with a more robust multipitch estimation system.
6.2 Perceptual evaluation
In order to assess the perceptive value of our models, we conducted a listening test based on the MUSHRA methodology [50]. During the listening test, 42 subjects were asked to rate the quality of audio signals between 0 (poor quality) and 100 (perfect quality) against the reference (groundtruth fullband signal), which is expected to be rated 100. This behavior is expected by normal hearing and focused subjects, as the reference sound is provided for each trial. 10 stimuli are given in a random order. For each of them, 6 signals are to be rated :

1
Anchor 1: lowband input signal (model Null)

2
Anchor 2: hidden reference (groundtruth fullband signal)

3
SBR reconstruction

4
ResNet output

5
DDSPmonodec output

6
DDSPnoise output
The signals are taken from the Gtzan dataset [48], one of each genre, and only 5 s are extracted in the middle of the original signal. Information about participants is asked at the end of the survey, including gender, age, and the number of years of musical practice. Given the poorer LSD performance of our proposed polyphonic schemes compared to the monophonic one, also confirmed by informal listening by the authors, it has been decided not to consider them for subjective evaluation. This had the benefit of maintaining the duration of the listening test into a reasonable range of about 20 min.
First, we conduct an analysis of variance (ANOVA) to check whether the factor of musical training is a significant source of variation in the rating data. We consider an individual as being a musician if it has an experience of at least 1 year. Considering that, the ANOVA run on the rating distributions for musician and nonmusician subjects gives a pvalue of \(1.61 \cdot 10^{8}\), which tells us that being a musician or not has a significant effect on the test ratings.
Close inspection of the ratings showed that the rankings of the different methods are the same for both populations. The only difference a different bias, where musicians were on average more severe than nonmusicians, as can be seen on Fig. 7, that show the distributions of ratings for musicians and nonmusicians the over all models.
Next, conducting another ANOVA in which the analyzed factor is the model gives us a pvalue of \(2.18.10^{88}\), which is very small and shows that the choice of model is a significant source of variation in our collected data, thus the possibility of comparing the rating distributions of all models. With another ANOVAs on the models but for the data splits in musician or nonmusician subsets, we obtain similar very lowvalued pvalues, which tells us that in both case the choice of models has an significant impact on the ratings among the participants.
The distributions of the participants’ ratings for all models and all stimuli are plotted as boxplots in Fig. 8. We can see that the outputs of model DDSPmonodec are in average rated to be of fair quality (almost good), whereas the outputs from ResNet, DDSPnoise, and no processing are typically rated as poor, and SBR outputs are quite often rated as being of bad quality. An important outcome is that DDSPmonodec provides a large margin improvement compared to the Null baseline, meaning that this method is able to improve audio quality at a low computational cost.
By computing a ttest on rating distribution of DDSPmonodec against the other models, we can verify that it is significantly better than the other ones. The pvalues obtained for the ttest are well below the typical threshold of 0.05, so the distributions are significantly different from each other. We can thus conclude that, from the data of the listening test, the DDSPmonodec gives perceptively better highfrequency contents than the other evaluated models.
Monophonic and polyphonic examples are available online^{Footnote 2}. The latter have been considered as stimuli in the listening test.
6.3 Inference time
One great advantage of the proposed DDSP approach is the important reduction of inference time compared to neural networks with a lot of training parameters such as Resnet. Figure 9 shows a scatter plot of the performance against the inference time of the different proposed models and the baselines SBR and Resnet, on the Gtzan dataset. The inference was made on a laptop equipped with an Intel Core i7 CPU at a frequency of 2.8 GHz. We can clearly see that a neural network architecture such as ResNet takes a lot of computing time to process an input signal, well above a potential realtime behavior. While SBR is very fast, DDSPbased models such as DDSPmonodec, DDSPpolydec, and DDSPnoise are quite efficient in terms of computation time. DDSPmonodec and DDSPpolydec take the same amount of computing time because their architecture is very similar, and DDSPnoise is a bit faster because of the smaller matrices in the decoder. On the other hand, DDSPmonodeccyclic is less computationally efficient because of its iterative nature, as an inference from DDSPmonodec is computed at each iteration. These insights on the computational power of the DDSPbased model show the advantage of such hybrid models compared to neural networks with a huge number of parameters such as the ResNet architecture.
7 Conclusion
In this article, we explored differentiable digital signal processing models for bandwidth extension of monophonic and polyphonic musical signals. We showed the benefit of using a monophonic DDSP model to generate high frequencies of monophonic signals against the two baselines, including a highcomplexity deeplearningbased ResNet model. Then, we designed two systems to address polyphonic BWE: a cyclic use of a monophonic DDSP model, and an adapted DDSP model with polyphonic synthesis capacities. On polyphonic signals, the proposed polyphonic systems showed to be more effective on polyphonic synthetic signals, but failed to surpass the monophonic DDSP model on real data. In addition, we conducted a listening test with the MUSHRA methodology, which showed that the DDSPmonodec model was more pleasant to the ear for most participants, when compared to the baselines. For future work, we think that considering a more advanced multipitch estimator could enable the polyphonic models to generate less artifacts and that other artifacts could be avoided by researching further the loudness estimation procedure.
Availability of data and materials
Experiments reported in this paper rely on publicly available code and on the following publicly available datasets:
1 OrchideaSOL: https://forum.ircam.fr/projects/detail/orchideasol
2 Medleysolosdb: https://zenodo.org/record/1344103
3 MedleyDB: https://medleydb.weebly.com
4 GTZAN: https://www.kaggle.com/datasets/andradaolteanu/gtzandatasetmusicgenreclassification
The synthetic datasets can be reproduced using the experimental code available at: https://github.com/mathieulagrange/ddspMusicBandwidthExtension.
Abbreviations
 ANOVA:

Analysis of variance
 ASD:

Attack, sustain, release
 ASR:

Automatic speech recognition
 BWE:

Bandwidth extension
 CNN:

Convolutional neural network
 CPU:

Central processing unit
 DDSP:

Differentiable digital signal processing
 DL:

Deep learning
 DNN:

Deep neural network
 FFT:

Fast Fourier transform
 FIR:

Finite impulse response
 GAN:

Generative adversarial network
 GBRBM:

GaussianBernoulli restricted Boltzmann machine
 GMM:

Gaussian mixture model
 GPU:

Graphics processing unit
 GRU:

Gated recurrent unit
 HMM:

Hidden Markov model
 LB:

Lowband
 LSD:

Logspectral distance
 LSTM:

Long shortterm memory
 MFCC:

Mel frequency cepstral coefficients
 MLP:

Multilayer perceptron
 MSE:

Meansquared error
 MSS:

Multiscale spectral
 MUSHRA:

Multiple stimuli with hidden reference and anchor
 NMF:

Nonnegative matrix factorization
 ReLU:

Rectified linear unit
 SBR:

Spectral band replication
 STFT:

Shortterm Fourier transform
 WB:

Wideband
References
E. Vincent, T. Virtanen, S. Gannot (eds.), Audio Source Separation and Speech Enhancement (Wiley, 2018)
A. Adler, V. Emiya, M.G. Jafari, M. Elad, R. Gribonval, M.D. Plumbley, Audio inpainting. IEEE Trans. Audio Speech Lang. Process 20, 922–932 (2012)
N.R. French, J.C. Steinberg, Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am. 19, 90–119 (1947)
S.V. Vaseghi, R. FraylingCork, Restoration of old gramophone recordings. J. Audio Eng. Soc. 40, 791–801 (1992)
C. Gaultier, S. Kitić, R. Gribonval, N. Bertin, Sparsitybased audio declipping methods: selected overview, new algorithms, and largescale evaluation. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1174–1187 (2021)
M. Dietz, L. Liljeryd, K. Kjorling, O. Kunz, in Audio Engineering Society Convention, Spectral band replication, a novel approach in audio coding (2002)
Y. Ning, S. He, Z. Wu, C. Xing, L.J. Zhang, A review of deep learning based speech synthesis. Appl. Sci. 9, 4050–4066 (2019)
S.H. Mohammadi, A. Kain, in Speech Communication, An overview of voice conversion systems (2017)
A.v.d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, Wavenet: a generative model for raw audio. Proc. ISCA (2016)
E. Moliner, J. Lehtinen, V. Välimäki, in International Conference on Acoustics, Speech and Signal Processing, Solving audio inverse problems with a diffusion model (2023)
J. Engel, L. Hantrakul, C. Gu, A. Roberts, DDSP: Differentiable Digital Signal Processing. (International Conference on Learning Representations, 2020)
S. Sulun, M.E.P. Davies, On filter generalization for music bandwidth extension using deep neural networks. J. Sel. Top. Signal Process. 15, 132–142 (2021)
S. Meltzer, R. Bohm, F. Henn, in Audio Engineering Society Convention, SBR enhanced audio codecs for digital broadcasting such as “Digital Radio Mondiale” (DRM) (2002)
F. Nagel, S. Disch, in International Conference on Acoustics, Speech and Signal Processing, A harmonic bandwidth extension method for audio codecs (2009)
S. Chennoukh, A. Gerrits, G. Miet, R. Sluijter, in International Conference on Acoustics, Speech, and Signal Processing. Proceedings, Speech enhancement via frequency bandwidth extension using line spectral frequencies (2001)
J. Sadasivan, S. Mukherjee, C.S. Seelamantula, in International Conference on Acoustics, Speech and Signal Processing, Joint dictionary training for bandwidth extension of speech signals (2016)
Y. Yoshida, M. Abe, in International Conference on Spoken Language Processing, An algorithm to reconstruct wideband speech from narrowband speech based on codebook mapping (1994)
K.Y. Park, H.S. Kim, in International Conference on Acoustics, Speech, and Signal Processing. Proceedings, Narrowband to wideband conversion of speech using GMM based transformation (2000)
P. Bauer, T. Fingscheidt, in International Conference on Acoustics, Speech and Signal Processing, An HMMbased artificial bandwidth extension evaluated by crosslanguage training and test (2008)
G.B. Song, P. Martynovich, A study of HMMbased bandwidth extension of speech signals. Signal Process. 89, 2036–2044 (2009)
D. Bansal, B. Raj, P. Smaragdis, in Interspeech, Bandwidth expansion of narrowband speech using nonnegative matrix factorization (2005)
D.L. Sun, R. Mazumder, in International Workshop on Machine Learning for Signal Processing, Nonnegative matrix completion for bandwidth extension: A convex optimization approach (2013)
K. Li, C.H. Lee, in International Conference on Acoustics, Speech and Signal Processing, A deep neural network approach to speech bandwidth expansion (2015)
K. Li, Z. Huang, Y. Xu, C.H. Lee, in Interspeech, DNNbased speech bandwidth expansion and its application to adding highfrequency missing features for automatic speech recognition of narrowband speech (2015)
Y. Wang, S. Zhao, W. Liu, M. Li, J. Kuang, in Interspeech, Speech bandwidth expansion based on deep neural networks (2015)
Y. Gu, Z.H. Ling, in Interspeech, Waveform modeling using stacked dilated convolutional neural networks for speech bandwidth extension (2017)
V. Kuleshov, S.Z. Enam, S. Ermon, Audio super resolution using neural networks. (International Conference on Learning Representations, 2017)
H. Wang, D. Wang, in International Conference on Acoustics, Speech and Signal Processing, Timefrequency loss for CNN based speech superresolution (2020)
G. Campos, N. Fonseca, A. Ferreira, M. Davies, in Conference on Digital Audio Effects, High frequency magnitude spectrogram reconstruction for music mixtures using convolutional autoencoders (2018)
M. Lagrange, F. Gontier, in International Conference on Acoustics, Speech and Signal Processing, Bandwidth extension of musical audio signals with no side information using dilated convolutional neural networks (2020)
S. Li, S. Villette, P. Ramadas, D.J. Sinder, in International Conference on Acoustics, Speech and Signal Processing, Speech bandwidth extension using generative adversarial networks (2018)
X. Li, V. Chebiyyam, K. Kirchhoff, in Interspeech, Speech audio superresolution for speech recognition (2019)
J. Su, Y. Wang, A. Finkelstein, Z. Jin, in International Conference on Acoustics, Speech and Signal Processing, Bandwidth extension is all you need (2021)
E. Moliner, V. Välimäki, BEHMGAN: Bandwidth Extension of Historical Music using Generative Adversarial Networks. IEEE/ACM Trans. Audio Speech Lang. Process 31, 943–956 (2022)
J. Song, C. Meng, S. Ermon, Denoising diffusion implicit models. International Conference on Learning Representations (2020)
E. Moliner, F. Elvander, V. Välimäki, Zeroshot blind audio bandwidth extension. arXiv preprint arXiv:2306.01433 (2023)
B. Hayes, C. Saitis, G. Fazekas, Neural waveshaping synthesis (International Conference on Music Information Retrieval, 2021)
S. Shan, L. Hantrakul, J. Chen, M. Avent, D. in International Conference on Acoustics, Speech and Signal Processing, Trevelyan, Differentiable wavetable synthesis (2021)
S. Lee, H.S. Choi, K. Lee, Differentiable artificial reverberation. IEEE/ACM Trans. Audio Speech Lang. Process 30, 2541–2556 (2022)
C.J. Steinmetz, N.J. Bryan, J.D. Reiss, Style transfer of audio effects with differentiable signal processing. J. Audio Eng. Soc 70, 708–721 (2022)
N. Masuda, D. Saito, in International Society for Music Information Retrieval Conference, Synthesizer sound matching with differentiable DSP (2021)
F. Esqueda, B. Kuznetsov, J.D. Parker, in International Conference on Digital Audio Effects, Differentiable whitebox virtual analog modeling (2021)
J.W. Kim, J. Salamon, P. Li, J.P. Bello, in International Conference on Acoustics, Speech and Signal Processing, Crepe: a Convolutional Representation for Pitch Estimation (2018)
L. Hantrakul, J. Engel, A. Roberts, C. Gu, in International Society for Music Information Retrieval Conference, Fast and flexible neural audio synthesis (2019)
R.M. Bittner, J.J. Bosch, D. Rubinstein, G. MeseguerBrocal, S. Ewert, in International Conference on Acoustics, Speech and Signal Processing, A lightweight instrumentagnostic Model for polyphonic note transcription and multipitch estimation (2022)
C.E. Cella, D. Ghisi, V. Lostanlen, F. Lévy, J. Fineberg, Y. Maresz, OrchideaSOL: a dataset of extended instrumental techniques for computeraided orchestration (International Computer Music Conference, 2020)
V. Lostanlen, C.E. Cella, Deep convolutional networks on the pitch spiral for musical instrument recognition (International Conference on Music Information Retrieval, 2017)
B.L. Sturm, An analysis of the GTZAN music genre dataset. In Proceedings of the second international ACM workshop on Music information retrieval with usercentered and multimodal strategies 7–12 (2012)
R.M. Bittner, J. Wilkins, H. Yip, J.P. Bello, in International Conference on Music Information Retrieval, MedleyDB 2.0: New data and a system for sustainable data collection (2016)
M. Schoeffler, S. Bartoschek, F.R. Stöter, M. Roess, S. Westphal, B. Edler, J. Herre, webMUSHRA  A Comprehensive Framework for Webbased Listening Tests. J. Open Res. Softw 6, 18 (2018)
Acknowledgements
We would like to thank Vincent Lostanlen for fruitful discussions and suggestions.
Funding
This research have been partially funded by an RFI OIC grant.
Author information
Authors and Affiliations
Contributions
PAG conducted the numerical experiments and wrote the manuscript. ML provided guidance and wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Grumiaux, PA., Lagrange, M. Efficient bandwidth extension of musical signals using a differentiable harmonic plus noise model. J AUDIO SPEECH MUSIC PROC. 2023, 51 (2023). https://doi.org/10.1186/s13636023003155
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636023003155