Efficient bandwidth extension of musical signals using a differentiable harmonic plus noise model

The task of bandwidth extension addresses the generation of missing high frequencies of audio signals based on knowledge of the low-frequency part of the sound. This task applies to various problems, such as audio coding or audio restoration. In this article, we focus on efficient bandwidth extension of monophonic and polyphonic musical signals using a differentiable digital signal processing (DDSP) model. Such a model is composed of a neural network part with relatively few parameters trained to infer the parameters of a differentiable digital signal processing model, which efficiently generates the output full-band audio signal. We first address bandwidth extension of monophonic signals, and then propose two methods to explicitly handle polyphonic signals. The benefits of the proposed models are first demonstrated on monophonic and polyphonic synthetic data against a baseline and a deep-learning-based ResNet model. The models are next evaluated on recorded monophonic and polyphonic data, for a wide variety of instruments and musical genres. We show that all proposed models surpass a higher complexity deep learning model for an objective metric computed in the frequency domain. A MUSHRA listening test confirms the superiority of the proposed approach in terms of perceptual quality.


Introduction
Audio bandwidth extension (BWE) is a subtask of audio enhancement [Vincent et al., 2018] whose goal is to extrapolate the audio spectrum to higher frequencies, in contrast with audio inpainting whose goal is to interpolate missing parts [Adler et al., 2012].BWE has been considered early in telecommunication systems to overcome bandwidth limitations, especially in telephony for which the typical sampling rate is 16 kHz, i.e., leading to the highest frequency in the processed signal be 8 kHz.In the case of human conversations, the quality of speech can be greatly improved if the sampling rate is increased to 44.1 or 48 kHz [French and Steinberg, 1947].In the same vein, another application of BWE is to improve the quality of old music recordings, possibly in addition to the removal of clicks and noise [Vaseghi and Frayling-Cork, 1992] or declipping [Gaultier et al., 2021].In both applications, the signal enhancement is handled without access to the original signal with better quality.Informed BWE algorithms can also be useful in audio coding [Dietz et al., 2002] where signals of smaller sampling rates are more effectively compressed, requiring the use of a BWE to restore the full sampling rate of the decoded signal.In most cases low bitrate side information is transmitted along the compressed low frequency signal to improve the performance of the BWE module.
Finally, BWE is also meaningful for interoperability of audio processing tools as many audio signal processing methods, such as source separation [Vincent et al., 2018], speech synthesis [Ning et al., 2019] or voice conversion [Mohammadi and Kain, 2017], focus on 16-kHz signals, hence the need for a BWE system beforehand if the acquired signal is not at the desired sampling rate.
Even-though many deep learning based systems have been proposed to tackle BWE, most of them do not consider runtime efficiency as critical, leading to high quality systems that can be very costly at inference.High quality generators based on autoregressive signal models such as Wavenet [Oord et al., 2016] or on diffusion [Moliner et al., 2023b] have intrinsic high complexity and sequentiality which limit their use for time or delay critical applications.
In this paper, we propose to consider differentiable digital signal processing (DDSP) models derived from the seminal work of [Engel et al., 2020] in order to tackle BWE in an efficient manner.Controlling an harmonic plus noise sound model with a deep learning architecture allows us to considerably reduce inference time.Experiments described in this paper demonstrate a speed increase a 100 % compared to a reference resnet implementation [Sulun and Davies, 2021], this with better resulting perceptual quality.This is due to several factors, including reduction of learnable parameters.Using the DDSP approach, the sound is generated using deterministic synthesizers that are controlled by several deep-learning modules of relatively small sizes.For comparison, the resnet architecture has more than 55k learnable parameters, while the tested DDSP approach has around 4k parameters.
The remaining of this article is organized as follows.In Section 2, we present a general overview of works related to BWE.In Section 3, we explain the proposed models designed to address BWE.The experimental protocol whose code is available online1 , and which rely on publicly avaible datasets, is detailed in Section 4. In Section 5, we show how the proposed models are well designed when considering synthetic data, and in Section 6 they are evaluated on real data.Finally, we conclude this article in Section 7.

Related work
Most approaches considered speech signals with application to telephony.The literature that consider music or general audio is more scarce.When available, we here put the focus on literature related to musical audio.

Signal processing approaches
Early works employed pure signal processing methods for BWE.In the area of audio coders, some non-blind systems rely on spectral band replication (SBR) [Dietz et al., 2002] using side information extracted during compression.The SBR algorithm is based on the replication of the low-band spectrum to the high-band region, possibly with the benefits of side information about the high frequencies to improve the overall performance.It has been extended in several works [Meltzer et al., 2002, Nagel andDisch, 2009], e.g., by replacing the replication by a stretching of the low-band content towards the high-band part, thus preserving the intrinsic harmonic relationships.Source-filter models have been also employed to extend the bandwidth using line spectral frequencies in [Chennoukh et al., 2001].Systems based on dictionary learning to map low-frequency patterns to high-frequency components has been proposed in [Sadasivan et al., 2016, Yoshida andAbe, 1994].Classic machine learning methods have also been explored for BWE, such as Gaussian mixture models (GMMs) [Park and Kim, 2000], hidden Markov models (HMM) [Bauer andFingscheidt, 2008, Song andMartynovich, 2009] or non-negative matrix factorization (NMF) [Bansal et al., 2005, Sun andMazumder, 2013].

Convolutional deep learning approaches
Recently, deep learning (DL) methods have shown great performance to synthesize the upper band spectrum.The first works that apply DL technique in BWE literature used deep neural networks (DNNs) with dense layers to infer the high frequencies up to 8 kHz [Li and Lee, 2015, Li et al., 2015, Wang et al., 2015].In [Li and Lee, 2015], the log short-time Fourier transform is fed into several dense layers with the last one inferring the high-band spectrum magnitude.The waveform is reconstructed by using the flipped phase from the low-band to estimate the high-band phase information.While this flipped method avoids having phase discontinuities at the low/high frontier, [Li et al., 2015] propose to cope with this potential issue by extended the mean-squared error (MSE) loss function with a regularization term.Gaussian-Bernoulli restricted Boltzmann machines (GBRBM) has been employed alongside dense layers in [Wang et al., 2015] in order to estimate the higher spectral envelope.Other systems make use of convolutional neural networks (CNNs) to infer the high frequencies from the low-band input features, using 1D convolutions in the time domain [Gu and Ling, 2017, Kuleshov et al., 2017, Wang and Wang, 2020, Sulun and Davies, 2021] or 2D convolutions in the spectro/temporal domain [Campos et al., 2018, Lagrange andGontier, 2020].In [Gu and Ling, 2017], the authors show that using a network architecture of 1D dilated convolutions and residual connections outperforms a state-of-the-art based on a long short-term memory (LSTM) system on speech signals.The authors of [Kuleshov et al., 2017] make use of 1D convolutional layers in an encoder-decoder scheme to extend the bandwidth of speech and musical signals in three upscaling ratios: 2, 4 and 6.They show the effectiveness of their system compared to [Li et al., 2015] for objective and perceptive metrics.In the same vein, an encoder-decoder architecture in the time domain is also used in [Wang and Wang, 2020], but the authors propose to opt for subpixel layers instead of classical transposed convolutional layers because it was shown that less artifacts are created by considering those layers.Another encoder-decoder system can be found in [Sulun and Davies, 2021] where the architecture also contains residual connections in a U-Net scheme.The authors shows 1) that using a Resnet architecture outperforms the U-Net, probably because of the loss of information in the bottleneck layer of the former, and 2).that the employed DNNs overfit on the filter shapes present in the training data.This latter problem can fortunately be alleviated with a data augmentation strategy which utilizes a wide variety of low pass filters during training.
As the paper describes thoroughly the architecture proposed as well as its learning procedure, we choose to use this latter system as a reference "high complexity" system.

Generative adversarial networks
Generative adversarial networks (GAN) have been explored in several works for BWE.In [Li et al., 2018], the authors show that relying on GANs can improve the generated speech quality by using a simple DNN.In [Li et al., 2019], the generator is based on a U-Net-like architecture and the discriminator is trained to distinguish between generated and true wide-band signals, with the addition of a perceptual loss expressing the distance between features learned by a pre-trained automatic speech recognition (ASR) network.A combination of two discriminators, one based on spectral features and the other based on temporal features, have been proposed in [Su et al., 2021] to extend the bandwidth from 8 kHz to 48 kHz.In [Moliner and Välimäki, 2022], the generator is also based on a U-Net architecture yet it is proposed to employ CNNs for the three discriminators, each one being applied on a downsampled version of generated or true waveform (downsampling factors = 1, 2, 4).The generator is then trained to generate piano signals.
While GAN do not impose strong constraints in terms of inference complexity, GANs are known to be notoriously difficult to train, as they require very specific choices in optimization and architectures in order to stabilize training and could fail to cover modes of the data distribution [Song et al., 2020].
At the time of the design of this study, we found no pretrained general audio BWE model learnt with an adversarial procedure.We thus do not consider a GAN trained generator as another reference method.

Diffusion models
In terms of quality of generation, diffusion models now provides very convincing performance for a wide variety of data, including audio [Moliner et al., 2023b, Moliner et al., 2023a].As for autoregressive architecture like Wavenet [Oord et al., 2016], this important increase of quality comes at a strong computational cost at inference.The network has to be called sequentially a large number of times (usually from 100 to 1000 times) in order to perform the inference.One can reduce the size of the network or reduce the number of steps in order to accelerate sampling [Song et al., 2020], but those approaches are detrimental to the quality of the generated audio and the inference time remains high.
In this paper, we find that the inference of a standard ResNet architecture is already about 1000 times real time on a standard central processing unit (CPU) and our study focuses on efficient BWE, we choose not to consider diffusion models as a reference.See Figure 9 for more details about computation.

Differentiable sound models
In this article, we address BWE using DDSP models derived from the seminal work of [Engel et al., 2020] that focuses on the generation of audio signals with a combination of neural networks and digital signal processing models.This approach allows one to train the neural network parameters in an end-to-end fashion with backpropagation, if the rest of the model is differentiable.Besides several sound synthesis models [Hayes et al., 2021, Shan et al., 2021], DDSP has also been successfully applied to other tasks, such as neural audio effect [Lee et al., 2022], style transfer [Steinmetz et al., 2022], sound matching [Masuda and Saito, 2021] or virtual analog [Esqueda et al., 2021].In this section, we describe the DDSP models we propose for monophonic and polyphonic BWE.

Monophonic BWE system
To address BWE for monophonic musical signal, we adapt the model proposed in [Engel et al., 2020], which is monophonic by design.The main difference with the original DDSP model is that, in order to reconstruct the higher frequencies, the model takes as input the low-band (LB) audio signal of bandwidth f N α , with f N the Nyquist frequency, and is trained to output the wide-band (WB) signal of bandwidth f N .The overall architecture, illustrated on Fig. 1 is the same as in [Engel et al., 2020], and consists in two parts: an trainable encoder-decoder neural network, and a harmonic-plus-noise synthesizer.The neural network is illustrated in blue, the extracted features are shown in yellow, and the differentiable synthesizer is colored in red.This monophonic model is labeled DDSP-mono-dec, referring to the design of the decoder to generate monophonic parameters.

Extracted features
The input LB signal is first analyzed to extract the fundamental frequency f 0 (n) and loudness l(n) over time.In the monophonic setting, we use CREPE [Kim et al., 2018], a state-of-the-art monophonic pitch estimator based on a convolutional neural network, to estimate f 0 .The loudness l is obtained with a A-weighting of the power spectrum [Hantrakul et al., 2019].

Neural network
The input LB signal waveform is processed by an encoder which creates an latent vector z.In the encoder, the first 30 mel frequency cepstrum coefficients (MFCC) are extracted from the audio input (fast Fourier transform (FFT) size of 1024, overlap of 75 % and 128 mels between 20 Hz and 8000 Hz) and then passed into a trainable normalization layer.After that, the MFCCs goes into a gated recurrent unit (GRU) with 512 units and finally a 512-neuron linear layer outputs the latent vector z(n).
The three vectors z(n), f 0 (n) and l(n) are then fed into the decoder.Each of them first goes into a separate multi-layer perceptron (MLP) with three layers, and the outputs are concatenated.The obtained vector is processed by a 512-unit GRU and then a another 3-layer MLP.Finally, two separate dense layers are used : the first one outputs the harmonic amplitudes A h (n) (see Section 3.1.3)using a softmax activation, and the second one gives the noise filter coefficients N (k).Note that, as in [Engel et al., 2020], we use a modified sigmoid function σ(x) at the output of these two last dense layers : σ(x) = 2 • sigmoid(x) log(10) + 10 −7 .This architecture has around 3k learnable parameters.

Harmonic-plus-noise synthesizer
Both outputs from the neural network are used separately in the additive synthesizer and noise modules.The additive synthesizer takes the estimated f 0 (n) and the inferred harmonic amplitudes A h (n) to generate the audio signal y(n): where ϕ h is the instantaneous phase of the h-th sinusoidal component.It is computed by integrating the instantaneous frequency where ϕ 0,h is a random initial phase.In the filtered noise module, we obtain a time-domain finite impulse response (FIR )filter as the inverse discrete Fourier transform of the noise filter coefficients N (k) from the neural network output.The filtered noise signal is synthesized by convolving a white noise with the FIR filter.The harmonic signal and filtered noise are finally summed to obtain the wide-band output signal.
Even if the full-band output signal is generated, only the missing high frequency content is kept, and added to the input low-band signal.

Noise-only synthesizer
We also consider a noise only synthesizer in which the output of the autoencoder only contains the noise filter coefficients N (k).We label this model DDSP-noise.This will allow us in the experimental part to evaluate the respective value of the harmonic and noise parts of the synthesizer.

Loss function
We use the multi-scale spectral (MSS) loss function to train our models computed on the missing high-frequency region.It is defined as L(y, ỹ) = s L s (Y s , Ỹs ), where Y s and Ỹs are the high frequency magnitude spectrograms of the ground-truth signal y and the reconstructed signal ỹ, respectively, computed using a FFT size s, and : || • || being the common L 1 norm.Indeed, experiments demonstrated that it is preferable to compute each loss only on the high frequency region for solving the BWE task.We use the same set of FFT sizes as in [Engel et al., 2020], that is [2048,1024,512,256,128,64] samples.

Polyphonic BWE methods
By the use of a single harmonic synthesizer, the DDSP architecture can only generate high frequency content harmonically from a single f 0 .To address BWE for polyphonic musical signals, we propose two systems: a cyclic use of a monophonic BWE system detailed above, and a BWE system based on a polyphonic DDSP architecture.

Cyclic monophonic decoder
In the monophonic BWE system, the DDSP model generates an harmonic signal based on a single f 0 estimated from a monophonic pitch estimator [Kim et al., 2018].Now that we are in a polyphonic context, we use a state-ofthe-art multi-pitch estimator [Bittner et al., 2022] which outputs a maximum of I different fundamental frequencies f i 0 .Considering that this multi-pitch estimator has a rather good performance, we propose to iteratively use the monophonic DDSP model DDSP-mono-dec in a cyclic manner, as illustrated in Fig. 2. We label this model DDSPmono-dec-cyclic.Pseudocode of the overall algorithm is detailed in Algorithm 1.
The monophonic DDSP model is applied for I iterations on a low-band signal x i LB which correspond to the original low-band signal minus the i − 1 estimated sources.At each iteration i, a loudness contour l i is extracted from what we label a residual low-band input signal x i LB and passed, along with the i th estimated pitch f i 0 (obtained on x at the beginning of the algorithm) and x i LB , into the DDSP model.The output full-band monophonic signal ỹi , which contains a harmonic content from the current f i 0 , is then low-pass filtered to keep only the low-frequency part ỹi LB .Finally, the magnitude spectrogram of ỹi LB is subtracted to the magnitude spectrogram of the residual low-band input signal: The low-band input signal is then obtained in the time-domain using an inverse short-term Fourier transform (STFT) on X i LB (phase is kept in place).Algorithm 1 Pseudocode algorithm of cyclic use of the monophonic DDSP model.
▷ constructing final signal step by step end for return ỹ In that way, at each iteration i, the harmonic content generated at the previous step is removed in the spectral domain from the residual low-band input signal, so that a different f i 0 should be extracted.The residual low-band signal should contain less and less harmonics during this process.
At the beginning of the iteration, the loudness contour is then estimated on the full polyphonic signal, which will lead to estimations errors, that hopefully will decrease at each iteration.
The output of the noise synthesizer, which is part of the monophonic DDSP model at each iteration in order to have a more precise estimate of the amplitude of the harmonic of the sinusoidal part.While the noise part is thus estimated at each iteration we only considered the noise part of the last iteration I in order not to overestimate the noise part.
Finally, the full-band monophonic output signals ỹi are summed and mixed with the noise part to obtain the estimated full-band polyphonic signal ỹ.As in the monophonic BWE setting, the high frequency content from this full-band signal is mixed with the low-band input signal.

Polyphonic decoder
To address BWE for polyphonic signals, we propose another model adapted from the original DDSP models, illustrated in Fig. 3, which we label DDSP-poly-dec because the decoder outputs the parameters intended to control a polyphonic synthesizer.As before, the model is trained on polyphonic data.
In this model, I additive harmonic synthesizers are used, where I is the estimated number of fundamental frequencies f i 0 , i ∈ 1, ..., I present in the input low-band signal.To estimate the parameters for each separate additive synthesizer, we extend the decoder detailed in section 3.1.2by using I separate MLPs for each f i 0 (instead of a single MLP for vector f 0 in the monophonic DDSP model).The outputs of those I MLPs are then concatenated into one vector, which is itself concatenated to the outputs of the two other MLPs applied on z and l.Then, as in the monophonic model, the obtained vector goes through a GRU and another MLP.After that, I + 1 dense layers are used: one for estimating the noise filter coefficients N (k), and I other layers to output the H harmonic amplitudes of the I additive synthesizers.
In this model, we employ the same multi-pitch estimator [Bittner et al., 2022] as in the cyclic model to estimate a maximum of I f i 0 .If only I ′ < I fundamental frequencies are given by the estimator, we set f i 0 = 0, i > I ′ , and all f i 0 are fed in the decoder.To prevent any adverse impact on sound quality of those missing values, only the I ′ first sets of H harmonic amplitudes are extracted from the decoder output and used with the first I ′ additive synthesizers.

Experimental protocol
In this section, we detail the datasets, metrics and baselines used to assess the performance of the proposed BWE models.
The task that we consider is bandwidth extension task where the input signal is sampled at 4kHz, thus with frequencies up to 2kHz and the output signal is sampled at 16kHz, thus with frequencies up to 8kHz.
As our approach is quite flexible in terms of extension scenario, we also performed experiments for the task going from a sampling frequency of 8kHz to 16kHz.We found that the ranking between models was the same as the one for upsampling from 2kHz to 8kHz.We thus display and discuss results only for the latter, as the task is more challenging and lead to more salient perceptual differences, a required aspect for a successful perceptual evaluation.

Datasets
To train and evaluate our models, we used both monophonic and polyphonic datasets.Synthetic data has been also been considered in order to check expected behaviors of proposed systems.Those systems are then evaluated on uncontrolled real-world data.

Synthetic datasets
In order to analyze the inference capabilities of the trained models, we generated two synthetic datasets, respectively containing monophonic and polyphonic signals.These signals are generated using a harmonic-plus-noise synthesizer, as for the DDSP models, allowing for precise analysis of the models generating capabilities.
Each monophonic signal is generated given a f 0 corresponding to a certain MIDI pitch between C3 (e.g., 130.82 Hz) and G#6 (e.g., 1661.22Hz).An harmonic signal is generated from this f 0 with H harmonics (H ∈ {10, 15, 20}), where the amplitude of the h-th harmonic is 1 h 2 .A pink noise is added to this harmonic signal with a signal-to-noise ratio of 10 dB.Then, an attack, sustain, decay (ASD) envelope is generated and multiplied to the harmonic-plus-noise signal.The durations of attack and decay and the sustain level are randomly picked in the interval [0, 0.3], [0.5, 1] and [0, 2] (in seconds), respectively.Finally, a random gain in interval [0.75, 1] scales the final monophonic harmonicplus-noise signal.The final monophonic synthetic dataset is obtained by generating all combinations of f 0 with the three H values, giving three signals.
The polyphonic synthetic dataset is generated by composing chords on the diatonic scale simply by considering multiple notes from the monophonic synthetic dataset, as follows.To generate a I-note polyphonic chord signal, we randomly pick I monophonic signals by taking care that a particular pitch (regardless of the octave) does not appear more than once among these I signals.For each note, a gain is randomly picked in [0.5, 1], and all notes are mixed with corresponding gains.To build the full database, we generated polyphonic signals for all combinations of f 0 and I ∈ {2, 3, 4, 5}.
From the generated monophonic and polyphonic synthetic datasets, 90% of the signals form the train set, and the remaining signals form the test set.

Real-world monophonic datasets
Two real-world datasets consisting of monophonic musical signals are used to evaluate our models.The OrchideaSOL dataset [Cella et al., 2020] includes signals of single notes from many different instruments (accordion, bassoon, tuba, horn, trombone, trumpet, guitar, harp, contrabass, viola, violin, violoncello, clarinet, flute, oboe and saxophone).In the original dataset, many different playing styles are available for each instrument, however we only keep the ordinario one, corresponding to a natural playing.The training set for our experiments contains 90% of the original dataset, i.e., about 5.5 hours of audio, while the test set contains 10%, i.e., about 42 minutes of audio.
Medley-solos-db [Lostanlen and Cella, 2017] is another largely monophonic dataset which contains melodies of one of eight different instruments (clarinet, distorted electric guitar, female singer, flute, piano, saxophone, trumpet and violin), i.e. the f 0 changes over time in those signals.In our experiments we considered the original provided test and train splits, which corresponds to about 2.4 and 5 hours of audio, respectively.As some of the instruments are polyphonic i.e. distorted electric guitar, piano and violin, a small part of the dataset cannot strictly be considered as monophonic.In order to preserve the integrity of train/test splits of the dataset, we chose not to discard those instruments.

Real-world polyphonic datasets
To assess the proposed model for polyphonic BWE, we employed two real-world datasets containing multiple multitrack mixes.Gtzan dataset [Sturm, 2013] has been widely exploited in many audio signal processing tasks.It contains 1000 30-second music track equally split into 10 genres (blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae and rock).The train and test splits contain 7.5 hours and 50 minutes of audio, respectively.
We also used the mixed version of each track of the MedleyDB dataset [Bittner et al., 2016], since most of the corresponding stems are already part of the training split of the previously mentionned Medley-solos-db dataset.The whole MedleyDB dataset is split into train and test sets in a 90%/10% way, corresponding to approximately 6 hours and 50 minutes of audio data, respectively.

Evaluation and metrics
To evaluate the performance of the proposed models, we first employ an objective metric computed in the frequency domain named log-spectral distance (LSD), defined as: where Y (n, k) and Ỹ (n, k) are the STFT representation of the target full-band signal and the estimated full-band signal, respectively (see Section 4.4 for STFT parameter values).
Secondly, we ran a listening test based on a MUSHRA methodology [Schoeffler et al., 2018] to assess the perceptive accuracy of the proposed models.We followed the classical MUSHRA specifications to build a listening test which was completed by 44 participants.More details are provided in Section 6.

Null baseline
The null baseline is simply the absence of addition of any content in the missing high frequency range.It provides a "ground floor" baseline to assess if the contribution of a given method is not actually worse than doing nothing.

Spectral band replication
To compare the performance of our models with existing approaches in the literature, a simplified version of the SBR algorithm [Meltzer et al., 2002] has been evaluated on the considered datasets.This algorithm has a long history in audio codec technologies and comes in various designs that often considers the use of side information, transmitted in the bitstream for the decoder to perform BWE.In this work we implemented a simplified version that is blind, i.e. do not require any information for performing BWE.
In this algorithm, the input signal x is treated in the frequency domain, frame by frame.The upper half frequencies are inferred by replicating the lower half frequencies, with the idea of transposing the lower harmonics upwards.As we aim to extend the bandwidth of musical signal from 2 kHz to 8 kHz, we extend this algorithm by replicating the lower band three times to reconstruct the full spectrum.In order to obtain a typical frequency amplitude decay, for each replication, the amplitudes of the transposed frequencies are adjusted so that there is an energy continuity at the replication frontier, i.e., for the j-th replication (j ∈ {1, 2, 3} as it is a fourfold bandwidth extension), the energies of the same portion of frequencies on both size are equal: where K is the number of frequency bins, and α ∈ [0, 1] the fraction of frequency bins considered for matching the energies of adjacent replicated bands.Experimentally, we found that α = 0.5 led to the best performance for the overall algorithm.
In our experiment, to consider the SBR at its best performance, the ground-truth phase information is used to obtain the full-band signal in the temporal domain.We acknowledge that the phase is not known in practice and would have to be estimated in a realistic production setting.

Resnet architecture
We also compare our models to a higher complexity system based on deep learning [Sulun and Davies, 2021].We chose this system because the Resnet architecture shows better results than the other proposed model based on a U-Net architecture.The Resnet architecture takes an input signal in the temporal domain and output a signal of the same size, with high frequency components.It is composed of 15 residual blocks made of two 1D convolutional layers each, with 512 convolutional filters of size 7, with a rectified linear unit (ReLU) activation after the first layer.For each layer, the input is added back to the output after being multiplied by a factor 0.1 (for stabilizing the training) in a residual fashion.The input signal is added back to the output.Batch normalization and dropout with a factor 0.5 are used after each convolutional layers.This model has around 55M learnable parameters.
To train this model, we use the same strategy as in [Sulun and Davies, 2021], i.e., using a mean square error loss with a learning rate reducing schedule.

Experimental parameters
In our work, all audio signals are sampled at f s = 16 kHz.To compute the STFT of these signals, we used an analysis window of 1024 samples with a hop length of 256 samples.The input signals are of length 64000 samples (= 4 s) for the DDSP models and the SBR baseline, and 8192 samples (≈ 0.5 s) for the Resnet model.In the DDSP models, we considered H = 100 number of harmonics and a size of K = 65 for the noise transfer function N (k).In the cyclic DDSP system, we use a total number of I = 5 iterations.
DDSP models are trained for 25000 steps with batches of size 32.We used the Adam optimizer with an initial learning rate of 0.001 for DDSP models, and the latter is halved if the loss has not been decreased during four plateaus of 2500 steps.We used A100 GPUs for the training, which permit us to train DDSP models for around 1 hour for DDSP-mono-dec, 2 hours for DDSP-poly-dec, while Resnet training took around 19 hours.

Validation on synthetic data
In this section, we first study the performance of the proposed models against the baselines on the monophonic and polyphonic synthetic data.It allows for more detailed insights on the models' ability to accurately generate the missing high frequency content.

Monophonic dataset
We first trained and evaluated our monophonic DDSP model on the monophonic synthetic dataset against SBR method [Meltzer et al., 2002] and Resnet model [Sulun and Davies, 2021].Table 1 summarises the results.
Table 1: Evaluation results for monophonic BWE model and baselines on the monophonic synthetic dataset.
Model LSD Null 6.15 SBR [Meltzer et al., 2002] 4.19 Resnet [Sulun and Davies, 2021] 4.34 DDSP-noise 5.04 DDSP-mono-dec 2.93 The results show the benefit of the DDSP model over the two reference models.On Fig. 4, the generated upper band from the proposed monophonic DDSP model, SBR baseline and Resnet models are illustrated for one frame of a particular synthetic signal with f 0 ≈ 830 Hz.The DDSP model is robust enough to synthesize the wanted harmonics with matching amplitudes, showing that it is capable to learn the chosen harmonic amplitude decay.The SBR baseline duplicates the low-band harmonic content with an offset because of the mismatch between the cutoff frequency and f 0 , and the Resnet model is apparently not capable of generating relevant high frequency harmonics, thus minimizing its loss by very few addition of energy.Table 2: Evaluation results for the proposed BWE models, the baseline and the reference resnet model on the polyphonic synthetic datasets.

Polyphonic dataset
Table 2 shows the LSD metric for all models and baselines on the polyphonic synthetic dataset.We can see that all proposed models surpass the SBR baseline and the Resnet model, and that the BWE performance has been improved by the design of both the cyclic system and the polyphonic model.The polyphonic DDSP model is almost twice as good as SBR and Resnet, which is an important improvement.When looking at Fig. 5, which illustrated the upper band generation for one polyphonic example from all the considered models and baselines, we notice that both cyclic and polyphonic methods are capable of generating precise harmonics, with a relatively good amplitude match compare to the ground-truth.For the monophonic setting, the SBR baseline generates shifted harmonics.The Resnet model seems to be able to focus only on some harmonics, with relatively precise amplitudes, while also generating some noise in the lowest generated frequencies.
The three DDSP-based models seems quite capable of estimating the low harmonic amplitudes, while the high harmonic content suffers from too high amplitudes, which may lead to non-natural artifacts.Possible reasons for this defect are given at the end of the next section.

Evaluation on real-world datasets
In this section, we present the performance results for each monophonic and polyphonic recorded datasets of the proposed models against the reference methods: namely SBR and Resnet model.

Objective evaluation
The proposed models are first evaluated objectively using the LSD metrics on the real-world datasets.Monophonic models DDSP-mono-dec and DDSP-noise are evaluated on both monophonic and polyphonic datasets, while polyphonic models DDSP-mono-dec-cyclic and DDSP-poly-dec are evaluated only on polyphonic datasets (Gtzan and MedleyDB.Table 3 shows the results.First, we can see that all proposed models surpass both SBR and Resnet model in terms of LSD, except for the cyclic model which is worse than SBR.On the OrchideaSOL, Gtzan and MedleyDB datasets, the gain in performance is substantial for the best model compared to the reference ones.For example, DDSP-mono-dec leads to a LSD of 5.68 where SBR and Resnet achieve 9.27 and 14.04, respectively.On polyphonic signals, the Resnet model seems to be quite bad at predicting high frequencies (LSD = 26.84 and 16.17 on Gtzan and MedleyDB, respectively), whereas our DDSP-based models give quite lower LSDs (less than 12 for all theses models on both datasets).
When looking at the performance of the proposed models, we first observe that the polyphonic models DDSPmono-dec-cyclic and DDSP-poly-dec do not achieve a better performance than the monophonic one DDSP-mono-dec, whereas the noise-only model DDSP-noise is on par with its results.This observation is quite constrasted from what we obtained on the synthetic datasets.When having a look at Fig. 6, we can see that DDSP-poly-dec seems to generate the highest frequency with too low amplitudes, whereas DDSP-mono-dec is a bit more precise in the high frequencies.
By informal listening of some reconstructed signals, we managed to distinguish two types of unwanted artifacts.The first kind happens when the amplitudes of the reconstructed harmonics are too high, which leads to a very synthetic high frequency reconstruction.One of the reasons for these wrongly inferred harmonic amplitudes is that, in both DDSP-mono-dec-cyclic and DDSP-mono-dec, the loudness contour is estimated for a mixture made of several f 0 , making it less trivial for the autoencoder to estimate each f 0 harmonic amplitudes.The second type of artifacts can be heard when the synthesized noise handles much of the high frequency content, while the harmonic amplitudes are too low, or even non-existent.This happens when the multi-pitch estimator fails to correctly predict the set of f 0 s, then the overall system do not generate high amplitude harmonics, and compensates with noise.Because of that, we conjecture that the proposed models should be more effective with a more robust multi-pitch estimation system.

Perceptual evaluation
In order to assess the perceptive value of our models, we conducted a listening test based on the MUSHRA methodology [Schoeffler et al., 2018].During the listening test, 42 subjects were asked to rate the quality of audio signals between 0 (poor quality) and 100 (perfect quality) against the reference (ground-truth full-band signal), which is expected to be rated 100.This behavior is expected by normal hearing and focused subjects, as the reference sound is provided for each trial.10 stimuli are given in a random order.For each of them, 6 signals are to be rated :  The signals are taken from Gtzan dataset [Sturm, 2013], one of each genre, and only 5 seconds are extracted in the middle of the original signal.Information about participants are asked at the end of the survey, including gender, age, and the number of years of musical practice.Given the poorer LSD performance of our proposed polyphonic schemes compared to the monophonic one, also confirmed by informal listening by the authors, it has been decided not to consider them for subjective evaluation.This had the benefit of maintaining the duration of the listening test into a reasonable range of about 20 minutes.
First, we conduct an analysis of variance (ANOVA) to check whether the factor of musical training is a significant source of variation in the rating data.We consider an individual as being a musician if it has an experience of at least one year.Considering that, the ANOVA run on the rating distributions for musician and non-musician subjects gives a p-value of 1.61 • 10 −8 , which tells us that being a musician or not has a significant effect on the test ratings.
Close inspection of the ratings showed that the rankings of the different methods are the same for both populations.
The only difference a different bias, where musicians were on average more severe than non-musicians, as can be seen on Fig. 8, that show the distributions of ratings for musicians and non-musicians the over all models.Next, conducting another ANOVA in which the analyzed factor is the model gives us a p-value of 2.18.10 −88 , which is very small and shows that the choice of model is a significant source of variation in our collected data, thus the possibility of comparing the rating distributions of all models.With another ANOVAs on the models but for the data splits in musician or non-musician subsets, we obtain similar very low-valued p-values, which tells us that in both case the choice of models has an significant impact on the ratings among the participants.
The distributions of the participants' ratings for all models and all stimuli are plotted as boxplots in Fig. 7.We can see that the outputs of model DDSP-mono-dec are in average rated to be of fair quality (almost good), whereas the outputs from Resnet, DDSP-noise and no processing are typically rated as poor, and SBR outputs are quite often rated as being of bad quality.An important outcome is that DDSP-mono-dec provides a large margin improvement compared to the Null baseline, meaning that this method is able to improve audio quality at a low computational cost.
By computing a t-test on rating distribution of DDSP-mono-dec against the other models, we can verify that it is significantly better than the other ones.The p-values obtained for the t-test are well below the typical threshold of 0.05, so the distributions are significantly different from each other.We can thus conclude that, from the data of the listening test, the DDSP-mono-dec gives perceptively better high-frequency contents than the other evaluated models.
Monophonic and polyphonic examples are available online 2 .The latter have been considered as stimuli in the listening test.

Conclusion
In this article, we explored differentiable digital signal processing models for bandwidth extension of monophonic and polyphonic musical signals.We showed the benefit of using a monophonic DDSP model to generate high frequencies of monophonic signals against the two baselines, including a high complexity deep-learning-based resnet model.Then, we designed two systems to address polyphonic BWE: a cyclic use of a monophonic DDSP model, and an adapted DDSP model with polyphonic synthesis capacities.On polyphonic signals, the proposed polyphonic systems showed to be more effective on polyphonic synthetic signals, but failed to surpass the monophonic DDSP model on real data.
In addition, we conducted a listening test with the MUSHRA methodology, which showed that the DDSP-mono-dec model was more pleasant to the ear for most participants, when compared to the baselines.For future work, we think that considering a more advanced multi-pitch estimator could enable the polyphonic models to generate less artifacts, and that other artifacts could be avoided by researching further the loudness estimation procedure.

Figure 2 :
Figure 2: Cyclic use of the monophonic DDSP model (DDSP-mono-dec-cyclic) for bandwidth extension of polyphonic signals.

Figure 4 :
Figure 4: Generated upper frequency band using the model DDSP-mono-dec, the baseline SBR and the reference Resnet model, for a synthetic signal containing harmonics based on the MIDI note G#5 (≈ 830 Hz).The vertical line shows the limit between the low and high bands.

Figure 5 :
Figure 5: Generated upper frequency band using the proposed models and the baselines, for one frame of a synthetic signal containing four notes: C#5 (≈ 554 Hz), D#5 (≈ 587 Hz), D5 (≈ 622 Hz) and F#6 (≈ 1479 Hz).The vertical line shows the limit between the low and high bands.

Figure 6 :
Figure6: Spectrograms showing the generated upper frequency band using the proposed models and the baselines for a real-world signal from a pop music track.The horizontal white line shows the limit between the low and high bands at 1000 Hz.

Figure 7 :
Figure 7: Stimuli ratings for, from top to bottom, the anchor Null corresponding to the input signal without any process, the SBR baseline, the proposed models DDSP-mono-dec and DDSP-noise, and the reference Resnet model.Boxes correspond to the interquartile range (IQR) over all participants, with the mean indicated by an orange vertical line.Lower and upper whiskers are set to 1.5 × IQR below and above Q 1 and Q 3 , respectively.

Figure 8 :
Figure 8: Stimuli ratings from subjects with (bottom) and without (top) musical training.Boxes correspond to the interquartile range (IQR) over all participants, with the mean indicated by an orange vertical line.Lower and upper whiskers are set to 1.5 × IQR below and above Q 1 and Q 3 , respectively.

Figure 9 :
Figure 9: Real-time CPU inference percentage vs.LSDs for the proposed models and the baselines over the dataset Gtzan.

Table 3 :
LSD performance of the evaluated models for monophonic and polyphonic real-world datasets.Best models are shown in bold for each dataset.The last two columns show CPU inference time expressed as real time percentage and the number of parameters of the models.