Sampling the user controls in neural modeling of audio devices

This work studies neural modeling of nonlinear parametric audio circuits, focusing on how the diversity of settings of the target device user controls seen during training affects network generalization. To study the problem, a large corpus of training datasets is synthetically generated using SPICE simulations of two distinct devices, an analog equalizer and an analog distortion pedal. A proven recurrent neural network architecture is trained using each dataset. The difference in the datasets is in the sampling resolution of the device user controls and in their overall size. Based on objective and subjective evaluation of the trained models, a sampling resolution of five for the device parameters is found to be sufficient to capture the behavior of the target systems for the types of devices considered during the study. This result is desirable, since a dense sampling grid can be impractical to realize in the general case when no automated way of setting the device parameters is available, while collecting large amounts of data using a sparse grid only incurs small additional costs. Thus, the result provides guidance for efficient collection of training data for neural modeling of other similar audio devices


Introduction
Virtual analog (VA) modeling is an active subdiscipline of audio processing that attempts to imitate analog and electromechanical audio hardware using software [1,2].Within the past decades, progress in the field has allowed for digital replications of various audio hardware units, including guitar amplifiers [3][4][5] and synthesizer subcircuits, such as voltage-controlled oscillators [6,7] and filters [8][9][10], as well as studio hardware, such as dynamic range compressors [11,12] and effect processors [13,14].
The approaches used for VA modeling are traditionally divided into white-box, gray-box, and black-box methods.White-box methods use explicit knowledge of the target circuits to discover and derive the physical constraints that govern the systems, oftentimes encountered as ordinary differential equations (ODEs), which are then discretized and simulated on a computer.Examples of approaches belonging to this category include numerical ODE solvers [3,15], state-space methods [16], and wave digital filters [5,9,17].Black-box methods use input-output relationships collected from the target circuits together with general-purpose digital models to try and match the observed behavior via optimization.Examples of black-box approaches include blockbased methods [18], Volterra series expansion [19], and dynamic convolution [20].In gray-box modeling methods, a combination of these two approaches is used [12,21,22].
Within the past decade, progress in the field of machine learning (ML), especially in deep learning (DL) [23][24][25], has ignited an exploration within the VA modeling community to the applicability of ML methods to the task of circuit modeling.As in other domains, DL approaches have been shown to be capable of achieving state-ofthe-art accuracy, when applied to a variety of modeling targets [26][27][28].While early examples of DL-based modeling approaches have explored the use of end-to-end neural networks (NNs), such as convolutional neural networks and recurrent neural networks (RNNs), in a blackbox manner [29][30][31], later work has utilized knowledge of the underlying circuits, steering the approaches more towards the white-box end of circuit modeling [32][33][34].
In order to be optimized for the VA modeling task, DL methods require a dataset representing the target behavior to be collected.Within the supervised learning paradigm [25], the dataset consists of input-output pairs of audio and the related circuit configurations, e.g., the settings for the user controls, in the case of parametric circuits.While the data can be collected using recordings of the target circuit or via circuit simulation, it has not been clear how the space of user controls of the target should be sampled and exposed to the NNs for them to generalize over all possible configurations.This uncertainty is exemplified in the number of differing practices adopted in the field for collecting training data for the case of parametric circuits [31,[35][36][37].More precisely, Wright et al. [31] use a uniform sampling grid with 5 points, Hawley et al. [35] use a uniform sampling grid with 10 or 21 points depending on the target, and both Nercessian et al. [36] and Juvela et al. [37] use random sampling without restrictions on the resolution.Moreover, the question becomes increasingly important for circuits with more than a few user controls, since the space of all possible configurations grows exponentially with the number of parameters.
This paper addresses this gap by synthetically generating a large number of datasets for two distinct nonlinear modeling targets and training and comparing the performance of NNs trained on each of these datasets.The datasets differ in the way the target circuit parameter space is sampled, also taking into account the effect the overall dataset size has on the problem.The data is collected by constructing and running SPICE simulations of the modeling targets, allowing for the collection to happen in a strictly controllable and automated manner.For modeling the circuit behavior, a proven RNN architecture is used, with the models only exposed to input-output pairs of audio and circuit configurations of the targets in a black-box manner.To evaluate the performance of the networks, both an objective evaluation, based on comparison of the loss metrics, and a subjective evaluation, based on a listening test, are given.We point the reader to the accompanying web page for additional materials 1 .
The rest of this paper is organized as follows.Section 2 gives a mathematical description of black-box VA modeling formulated as a supervised ML task and also describes the deep NN architecture used for modeling.Section 3 presents a technical overview of the two modeling targets, an analog distortion pedal and an analog equalizer.Section 4 describes the data collection and the sampling of the user controls, forming the main body of the work.Section 5 gives an outline for the training procedure.Sections 6 and 7 reports on the evaluation procedure and the results.Finally, Sect.8 concludes.

Neural modeling of audio circuits
From an ML perspective, VA modeling can be seen as a sequence modeling task usually solved as a supervised learning problem.In supervised learning problems [25], the learning algorithm, usually called the model, is trained using a dataset D of input-output pairs u (i) , y (i)  to give predictions ŷ for inputs u / ∈ D not seen dur- ing training.In black-box VA modeling, the inputs and the outputs are discrete time representations of audio collected from some target device, and in the case of parametric circuits, the model also receives the device configurations φ for each pair.In this scenario, the data- set D is of the form: where u is an input vector of sampled audio, y is the cor- responding output vector, φ is a vector storing the device configuration, and N = |D| is the size of the dataset.
Most electronic circuits are stateful systems due to the energy storing elements, like capacitors and inductors, present in them [38].Thus, in order to understand their behavior given some inputs and the circuit configuration, one needs to also observe the evolution of the circuit output.In discrete time and for parametric circuits, this can be written as a recursion: where y[n] and y[n − 1] are the circuit outputs at the cur- rent and previous time steps, u[n] and φ[n] are the input to the circuit as well as its configuration at the current time step, and f (•) performs the mapping between these quantities.
In black-box neural VA modeling approaches, the mapping f is approximated with a neural network f θ , whose weights θ are optimized to minimize a chosen loss func- tion L(ŷ, y) over the dataset D .The loss function L is used as a metric to evaluate the discrepancy between the true output y and the model prediction ŷ , and the gra- dient of the loss w.r.t. the model weights ∇ θ L is used to iteratively step the weights towards the optimum. (1)

Model architecture
Due to the stateful form of the studied systems and following earlier research [29,31,33], a stateful NN architecture was chosen for approximating the mapping f (•) .In prac- tice, this meant using an RNN, which is a common choice for sequence modeling tasks due to its ability to process inputs of varying lengths [25].
The deep NN architecture used in this study, originally introduced for modeling parametric nonlinear circuits in [31], is shown in Fig. 1.The architecture consists of a gated recurrent unit (GRU) [39] and a fully connected (FC) output layer.The GRU g θ ′ computes the state-to-state transition of the model given an input vector x[n] and its previous hidden state h[n − 1] as follows: The exact computations performed by a GRU are left out, as they have been described earlier elsewhere [31,33].The GRU can be conditioned to adapt to the circuit configurations φ[n] by concatenating them together with the input audio u[n], making the input vector [31]: After computing the hidden state h[n] for the current time step, it is passed on to the FC output layer, which performs a memoryless mapping h[n] → ŷ[n] , producing the output prediction for the current time step.Overall, the model predicts the current output sample as: (3) where x[n] is concatenation of the input audio and device configurations as in Eq. ( 4).
The size of the hidden state |h| , and in turn the input dimensionality of the FC output layer, determines the representational capabilities of the model and is one of its hyperparameters.In our work, a hidden size |h| = 32 was used, which was found sufficient for modeling the chosen targets according to a hyperparameter search conducted during an early experimental phase [40].The model was implemented in Python using the PyTorch [41] ML framework.

Modeling targets
To study the research problem, two nonlinear circuits with short-term memory were chosen: the Pultec EQP-1A, a saturating analog equalizer (EQ) [42], and the ProCo RAT, an analog distortion effect [43].The choice of the devices was guided by the circuits' nonlinear nature, which makes the modeling inherently harder and justifies the usage of NNs, as well as their varying number of user controls |φ| .The varying |φ| was thought to influ- ence the problem setting in a meaningful way by altering the dimensionality the model has to adhere to.In the following, both of the chosen circuits are given a brief technical description.

Pultec EQ
The Pultec EQ is a famous early studio program equalizer from the 1950s, originally manufactured by Pulse Techniques, Inc. [42].It was one of the first EQs that allowed for continuous adjustment of multiple frequency bands.The sound of the Pultec is still revered due to its musical equalizing curves, as well as the warmth brought by the various transformers and vacuum tubes utilized in the circuit, which can be seen in the number of software emulations and hardware derivatives available in the market [44][45][46].
The circuit schematic of the Pultec EQ with the power conditioning circuitry removed is shown in Fig. 2. The system consists of a passive EQ stage and an active recovery/amplification stage, highlighted with colored boxes in Fig. 2. The EQ stage contains a low-shelving filter, a bell-shaped filter centered in the mid-to-high frequencies, and a high-shelving filter.The processing stages and the input and output of the circuit are connected using transformers, which produce harmonic distortion components, especially in the low frequencies [47].Since the equalizer stage is passive, the signal level is attenuated as it passes through, which is then compensated for at the recovery stage.The recovery stage uses a balanced pushpull arrangement of vacuum tubes for the amplification, producing additional harmonic distortion components in the processed signal [48].The user controls of the circuit consist of both switchable and continuous controls.The switchable controls are the crossover frequency of the low-shelving filter f low , the center frequency of the bell-shaped filter f mid , and the crossover frequency of the high-shelving filter f high .The continuous controls consist of, unusually, separate boosting g low and attenuation a low controls for the low- shelving filter, boosting g mid and resonance q mid controls for the bell-shaped filter, and an attenuation a high control for the high-shelving filter.Since the controls are nonorthogonal and the effective boosting and attenuation crossover frequencies are not exactly the same, dialing in both boosting and attenuation simultaneously allows for acquiring more complicated frequency responses than would be immediately obvious.
For the scope of this work, the switchable characteristics of the Pultec EQ were set to constant values (f low , f mid , f high ) = (20, 3k, 20k) Hz , and only the continuously variable controls were used for conditioning the Pultec EQ-based models.Thus, the vector of conditioning values for the Pultec EQ becomes φ PULTEC = [g low , a low , g mid , q mid , a high ] .An example mag- nitude response of the Pultec EQ is shown in Fig. 3, with the user controls set to φ PULTEC = [1.0,0.4, 0.75, 0.0, 0.75] , where the values from 0 to 1 represent the linear range of the potentiometer settings, from minimum to maximum.

ProCo RAT
The ProCo RAT is a popular analog distortion pedal originating from the late 1970s, originally manufactured by Pro Co Sound [43].It is widely adopted by musicians of many disciplines to introduce richness and complexity to signals driven through it, produced by the highly nonlinear operation of the circuit.
The circuit schematic of the ProCo RAT with the power conditioning circuitry removed is shown in Fig. 4. A thorough analysis of the circuit is presented in [43].The system consists of three stages: a clipping stage, a tone control stage and an output stage, highlighted with colored boxes in Fig. 4. The clipping stage consists of an adjustable non-inverting amplifier driving a hard-clipping circuit, producing high amounts of distortion as the signal level is brought up.The non-inverting amplifier is implemented using an operational amplifier as the active element, and the hardclipping is achieved with an anti-parallel connection of silicon diodes.The tone control stage consists of an adjustable first-order RC low-pass filter with a −6 -dB slope.The output stage consists of a source follower circuit driving an adjustable voltage divider, and is used to decouple the pedal electronics from the downstream circuitry.The source follower is implemented using a junction field effect transistor (JFET) as the active element, the inherent nonlinearities of which will also produce and add up to the harmonic distortion components produced by the circuit.
Each of the three processing stages contains a single continuous control.These controls are the distortion amount g dist of the clipping stage (the gain of the non- inverting amplifier), the cutoff frequency of the tone stage f tone , and the output volume g vol (the setting of the voltage divider).All of these controls were used as conditioning for the ProCo RAT-based models, making the

Data collection
To collect data for training, validation, and testing, SPICE simulations of the target circuits were utilized.The SPICE netlists were constructed using LTSpice [50], which were then invoked and controlled using the PyLTSpice [51] wrapper and library for Python.The various potentiometer laws encountered in the circuits were modeled according to the power approximations given in [52], and the user controls φ ∈ φ were parameterized so that their effective ranges were within the closed interval [0, 1].The following subsections describe the source audio used for exciting the circuits, the parameter sampling procedure, and the datasets used for training the models.

Source audio
The source audio used to excite the circuits consisted of 4 min of audio sampled at 44.1 kHz .The audio was divided into 1 min of guitar and 1 min of bass passages from [53] and [54], respectively, in addition to 1 min of synthesized logarithmic sine sweeps and 1 min of synthesized white noise passages, both at various amplitudes.This choice of source audio was inspired by successful experiments found in the literature [32,34,55], and the overall length was chosen to be similar to that which was used in related work [26,33].In order to ensure excitation of the circuits in their most nonlinear regions, all of the 1-min collections of recordings were normalized so that their maximum peak values reached −0.1 dBFS .Finally, to allow for using mini-batches later during training, the 4 min of source audio was split into 1-s segments, forming the partial dataset D ′ = u (i) 240 i=1 .

Parameter sampling
In order to construct datasets of the form given in Eq. ( 1), each input u (i) ∈ D ′ was driven through each of the targets, and for each simulation round, the user controls φ ∈ φ were sampled as: where U{0, 1} is the discrete uniform distribution span- ning [0, 1], δ is the sampling density or the number of sampled values within the interval, and Z + is the set of positive integers.For example, when the sampling density is δ = 3 , each φ ∈ φ can take values from the set {0.0, 0.5, 1.0} , and each of these values has an equal likeli- hood of being selected.
For our experiments, the sampling densities δ = [3,5,9,17] were considered, as illustrated in Fig. 6, on top of which a continuous sampling φ ∼ U{0, 1} was trialed, which we denote ẟ = "c" (continuous).The rationale in choosing the sampling densities was in halving the distances between the allowed sampling points at every increment of δ , such that for our choice, the distances became [0.5, 0.25, 0.125, 0.0625], or 1  δ−1 , as well as the 64-bit floating point resolution for ẟ = "c".

Training datasets
To investigate the effects of the sampling density and the dataset size on the model generalization, pools of 50 datasets were constructed for each sampling density ẟ = [3,5,9,17, "c"] and target, denoted as: Due to the stochastic nature of the sampling procedure in Eq. ( 6), the contents of datasets D δ,j generated with the same sampling density δ are different and the number of possible device configurations φ in D δ,j grows exponentially with the sampling density used.In contrast, since the sizes of the datasets |D δ,j | = 240 are the same, the increasing diversity comes with the risk of φ ∈ D δ,j representing only a small subspace of the different possible configurations.
Noting the connection between the sampling density and the configuration diversity, the final datasets for training the models were constructed by creating stacks of n = [1, 2, 4, 8, 16] datasets D δ,j for each sampling density δ .To gain robustness in the stacked datasets D δ,j possibly rep- resenting a particularly uneven distribution for the device configurations φ , five random draws from the 50  n possible subsets were created for each (δ, n) pair, with the cor- responding datasets denoted as D δ,n×,k , k ∈ [1, ..., 5] .For example, when (δ, n) = (3, 4) , the five datasets constructed for the configuration are denoted D 3,4×,k .With the con- sidered choices for δ , n, and (#draws) , the total number of datasets constructed for each target sums to 125.

Model training
To better understand the effect of dataset choice on model generalization, 125 models for each target device were trained, one for each distinct dataset D δ,n×,k intro- duced in Sect. 4. The following subsections give a detailed description of the training procedure.

Validation and test sets
To create common benchmarks for evaluating the models trained using diverse training sets, separate validation (7)  and test sets were constructed for each target.The source audio for both the validation and test sets consisted of 30 s of unseen guitar and 30 s of unseen bass passages from the same distributions that were used during training, and the user controls φ ∈ φ were sampled continu- ously from U{0, 1} .These settings were meant to mimic the model operating conditions during inference.
To gain additional robustness to unlucky draws during the final testing phase, five iterations of the test sets were constructed for each target, over which the final results are further aggregated over, as explained in Sect.6.As was done for the training datasets, the 30-s collections of recordings were normalized to maximum peak levels of −0.1 dBFS and split into 1-s segments to aid parallel processing.We denote the validation set D val and the test sets D test,l , l ∈ [1, ..., 5].

Loss
To optimize the model weights θ , the error of the model prediction ŷ in comparison with the target output y is first evaluated using a chosen loss L(ŷ, y) and the gra- dient of the loss w.r.t. the model weights ∇ θ L is com- puted using the back-propagation algorithm [25].In our work, the error-to-signal ratio (ESR) loss was used, defined as [30]: where y[n] is the target output, ŷ[n] the model prediction, N is the sequence length, and | • | is the absolute value operator.The term in the denominator normalizes the loss computations w.r.t. the energy of the target output in order to prevent high energy segments from dominating the weight optimization.For the scope of this work, no pre-emphasis filter was used before the loss computations, although it is known to be advantageous [56], since we wanted to test a basic NN model without extensions.

Training procedure
Instead of computing the gradient ∇ θ L using the full training set D , the gradient is approximated using mini-batches of examples B ∼ D , and noisy estimates ∇θ L computed over B are used until each example u (i) , y (i) , φ (i) ∈ D has been used [25].This approach, commonly known as stochastic gradient descent (SGD), results in a higher number of optimizer calls for each training epoch in comparison with using the full gradient over D .In our implementation, we use a relatively small batch size |B| of 2 5 = 32 according to earlier experimental practice [26,28].
When applying back-propagation to RNNs, the recursive computational graph resulting from evaluating (8) Eq. ( 5) over the input sequence u is first unfolded to a regular directional computational graph, and the approximated gradients ∇θ L are computed using standard back-propagation rules [25].This approach is commonly known as back-propagation through time (BPTT).Instead of unfolding the computational graph over the whole of u , the input sequence is further split into shorter portions and the computational graphs are unfolded sequentially until the whole input sequence is traversed, calling the optimizer at the end of each portion [57].This approach, commonly known as truncated BPTT, further increases the number of optimizer calls and speeds up training.In our implementation, we let the model state initialize for N INIT = 2 10 = 1024 samples before tracking the gradient, and the subsequent gradients were estimated using the same number of steps N TBPTT = 1024 [14].
After computing ∇θ L , the model weights θ are stepped towards the negative gradient using some optimizer function.In our implementation, we used the Adam optimizer [58] with a learning rate γ of 1 × 10 −3 and betas (β 1 , β 2 ) = (0.9, 0.999) , which cor- respond to the default values for the method as implemented in PyTorch [41].

Normalizing compute
To ensure a fair comparison of models trained on datasets D of varying sizes, the number of optimizer steps the models were trained for was kept constant.The number of optimizer steps resulting from an epoch of training using truncated BPTT can be computed as: where |B| is the number of mini-batches in an epoch, |u| is the input length in samples, and ⌈•⌉ is the ceil- ing function.The total number of optimizer steps s total is then: where N E is the number of epochs.
In our experiments, all of the models were trained for s total = 430, 000 optimizer steps, or 10,000 batches of size |B| = 32 , after which the training was stopped.The mod- els were trained on modern GPUs, and a typical training round took approximately 3-5 h to finish.During the progress of training, the validation loss over the validation sets D val was seen to saturate for both targets, exem- plified in Fig. 7 with 2 randomly chosen models.After the training was stopped, the model weights that produced the lowest validation loss were used for final evaluation.

Objective evaluation
To compare the performance of the models trained using the various training sets D k,n×,k , the median ESR loss aggregated over the test sets D test,l is computed for each distinct model.In order to rule out the stochasticity of the parameter sampling procedure, the results for the models belonging to the same configurations (δ = δ ′ , n = n ′ ) are further aggregated.This aggregation procedure is illustrated in Fig. 8 for an example configuration (δ = 5, n = 4) , where the ProCo RAT is the mod- eling target.As shown in the figure, the objective metrics used to assess any particular configuration is the aggregate over (#models) × (#test sets) = 5 × 5 = 25 losses.

Objective results
The aggregated medians ( η ) of the ESR losses for the ProCo RAT models are visualized in Fig. 9.The best performing configurations are highlighted by using a white font, while configurations within a tolerance of 0.01 from the best are highlighted with a light gray font.
As can be seen from the figure, the models trained using the sampling density δ = 3 clearly perform worse than the models trained using higher sampling densities.Generally speaking, increasing both the sampling density and the dataset size has a positive effect on the model performance, although the benefits seem to saturate along both axes and the contours are not strictly monotonic.Increasing the dataset size has the most positive effect on the model performance when using denser sampling grids for the device parameters, to the extent that the best performing models were trained using the densest sampling grids and larger dataset sizes (δ ≥ 17, n ≥ 4) .Beyond the dataset configuration of (δ = 17, n = 4) , no further increase in the model performance is achieved.Comparing the error metrics of the best performing models to those of related models in existing literature shows agreement in magnitudes [31,59], further validating the results.
The aggregated medians ( η ) of the ESR losses for the Pultec EQ models are visualized in Fig. 10.Note that the minimum of the z-axis is an order magnitude smaller in comparison with earlier, due to the losses being much lower overall for the Pultec EQ models.We hypothesize that while the Pultec EQ has a larger number of user controls that the model has to adhere to in comparison with the ProCo RAT, it remains an easier target to model since it exhibits a greater degree of linearity.Again, the best performing configurations, and those within a tolerance of 0.01 from the best, are highlighted using white and light gray fonts, respectively.
Looking at Fig. 10 and similarly as before, the models trained using the sparsest sampling grid δ = 3 perform worse than the rest.Beyond this sparsest sampling grid, the loss surface becomes noisy in its shape, and the models within a tolerance of 0.01 from the best configuration are scattered across the (δ, n) search space.Judging by the noisy shape of the loss surface and the small overall error metrics around this region, the results suggest that the models beyond the sparsest sampling grid of δ = 3 have all reached convergence.Acknowledging the uncertainties brought by the noisiness, there is evidence that the models trained using the densest and largest configurations (ẟ = "c", n ≥ 8) have slightly degraded performance, indicating that continuous sampling is a suboptimal choice.In light of these findings, we conclude that a sampling density of δ = 5 adequately captures the device behavior for this particular target.

Subjective evaluation
In order to gain insight into how the acquired ESR losses relate to the perceptual quality of the trained models, an additional listening test was conducted.The listening test setup was similar to a MUSHRA test (ITU-R BS.1534) [60] and was conducted using the webMUSHRA [61] framework.

Listening test setup
A single model for each sampling density was chosen for the listening test.The models were picked such that for each sampling density, the best performing configuration w.r.t. the dataset size was chosen.From the five possible In order to make the listening test conditions more realistic, a new pool of audio material was collected, comprising short segments of music representing varying genres.The segments in the pool were processed using both target devices and the chosen models, randomizing the device parameters at the start of each segment.The segments processed by the target devices were used as the reference conditions.To create a low-quality anchor, the segments were additionally processed using a hyperbolic tangent function with 25× of input gain and filter- ing the resulting outputs using a high-shelving filter with −18 dB of gain at Nyquist and the corner frequency set at 5.5 kHz .The final segments for the listening test were chosen by computing the segment-wise losses from the predictions generated by each of the chosen models, and picking a set of segments that produced an even distribution of low, mean, and high average losses, in order to have a fair choice of audio for the test.Finally, each segment was normalized to −23 dB LUFS, using the pyloud- norm library [62].
The listening test was conducted in sound-isolated listening booths at the Aalto Acoustics Lab using pairs of Sennheiser HD650 headphones.Fifteen experienced listeners without reported hearing impairments conducted the test, and no subjects were excluded during the posthoc analysis of the ratings.

Subjective results
The results of the listening test are shown in Fig. 11.The asterisk ( * ) is used to denote the best performing model for each sampling density, according to the selection strategy underlined earlier.Similar to what was found in the objective evaluation, the models that were trained using the sparsest sampling grid δ = 3 are clearly performing worse than the rest.From δ ≥ 5 onwards, the performance of the models is seen to saturate for both targets, although in the case of the ProCo RAT, the model representing the choice δ = 9 is seen to perform worse than would be expected.In the case of the ProCo RAT, the saturation of the performance happens at a perceptual quality between good and excellent, while for the Pultec EQ, all of the models δ ≥ 5 are perceptually indistinguishable from the reference.While acquiring a model with excellent perceptual quality would have been desirable also for the ProCo RAT, we note the agreement between the acquired quality and the existing state-ofthe-art for related devices [59] and hypothesize that given the highly nonlinear behavior of the device, reaching this level would have required further considerations such as perceptual weighting of the loss [56] or model anti-aliasing [63].
Listening to the segments processed by the δ = 9 model for the ProCo RAT confirms that the perceptual quality of the model is noticeably worse than the others.To investigate this, we compute the short-time Fourier transform (STFT) loss [64] as well as the ESR loss over the listening test segments for the chosen models, shown in Fig. 12.While the ESR loss on the left of the figure shows the expected monotonic improvement of the loss metrics as the sampling density is increased, the STFT loss on the right clearly shows how the δ = 9 model does not conform to the pattern.This finding suggests that while time-domain losses such as the ESR have been shown to be valid choices for training models of perceptually excellent quality [26,27,33], a frequency-domain loss can help in covering some aspects of the modeling problem not caught by focusing on the time-domain only.
Based on the patterns seen in the loss surface for the ProCo RAT models in Fig. 9, it would have been expected for the models trained using higher sampling densities to outperform the models trained using a sparse sampling grid of δ = 5 .However, the results of the listening tests show that, beyond a sampling density δ ≥ 5 , no improve- ment in the model performance is achieved.This finding can be interpreted as showing, as was found by analyzing the behavior of the δ = 9 model for the ProCo RAT, that the ESR is not a conclusive perceptual metric, and it should not be understood as a direct indicator of the model performance.Reminding ourselves of the error surfaces shown in Figs. 9 and 10, and keeping in mind the saturation of the perceptual quality, we find agreement in the overall trend of the results for both of the considered targets.In light of these findings, we conclude that a sampling grid δ = 5 is sufficient for capturing the nonlinear behavior of the types of systems considered in this study, for the application of neural VA modeling.

Conclusions
This paper studied neural VA modeling of nonlinear parametric circuits, focusing on how the diversity in exposure to varying settings of the device user controls during training affects the network generalization.The problem was studied by generating a large corpus of training datasets for two chosen modeling targets using automated SPICE simulations, and training a proven RNN model for each of the datasets.The dataset properties that were altered during the generation were the sampling resolution of the device user controls, as well as the dataset size.
Our results demonstrate that a sampling density of five for the user controls is sufficient for modeling the types of devices considered in this work, i.e., nonlinear circuits with short-term memory and up to five user controls.This result is helpful when collecting training data for other similar devices, since generally no automatic way of setting the device parameters on an arbitrary grid exists and a sparse sampling of the parameter space is practically desirable, while collecting larger amounts of data using sparser grids only incurs a small additional cost.
In the future, the scope of the study could be extended to include, for example, multiple model architectures beyond the choice of RNNs, alternative loss functions, especially in the time-frequency domain, and other device types beyond nonlinear circuits with short-term memory.Further work is also needed to establish an explanation of why the densest possible sampling of the parameter space is not always the best choice for the considered task.The findings of this study can help reduce time and effort in collecting training data for deep NN models of audio devices.

Fig. 1
Fig. 1 Deep NN architecture used for the modeling

Fig. 2 Fig. 3
Fig.2Pultec EQ circuit diagram showing the equalizer and recovery stages, adapted from[49] vector of conditioning values φ RAT = [g dist , f tone , g vol ] .An example time-domain response of the ProCo RAT is shown in Fig.5, with the user controls set to φ RAT = [1.0,1.0, 0.75] , where again the values represent possible potentiometer settings, such that 1.0 corresponds to the maximum.

Fig. 7 Fig. 8
Fig. 7 Typical evolution of validation loss during training for both target devices

Fig. 9 Fig. 10
Fig. 9 The aggregated median ( η ) losses for the ProCo RAT models in the (left) 3-D plot and (right) matrix form

Fig. 11
Fig. 11 Listening test results showing both the means and 95% confidence intervals of the ratings for both targets, indicating a monotonic increase in perceptual quality until a sampling density of δ = 5 , beyond which no further improvement is observed.The asterisk ( * ) denotes the best performing model for each sampling density