Synthesis of Soundfields through Irregular Loudspeaker Arrays Based on Convolutional Neural Networks

Most soundfield synthesis approaches deal with extensive and regular loudspeaker arrays, which are often not suitable for home audio systems, due to physical space constraints. In this article we propose a technique for soundfield synthesis through more easily deployable irregular loudspeaker arrays, i.e. where the spacing between loudspeakers is not constant, based on deep learning. The input are the driving signals obtained through a plane wave decomposition-based technique. While the considered driving signals are able to correctly reproduce the soundfield with a regular array, they show degraded performances when using irregular setups. Through a complex-valued Convolutional Neural Network (CNN) we modify the driving signals in order to compensate the errors in the reproduction of the desired soundfield. Since no ground-truth driving signals are available for the compensated ones, we train the model by calculating the loss between the desired soundfield at a number of control points and the one obtained through the driving signals estimated by the network. Numerical results show better reproduction accuracy with respect to the plane wave decomposition-based technique, pressure-matching approach and to linear optimizers for driving signal compensation.


Introduction
Soundfield synthesis methods deal with the objective of reproducing a desired pressure field in a target region of space through arrays made of loudspeakers.In recent years, the attention towards this field of research has consistently increased due to its potential application in virtual reality, telepresence, and gaming.
The first approaches towards soundfield synthesis dealt with extensive loudspeaker setups, driven in order to effectively reproduce an accurate approximation of the desired soundfield.Wave field synthesis (WFS) [1,2] is based on the Huygens-Fresnel principle and synthesizes a desired pressure field through a large number of regularly distributed loudspeakers.Ambisonics [3] is based on the analysis of the soundfield in terms of spherical harmonics and reproduces the desired pressure field in a small listening area.In order to enlarge the area where reproduction is accurate, higher order ambisonics (HOA) was introduced [4,5].These physically based approaches reproduce the soundfield with a satisfying quality when regular array geometries are used, such as spherical [6,7], linear [8], or circular [9].However, their performances severely degrade when using irregular setups.While several techniques were proposed in order to adapt HOA techniques to irregular array setups [10,11] such as projection decoding methods [12,13] and [14] all-round ambisonic panning and decoding (AllRAD), they often require the solution of ill-posed problems.
Optimization-based techniques are more easily applicable to irregular loudspeaker setups.The pressurematching method [15,16] is based on the minimization of the reproduction error at a fixed number of positions in the listening area, denoted as control points.The desired driving signals are then obtained through a regularized least squares optimization problem.While this approach is applicable to setups having extremely irregular geometries, the achievable reproduction quality is strongly dependent on the selection of the control points, i.e., by sampling the listening area with a fine grid.Its computational cost, however, increases with the number of selected control points.Mode-matching [17][18][19] is another optimization-based family of techniques that can be applied to loudspeaker setups having arbitrary geometries.In this case, the optimization procedure is based on matching a modal decomposition of the desired soundfield around a single expansion center.Modal decomposition can be operated using circular or spherical wavefunctions.In doing this, it is needed to limit the decomposition to a maximum mode order, since a too high or small number leads to worse synthesis quality [19].Several approaches have been proposed to appropriately weight the modes [18,20].Irregular loudspeaker setups have also been considered by intensitymatching methods [21,22], where the objective is the minimization of the sound intensity, i.e., particle velocity, in the spherical harmonic domain over a spatial region.
More recently, after its widespread adoption in acoustic signal processing research [23], deep learning has also been applied to soundfield synthesis problems [24] such as the reconstruction of the pressure field at unknown locations [25,26].In [27], the authors proposed a network that is able to convert mono audio recorded using a 360 • video camera into first-order ambisonics (FOA).In [28], a network is proposed in order to upscale ambisonic signals, while in [29], a learning-based model for frequency expanding of the higher-order ambisonics (HOA) encoding process is presented.Also, in [30], the authors propose a technique for the estimation of spherical harmonic coefficients in soundfield recording, using feed-forward neural networks.Finally, in [31], the authors present a neural network that is able to calculate the optimal number of driving signals, extracted through a LASSO-based technique.In [32], a deep learning-based pressure matching approach was presented, where a real-valued CNN extracted the driving signals from pressure measurements at control points, a very similar approach was also successively followed by [33].Learning techniques have also been applied to the problem of optimizing the number and placement of sensors in soundfield control scenarios [34].
Complex-valued neural networks [35][36][37][38][39] enable to directly treat complex data and have recently been applied to a variety of audio signal processing tasks such as source localization [40] and separation [41].The adoption of such networks enables us to directly treat complex data instead of handling separately the real and imaginary parts such as in [26].
In this manuscript, we propose a technique for 2D soundfield synthesis through irregular loudspeaker setups in a free field environment, where the desired driving signals are obtained through a complex-valued convolutional neural network (CNN).Although the proposed method is easily extensible to 3D scenarios, this would involve dealing with 3D CNNs, which would add an increased complexity the computational point of view without enhancing the conceptual reasoning behind the proposed method.For this reason, in this manuscript, we decided to focus on 2D deployments and to leave the 3D extension to future works.
Instead of deriving the driving signals from soundfield measurements, the target field is obtained from the model-based rendering (MR) method presented in [42], based on the plane wave decomposition.While this technique is able to correctly reproduce the soundfield when regular loudspeaker setups are used, irregularities in the reproduced wavefronts appear when the spacing between the loudspeakers becomes uneven.
Operatively, we generate irregular loudspeaker arrays, by considering regular array setups and randomly removing a number of loudspeakers, simulating configurations where more than half of the loudspeakers are missing, thus paving the way to the use of minimal setups.Through [42], we compute the driving signals obtained using the irregular setup and feed them into a CNN, giving as output a compensated version of the driving signals.Differently from what proposed in [31], the loss is not based on the driving signals.Instead, we compute the loss between the ground truth soundfield and the one obtained through the compensated driving signals, which are the output of the network.
The main contribution of this paper thus is to provide a first, to the best of our knowledge, application of deep learning to soundfield synthesis when dealing with irregular loudspeaker setups.Such configurations are highly desirable in real-world application scenarios, since they are more easily deployable in contexts such as home audio.The choice of removing loudspeakers from regular circular and linear setups also goes in this direction, for example, a fully regular circular loudspeaker could hardly be deployed in a living room due to the presence of furniture, while the proposed irregularities in the setup could instead accommodate these situations, by removing loudspeakers wherever needed.
In the literature, linear optimizers for loudspeaker driving functions have already been proposed such as adaptive wave field synthesis (AWFS) [43][44][45][46], where the reproduction error is minimized in a least-mean squares sense.In order to demonstrate the effectiveness of the technique, we compare it with AWFS, PM, and a linearly compensated MR both when using simulated and real data.
The rest of this manuscript is organized as follows.In Section 2, we introduce the notation and present the necessary background related to the MR and PM tech- niques.In Section 3, we describe the proposed technique for soundfield synthesis using irregular loudspeaker arrays.In Section 4, we present simulation results both when considering a circular and linear loudspeaker array.Finally, in Section 5, we draw some conclusions.

Notation and review of pressure-matching, model-based soundfield synthesis, and adaptive wave field synthesis
In this section, we briefly review three soundfield synthesis techniques related to the proposed approach and we introduce the notation that will be used throughout the rest of the paper.We first introduce the pressurematching technique and then the model-based soundfield synthesis method, which is used in order to derive the loudspeaker driving signals, that will then be compensated through the proposed method.Finally, we present the adaptive wave field synthesis technique, which optimizes the WFS driving signals through a linear procedure and will be used in order to compare the performances of the proposed approach.

Notation and preliminaries
Let us consider an arrangement of L omnidirectional loudspeakers, or secondary sources, as often denoted in the soundfield synthesis literature, deployed at positions r l ∈ R 2 , l = 1, . . ., L .Let us also consider a set of A points r a ∈ R 2 , a = 1, . . ., A through which we sample the region of the space A , denoted as listening area, where we want to reproduce the soundfield.Let d(ω) = [d 1 (ω), . . ., d L (ω)] T denote the vector contain- ing the driving signals applied to the secondary sources, where ω ∈ R is the angular frequency and the superscript T is the transposition.If g(r a |r l , ω) is the acoustic trans- fer function (ATF) between secondary source l and point a, the vector g a = [g(r a |r 1 , ω), . . ., g(r a |r L , ω)] T is the juxtaposition of all the ATFs from the secondary sources to the listening point a.The synthesized sound pressure can be computed as where in the case of 2D propagation in free space conditions and using the e jωt convention for the Fourier's transform, g(•) corresponds to the Green's function [47] where H (2)  0 is the Hankel function of second kind and zero order, while c is the speed of sound in air.
The objective of soundfield synthesis techniques can then be defined as retrieving the set of driving signals d such that that is, minimizing the error between the reproduced and desired pressure field at the points contained in the listening area.The method through which the driving signals are estimated is what differentiates the various soundfield synthesis techniques.

Pressure-matching method
The pressure-matching technique, formulated as in [15], is a method for the synthesis of soundfields based on the minimization of the reproduction error at discrete points in the environment, denoted as control points.
Let us consider a series of control points r i , i = 1, . . ., I such that r i ∈ A .In the following, the subscript cp will indicate that the related term refers only to values measured at the control points.The driving signals to be applied to the secondary sources are obtained by solving the minimization problem where is a regularization parameter, which may be determined either trough techniques such as the L-curve [48] or more often by extracting singular values related to the propagation matrices [19] and H denotes the Hermitian transpose.The solution of (4) is given by where the entries of G cp (ω) ∈ C I×L , corresponding to the transfer function between secondary sources r l and con- trol points r i are defined as (1) and p cp ∈ C I is a vector corresponding to the ground truth pressure soundfield evaluated at the control points, i.e., p cp (ω) = [p(r i , ω), . . ., p(r I , ω)] T .While the inversion of a matrix may be computationally expensive, if we consider a single set of secondary sources (i.e., a single loudspeaker array), the pressure-matching technique can be implemented with a more convenient linear computational cost O(IL) by pre-computing where C cp (ω) ∈ C L×I is independent on the soundfield.Then the filters can be calculated by rewriting (5) as

Model-based acoustic rendering based on plane wave decomposition
The model-based acoustic rendering ( MR) [42] technique is based on the decomposition of the soundfield into directional contributions encoded by the Herglotz density function [49], which can be converted into driving signals for arbitrary loudspeaker arrangements along a planar curve.We first summarize how the Herglotz density function is defined in the case of a point source and then how it has been used in [42] to render the soundfield through circular and linear loudspeaker arrays.

Herglotz density function
Let us define k(θ) = [cos θ sin θ] T as the unit vector cor- responding to a plane-wave propagating with direction θ , then we can write the corresponding wave vector as k(θ)(θ) = k(θ) ω c .The pressure soundfield at a point r = [x, y] T can be modeled as a superposition of plane waves [50,51] where ϕ(θ, ω) ∈ C is the Herglotz density function, and it is a function modulating each plane wave component in amplitude and phase [49].In the case of an isotropic point source r ′ = ρ ′ [cos(θ ′ ), sin(θ ′ )] , expressed in terms of polar coordinates ρ ′ and θ ′ , corresponding to radius and azimuth, respectively, ϕ(θ, ω) can be defined as [42] where A(ω) is the spectrum of the sound emitted by the source.(6)

Implementation with circular arrays
Let us consider a circular array of secondary sources deployed at positions r l , corresponding to polar coordi- nates ρ l [cos θ l sin θ l ] T , where ρ l is the radius.Let us also consider a discrete distribution of N (ω) plane waves with directions θ n , n = 1, . . ., N , uniformly sampling the [0, 2π) interval, where each plane wave is reproduced by the same L loudspeakers, in order to approximate the desired soundfield.We take advantage of the discrete plane wave distribution in order to reproduce the soundfield by approximating it as [42] where The sum in ( 10) is approximated through a truncation of the modal expansion to order M, i.e., ( m = −M, . . ., M ) where M can be chosen in order to bound the reproduction error in a listening area of radius ρ by selecting M ≥ ⌈e ω c ρ 2 ⌉ [50].Then, according to Shan- non's theorem, we can correctly reproduce the soundfield without additional errors, except for the ones due to the discretization, by using N ≥ 2M + 1 plane waves.
The filter corresponding to the l-th loudspeaker and the n-th plane-wave component, can then be defined as [42] The driving signal corresponding to the secondary source l rendering all the N plane-wave components is [42] Finally, the soundfield at r a is

Implementation with linear arrays
Let us now consider an array of secondary sources deployed on a line segment such that r l = [x 0 , −y 0 ≤ y ≤ y 0 ] T .In this case, the allowed val- ues for the reproduced plane wave directions belong to a subset of [0, 2π) , and specifically the allowed range is θ ∈ R|θ min ≤ θ ≤ θ max , where θ min = arctan(−y 0 , x 0 ) and θ max = arctan(y 0 , x 0 ) .This angular interval is sampled using N components.This limitation is due to the geometrical constraints posed by the configuration of the (11) array and of the listening region.Reproduction is performed towards the half-plane given by x < x 0 [8], and the linear array is not able to accommodate all the plane wave directions surrounding the listening region, as in the circular array case.Since no closed-form solutions are known for arrays that are not circular [42], the filter h(θ n , ω) = [h 1 (θ n , ω), . . .h L (θ n , ω)] T to be applied to the loudspeakers signals are estimated by minimizing the error due to the approximation of plane wave soundfield through secondary sources, that is [42] which yields [42] where T is a vector containing the pressure soundfield at the control points, due to a plane wave with direction θ n .
We can then derive the driving signals in the case of the linear array as [42] and then the desired soundfield can be obtained by inserting the derived driving signals into (14).

Adaptive wave field synthesis
Wave field synthesis (WFS) [1] is a soundfield reproduction technique which assumes free-field reproduction and whose driving signals are derived from the Kirchhoff-Helmholtz integral theorem.
Let us consider a 2D free-field environment.The WFS driving signals needed to reproduce a source placed in r s can be derived as [43] where ρ denotes the air density, the angle between r s and the normal to the reproduction line (i.e., con- tour comprising the loudspeaker array) at the secondary source r l , r o denotes a point on the reference line, along which the amplitude error should theoretically be ( 15) zero [52], and finally, l = ||r l − r l+1 || denotes the spacing between consecutive loudspeakers.In order to solve the reproduction inaccuracies due to the WFS free-field assumption, in [43], it was proposed a compensation technique for WFS driving signals, denoted adaptive wave field synthesis ( AWFS ).Let us consider the soundfield p cp,wfs (ω) obtained by reproduc- ing at control points through the WFS driving signals and e cp (ω) = p cp (ω) − p cp,wfs (ω) as the reproduction error, then the d awfs (ω) ∈ C L driving signals are obtained in AWFS by by solving the following minimization problem [43] where e(ω) = p cp (ω) − pcp,awfs (ω) is the difference between the ground truth soundfield and estimated complex soundfields, is a regularization parameter.
The adapted wave-field synthesis driving signals that minimize the cost function are then found through [43,53] where the solution is equivalent to the WFS one for → ∞ and to the optimal solution in a least-mean- square sense for → 0.

Driving signals compensation through complex-valued convolutional neural networks
In this section, we present the proposed technique for soundfield synthesis through complex-valued CNNs using irregular loudspeaker arrays.We first formalize the problem as the compensation of the filters obtained through the MR technique; then, we describe the general pipeline of the method and the proposed network architecture.

Problem formulation
Let us consider a circular or linear array of secondary sources as shown in Fig. 1a and c, respectively.An irregular loudspeaker array setup is obtained by removing some secondary sources from the setup, as shown in Fig. 1b  and d.More formally, we can define an irregular loudspeaker array as an array where the spacing between the secondary sources is not constant.Given the MR soundfield synthesis technique presented in Section 2.3, it is possible to obtain driving signals enabling a correct reproduction of the (19) arg min soundfield, as shown using a circular array in Fig. 2a.However, if we remove secondary sources and we do not take any countermeasure, the quality of the reproduced soundfield degrades considerably, as shown in

Pipeline
The pipeline of the proposed method is depicted in Fig. 3.In order to train the network, we consider a set of simulated data.More specifically, we consider a set of point sources positioned at locations r s outside the listening region.For each source, we compute the corresponding driving signal matrix D mr and, by applying (2), the cor- responding ground truth pressure soundfield at control points p cp .
The matrix D mr is fed as input to the network U(•) , whose output is the matrix containing the compensated filters D cnn .
The prediction of the soundfield due to r s at the selected control points r i at frequency ω k is given by the convolution in the frequency domain between the estimated filters and the point-to-point Green's function, i.e., (22) The parameters of the network U(•) are optimized through the loss function where || • || 1 denotes the L1-norm.The loss in ( 25) is defined for a single source in r s .However, it is on a batch of sources.The batch index is here omitted for the sake of compactness.

Network architecture
In order to estimate the compensated driving signals from the ones obtained using the MR method using an irregular loudspeaker array, we make use of a complexvalued 2D convolutional architecture denoted as U(•) .Since the main novelty contained in this manuscript stands in the application of complex-valued deep learning to soundfield synthesis using irregular loudspeaker arrays and not on the proposed deep learning techniques, we designed the network architectures by selecting standard design choices from the literature and adapting them to the particular considered scenario.
The network takes as input D mr and outputs the matrix D cnn .For what concerns the size of the tensor given as input, the proposed architecture is made to work with an odd size, for what concerns the axis corresponding to the frequency number K, and a power of two for the axis corresponding to the number of loudspeakers L, only minor adjustments would be needed in order to adapt it to different scenarios.It is important to note that the network should be retrained from scratch in order to change use different frequency axes.
The chosen network architecture processes the input, by subsequently compressing it along the width and height axes, while increasing the number of filters (i.e., channels), since this procedure helps in learning higher-level features hierarchically [54] at different scales.The chosen number of filters is similar to the ones commonly used in the literature, such as in VGG16 [55].Since the proposed model is compensating the input driving signals, it is necessary that the output has the same dimensions as the input.For this reason, the architecture has a mirrored structure that first compresses the input data using 2D convolutional layers and then expands them through 2D transposed convolutional layers to generate the compensated driving signals.
All layers have a (3 × 3) kernel, which is a common choice among CNN-based architectures [56], with the exception of layer v) having a 4 × 3 kernel.This choice is made to account for the fact that in the considered scenario the number of frequencies is not a power of two.No padding is applied, stride value is equal to 2 × 2 , and the chosen activation is the complex para- metric rectified linear unit (CPReLU), which has been proposed and used for audio-related applications [57], and it is extremely powerful due to the high number of parameters contained in the activation.Similarly to the CReLU activation [37,58], CPReLU applies separate PReLUs [59] on the real and imaginary part of a neuron.More specifically it is defined as where z ∈ C represents the value of a neuron, and ℜ and ℑ denote the operators extracting the real and imaginary parts, respectively, out of a complex number.(26) CPReLU(z) = PReLU(ℜ(z)) + jPReLU(ℑ(z)), In the layer vii), zero-padding is applied, stride is equal to 1 × 1 , and a linear activation is used.We introduce a skip connection, which has been proven to be able to speed up training [60] by feeding as input to layer v) the addition of the outputs of layer iv) and ii).All convolutional layers, with the exception of vii), are followed by dropout with a rate of 0.5, in order to prevent overfitting [61].The complex-valued layers of the network were implemented by means of the CVNN [62] library using TensorFlow as backend.

Results
In this section, we present simulation and experimental results aimed at estimating the accuracy of the soundfield synthesized with the proposed method, referred in the following as CNN , with respect to the techniques pre- sented in Section 2, namely the model-based soundfield rendering technique [42] ( MR ), the pressure matching technique [15](PM ), and the adaptive wave field synthesis ( AWFS ).We also consider an adaptive version of the MR technique by applying the AWFS procedure defined in (20) to the driving signals obtained via the model-based technique.We will refer to this method as AMR in the following.
The MR technique assumes setups where loudspeak- ers are regularly spaced, therefore its performances are expected to be non-optimal when it is applied to an irregular array, as in the case of this manuscript.Moreover, since the CNN technique compensates the driving signals extracted via MR , the synthesis accuracy obtained through the latter can be considered as the higher bound with respect to the reproduction error.
We consider also the PM method since, similarly to CNN , it does not pose any constraint with respect to the configuration of the loudspeaker array.
We avoid a comparison with a mode matching technique, even if it is suitable to work with irregular setups, due to the inherently different optimization procedure.While the function to be minimized in the PM and CNN approaches considers the pressure obtained at a series of control points, the mode matching technique, instead, minimizes directly the difference between the modes of the desired and reproduced soundfields [63].Moreover, a mode-matching strategy is already applied in the derivation of the spatial filters used in the MR technique.
The simulation results refer to circular and linear speakers deployments, while the experimental ones to a circular array setup only.We first present aspects of the setup that are in common between the configurations.We then discuss separately the different scenarios.The setups chosen for the simulation and experimental campaigns were empirically chosen with the objective of being able on one side analyze the accuracy of the reproduction, thus using a high spatial sampling for the listening area, while considering a challenging setup for what concerns the control points, whose spatial sampling always corresponds to a spatial aliasing frequency well below the maximum one considered in the analysis.This choice is done, since, as demonstrated in our previous work [32], learning-based soundfield synthesis techniques are able to overcome sampling issues compared to other optimization-based approaches such as PM .The code used in order to generate the data and train the model as well as the setups and additional results can be found at https:// polimi-ispl.github.io/ deep_ learn ing_ sound field_ synth esis_ irreg ular_ array/.The WFS driving functions needed to apply AWFS were computed using the Sound Field Synthesis (SFS) Toolbox for Python [64]

Model parameters
In order to train the network, we simulate a set of point sources S , which is then separated into three sets S train , S val , S test used for the training, validation, and testing phases, respectively.These datasets are independent from each other, meaning more formally that The network is trained using the Adam optimizer [65] with a learning rate lr = 10 −4 .We set the maximum number of epochs to 5000 and saved only the model corresponding to the best validation loss value.We apply early stopping by ending the training after 10 epochs of no improvement in terms of validation loss.The network loss usually converged after around 100 − 200 epochs.The regularization constant used to regularize the least squares solution in PM (see (4) and MR (see ( 16)), AMR and AWFS (see (20)) was set to 10 −3 σ max , where σ max is the maximum singular value of G H cp G cp , similarly to [19].

Evaluation metrics
In order to evaluate the performances of the proposed method, we adopt two different metrics, the normalized reproduction error ( NRE) [19] and the structural simi- larity index measure (SSIM) [66].The NRE measures the reproduction accuracy and for a single emitting source r s and frequency ω k is defined as where p(r a , ω k ) corresponds to the pressure soundfield estimated at point r a using either the MR , PM or CNN techniques, while p(r a , ω k ) is the ground truth.As already done in [25], we also evaluate the accuracy in terms of SSIM , which enables to evaluate how the con- sidered techniques are able to reproduce the overall shape (27) of the pressure soundfield for each frequency point.For a single emitting source r s and frequency ω k , the SSIM is given by where p ∈ R A and p ∈ R A correspond to absolute value of the pressure soundfield, normalized between 0 and 1, measured in the listening area A at frequency ω k when the source r s is active, in the ground truth case, and when either CNN , PM or MR are used, respectively.The value µ (•) and σ 2 (•) are the average and variance of the vector at subscript, respectively.Finally, σ (•,•) is the covari- ance between the entries of the two matrices given as argument.In order to stabilize the division with a weak denominator, the SSIM calculation includes the two constants c 1 = (h 1 R) 2 and c 2 = (h 2 R) 2 where R is the dynamic range of the entry values (1 in the case of normalized matrices), while h 1 = 0.01 and h 2 = 0.03 , follow- ing the standard recommendation [25].

Linear array
In this section, we present results related to soundfield synthesis when considering a linear array setup.

Setup
We considered a regular linear array centered in [0.5 m, 0 m] T and consisting of L = 64 secondary sources with a spacing of 0.0625 m .From this configuration, we generated three irregular array setups by randomly removing 16, 32, or 48 loudspeakers, resulting in three irregular arrays with L = 48 , L = 32 , and L = 16 second- ary sources, respectively.The listening area A considered for reproduction was a 2 m × 2 m surface located on the half plane on the left of the array, specifically, with the lowest left corner placed in [−2 m, −2 m] T sampled using A = 25000 points with a spacing of 0.02 m on both the x and y axes.We used I = 60 control points, placed on a 2 m × 2 m overlapping with the listening region A and spaced of 0.44 m on both the x and y axes, correspond- ing to a spatial aliasing of 387 Hz , both for computing the losses during the training of CNN model and for calcu- lating the driving signals through PM and AWFS and the filters needed to compute MR through (16) and AMR.
In order to train the network, we considered the cardinality of S train , S val , and S test equal to 3920, 980, and 2500, respectively.The sources in S train ∪ S val are placed in a 4 m × 8 m grid sampled using a spacing of 0.06 m along the y-axis and of 0.11 m along the x-axis.The split of these sources in validation and training sets is performed randomly at training time, as is the common practice.Test sources are then obtained by shifting the ( 29) , S train ∪ S val sets of 0.05 m along the x-axis.The image depicting the setup is available on the accompanying website 1 .We considered sources emitting a signal with spectrum A(ω k ) = 1 at K = 63 frequencies spaced by 23 Hz , in the range between 46 Hz and 1500 Hz.

Results
In Fig.  where the latter shows also a higher irregularity.When comparing the average result of CNN with respect to the linear optimizer-based AWFS and AMR methods, the former still obtains better performances in most scenarios, having slightly lower performances around 200 Hz ; however, the gap in performances diminishes together with the number of active loudspeakers, being almost indistinguishable for L = 16 .As expected, with fewer active secondary sources the error is higher.In Fig. 6b-d-f, we present results showing the SSIM aver- aged over all |S test | sources, when considering an irregu- lar array of L = 48, 32 and 16 sources, respectively.For L = 48 , the results are more or less similar for all meth- ods; CNN is worse in average at the lowest frequencies, while slightly better at the higher ones.In the case of L = 32 , the SSIM curves are similar for most methods except for CNN which obtains slightly lower results below 600 Hz but performs better than the other methods for higher frequency values.Finally, in the case of L = 16 , the SSIM is comparable for all considered methods, with CNN obtaining slightly better results over 600 Hz.

Circular array
In this section, we present results related to soundfield synthesis when considering a circular array setup.

Setup
We considered a regular circular array consisting of L = 64 secondary sources with a radius of 1 m.
The listening area considered for reproduction, surrounded by the louspeaker array, corresponds to a circle We considered sources emitting a signal with spectrum A(ω k ) = 1 at K = 63 frequencies spaced by 23 Hz , in the range between 46 Hz and 1500 Hz .The image depicting the setup is available on the accompanying website2

Results
In Fig. 7a, we show the real part of the ground truth sound pressure distribution for an emitting point source placed in r = [0.99m, 2.88 m, 0 m] T .In Fig. 7b, c, d,  e, and f, the real part of the sound pressure obtained through MR , CNN , PM , AWFS , and AMR is shown, respectively when 32 speakers are active.It is clear how the CNN model performs best, by reducing the number of irregularities in the wavefront, with respect to the MR , AWFS , and AMR techniques and especially with respect to the PM technique, whose reproduced soundfield is extremely irregular.These considerations are also confirmed by inspecting the NRE obtained for the same sce- nario, shown in Fig. 8, where the NRE in the case of CNN , shown in Fig. 8b, is sensibly lower in the listening area A with respect to the ones obtained through MR and PM , shown in Fig. 8a, c, d, and e, respectively.
In Fig. 9a-c-e, we present results showing the NRE aver- aged over all |S test | sources, when considering an irregular array of L = 48, 32 and 16 secondary sources.Similarly to the linear array case, the CNN achieves NRE average results that are on par or better than the other considered techniques.This is more evident when the number of secondary sources is lower.While the mean of the NRE of MR is approximately constant in the considered fre- quency range, the average error of CNN tends to increase with the frequency, even if it remains lower than the one of MR .Analogously, PM exhibits an error that increases with the frequency, becoming extremely irregular for the upper frequency range and more sparse setups, while being on par or lower than CNN for the lower frequen- cies.AMR shows a behavior similar to CNN but reaching higher NRE values.When considering the AWFS tech- nique, the CNN technique performs better in average In Fig. 9b-d-f, we present the SSIM metric averaged over all |S test | sources, when considering an irregular array of L = 48, 32 and 16 sources, respectively.Differently from the linear array case, the SSIM obtained through CNN is similar or better than the other considered methods for L = 16 and L = 32 , especially for higher frequency val- ues.This is probably due to both the smaller listening area considered, allowing for a smaller number of irregularities in the reproduced wavefront, and the fact that the array  surrounds the listening area enabling reproduction from a higher number of directions.However, a notable exception is L = 16 where the highest SSIM performances are obtained by the MR technique and CNN performs worse than AWFS and MR at higher frequencies.
In the case of the circular array, we also computed the NRE and SSIM when varying the location of the emit- ting source, in particular when it moves farther from the center of the array in the range 1.5 m < ρ < 3.5 m , while keeping the frequency fixed at 1007 Hz .The results of the NRE metric are shown in Fig. 10a-c-e for the arrays with 48, 32, and 16 secondary sources, respectively.All methods present a mostly constant behavior with respect to the whole considered radius range, with CNN and PM the most and least accurate, respectively.As expected, the NRE worsens when decreasing the number of active secondary sources.Coherently with the NRE results, for L = 16 , the CNN and AWFS average performances are extremely similar.The results for the SSIM metric are shown in Fig. 10b-d-f for the arrays with 48, 32, and 16 secondary sources, respectively.In this case, the accuracy slightly worsens as the distance of the sources increases.While CNN , MR , and AWFS are close to each other, AMR and PM turns out to be the worse.

Real data
In this section, we present results related to soundfield synthesis when considering a circular array setup and data obtained from room impulse responses (RIRs) measurements contained in the dataset from [67].It is important to stress the fact that in this scenario, the sound propagation is 3D; therefore, in order to provide a fair comparison, we used the 2.5D version of WFS in order to implement the AWFS method, contained in the SFS toolbox [64].While the filters obtained via MR have a disadvantage, being computed for a 2D environment, this is not a problem both for AMR and CNN , since using these methods the MR filters are just used as input and later optimized taking into account the 3D scenario.The point sources used to generate the desired ground truth soundfields were simulated using Pyroomacoustics [68] and effectively considering 3D propagation.

Setup
RIRs were measured in a hemi-anechoic room, with 50 mm Martini Absorb XHD50 sound absorbing materials on the ground, of size 4.90 m × 7.22 m × 5.29 m with an average reverberation time of 0.045 s using an array of L = 60 loudspeakers (Genelec 8010A) with radius of 1.5 m , the spacing between each loudspeaker being approximately 0.157 m .From this configuration, three irregular array setups were generated by randomly removing 12, 28, or 44 loudspeakers, resulting in three irregular configurations with L = 48 , L = 32 , and L = 16 secondary sources, respectively.The RIRs related to the reproduction zone are measured by considering the square microphone (DPA 4060) array configuration, specifically related to the Zone E in [67], consisting of 64 microphones sampling with a spacing of 0.04 m a square of size 0.28 m × 0.28 m placed in the center of the area comprised by the microphone array.Both microphones and loudspeakers were placed at the same height of 1.45 m from the floor.A total of 16 control points inside the reproduction area were chosen by selecting the first (from left) and fifth columns of microphones on the listening area grid, having thus the two columns separate by approximately 0.16 m and microphones in the same column spaced by 0.04 m , resulting in approximately 1071 Hz of spa- tial aliasing.The control points were used in order to compute the losses using the CNN model and the driving signals through the PM , AWFS , and AMR techniques.The considered sampling frequency is of Fs = 48000 Hz [67].
In order to generate the ground truth dataset of desired soundfields, we simulated through Pyroomacoustics [68]

Results
In Fig. 11a, we show the real part of the ground truth sound pressure distribution for a point source placed in r = [−3.76m, −1.14 m, 0 m] T at f = 1500 Hz .In Fig. 11b, c, d, e, and f, the real part of the sound pressure obtained through MR , CNN , PM , AWFS , and AMR is shown, respectively, when 32 speakers are active.We can see that the CNN technique is the one that is able to better reproduce the soundfield, closely followed by the PM method and then by the AWFS and AMR methods; MR is the one that seems to perform worst at generating the desired ground truth soundfield.Similar considerations can be drawn by inspecting the NRE obtained for the same scenario, shown in Fig. 12, where the NRE for the listening area A in the case of CNN , Fig. 12a, MR Fig. 12b, PM Fig. 12c, AWFS Fig. 12d, and AMR Fig. 12e.
In Fig. 13a-b-c, we present results showing the NRE averaged over all |S test | sources, when consider- ing an irregular array of L = 48, 32 and 16 secondary sources.In the case of L = 48 CNN , AMR and PM per- formances are similar under 700 Hz , while over this value, CNN is the method that minimizes the mean of NRE over the whole test set S test the most.No major difference can be observed for L = 32 .Finally for what concerns the L = 16 scenario CNN performances are on par with AMR under 800 Hz ; for higher values, the error obtained with the latter strongly increases.On the other way around, while CNN performances are on par with AWFS under 600 − 700 Hz , the latter per- forms slightly better over 800 Hz .The MR method is the one working worst in all cases except over around 1200 Hz when L = 48 and L = 32 , where it performs better than PM.
We avoid showing the SSIM results due to the fact that being it strongly dependent on the variance of the data, it is not representative of the quality of the generated data in this specific case, since the ground truth soundfields are simulated, while the RIRs used for reproduction are measured, causing the data to have significantly different distributions.

Conclusion
In this manuscript, we have proposed a technique for soundfield synthesis using irregular loudspeaker arrays.The methodology is based on a deep learning-based approach.More specifically, we consider the driving signals obtained through an already existing soundfield method, based on the plane wave decomposition, and propose a network that is able to modify the driving signals by compensating the errors in the reproduced soundfield due to the irregularity in the loudspeaker setup.We compare the proposed method with the one used to compute the input driving signals and with pressure-matching, showing that the proposed model is able to obtain better average performances in most of the setups.
The obtained results open the possibility of adopting the combination of deep learning and model-based soundfield synthesis for addressing issues arising when irregular loudspeaker arrays are available.For example, a complex-valued CNN-based pressure matching technique can be devised, by optimizing the driving signals from the knowledge of the soundfield at prescribed control points.Moreover, we plan to move to real environments, where multiple sources are active and also noise and reverberation are present, aiming at compensating the environment and mask the noise.We also plan to consider sources emitting more realistic signals such as speech or music.In order to make the model more suited to real-world applications, we plan to make the system able to handle different loudspeaker arrangements, without the need for retraining and to identify systematically the effects of loudspeaker and control points arrangements on the model performances.Further developments could also entail the application of deep learning and irregular arrays to related problems such as multizone soundfield reproduction in order to create personal audio systems and also conditioning the system in order to be independent of the chosen array setup.

Fig. 1
Fig. 1 Examples of regular circular (a) and linear (c) array setups, examples of irregular circular (b) and linear (d) array setups

Fig. 3
Fig. 3 Schematic representation of the training procedure.Note that for simplicity, the images of p cnn and p correspond only the real part of the amplitude pressure soundfield obtained at a frequency f = 562 Hz and due to a source positioned in r = [−0.61m, 1.42 m] T

4 ,
we show the real part of the reproduced sound pressure distribution at frequency f = 210 Hz for a point source located in r = [1.05[m], 1.88 [m], 0 [m]] T , synthesized using L = 32 loudspeakers.More specifically, Fig. 4a refers to the ground truth soundfield, while the fields for MR , CNN , PM , AWFS , and AMR are shown in Fig.4b, c, d, e, and f, respectively.We purposely choose to show the soundfield at a lower frequency to present an example where the performancesof the CNN method are slightly worse than the ones obtained with respect to AMR , while better than all other considered meth- ods.It is apparent the fact that the CNN and AMR mod- els obtain the best results, by reducing the number of irregularities in the wavefront, both with respect to the MR technique, whose driving signals are the input to the CNN model, and to the PM technique.The differences

Fig. 4
Fig. 4 Amplitude (real part) of the soundfield for a source placed in r = [1.05m, 1.88 m, 0 m] T at f = 210 Hz , ground truth is shown in a. Reproduction through an irregular linear array of L = 32 loudspeakers using MR (b), CNN (c), PM (d), AWFS (e), and AMR (f )

of 1 m
radius centered in [0 m, 0 m] T , uniformly sampled in order to have A = 7770 listening points spaced of 0.02 m between consecutive points.We used I = 25 control points placed in a 1.3 m × 1.3 m square grid inside A , centered in [0 m, 0 m] T , with a spacing of 0.3 m both along the x and y axes, resulting in 5 rows and 5 columns and corresponding to spatial aliasing over approximately 514 Hz .The control points were used to compute the losses during the training of CNN model and to calculate the driving signals through PM , AWFS , and AMR.In order to train the network, we used |S train | = 4096 and |S val | = 1024 , respectively.The S train and S val sets were generated by sampling uniformly with 256 points 20 circumferences whose radius was uniformly distributed in the range [1.5m, 3.5m] from the center of the array.The test dataset S test , instead, was created by sam- pling uniformly using 128 points 20 circumferences whose radius was uniformly distributed in the range [1.55 m, 3.55 m] , obtaining |S test | = 2560 test sources, placed such that no source is overlapping with the ones used for training and validating the method.

Fig. 5
Fig. 5 Normalized reproduction error (NRE) distribution in dB for a source placed in r = [1.05m, 1.88 m, 0 m] T at f = 210 Hz when using MR (a), CNN (b), PM (c), AWFS (d), and AMR (e).Black loudspeakers represent the geometry of the chosen array

Fig. 7
Fig. 7 Real part of the soundfield for a source placed in r = [0.99,m, 2.88 m, 0 m] T at f = 1007 Hz , ground truth is shown in a. Reproduction performances using the irregular circular array of L = 32 loudspeakers are shown using MR (b), CNN (c), PM (d), AWFS (e), and AMR (f).Black loudspeakers represent the geometry of the chosen array

Fig. 10
Fig. 10 Irregular circular array soundfield synthesis performances with respect to distance from the center of the reproduction area at frequency f = 1007 Hz : NRE when L = 48 (a), NRE when L = 32 (c), NRE when L = 16 (e), SSIM when L = 48 (b), SSIM when L = 32 (d), SSIM when L = 16 (f).Error bars represent ±1 standard deviations a total of 4264 point sources placed in a 8 m × 8 m grid surrounding the loudspeaker array.The sources were split into |S train | = 1705 , |S val | = 427 , and |S test | = 2132 to create the training, validation, and test sets, respectively.We considered sources emitting a signal with spectrum A(ω k ) = 1 at K = 63 frequen- cies spaced by 23 Hz , in the range between 50 Hz and 1500 Hz .The image depicting the setup is available on the accompanying website 3

Fig. 11 Fig. 12
Fig. 11 Amplitude (real part) of the soundfield for a source placed in r = [−3.76,m, −1.14 m, 0 m] T at f = 1500 Hz , ground truth is shown in a. Reproduction performances using the irregular circular array of L = 32 using MR (b), CNN (c), PM (d), AWFS (e), and APWD (f)