 Empirical Research
 Open access
 Published:
Synthesis of soundfields through irregular loudspeaker arrays based on convolutional neural networks
EURASIP Journal on Audio, Speech, and Music Processing volumeÂ 2024, ArticleÂ number:Â 17 (2024)
Abstract
Most soundfield synthesis approaches deal with extensive and regular loudspeaker arrays, which are often not suitable for home audio systems, due to physical space constraints. In this article, we propose a technique for soundfield synthesis through more easily deployable irregular loudspeaker arrays, i.e., where the spacing between loudspeakers is not constant, based on deep learning. The input are the driving signals obtained through a plane wave decompositionbased technique. While the considered driving signals are able to correctly reproduce the soundfield with a regular array, they show degraded performances when using irregular setups. Through a complexvalued convolutional neural network (CNN), we modify the driving signals in order to compensate the errors in the reproduction of the desired soundfield. Since no ground truth driving signals are available for the compensated ones, we train the model by calculating the loss between the desired soundfield at a number of control points and the one obtained through the driving signals estimated by the network. The proposed model must be retrained for each irregular loudspeaker array configuration. Numerical results show better reproduction accuracy with respect to the plane wave decompositionbased technique, pressurematching approach, and linear optimizers for driving signal compensation.
1 Introduction
Soundfield synthesis methods deal with the objective of reproducing a desired pressure field in a target region of space through arrays made of loudspeakers. In recent years, the attention towards this field of research has consistently increased due to its potential application in virtual reality, telepresence, and gaming.
The first approaches towards soundfield synthesis dealt with extensive loudspeaker setups, driven in order to effectively reproduce an accurate approximation of the desired soundfield. Wave field synthesis (WFS) [1, 2] is based on the HuygensFresnel principle and synthesizes a desired pressure field through a large number of regularly distributed loudspeakers. Ambisonics [3] is based on the analysis of the soundfield in terms of spherical harmonics and reproduces the desired pressure field in a small listening area. In order to enlarge the area where reproduction is accurate, higher order ambisonics (HOA) was introduced [4, 5]. These physically based approaches reproduce the soundfield with a satisfying quality when regular array geometries are used, such as spherical [6, 7], linear [8], or circular [9]. However, their performances severely degrade when using irregular setups. While several techniques were proposed in order to adapt HOA techniques to irregular array setups [10, 11] such as projection decoding methods [12, 13] and [14] allround ambisonic panning and decoding (AllRAD), they often require the solution of illposed problems.
Optimizationbased techniques are more easily applicable to irregular loudspeaker setups. The pressurematching method [15, 16] is based on the minimization of the reproduction error at a fixed number of positions in the listening area, denoted as control points. The desired driving signals are then obtained through a regularized least squares optimization problem. While this approach is applicable to setups having extremely irregular geometries, the achievable reproduction quality is strongly dependent on the selection of the control points, i.e., by sampling the listening area with a fine grid. Its computational cost, however, increases with the number of selected control points. Modematching [17,18,19] is another optimizationbased family of techniques that can be applied to loudspeaker setups having arbitrary geometries. In this case, the optimization procedure is based on matching a modal decomposition of the desired soundfield around a single expansion center. Modal decomposition can be operated using circular or spherical wavefunctions. In doing this, it is needed to limit the decomposition to a maximum mode order, since a too high or small number leads to worse synthesis quality [19]. Several approaches have been proposed to appropriately weight the modes [18, 20]. Irregular loudspeaker setups have also been considered by intensitymatching methods [21, 22], where the objective is the minimization of the sound intensity, i.e., particle velocity, in the spherical harmonic domain over a spatial region.
More recently, after its widespread adoption in acoustic signal processing research [23], deep learning has also been applied to soundfield synthesis problems [24] such as the reconstruction of the pressure field at unknown locations [25, 26]. In [27], the authors proposed a network that is able to convert mono audio recorded using a \(360^{\circ }\) video camera into firstorder ambisonics (FOA). In [28], a network is proposed in order to upscale ambisonic signals, while in [29], a learningbased model for frequency expanding of the higherorder ambisonics (HOA) encoding process is presented. Also, in [30], the authors propose a technique for the estimation of spherical harmonic coefficients in soundfield recording, using feedforward neural networks. Finally, in [31], the authors present a neural network that is able to calculate the optimal number of driving signals, extracted through a LASSObased technique. In [32], a deep learningbased pressure matching approach was presented, where a realvalued CNN extracted the driving signals from pressure measurements at control points, a very similar approach was also successively followed by [33]. Learning techniques have also been applied to the problem of optimizing the number and placement of sensors in soundfield control scenarios [34].
Complexvalued neural networks [35,36,37,38,39] enable to directly treat complex data and have recently been applied to a variety of audio signal processing tasks such as source localization [40] and separation [41]. The adoption of such networks enables us to directly treat complex data instead of handling separately the real and imaginary parts such as in [26].
In this manuscript, we propose a technique for 2D soundfield synthesis through irregular loudspeaker setups in a free field environment, where the desired driving signals are obtained through a complexvalued convolutional neural network (CNN). Although the proposed method is easily extensible to 3D scenarios, this would involve dealing with 3D CNNs, which would add an increased complexity the computational point of view without enhancing the conceptual reasoning behind the proposed method. For this reason, in this manuscript, we decided to focus on 2D deployments and to leave the 3D extension to future works.
Instead of deriving the driving signals from soundfield measurements, the target field is obtained from the modelbased rendering (MR) method presented in [42], based on the plane wave decomposition. While this technique is able to correctly reproduce the soundfield when regular loudspeaker setups are used, irregularities in the reproduced wavefronts appear when the spacing between the loudspeakers becomes uneven.
Operatively, we generate irregular loudspeaker arrays, by considering regular array setups and randomly removing a number of loudspeakers, simulating configurations where more than half of the loudspeakers are missing, thus paving the way to the use of minimal setups. Through [42], we compute the driving signals obtained using the irregular setup and feed them into a CNN, giving as output a compensated version of the driving signals. Differently from what proposed in [31], the loss is not based on the driving signals. Instead, we compute the loss between the ground truth soundfield and the one obtained through the compensated driving signals, which are the output of the network.
The main contribution of this paper thus is to provide a first, to the best of our knowledge, application of deep learning to soundfield synthesis when dealing with irregular loudspeaker setups. Such configurations are highly desirable in realworld application scenarios, since they are more easily deployable in contexts such as home audio. The choice of removing loudspeakers from regular circular and linear setups also goes in this direction, for example, a fully regular circular loudspeaker could hardly be deployed in a living room due to the presence of furniture, while the proposed irregularities in the setup could instead accommodate these situations, by removing loudspeakers wherever needed.
In the literature, linear optimizers for loudspeaker driving functions have already been proposed such as adaptive wave field synthesis (AWFS) [43,44,45,46], where the reproduction error is minimized in a leastmean squares sense. In order to demonstrate the effectiveness of the technique, we compare it with AWFS, PM, and a linearly compensated MR both when using simulated and real data.
The rest of this manuscript is organized as follows. In Section 2, we introduce the notation and present the necessary background related to the \(\text {MR}\) and \(\text {PM}\) techniques. In Section 3, we describe the proposed technique for soundfield synthesis using irregular loudspeaker arrays. In Section 4, we present simulation results both when considering a circular and linear loudspeaker array. Finally, in Section 5, we draw some conclusions.
2 Notation and review of pressurematching, modelbased soundfield synthesis, and adaptive wave field synthesis
In this section, we briefly review three soundfield synthesis techniques related to the proposed approach and we introduce the notation that will be used throughout the rest of the paper. We first introduce the pressurematching technique and then the modelbased soundfield synthesis method, which is used in order to derive the loudspeaker driving signals, that will then be compensated through the proposed method. Finally, we present the adaptive wave field synthesis technique, which optimizes the WFS driving signals through a linear procedure and will be used in order to compare the performances of the proposed approach.
2.1 Notation and preliminaries
Let us consider an arrangement of L omnidirectional loudspeakers, or secondary sources, as often denoted in the soundfield synthesis literature, deployed at positions \(\textbf{r}_l \in \mathbb {R}^2, l=1,\ldots , L\). Let us also consider a set of A points \(\textbf{r}_a \in \mathbb {R}^2, a=1, \ldots , A\) through which we sample the region of the space \(\mathcal {A}\), denoted as listening area, where we want to reproduce the soundfield. Let \(\textbf{d}(\omega ) =[d_1(\omega ), \ldots , d_L(\omega )]^T\) denote the vector containing the driving signals applied to the secondary sources, where \(\omega \in \mathbb {R}\) is the angular frequency and the superscript T is the transposition. If \(g(\textbf{r}_{a}\textbf{r}_l, \omega )\) is the acoustic transfer function (ATF) between secondary source l and point a, the vector \(\textbf{g}_a = [g(\textbf{r}_{a}\textbf{r}_1, \omega ),\ldots ,g(\textbf{r}_{a}\textbf{r}_L, \omega )]^T\) is the juxtaposition of all the ATFs from the secondary sources to the listening point a. The synthesized sound pressure can be computed as
where in the case of 2D propagation in free space conditions and using the \(e^{j\omega t}\) convention for the Fourierâ€™s transform, \(g(\cdot )\) corresponds to the Greenâ€™s function [47]
where \(H_0^{(2)}\) is the Hankel function of second kind and zero order, while c is the speed of sound in air.
The objective of soundfield synthesis techniques can then be defined as retrieving the set of driving signals \(\textbf{d}\) such that
that is, minimizing the error between the reproduced and desired pressure field at the points contained in the listening area. The method through which the driving signals are estimated is what differentiates the various soundfield synthesis techniques.
2.2 Pressurematching method
The pressurematching technique, formulated as in [15], is a method for the synthesis of soundfields based on the minimization of the reproduction error at discrete points in the environment, denoted as control points.
Let us consider a series of control points \(\textbf{r}_i, i=1,\ldots , I\) such that \(\textbf{r}_i \in \mathcal {A}\). In the following, the subscript \(\text {cp}\) will indicate that the related term refers only to values measured at the control points. The driving signals to be applied to the secondary sources are obtained by solving the minimization problem
where \(\lambda\) is a regularization parameter, which may be determined either trough techniques such as the Lcurve [48] or more often by extracting singular values related to the propagation matrices [19] and H denotes the Hermitian transpose. The solution of (4) is given by
where the entries of \(\textbf{G}_{\text {cp}}(\omega ) \in \mathbb {C}^{I \times L}\), corresponding to the transfer function between secondary sources \(\textbf{r}_l\) and control points \(\textbf{r}_i\) are defined as
and \(\textbf{p}_{\text {cp}} \in \textbf{C}^{I}\) is a vector corresponding to the ground truth pressure soundfield evaluated at the control points, i.e., \(\textbf{p}_{\text {cp}}(\omega )=[p(\textbf{r}_{i},\omega ), \ldots , p(\textbf{r}_{I},\omega )]^T\).
While the inversion of a matrix may be computationally expensive, if we consider a single set of secondary sources (i.e., a single loudspeaker array), the pressurematching technique can be implemented with a more convenient linear computational cost \(\mathcal {O}(IL)\) by precomputing
where \(\textbf{C}_{\text {cp}}(\omega ) \in \mathbb {C}^{L\times I}\) is independent on the soundfield. Then the filters can be calculated by rewriting (5) as
2.3 Modelbased acoustic rendering based on plane wave decomposition
The modelbased acoustic rendering (\(\text {MR}\)) [42] technique is based on the decomposition of the soundfield into directional contributions encoded by the Herglotz density function [49], which can be converted into driving signals for arbitrary loudspeaker arrangements along a planar curve.
We first summarize how the Herglotz density function is defined in the case of a point source and then how it has been used in [42] to render the soundfield through circular and linear loudspeaker arrays.
2.3.1 Herglotz density function
Let us define \(\hat{\textbf{k}}(\theta )=[\cos {\theta }\ \sin {\theta }]^T\) as the unit vector corresponding to a planewave propagating with direction \(\theta\), then we can write the corresponding wave vector as \(\textbf{k}(\theta )(\theta ) = \hat{\textbf{k}}(\theta ) \frac{\omega }{c}\).
The pressure soundfield at a point \(\textbf{r} = [x,y]^T\) can be modeled as a superposition of plane waves [50, 51]
where \(\varphi (\theta ,\omega ) \in \mathbb {C}\) is the Herglotz density function, and it is a function modulating each plane wave component in amplitude and phase [49]. In the case of an isotropic point source \(\textbf{r}'=\rho '[\cos (\theta '),\sin (\theta ')]\), expressed in terms of polar coordinates \(\rho '\) and \(\theta '\), corresponding to radius and azimuth, respectively, \(\varphi (\theta ,\omega )\) can be defined as [42]
where \(A(\omega )\) is the spectrum of the sound emitted by the source.
2.3.2 Implementation with circular arrays
Let us consider a circular array of secondary sources deployed at positions \(\textbf{r}_l\), corresponding to polar coordinates \(\rho _l[\cos {\theta _l} \sin {\theta _l}]^T\), where \(\rho _l\) is the radius. Let us also consider a discrete distribution of \(N(\omega )\) plane waves with directions \(\theta _n, n=1,\ldots ,N\), uniformly sampling the \([0, 2\pi )\) interval, where each plane wave is reproduced by the same L loudspeakers, in order to approximate the desired soundfield. We take advantage of the discrete plane wave distribution in order to reproduce the soundfield by approximating it as [42]
where \(<\cdot ,\cdot>\) denotes the standard inner product in \(\mathbb {R}^2\).
The sum in (10) is approximated through a truncation of the modal expansion to order M, i.e., (\(m=M,\ldots ,M\)) where M can be chosen in order to bound the reproduction error in a listening area of radius \(\rho\) by selecting \(M \ge \lceil e \frac{\omega }{c} \frac{\rho }{2} \rceil\) [50]. Then, according to Shannonâ€™s theorem, we can correctly reproduce the soundfield without additional errors, except for the ones due to the discretization, by using \(N \ge 2M+1\) plane waves.
The filter corresponding to the lth loudspeaker and the nth planewave component, can then be defined as [42]
The driving signal corresponding to the secondary source l rendering all the N planewave components is [42]
Finally, the soundfield at \(\textbf{r}_a\) is
2.3.3 Implementation with linear arrays
Let us now consider an array of secondary sources deployed on a line segment such that \(\textbf{r}_l=[x_0,y_0\le y \le y_0]^T\). In this case, the allowed values for the reproduced plane wave directions belong to a subset of \([0, 2\pi )\), and specifically the allowed range is \(\theta \in {\textbf{R}\theta _\text {min}\le \theta \le \theta _\text {max}}\), where \(\theta _\text {min}=\arctan (y_0,x_0)\) and \(\theta _\text {max}=\arctan (y_0,x_0)\). This angular interval is sampled using N components. This limitation is due to the geometrical constraints posed by the configuration of the array and of the listening region. Reproduction is performed towards the halfplane given by \(x < x_0\) [8], and the linear array is not able to accommodate all the plane wave directions surrounding the listening region, as in the circular array case. Since no closedform solutions are known for arrays that are not circular [42], the filter \(\textbf{h}(\theta _n,\omega ) =[h_1(\theta _n, \omega ),\ldots h_L(\theta _n, \omega )]^T\) to be applied to the loudspeakers signals are estimated by minimizing the error due to the approximation of plane wave soundfield through secondary sources, that is [42]
which yields [42]
where \(\textbf{p}_{\text {cp},\text {pwd}}(\theta _n,\omega )=[e^{j\frac{\omega }{c}<\textbf{r}_i,\hat{\textbf{k}}(\theta _n)>},\ldots , e^{j\frac{\omega }{c}<\textbf{r}_I,\hat{\textbf{k}}(\theta _n)>}]^T\) is a vector containing the pressure soundfield at the control points, due to a plane wave with direction \(\theta _n\).
We can then derive the driving signals in the case of the linear array as [42]
and then the desired soundfield can be obtained by inserting the derived driving signals into (14).
2.4 Adaptive wave field synthesis
Wave field synthesis (WFS) [1] is a soundfield reproduction technique which assumes freefield reproduction and whose driving signals are derived from the KirchhoffHelmholtz integral theorem.
Let us consider a 2D freefield environment. The WFS driving signals needed to reproduce a source placed in \(\textbf{r}_s\) can be derived as [43]
where \(\rho\) denotes the air density, \(\Psi\) the angle between \(\textbf{r}_s\) and the normal to the reproduction line (i.e., contour comprising the loudspeaker array) at the secondary source \(\textbf{r}_l\), \(\textbf{r}_o\) denotes a point on the reference line, along which the amplitude error should theoretically be zero [52], and finally, \(\Delta _l = \textbf{r}_l  \textbf{r}_{l+1}\) denotes the spacing between consecutive loudspeakers.
In order to solve the reproduction inaccuracies due to the WFS freefield assumption, in [43], it was proposed a compensation technique for WFS driving signals, denoted adaptive wave field synthesis (\(\text {AWFS}\)). Let us consider the soundfield \(\textbf{p}_\text {cp,wfs}(\omega )\) obtained by reproducing at control points through the WFS driving signals and \(\textbf{e}_\text {cp}(\omega ) = \textbf{p}_\text {cp}(\omega )  \textbf{p}_\text {cp,wfs}(\omega )\) as the reproduction error, then the \(\textbf{d}_\text {awfs}(\omega ) \in \mathbb {C}^L\) driving signals are obtained in \(\text {AWFS}\) by by solving the following minimization problem [43]
where \(\textbf{e}(\omega ) = \textbf{p}_\text {cp}(\omega )\hat{\textbf{p}}_{\text {cp}, \text {awfs}}(\omega )\) is the difference between the ground truth soundfield and estimated complex soundfields, \(\lambda\) is a regularization parameter.
The adapted wavefield synthesis driving signals that minimize the cost function are then found through [43, 53]
where the solution is equivalent to the WFS one for \(\lambda \rightarrow \infty\) and to the optimal solution in a leastmeansquare sense for \(\lambda \rightarrow 0\).
3 Driving signals compensation through complexvalued convolutional neural networks
In this section, we present the proposed technique for soundfield synthesis through complexvalued CNNs using irregular loudspeaker arrays. We first formalize the problem as the compensation of the filters obtained through the MR technique; then, we describe the general pipeline of the method and the proposed network architecture.
3.1 Problem formulation
Let us consider a circular or linear array of secondary sources as shown in Fig. 1a and c, respectively. An irregular loudspeaker array setup is obtained by removing some secondary sources from the setup, as shown in Fig. 1b and d. More formally, we can define an irregular loudspeaker array as an array where the spacing between the secondary sources is not constant.
Given the MR soundfield synthesis technique presented in Section 2.3, it is possible to obtain driving signals enabling a correct reproduction of the soundfield, as shown using a circular array in Fig. 2a. However, if we remove secondary sources and we do not take any countermeasure, the quality of the reproduced soundfield degrades considerably, as shown in Fig. 2b. Let us consider the driving signals \(\textbf{d}_{\text {mr}} \in \mathbb {C}^{L\times K}\), being K the number of frequencies, obtained, either using a linear or circular array, through the MR technique, if we stack the driving signals into a \(\textbf{D}_{\text {mr}} \in \mathbb {C}^{L \times K}\) matrix as follows
where \(\omega _k, k=1,\ldots ,K\) correspond to the discrete angular frequencies, then we can define the objective of the proposed method as retrieving the function \(\mathcal {U}(\cdot )\) such that
where \(\omega _k, k=1,\ldots ,K\) are the discrete angular frequencies and the driving signals matrix \(\textbf{D}_{\text {cnn}}(\omega _k) \in \mathbb {C}^{L \times K}\) is the compensated version of \(\textbf{D}_{\text {mr}}\), obtained by minimizing the following optimization problem
that is, corresponding to the minimization of the reproduction error at control points \(\textbf{r}_i,\ i=1,\ldots ,I\).
3.2 Pipeline
The pipeline of the proposed method is depicted in Fig. 3.
In order to train the network, we consider a set of simulated data. More specifically, we consider a set of point sources positioned at locations \(\textbf{r}_{s}\) outside the listening region. For each source, we compute the corresponding driving signal matrix \(\textbf{D}_{\text {mr}}\) and, by applying (2), the corresponding ground truth pressure soundfield at control points \(\textbf{p}_{\text {cp}}\).
The matrix \(\textbf{D}_{\text {mr}}\) is fed as input to the network \(\mathcal {U}(\cdot )\), whose output is the matrix containing the compensated filters \(\textbf{D}_{\text {cnn}}\).
The prediction of the soundfield due to \(\textbf{r}_{s}\) at the selected control points \(\textbf{r}_{i}\) at frequency \(\omega _k\) is given by the convolution in the frequency domain between the estimated filters and the pointtopoint Greenâ€™s function, i.e.,
The parameters of the network \(\mathcal {U}(\cdot )\) are optimized through the loss function
where \(\cdot _1\) denotes the L1norm. The loss in (25) is defined for a single source in \(\textbf{r}_s\). However, it is on a batch of sources. The batch index is here omitted for the sake of compactness.
3.3 Network architecture
In order to estimate the compensated driving signals from the ones obtained using the \(\text {MR}\) method using an irregular loudspeaker array, we make use of a complexvalued 2D convolutional architecture denoted as \(\mathcal {U}(\cdot )\). Since the main novelty contained in this manuscript stands in the application of complexvalued deep learning to soundfield synthesis using irregular loudspeaker arrays and not on the proposed deep learning techniques, we designed the network architectures by selecting standard design choices from the literature and adapting them to the particular considered scenario.
The network takes as input \(\textbf{D}_\text {mr}\) and outputs the matrix \(\textbf{D}_\text {cnn}\). For what concerns the size of the tensor given as input, the proposed architecture is made to work with an odd size, for what concerns the axis corresponding to the frequency number K, and a power of two for the axis corresponding to the number of loudspeakers L, only minor adjustments would be needed in order to adapt it to different scenarios. It is important to note that the network should be retrained from scratch in order to change use different frequency axes.
The proposed network is composed of the following layers:

i)
A complex convolutional layer, with 128 filters, which outputs a \((L/2)1\times (K1)/2\times 128\) feature map.

ii)
A complex convolutional layer, with 256 filters, which outputs a \((L/4)1 \times (K3)/4\times 128\) feature map.

iii)
A complex convolutional layer, with 512 filters, which outputs a \((L8)/8\times (K7)/8\times 512\) feature map.

iv)
A transposed complex convolutional layer, with 256 filters, which outputs a \((L/4)1 \times (K3)/4 \times 256\) feature map.

v)
A transposed complex convolutional layer, with 128 filters, which outputs a \((L/21)\times (K1)/2 \times 128\) feature map.

vi)
A transposed complex convolutional layer, with 128 filters, which outputs a \(L\times K\times 128\) feature map.

vii)
A transposed complex convolutional layer, with 1 filter, which outputs a \(L\times K\times 1\) feature map.
The chosen network architecture processes the input, by subsequently compressing it along the width and height axes, while increasing the number of filters (i.e., channels), since this procedure helps in learning higherlevel features hierarchically [54] at different scales. The chosen number of filters is similar to the ones commonly used in the literature, such as in VGG16 [55]. Since the proposed model is compensating the input driving signals, it is necessary that the output has the same dimensions as the input. For this reason, the architecture has a mirrored structure that first compresses the input data using 2D convolutional layers and then expands them through 2D transposed convolutional layers to generate the compensated driving signals.
All layers have a \((3 \times 3)\) kernel, which is a common choice among CNNbased architectures [56], with the exception of layer v) having a \(4 \times 3\) kernel. This choice is made to account for the fact that in the considered scenario the number of frequencies is not a power of two. No padding is applied, stride value is equal to \(2 \times 2\), and the chosen activation is the complex parametric rectified linear unit (CPReLU), which has been proposed and used for audiorelated applications [57], and it is extremely powerful due to the high number of parameters contained in the activation. Similarly to the CReLU activation [37, 58], CPReLU applies separate PReLUs [59] on the real and imaginary part of a neuron. More specifically it is defined as
where \(z \in \mathbb {C}\) represents the value of a neuron, and \(\Re\) and \(\Im\) denote the operators extracting the real and imaginary parts, respectively, out of a complex number.
In the layer vii), zeropadding is applied, stride is equal to \(1 \times 1\), and a linear activation is used. We introduce a skip connection, which has been proven to be able to speed up training [60] by feeding as input to layer v) the addition of the outputs of layer iv) and ii). All convolutional layers, with the exception of vii), are followed by dropout with a rate of 0.5, in order to prevent overfitting [61]. The complexvalued layers of the network were implemented by means of the CVNN [62] library using TensorFlow as backend.
4 Results
In this section, we present simulation and experimental results aimed at estimating the accuracy of the soundfield synthesized with the proposed method, referred in the following as \(\text {CNN}\), with respect to the techniques presented in Section 2, namely the modelbased soundfield rendering technique [42] (\(\text {MR}\)), the pressure matching technique [15](\(\text {PM}\)), and the adaptive wave field synthesis (\(\text {AWFS}\)). We also consider an adaptive version of the \(\text {MR}\) technique by applying the \(\text {AWFS}\) procedure defined in (20) to the driving signals obtained via the modelbased technique. We will refer to this method as \(\text {AMR}\) in the following.
The \(\text {MR}\) technique assumes setups where loudspeakers are regularly spaced, therefore its performances are expected to be nonoptimal when it is applied to an irregular array, as in the case of this manuscript. Moreover, since the \(\text {CNN}\) technique compensates the driving signals extracted via \(\text {MR}\), the synthesis accuracy obtained through the latter can be considered as the higher bound with respect to the reproduction error.
We consider also the \(\text {PM}\) method since, similarly to \(\text {CNN}\), it does not pose any constraint with respect to the configuration of the loudspeaker array.
We avoid a comparison with a mode matching technique, even if it is suitable to work with irregular setups, due to the inherently different optimization procedure. While the function to be minimized in the \(\text {PM}\) and \(\text {CNN}\) approaches considers the pressure obtained at a series of control points, the mode matching technique, instead, minimizes directly the difference between the modes of the desired and reproduced soundfields [63]. Moreover, a modematching strategy is already applied in the derivation of the spatial filters used in the \(\text {MR}\) technique.
The simulation results refer to circular and linear speakers deployments, while the experimental ones to a circular array setup only. We first present aspects of the setup that are in common between the configurations. We then discuss separately the different scenarios. The setups chosen for the simulation and experimental campaigns were empirically chosen with the objective of being able on one side analyze the accuracy of the reproduction, thus using a high spatial sampling for the listening area, while considering a challenging setup for what concerns the control points, whose spatial sampling always corresponds to a spatial aliasing frequency well below the maximum one considered in the analysis. This choice is done, since, as demonstrated in our previous work [32], learningbased soundfield synthesis techniques are able to overcome sampling issues compared to other optimizationbased approaches such as \(\text {PM}\). The code used in order to generate the data and train the model as well as the setups and additional results can be found at https://polimiispl.github.io/deep_learning_soundfield_synthesis_irregular_array/.The \(\text {WFS}\) driving functions needed to apply \(\text {AWFS}\) were computed using the Sound Field Synthesis (SFS) Toolbox for Python [64]
4.1 Model parameters
In order to train the network, we simulate a set of point sources \(\mathcal {S}\), which is then separated into three sets \(\mathcal {S}_\text {train}\), \(\mathcal {S}_\text {val}\), \(\mathcal {S}_\text {test}\) used for the training, validation, and testing phases, respectively. These datasets are independent from each other, meaning more formally that
The network is trained using the Adam optimizer [65] with a learning rate \(\text {lr}=10^{4}\). We set the maximum number of epochs to 5000 and saved only the model corresponding to the best validation loss value. We apply early stopping by ending the training after 10 epochs of no improvement in terms of validation loss. The network loss usually converged after around \(100200\) epochs. The regularization constant \(\lambda\) used to regularize the least squares solution in \(\text {PM}\) (see (4) and \(\text {MR}\) (see (16)), \(\text {AMR}\) and \(\text {AWFS}\) (see (20)) was set to \(10^{3} \sigma _\text {max}\), where \(\sigma _\text {max}\) is the maximum singular value of \(\textbf{G}_\text {cp}^H\textbf{G}_\text {cp}\), similarly to [19].
4.2 Evaluation metrics
In order to evaluate the performances of the proposed method, we adopt two different metrics, the normalized reproduction error (\(\text {NRE}\)) [19] and the structural similarity index measure (SSIM) [66]. The \(\text {NRE}\) measures the reproduction accuracy and for a single emitting source \(\textbf{r}_s\) and frequency \(\omega _k\) is defined as
where \(\hat{p}(\textbf{r}_{a},\omega _k)\) corresponds to the pressure soundfield estimated at point \(\textbf{r}_{a}\) using either the \(\text {MR}\), \(\text {PM}\) or \(\text {CNN}\) techniques, while \(p(\textbf{r}_{a},\omega _k)\) is the ground truth.
As already done in [25], we also evaluate the accuracy in terms of \(\text {SSIM}\), which enables to evaluate how the considered techniques are able to reproduce the overall shape of the pressure soundfield for each frequency point. For a single emitting source \(\textbf{r}_s\) and frequency \(\omega _k\), the \(\text {SSIM}\) is given by
where \(\textbf{p} \in \mathbb {R}^{A}\) and \(\hat{\textbf{p}} \in \mathbb {R}^{A}\) correspond to absolute value of the pressure soundfield, normalized between 0 and 1, measured in the listening area \(\mathcal {A}\) at frequency \(\omega _k\) when the source \(\textbf{r}_s\) is active, in the ground truth case, and when either \(\text {CNN}\), \(\text {PM}\) or \(\text {MR}\) are used, respectively. The value \(\mu _{(\cdot )}\) and \(\sigma ^2_{(\cdot )}\) are the average and variance of the vector at subscript, respectively. Finally, \(\sigma _{(\cdot ,\cdot )}\) is the covariance between the entries of the two matrices given as argument. In order to stabilize the division with a weak denominator, the \(\text {SSIM}\) calculation includes the two constants \(c_1=(h_1R)^2\) and \(c_2=(h_2R)^2\) where R is the dynamic range of the entry values (1 in the case of normalized matrices), while \(h_1=0.01\) and \(h_2=0.03\), following the standard recommendation [25].
4.3 Linear array
In this section, we present results related to soundfield synthesis when considering a linear array setup.
4.3.1 Setup
We considered a regular linear array centered in \([0.5\ \text {m}, 0\ \text {m}]^T\) and consisting of \(L=64\) secondary sources with a spacing of \(0.0625\ \text {m}\). From this configuration, we generated three irregular array setups by randomly removing 16, 32, or 48 loudspeakers, resulting in three irregular arrays with \(L=48\), \(L=32\), and \(L=16\) secondary sources, respectively. The listening area \(\mathcal {A}\) considered for reproduction was a \(2\ \text {m} \times 2\ \text {m}\) surface located on the half plane on the left of the array, specifically, with the lowest left corner placed in \([2\ \text {m}, 2\ \text {m}]^T\) sampled using \(A=25000\) points with a spacing of \(0.02\ \text {m}\) on both the x and y axes. We used \(I=60\) control points, placed on a \(2\ \text {m} \times 2\ \text {m}\) overlapping with the listening region \(\mathcal {A}\) and spaced of \(0.44\ \text {m}\) on both the x and y axes, corresponding to a spatial aliasing of \(387\ \text {Hz}\), both for computing the losses during the training of \(\text {CNN}\) model and for calculating the driving signals through \(\text {PM}\) and \(\text {AWFS}\) and the filters needed to compute \(\text {MR}\) through (16) and \(\text {AMR}\).
In order to train the network, we considered the cardinality of \(\mathcal {S}_\text {train}\), \(\mathcal {S}_\text {val}\), and \(\mathcal {S}_\text {test}\) equal to 3920, 980, and 2500, respectively. The sources in \(\mathcal {S}_\text {train} \cup \mathcal {S}_\text {val}\) are placed in a \(4\ \text {m} \times 8\ \text {m}\) grid sampled using a spacing of \(0.06\ \text {m}\) along the yaxis and of \(0.11\ \text {m}\) along the xaxis. The split of these sources in validation and training sets is performed randomly at training time, as is the common practice. Test sources are then obtained by shifting the \(\mathcal {S}_\text {train} \cup \mathcal {S}_\text {val}\) sets of \(0.05\ \text {m}\) along the xaxis. The image depicting the setup is available on the accompanying website^{Footnote 1}. We considered sources emitting a signal with spectrum \(A(\omega _k)=1\) at \(K=63\) frequencies spaced by \(23\ \text {Hz}\), in the range between \(46\ \text {Hz}\) and \(1500\ \text {Hz}\).
4.3.2 Results
In Fig. 4, we show the real part of the reproduced sound pressure distribution at frequency \(f=210\ \text {Hz}\) for a point source located in \(\textbf{r}=[1.05\ [\text {m}], 1.88\ [\text {m}], 0\ [\text {m}]]^T\), synthesized using \(L=32\) loudspeakers. More specifically, Fig. 4a refers to the ground truth soundfield, while the fields for \(\text {MR}\), \(\text {CNN}\), \(\text {PM}\), \(\text {AWFS}\), and \(\text {AMR}\) are shown in Fig. 4b, c, d, e, and f, respectively. We purposely choose to show the soundfield at a lower frequency to present an example where the performancesof the \(\text {CNN}\) method are slightly worse than the ones obtained with respect to \(\text {AMR}\), while better than all other considered methods. It is apparent the fact that the \(\text {CNN}\) and \(\text {AMR}\) models obtain the best results, by reducing the number of irregularities in the wavefront, both with respect to the \(\text {MR}\) technique, whose driving signals are the input to the \(\text {CNN}\) model, and to the \(\text {PM}\) technique. The differences in performance of the \(\text {CNN}\) model with respect to the \(\text {AWFS}\) technique is smaller, since in this scenario, all models work reasonably well. These considerations are also confirmed by inspecting the \(\text {NRE}\) for the same scenario, as shown in Fig. 5. In Fig. 6ace, we present results showing the \(\text {NRE}\) averaged over all \(\mathcal {S}_\text {test}\) sources, when considering an irregular array of \(L=48,32\) and 16 secondary sources. The \(\text {CNN}\) achieves the best average \(\text {NRE}\) over the whole range of considered frequencies in all cases, both with respect to the \(\text {MR}\) and \(\text {PM}\) techniques, where the latter shows also a higher irregularity. When comparing the average result of \(\text {CNN}\) with respect to the linear optimizerbased \(\text {AWFS}\) and \(\text {AMR}\) methods, the former still obtains better performances in most scenarios, having slightly lower performances around \(200\ \text {Hz}\); however, the gap in performances diminishes together with the number of active loudspeakers, being almost indistinguishable for \(L=16\). As expected, with fewer active secondary sources the error is higher.
In Fig. 6bdf, we present results showing the \(\text {SSIM}\) averaged over all \(\mathcal {S}_\text {test}\) sources, when considering an irregular array of \(L=48,\ 32\) and 16 sources, respectively. For \(L=48\), the results are more or less similar for all methods; \(\text {CNN}\) is worse in average at the lowest frequencies, while slightly better at the higher ones. In the case of \(L=32\), the \(\text {SSIM}\) curves are similar for most methods except for \(\text {CNN}\) which obtains slightly lower results below \(600\ \text {Hz}\) but performs better than the other methods for higher frequency values. Finally, in the case of \(L=16\), the \(\text {SSIM}\) is comparable for all considered methods, with \(\text {CNN}\) obtaining slightly better results over \(600\ \text {Hz}\).
4.4 Circular array
In this section, we present results related to soundfield synthesis when considering a circular array setup.
4.4.1 Setup
We considered a regular circular array consisting of \(L=64\) secondary sources with a radius of \(1\ \text {m}\).
The listening area considered for reproduction, surrounded by the louspeaker array, corresponds to a circle of \(1\ \text {m}\) radius centered in \([0\ \text {m}, 0\ \text {m}]^T\), uniformly sampled in order to have \(A=7770\) listening points spaced of \(0.02\ \text {m}\) between consecutive points.
We used \(I=25\) control points placed in a \(1.3\ \text {m} \times 1.3\ \text {m}\) square grid inside \(\mathcal {A}\), centered in \([0\ \text {m}, 0\ \text {m}]^T\), with a spacing of \(0.3\ \text {m}\) both along the x and y axes, resulting in 5 rows and 5 columns and corresponding to spatial aliasing over approximately \(514\ \text {Hz}\). The control points were used to compute the losses during the training of \(\text {CNN}\) model and to calculate the driving signals through \(\text {PM}\), \(\text {AWFS}\), and \(\text {AMR}\).
In order to train the network, we used \(\mathcal {S}_\text {train}=4096\) and \(\mathcal {S}_\text {val}=1024\), respectively. The \(\mathcal {S}_\text {train}\) and \(\mathcal {S}_\text {val}\) sets were generated by sampling uniformly with 256 points 20 circumferences whose radius was uniformly distributed in the range \([1.5 \text {m}, 3.5 \text {m}]\) from the center of the array.
The test dataset \(\mathcal {S}_\text {test}\), instead, was created by sampling uniformly using 128 points 20 circumferences whose radius was uniformly distributed in the range \([1.55\ \text {m}, 3.55\ \text {m}]\), obtaining \(\mathcal {S}_\text {test}=2560\) test sources, placed such that no source is overlapping with the ones used for training and validating the method.
We considered sources emitting a signal with spectrum \(A(\omega _k)=1\) at \(K=63\) frequencies spaced by \(23\ \text {Hz}\), in the range between \(46\ \text {Hz}\) and \(1500\ \text {Hz}\). The image depicting the setup is available on the accompanying website^{Footnote 2}
4.4.2 Results
In Fig. 7a, we show the real part of the ground truth sound pressure distribution for an emitting point source placed in \(\textbf{r}=[0.99\ \text {m}, 2.88\ \text {m}, 0\ \text {m}]^T\). In Fig. 7b, c, d, e, and f, the real part of the sound pressure obtained through \(\text {MR}\), \(\text {CNN}\), \(\text {PM}\), \(\text {AWFS}\), and \(\text {AMR}\) is shown, respectively when 32 speakers are active. It is clear how the \(\text {CNN}\) model performs best, by reducing the number of irregularities in the wavefront, with respect to the \(\text {MR}\), \(\text {AWFS}\), and \(\text {AMR}\) techniques and especially with respect to the \(\text {PM}\) technique, whose reproduced soundfield is extremely irregular. These considerations are also confirmed by inspecting the \(\text {NRE}\) obtained for the same scenario, shown in Fig. 8, where the \(\text {NRE}\) in the case of \(\text {CNN}\), shown in Fig. 8b, is sensibly lower in the listening area \(\mathcal {A}\) with respect to the ones obtained through \(\text {MR}\) and \(\text {PM}\), shown in Fig. 8a, c, d, and e, respectively.
In Fig. 9ace, we present results showing the \(\text {NRE}\) averaged over all \(\mathcal {S}_\text {test}\) sources, when considering an irregular array of \(L=48,\ 32\) and 16 secondary sources. Similarly to the linear array case, the \(\text {CNN}\) achieves \(\text {NRE}\) average results that are on par or better than the other considered techniques. This is more evident when the number of secondary sources is lower. While the mean of the \(\text {NRE}\) of \(\text {MR}\) is approximately constant in the considered frequency range, the average error of \(\text {CNN}\) tends to increase with the frequency, even if it remains lower than the one of \(\text {MR}\). Analogously, \(\text {PM}\) exhibits an error that increases with the frequency, becoming extremely irregular for the upper frequency range and more sparse setups, while being on par or lower than \(\text {CNN}\) for the lower frequencies. \(\text {AMR}\) shows a behavior similar to \(\text {CNN}\) but reaching higher \(\text {NRE}\) values. When considering the \(\text {AWFS}\) technique, the \(\text {CNN}\) technique performs better in average both in the \(L=48\) and \(L=32\) cases, while performances when using an array with \(L=16\) loudspeaker are practically on par.
In Fig. 9bdf, we present the \(\text {SSIM}\) metric averaged over all \(\mathcal {S}_\text {test}\) sources, when considering an irregular array of \(L=48,32\) and 16 sources, respectively. Differently from the linear array case, the \(\text {SSIM}\) obtained through \(\text {CNN}\) is similar or better than the other considered methods for \(L=16\) and \(L=32\), especially for higher frequency values. This is probably due to both the smaller listening area considered, allowing for a smaller number of irregularities in the reproduced wavefront, and the fact that the array surrounds the listening area enabling reproduction from a higher number of directions. However, a notable exception is \(L=16\) where the highest \(\text {SSIM}\) performances are obtained by the \(\text {MR}\) technique and \(\text {CNN}\) performs worse than \(\text {AWFS}\) and \(\text {MR}\) at higher frequencies.
In the case of the circular array, we also computed the \(\text {NRE}\) and \(\text {SSIM}\) when varying the location of the emitting source, in particular when it moves farther from the center of the array in the range \(1.5\ \text {m}<\rho <3.5\ \text {m}\), while keeping the frequency fixed at \(1007\ \text {Hz}\). The results of the \(\text {NRE}\) metric are shown in Fig. 10ace for the arrays with 48, 32, and 16 secondary sources, respectively. All methods present a mostly constant behavior with respect to the whole considered radius range, with \(\text {CNN}\) and \(\text {PM}\) the most and least accurate, respectively. As expected, the \(\text {NRE}\) worsens when decreasing the number of active secondary sources. Coherently with the \(\text {NRE}\) results, for \(L=16\), the \(\text {CNN}\) and \(\text {AWFS}\) average performances are extremely similar. The results for the \(\text {SSIM}\) metric are shown in Fig. 10bdf for the arrays with 48, 32, and 16 secondary sources, respectively. In this case, the accuracy slightly worsens as the distance of the sources increases. While \(\text {CNN}\), \(\text {MR}\), and \(\text {AWFS}\) are close to each other, \(\text {AMR}\) and \(\text {PM}\) turns out to be the worse.
4.5 Real data
In this section, we present results related to soundfield synthesis when considering a circular array setup and data obtained from room impulse responses (RIRs) measurements contained in the dataset from [67]. It is important to stress the fact that in this scenario, the sound propagation is 3D; therefore, in order to provide a fair comparison, we used the 2.5D version of \(\text {WFS}\) in order to implement the \(\text {AWFS}\) method, contained in the SFS toolbox [64]. While the filters obtained via \(\text {MR}\) have a disadvantage, being computed for a 2D environment, this is not a problem both for \(\text {AMR}\) and \(\text {CNN}\), since using these methods the \(\text {MR}\) filters are just used as input and later optimized taking into account the 3D scenario. The point sources used to generate the desired ground truth soundfields were simulated using Pyroomacoustics [68] and effectively considering 3D propagation.
4.5.1 Setup
RIRs were measured in a hemianechoic room, with 50 mm Martini Absorb XHD50 sound absorbing materials on the ground, of size \(4.90\ \text {m}\times 7.22\ \text {m}\times 5.29\ \text {m}\) with an average reverberation time of \(0.045\ \text {s}\) using an array of \(L=60\) loudspeakers (Genelec 8010A) with radius of \(1.5\ \text {m}\), the spacing between each loudspeaker being approximately \(0.157\ \text {m}\). From this configuration, three irregular array setups were generated by randomly removing 12, 28, or 44 loudspeakers, resulting in three irregular configurations with \(L=48\), \(L=32\), and \(L=16\) secondary sources, respectively. The RIRs related to the reproduction zone are measured by considering the square microphone (DPA 4060) array configuration, specifically related to the Zone E in [67], consisting of 64 microphones sampling with a spacing of \(0.04\ \text {m}\) a square of size \(0.28\ \text {m} \times 0.28\ \text {m}\) placed in the center of the area comprised by the microphone array. Both microphones and loudspeakers were placed at the same height of \(1.45\ \text {m}\) from the floor. A total of 16 control points inside the reproduction area were chosen by selecting the first (from left) and fifth columns of microphones on the listening area grid, having thus the two columns separate by approximately \(0.16\ \text {m}\) and microphones in the same column spaced by \(0.04\ \text {m}\), resulting in approximately \(1071\ \text {Hz}\) of spatial aliasing. The control points were used in order to compute the losses using the \(\text {CNN}\) model and the driving signals through the \(\text {PM}\), \(\text {AWFS}\), and \(\text {AMR}\) techniques. The considered sampling frequency is of \(Fs=48000\ \text {Hz}\) [67].
In order to generate the ground truth dataset of desired soundfields, we simulated through Pyroomacoustics [68] a total of 4264 point sources placed in a \(8\ \text {m} \times 8\ \text {m}\) grid surrounding the loudspeaker array. The sources were split into \(\mathcal {S}_\text {train}=1705\), \(\mathcal {S}_\text {val}=427\), and \(\mathcal {S}_\text {test}=2132\) to create the training, validation, and test sets, respectively. We considered sources emitting a signal with spectrum \(A(\omega _k)=1\) at \(K=63\) frequencies spaced by \(23\ \text {Hz}\), in the range between \(50\ \text {Hz}\) and \(1500\ \text {Hz}\). The image depicting the setup is available on the accompanying website^{Footnote 3}
4.5.2 Results
In Fig. 11a, we show the real part of the ground truth sound pressure distribution for a point source placed in \(\textbf{r}=[3.76\ \text {m}, 1.14\ \text {m}, 0\ \text {m}]^T\) at \(f=1500\ \text {Hz}\). In Fig. 11b, c, d, e, and f, the real part of the sound pressure obtained through \(\text {MR}\), \(\text {CNN}\), \(\text {PM}\), \(\text {AWFS}\), and \(\text {AMR}\) is shown, respectively, when 32 speakers are active. We can see that the \(\text {CNN}\) technique is the one that is able to better reproduce the soundfield, closely followed by the \(\text {PM}\) method and then by the \(\text {AWFS}\) and \(\text {AMR}\) methods; \(\text {MR}\) is the one that seems to perform worst at generating the desired ground truth soundfield. Similar considerations can be drawn by inspecting the \(\text {NRE}\) obtained for the same scenario, shown in Fig. 12, where the \(\text {NRE}\) for the listening area \(\mathcal {A}\) in the case of \(\text {CNN}\), Fig. 12a, \(\text {MR}\) Fig. 12b, \(\text {PM}\) Fig. 12c, \(\text {AWFS}\) Fig. 12d, and \(\text {AMR}\) Fig. 12e.
In Fig. 13abc, we present results showing the \(\text {NRE}\) averaged over all \(\mathcal {S}_\text {test}\) sources, when considering an irregular array of \(L=48,\ 32\) and 16 secondary sources. In the case of \(L=48\) \(\text {CNN}\), \(\text {AMR}\) and \(\text {PM}\) performances are similar under \(700\ \text {Hz}\), while over this value, \(\text {CNN}\) is the method that minimizes the mean of \(\text {NRE}\) over the whole test set \(\mathcal {S}_\text {test}\) the most. No major difference can be observed for \(L=32\). Finally for what concerns the \(L=16\) scenario \(\text {CNN}\) performances are on par with \(\text {AMR}\) under \(800\ \text {Hz}\); for higher values, the error obtained with the latter strongly increases. On the other way around, while \(\text {CNN}\) performances are on par with \(\text {AWFS}\) under \(600700\ \text {Hz}\), the latter performs slightly better over \(800\ \text {Hz}\). The \(\text {MR}\) method is the one working worst in all cases except over around \(1200\ \text {Hz}\) when \(L=48\) and \(L=32\), where it performs better than \(\text {PM}\).
We avoid showing the SSIM results due to the fact that being it strongly dependent on the variance of the data, it is not representative of the quality of the generated data in this specific case, since the ground truth soundfields are simulated, while the RIRs used for reproduction are measured, causing the data to have significantly different distributions.
5 Conclusion
In this manuscript, we have proposed a technique for soundfield synthesis using irregular loudspeaker arrays. The methodology is based on a deep learningbased approach. More specifically, we consider the driving signals obtained through an already existing soundfield method, based on the plane wave decomposition, and propose a network that is able to modify the driving signals by compensating the errors in the reproduced soundfield due to the irregularity in the loudspeaker setup. We compare the proposed method with the one used to compute the input driving signals and with pressurematching, showing that the proposed model is able to obtain better average performances in most of the setups.
The obtained results open the possibility of adopting the combination of deep learning and modelbased soundfield synthesis for addressing issues arising when irregular loudspeaker arrays are available. For example, a complexvalued CNNbased pressure matching technique can be devised, by optimizing the driving signals from the knowledge of the soundfield at prescribed control points. Moreover, we plan to move to real environments, where multiple sources are active and also noise and reverberation are present, aiming at compensating the environment and mask the noise. We also plan to consider sources emitting more realistic signals such as speech or music. In order to make the model more suited to realworld applications, we plan to make the system able to handle different loudspeaker arrangements, without the need for retraining and to identify systematically the effects of loudspeaker and control points arrangements on the model performances. Further developments could also entail the application of deep learning and irregular arrays to related problems such as multizone soundfield reproduction in order to create personal audio systems and also conditioning the system in order to be independent of the chosen array setup.
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. The code used to perform the experiments is fully available at https://github.com/polimiispl/deep_learning_soundfield_synthesis_irregular_array.
References
A.J. Berkhout, D. de Vries, P. Vogel, Acoustic control by wave field synthesis. J. Acoust. Soc. Am. 93(5), 2764â€“2778 (1993)
S. Spors, R. Rabenstein, J. Ahrens, in 124th AES convention. The theory of wave field synthesis revisited (Audio Engineering Society (AES),Â New York, 2008), pp. 17â€“20
M.A. Gerzon, Periphony: Withheight sound reproduction. J. Audio Eng. Soc. 21(1), 2â€“10 (1973)
D.B. Ward, T.D. Abhayapala, Reproduction of a planewave sound field using an array of loudspeakers. IEEE Trans. Speech Audio Process. 9(6), 697â€“707 (2001)
M.A. Poletti, Threedimensional surround sound systems based on spherical harmonics. J. Audio Eng. Soc. 53(11), 1004â€“1025 (2005)
M. Poletti, F. Fazi, P. Nelson, Soundfield reproduction systems using fixeddirectivity loudspeakers. J. Acoust. Soc. Am. 127(6), 3590â€“3601 (2010)
M. Kentgens, A. Behler, P. Jax, in 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). Translation of a higher order ambisonics sound scene based on parametric decomposition (IEEE,Â Piscataway, 2020), pp. 151â€“155
J. Ahrens, S. Spors, Sound field reproduction using planar and linear arrays of loudspeakers. IEEE Trans. Audio Speech Lang. Process. 18(8), 2038â€“2050 (2010)
P. Chen, et al., in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). 3D exterior soundfield reproduction using a planar loudspeaker array (IEEE,Â Piscataway, 2018), pp. 471â€“475
J. Trevino, T. Okamoto, Y. Iwaya, Y. Suzuki, High order Ambisonic decoding method for irregular loudspeaker arrays. In Proceedings of 20th International Congress on Acoustics (pp. 23â€“27)
F. Zotter, M. Frank, H. Pomberger, Comparison of energypreserving and allround ambisonic decoders. Fortschritte der Akustik, AIADAGA, (Meran) (2013)
T. Qu, Z. Huang, Y. Qiao, X. Wu, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Matching projection decoding method for ambisonics system (IEEE,Â Piscataway, 2018), pp. 561â€“565
Z. Ge, L. Li, T. Qu, Partially matching projection decoding method evaluation under different playback conditions. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1411â€“1423 (2021)
F. Zotter, M. Frank, Allround ambisonic panning and decoding. J. Audio Eng. Soc. 60(10), 807â€“820 (2012)
P.A. Nelson, Active control of acoustic fields and the reproduction of sound. J. Sound Vib. 177(4), 447â€“477 (1994)
P.A. Gauthier, A. Berry, W. Woszczyk, Soundfield reproduction inroom using optimal control techniques: Simulations in the frequency domain. J. Acoust. Soc. Am. 117(2), 662â€“678 (2005)
P.N. Samarasinghe, M.A. Poletti, S.A. Salehin, T.D. Abhayapala, F.M. Fazi, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 3d soundfield reproduction using higher order loudspeakers (IEEE,Â Piscataway, 2013), pp. 306â€“310
T. Betlehem, T.D. Abhayapala, Theory and design of sound field reproduction in reverberant rooms. J Acoust. Soc. Am. 117(4), 2100â€“2111 (2005)
N. Ueno, S. Koyama, H. Saruwatari, Threedimensional sound field reproduction based on weighted modematching method. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 1852â€“1867 (2019)
N. Ueno, S. Koyama, H. Saruwatari, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Sound field reproduction with exterior cancellation using analytical weighting of harmonic coefficients (IEEE,Â Piscataway, 2018), pp. 466â€“470
H. Zuo, P.N. Samarasinghe, T.D. Abhayapala, Intensity based spatial soundfield reproduction using an irregular loudspeaker array. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1356â€“1369 (2020)
H. Zuo, T.D. Abhayapala, P.N. Samarasinghe, in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 3d multizone soundfield reproduction in a reverberant environment using intensity matching method (IEEE,Â Piscataway, 2021), pp. 416â€“420
M.J. Bianco, P. Gerstoft, J. Traer, E. Ozanich, M.A. Roch, S. Gannot, C.A. Deledalle, Machine learning in acoustics: Theory and applications. J. Acoust. Soc. Am. 146(5), 3590â€“3628 (2019)
M. Cobos, J. Ahrens, K. Kowalczyk, A. Politis, An overview of machine learning and other databased methods for spatial audio capture, processing, and reproduction. EURASIP J. Audio Speech Music Process. 2022(1), 1â€“21 (2022)
F. Lluis, P. MartinezNuevo, M. Bo MÃ¸ller, S. Ewan Shepstone, Sound field reconstruction in rooms: Inpainting meets superresolution. J. Acoust. Soc. Am. 148(2), 649â€“659 (2020)
M.S. Kristoffersen, M.B. MÃ¸ller, P. MartÃnezNuevo, J. Ã˜stergaard, Deep sound field reconstruction in real rooms: Introducing the isobel sound field dataset. (2021).Â arXivÂ preprintÂ arXiv:2102.06455
P. Morgado et al., in Proceedings of the 32nd Int. Conf. on Neural Information Processing Systems.Â Selfsupervised generation of spatial audio for 360\(^{\circ }\)video. (Curran Associates Inc., New York, 2018), pp. 360â€“370
G. Routray, S. Basu, P. Baldev, R.M. Hegde, in EAA Spatial Audio Signal Processing Symposium. Deepsound field analysis for upscaling ambisonic signals (2019), pp. 1â€“6
S. Gao, J. Lin, W. Xihong, T. Qu, Sparse DNN model for frequency expanding of higher order ambisonics encoding process. IEEE/ACM Trans. Audio Speech Lang. Process. (2022)
L. Zhang, X. Wang, R. Hu, D. Li, W. Tu, Estimation of spherical harmonic coefficients in sound field recording using feedforward neural networks. Multimedia Tools Appl. 80(4), 6187â€“6202 (2021)
H. Chen, T. Abhayapala, in Proceedings of the 23rd International Congress on Acoustics : integrating 4th EAA Euroregio 2019 : 913 September 2019 in Aachen, Germany. Spatial sound field reproduction using deep neural networks (2019). https://doi.org/10.18154/RWTHCONV239844
L. Comanducci, F. Antonacci, A. Sarti, in 2022 International Workshop on Acoustic Signal Enhancement (IWAENC). A deep learningbased pressure matching approach to soundfield synthesis (IEEE,Â Piscataway, 2022), pp. 1â€“5
X. Hong, B. Du, S. Yang, M. Lei, X. Zeng, Endtoend sound field reproduction based on deep learning. J. Acoust. Soc. Am. 153(5), 3055â€“3055 (2023)
S. Koyama, G. Chardon, L. Daudet, Optimizing source and sensor placement for sound field control: An overview. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 696â€“714 (2020)
C. Lee, H. Hasegawa, S. Gao, Complexvalued neural networks: A comprehensive survey. IEEE/CAA J. Autom. Sin. 9(8), 1406â€“1426 (2022)
J. Bassey, L. Qian, X. Li, A survey of complexvalued neural networks. (2021).Â arXivÂ preprintÂ arXiv:2101.12249
C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J.F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, C.J. Pal, in International Conference on Learning Representations. Deep complex networks (2018). https://openreview.net/forum?id=H1T2hmZAb
A. Hirose, Complexvalued neural networks (Springer Science & Business Media, Berlin/Heidelberg, 2012)
M. Yang, M.Q. Ma, D. Li, Y.H.H. Tsai, R. Salakhutdinov, in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Complex transformer: A framework for modeling complexvalued sequence (IEEE, 2020), pp. 4232â€“4236
H. Tsuzuki, M. Kugler, S. Kuroyanagi, A. Iwata, An approach for sound source localization by complexvalued neural network. IEICE Trans. Inf. Syst. E96.D(10), 2257â€“2265 (2013). https://doi.org/10.1587/transinf.E96.D.2257
Y.S. Lee, C.Y. Wang, S.F. Wang, J.C. Wang, C.H. Wu, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Fully complex deep neural network for phaseincorporating monaural source separation (IEEE,Â Piscataway, 2017), pp. 281â€“285
L. Bianchi, F. Antonacci, A. Sarti, S. Tubaro, Modelbased acoustic rendering based on plane wave decomposition. Appl. Acoust. 104, 127â€“134 (2016)
P.A. Gauthier, A. Berry, Adaptive wave field synthesis with independent radiation mode control for active sound field reproduction: Theory. J. Acoust. Soc. Am. 119(5), 2721â€“2737 (2006)
P.A. Gauthier, A. Berry, in Audio Engineering Society Convention 123. Adaptive wave field synthesis for sound field reproduction: Theory, experiments, and future perspectives (Audio Engineering Society,Â New York, 2007)
P.A. Gauthier, A. Berry, Adaptive wave field synthesis for broadband active sound field reproduction: Signal processing. J. Acoust. Soc. Am. 123(4), 2003â€“2016 (2008)
P.A. Gauthier, A. Berry, Adaptive wave field synthesis for active sound field reproduction: Experimental results. J. Acoust. Soc. Am. 123(4), 1991â€“2002 (2008)
E.G. Williams, Fourier acoustics: Sound radiation and nearfield acoustical holography (Academic press, Cambridge, 1999)
P.C. Hansen, Analysis of discrete illposed problems by means of the lcurve. SIAM Rev. 34(4), 561â€“580 (1992)
D.L. Colton, R. Kress, R. Kress, Inverse acoustic and electromagnetic scattering theory, vol. 93 (Springer, New York, 1998)
D.N. Zotkin, R. Duraiswami, N.A. Gumerov, Planewave decomposition of acoustical scenes via spherical and cylindrical microphone arrays. IEEE Trans. Audio Speech Lang. Process. 18(1), 2â€“16 (2009)
E.T. Whittaker, On the partial differential equations of mathematical physics. Math. Ann. 57(3), 333â€“355 (1903)
E. Verheijen, Sound field reproduction by wave field synthesis. Ph. D. dissertation, Delft University of Technology (1997)
P.A. Nelson, S.J. Elliott, Active control of sound (Academic press, Cambridge, 1991)
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436â€“444 (2015)
K. Simonyan, A. Zisserman, in International Conference on Learning Representations. Very deep convolutional networks for largescale image recognition (2015)
K. SongGong, W. Wang, H. Chen, Acoustic source localization in the circular harmonic domain using deep learning architecture. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 2475â€“2491 (2022)
A. Pandey, D. Wang, in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Exploring deep complex networks for complex spectrogram enhancement (IEEE,Â Piscataway, 2019), pp. 6885â€“6889
Y. Kuroe, M. Yoshid, T. Mori, in Artificial Neural Networks and Neural Information ProcessingICANN/ICONIP 2003 Istanbul, Turkey, June 26â€“29, 2003 Proceedings. On activation functions for complexvalued neural networksexistence of energy functions (Springer,Â New York, 2003), pp. 985â€“992
K. He, X. Zhang, S. Ren, J. Sun, in Proceedings of the IEEE international conference on computer vision. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification (IEEE, 2015), pp. 1026â€“1034
K. He, X. Zhang, S. Ren, J. Sun, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Deep residual learning for image recognition (2016), pp. 770â€“778. https://doi.org/10.1109/CVPR.2016.90
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929â€“1958 (2014)
J.A. Barrachina. Negu93/cvnn: Complexvalued neural networks (2022). https://doi.org/10.5281/zenodo.7303587
S. Koyama, K. Kimura, N. Ueno, in 2021 Immersive and 3D Audio: from Architecture to Automotive (I3DA). Sound field reproduction with weighted mode matching and infinitedimensional harmonic analysis: An experimental evaluation (IEEE,Â Piscataway, 2021), pp. 1â€“6
H. Wierstorf, S. Spors, in Audio Engineering Society Convention 132, Sound field synthesis toolbox (Audio Engineering Society, 2012). https://github.com/sfstoolbox/sfspython/releases/tag/0.6.2
D.P. Kingma, J. Ba, in 3rd Intl. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings. Adam: A method for stochastic optimization (2015). http://arxiv.org/abs/1412.6980
Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600â€“612 (2004)
S. Zhao, Q. Zhu, E. Cheng, I.S. Burnett, A room impulse response database for multizone sound field reproduction (L). J. Acoust. Soc. Am. 152(4), 2505â€“2512 (2022). https://doi.org/10.1121/10.0014958. https://pubs.aip.org/asa/jasa/articlepdf/152/4/2505/16657353/2505_1_online.pdf
R. Scheibler, E. Bezzam, I. DokmaniÄ‡, in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). Pyroomacoustics: A python package for audio room simulation and array processing algorithms (IEEE,Â Piscataway, 2018), pp. 351â€“355
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
LC: conceptualization, code implementation, results computation, main writing. FA: conceptualization, writing, research oversee. AS: research oversee and manuscript review. All authors read and agreed to the submitted version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The authors approve and consent to participate.
Consent for publication
The authors consent for publication.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Comanducci, L., Antonacci, F. & Sarti, A. Synthesis of soundfields through irregular loudspeaker arrays based on convolutional neural networks. J AUDIO SPEECH MUSIC PROC. 2024, 17 (2024). https://doi.org/10.1186/s13636024003377
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636024003377