Skip to main content
  • Empirical Research
  • Open access
  • Published:

Synthesis of soundfields through irregular loudspeaker arrays based on convolutional neural networks

Abstract

Most soundfield synthesis approaches deal with extensive and regular loudspeaker arrays, which are often not suitable for home audio systems, due to physical space constraints. In this article, we propose a technique for soundfield synthesis through more easily deployable irregular loudspeaker arrays, i.e., where the spacing between loudspeakers is not constant, based on deep learning. The input are the driving signals obtained through a plane wave decomposition-based technique. While the considered driving signals are able to correctly reproduce the soundfield with a regular array, they show degraded performances when using irregular setups. Through a complex-valued convolutional neural network (CNN), we modify the driving signals in order to compensate the errors in the reproduction of the desired soundfield. Since no ground truth driving signals are available for the compensated ones, we train the model by calculating the loss between the desired soundfield at a number of control points and the one obtained through the driving signals estimated by the network. The proposed model must be retrained for each irregular loudspeaker array configuration. Numerical results show better reproduction accuracy with respect to the plane wave decomposition-based technique, pressure-matching approach, and linear optimizers for driving signal compensation.

1 Introduction

Soundfield synthesis methods deal with the objective of reproducing a desired pressure field in a target region of space through arrays made of loudspeakers. In recent years, the attention towards this field of research has consistently increased due to its potential application in virtual reality, telepresence, and gaming.

The first approaches towards soundfield synthesis dealt with extensive loudspeaker setups, driven in order to effectively reproduce an accurate approximation of the desired soundfield. Wave field synthesis (WFS) [1, 2] is based on the Huygens-Fresnel principle and synthesizes a desired pressure field through a large number of regularly distributed loudspeakers. Ambisonics [3] is based on the analysis of the soundfield in terms of spherical harmonics and reproduces the desired pressure field in a small listening area. In order to enlarge the area where reproduction is accurate, higher order ambisonics (HOA) was introduced [4, 5]. These physically based approaches reproduce the soundfield with a satisfying quality when regular array geometries are used, such as spherical [6, 7], linear [8], or circular [9]. However, their performances severely degrade when using irregular setups. While several techniques were proposed in order to adapt HOA techniques to irregular array setups [10, 11] such as projection decoding methods [12, 13] and [14] all-round ambisonic panning and decoding (AllRAD), they often require the solution of ill-posed problems.

Optimization-based techniques are more easily applicable to irregular loudspeaker setups. The pressure-matching method [15, 16] is based on the minimization of the reproduction error at a fixed number of positions in the listening area, denoted as control points. The desired driving signals are then obtained through a regularized least squares optimization problem. While this approach is applicable to setups having extremely irregular geometries, the achievable reproduction quality is strongly dependent on the selection of the control points, i.e., by sampling the listening area with a fine grid. Its computational cost, however, increases with the number of selected control points. Mode-matching [17,18,19] is another optimization-based family of techniques that can be applied to loudspeaker setups having arbitrary geometries. In this case, the optimization procedure is based on matching a modal decomposition of the desired soundfield around a single expansion center. Modal decomposition can be operated using circular or spherical wavefunctions. In doing this, it is needed to limit the decomposition to a maximum mode order, since a too high or small number leads to worse synthesis quality [19]. Several approaches have been proposed to appropriately weight the modes [18, 20]. Irregular loudspeaker setups have also been considered by intensity-matching methods [21, 22], where the objective is the minimization of the sound intensity, i.e., particle velocity, in the spherical harmonic domain over a spatial region.

More recently, after its widespread adoption in acoustic signal processing research [23], deep learning has also been applied to soundfield synthesis problems [24] such as the reconstruction of the pressure field at unknown locations [25, 26]. In [27], the authors proposed a network that is able to convert mono audio recorded using a \(360^{\circ }\) video camera into first-order ambisonics (FOA). In [28], a network is proposed in order to upscale ambisonic signals, while in [29], a learning-based model for frequency expanding of the higher-order ambisonics (HOA) encoding process is presented. Also, in [30], the authors propose a technique for the estimation of spherical harmonic coefficients in soundfield recording, using feed-forward neural networks. Finally, in [31], the authors present a neural network that is able to calculate the optimal number of driving signals, extracted through a LASSO-based technique. In [32], a deep learning-based pressure matching approach was presented, where a real-valued CNN extracted the driving signals from pressure measurements at control points, a very similar approach was also successively followed by [33]. Learning techniques have also been applied to the problem of optimizing the number and placement of sensors in soundfield control scenarios [34].

Complex-valued neural networks [35,36,37,38,39] enable to directly treat complex data and have recently been applied to a variety of audio signal processing tasks such as source localization [40] and separation [41]. The adoption of such networks enables us to directly treat complex data instead of handling separately the real and imaginary parts such as in [26].

In this manuscript, we propose a technique for 2D soundfield synthesis through irregular loudspeaker setups in a free field environment, where the desired driving signals are obtained through a complex-valued convolutional neural network (CNN). Although the proposed method is easily extensible to 3D scenarios, this would involve dealing with 3D CNNs, which would add an increased complexity the computational point of view without enhancing the conceptual reasoning behind the proposed method. For this reason, in this manuscript, we decided to focus on 2D deployments and to leave the 3D extension to future works.

Instead of deriving the driving signals from soundfield measurements, the target field is obtained from the model-based rendering (MR) method presented in [42], based on the plane wave decomposition. While this technique is able to correctly reproduce the soundfield when regular loudspeaker setups are used, irregularities in the reproduced wavefronts appear when the spacing between the loudspeakers becomes uneven.

Operatively, we generate irregular loudspeaker arrays, by considering regular array setups and randomly removing a number of loudspeakers, simulating configurations where more than half of the loudspeakers are missing, thus paving the way to the use of minimal setups. Through [42], we compute the driving signals obtained using the irregular setup and feed them into a CNN, giving as output a compensated version of the driving signals. Differently from what proposed in [31], the loss is not based on the driving signals. Instead, we compute the loss between the ground truth soundfield and the one obtained through the compensated driving signals, which are the output of the network.

The main contribution of this paper thus is to provide a first, to the best of our knowledge, application of deep learning to soundfield synthesis when dealing with irregular loudspeaker setups. Such configurations are highly desirable in real-world application scenarios, since they are more easily deployable in contexts such as home audio. The choice of removing loudspeakers from regular circular and linear setups also goes in this direction, for example, a fully regular circular loudspeaker could hardly be deployed in a living room due to the presence of furniture, while the proposed irregularities in the setup could instead accommodate these situations, by removing loudspeakers wherever needed.

In the literature, linear optimizers for loudspeaker driving functions have already been proposed such as adaptive wave field synthesis (AWFS) [43,44,45,46], where the reproduction error is minimized in a least-mean squares sense. In order to demonstrate the effectiveness of the technique, we compare it with AWFS, PM, and a linearly compensated MR both when using simulated and real data.

The rest of this manuscript is organized as follows. In Section 2, we introduce the notation and present the necessary background related to the \(\text {MR}\) and \(\text {PM}\) techniques. In Section 3, we describe the proposed technique for soundfield synthesis using irregular loudspeaker arrays. In Section 4, we present simulation results both when considering a circular and linear loudspeaker array. Finally, in Section 5, we draw some conclusions.

2 Notation and review of pressure-matching, model-based soundfield synthesis, and adaptive wave field synthesis

In this section, we briefly review three soundfield synthesis techniques related to the proposed approach and we introduce the notation that will be used throughout the rest of the paper. We first introduce the pressure-matching technique and then the model-based soundfield synthesis method, which is used in order to derive the loudspeaker driving signals, that will then be compensated through the proposed method. Finally, we present the adaptive wave field synthesis technique, which optimizes the WFS driving signals through a linear procedure and will be used in order to compare the performances of the proposed approach.

2.1 Notation and preliminaries

Let us consider an arrangement of L omnidirectional loudspeakers, or secondary sources, as often denoted in the soundfield synthesis literature, deployed at positions \(\textbf{r}_l \in \mathbb {R}^2, l=1,\ldots , L\). Let us also consider a set of A points \(\textbf{r}_a \in \mathbb {R}^2, a=1, \ldots , A\) through which we sample the region of the space \(\mathcal {A}\), denoted as listening area, where we want to reproduce the soundfield. Let \(\textbf{d}(\omega ) =[d_1(\omega ), \ldots , d_L(\omega )]^T\) denote the vector containing the driving signals applied to the secondary sources, where \(\omega \in \mathbb {R}\) is the angular frequency and the superscript T is the transposition. If \(g(\textbf{r}_{a}|\textbf{r}_l, \omega )\) is the acoustic transfer function (ATF) between secondary source l and point a, the vector \(\textbf{g}_a = [g(\textbf{r}_{a}|\textbf{r}_1, \omega ),\ldots ,g(\textbf{r}_{a}|\textbf{r}_L, \omega )]^T\) is the juxtaposition of all the ATFs from the secondary sources to the listening point a. The synthesized sound pressure can be computed as

$$\begin{aligned} \hat{\textbf{p}}(\textbf{r}_a,\omega )= \textbf{d}^T(\omega ) \textbf{g}_a(\omega )=\sum \limits _{l=1}^{L} d_l(\omega )g(\textbf{r}_{a}|\textbf{r}_l, \omega ), \end{aligned}$$
(1)

where in the case of 2D propagation in free space conditions and using the \(e^{j\omega t}\) convention for the Fourier’s transform, \(g(\cdot )\) corresponds to the Green’s function [47]

$$\begin{aligned} g(\textbf{r}_{a}|\textbf{r}_l, \omega )= - \frac{j}{4} H_{0}^{(2)} \left( \frac{\omega }{c}\left| |\textbf{r}_a-\textbf{r}_l\right| |\right) , \end{aligned}$$
(2)

where \(H_0^{(2)}\) is the Hankel function of second kind and zero order, while c is the speed of sound in air.

The objective of soundfield synthesis techniques can then be defined as retrieving the set of driving signals \(\textbf{d}\) such that

$$\begin{aligned} \underset{\textbf{d}}{\text {arg}\,\text {min}} \sum \limits _{a=1}^{A} |\textbf{p}(\textbf{r}_a,\omega )-\hat{\textbf{p}}(\textbf{r}_a,\omega )|^2, \end{aligned}$$
(3)

that is, minimizing the error between the reproduced and desired pressure field at the points contained in the listening area. The method through which the driving signals are estimated is what differentiates the various soundfield synthesis techniques.

2.2 Pressure-matching method

The pressure-matching technique, formulated as in [15], is a method for the synthesis of soundfields based on the minimization of the reproduction error at discrete points in the environment, denoted as control points.

Let us consider a series of control points \(\textbf{r}_i, i=1,\ldots , I\) such that \(\textbf{r}_i \in \mathcal {A}\). In the following, the subscript \(\text {cp}\) will indicate that the related term refers only to values measured at the control points. The driving signals to be applied to the secondary sources are obtained by solving the minimization problem

$$\begin{aligned} \textbf{d}_{\text {pm}}(\omega )= \arg \min _{\textbf{d}_{\text {pm}}}&\left| \sum \limits _{i=0}^{I-1}\hat{p}_{\text {pm}}(\textbf{r}_{i},\omega )-p(\textbf{r}_{i},\omega )\right| ^2 +\nonumber \\&\lambda \textbf{d}_{\text {pm}}^H(\omega )\textbf{d}_{\text {pm}}(\omega ), \end{aligned}$$
(4)

where \(\lambda\) is a regularization parameter, which may be determined either trough techniques such as the L-curve [48] or more often by extracting singular values related to the propagation matrices [19] and H denotes the Hermitian transpose. The solution of (4) is given by

$$\begin{aligned} \textbf{d}_{\text {pm}}(\omega ) = \left( \mathbf {G_{\text {cp}}}^H(\omega )\textbf{G}_{\text {cp}}(\omega ) + \lambda \textbf{I}_L\right) ^{-1} \textbf{G}_{\text {cp}}^H(\omega ) \textbf{p}_{\text {cp}}(\omega ), \end{aligned}$$
(5)

where the entries of \(\textbf{G}_{\text {cp}}(\omega ) \in \mathbb {C}^{I \times L}\), corresponding to the transfer function between secondary sources \(\textbf{r}_l\) and control points \(\textbf{r}_i\) are defined as

$$\begin{aligned} (\textbf{G}_{\text {cp}}(\omega ))_{i,l} = g(\textbf{r}_i|\textbf{r}_l,\omega ), \end{aligned}$$
(6)

and \(\textbf{p}_{\text {cp}} \in \textbf{C}^{I}\) is a vector corresponding to the ground truth pressure soundfield evaluated at the control points, i.e., \(\textbf{p}_{\text {cp}}(\omega )=[p(\textbf{r}_{i},\omega ), \ldots , p(\textbf{r}_{I},\omega )]^T\).

While the inversion of a matrix may be computationally expensive, if we consider a single set of secondary sources (i.e., a single loudspeaker array), the pressure-matching technique can be implemented with a more convenient linear computational cost \(\mathcal {O}(IL)\) by pre-computing

$$\begin{aligned} \textbf{C}_{\text {cp}}(\omega ) = \left( \textbf{G}_{\text {cp}}^H(\omega )\textbf{G}_{\text {cp}}(\omega ) + \lambda \textbf{I}_L\right) ^{-1} \textbf{G}_{\text {cp}}^H(\omega ), \end{aligned}$$
(7)

where \(\textbf{C}_{\text {cp}}(\omega ) \in \mathbb {C}^{L\times I}\) is independent on the soundfield. Then the filters can be calculated by rewriting (5) as

$$\begin{aligned} \textbf{d}_{\text {pm}}(\omega ) = \textbf{C}_{\text {cp}}(\omega )\textbf{p}_{\text {cp}}(\omega ). \end{aligned}$$
(8)

2.3 Model-based acoustic rendering based on plane wave decomposition

The model-based acoustic rendering (\(\text {MR}\)) [42] technique is based on the decomposition of the soundfield into directional contributions encoded by the Herglotz density function [49], which can be converted into driving signals for arbitrary loudspeaker arrangements along a planar curve.

We first summarize how the Herglotz density function is defined in the case of a point source and then how it has been used in [42] to render the soundfield through circular and linear loudspeaker arrays.

2.3.1 Herglotz density function

Let us define \(\hat{\textbf{k}}(\theta )=[\cos {\theta }\ \sin {\theta }]^T\) as the unit vector corresponding to a plane-wave propagating with direction \(\theta\), then we can write the corresponding wave vector as \(\textbf{k}(\theta )(\theta ) = \hat{\textbf{k}}(\theta ) \frac{\omega }{c}\).

The pressure soundfield at a point \(\textbf{r} = [x,y]^T\) can be modeled as a superposition of plane waves [50, 51]

$$\begin{aligned} p(\textbf{r},\omega ) = \frac{1}{2\pi }\int _{0}^{2\pi } e^{j \frac{\omega }{c}(x\cos \theta + y\sin \theta )}\varphi (\theta ,\omega )d\theta , \end{aligned}$$
(9)

where \(\varphi (\theta ,\omega ) \in \mathbb {C}\) is the Herglotz density function, and it is a function modulating each plane wave component in amplitude and phase [49]. In the case of an isotropic point source \(\textbf{r}'=\rho '[\cos (\theta '),\sin (\theta ')]\), expressed in terms of polar coordinates \(\rho '\) and \(\theta '\), corresponding to radius and azimuth, respectively, \(\varphi (\theta ,\omega )\) can be defined as [42]

$$\begin{aligned} \varphi (\theta ,\omega ) = A(\omega )\sum \limits _{m=-\infty }^{+\infty }j^{-m}\frac{j}{4}H_{m}^{(2)}(\frac{\omega }{c}\rho ')e^{jm(\theta -\theta ')}, \end{aligned}$$
(10)

where \(A(\omega )\) is the spectrum of the sound emitted by the source.

2.3.2 Implementation with circular arrays

Let us consider a circular array of secondary sources deployed at positions \(\textbf{r}_l\), corresponding to polar coordinates \(\rho _l[\cos {\theta _l} \sin {\theta _l}]^T\), where \(\rho _l\) is the radius. Let us also consider a discrete distribution of \(N(\omega )\) plane waves with directions \(\theta _n, n=1,\ldots ,N\), uniformly sampling the \([0, 2\pi )\) interval, where each plane wave is reproduced by the same L loudspeakers, in order to approximate the desired soundfield. We take advantage of the discrete plane wave distribution in order to reproduce the soundfield by approximating it as [42]

$$\begin{aligned} \hat{p}(\textbf{r},\omega ) = \frac{1}{N} \sum \limits _{n=1}^{N} \varphi (\theta _n,\omega ) e^{j \frac{\omega }{c}<\textbf{r},\hat{\textbf{k}}(\theta _n)>}, \end{aligned}$$
(11)

where \(<\cdot ,\cdot>\) denotes the standard inner product in \(\mathbb {R}^2\).

The sum in (10) is approximated through a truncation of the modal expansion to order M, i.e., (\(m=-M,\ldots ,M\)) where M can be chosen in order to bound the reproduction error in a listening area of radius \(\rho\) by selecting \(M \ge \lceil e \frac{\omega }{c} \frac{\rho }{2} \rceil\) [50]. Then, according to Shannon’s theorem, we can correctly reproduce the soundfield without additional errors, except for the ones due to the discretization, by using \(N \ge 2M+1\) plane waves.

The filter corresponding to the l-th loudspeaker and the n-th plane-wave component, can then be defined as [42]

$$\begin{aligned} h_l(\theta _n, \omega ) = \frac{4}{jL} \sum \limits _{m=-M}^{M} \frac{e^{jm(\theta _l-\theta _n)}}{H_{m}^{(2)}(\frac{\omega }{c} \rho _l)}. \end{aligned}$$
(12)

The driving signal corresponding to the secondary source l rendering all the N plane-wave components is [42]

$$\begin{aligned} d_{\text {mr},l}(\omega ) =\frac{1}{N}\sum \limits _{n=1}^{N}\varphi (\theta _n,\omega )h_l(\theta _n, \omega ). \end{aligned}$$
(13)

Finally, the soundfield at \(\textbf{r}_a\) is

$$\begin{aligned} \hat{p}_{\text {mr}}(\textbf{r}_a,\omega ) = \sum \limits _{l=1}^{L}d_{\text {mr},l}( \omega ) g(\textbf{r}_a|\textbf{r}_l,\omega ). \end{aligned}$$
(14)

2.3.3 Implementation with linear arrays

Let us now consider an array of secondary sources deployed on a line segment such that \(\textbf{r}_l=[x_0,-y_0\le y \le y_0]^T\). In this case, the allowed values for the reproduced plane wave directions belong to a subset of \([0, 2\pi )\), and specifically the allowed range is \(\theta \in {\textbf{R}|\theta _\text {min}\le \theta \le \theta _\text {max}}\), where \(\theta _\text {min}=\arctan (-y_0,x_0)\) and \(\theta _\text {max}=\arctan (y_0,x_0)\). This angular interval is sampled using N components. This limitation is due to the geometrical constraints posed by the configuration of the array and of the listening region. Reproduction is performed towards the half-plane given by \(x < x_0\) [8], and the linear array is not able to accommodate all the plane wave directions surrounding the listening region, as in the circular array case. Since no closed-form solutions are known for arrays that are not circular [42], the filter \(\textbf{h}(\theta _n,\omega ) =[h_1(\theta _n, \omega ),\ldots h_L(\theta _n, \omega )]^T\) to be applied to the loudspeakers signals are estimated by minimizing the error due to the approximation of plane wave soundfield through secondary sources, that is [42]

$$\begin{aligned} \textbf{h}(\theta _n,\omega ) = \nonumber \\ \underset{\textbf{h}}{\text {arg}\,\text {min}}&|\sum \limits _{i=1}^{I} e^{j\frac{\omega }{c}<\textbf{r}_i,\hat{\textbf{k}}(\theta _n)>}- \textbf{h}(\theta _n, \omega )g_l(\textbf{r}_i|\textbf{r}_l,\omega )|^2 \nonumber \\&+ \lambda \textbf{h}(\theta _n,\omega )^H\textbf{h}(\theta _n,\omega )^H, \end{aligned}$$
(15)

which yields [42]

$$\begin{aligned} \textbf{h}(\theta _n, \omega ) = \left( \textbf{G}_{\text {cp}}^{H}(\omega )\textbf{G}_{\text {cp}}(\omega ) + \lambda \textbf{I}_L\right) ^{-1} \textbf{G}_{\text {cp}}^{H}\textbf{p}_{\text {cp},\text {pwd}}(\omega ,\theta _n), \end{aligned}$$
(16)

where \(\textbf{p}_{\text {cp},\text {pwd}}(\theta _n,\omega )=[e^{j\frac{\omega }{c}<\textbf{r}_i,\hat{\textbf{k}}(\theta _n)>},\ldots , e^{j\frac{\omega }{c}<\textbf{r}_I,\hat{\textbf{k}}(\theta _n)>}]^T\) is a vector containing the pressure soundfield at the control points, due to a plane wave with direction \(\theta _n\).

We can then derive the driving signals in the case of the linear array as [42]

$$\begin{aligned} \textbf{d}_{\text {mr},l}(\omega ) = \frac{\theta _\text {max}-\theta _\text {min}}{2\pi N}\sum \limits _{n=1}^{N}\varphi (\theta _n, \omega )h_l(\theta _n, \omega ), \end{aligned}$$
(17)

and then the desired soundfield can be obtained by inserting the derived driving signals into (14).

2.4 Adaptive wave field synthesis

Wave field synthesis (WFS) [1] is a soundfield reproduction technique which assumes free-field reproduction and whose driving signals are derived from the Kirchhoff-Helmholtz integral theorem.

Let us consider a 2D free-field environment. The WFS driving signals needed to reproduce a source placed in \(\textbf{r}_s\) can be derived as [43]

$$\begin{aligned} d_\text {WFS}(\textbf{r}_l,\omega )&= \frac{4\pi }{\omega \rho } A(\omega )j\sqrt{\frac{jk}{2\pi }}\cos \Psi \frac{e^{jk||\textbf{r}_s - \textbf{r}_l||}}{\sqrt{||\textbf{r}_s -\textbf{r}_l||}}\nonumber \\&\quad \times \sqrt{\frac{||\textbf{r}_o - \textbf{r}_l||}{||\textbf{r}_o -\textbf{r}_l||+||\textbf{r}_s -\textbf{r}_l||}} \Delta _l, \end{aligned}$$
(18)

where \(\rho\) denotes the air density, \(\Psi\) the angle between \(\textbf{r}_s\) and the normal to the reproduction line (i.e., contour comprising the loudspeaker array) at the secondary source \(\textbf{r}_l\), \(\textbf{r}_o\) denotes a point on the reference line, along which the amplitude error should theoretically be zero [52], and finally, \(\Delta _l = ||\textbf{r}_l - \textbf{r}_{l+1}||\) denotes the spacing between consecutive loudspeakers.

In order to solve the reproduction inaccuracies due to the WFS free-field assumption, in [43], it was proposed a compensation technique for WFS driving signals, denoted adaptive wave field synthesis (\(\text {AWFS}\)). Let us consider the soundfield \(\textbf{p}_\text {cp,wfs}(\omega )\) obtained by reproducing at control points through the WFS driving signals and \(\textbf{e}_\text {cp}(\omega ) = \textbf{p}_\text {cp}(\omega ) - \textbf{p}_\text {cp,wfs}(\omega )\) as the reproduction error, then the \(\textbf{d}_\text {awfs}(\omega ) \in \mathbb {C}^L\) driving signals are obtained in \(\text {AWFS}\) by by solving the following minimization problem [43]

$$\begin{aligned} \underset{\textbf{d}_\text {awfs}}{\text {arg}\,\text {min}}&\textbf{e}(\omega )^{H}\textbf{e}(\omega ) +\nonumber \\&\lambda (\textbf{d}_\text {awfs}(\omega )-\textbf{d}_\text {wfs}(\omega ))^{H} (\textbf{d}_\text {awfs}(\omega )-\textbf{d}_\text {wfs}(\omega )), \end{aligned}$$
(19)

where \(\textbf{e}(\omega ) = \textbf{p}_\text {cp}(\omega )-\hat{\textbf{p}}_{\text {cp}, \text {awfs}}(\omega )\) is the difference between the ground truth soundfield and estimated complex soundfields, \(\lambda\) is a regularization parameter.

The adapted wave-field synthesis driving signals that minimize the cost function are then found through [43, 53]

$$\begin{aligned} \textbf{d}_\text {awfs}=[\textbf{G}_{\text {cp}}^{H}\textbf{G}_{\text {cp}} + \lambda \textbf{I}]^{-1} [\textbf{G}_{\text {cp}}^{H}\textbf{p}_\text {cp}(\omega ) + \lambda \textbf{d}_\text {wfs}], \end{aligned}$$
(20)

where the solution is equivalent to the WFS one for \(\lambda \rightarrow \infty\) and to the optimal solution in a least-mean-square sense for \(\lambda \rightarrow 0\).

3 Driving signals compensation through complex-valued convolutional neural networks

In this section, we present the proposed technique for soundfield synthesis through complex-valued CNNs using irregular loudspeaker arrays. We first formalize the problem as the compensation of the filters obtained through the MR technique; then, we describe the general pipeline of the method and the proposed network architecture.

3.1 Problem formulation

Let us consider a circular or linear array of secondary sources as shown in Fig. 1a and c, respectively. An irregular loudspeaker array setup is obtained by removing some secondary sources from the setup, as shown in Fig. 1b and d. More formally, we can define an irregular loudspeaker array as an array where the spacing between the secondary sources is not constant.

Fig. 1
figure 1

Examples of regular circular (a) and linear (c) array setups, examples of irregular circular (b) and linear (d) array setups

Given the MR soundfield synthesis technique presented in Section 2.3, it is possible to obtain driving signals enabling a correct reproduction of the soundfield, as shown using a circular array in Fig. 2a. However, if we remove secondary sources and we do not take any countermeasure, the quality of the reproduced soundfield degrades considerably, as shown in Fig. 2b. Let us consider the driving signals \(\textbf{d}_{\text {mr}} \in \mathbb {C}^{L\times K}\), being K the number of frequencies, obtained, either using a linear or circular array, through the MR technique, if we stack the driving signals into a \(\textbf{D}_{\text {mr}} \in \mathbb {C}^{L \times K}\) matrix as follows

$$\begin{aligned} \textbf{D}_{\text {mr}}= \left[ \begin{array}{cccc} d_{\text {mr},1}(\omega _1) &{} d_{\text {mr},1}(\omega _2) &{} \dots &{} d_{\text {mr},1}(\omega _K) \\ d_{\text {mr},2}(\omega _1) &{} d_{\text {mr},2}(\omega _2) &{} \dots &{} d_{\text {mr},2}(\omega _K) \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ d_{\text {mr},L}(\omega _1) &{} d_{\text {mr},L}(\omega _2) &{} \dots &{} d_{\text {mr},L}(\omega _K) \\ \end{array}\right] , \end{aligned}$$
(21)

where \(\omega _k, k=1,\ldots ,K\) correspond to the discrete angular frequencies, then we can define the objective of the proposed method as retrieving the function \(\mathcal {U}(\cdot )\) such that

$$\begin{aligned} \textbf{D}_{\text {cnn}} = \mathcal {U}(\textbf{D}_\text {mr}), \end{aligned}$$
(22)

where \(\omega _k, k=1,\ldots ,K\) are the discrete angular frequencies and the driving signals matrix \(\textbf{D}_{\text {cnn}}(\omega _k) \in \mathbb {C}^{L \times K}\) is the compensated version of \(\textbf{D}_{\text {mr}}\), obtained by minimizing the following optimization problem

$$\begin{aligned} \textbf{D}_{\text {cnn}}= \underset{\textbf{D}_{\text {cnn}}}{\text {arg}\,\text {min}} \sum \limits _{i=1}^{I}\sum \limits _{k=1}^{K}|p(\textbf{r}_i, \omega _k)-\sum \limits _{l=1}^{L}D_{\text {cnn},lk}g(\textbf{r}_i|\textbf{r}_l, \omega _k)|^2, \end{aligned}$$
(23)

that is, corresponding to the minimization of the reproduction error at control points \(\textbf{r}_i,\ i=1,\ldots ,I\).

Fig. 2
figure 2

Amplitude (real) part of the soundfield for a source placed in \(\textbf{r}=[-1.2\ \text {m}, 0.96\ \text {m}, 0\ \text {m}]\) at \(f=1007\ \text {Hz}\) obtained using PWD through a regular (a) and irregular (b) array of secondary sources. Black loudspeakers represent the geometry of the chosen array

3.2 Pipeline

The pipeline of the proposed method is depicted in Fig. 3.

Fig. 3
figure 3

Schematic representation of the training procedure. Note that for simplicity, the images of \(\textbf{p}_{\text {cnn}}\) and \(\textbf{p}\) correspond only the real part of the amplitude pressure soundfield obtained at a frequency \(f=562\ \text {Hz}\) and due to a source positioned in \(\textbf{r}=[-0.61\ \text {m}, 1.42\ \text {m}]^T\)

In order to train the network, we consider a set of simulated data. More specifically, we consider a set of point sources positioned at locations \(\textbf{r}_{s}\) outside the listening region. For each source, we compute the corresponding driving signal matrix \(\textbf{D}_{\text {mr}}\) and, by applying (2), the corresponding ground truth pressure soundfield at control points \(\textbf{p}_{\text {cp}}\).

The matrix \(\textbf{D}_{\text {mr}}\) is fed as input to the network \(\mathcal {U}(\cdot )\), whose output is the matrix containing the compensated filters \(\textbf{D}_{\text {cnn}}\).

The prediction of the soundfield due to \(\textbf{r}_{s}\) at the selected control points \(\textbf{r}_{i}\) at frequency \(\omega _k\) is given by the convolution in the frequency domain between the estimated filters and the point-to-point Green’s function, i.e.,

$$\begin{aligned} {p}_{\text {cnn},\text {cp},i}(\omega _k) = \sum \limits _{l=1}^{L}d_{\text {cnn},l}(\omega )g(\textbf{r}_i|\textbf{r}_l, \omega _k). \end{aligned}$$
(24)

The parameters of the network \(\mathcal {U}(\cdot )\) are optimized through the loss function

$$\begin{aligned} \mathcal {L}(\textbf{p}_{\text {cnn},\text {cp}},\textbf{p}_{\text {cp}}) = \sum \limits _{k=1}^{K} (||\textbf{p}_{\text {cp}}(\omega _k)-\textbf{p}_{\text {cnn},\text {cp},i}(\omega _k)||_1), \end{aligned}$$
(25)

where \(||\cdot ||_1\) denotes the L1-norm. The loss in (25) is defined for a single source in \(\textbf{r}_s\). However, it is on a batch of sources. The batch index is here omitted for the sake of compactness.

3.3 Network architecture

In order to estimate the compensated driving signals from the ones obtained using the \(\text {MR}\) method using an irregular loudspeaker array, we make use of a complex-valued 2D convolutional architecture denoted as \(\mathcal {U}(\cdot )\). Since the main novelty contained in this manuscript stands in the application of complex-valued deep learning to soundfield synthesis using irregular loudspeaker arrays and not on the proposed deep learning techniques, we designed the network architectures by selecting standard design choices from the literature and adapting them to the particular considered scenario.

The network takes as input \(\textbf{D}_\text {mr}\) and outputs the matrix \(\textbf{D}_\text {cnn}\). For what concerns the size of the tensor given as input, the proposed architecture is made to work with an odd size, for what concerns the axis corresponding to the frequency number K, and a power of two for the axis corresponding to the number of loudspeakers L, only minor adjustments would be needed in order to adapt it to different scenarios. It is important to note that the network should be retrained from scratch in order to change use different frequency axes.

The proposed network is composed of the following layers:

  1. i)

    A complex convolutional layer, with 128 filters, which outputs a \((L/2)-1\times (K-1)/2\times 128\) feature map.

  2. ii)

    A complex convolutional layer, with 256 filters, which outputs a \((L/4)-1 \times (K-3)/4\times 128\) feature map.

  3. iii)

    A complex convolutional layer, with 512 filters, which outputs a \((L-8)/8\times (K-7)/8\times 512\) feature map.

  4. iv)

    A transposed complex convolutional layer, with 256 filters, which outputs a \((L/4)-1 \times (K-3)/4 \times 256\) feature map.

  5. v)

    A transposed complex convolutional layer, with 128 filters, which outputs a \((L/2-1)\times (K-1)/2 \times 128\) feature map.

  6. vi)

    A transposed complex convolutional layer, with 128 filters, which outputs a \(L\times K\times 128\) feature map.

  7. vii)

    A transposed complex convolutional layer, with 1 filter, which outputs a \(L\times K\times 1\) feature map.

The chosen network architecture processes the input, by subsequently compressing it along the width and height axes, while increasing the number of filters (i.e., channels), since this procedure helps in learning higher-level features hierarchically [54] at different scales. The chosen number of filters is similar to the ones commonly used in the literature, such as in VGG16 [55]. Since the proposed model is compensating the input driving signals, it is necessary that the output has the same dimensions as the input. For this reason, the architecture has a mirrored structure that first compresses the input data using 2D convolutional layers and then expands them through 2D transposed convolutional layers to generate the compensated driving signals.

All layers have a \((3 \times 3)\) kernel, which is a common choice among CNN-based architectures [56], with the exception of layer v) having a \(4 \times 3\) kernel. This choice is made to account for the fact that in the considered scenario the number of frequencies is not a power of two. No padding is applied, stride value is equal to \(2 \times 2\), and the chosen activation is the complex parametric rectified linear unit (CPReLU), which has been proposed and used for audio-related applications [57], and it is extremely powerful due to the high number of parameters contained in the activation. Similarly to the CReLU activation [37, 58], CPReLU applies separate PReLUs [59] on the real and imaginary part of a neuron. More specifically it is defined as

$$\begin{aligned} \text {CPReLU}(z) = \text {PReLU}(\Re (z)) + j\text {PReLU}(\Im (z)), \end{aligned}$$
(26)

where \(z \in \mathbb {C}\) represents the value of a neuron, and \(\Re\) and \(\Im\) denote the operators extracting the real and imaginary parts, respectively, out of a complex number.

In the layer vii), zero-padding is applied, stride is equal to \(1 \times 1\), and a linear activation is used. We introduce a skip connection, which has been proven to be able to speed up training [60] by feeding as input to layer v) the addition of the outputs of layer iv) and ii). All convolutional layers, with the exception of vii), are followed by dropout with a rate of 0.5, in order to prevent overfitting [61]. The complex-valued layers of the network were implemented by means of the CVNN [62] library using TensorFlow as backend.

4 Results

In this section, we present simulation and experimental results aimed at estimating the accuracy of the soundfield synthesized with the proposed method, referred in the following as \(\text {CNN}\), with respect to the techniques presented in Section 2, namely the model-based soundfield rendering technique [42] (\(\text {MR}\)), the pressure matching technique [15](\(\text {PM}\)), and the adaptive wave field synthesis (\(\text {AWFS}\)). We also consider an adaptive version of the \(\text {MR}\) technique by applying the \(\text {AWFS}\) procedure defined in (20) to the driving signals obtained via the model-based technique. We will refer to this method as \(\text {AMR}\) in the following.

The \(\text {MR}\) technique assumes setups where loudspeakers are regularly spaced, therefore its performances are expected to be non-optimal when it is applied to an irregular array, as in the case of this manuscript. Moreover, since the \(\text {CNN}\) technique compensates the driving signals extracted via \(\text {MR}\), the synthesis accuracy obtained through the latter can be considered as the higher bound with respect to the reproduction error.

We consider also the \(\text {PM}\) method since, similarly to \(\text {CNN}\), it does not pose any constraint with respect to the configuration of the loudspeaker array.

We avoid a comparison with a mode matching technique, even if it is suitable to work with irregular setups, due to the inherently different optimization procedure. While the function to be minimized in the \(\text {PM}\) and \(\text {CNN}\) approaches considers the pressure obtained at a series of control points, the mode matching technique, instead, minimizes directly the difference between the modes of the desired and reproduced soundfields [63]. Moreover, a mode-matching strategy is already applied in the derivation of the spatial filters used in the \(\text {MR}\) technique.

The simulation results refer to circular and linear speakers deployments, while the experimental ones to a circular array setup only. We first present aspects of the setup that are in common between the configurations. We then discuss separately the different scenarios. The setups chosen for the simulation and experimental campaigns were empirically chosen with the objective of being able on one side analyze the accuracy of the reproduction, thus using a high spatial sampling for the listening area, while considering a challenging setup for what concerns the control points, whose spatial sampling always corresponds to a spatial aliasing frequency well below the maximum one considered in the analysis. This choice is done, since, as demonstrated in our previous work [32], learning-based soundfield synthesis techniques are able to overcome sampling issues compared to other optimization-based approaches such as \(\text {PM}\). The code used in order to generate the data and train the model as well as the setups and additional results can be found at https://polimi-ispl.github.io/deep_learning_soundfield_synthesis_irregular_array/.The \(\text {WFS}\) driving functions needed to apply \(\text {AWFS}\) were computed using the Sound Field Synthesis (SFS) Toolbox for Python [64]

4.1 Model parameters

In order to train the network, we simulate a set of point sources \(\mathcal {S}\), which is then separated into three sets \(\mathcal {S}_\text {train}\), \(\mathcal {S}_\text {val}\), \(\mathcal {S}_\text {test}\) used for the training, validation, and testing phases, respectively. These datasets are independent from each other, meaning more formally that

$$\begin{aligned} \mathcal {S}_\text {train}\cap \mathcal {S}_\text {val} = \mathcal {S}_\text {train} \cap \mathcal {S}_\text {test} = \mathcal {S}_\text {test} \cap \mathcal {S}_\text {val} = \emptyset . \end{aligned}$$
(27)

The network is trained using the Adam optimizer [65] with a learning rate \(\text {lr}=10^{-4}\). We set the maximum number of epochs to 5000 and saved only the model corresponding to the best validation loss value. We apply early stopping by ending the training after 10 epochs of no improvement in terms of validation loss. The network loss usually converged after around \(100-200\) epochs. The regularization constant \(\lambda\) used to regularize the least squares solution in \(\text {PM}\) (see (4) and \(\text {MR}\) (see (16)), \(\text {AMR}\) and \(\text {AWFS}\) (see (20)) was set to \(10^{-3} \sigma _\text {max}\), where \(\sigma _\text {max}\) is the maximum singular value of \(\textbf{G}_\text {cp}^H\textbf{G}_\text {cp}\), similarly to [19].

4.2 Evaluation metrics

In order to evaluate the performances of the proposed method, we adopt two different metrics, the normalized reproduction error (\(\text {NRE}\)) [19] and the structural similarity index measure (SSIM) [66]. The \(\text {NRE}\) measures the reproduction accuracy and for a single emitting source \(\textbf{r}_s\) and frequency \(\omega _k\) is defined as

$$\begin{aligned} \text {NRE}(\textbf{r}_s,\omega _k) = 10\log _{10} \frac{\sum \nolimits _{a=1}^{A}|\hat{p}(\textbf{r}_{a},\omega _k)-p(\textbf{r}_{a},\omega _k))|^2}{\sum \nolimits _{a=1}^{A}|p(\textbf{r}_{a},\omega _k))|^2}, \end{aligned}$$
(28)

where \(\hat{p}(\textbf{r}_{a},\omega _k)\) corresponds to the pressure soundfield estimated at point \(\textbf{r}_{a}\) using either the \(\text {MR}\), \(\text {PM}\) or \(\text {CNN}\) techniques, while \(p(\textbf{r}_{a},\omega _k)\) is the ground truth.

As already done in [25], we also evaluate the accuracy in terms of \(\text {SSIM}\), which enables to evaluate how the considered techniques are able to reproduce the overall shape of the pressure soundfield for each frequency point. For a single emitting source \(\textbf{r}_s\) and frequency \(\omega _k\), the \(\text {SSIM}\) is given by

$$\begin{aligned} \text {SSIM}(\textbf{r}_s, \omega _k) = \frac{(2\mu _{\hat{\textbf{p}}}\mu _{\textbf{p}}+c_1)({2\sigma _{\hat{\textbf{p}}\textbf{p}}+c_2)}}{(\mu ^2_{\hat{\textbf{p}}}+\mu ^2_{\textbf{p}}+c_1)(\sigma ^2_{\hat{\textbf{p}}}+\sigma ^2_{\textbf{p}}+c_1)}, \end{aligned}$$
(29)

where \(\textbf{p} \in \mathbb {R}^{A}\) and \(\hat{\textbf{p}} \in \mathbb {R}^{A}\) correspond to absolute value of the pressure soundfield, normalized between 0 and 1, measured in the listening area \(\mathcal {A}\) at frequency \(\omega _k\) when the source \(\textbf{r}_s\) is active, in the ground truth case, and when either \(\text {CNN}\), \(\text {PM}\) or \(\text {MR}\) are used, respectively. The value \(\mu _{(\cdot )}\) and \(\sigma ^2_{(\cdot )}\) are the average and variance of the vector at subscript, respectively. Finally, \(\sigma _{(\cdot ,\cdot )}\) is the covariance between the entries of the two matrices given as argument. In order to stabilize the division with a weak denominator, the \(\text {SSIM}\) calculation includes the two constants \(c_1=(h_1R)^2\) and \(c_2=(h_2R)^2\) where R is the dynamic range of the entry values (1 in the case of normalized matrices), while \(h_1=0.01\) and \(h_2=0.03\), following the standard recommendation [25].

4.3 Linear array

In this section, we present results related to soundfield synthesis when considering a linear array setup.

4.3.1 Setup

We considered a regular linear array centered in \([0.5\ \text {m}, 0\ \text {m}]^T\) and consisting of \(L=64\) secondary sources with a spacing of \(0.0625\ \text {m}\). From this configuration, we generated three irregular array setups by randomly removing 16, 32, or 48 loudspeakers, resulting in three irregular arrays with \(L=48\), \(L=32\), and \(L=16\) secondary sources, respectively. The listening area \(\mathcal {A}\) considered for reproduction was a \(2\ \text {m} \times 2\ \text {m}\) surface located on the half plane on the left of the array, specifically, with the lowest left corner placed in \([-2\ \text {m}, -2\ \text {m}]^T\) sampled using \(A=25000\) points with a spacing of \(0.02\ \text {m}\) on both the x and y axes. We used \(I=60\) control points, placed on a \(2\ \text {m} \times 2\ \text {m}\) overlapping with the listening region \(\mathcal {A}\) and spaced of \(0.44\ \text {m}\) on both the x and y axes, corresponding to a spatial aliasing of \(387\ \text {Hz}\), both for computing the losses during the training of \(\text {CNN}\) model and for calculating the driving signals through \(\text {PM}\) and \(\text {AWFS}\) and the filters needed to compute \(\text {MR}\) through (16) and \(\text {AMR}\).

In order to train the network, we considered the cardinality of \(\mathcal {S}_\text {train}\), \(\mathcal {S}_\text {val}\), and \(\mathcal {S}_\text {test}\) equal to 3920, 980, and 2500, respectively. The sources in \(\mathcal {S}_\text {train} \cup \mathcal {S}_\text {val}\) are placed in a \(4\ \text {m} \times 8\ \text {m}\) grid sampled using a spacing of \(0.06\ \text {m}\) along the y-axis and of \(0.11\ \text {m}\) along the x-axis. The split of these sources in validation and training sets is performed randomly at training time, as is the common practice. Test sources are then obtained by shifting the \(\mathcal {S}_\text {train} \cup \mathcal {S}_\text {val}\) sets of \(0.05\ \text {m}\) along the x-axis. The image depicting the setup is available on the accompanying websiteFootnote 1. We considered sources emitting a signal with spectrum \(A(\omega _k)=1\) at \(K=63\) frequencies spaced by \(23\ \text {Hz}\), in the range between \(46\ \text {Hz}\) and \(1500\ \text {Hz}\).

4.3.2 Results

In Fig. 4, we show the real part of the reproduced sound pressure distribution at frequency \(f=210\ \text {Hz}\) for a point source located in \(\textbf{r}=[1.05\ [\text {m}], 1.88\ [\text {m}], 0\ [\text {m}]]^T\), synthesized using \(L=32\) loudspeakers. More specifically, Fig. 4a refers to the ground truth soundfield, while the fields for \(\text {MR}\), \(\text {CNN}\), \(\text {PM}\), \(\text {AWFS}\), and \(\text {AMR}\) are shown in Fig. 4b, c, d, e, and f, respectively. We purposely choose to show the soundfield at a lower frequency to present an example where the performancesof the \(\text {CNN}\) method are slightly worse than the ones obtained with respect to \(\text {AMR}\), while better than all other considered methods. It is apparent the fact that the \(\text {CNN}\) and \(\text {AMR}\) models obtain the best results, by reducing the number of irregularities in the wavefront, both with respect to the \(\text {MR}\) technique, whose driving signals are the input to the \(\text {CNN}\) model, and to the \(\text {PM}\) technique. The differences in performance of the \(\text {CNN}\) model with respect to the \(\text {AWFS}\) technique is smaller, since in this scenario, all models work reasonably well. These considerations are also confirmed by inspecting the \(\text {NRE}\) for the same scenario, as shown in Fig. 5. In Fig. 6a-c-e, we present results showing the \(\text {NRE}\) averaged over all \(|\mathcal {S}_\text {test}|\) sources, when considering an irregular array of \(L=48,32\) and 16 secondary sources. The \(\text {CNN}\) achieves the best average \(\text {NRE}\) over the whole range of considered frequencies in all cases, both with respect to the \(\text {MR}\) and \(\text {PM}\) techniques, where the latter shows also a higher irregularity. When comparing the average result of \(\text {CNN}\) with respect to the linear optimizer-based \(\text {AWFS}\) and \(\text {AMR}\) methods, the former still obtains better performances in most scenarios, having slightly lower performances around \(200\ \text {Hz}\); however, the gap in performances diminishes together with the number of active loudspeakers, being almost indistinguishable for \(L=16\). As expected, with fewer active secondary sources the error is higher.

Fig. 4
figure 4

Amplitude (real part) of the soundfield for a source placed in \(\textbf{r}=[1.05\ \text {m}, 1.88\ \text {m}, 0\ \text {m}]^T\) at \(f=210\ \text {Hz}\) , ground truth is shown in a. Reproduction through an irregular linear array of \(L=32\) loudspeakers using \(\text {MR}\) (b), \(\text {CNN}\) (c), \(\text {PM}\) (d), \(\text {AWFS}\) (e), and \(\text {AMR}\) (f)

Fig. 5
figure 5

Normalized reproduction error (NRE) distribution in \(\text {dB}\) for a source placed in \(\textbf{r}=[1.05\ \text {m}, 1.88\ \text {m}, 0\ \text {m}]^T\) at \(f=210\ \text {Hz}\) when using \(\text {MR}\) (a), \(\text {CNN}\) (b), \(\text {PM}\) (c), \(\text {AWFS}\) (d), and \(\text {AMR}\) (e). Black loudspeakers represent the geometry of the chosen array

Fig. 6
figure 6

Irregular linear array soundfield synthesis performances with respect to frequency: NRE when \(L=48\) (a), NRE when \(L=32\) (c), NRE \(L=16\) (e). SSIM when \(L=48\) (b), SSIM when \(L=32\) (d), SSIM when \(L=16\) (f). Error bars represent \(\pm 1\) standard deviations

In Fig. 6b-d-f, we present results showing the \(\text {SSIM}\) averaged over all \(|\mathcal {S}_\text {test}|\) sources, when considering an irregular array of \(L=48,\ 32\) and 16 sources, respectively. For \(L=48\), the results are more or less similar for all methods; \(\text {CNN}\) is worse in average at the lowest frequencies, while slightly better at the higher ones. In the case of \(L=32\), the \(\text {SSIM}\) curves are similar for most methods except for \(\text {CNN}\) which obtains slightly lower results below \(600\ \text {Hz}\) but performs better than the other methods for higher frequency values. Finally, in the case of \(L=16\), the \(\text {SSIM}\) is comparable for all considered methods, with \(\text {CNN}\) obtaining slightly better results over \(600\ \text {Hz}\).

4.4 Circular array

In this section, we present results related to soundfield synthesis when considering a circular array setup.

4.4.1 Setup

We considered a regular circular array consisting of \(L=64\) secondary sources with a radius of \(1\ \text {m}\).

The listening area considered for reproduction, surrounded by the louspeaker array, corresponds to a circle of \(1\ \text {m}\) radius centered in \([0\ \text {m}, 0\ \text {m}]^T\), uniformly sampled in order to have \(A=7770\) listening points spaced of \(0.02\ \text {m}\) between consecutive points.

We used \(I=25\) control points placed in a \(1.3\ \text {m} \times 1.3\ \text {m}\) square grid inside \(\mathcal {A}\), centered in \([0\ \text {m}, 0\ \text {m}]^T\), with a spacing of \(0.3\ \text {m}\) both along the x and y axes, resulting in 5 rows and 5 columns and corresponding to spatial aliasing over approximately \(514\ \text {Hz}\). The control points were used to compute the losses during the training of \(\text {CNN}\) model and to calculate the driving signals through \(\text {PM}\), \(\text {AWFS}\), and \(\text {AMR}\).

In order to train the network, we used \(|\mathcal {S}_\text {train}|=4096\) and \(|\mathcal {S}_\text {val}|=1024\), respectively. The \(\mathcal {S}_\text {train}\) and \(\mathcal {S}_\text {val}\) sets were generated by sampling uniformly with 256 points 20 circumferences whose radius was uniformly distributed in the range \([1.5 \text {m}, 3.5 \text {m}]\) from the center of the array.

The test dataset \(\mathcal {S}_\text {test}\), instead, was created by sampling uniformly using 128 points 20 circumferences whose radius was uniformly distributed in the range \([1.55\ \text {m}, 3.55\ \text {m}]\), obtaining \(|\mathcal {S}_\text {test}|=2560\) test sources, placed such that no source is overlapping with the ones used for training and validating the method.

We considered sources emitting a signal with spectrum \(A(\omega _k)=1\) at \(K=63\) frequencies spaced by \(23\ \text {Hz}\), in the range between \(46\ \text {Hz}\) and \(1500\ \text {Hz}\). The image depicting the setup is available on the accompanying websiteFootnote 2

4.4.2 Results

In Fig. 7a, we show the real part of the ground truth sound pressure distribution for an emitting point source placed in \(\textbf{r}=[0.99\ \text {m}, 2.88\ \text {m}, 0\ \text {m}]^T\). In Fig. 7b, c, d, e, and f, the real part of the sound pressure obtained through \(\text {MR}\), \(\text {CNN}\), \(\text {PM}\), \(\text {AWFS}\), and \(\text {AMR}\) is shown, respectively when 32 speakers are active. It is clear how the \(\text {CNN}\) model performs best, by reducing the number of irregularities in the wavefront, with respect to the \(\text {MR}\), \(\text {AWFS}\), and \(\text {AMR}\) techniques and especially with respect to the \(\text {PM}\) technique, whose reproduced soundfield is extremely irregular. These considerations are also confirmed by inspecting the \(\text {NRE}\) obtained for the same scenario, shown in Fig. 8, where the \(\text {NRE}\) in the case of \(\text {CNN}\), shown in Fig. 8b, is sensibly lower in the listening area \(\mathcal {A}\) with respect to the ones obtained through \(\text {MR}\) and \(\text {PM}\), shown in Fig. 8a, c, d, and e, respectively.

Fig. 7
figure 7

Real part of the soundfield for a source placed in \(\textbf{r}=[0.99, \text {m}, 2.88\ \text {m}, 0\ \text {m}]^T\) at \(f=1007\ \text {Hz}\) , ground truth is shown in a. Reproduction performances using the irregular circular array of \(L=32\) loudspeakers are shown using \(\text {MR}\) (b), \(\text {CNN}\) (c), \(\text {PM}\) (d), \(\text {AWFS}\) (e), and \(\text {AMR}\) (f). Black loudspeakers represent the geometry of the chosen array

Fig. 8
figure 8

Normalized reproduction error (NRE) distribution in \(\text {dB}\) for a source placed in \(\textbf{r}=[0.99\ \text {m}, 2.88\ \text {m}, 0\ \text {m}]^T\) at \(f=1007\ \text {Hz}\) when using: \(\text {MR}\) (a), \(\text {CNN}\) (b), \(\text {PM}\) (c), \(\text {AWFS}\) (d), and \(\text {AMR}\) (e). Black loudspeakers represent the geometry of the chosen array

In Fig. 9a-c-e, we present results showing the \(\text {NRE}\) averaged over all \(|\mathcal {S}_\text {test}|\) sources, when considering an irregular array of \(L=48,\ 32\) and 16 secondary sources. Similarly to the linear array case, the \(\text {CNN}\) achieves \(\text {NRE}\) average results that are on par or better than the other considered techniques. This is more evident when the number of secondary sources is lower. While the mean of the \(\text {NRE}\) of \(\text {MR}\) is approximately constant in the considered frequency range, the average error of \(\text {CNN}\) tends to increase with the frequency, even if it remains lower than the one of \(\text {MR}\). Analogously, \(\text {PM}\) exhibits an error that increases with the frequency, becoming extremely irregular for the upper frequency range and more sparse setups, while being on par or lower than \(\text {CNN}\) for the lower frequencies. \(\text {AMR}\) shows a behavior similar to \(\text {CNN}\) but reaching higher \(\text {NRE}\) values. When considering the \(\text {AWFS}\) technique, the \(\text {CNN}\) technique performs better in average both in the \(L=48\) and \(L=32\) cases, while performances when using an array with \(L=16\) loudspeaker are practically on par.

Fig. 9
figure 9

Irregular circular array soundfield synthesis performances with respect to frequency: NRE when \(L=48\) (a), NRE when \(L=32\) (c), NRE when \(L=16\) (e), SSIM when \(L=48\) (b), SSIM when \(L=32\) (d), SSIM when \(L=16\) (f). Error bars represent \(\pm 1\) standard deviations

In Fig. 9b-d-f, we present the \(\text {SSIM}\) metric averaged over all \(|\mathcal {S}_\text {test}|\) sources, when considering an irregular array of \(L=48,32\) and 16 sources, respectively. Differently from the linear array case, the \(\text {SSIM}\) obtained through \(\text {CNN}\) is similar or better than the other considered methods for \(L=16\) and \(L=32\), especially for higher frequency values. This is probably due to both the smaller listening area considered, allowing for a smaller number of irregularities in the reproduced wavefront, and the fact that the array surrounds the listening area enabling reproduction from a higher number of directions. However, a notable exception is \(L=16\) where the highest \(\text {SSIM}\) performances are obtained by the \(\text {MR}\) technique and \(\text {CNN}\) performs worse than \(\text {AWFS}\) and \(\text {MR}\) at higher frequencies.

In the case of the circular array, we also computed the \(\text {NRE}\) and \(\text {SSIM}\) when varying the location of the emitting source, in particular when it moves farther from the center of the array in the range \(1.5\ \text {m}<\rho <3.5\ \text {m}\), while keeping the frequency fixed at \(1007\ \text {Hz}\). The results of the \(\text {NRE}\) metric are shown in Fig. 10a-c-e for the arrays with 48, 32, and 16 secondary sources, respectively. All methods present a mostly constant behavior with respect to the whole considered radius range, with \(\text {CNN}\) and \(\text {PM}\) the most and least accurate, respectively. As expected, the \(\text {NRE}\) worsens when decreasing the number of active secondary sources. Coherently with the \(\text {NRE}\) results, for \(L=16\), the \(\text {CNN}\) and \(\text {AWFS}\) average performances are extremely similar. The results for the \(\text {SSIM}\) metric are shown in Fig. 10b-d-f for the arrays with 48, 32, and 16 secondary sources, respectively. In this case, the accuracy slightly worsens as the distance of the sources increases. While \(\text {CNN}\), \(\text {MR}\), and \(\text {AWFS}\) are close to each other, \(\text {AMR}\) and \(\text {PM}\) turns out to be the worse.

Fig. 10
figure 10

Irregular circular array soundfield synthesis performances with respect to distance from the center of the reproduction area at frequency \(f=1007\ \text {Hz}\): NRE when \(L=48\) (a), NRE when \(L=32\) (c), NRE when \(L=16\) (e), SSIM when \(L=48\) (b), SSIM when \(L=32\) (d), SSIM when \(L=16\) (f). Error bars represent \(\pm 1\) standard deviations

4.5 Real data

In this section, we present results related to soundfield synthesis when considering a circular array setup and data obtained from room impulse responses (RIRs) measurements contained in the dataset from [67]. It is important to stress the fact that in this scenario, the sound propagation is 3D; therefore, in order to provide a fair comparison, we used the 2.5D version of \(\text {WFS}\) in order to implement the \(\text {AWFS}\) method, contained in the SFS toolbox [64]. While the filters obtained via \(\text {MR}\) have a disadvantage, being computed for a 2D environment, this is not a problem both for \(\text {AMR}\) and \(\text {CNN}\), since using these methods the \(\text {MR}\) filters are just used as input and later optimized taking into account the 3D scenario. The point sources used to generate the desired ground truth soundfields were simulated using Pyroomacoustics [68] and effectively considering 3D propagation.

4.5.1 Setup

RIRs were measured in a hemi-anechoic room, with 50 mm Martini Absorb XHD50 sound absorbing materials on the ground, of size \(4.90\ \text {m}\times 7.22\ \text {m}\times 5.29\ \text {m}\) with an average reverberation time of \(0.045\ \text {s}\) using an array of \(L=60\) loudspeakers (Genelec 8010A) with radius of \(1.5\ \text {m}\), the spacing between each loudspeaker being approximately \(0.157\ \text {m}\). From this configuration, three irregular array setups were generated by randomly removing 12, 28, or 44 loudspeakers, resulting in three irregular configurations with \(L=48\), \(L=32\), and \(L=16\) secondary sources, respectively. The RIRs related to the reproduction zone are measured by considering the square microphone (DPA 4060) array configuration, specifically related to the Zone E in [67], consisting of 64 microphones sampling with a spacing of \(0.04\ \text {m}\) a square of size \(0.28\ \text {m} \times 0.28\ \text {m}\) placed in the center of the area comprised by the microphone array. Both microphones and loudspeakers were placed at the same height of \(1.45\ \text {m}\) from the floor. A total of 16 control points inside the reproduction area were chosen by selecting the first (from left) and fifth columns of microphones on the listening area grid, having thus the two columns separate by approximately \(0.16\ \text {m}\) and microphones in the same column spaced by \(0.04\ \text {m}\), resulting in approximately \(1071\ \text {Hz}\) of spatial aliasing. The control points were used in order to compute the losses using the \(\text {CNN}\) model and the driving signals through the \(\text {PM}\), \(\text {AWFS}\), and \(\text {AMR}\) techniques. The considered sampling frequency is of \(Fs=48000\ \text {Hz}\) [67].

In order to generate the ground truth dataset of desired soundfields, we simulated through Pyroomacoustics [68] a total of 4264 point sources placed in a \(8\ \text {m} \times 8\ \text {m}\) grid surrounding the loudspeaker array. The sources were split into \(|\mathcal {S}_\text {train}|=1705\), \(|\mathcal {S}_\text {val}|=427\), and \(|\mathcal {S}_\text {test}|=2132\) to create the training, validation, and test sets, respectively. We considered sources emitting a signal with spectrum \(A(\omega _k)=1\) at \(K=63\) frequencies spaced by \(23\ \text {Hz}\), in the range between \(50\ \text {Hz}\) and \(1500\ \text {Hz}\). The image depicting the setup is available on the accompanying websiteFootnote 3

4.5.2 Results

In Fig. 11a, we show the real part of the ground truth sound pressure distribution for a point source placed in \(\textbf{r}=[-3.76\ \text {m}, -1.14\ \text {m}, 0\ \text {m}]^T\) at \(f=1500\ \text {Hz}\). In Fig. 11b, c, d, e, and f, the real part of the sound pressure obtained through \(\text {MR}\), \(\text {CNN}\), \(\text {PM}\), \(\text {AWFS}\), and \(\text {AMR}\) is shown, respectively, when 32 speakers are active. We can see that the \(\text {CNN}\) technique is the one that is able to better reproduce the soundfield, closely followed by the \(\text {PM}\) method and then by the \(\text {AWFS}\) and \(\text {AMR}\) methods; \(\text {MR}\) is the one that seems to perform worst at generating the desired ground truth soundfield. Similar considerations can be drawn by inspecting the \(\text {NRE}\) obtained for the same scenario, shown in Fig. 12, where the \(\text {NRE}\) for the listening area \(\mathcal {A}\) in the case of \(\text {CNN}\), Fig. 12a, \(\text {MR}\) Fig. 12b, \(\text {PM}\) Fig. 12c, \(\text {AWFS}\) Fig. 12d, and \(\text {AMR}\) Fig. 12e.

Fig. 11
figure 11

Amplitude (real part) of the soundfield for a source placed in \(\textbf{r}=[-3.76, \text {m}, -1.14\ \text {m}, 0\ \text {m}]^T\) at \(f=1500\ \text {Hz}\), ground truth is shown in a. Reproduction performances using the irregular circular array of \(L=32\) using \(\text {MR}\) (b), \(\text {CNN}\) (c), \(\text {PM}\) (d), \(\text {AWFS}\) (e), and \(\text {APWD}\) (f)

Fig. 12
figure 12

Normalized reproduction error (NRE) distribution in \(\text {dB}\) for a source placed in \(\textbf{r}=[-3.76, \text {m}, -1.14\ \text {m}, 0\ \text {m}]^T\) at \(f=1500\ \text {Hz}\) when using: \(\text {MR}\) (a), \(\text {CNN}\) (b), \(\text {PM}\) (c), \(\text {AWFS}\) (d), and \(\text {AMR}\) (e)

In Fig. 13a-b-c, we present results showing the \(\text {NRE}\) averaged over all \(|\mathcal {S}_\text {test}|\) sources, when considering an irregular array of \(L=48,\ 32\) and 16 secondary sources. In the case of \(L=48\) \(\text {CNN}\), \(\text {AMR}\) and \(\text {PM}\) performances are similar under \(700\ \text {Hz}\), while over this value, \(\text {CNN}\) is the method that minimizes the mean of \(\text {NRE}\) over the whole test set \(\mathcal {S}_\text {test}\) the most. No major difference can be observed for \(L=32\). Finally for what concerns the \(L=16\) scenario \(\text {CNN}\) performances are on par with \(\text {AMR}\) under \(800\ \text {Hz}\); for higher values, the error obtained with the latter strongly increases. On the other way around, while \(\text {CNN}\) performances are on par with \(\text {AWFS}\) under \(600-700\ \text {Hz}\), the latter performs slightly better over \(800\ \text {Hz}\). The \(\text {MR}\) method is the one working worst in all cases except over around \(1200\ \text {Hz}\) when \(L=48\) and \(L=32\), where it performs better than \(\text {PM}\).

Fig. 13
figure 13

Irregular circular array soundfield synthesis performances (real measurements) with respect to frequency, NRE when \(L=48\), (a), \(L=32\) (b), \(L=16\) (c). Error bars represent \(\pm 1\) standard deviations

We avoid showing the SSIM results due to the fact that being it strongly dependent on the variance of the data, it is not representative of the quality of the generated data in this specific case, since the ground truth soundfields are simulated, while the RIRs used for reproduction are measured, causing the data to have significantly different distributions.

5 Conclusion

In this manuscript, we have proposed a technique for soundfield synthesis using irregular loudspeaker arrays. The methodology is based on a deep learning-based approach. More specifically, we consider the driving signals obtained through an already existing soundfield method, based on the plane wave decomposition, and propose a network that is able to modify the driving signals by compensating the errors in the reproduced soundfield due to the irregularity in the loudspeaker setup. We compare the proposed method with the one used to compute the input driving signals and with pressure-matching, showing that the proposed model is able to obtain better average performances in most of the setups.

The obtained results open the possibility of adopting the combination of deep learning and model-based soundfield synthesis for addressing issues arising when irregular loudspeaker arrays are available. For example, a complex-valued CNN-based pressure matching technique can be devised, by optimizing the driving signals from the knowledge of the soundfield at prescribed control points. Moreover, we plan to move to real environments, where multiple sources are active and also noise and reverberation are present, aiming at compensating the environment and mask the noise. We also plan to consider sources emitting more realistic signals such as speech or music. In order to make the model more suited to real-world applications, we plan to make the system able to handle different loudspeaker arrangements, without the need for retraining and to identify systematically the effects of loudspeaker and control points arrangements on the model performances. Further developments could also entail the application of deep learning and irregular arrays to related problems such as multizone soundfield reproduction in order to create personal audio systems and also conditioning the system in order to be independent of the chosen array setup.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. The code used to perform the experiments is fully available at https://github.com/polimi-ispl/deep_learning_soundfield_synthesis_irregular_array.

Notes

  1. https://polimi-ispl.github.io/deep_learning_soundfield_synthesis_irregular_array/docs/linear.html

  2. https://polimi-ispl.github.io/deep_learning_soundfield_synthesis_irregular_array/docs/circular.html

  3. https://polimi-ispl.github.io/deep_learning_soundfield_synthesis_irregular_array/docs/real.html

References

  1. A.J. Berkhout, D. de Vries, P. Vogel, Acoustic control by wave field synthesis. J. Acoust. Soc. Am. 93(5), 2764–2778 (1993)

    Article  Google Scholar 

  2. S. Spors, R. Rabenstein, J. Ahrens, in 124th AES convention. The theory of wave field synthesis revisited (Audio Engineering Society (AES), New York, 2008), pp. 17–20

  3. M.A. Gerzon, Periphony: With-height sound reproduction. J. Audio Eng. Soc. 21(1), 2–10 (1973)

    Google Scholar 

  4. D.B. Ward, T.D. Abhayapala, Reproduction of a plane-wave sound field using an array of loudspeakers. IEEE Trans. Speech Audio Process. 9(6), 697–707 (2001)

    Article  Google Scholar 

  5. M.A. Poletti, Three-dimensional surround sound systems based on spherical harmonics. J. Audio Eng. Soc. 53(11), 1004–1025 (2005)

    Google Scholar 

  6. M. Poletti, F. Fazi, P. Nelson, Sound-field reproduction systems using fixed-directivity loudspeakers. J. Acoust. Soc. Am. 127(6), 3590–3601 (2010)

    Article  Google Scholar 

  7. M. Kentgens, A. Behler, P. Jax, in 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). Translation of a higher order ambisonics sound scene based on parametric decomposition (IEEE, Piscataway, 2020), pp. 151–155

  8. J. Ahrens, S. Spors, Sound field reproduction using planar and linear arrays of loudspeakers. IEEE Trans. Audio Speech Lang. Process. 18(8), 2038–2050 (2010)

    Article  Google Scholar 

  9. P. Chen, et al., in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). 3D exterior soundfield reproduction using a planar loudspeaker array (IEEE, Piscataway, 2018), pp. 471–475

  10. J. Trevino, T. Okamoto, Y. Iwaya, Y. Suzuki, High order Ambisonic decoding method for irregular loudspeaker arrays. In Proceedings of 20th International Congress on Acoustics (pp. 23–27)

  11. F. Zotter, M. Frank, H. Pomberger, Comparison of energy-preserving and all-round ambisonic decoders. Fortschritte der Akustik, AIA-DAGA, (Meran) (2013)

  12. T. Qu, Z. Huang, Y. Qiao, X. Wu, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Matching projection decoding method for ambisonics system (IEEE, Piscataway, 2018), pp. 561–565

  13. Z. Ge, L. Li, T. Qu, Partially matching projection decoding method evaluation under different playback conditions. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1411–1423 (2021)

    Article  Google Scholar 

  14. F. Zotter, M. Frank, All-round ambisonic panning and decoding. J. Audio Eng. Soc. 60(10), 807–820 (2012)

    Google Scholar 

  15. P.A. Nelson, Active control of acoustic fields and the reproduction of sound. J. Sound Vib. 177(4), 447–477 (1994)

    Article  Google Scholar 

  16. P.A. Gauthier, A. Berry, W. Woszczyk, Sound-field reproduction in-room using optimal control techniques: Simulations in the frequency domain. J. Acoust. Soc. Am. 117(2), 662–678 (2005)

    Article  Google Scholar 

  17. P.N. Samarasinghe, M.A. Poletti, S.A. Salehin, T.D. Abhayapala, F.M. Fazi, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 3d soundfield reproduction using higher order loudspeakers (IEEE, Piscataway, 2013), pp. 306–310

  18. T. Betlehem, T.D. Abhayapala, Theory and design of sound field reproduction in reverberant rooms. J Acoust. Soc. Am. 117(4), 2100–2111 (2005)

    Article  Google Scholar 

  19. N. Ueno, S. Koyama, H. Saruwatari, Three-dimensional sound field reproduction based on weighted mode-matching method. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 1852–1867 (2019)

    Article  Google Scholar 

  20. N. Ueno, S. Koyama, H. Saruwatari, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Sound field reproduction with exterior cancellation using analytical weighting of harmonic coefficients (IEEE, Piscataway, 2018), pp. 466–470

  21. H. Zuo, P.N. Samarasinghe, T.D. Abhayapala, Intensity based spatial soundfield reproduction using an irregular loudspeaker array. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1356–1369 (2020)

    Article  Google Scholar 

  22. H. Zuo, T.D. Abhayapala, P.N. Samarasinghe, in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 3d multizone soundfield reproduction in a reverberant environment using intensity matching method (IEEE, Piscataway, 2021), pp. 416–420

  23. M.J. Bianco, P. Gerstoft, J. Traer, E. Ozanich, M.A. Roch, S. Gannot, C.A. Deledalle, Machine learning in acoustics: Theory and applications. J. Acoust. Soc. Am. 146(5), 3590–3628 (2019)

    Article  Google Scholar 

  24. M. Cobos, J. Ahrens, K. Kowalczyk, A. Politis, An overview of machine learning and other data-based methods for spatial audio capture, processing, and reproduction. EURASIP J. Audio Speech Music Process. 2022(1), 1–21 (2022)

    Google Scholar 

  25. F. Lluis, P. Martinez-Nuevo, M. Bo Møller, S. Ewan Shepstone, Sound field reconstruction in rooms: Inpainting meets super-resolution. J. Acoust. Soc. Am. 148(2), 649–659 (2020)

  26. M.S. Kristoffersen, M.B. Møller, P. Martínez-Nuevo, J. Østergaard, Deep sound field reconstruction in real rooms: Introducing the isobel sound field dataset. (2021). arXiv preprint arXiv:2102.06455

  27. P. Morgado et al., in Proceedings of the 32nd Int. Conf. on Neural Information Processing Systems. Self-supervised generation of spatial audio for 360\(^{\circ }\)video. (Curran Associates Inc., New York, 2018), pp. 360–370

  28. G. Routray, S. Basu, P. Baldev, R.M. Hegde, in EAA Spatial Audio Signal Processing Symposium. Deep-sound field analysis for upscaling ambisonic signals (2019), pp. 1–6

  29. S. Gao, J. Lin, W. Xihong, T. Qu, Sparse DNN model for frequency expanding of higher order ambisonics encoding process. IEEE/ACM Trans. Audio Speech Lang. Process. (2022)

  30. L. Zhang, X. Wang, R. Hu, D. Li, W. Tu, Estimation of spherical harmonic coefficients in sound field recording using feed-forward neural networks. Multimedia Tools Appl. 80(4), 6187–6202 (2021)

    Article  Google Scholar 

  31. H. Chen, T. Abhayapala, in Proceedings of the 23rd International Congress on Acoustics : integrating 4th EAA Euroregio 2019 : 9-13 September 2019 in Aachen, Germany. Spatial sound field reproduction using deep neural networks (2019). https://doi.org/10.18154/RWTH-CONV-239844

  32. L. Comanducci, F. Antonacci, A. Sarti, in 2022 International Workshop on Acoustic Signal Enhancement (IWAENC). A deep learning-based pressure matching approach to soundfield synthesis (IEEE, Piscataway, 2022), pp. 1–5

  33. X. Hong, B. Du, S. Yang, M. Lei, X. Zeng, End-to-end sound field reproduction based on deep learning. J. Acoust. Soc. Am. 153(5), 3055–3055 (2023)

    Article  Google Scholar 

  34. S. Koyama, G. Chardon, L. Daudet, Optimizing source and sensor placement for sound field control: An overview. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 696–714 (2020)

    Article  Google Scholar 

  35. C. Lee, H. Hasegawa, S. Gao, Complex-valued neural networks: A comprehensive survey. IEEE/CAA J. Autom. Sin. 9(8), 1406–1426 (2022)

    Article  Google Scholar 

  36. J. Bassey, L. Qian, X. Li, A survey of complex-valued neural networks. (2021). arXiv preprint arXiv:2101.12249

  37. C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J.F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, C.J. Pal, in International Conference on Learning Representations. Deep complex networks (2018). https://openreview.net/forum?id=H1T2hmZAb

  38. A. Hirose, Complex-valued neural networks (Springer Science & Business Media, Berlin/Heidelberg, 2012)

    Book  Google Scholar 

  39. M. Yang, M.Q. Ma, D. Li, Y.H.H. Tsai, R. Salakhutdinov, in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Complex transformer: A framework for modeling complex-valued sequence (IEEE, 2020), pp. 4232–4236

  40. H. Tsuzuki, M. Kugler, S. Kuroyanagi, A. Iwata, An approach for sound source localization by complex-valued neural network. IEICE Trans. Inf. Syst. E96.D(10), 2257–2265 (2013). https://doi.org/10.1587/transinf.E96.D.2257

  41. Y.S. Lee, C.Y. Wang, S.F. Wang, J.C. Wang, C.H. Wu, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Fully complex deep neural network for phase-incorporating monaural source separation (IEEE, Piscataway, 2017), pp. 281–285

  42. L. Bianchi, F. Antonacci, A. Sarti, S. Tubaro, Model-based acoustic rendering based on plane wave decomposition. Appl. Acoust. 104, 127–134 (2016)

    Article  Google Scholar 

  43. P.A. Gauthier, A. Berry, Adaptive wave field synthesis with independent radiation mode control for active sound field reproduction: Theory. J. Acoust. Soc. Am. 119(5), 2721–2737 (2006)

    Article  Google Scholar 

  44. P.A. Gauthier, A. Berry, in Audio Engineering Society Convention 123. Adaptive wave field synthesis for sound field reproduction: Theory, experiments, and future perspectives (Audio Engineering Society, New York, 2007)

  45. P.A. Gauthier, A. Berry, Adaptive wave field synthesis for broadband active sound field reproduction: Signal processing. J. Acoust. Soc. Am. 123(4), 2003–2016 (2008)

    Article  Google Scholar 

  46. P.A. Gauthier, A. Berry, Adaptive wave field synthesis for active sound field reproduction: Experimental results. J. Acoust. Soc. Am. 123(4), 1991–2002 (2008)

    Article  Google Scholar 

  47. E.G. Williams, Fourier acoustics: Sound radiation and nearfield acoustical holography (Academic press, Cambridge, 1999)

    Google Scholar 

  48. P.C. Hansen, Analysis of discrete ill-posed problems by means of the l-curve. SIAM Rev. 34(4), 561–580 (1992)

    Article  MathSciNet  Google Scholar 

  49. D.L. Colton, R. Kress, R. Kress, Inverse acoustic and electromagnetic scattering theory, vol. 93 (Springer, New York, 1998)

    Google Scholar 

  50. D.N. Zotkin, R. Duraiswami, N.A. Gumerov, Plane-wave decomposition of acoustical scenes via spherical and cylindrical microphone arrays. IEEE Trans. Audio Speech Lang. Process. 18(1), 2–16 (2009)

    Article  Google Scholar 

  51. E.T. Whittaker, On the partial differential equations of mathematical physics. Math. Ann. 57(3), 333–355 (1903)

    Article  MathSciNet  Google Scholar 

  52. E. Verheijen, Sound field reproduction by wave field synthesis. Ph. D. dissertation, Delft University of Technology (1997)

  53. P.A. Nelson, S.J. Elliott, Active control of sound (Academic press, Cambridge, 1991)

    Google Scholar 

  54. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  55. K. Simonyan, A. Zisserman, in International Conference on Learning Representations. Very deep convolutional networks for large-scale image recognition (2015)

  56. K. SongGong, W. Wang, H. Chen, Acoustic source localization in the circular harmonic domain using deep learning architecture. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 2475–2491 (2022)

    Article  Google Scholar 

  57. A. Pandey, D. Wang, in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Exploring deep complex networks for complex spectrogram enhancement (IEEE, Piscataway, 2019), pp. 6885–6889

  58. Y. Kuroe, M. Yoshid, T. Mori, in Artificial Neural Networks and Neural Information Processing-ICANN/ICONIP 2003 Istanbul, Turkey, June 26–29, 2003 Proceedings. On activation functions for complex-valued neural networks-existence of energy functions- (Springer, New York, 2003), pp. 985–992

  59. K. He, X. Zhang, S. Ren, J. Sun, in Proceedings of the IEEE international conference on computer vision. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (IEEE, 2015), pp. 1026–1034

  60. K. He, X. Zhang, S. Ren, J. Sun, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Deep residual learning for image recognition (2016), pp. 770–778. https://doi.org/10.1109/CVPR.2016.90

  61. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  Google Scholar 

  62. J.A. Barrachina. Negu93/cvnn: Complex-valued neural networks (2022). https://doi.org/10.5281/zenodo.7303587

  63. S. Koyama, K. Kimura, N. Ueno, in 2021 Immersive and 3D Audio: from Architecture to Automotive (I3DA). Sound field reproduction with weighted mode matching and infinite-dimensional harmonic analysis: An experimental evaluation (IEEE, Piscataway, 2021), pp. 1–6

  64. H. Wierstorf, S. Spors, in Audio Engineering Society Convention 132, Sound field synthesis toolbox (Audio Engineering Society, 2012). https://github.com/sfstoolbox/sfs-python/releases/tag/0.6.2

  65. D.P. Kingma, J. Ba, in 3rd Intl. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Adam: A method for stochastic optimization (2015). http://arxiv.org/abs/1412.6980

  66. Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  67. S. Zhao, Q. Zhu, E. Cheng, I.S. Burnett, A room impulse response database for multizone sound field reproduction (L). J. Acoust. Soc. Am. 152(4), 2505–2512 (2022). https://doi.org/10.1121/10.0014958. https://pubs.aip.org/asa/jasa/article-pdf/152/4/2505/16657353/2505_1_online.pdf

  68. R. Scheibler, E. Bezzam, I. Dokmanić, in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). Pyroomacoustics: A python package for audio room simulation and array processing algorithms (IEEE, Piscataway, 2018), pp. 351–355

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

LC: conceptualization, code implementation, results computation, main writing. FA: conceptualization, writing, research oversee. AS: research oversee and manuscript review. All authors read and agreed to the submitted version of the manuscript.

Corresponding author

Correspondence to Luca Comanducci.

Ethics declarations

Ethics approval and consent to participate

The authors approve and consent to participate.

Consent for publication

The authors consent for publication.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Comanducci, L., Antonacci, F. & Sarti, A. Synthesis of soundfields through irregular loudspeaker arrays based on convolutional neural networks. J AUDIO SPEECH MUSIC PROC. 2024, 17 (2024). https://doi.org/10.1186/s13636-024-00337-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-024-00337-7

Keywords