Sound field reconstruction using neural processes with dynamic kernels

Accurately representing the sound field with the high spatial resolution is critical for immersive and interactive sound field reproduction technology. To minimize experimental effort, data-driven methods have been proposed to estimate sound fields from a small number of discrete observations. In particular, kernel-based methods using Gaussian Processes (GPs) with a covariance function to model spatial correlations have been used for sound field reconstruction. However, these methods have limitations due to the fixed kernels having limited expressiveness, requiring manual identification of optimal kernels for different sound fields. In this work, we propose a new approach that parameterizes GPs using a deep neural network based on Neural Processes (NPs) to reconstruct the magnitude of the sound field. This method has the advantage of dynamically learning kernels from simulated data using an attention mechanism, allowing for greater flexibility and adaptability to the acoustic properties of the sound field. Numerical experiments demonstrate that our proposed approach outperforms current methods in reconstructing accuracy, providing a promising alternative for sound field reconstruction.


Introduction
Accurately describing the characteristics of a sound field, including its spatial, temporal, and spectral properties, is crucial for spatial audio applications, which aims to create realistic auditory environments through loudspeakers or headphones [1,2].With recent advances in immersive and interactive sound field reproduction technologies, the ability to render dynamically variable sound fields that allow for listener and source movement within the audio scene has become increasingly important.While obtaining continuous spatial coverage measurements of a sound field over a large area is extremely challenging [3][4][5][6], sound field reconstruction offers a resourceful approach to estimate the sound field from a limited set of discrete observations.Such methods can help overcome the limitations of direct measurement techniques and enable realistic, immersive audio experiences in real-world applications.
General solutions for sound field reconstruction typically rely on conventional linear regression, where the sound field is measured at multiple points and represented as a linear combination of basis functions such as plane waves, cylindrical or spherical harmonics [7][8][9][10].However, a large number of basis functions are needed to accurately represent sound fields over a large spatial region using conventional linear regression.Under specific acoustic assumptions, it is possible to represent the sound field using sparse representations, including planewave [11] or spherical wave [12] expansions, and modal decomposition [13,14], as well as equivalent source methods [15][16][17].Many of these techniques employ the principle of compressed sensing principles [18] to estimate undersampled data for sound field reconstruction.
Another approach, known as kernel ridge regression, is based on infinite-dimensional analysis of sound fields to address the issue of basis function truncation [19][20][21].In this field, the hierarchical kernel was proposed [22], which requires manual adjustment of the kernel parameters to align with the specific characteristics of the sound field.More recent works [20,21] have focused on adaptive kernels, i.e. the usage of pre-defined kernels or subkernels with adaptively adapted parameters.
Recently, there have been several data-driven methods utilizing neural networks (NN) for specific tasks within the field of sound field reconstruction [23][24][25][26][27][28].Many of these methods are inspired primarily by image restoration and segmentation techniques in computer vision.For example, convolutional neural network (CNN) architectures including U-Net [23] was proposed for reconstructing sound field magnitude [23], physics-informed CNNs was proposed for reconstructing sound fields generated by point sources [26], and MultiResUNet was used for microphone array based room impulse response interpolation [24].
Our research focuses on reconstructing various types of sound fields across an entire spatial region, using a limited number of discrete observations.The work is based on Gaussian processes (GPs), which are powerful probabilistic models that can be used to capture the spatial correlation in the field by employing a kernel function and also to handle the uncertainty associated with the field's variations.The work in [22] presents a pioneering approach to using GPs for sound field reconstruction, demonstrating the significant potential of this technique.However, one crucial aspect that strongly influences the performance of GP models is the choice of kernel function.At the moment, there are still several unresolved questions regarding kernel selection.Firstly, the current work employs the pre-defined kernels, with the kernel function parameters adapted solely from the observations, resulting in limited expressiveness.Secondly, the current work has primarily focused on sound field reconstruction of far-field sources or sparsely distributed sources in reverberant rooms.The kernel functions used in prior work do not adequately capture near-field acoustic properties.Hence, there is potential for further exploration into various types of sound fields, such as near-field sources and standing waves, etc.
In summary, identifying an appropriate kernel with optimal kernel function parameters for various types of sound fields can be challenging.To address this issue, this paper proposes a novel data-driven approach to reconstruct the magnitude of the sound pressure using neural processes (NPs) [29].NPs enable us to parameterize GPs using a deep neural network.In addition, we introduce dynamic kernels that can effectively adapt to the properties of diverse sound fields by leveraging attention mechanisms.Note that here the motivations for modeling sound field magnitude as in [23] are as follows.(1) The human auditory system is more sensitive to changes in sound magnitude than to changes in phase.Therefore, capturing and reconstructing the magnitude can often be sufficient for achieving perceptually accurate results.(2) Reconstructing only the magnitude simplifies the training complexity.
In this paper, the primary objective is to achieve an accurate reconstruction of sound field magnitudes using minimal observations that are arbitrarily and irregularly distributed.The paper is structured as follows.Section 2 provides a review of the GPs model, including commonly used kernel functions, and highlights the limitations of this model.Building on this, Section 3 presents the conceptual framework and neural network architecture details of the proposed approach using NPs.Section 4 outlines the training procedure and presents results on the reconstruction accuracy of the proposed method, in comparison with the conventional linear regression models and data-driven models.

GPs methodology
The problem is defined as reconstructing a sound field within a specific area of interest, using only a limited and finite set of observations, which are denoted as ũ = [ ũ(r 1 , ω), . . .ũ(r N , ω)] , where r ∈ is the spatial loca- tions and ω is the angular frequency.Hereafter, ω is omit- ted for notation simplicity.The observed pressure ũ(r) at a location r is represented as where the true sound field f (r) cannot be directly observed or measured and e(r) denotes the measurement noise [22].
Assuming the sound field in the space is a zero mean complex GP, that is the distribution of sound pressure within that space follows a complex Gaussian distribution where the covariance function, or the kernel, κ r, r ′ of the sound pressures between the spatial locations of r and r ′ is written as The measurement noise in (1) is also assumed complex Gaussian with zero mean (1) (4) e(r) ∼ CGP 0, κ e r, r ′ .
To predict the sound pressure at a new location r * , we need to compute the posterior distribution of u * (r) given the observed data ũ(r) and the kernel parameters.This can be done using the conditional distribution of a multivariate normal distribution [30], where µ u * | ũ(r) is the predictive mean and κ u * | ũ(r, r * ) is the kernel between the observed position and predictive position.
The optimal sound field reconstruction is the posterior mean in (5), that is where the kernel, κ = [κ(r 1 , r * ) • • • κ(r N , r * )] , is the spa- tial correlation function between the N observed pressures and the predictive locations r * , and the covariance matrices K and are defined as [31] Obviously, the kernel function, which models the spatial correlation between the sound pressure measurements, is a crucial part of sound field reconstruction using GP.The choice of kernel function can have a significant impact on the accuracy and efficiency of the sound field reconstruction.

Kernel functions
Kernels for sound field representation are typically categorized based on their properties of stationarity and isotropy.It is vital to choose or develop a kernel function that aligns with the characteristics of the sound field in GP methodology.For instance, a diffuse field demonstrates stationary and isotropic spatial correlation, while a plane wave field presents stationary but anisotropic spatial correlation.Below are some frequently applied kernel functions in audio and acoustics research [32].

RBF kernels
The definition of the isotropic radial basis function (RBF) kernel is where α is the scaling factor that adjusts the kernel func- tions to match the size of the data, ρ is the length scale defining the decay rate of the kernel, and δ r − r ′ is the Euclidean distance between two points.
The definition of the anisotropic RBF kernel is where the unitary vector u l ∈ R D defines the lth direction and ρ l is the length scale of the corresponding direction.
The definition of the periodic RBF kernel, which is derived from Eq. ( 9), gives where the kernel repeats every wavelength = 2π/k.

The plane waves kernels
Plane-wave expansions serve as a widely used method in sound field reconstruction.By decomposing the sound field into a sum of plane waves with varying amplitudes, directions, and frequencies, it becomes possible to reconstruct the field by determining their respective amplitudes and phases [14,17,33,34].That is, at the wavenumber k, the field at any point in space r can be expressed as where w l are unknown weights, e −jk T l r is the elementary wave function, and k l = ku l is the wavenumber vector.
If the weights w l are also modeled as a complex Gauss- ian process such that the kernel for the sound field that is generated by multiple sound sources [35,36] is defined as where the weights w l share a same variance σ 2 w .For a special case that the sound field is generated by only a few sources, which is normally characterized as sparse [37,38], and the kernel is defined as where the variances of the weights w l are independent and σ l are considered as inverse gamma distributed [39] (9) where a > 0 is the shape parameter and b > 0 is the scale parameter of the density function.With a fixed prior a, smaller values of b promote sparser solutions.The concept of the hierarchical kernel κ h is introduced in [22].In order to adapt to both normal and sparse sound field, the parameters σ l in ( 15) is defined as

The diffuse field kernel
For a diffuse field driven by a pure tone, the spatial correlation and coherence can be modeled by the superposition of an infinite number of random phase plane waves [34].That is, the diffuse field kernel function corresponding to (11) in the limit L → ∞ is written as follows, For the two-dimensional case, the kernel in ( 17) is the zeroth-order Bessel function In summary, when attempting sound field reconstruction using GPs, it is necessary to understand the characteristics of the sound field and select the appropriate kernel function.Once the optimal kernel function is determined, Eq. ( 6) can be utilized to obtain the predictive sound field pressure.However, if there is no suitable kernel function available, a custom kernel may need to (16) be derived.Nevertheless, developing a kernel that can effectively adapt to diverse acoustic environments can be challenging, particularly when dealing with complex sound fields.Additionally, estimating the optimal hyperparameters of the kernel through numerous experiments can be a time-consuming process.

Proposed method
In this work, we propose a novel approach to automatically obtain the optimal kernel from the magnitude of the sound field data for reconstruction, using a datadriven model based on NPs with attention mechanisms.
Our proposed model generates dynamic kernels that can adapt to the unique properties of various sound fields and defines distributions over sound field functions similar to GPs.This combination provides a probabilistic, data-efficient, flexible, and computationally efficient solution for optimal kernel selection.
In this section, we first detail the overall architecture of our method in Section 3.1, and then introduce the proposed two-stream encoder and the efficient and lightweight decoder in Sections 3.2 and 3.3, respectively.

Architecture
As shown in Fig. 1, the proposed model is composed of an encoder and a decoder.Specifically, the encoder contains two paths: a GPs parameterized path, which models the global structure of the stochastic process realization, and a dynamic kernel path, which captures the spatial correlation between observations and predictions.
The encoder takes a limited set of observed sound field magnitude measurements along with their corresponding Fig. 1 Schematic diagram of the neural network architecture proposed for sound field reconstruction locations (r, p) i∈C(0,N ) as input, where p = |u| and C denotes the set of integers from 0 to N. Within the GPs parameterized path, the encoder outputs a latent variable z , which encodes the global structure and uncertainty of the sound field distributions in the function space.In the dynamic kernel path, given the target location r * , the dynamic kernel mechanism outputs a correlation-specific representation v * .Since the dynamic kernel models the spatial correlation between observations and predictions using differentiable attention, which cannot be analytically obtained and acts as an implicit kernel, we visualize it in Section 4.6.
The decoder takes the latent variable z , the correlation- specific representation v * , and the target location r * as input and produces the predictive sound field magnitude p * of the target location.This process can be understood as analogous to reconstructing the sound field using an appropriate kernel, utilizing the neural network to carry out the calculation described by (5).

Encoder
In this section, we introduce the structure and mechanism of the encoder with the two distinct paths.

GPs parameterized using NPs
The GPs parameterized path is designed to learn distributions over sound field functions from observations.To represent a GP using a neural network, we assume that F (x) ∼ GP(µ, σ ) can be parameterized by a high-dimen- sional random vector z , i.e., the latent variable [29].We can then write F (x) = g(x, z) for some fixed and learn- able function g, where z models different realizations of the data-generating GPs [40].The motivation for introducing z is to enable our model to capture different types of sound fields.
In the GPs parameterized path, the observed sound field magnitudes in the frequency-spatial domain (r, p) i are embedded from the input space to the representation space using fully connected layers with Gaussian Error Linear Unit (GELU) [41] activation functions.In our approach, we incorporate a self-attention (SA) mechanism [42], denoted as s i = SA(r i , pi ) , to model higher-order interactions within the sound field.The SA mechanism allows us to capture the interactions among the observations, enabling the learning of global structural features of the sound field, and obtaining richer representations of the observations.The mean aggregator is used to combine the features as s = m(s i ) and generate a single global representation by Multi-Layer Perceptron (MLP), which parameterizes the latent distribution z ∼ GP(µ z , σ z ) .Finally, each sample of z corre- sponds to one realization of the GPs, capturing the global uncertainty.
In summary, the GPs parameterized path learns the mapping from the observed data to the latent distribution of the GPs, representing Eq. ( 5) by the neural network.Following this framework, the kernel function is not explicitly defined but is learned through the neural network's parameters, which is described in detail below.

Dynamic kernel-based attention mechanism
In GPs, the kernel function captures the relationship between pairs of inputs by computing the dot product between their corresponding feature maps.Here, the kernel is defined as κ x, , where represents the feature map that maps the inputs into a higher-dimensional feature space.The advantage of using such a kernel is that it allows us to design algorithms based on dot-product spaces [43].In our approach, we introduce a dynamic kernel mechanism inspired by the Scaled Dot-Product Attention (SDPA) [42].This dynamic kernel mechanism enables us to model the spatial correlation presented in diverse sound fields.More specifically, the target location r * is treated as a query, while the observations (r, v) i are treated as key-value pairs.Here, v i represents the transformation of pi into a higher-dimensional space through embedding.Similarly, both r i and r * undergo embedding within the dynamic kernel mechanism.The SDPA mechanism allows us to calculate weights that determine the correlation of each observation with respect to the target location, enabling accurate prediction of the sound field magnitude p * at the target location.
Suppose we have n key-value pairs arranged as matrices R ∈ R n×d r , V ∈ R n×d v , and m queries R * ∈ R m×d r .The dynamic kernel mechanism calculates correlation weights κ d by taking the dot-product of the queries and keys scaled by d r , i.e., the kernel form [44], and assigns κ d to V to obtain the output V * , which gives In addition to using a single dynamic kernel, we further propose using a multi-dynamic kernel to achieve linear smoother query values [42,44].As shown in Eq. ( 20), the multi-dynamic kernel is obtained by the sum of h kernels mapping with different weights W , defined by For each target location r * , the dynamic kernel gener- ates an attention map between r * and observations (r, p) i , (19) which are totally learned from the data.This allows our proposed model to make more accurate predictions in environments with different acoustic properties.The visualization of this part is shown in Section 4.6.

Decoder
The decoder takes the latent variable z , the correlation- specific representation v * , and the target location r * as input.We define a Gaussian likelihood to describe the decoder, that is where z is a global latent variable, g θ (r * , z) is a decoder function to generate a prediction for target sound field magnitude p * at a location r * , which is implemented as a deep neural network with parameters θ , and τ −1 is the variance of observation noise [29,45].Specifically, the likelihood π(p * | z, v * , r * ) is defined as a factorized Gaussian distribution across the predictions (r * , p * ) with mean and variance determined by z and correlation-spe- cific representation v * .
To generate the predictive sound field magnitude p * , the proposed model is defined by Since the conditional prior π(z | s) in Eq. ( 22) is intractable, it is approximated using the variational posterior [29] where m(•) is a mean aggregator function, and µ ω (•) and σ ω (•) parameterize a normal distribution from which z is sampled.

Loss function
The parameters of the encoder and decoder are learned by maximizing the evidence lower-bound (ELBO), The objective function consists of two terms.The first term is the reconstruction error (RE), which is equivalent to the mean squared error (MSE) [46].We denote this term as L D , and it measures the discrepancy between the predicted output p * and the corresponding ground truth p • .The MSE is computed over all the elements, denoted as N .The second term is called the Kullback-Leibler(KL) divergence [47], which is a measure of dissimilarity between two probability distributions.It quantifies the difference between the distribution of observed data ( 21) q(z | s) and the distribution of predicted data q(z | s * ) during the training process.
To achieve a balance between data reconstruction and meaningful representation learning, we assign equal weights to both terms during training.

Simulation experiments
We evaluated the performance of our proposed sound field reconstruction model in comparison to the GPs and data-driven models.The sound fields we reconstructed included both spatially stationary and non-stationary fields, such as a diffuse field and point sources in the near-field.Additionally, we reconstructed simulated room transfer functions (RTFs) using the image source method [48] and modal theory [34].Our reconstruction was carried out on a two-dimensional grid composed of 32 by 32 uniformly spaced points along the relevant dimensions.The absolute distance between input points is determined by the room size.Specifically, the distance between points along the x-axis is l x /32 , and the distance between points along the y-axis is l y /32 .To ensure scale independence in the learning process, it is common to standardize the input for each frequency.This standardization involves transforming the input values such that they have a mean of 0 and a standard deviation of 1.

Evaluation metrics
We use two metrics to evaluate the performance of our models.The first metric is the normalized mean square error (NMSE) between the ground truth p • and the pre- dictions p * for each frequency point k, which is calcu- lated as follows The second metric is the Modal Assurance Criterion (MAC) [49] for each frequency point k, which is defined as follows, The MAC measure evaluates the level of spatial similarity by determining how well the model predicts the overall shape of the pressure distribution in the sound field for each frequency point.The MAC values range from 0 (indicating maximum dissimilarity) to 1 (25) (representing identical shapes), providing a quantitative measure of the quality of the model's predictions.

Training procedure
Our proposed model can be trained end-to-end on simulated data.To optimize the model, we use the Adam optimizer [50] and train it for 300 epochs.The base learning rate is initially set to 1e−4 and decays to 1e−5 after 200 epochs.Moreover, to achieve better performance and stability during the training process, we implement an exponential warm-up strategy throughout the first 20 epochs.

Spatially stationary field
In this section, we explore the reconstruction of the diffuse field, which is modeled by the superposition of an infinite number of random phase plane waves, as shown in Eq. ( 17).This type of sound field is particularly relevant to the sound field present in reverberation rooms [51].
To evaluate the performance of our proposed model, we conduct experiments on simulated data.Specifically, we estimate the sound field magnitudes in the frequency band [30,500] Hz on a 32 by 32 grid, given 10 observations arbitrarily placed.The simulated data is generated by using m plane waves with unit magnitude and random phase, i.e., ∠u l ∼ U[0, 2π) and random direction of prop- agation, i.e., k l ∼ U[−k, k] .Here, m is randomly sampled from the range of m ∈ (1000, 3000) .To train our pro- posed model, we use a diverse set of 8000 diffuse fields according to the above parameter settings.
In order to evaluate the effectiveness of our proposed model, we compare it against GPs with different kernels, including the Bessel kernel, hierarchical kernel, and RBF kernels.The prior densities of parameters in Eq. ( 8)-( 10) are defined as α ∼ N (0, 1) , ρ and ρ l ∼ Ŵ −1 a ρ , b ρ , where a ρ = 5 and b ρ = 5 .For the hierarchical kernel in Eq. ( 16), the parameters are set as b = 10 −b log and b log ∼ N (2, 1) [22].In order for the mean magnitude of the fields to be 1 Pa, the fields are normalized.The parameters settings and scaling align with the original work [52].
In Table 1, we present the mean performance of our proposed model compared to GPs on a diverse set of 1000 diffuse fields.The results clearly demonstrate that our model exhibits significantly improved reconstruction performance.To provide a detailed visualization of the reconstruction process, we selected a sound field from the test set.Figure 2 depicts the sound field magnitudes of the reconstructed data at various frequencies.The Bessel kernel performs relatively well due to its aptitude to coincide with the diffuse field.The hierarchical kernel exhibits a certain level of adaptability to the property of the sound field, enabling it to capture the structure of the diffuse field.However, in regions where there are no observations, such as the upper left corner, all kernels poorly extrapolate the sound field, particularly at 500 Hz.This phenomenon highlights the limitations of the GPs method in accurately capturing the complex behavior of the sound field in sparsely sampled regions.
In comparison, the proposed model achieves the best performance due to the proposed attention-based dynamic kernel mechanism, which enables the model to effectively capture the global sound field and obtain richer representations.This enhances the model's overall performance, enabling it to outperform other approaches.

Spatially non-stationary field
In this section, we discuss the process of reconstructing the sound field in the near-field created by multiple point sources.This type of sound field is particularly relevant to the direct component of the Room Impulse Response (RIR) [21,53].The direct component of the RIR provides critical information about the room geometry [54].
To train our model, we created a dataset consisting of 8000 simulated sound fields.Each field is composed of a random number of point sources, denoted as j ∼ U [1,6] , which are randomly distributed.Each point source is positioned at a radial distance, represented by d ∼ U( , 3 ) , from the central point of the reconstruc- tion area.The parameters of GPs method are set as Section 4.3.
Table 2 shows the mean performance of our proposed model and GPs on a diverse set of 1000 near-fields.To provide a detailed reconstruction demonstration, we selected a sound field from the test set for visualization.Figure 3 shows the reconstruction of the near-field produced by five point sources evenly distributed at 2 m from the center of the reconstructed area.From the figure, we see that the GPs method with existing kernels fails to accurately follow the distance inverse law in terms of the pressure amplitude reconstruction.This discrepancy arises from the mismatch between the kernel functions and the properties of the sound field.Specifically, the magnitude of the reconstructed sound field is relatively small near the sources (i.e., the edge of the reconstruction area), while the magnitude is excessively large at locations further away from the sources (i.e., the center of the reconstruction area).In addition, the kernels are poor for source localization, making it difficult to distinguish the location or even the number of sound sources from Fig. 3.
As predicted, the proposed model demonstrates superior performance in accurately reconstructing  sources with varying numbers, orientations, and distances, particularly at 500 Hz.This outcome highlights the remarkable ability of the proposed model to generalize effectively and reconstruct diverse sound fields.

RTF magnitude reconstruction
RTFs are a crucial component for achieving immersive and interactive sound field reproduction in virtual reality applications [13].They represent the frequencydomain representation of RIRs, which typically comprise direct and reverberant components that can be modeled by spherical waves and diffuse fields, respectively [31].In Sections 4.3 and 4.4, we demonstrated the remarkable superiority of our proposed model over the GPs method in both near-field and diffusefield sound field reconstruction.To provide a fair comparison, we further compare our proposed model with a data-driven sound field reconstruction method based on a U-net-like neural network [23].The training process and settings are in line with the original work [55].We employed two simulation methods, the Image-Source Method (ISM) and Modal Theory (MT), to generate RTF datasets.We tested the ability of our model to reconstruct sound fields in simple small-sized rooms, as well as complex rooms with standing waves.Note that the trained networks are not specific to any particular room geometries or wall reflective properties but only leverage the limited set of observations within the reconstruction area of interest, demonstrating the versatility and practicality of our proposed approach.

ISM-RTFs dataset
The ISM for generating RIRs is widely used in sound field reconstruction [56,57], with the RIR generator [48] being a popular tool due to its simplicity and computational efficiency.The ISM-based approach is well-suited for small room sizes and simple geometries.In the frequency domain, the generated RTFs are represented as where r β are the vectors corresponding to the permuta- tions of x 0 ± x, y 0 ± y, z 0 ± z , γ is the integer vector tri- plet (n x , n y , n z ) , and r γ = 2(n x L x , n y , n z L z ) [31].
In our simulations, we investigated point source radiation in 2D rooms within the frequency range of [30,500] Hz, where B = 4 and z = 0 as specified in Eq. (28).We conducted the simulations in 11,000 rectangular rooms with floor areas randomly sampled from 12 to 20 m 2 .In each room, an omnidirectional source was placed in a uniformly sampled random location.We set reverberation time T 60 = 0.4s , the sampling frequency to fs = 48 kHz, and simulate reflections up to the 3rd order.To assess the performance of our proposed model with a limited number of observations, we placed 10, 30, and 50 microphones in a 32 by 32 grid in an arbitrary manner.We used 10,000 and 1000 rooms for training and testing the model, respectively, from the dataset.We then analyzed the mean performance of the model across these test rooms.
As shown in Fig. 4, our proposed algorithm consistently outperforms the U-net model.Specifically, the proposed model achieves similar results with only 30 observations, while U-net requires 50 observations to achieve comparable performance.This improvement can be attributed to the dynamic kernel that incorporates global information more comprehensively than the partial convolution employed in U-net [58].This demonstrates the potential of our proposed model to reduce the number of required samples while maintaining its effectiveness.
Furthermore, we observe that the performance of our proposed model improves as the number of available observations increases.Although the performance slightly degrades with increasing frequency, the model still exhibits good performance in reconstructing RTFs in small rooms across most frequencies.These outcomes suggest that our algorithm is effective for reconstructing RTFs in small rooms.(28)

MT-RTFs dataset
In order to investigate the potential of our proposed model for reconstructing complex sound field with standing waves, we generated a dataset using MT [55], i.e., the following equation where N is a triple summation across the modal order in each dimension (n x , n y , n z ) of the room, V is the room volume, ψ N (•) is the eigenfunctions (representing the mode shape), and ω N denotes eigenfrequencies (repre- senting the resonance frequency).The time constant τ N represents the characteristic time for a specific mode in a room to decay.It is a constant obtained by dividing the total sound energy in the room by the sound power absorbed by the walls related to that particular mode.Specifically, for each mode, τ N is calculated from the absorption coefficient determined using Sabine's equation [23].Here, we focus on 2D rectangular rooms ( 29) within the frequency band [30,500] Hz.We incorporate all room modes with eigenfrequencies f m below 600 Hz, and specifically set n z to 0 in Eq. ( 29).Consequently, the total number of modes can be calculated using the formula N = f m 2 /(c 2 /4n x n y ) [34].A reverberation time of T 60 = 0.4s is assumed.The training and test sets are split, and the room size and sound source location settings are the same as in Section 4.5.1.
Figure 5 depicts the mean performance of the proposed model in reconstructing MT's RTF dataset.The performance of the proposed model given 10, 30, and 50 observations consistently outperforms the U-net, indicating its potential for effectively reconstructing standing waves.Particularly in the low-frequency range, the proposed model exhibits a significant advantage.As the reconstruction frequency approaches the highest eigenfrequency, the complexity of the modes increases, which leads to a decrease in the reconstruction performance.This phenomenon aligns with theoretical expectations, suggesting that a higher number of observations is required to improve robustness and overcome the challenges posed by undersampling [23,59].
In addition, comparing Figs. 4 and 5, the method's performance deteriorates with increasing frequency, which is more noticeable in Fig. 5.The reason for this phenomenon is the ISM-RTFs dataset is more homogeneous than the MT-RTFs dataset.Specifically, the sound fields generated by IMS are produced in shoebox rooms with image reflections up to the 3rd order.This indicates a relatively sparse sound field with wavefronts in the space-time domain.Due to the transient nature of the wavefronts, this type of sound field is dense in the frequency domain.In contrast, the sound fields generated by MT are relatively sparse in the modal region of the sound field (up to Schroeder's frequency).As frequency approaches Schroeder's frequency, the sound fields have increasingly more modes and eventually become diffuse.

Dynamic kernel visualization
In this section, we demonstrate the spatial correlation between observations and target locations using the proposed dynamic kernel Eq. ( 19).We select multiple rooms from both IMS-RTFs and MT-RTFs datasets to visualize the sound field and their spatial correlation at specific frequencies.
Figure 6a and b demonstrate that for the IMS-RTFs dataset, the correlation is stronger between observations in close proximity to the target location.Additionally, the dynamic kernel assigns relatively more attention to locations where the sound source is situated, i.e., the bottom left of Fig. 6a and the middle left of Fig. 6b, and less to areas where the sound field characteristics are less prominent, such as the top side of Fig. 6a and right side of Fig. 6b.This reflects the validity of the dynamic kernel in apportioning attention to the global sound field.Additionally, it provides an explanation for the experimental results in Section 4.3, as the sound field reconstructed by the proposed method reflects the locations of sources.
For the MT-RTFs dataset shown in Fig. 6c and d, similar conclusions can be drawn, with closer observations displaying a stronger correlation with the target location.Interestingly, the observations that correlate most strongly with the target point are not in proximity to it but rather at the left bottom of the Fig. 6c and the bottom of the Fig. 6d, where the structural features of the sound field are noticeable.This highlights the dynamic kernel's ability to learn from data.Furthermore, it is apparent that the sound field environment in the MT-RTFs dataset is more intricate than that of the IMS-RTFs dataset at the same frequency.This difference explains the proposed model's performance degradation in reconstructing the MT-RTFs dataset at higher frequencies.

Model generalization
To assess the generalization ability of our model, we combined the four datasets mentioned in Sections 4.3, 4.4, and 4.5 into a diverse dataset for both training and testing.We conducted experiments on four types of sound fields, where 10 observations were arbitrarily placed.In our comparisons between U-net and GPs, we employed the best-performing hierarchical kernel for GPs.
As illustrated in Fig. 7, we observed a decline in performance for the model trained on the diverse dataset when compared to training on each individual dataset separately.This decline can be attributed to the varying data distributions present in each dataset.However, it is important to note that even with this decline, our proposed model still exhibited strong performance, particularly in terms of robustness at high frequencies.
This outcome serves as a testament to our model's ability to learn from diverse data and highlights its applicability across various sound field scenarios.While the varying data distributions affected the model's performance to some extent, our model showcased resilience and delivered notable results, particularly in capturing sound characteristics at higher frequencies.

Computational complexity analysis
Apart from enhancing the accuracy of reconstruction, the proposed model also offers a significant advantage in terms of computational complexity during the inference process.With a model size of 4.3 million parameters, the deterministic inference time is around 0.016 s on a Nvidia Tesla K80 GPU.This estimation is based on the observation of 1000 different room predictions.
In our experiments, we conducted model training for

Conclusion
In this work, we proposed a novel method that parameterizes GPs using a deep neural network based on Neural Processes.Our method allows for the learning of dynamic kernels from simulated data with the introduction of attention, enabling the method to obtain a kernel that adapts to the acoustic properties of the sound field without many functional design restrictions.Numerical experiment results demonstrate that our proposed method outperforms current methods in terms of reconstructing accuracy for a diverse range of sound fields.Future work involves validating our approach using realworld data and further developing the methodology for complex sound field reconstruction.

Fig. 2
Fig. 2 Reconstructed diffuse-field magnitudes of different frequencies given 10 observations arbitrarily placed.The red dots indicate the locations used for reconstructing predicted sound field magnitudes

Fig. 3
Fig. 3 Reconstructed near-field magnitudes of different frequencies given 10 observations arbitrarily placed.The red dots indicate the location used for reconstructing predicted sound field magnitude

Fig. 6
Fig. 6 Visualization of spatial correlation of RTFs at a specific frequency.The dots indicate the location of the observations that were used to reconstruct the output of our model, and the white square denotes the target location that needs to predict its magnitude.The color of the dots reflects the strength of the correlation between the observations and the target

Fig. 7
Fig. 7 Normalized mean square error (NMSE) in dB and Modal Assurance Criterion (MAC) estimated from four datasets given 10 observations arbitrarily placed

Table 1
The mean of NMSE and MAC of diffuse-field test dataset of different frequencies given 10 observations arbitrarily placed

Table 2
The mean of NMSE and MAC of near-field test dataset of different frequencies given 10 observations arbitrarily placed