MIRACLE—a microphone array impulse response dataset for acoustic learning

This work introduces a large dataset comprising impulse responses of spatially distributed sources within a plane parallel to a planar microphone array. The dataset, named MIRACLE, encompasses 856,128 single-channel impulse responses and includes four different measurement scenarios. Three measurement scenarios were conducted under anechoic conditions. The fourth scenario includes an additional specular reflection from a reflective panel. The source positions were obtained by uniformly discretizing a rectangular source plane parallel to the microphone for each scenario. The dataset contains three scenarios with a spatial resolution of 23 mm at two different source-plane-to-array distances, as well as a scenario with a resolution of 5 mm for the shorter distance. In contrast to existing room impulse response datasets, the accuracy of the provided source location labels is assessed and additional metadata, such as the directivity of the loudspeaker used for excitation, is provided. The MIRACLE dataset can be used as a benchmark for data-driven modelling and interpolation methods as well as for various acoustic machine learning tasks, such as source separation, localization, and characterization. Two timely applications of the dataset are presented in this work: the generation of microphone array data for data-driven


Introduction
A room impulse response (RIR) characterizes the linear time-invariant acoustic propagation between a source and a receiver within a specific acoustic environment.RIRs are crucial for sound field auralization [1] as well as in the realm of room acoustics, where they are used for estimating acoustic properties of a room such as the reverberation time [2].
The emergence of data-driven methods in acoustics [3], particularly deep learning methods, has sparked increasing interest in the availability of rich, high-quality RIR datasets.These datasets play a pivotal role in the training of data-driven (interpolatory) sound field reconstruction methods [4][5][6][7], deep generative models [8,9], and augmentation methods [10].In addition, RIR datasets can be flexibly employed in order to synthesize acoustic training data for source localization and characterization [11], sound event detection, and speech separation tasks by convolving arbitrary source signals with RIRs [12][13][14].The same synthesis procedure can be employed for data-driven acoustic parameter estimation problems, such as blind reverberation time estimation [15,16] and others [17].
While data-driven methods often exhibit superior performance compared to conventional model-based methods, they require large amounts of realistic training data and are sensitive to variations of underlying probability distributions describing the data, also known as dataset shift [18].Experimental data is oftentimes not available or too time-consuming to acquire.Many data-driven methods across various application areas, such as speech enhancement and recognition [19,20], localization [11,21], sound field reconstruction [22], room acoustic parameter estimation [23,24], and acoustical engineering [25][26][27], are therefore trained with simulated data, whereby enhanced realism helps to improve generalization performance [28,29].However, without adaptation to or training with realistic data, the performance of data-driven methods can be significantly impaired [25,30], which indicates the need for experimentally measured RIR datasets.

State of the art
In contrast to the diverse analytical and data-driven frameworks for the simulation of RIRs proposed in the past [8,[31][32][33][34], the availability of openly available experimental RIR datasets is limited.Moreover, the landscape of RIR datasets is rather heterogeneous because the experimental data is usually acquired with a certain application in mind.This manifests in application-specific environments, customized source-receiver arrangements, and non-standardized dataset annotations.
In some datasets, the focus lies on environmental variability through different room geometries and reverberation times [35][36][37][38][39][40].The SoundCam dataset [35] poses an extreme example in which the environmental characteristics of four rooms are manipulated by placing humans at hundreds of locations in the rooms, leading to an enormous amount of environmental representations.These datasets are of particular interest in speech recognition, teleconferencing, and sound event detection applications.While in some of these datasets, the sensors are distributed over the room [35,36], others rely on compact microphone arrangements with only a few channels and aperture sizes that are small compared to the source distance [39][40][41].The latter is often the case in speech and sound event localization applications.Only very few datasets provide additional annotations of the environmental characteristics, such as echo annotations, which enable their use for acoustic parameter estimation problems [40].
Conversely, there are application fields where spatial diversity and accurate positional annotations are similarly or even more important than environmental variability.In acoustical engineering applications, microphone array measurements for determining source locations and individually induced sound pressure at a specific receiver position are widely used to implement noise reduction measures [42].To accurately estimate these source characteristics, free-field conditions or laboratory environments with a short reverberation time are preferred.Usually, the utilized microphone arrangements are planar and contain many channels spread over a large area to ensure accurate spatial filtering capabilities over a wide frequency range.Typical examples include the measurement of machinery noise and wind turbines.Oftentimes, multiple closely neighbouring sound sources can be expected.There exist datasets that employed mechatronic positioning devices, which offer a higher positional accuracy over manual positioning, to achieve a dense sampling of the acoustic space under reverberant [43][44][45] or anechoic conditions [46].For example, the Pyramic dataset [46] was acquired by rotating a tetrahedral microphone array with a turntable device using fixed loudspeaker positions.However, the accuracy of the acquired positional information is rarely assessed in the literature [46].A limitation of these datasets is that either the number of sources or receivers can be considered as small [39,43,44,46], which prohibits their use in engineering acoustic applications where a dense spatial sampling of the source and receiver space is required.
Another discipline with similar data requirements is the spatial interpolation of room impulse responses [5][6][7]47].Due to the large spatial variability of RIRs [47], dense sampling grids have to be employed in order to adequately capture the entire range of dynamics, especially at high frequencies.The absence of large and spatially dense RIR datasets entails that thorough assessment of interpolation methods has only been performed with synthetic data so far [5,7].
Regardless of the application, only a few large-scale datasets encompassing several thousand RIRs, the largest of which is presented in this work.These datasets are outlined in Table 1.In addition to these larger datasets, there are several popular datasets with a smaller number of RIRs, for example [38][39][40][48][49][50][51][52][53][54][55].Large and realistic RIR datasets are of great interest to many scientific communities, especially those concerned with the development of novel data-driven modelling algorithms.The mathematical field of model order reduction (MOR) [56], for example, is rapidly developing data-driven methods [57][58][59].However, in the majority of cases, new methods are validated only with synthetic data of common benchmark problems [60][61][62] and the validation of these methods in realistic scenarios with high-dimensional data is still pending.Providing well-documented high-dimensional measurement data is an essential element that can enable these communities to increase the applicability of their methods to real problems.
In this work, we introduce a large measured room impulse response (RIR) dataset, which we call "Microphone Array Impulse Response Dataset for Acoustical Learning" (MIRACLE).The dataset is tailored to the application field of sound source localization and characterization in acoustical engineering applications, and spatial interpolation of RIRs.The focus of the dataset lies on spatial rather than on environmental variability and accurate positional annotations.
Its key highlights are: 1. MIRACLE is the first RIR dataset with a dense sampling of the environment in both source and receiver space.2. Particularly accurate spatial sampling is achieved by using a mechatronic positioning device to control the loudspeaker position.In contrast to most of the previously published datasets, the accuracy of the source positions (positional labels) is statistically validated.The assessment reveals outstandingly accurate and precise source locations with positional uncertainties in the order of a few millimetres.

Experimental setup
The experimental setup is illustrated in Fig. 1.Details on the utilized hardware are given in Table 2.

Microphone array
The phased microphone setup features a planar microphone array comprising n o = 64 channels mounted in a 1.5 m × 1.5 m aluminium plate.The microphone arrange- ment follows Vogel's spiral [63].The maximum pairwise distance between the array microphones is referred to as the aperture size d a = 1.47 m .The microphone array data was acquired with a multichannel acquisition system (sampling rate = 51.2kHz).

Sound source and excitation signal
A dynamic 2" cone loudspeaker in a cylindrical 3D-printed enclosure was employed as the sound source.An exponential sine sweep was used as the excitation signal because of its favourable properties with regard to crest factor and rejection of non-linearities [64].It was designed according to [65][66][67] in the frequency range of the loudspeaker, namely from 100 Hz to 16 kHz .Because the anechoic chamber is nearly free of reflections and has very low noise levels, it was possible to choose a relatively short sweep time of 3 s for the measurement.In order to ensure that the entire system response after excitation is captured, a safety window of 250 ms was added to the recording duration, resulting in n s = 166,400 samples per measurement.The loudspeaker excitation signal was also fed back directly to the AD-converter and was synchronously recorded with the microphone signals as a reference signal for post-processing.It is referred to as the loopback excitation signal in the following.

Positioning
A high-precision motor-driven 2D positioning system was employed for loudspeaker positioning.The positioning system and the microphone array were manually aligned by using a laser distance meter and a cross-line laser, achieving only minor alignment errors of a few millimetres at worst.The loudspeaker dust cap at the membrane centre was used as reference in the manual alignment.During data post-processing, a spatial offset  correction was applied based on a statistical evaluation given in Section 3.3.The corrected positions apply to the acoustical centre of the loudspeaker rather than the centre of the membrane.

Environment
All measurements were performed in the anechoic chamber of TU Berlin (room volume V = 830 m 3 , lower cut-off frequency f c = 63 Hz ).Neither heating nor air condi- tioning was active, and the temperature was monitored at the microphone array centre throughout the experiment.
A ground plate was placed between the loudspeaker and the microphone array in one of the experimental scenarios to enable a reflective environment.The supporting grid platform and the positioning system were clad with absorptive foam to minimize reflections.

Experimental procedure
A customized and fully automated data acquisition procedure was implemented.Before each experiment, the loudspeaker was repeatedly excited with the excitation signal for a duration of 20 min (the duration was determined in a dedicated experiment).This warm-up phase accounts for the weakly non-stationary dynamics of the loudspeaker's transfer function, e.g., changes of the properties of the loudspeaker magnet related to internal temperature fluctuations; see [68].Subsequently, the actual measurement routine was started by positioning the loudspeaker at the desired source location and measuring the room temperature simultaneously.After positioning, two repetitions of background noise measurement ( 1 s each) and loudspeaker excitation measurements ( 3 s each) were performed using all n o microphones at once.Subsequently, the cross-correlation between all n i recorded channels was evaluated according to the rule of two [69].Based on the measured sweep signals and the noise signal, the rule of two defines a cross-correlation threshold at which a pair of measured sweeps can be regarded free of corruption.In case of any violations, the measurement was repeated automatically.Following the main measurement campaign, an additional measurement was conducted in the anechoic chamber to obtain the angle-dependent frequency response of the loudspeaker at discrete azimuth angles at a resolution of �θ = 2.5 • .A microphone was placed at a distance of 0.5 m from the loudspeaker centre.The latter was mounted on a motor-driven dispersion measurement turntable.A photograph of the measurement setup can be found in Fig. 2. The same excitation signal and processing parameters as in the previous measurement campaign were used to determine the loudspeaker impulse response.Due to the cylindrical enclosure enclosing the loudspeaker, rotational symmetry around the z-axis can be assumed.

Post-processing
Several post-processing steps were performed to obtain a good estimate of the system impulse response from the measurements.Firstly, the loopback excitation and microphone signals were averaged across the two measurement repetitions to obtain a single averaged excitation signal ũi,j ∈ R n s and averaged microphone signal ỹi,j ∈ R n s at the i-th source to the j-th receiver location, respectively.According to that, all signals were resampled to a sampling rate of f d = 32 kHz since the loud- speaker transmission capability and excitation sweep have an upper frequency limit of 16 kHz .We applied the polyphase method for resampling (see [70] for details).

Deconvolution
In the following, let n d = 104,000 denote the number of samples after resampling.An estimate of the frequency response was obtained by dividing the Discrete Fourier Transform (DFT) of the averaged and downsampled measurement signals Y i,j = DFT(ỹ i,j ) ∈ C n d by the correspond- ing DFT of the averaged and resampled loopback excitation signals for the angular frequency i,j ∈ C n d were obtained by regularized inversion [67,[71][72][73][74][75] where M = max k∈{1, ..., n d } {|U i,j (e ıω k )| 2 } = 1 .Regulariza- tion is necessary to avoid instabilities in the deconvolved frequency response that arise from persistently exciting only over a limited frequency range.Practical considerations for choosing the regularization term in acoustic applications can be found in [67].The regularization term ∈ R n d was chosen as such that the regularization term (e ıω k ) is equal to 0 above the cutoff frequency which is chosen according to the lower limit of the loudspeaker's frequency range of 100 Hz and equal to 1 below (1) . A cross-fade based on a Hann window (raised-cosine) is used to smoothly transition in between.The estimate of the frequency response H i,j was then transformed back to the time domain to finally obtain the impulse response

Truncation
The calculated impulse responses were subsequently truncated in order to contain the size of the final dataset.For user convenience, the impulse responses of all measurement scenarios were truncated identically.For this, the minimum cumulative energy e ∈ R n d given by was calculated for each scenario.The truncation index n t was chosen to be the smallest power of two that is larger than the time index for which 0.1 % of the energy is trun- cated at worst, namely

Impulse responses
A total of four different experimental scenarios were realized, which are summarized in Table 3.The acquisition time for each of the large-scale scenarios A1, A2, and R2 was about 20 h.The total number of single-channel impulse responses across all scenarios is 856,128.The scenarios differ regarding the environment as well as the spatial dimension ( d y = d x ), sampling resolution ( d y = d x ), and distance d z of the source plane.The two large anechoic sce- narios A1 and A2 each include 4096 measured source positions on an equidistantly spaced 64 × 64 grid at different source-plane distances d z .In addition, a densely sampled scenario D1 was acquired on a smaller 33 × 33 grid with a spacing of only 5 mm .Scenario R2 is based on the same geometric setup as scenario A2, but an aluminium plate on the floor introduces a specular reflection.
(5)   4 exemplarily show the measured impulse response and its magnitude spectrum for a single sourcereceiver combination for scenarios A1, A2, and R2, respectively.It can be readily verified that the doubling of the distance to the source is also reflected in a doubling of the delay shift and an attenuation of the magnitude spectrum by approximately −6dB .Furthermore, the specular reflec- tion for scenario R2 manifests in a prominent second peak in the impulse response and comb filtering in its magnitude spectrum.Additional reflections manifesting as spurious peaks in the impulse response are due to the structure of the positioning system and the supporting grid platform.
The mean and standard deviation of temperature and the speed of sound for each of the scenarios are given in Table 4.The speed of sound has been calculated according to [76,77] 1 .It reveals that the temperature and the speed of sound are almost identical across all scenarios with an absolute difference of �µ < 1 • C and �µ ≤ 0.6 m s , respectively, which is expected due to the fairly constant environmental conditions inside the anechoic chamber.

Loudspeaker directivity
Figure 5 shows the directivity D and the directivity index DI of the loudspeaker measured with a dispersion meas- urement turntable in the azimuthal plane.In this work, the directivity is defined as the ratio between the measured squared sound pressure p RMS (θ , f ) at an angle θ and the maximum among all angles, i.e.
The directivity index under the assumption of rotational symmetry is expressed as where p 2 RMS (0, f ) represents the squared sound pressure in front of the speaker.It is seen that the loudspeaker exhibits a radiation pattern similar to a monopole until an upper frequency of 2 kHz .Above this frequency, the directivity index increases.Still, the directivity observed by the microphone array is close to a monopole at relevant radiation angles, i.e. θ ≤ θ max = 67.3• , as indicated by the dashed line in Fig. 5.

Positional validation
Several uncertainty factors affected the spatial alignment precision regarding the microphone array centre and the centre of the observation area.These factors include measurement uncertainties with regard to the utilized cross-line laser and distance meter as well as mechanical backlash, which occurred primarily with horizontal changes of direction.Therefore, a systematic spatial offset within the range of a few millimetres can be assumed.
Due to the anechoic environment and the use of a largescale microphone array enabling an excellent spatial resolution, Conventional Frequency Domain Beamforming [42] serves as an appropriate method to obtain an estimate of the actual source location.The large number of acoustic cases also permits a statistical approach to determine the spatial offset for a measurement scenario and to quantify the uncertainty regarding the source position information.

Beamforming
] ⊂ Z and let denote the transfer function measurements from the ith source at location x s for i ∈ {1, . . ., n i } to each of the n o microphones.The cross-spectral matrix induced by a sound source with unit strength is then given by The beamforming result for an assumed source location x s ∈ R 3 is then given by the square of the C-weighted norm of the steering vector a(x s , Many formulations of the steering vector can be found in the literature.The formulations I and IV in [78] result in a coincidence of the beamformer's steered response power maximum and the actual source location for a single monopole source radiating under free-field conditions.In this work, formulation IV was used, which defines the entries of a via (10) H (e ıω k ) = H i,1 (e ıω k ) . . .
, where r j = x s − x j 2 is the distance between the assumed source location x s and the j-th microphone location x j , and r 0 = �x s − x 0 � 2 is the distance between x s and the reference position, in this case the origin of the coordinate system.Validation of each measured source position commenced with the spatial discretization of a neighbourhood around the assumed source position.A 201 × 201 equidistantly spaced focus-grid with a resolution of x = 0.5 mm was employed.The beamforming map was computed on the discretized region for every frequency in the range which was chosen such that the lower frequency limit f l = 2 kHz enabled a sufficiently large spatial resolution in the resulting beamforming map, and the upper frequency limit f u = 4 kHz ensures that the wavelength is larger than the loudspeaker diameter.The latter is important to ensure that the loudspeaker has a radiation pattern close to a monopole at relevant radiation angles in order to meet the monopole assumption needed for the steering vector formulation.As indicated by the dashed line in Fig. 5, the radiation angle from the loudspeaker to any microphone in the array is bounded by θ max = 67.3• .The global spatial maximum is then determined by where b(x s , ω) denotes the amplitude normalized beam- forming result with b(x s , ω) being the beamformer's maximum output among all source locations x s at a given frequency ω .The evaluation was conducted for different distances within a range of up to ±12 mm around the assumed source dis- tance with a sampling interval of z = 1 mm to account for a potential mismatch of the source plane distance.Finally, the positional offset between the beamformer's prediction and the assumed source position is determined by x i = xi − x i .

Statistical evaluation
The systematic positional offset between the centre of the observation area and the microphone array in the horizontal and vertical direction can be statistically determined by using the estimates x i ∈ R 2 for each individual measured source position.Thereby, each estimated positional deviation x i can be seen as a realization of the ( 14) jointly distributed random variables R x , R y with the joint Probability Density Function (PDF) f R x ,R y (�x i ) .It is assumed that the individual positional offset estimations x i are symmetrically distributed around the true positional offset due to the approximate symmetry of the microphone array and observation plane around the origin.Then, the true positional offset corresponds to the deviation associated with the greatest probability.A simple method to determine the joint PDF of jointly distributed random variables based on a finite set of samples is the kernel density estimation [79], denoted by where N refers to the sample size and K h is the so-called kernel.A bivariate Gaussian kernel with bandwidth h was used, where h was chosen according to the Silverman's rule of thumb [80].

Offset correction
The correction procedure's first step was determining the distance z between the loudspeaker and the micro- phone array plane for the experiments {A1, D1} and {A2} .The joint PDF was estimated individually for each evaluated distance z .Note that source cases from experi- ment R2 were excluded from the statistical evaluation since the ground plate reflections would introduce an additional disruptive factor in the positional estimation.
It is assumed that the true distance minimizes the variance among any direction associated with fR x ,R y (�x i ) , i.e. the spectral norm of the covariance matrix � �x i (�z) is minimized, such that Figure 6 shows the joint PDF with the smallest spectral norm for the experiments {A1, D1} and {A2} .Based on the joint PDF corresponding to the optimal distance (17) correction z , the true positional offset in vertical and horizontal direction is determined from the maximum of the corresponding marginal distributions depicted in Fig. 7. Table 5 shows the positional offset correction values for each of the experiments.
With the correction offset applied, one can conclude that the positional uncertainties regarding the true source positions are in the order of a few millimetres.Given the 2.5 and 97.5 percentiles of the marginal distributions, the positional uncertainty is in the range of

Applications
Experimental measurement of acoustical systems is rarely an end in itself.Usually, one utilizes the measurements to infer system properties or data-driven models of the system at hand.Given its substantial size and precision, the MIRACLE dataset can be applied to a variety of research applications.In the following, two timely and crucial acoustic research areas are addressed: firstly, source localization and characterization, and secondly, data-driven reduced order modelling.

Acoustic source localization and characterization
In recent years, there has been extensive investigation into data-driven methods, particularly deep learning models, to solve source localization and characterization problems [11].Localization focuses on determining the positions of one or multiple sound sources, whereas source characterization entails quantifying additional characteristics, such as their individual strength at a specific spatial reference position.Due to the scarcity of suitable experimental data, it is common to leverage synthetically generated microphone array data, with only few studies considering real-world data already during training [25].
It remains an open research question how strongly the performance of data-driven methods is affected by this oversimplification and how this performance degradation could be mitigated.

Microphone array data generation
One way to overcome the lack of experimental data is to generate microphone array data via a convolution of experimental RIRs with arbitrary source signals in a Monte Carlo simulation.In order to facilitate reproducibility in machine learning, the training data has to be published alongside the source code.Providing the immutable raw microphone array data of the Monte Carlo simulation would lead to infeasibly large storage requirements.As a lightweight alternative, a code environment of the data simulation process can be published.This enables the generation of microphone array data on demand, which not only alleviates the problem of storage but can also greatly accelerate the training process.This approach is taken by the AcouPipe framework [81], a previous work of the authors.
AcouPipe is an open-source project written in Python programming language that incorporates a default dataset generation method that constructs source cases with an arbitrary number of sound sources.As of AcouPipe v24.04 [82], the AcouPipe framework includes an additional experimental dataset generation method that utilizes the MIRACLE dataset per default.Both the synthetic and the experimental data generation method rely on similar measurement setups.A Monte Carlo simulation is used to randomly generate multi-source scenarios by superposition.By default, the number of sources and their respective positions are drawn from a Poisson distribution I ∼ P( = 3) and a bivariate normal distribution (x i , y i ) ∼ N (µ = 0, σ = 0.1688 d a ) for each individual source case.Artificial white noise is used as the source signal with a signal length of five s.The signal amplitudes are chosen according to the desired squared RMS sound pressure at the reference position.The latter is drawn from a Rayleigh distribution, such that p 2 RMS,i ∼ R(σ R = 5) .Finally, the sound pressure signals at the microphones are obtained by superposition and convolution of the respective RIRs with the source signals Uncorrelated white noise n j (t) is added to each micro- phone channel.A more detailed description of the underlying dataset generation procedure can be found in [81].
Listing 1 Python source code snippet demonstrating multi-source scenario generation based on the MIRA-CLE datasetTo demonstrate the ease of use, Listing 1 exhibits a Python source code example that generates multi-source scenarios based on the MIRACLE dataset.Here, a Dataset instance is constructed, utilizing the RIRs from scenario A1 for the generation of 1000 validation samples.With AcouPipe, it is not only possible to obtain the raw time data but also to extract several features.The features include a source mapping of the sources and their locations.By default, conventional beamforming is used to calculate the source mapping.Figure 8 shows the beamforming map related to the seventh sample of the validation dataset.
Theoretically, very large numbers of experimental scenarios can be created by superimposing the individual measurements of the MIRACLE dataset.For example, when considering unordered sampling without replacement more than eight million different source cases can be constructed when considering scenario A1 and two sources.The number of possible combinations increases significantly when sources of different strengths or even more than two sources are combined.
We believe that both the synthetic and the experimental data provide an excellent basis to address and answer (19) up-to-date research questions related to data-driven source localization and characterization tasks, such as: How strong are synthetically trained source characterization models influenced by the domain shift when evaluated with data acquired in a real environment?Which domain generalization and adaptation methods are suitable to overcome the performance degradation due to the domain shift?How do existing source localization and characterization methods generalize to real-world scenarios?

Reduced order modelling
System measurements are usually performed with a certain goal in mind, e.g.inferring system properties from the data, such as the reverberation time.In many cases, one would like to obtain a model for the dynamical system transmission, i.e. the mapping from an input signal to an output signal.A common and apparent approach to obtaining a model for the system dynamics from RIR measurements is to employ them as a convolution kernel h ∈ ℓ n o ×n i 2 via the convolution sum where u ∈ ℓ n i 2 and y ∈ ℓ n o 2 are the input and output signal, respectively, and ℓ 2 denotes the Hilbert space of square- summable sequences.This approach was also pursued in the previous section.However, it can be unfavourable with regard to computational effort and storage requirements, especially for larger systems with many inputs and/or outputs, such as densely sampled RIRs, because the computational effort scales with the product of the (21) number of inputs and outputs [83].Furthermore, systems whose dynamics are governed by a low number of weakly damped modes, e.g.small room acoustics, possess long impulse responses but are otherwise described by a simple model.In a convolution-based approach, there are no effective means to eliminate redundancies in the data, making the dynamical transmission of aforementioned systems unnecessarily expensive to compute.
As argued in [83,84], state-space models can be computationally advantageous, especially in real-time scenarios, and provide access to powerful system-theoretic MOR methods [56].MOR is a very active research field that aims to find small models that approximate the dynamics of a large and potentially infeasible so-called full order model as closely as possible.Instead of firstly identifying a large full order model from data and then applying MOR methods, a reduced order model can also be inferred directly from the data.This is oftentimes referred to as data-driven model order reduction or reduced order modelling.
In the following, discrete-time state-space models are briefly introduced, and the applicability of the Eigensystem Realization Algorithm (ERA) [85][86][87] to the MIRA-CLE dataset is demonstrated.

State-space models
A discrete-time linear time-invariant dynamical system with m ∈ N inputs, p ∈ N outputs can be described by the matrix quadruple (A, B, C, D) with A ∈ R n×n , B ∈ R n×m , C ∈ R p×m , and D ∈ R p×m , where n ∈ N is the order of the system.Its dynamics are governed by the discrete-time state equations for input u ∈ ℓ m 2 , output y ∈ ℓ p 2 , and state x ∈ ℓ n 2 .In our concrete case, u denotes the n i -dimensional input signal at the source locations, and y denotes the n o -dimensional output signal at the array microphones.Being an LTI system, the input-output behaviour of the state-space system is fully described by the convolution sum (21), where the impulse response h ∈ ℓ p×m 2 is given by h(t) = CA t−1 B in discrete-time.This can be easily verified from (22) by applying an impulse input u(t) = δ t0 with zero initial condition x(0) = 0 .A frequency domain description can be obtained by applying the z-transform to (22).The z-transform of output Y (z) ∈ C p is given by a multiplica- tion of the z-transform of the input U (z) ∈ C m with the transfer function (22) x(t + 1) = Ax(t) + Bu(t)

Eigensystem Realization Algorithm
The transfer function in ( 23) is a matrix-valued rational function that can be approximated by a Padé approximation [88].The moments of the transfer function for expansion at infinity, i.e. h i = d i dz i H (z) z=∞ , are referred to as the Markov parameters of the system.Intriguingly, the impulse response is equivalent to the sequence of Markov parameters in discrete time, i.e. h(t) = h t = CA t−1 B .This connec- tion is leveraged by ERA in the following way: Given a finite sequence h ∈ ℓ p×m 2 of 2s − 1 , s ∈ N Markov parameters, a matching state-space model can be identified via the Hankel matrix of Markov parameters For discrete-time systems, the Markov parameters can be conveniently obtained by measurements of the system impulse response.The Hankel matrix can be factored into the observability and controllability matrix O ∈ R ps×n and C ∈ R n×ms : From this factorization, a realization can be constructed via where O f and O l denote the first and last p(s − 1) rows of the observability matrix O ∈ R ps×n , respectively [86,87].Classically, the factorization is computed from a (truncated) singular value decomposition (SVD), i.e. for a truncation order r ≤ min{ms, ps} where U r ∈ R ps×r comprises the first r left singular vec- tors as columns, � r = diag(σ 1 , . . ., σ r ) the first r sin- gular values with σ 1 ≤ • • • ≤ σ r and V r ∈ R ms×r the first r right singular vectors as columns.By choosing r , a reduced order realization (A r , B r , C r ) can be constructed via (26).For a basic stability and error analysis, the reader is referred to the classical sources [85][86][87].A recent and thorough error analysis can be found in [89].( Using above theory, ERA is applied to scenario D1 of the MIRACLE dataset with m = 1089 inputs, p = 64 outputs.In order to improve the model quality, as argued in [83], the common dead time in the measurements is removed conservatively by truncating the first 68 samples.Additionally, to avoid unnecessary computations, only the first 20 ms of the impulse response are considered.Thus, after truncation, a total of 2s − 2 = 572 samples are taken as input data for ERA.It is immediately obvious from (24) that the resulting Hankel matrix of dimension 18,240 × 310,365 can lead to computational problems with the outlined algorithm.One would already need 45.3 GB to construct this matrix explicitly, and computing the SVD of such a large matrix is infeasible.There are several strategies [90][91][92] that can alleviate this computational burden in classical ERA by exploiting low-rank structures in the data and/or the Hankel matrix.A randomized SVD algorithm [93] was employed here as an approximate orthogonal decomposition instead of ( 27) by sampling the range of H with 5000 normally distributed random vectors and a single subspace iteration.For details of this randomized ERA variant, see [83,92].
It is important to emphasize that multiple-input-multiple-output reduced order models are identified for all input-output transmissions simultaneously.In order to facilitate a qualitative analysis of the model quality, Fig. 9 depicts the magnitude responses of the measurement data alongside the identified reduced order models for the single-channel transmission from the centremost source to the centremost microphone.It can be seen that the accuracy increases with increasing model order and decreases with frequency.Furthermore, we have noticed that the single-channel approximation accuracy tends to deteriorate for source or receiver locations with greater distances to the source and receiver planes, respectively.Fig. 9 Frequency response of the measured and modelled transfer functions (solid lines) and the error ε r (dash-dotted lines) of different model orders for scenario D1 at the centremost location in the source and receiver plane Therefore, an appropriate quantitative error criterion must encompass all transmission channels simultaneously.To this end, denote and as the RMS-averaged system and error impulse response, respectively, where h(t) ∈ ℓ 64×1,089 2 denotes the measured multichannel impulse response of the D1 scenario and A r , B r , C r are the reduced order system matrices obtained by ERA for model orders r = 1000, 2000, 3000 .The averaged responses are shown in Fig. 10 and it can be observed that the averaged error ε r is still substantial around 3 ms for r = 1000 , which is due to the aforemen- tioned deterioration of approximation quality towards the edges of the source and receiver planes.For the higher orders r = 2000, 3000 , this deviation is much less pronounced.
The potential benefit for model compression and realtime application is showcased in Table 6 in which the required memory and computational cost in Mega Floating Point Operations (MFLOPs) is listed alongside the relative approximation error In order to compare the cost of solving (22) with the state-space approach over the convolution sum in (21), the memory demand and computational cost of the Uniformly Partitioned Overlap-save (UPOLS) method [94] are also listed2 .The UPOLS method and the variants described in [94] are well-established and solve (21) via spectral multiplication of overlapping blocks of signals.
It can be seen that the reduced order models are more memory efficient than the UPOLS method by factors between 5 − 30 .A computational speedup for real-time simulation over the UPOLS method by factors between roughly 1.3 − 3.9 can be achieved for all models if the models are transformed into so-called modal or (quasi-) diagonal form which can be achieved via an eigenvalue decomposition of A r .Details on the transformation can be found in [83].(28)

Conclusion
A publicly available large-scale RIR dataset of 856,128 single-channel impulse responses has been presented, which is particularly suited for applications in the field of microphone array signal processing and sound field reconstruction because of the dense spatial sampling of the receiver and source spaces.To the authors' knowledge, the MIRA-CLE dataset is the largest openly available RIR dataset yet.
The measured data have been shown to be spatially accurate and therefore provide an excellent basis for the development and statistical evaluation of data-driven modelling methods in acoustical engineering.The relevance, versatility, and applicability of the MIRACLE dataset have been demonstrated by means of two timely and diverging application examples.
In contrast to existing datasets, the MIRACLE dataset exhibits a notable limitation in environmental diversity, primarily constrained to free-field sound propagation scenarios with and without specular reflection.While the characteristics of the source and receiver predominantly shape the impulse response representation, enhancing environmental variability may be feasible through a  semi-synthetic approach.This involves synthesizing corresponding RIRs and integrating them with the measured RIRs.To enrich the dataset, it is envisaged that the dataset will be extended in the near future by similar measurements of reverberant rooms.A persistent challenge lies in calibrating the source positions with comparable precision, mainly due to the diminished reliability of beamforming in reverberant environments.Addressing this challenge and broadening the dataset will be the focus of a forthcoming publication.

Fig. 1
Fig. 1 Experimental setup for the main experiment (R2) with reflective ground plate

Fig. 2
Fig. 2 Experimental setup for the directivity measurement

Figures 3 and
Figures3 and 4exemplarily show the measured impulse response and its magnitude spectrum for a single sourcereceiver combination for scenarios A1, A2, and R2, respectively.It can be readily verified that the doubling of the distance to the source is also reflected in a doubling of the delay shift and an attenuation of the magnitude spectrum by approximately −6dB .Furthermore, the specular reflec- tion for scenario R2 manifests in a prominent second peak in the impulse response and comb filtering in its magnitude spectrum.Additional reflections manifesting as spurious peaks in the impulse response are due to the structure of the positioning system and the supporting grid platform.The mean and standard deviation of temperature and the speed of sound for each of the scenarios are given in Table4.The speed of sound has been calculated

Fig. 6 Fig. 7
Fig. 6 Estimated joint PDF of the positional deviations between the beamforming results and the assumed source positions.The inner black circle corresponds to the outer rim of the loudspeaker and the outer black circle indicates the outer rim of the enclosure box (left: Experiments {A1, D1}, right: Experiment A2)

Fig. 8
Fig. 8 Source mapping at f = 2.5 kHz for the seventh sample of the experimental validation dataset.Grey circles indicate the microphone position and black crosses indicate the source positions

Fig. 10
Fig. 10 RMS-averaged impulse response of the data and error systems for different model orders

Table 2
Utilized hardware devices a Calibration was performed after the measurement campaign using a reference sensor with a temperature accuracy of ±0.1 • C

Table 3
MIRACLE experimental scenarios

Table 5
Positional correction values for each experiment

Table 6
(30)ry demand in MB, computational cost in MFLOPs, and approximation error εr according to(30)for the UPOLS method with block-size B = 64 and = 16 filter partitions and the reduced order models in different state-space representations.: standard form, : (quasi-) diagonal form