dEchorate: a calibrated room impulse response dataset for echo-aware signal processing

This paper presents a new dataset of measured multichannel room impulse responses (RIRs) named dEchorate. It includes annotations of early echo timings and 3D positions of microphones, real sources, and image sources under different wall configurations in a cuboid room. These data provide a tool for benchmarking recent methods in echo-aware speech enhancement, room geometry estimation, RIR estimation, acoustic echo retrieval, microphone calibration, echo labeling, and reflector position estimation. The dataset is provided with software utilities to easily access, manipulate, and visualize the data as well as baseline methods for echo-related tasks.


Introduction
When sound travels from a source to a microphone in a indoor space, it interacts with the environment by being delayed and attenuated due to the distance; and reflected, absorbed and diffracted due to the surfaces.The Room Impulse Response (RIR) represents this phenomenon as a linear and causal time-domain filter.As depicted in Figure 1, RIRs are commonly subdivided into 3 parts: the direct-path, corresponding to the line-of-sight propagation; the early echoes, stemming from few disjoint reflections on the closest reflectors; and the late reverberation comprising the dense accumulation of later reflections and scattering effects.
The late reverberation is indicative of the environment size and reverberation time, producing the socalled listener envelopment, i.e., the degree of immersion in the sound field [1].In contrast, the direct path and the early echoes carry precise information on the scene's geometry, such as the position of the source and room surfaces relative to the receiver position [2], and on the surfaces' reflectivity.Such relation is well explained by the Image Source Method (ISM) [3], in which the echoes are associated with the contribution of virtual sound sources lying outside the real room.Therefore, one may consider early echoes as "spatialized" copies of the source signal, whose Times of Arrival (TOAs) are related to the source and reflector positions.
Based on this idea, so-called echo-aware methods have been introduced a few decades ago, where matched filters (or rake receivers) are used to constructively sum the sound reflections [4][5][6] and build beamformers achieving much better sound qualities [7].These methods have recently regained interested as manifested by the European project SCENIC [8] arXiv:2104.13168v1[eess.AS] 27 Apr 2021 and the UK research project S 3 A [1] .Later, a few studies showed that knowing the properties of a few early echoes could boost the performance of typical indoor audio inverse problems such as Speech Enhancement (SE) [9,10], sound source localization [11][12][13][14] and separation [15][16][17][18], and speaker verification [19].
Another fervent area of research spanning transversely the audio signal processing field is estimating the room geometry blindly from acoustic signals [20][21][22][23].As recently reviewed by Crocco et al. in [22], endto-end Room Geometry Estimation (RooGE) involves a number of subtasks: RIR estimation, peak picking, microphones calibration, echo labeling and reflectors' position estimation.As interesting applications, these methods have been recently used in active setting (i.e., knowing the transmitted signals) on unmanned aerial vehicles (UAVs, a.k.a.drones) [24,25] and on mobilephones [26].The lowest common denominator of all these tasks is Acoustic Echo Retrieval (AER), that is, estimating the properties of early echoes, such as their TOAs and energies.The former problem is typically referred to as TOA estimation, or Time Difference of Arrival (TDOA) estimation when the direct-path is taken as reference.
As listed in [27] and in [28], a number of recorded RIRs corpora are freely available online, each of them meeting the demands of certain applications.Table 1 summarizes the main characteristics of some of them.One can broadly identify two main classes of echoaware RIR datasets in the literature: SE/Automatic Speech Recognition (ASR)-oriented datasets, e.g.[27,30,31], and RooGE-oriented datasets, e.g.[21][22][23].The former regards acoustic echoes as highly correlated interfering sources coming from close reflectors, such as a table in a meeting room or a near wall.This typically presents a challenge in estimating the correct source's Direction of Arrival (DOA) with further consequences in DOA-based enhancement algorithm, e.g., beamformers.Although this factor is taken into account, such datasets lack proper annotation of these echoes in the RIRs or the absolute position of objects inside the room.The latter group typically features design choices, such as microphones scattered across the room, which are not suitable for SE applications.Indeed, these typically involve compact or ad hoc arrays.The main common drawback of these datasets in that they cannot be easily used for other tasks than the ones which they are designed for.
To bypass the complexity of recording and annotating real RIR datasets, acoustic simulators based on the ISM are extensively used instead [32,33,[33][34][35].While such data are more versatile, simpler and quicker to obtain, they fail to fully capture the complexity and [1] http://www.s3a-spatialaudio.org/richness of real acoustic environments.Due to this, methods trained, calibrated, or validated on them may fail to generalize to real conditions, as will be shown in this paper.Interestingly, in the context of learningbased blind room volume estimation, the authors of [28] combined multiple real and synthetic RIR datasets in order to find a balance between number of training data and realism.
A good echo-oriented RIR dataset should include a variety of environments (room geometries and surface materials), of microphone placings (close to or away from reflectors, scattered or forming ad-hoc arrays) and, most importantly, precise annotations of the scene's geometry and echo timings in the RIRs.Moreover, in order to be versatile and used in both SE and RooGE applications, geometry and timing annotations should be fully consistent.Such data are difficult to collect since it involves precise measurements of the positions and orientations of all the acoustic emitters, receivers and reflective surfaces inside the environment with dedicated planimetric equipment.
To fill this gap, we present the dEchorate dataset: a fully calibrated multichannel RIR database with accurate annotation of the geometry and echo timings in different configurations of a cuboid room with varying wall acoustic profiles.The database currently features 1800 annotated RIRs obtained from 6 arrays of 5 microphones each, 6 sound sources and 11 different acoustic conditions.All the measurements were carried out at the acoustic lab at Bar-Ilan University following a consolidated protocol previously established for the realization of two other multichannel RIRs databases: the BIU's Impulse Response Database [29] gathering RIRs of different reverberation levels sensed by uniform linear arrays (ULAs); and MIRaGE [31] providing a set of measurements for a source placed on a dense position grid.The dEchorate dataset is designed for AER with linear arrays, and is more generally aimed at analyzing and benchmarking RooGE and echo-aware signal processing methods on real data.In particular, it can be used to assess robustness against the number of reflectors, the reverberation time, additive spatially-diffuse noise and non-ideal frequency and directive characteristics of microphone-source pairs and surfaces in a controlled way.Due to the amount of data and recording conditions, it could also be used to train machine learning models or as a reference to improve RIR simulators.The database is accompanied with a Python toolbox that can be used to process and visualize the data, perform analysis or annotate new datasets.
The remainder of the paper is organized as follows.Section 2 describes the construction and the composition of the dataset, while Section 3 provides an Table 1 Comparison between some existing RIR databases that account for early acoustic reflections.Receiver positions are indicated in terms of number of microphones per array times number of different positions of the array (∼ stands for partially available information).The read is invited to refer to [27,28] for more complete list of existing RIR datasets.
† The dataset in [23] is originally intended for RooGE and further extended for (binaural) SE in [18] with a similar setup.‡ These datasets have been recorded in the same room.ground.An additional channel is used for the loopback signal, which serves to compute the time of emission and detect errors.Each loudspeaker and each array is positioned close to one of the walls in such a way that the source of the strongest echo can be easily identified.Moreover, their positioning was chosen to cover a wide distribution of source-to-receiver distances, hence, a wide range of Direct-to-Reverberant ratios (DRRs).Further, 2 more loudspeakers were positioned pointing towards the walls (indirect sources).This was done to study the case of early reflections being stronger than the direct-path.Each linear array consists of 5 microphones with non-uniform inter-microphone spacings of [4, 5, 7.5, 10] cm [2] .Hereinafter we will refer to these elements as non-Uniform Linear Arrays (nULAs).

Measurements
The main feature of this room is the possibility to change the acoustic profile of each of its facets by flipping double-sided panels with one reflective (made of Formica Laminate sheets) and one absorbing face made of perforated panels filled with rock-wool).A complete list of the materials of the room is available in Section 5.This allows to achieve diverse values of RT 60 that range from 0.1 to almost 1 second.In this dataset, the panels of the floor were always kept absorbent. [2]that is, [−12.25,−8.25, −3.25, 3.25, 13.25] cm w.r.t. the barycenter Two types of measurement sessions were considered, namely, one-hot and incremental.For the first type, a single facet was placed in reflective mode while all the others were kept absorbent.For the second type, starting from fully-absorbent mode, facets were progressively switched to reflective one after the other until all but the floor are reflective, as shown in Table 3.The dataset features an extra recording session.For this session, office furnitures (chairs, coat-hanger and a table) were positioned in the room to simulate a typical meeting room with chairs and tables (see Figure 2).Theses recordings may be used to assert the robustness of echo-aware methods in a more realistic scenario For each room configuration and loudspeaker, three different excitation signals were played and recorded in sequence: chirps, white noise and speech utterances.The former consists in a repetition of 3 Exponentially Swept-frequency Sine (ESS) signals of duration 10 seconds and frequency range from 100 Hz to 14 kHz interspersed with 2 seconds of silence.Such frequency range was chosen to match the characteristics of the loudspeakers.To prevent rapid phase changes and "popping" effects, the signals were linearly faded in and out over 0.2 seconds with a Tuckey taper window. [3] Second, 10 seconds bursts of white noise and 3 anechoic speech utterances from the Wall Street Journal (WSJ) dataset [36] were played in the room.Through all recordings, at least 40 dB of sound dynamic range compared to the room silence was asserted, and a room temperature of 24 • ± 0.5 • C and 80% relative humidity were registered.In these conditions the speed of sounds is c air = 346.98m/s.In addition, 1 minute of room tone (i.e., silence) and 4 minutes of diffuse babble noise were recorded for each session.The latter was [3] The code to generate the reference signals and to process them is available together with the data.The code is based on the pyrirtools Python library All microphone signals were synchronously acquired and digitally converted to 48 kHz with 32 bit/sample using the equipment listed in Table 2.The polarity of each microphone was recorded by clapping a book in the middle of the room and their gain is corrected using the room tone.
Finally, RIRs are estimated with the ESS technique [37] where an exponential time-growing frequency sweep is using as probe signal.Then, the RIR is estimated by devolving the microphone signal, implemented as division in the frequency domain (The authors used the same code mentioned in Footnote 3).

Dataset annotation 2.3.1 RIRs annotation
The objective of this database is to feature annotations in the "geometrical space", namely the microphone, facet and source positions, that are fully consistent with annotations in the "signal space", namely the echo timings within the RIRs.This is achieved as follows: (i) First, the ground-truth positions of the array and source centres are acquired via a Beacon indoor positioning system (bIPS).This system consists in 4 stationary bases positioned at the corners of the ceiling and a movable probe used for measurements which can be located within errors of ±2 cm.(ii) The estimated RIRs are superimposed on synthetic RIRs computed with the Image Source Method (ISM) from the geometry obtained in the previous step.A Python GUI [4] (showed in Figure 5), is used to manually tune a peak finder and label the echoes corresponding to found peaks, that is, annotate their timings and their corresponding image source position and room facet label. [4]This GUI is available in the dataset package.
(iii) By solving a simple Multi-Dimensional Scaling (MDS) problem [38][39][40], refined microphone and source positions are computed from echo timings.The non-convexity of the problem is alleviated by using a good initialization (obtained at the previous step), by the high SNR of the measurements and, later, by including additional image sources in the formulation.The prior information about the arrays' structures reduced the number of variables of the problem, leaving the 3D positions of the sources and of the arrays' barycenters in addition to the arrays' tilt on the azimuthal plane.(iv) By employing a multilateration algorithm [41], where the positions of one microphone per array serve as anchors and the TOAs are converted into distances, it is possible to localize image sources alongside the real sources.This step will be further discussed in Section 4. Knowing the geometry of the room, in step (i) we were able to initially guess the position of the echoes in the RIR.Then, by iterating through steps (ii), (iii) and (iv), the position of the echoes are refined to be consistent under the ISM.
The final geometrical and signal annotation was chosen as a compromise between the bIPS measurements and the MDS output.While the former ones are noisy but consistent with the scene's geometry, the latter ones match the TOAs but not necessarily the physical world.In particular, geometrical ambiguities such as global rotation, translation and up-down flips were observed.Instead of manually correcting this error, we modified the original problem from using only the direct path distances (dMDS) to considering the image sources' TOA of the ceiling as well in the cost function (dcMDS).Table 4 shows numerically the mismatch (in cm) between the geometric space (defined by the bIPS measurements) and the signal space (the one defined by the echo timings, converted to cm based on the speed of sound).To better quantify it, we introduce here a Goodness of Match (GoM) metric: it mea- sures the fraction of (first-order) echo timings annotated in the RIRs matching the annotation produced by the geometry within a threshold.Including the ceiling information, dcMDS produces a geometrical configuration which has a small mismatch (0.4 cm on average, 1.86 cm max) in both the signal and geometric spaces with a 98.1% matching all the first order echoes within a 0.5 ms threshold (i.e., the position of all the image sources within about 17 cm error).It is worth noting that the bIPS measurements produce a significantly less consistent annotation with respect to the signal space.

Other tools for RIRs annotation
Finally, we would like to add that the following tools and techniques were found useful in annotating the echoes.
The "skyline" visualization consists in presenting the intensity of multiple RIRs as an image, such that the wavefronts corresponding to echoes can be highlighted [42].Let h n (l) be an RIR from the database, where l = 0, . . ., L − 1 denotes sample index and n = 0, . . ., N − 1 is an arbitrary indexing of all the microphones for a fixed room configuration.Then, the skyline is the visualization of the L×N matrix H created by stacking column-wise N normalized echograms [5] , that is where | • | denotes the absolute value.Figure 6 shows an example of skyline for 120 RIRs corresponding to 4 directional sources, 30 microphones and the most reflective room configuration, stacked horizontally, preserving the order of microphones within the arrays.One can notice several clusters of 5 adjacent bins of similar color (intensity) corresponding to the arrivals at the 5 sensors of each nULA. [5]The echogram is defined either as the absolute value or as the squared value of the RIR.
Thanks to the usage of linear arrays, this visualization allowed us to identify both TOAs and their labeling.
Direct path deconvolution/equalization was used to compensate for the frequency response of the source loudspeaker and microphone [20,43].In particular, the direct path of the RIR was manually isolated and used as an equalization filter to enhance early reflections from their superimposition before proceed with peak picking.Each RIR was equalized with its respective direct path.As depicted in Figure 5, in some cases this process was required for correctly identifying the underlying TOAs' peaks.
Different facet configurations for the same geometry influenced the peaks' predominance in the RIR, hence facilitating its echo annotation.An example of RIRs corresponding to 2 different facet configurations is shown in Figure 5: the reader can notice how the peak predominance changes for the different configurations.
An automatic peak finder was used on equalized echograms ηn (l) to provide an initial guess on the peak positions.In this work, peaks are found using the Python library peakutils whose parameters were manually tuned.

Limitations of current annotation
As stated in [44], we want to emphasize that annotating the correct TOAs of echoes and even the direct path in "clean" real RIRs is far from straightforward.The peaks can be blurred out by the loudspeaker characteristics or the concurrency of multiple reflections.Nevertheless, as showed in Table 4, the proposed annotation was found to be sufficiently consistent both in the geometric and in the echo/signal space.Thus, no further refinement was done.This database can be used as a first basis to develop better AER methods which could be used to iteratively improve the annotation, for instance including 2 nd order reflections.

The dEchorate package
The dataset comes with both data and code to parse and process it.The data are presented in 2 modalities: the raw data, that is, the collection of recorded wave files, are organized in folders and can be retrieved by querying a simple database table; the processed data, which comprise the estimated RIRs and the geometrical and signal annotations, are organized in tensors directly importable in Matlab or Python (e.g.all the RIRs are stored in a tensor of dimension L×I ×J ×D, respectively corresponding to the RIR length in samples, the number of microphones, of sources and of  and d) is a detail on the RIR skyline (See Figure 6) on the corresponding nULA, transposed to match the time axis.

room configurations).
Together with the data a Python package is available on the same website.This includes wrappers, GUI, examples as well as the code to reproduce this study.In particular, all the scripts used for estimating the RIRs and annotating them are available and can be used to further improve and enrich the annotation or as baselines for future works.

Analysing the Data
In this section we will illustrate some characteristics of the collected data in term of acoustic descriptors, namely the RT 60 , the DRR and the Direct-to-Early Ratio (DER).While the former two are classical acoustic descriptors used to evaluate SE and ASR technologies [45], the latter is less common and used in strongly echoic situations [46,47].

Reverberation Time
The RT 60 is the time required for the sound level in a room to decrease by 60 dB after the source is turned off, thus, it measures reverberation level.This value is one the most common acoustic descriptor for room acoustics.Besides, as reverberation affects detrimentally the performances of speech processing technologies, the robustness against RT 60 has become a common evaluation metric in SE and ASR.
Table 5 reports estimated RT 60 (b) values per octave band b ∈ {500, 1000, 2000, 4000} (Hz) for each of the room in the dataset.These values were estimated using the Schroeder's integration methods [48][49][50] in each octave band.For the octave bands centred at 125 Hz and 250 Hz, the measured RIRs did not exhibit sufficient power for a reliable estimation.This observation found confirmation in the frequency response provided by the loudspeakers' manufacturer, which decays exponentially from 300 Hz downwards.
Ideally, for the RT 60 to be reliably estimated, the Schroeder curve, i.e. the log of the square-integrated, octave-band-passed RIR, would need to feature a linear decay for 60 dB of dynamic range, which would occur in an ideal diffuse sound regime.However, such range is never observable in practice, due to the presence of noise and possible non-diffuse effects.Hence,  As can be seen in Table 5, obtained reverberation values are consistent with the room progressions described in section 2. Considering the 1000 Hz octave band, the RT 60 ranges from 0.14 s for the fully absorbent room (000000) to 0.73 s for the most reflective room (011111).When only one surfaces is reflective the RT 60 values remains around 0.19 s.

Direct To Early and Reverberant Ratio
In order to characterize an acoustic environment, it is common to provide the ratio between the energy of the direct and the indirect propagation paths.In particular, one can compute the so-called DRR directly from a measured RIR h(l) [45] as where D denotes the time support comprising the direct propagation path (set to ±120 samples around its time of arrival, blue part in Figure 1), and R comprises the remainder of the RIR, including both echoes and late reverberation (orange and green parts in Figure 1).Similarly, the DER defines the ratio between the energy of the direct path and the early echoes only, that is, where E is the time support of the early echoes only (green part in Figure 1).Differently from the RT 60 which mainly describes the diffuse regime, both DER and DRR are highly dependent on the position of the source and receiver in the room.Therefore, for each room, wide ranges of these parameters were registered.For the loudspeakers facing the microphones, the DER ranges from 2 dB to 6 dB in one-hot room configurations and from -2 dB to 6 dB in the most reverberant rooms.Similarly, the DRR has a similar trend featuring lower values, such as -2 dB in one-hot rooms and down to -7.5 dB for the most reverberant ones.A complete annotation of these metrics is available in the database.

Using the Data
The dEchorate database is now used to investigate the performance of state-of-the-art methods on two echoaware acoustic signal processing applications on both synthetic and measured data, namely, spatial filtering and room geometry estimation.

Application: Echo-aware Beamforming
Let I microphones acquire to a single static point sound source, contaminated by noise sources.In the short-time Fourier transform (STFT) domain, we stack the I complex-valued microphone observations at frequency f and time t into a vector x(f, t) ∈ C I .Let us denote s(f, t) ∈ C and n(f, t) ∈ C I the source signal and the noise signals at microphones, which are assumed to be statistically independent.By denoting h ∈ C I the Fourier transforms of the RIRs, the observed microphone signals in the STFT domain can be expressed a follows: Here, the STFT windows are assumed long enough so that the discrete convolution-to-multiplication approximation holds well.
Beamforming is one of the most widely used techniques for enhancing multichannel microphone recordings.The literature on this topic spans several decades of array processing and a recent review can be found in [51].In the frequency domain, the goal of beamforming is to estimate a set of coefficients w(f ) ∈ C I that are applied to x(f, t), such that s(f, t) ≈ w H x(f, t).Hereinafter, we will consider only the distortionless beamformers aiming at retrieving the clean target speech signal, as it is generated at the source position.
As mentioned throughout the paper, the knowledge of early echoes is expected to boost spatial filtering performances.However, estimating these elements is difficult in practice.To quantify this, we compare echoagnostic and echo-aware beamformers.In order to study their empirical potential, we will evaluate their performance using both synthetic and measured data, as available in the presented dataset.
Echo-agnostic beamformers do not need any echoestimation step: they either ignore their contributions, as in the direct-path delay-and-sum beamformer (DS) [52], or they consider coupling filters between pairs of microphones, called Relative Transfer Functions (ReTFs) [7].Note that contrary to RIRs, there exist efficient methods to estimate ReTFs from multichannel recordings of unknown sources (see [51, Section VI.B] for a review).The ReTFs can then be naturally incorporated in powerful beamforming algorithms achieving speech dereverberation and noise reduction in static [53] and dynamic scenarios [54].In this work, ReTFs are estimated using Generalized Eigenvector Decomposition (GEVD) method [55], using the approach illustanted in [56].
Echo-aware beamformers fall in the category of rake receivers, borrowing the idea from telecommunication where an antenna rakes (i.e., combines) coherent signals arriving from different propagation paths [4][5][6].To this end, they typically consider that for each RIR i, the delays and frequency-independent attenuation coefficients of R early echoes are known, denoted here as τ (r) i and α (r) i .In the frequency domain, this translates into the following: where r = 0, . . ., R − 1 denotes the reflection order, Recently, these methods have been used for noise and interferer suppression in [9,57] and for noise and reverberation reduction in [10,58].The main limitation of these works is that echo properties, or alternatively the position of image sources, must be known a priori.Hereafter, we will assume these properties known by using the annotations of the dEchorate dataset, as described in Section 2.3.In particular, we will assume that the RIRs follow the echo model ( 5) with R = 4, corresponding to the 4 strongest echoes.Knowing the echo delays, the associated attenuation coefficients are retrieved from the RIRs using a simple maximumlikelihood approach, as in [59,Eq. 10].
We evaluate the performance of both types of beamformers on the task of noise and late reverberation suppression.Different Minimum Variance Distortionless Response (MVDR) beamformers are considered, assuming either spatially white noise (i.e., classical DS design), diffuse noise (i.e., the Capon filter) or diffuse noise plus the late reverberation [60].In the latter case, the late reverberation statistics are modeled by a spatial coherence matrix [61] weighted by the late reverberation power, which is estimated using the procedure described in [60].
Performances of the different designs are compared on the task of enhancing a target speech signal in a 5-channel mixture using the nULAs in the dEchorate dataset.They are tested in scenarios featuring high reverberation and diffuse babble noise, appropriately scaled to pre-defined signal-to-noise ratios SNR ∈ {0, 10, 20}.Using the dEchorate data, we consider the room configuration 011111 (RT 60 ≈ 600 ms) and all possible combinations of (target, array) positions.Both real and corresponding synthetic RIRs are used, which are then convolved with anechoic utterances from the WSJ corpus [36] and corrupted by recorded diffuse babble noise.The synthetic RIRs are computed with the Python library pyroomacoustics [62], based purely on the ISM.Hence, on synthetic RIRs, the known echo timings perfectly match the components in their early part (no model mismatch).
The evaluation is conducted similarly to the one in [10] where the following metrics are considered: • the Signal-to-Noise plus Reverberation Ratio improvement (iSNRR) in dB, computed as the difference between the input SNRR at the reference microphone and the SNRR at the filter output; • the Speech-to-Reverberation energy Modulation Ratio improvement (iSRMR) [63] to measure dereverberation; • the Perceptual Evaluation of Speech Quality improvement (iPESQ) score [64] to assess the per-ceptual quality of the signal and indirectly the amount of artifacts.Implementations of the SRMR and Perceptual Evalution of Speech Quality (PESQ) metrics are available in the Python library speechmetrics.Both the Signal-to-Noise plus Reverberation Ratio (SNRR) and the PESQ are relative metrics, meaning they require a target reference signal.Here we consider the clean target signal as the dry source signal convolved with the early part of the RIR (up to R-th echo) of the reference (first) microphone.On the one hand, this choice numerically penalizes both direct-path-based and ReTFbased beamformers, which respectively aim at extracting the direct-path signal and the full reverberant signal in the reference microphone.On the other hand, considering only the direct path or the full reverberant signal would be equally unfair for the other beamformers.Moreover, including early echoes in the target signal is perceptually motivated since they are known to contribute to speech intelligibility [65].
Numerical results are reported in Figure 7. On synthetic data, as expected, one can see that the more information is used, the better performances are.Including late reverberation statistics considerably boosts performance in all cases.Both the ReTFs-based and the echo-aware beamformers significantly outperform the simple designs based on direct path only.While the two designs perform comparably in terms of iSNRR and iPESQ, the former has a slight edge over the latter in terms of median iSRMR.A possible explanation is that GEVD methods tend to consider the stronger and more stable components of the ReTFs, which in the considered scenarios may identify with the earlier portion of the RIRs.Moreover, since it is not constrained by a fixed echo model, the ReTFs can capture more information, e.g., frequency-dependent attenuation coefficients.Finally, one should consider the compacity of the model ( 5) with respect to the ReTF model in terms of the number of parameters to be estimated.In fact, when considering 4 echoes, only 8 parameters per channel are needed, as opposed to several hundreds for the ReTF (ideally, as many as the number of frequency bins per channel).
When it comes to measured RIRs, however, the trends are different.Here, the errors in echo timings due to calibration mismatch and the richness of real acoustic propagation lead to a drop in performance for echo-aware methods, both in terms of means and variances.This is clearest when considering the iPESQ metric, which also accounts for artifacts.the echo-agnostic beamformer considering late reverberation MVDR-ReTF-Late outperforms the other methods, maintaining the trend exhibited on simulated data.Finally, conversely to the MVDR-ReTF-Late, the MVDR-Rake-Late yields a significant portion of negative performances.As already observed in [10], this is probably due to the tiny annotation mismatches in echo timings as well as the fact that their frequencydependent strengths, induced by reflective surfaces, are not modeled in rake beamformers.This suggests that in order to be applicable to real conditions, future work in echo-aware beamforming should include finer blind estimates of early echo properties from signals, as investigated in, e.g., [35,66].

Application: Room Geometry Estimation
The shape of a convex room can be estimated knowing the positions of first-order image sources.Several methods have been proposed which take into account different levels of prior information and noise (see [23,67] for a review).When the echoes' TOA and their labeling are known for 4 non-coplanar microphones, one can perform this task using simple geometrical reasoning as in [21].In details, the 3D coordinates of each image source can be retrieved solving a multilateration problem [68], namely the extension of the trilateration problem to 3D space, where the goal is to estimate the relative position of an object based on the measurement of it distance with respect to anchor points.Finally, the position and orientation of each room facet can be easily derived from the ISM equations as the plane bisecting the line joining the real source position and the position of its corresponding image (see Figure 8) In dEchorate, the annotation of all the first order echo timings are available, as well as correspondences between echoes and room facets.This information can be used directly as input for the above-mentioned multilateration algorithm.We illustrate the validity of these annotations by employing the RooGE technique in [21] (with know labels) based on them.
Table 7 shows the results of the estimation of the room facets position in terms of Distance Error (DE) (in centimeters) and surface orientation error, (dubbed here Angular Error (AE), in degrees) using a single source and all 30 microphones, namely the 6 arrays.Room facets are estimated using each of the sources #1 to #4 as a probe.Despite a few outliers, the majority of facets are estimated correctly in terms of their placement and orientation with respect to the coordinate system computed in Section 2.3.For instance, using source #4, all 6 surfaces were localized with 1.49 cm DE on average and their inclinations with 1.3  the ideal shoebox approximation.In the real recording room, some gaps were present between revolving panels in the room facet.In addition, it is possible that for some (image source, receiver) pairs the farfield assumption is not verified, causing inaccuracies when inverting the ISM.The 2 outliers for source #3 are due to a wrong annotation caused by the source directivity which induced an echo mislabeling.When a wall is right behind a source, the energy of the related 1 st reflection is very small and might not appear in the RIRs.This happened for the eastern wall and a second order image was taken instead.Finally, the contribution of multiple reflections arriving at the same time can result in large late spikes in estimated RIRs.This effect is particularly amplified when the microphone and loudspeakers exhibit long impulse responses.As a consequence, some spikes can be miss-classified.This happened for the southern-wall where again a secondorder image was taken instead.Note that such echo mislabelings can either be corrected manually or using Euclidean distance matrix criteria as proposed in [21].Overall, this experiment illustrates well the interesting challenge of estimating and exploiting acoustic echoes in RIRs when typical sources and receivers with imperfect characteristics are used.ec

Conclusions and Perspectives
This paper introduced a new database of room impulse responses featuring accurate annotation of early echo timings that are consistent with source, microphone and room facet positions.These data can be used to test methods in the room geometry estimation pipeline and in echo-aware audio signal processing.In particular, robustness of these methods can be validated against different levels of RT 60 , SNR, surface reflectivity, proximity, or early echo density.This dataset paves the way to a number of interesting future research directions.By making this dataset freely available to the audio signal processing community, we hope to foster research in AER and echo-aware signal processing in order to improve the performance of existing methods on real data.Moreover, the dataset could be updated by including more robust annotations derived from more advanced algorithms for calibration and AER.
In addition, the data analysis conducted in this work brings the attention to exploring the impact of mismatch between simulated and real RIRs on audio signal processing methods.Finally, by using the pairs of simulated vs. real RIRs available in the dataset, it should be possible to develop techniques to convert one to the other, using style transfer or domain adaptation techniques, thus opening the way to new types of learning-based acoustic simulators.

Figure 1
Figure 1 Depiction of a measured room impulse response from the database.

Figure 3
Figure 3 Illustration of the recording setup -top view.

Figure 4
Figure 4 Picture of the acoustic lab.From left to right: the overall setup, one microphone array, the setup with revolved panels.

Figure 5
Figure 5 Detail of the GUI used to manually annotate the RIRs.For a given source and a microphone in an nULA, a) and b) each shows 2 RIRs for 2 different room configurations (blue and orange) before and after the direct path deconvolution.c) shows the results of the peak finder for one of the deconvolved RIRs,and d) is a detail on the RIR skyline (See Figure6) on the corresponding nULA, transposed to match the time axis.

Figure 6
Figure 6 The RIR skyline annotated with observed peaks (×) together with their geometrically-expected position (•) computed with the Pyroomacoustic acoustic simulator.As specified in the legend, markers of different colors are used to indicate the room facets responsible for the reflection: direct path (d), ceiling (c), floor (f), west wall (w), . . ., north wall (n).

a
common technique is to compute, e.g., the RT 10 on the range [−5, −15] dB of the Schroeder curve and to extrapolate the RT 60 by multiplying it by 6.We visually inspected all the RIRs of the dataset corresponding to directional sources 1, 2 and 3, i.e., 90 RIRs in each of the 10 rooms.Then, two sets were created.Set A features all the Schroeder curves featuring linear log-energy decays allowing for reliable RT 10 estimates.Set B contains all the other curves.In practice, 49% of the 3600 Schroeder curves were placed in the set B. These mostly correspond to the challenging measurement conditions purposefully included in our dataset, i.e., strong early echoes, loudspeakers facing towards reflectors or receivers close to reflectors.Finally, the RT 60 value of each room and octave band was calculated from the median of RT 10 corresponding to Schroeder curves in A only.

Figure 7
Figure 7 Boxplot showing the comparison of different echo-agnostic and echo-aware (*) beamformers for the room configuration 011111 (RT 60 ≈ 600 ms) on measured and synthetic data for all combinations of source-array positions in the dEchorate dataset.Mean values is indicated as +, while whiskers indicates extreme values.

Figure 8
Figure 8 Images source estimation and reflector estimation for one of the sound sources in the dataset.

Table 2
Measurement and recording equipment.

Table 3
Surface coding in the dataset: each binary digit indicates if the surface is absrobent (0, ) or reflective (1, ).

Table 4
Mismatch between geometric measurements and signal measurements in terms of maximum (Max.), average (Avg.)and standard deviation (Std) of absolute mismatch in centimeters.The goodness of match (GoM) between the signal and geometrical measurements is reported as the fraction of matching echo timings for different thresholds in milliseconds.

Table 5
Reverberation time per octave bands RT 60 (b) calculated in the 10 room configurations.For each coefficient, the number of corresponding Schroeder curves in A used to compute the median estimate is given in parentheses.

Table 6
Summary of the considered beamformers."n." and "lr." are used as short-hand for noise and late reverberation.(*) denotes echo-aware beamformers.

Table 7
Distance Error (DE) in centimeters and Angular Error (AE) in degrees between ground truth and estimated room facets using each of the sound sources (#1 to #4) as a probe.For each wall, bold font is used for the source yielding the best DE and AE, while italic highlights outliers when present.
ReTF Relative Transfer Function TOA Time of Arrival TDOA Time Difference of Arrival ISM Image Source Method SE Speech Enhancement SNRR Signal-to-Noise plus Reverberation Ratio iPESQ Perceptual Evaluation of Speech Quality improvement iSNRR Signal-to-Noise plus Reverberation Ratio improvement iSRMR Speech-to-Reverberation energy Modulation Ratio improvement RooGE Room Geometry Estimation WSJ Wall Street Journal