An end-to-end approach for blindly rendering a virtual sound source in an audio augmented reality environment

Audio augmented reality (AAR), a prominent topic in the field of audio, requires understanding the listening environment of the user for rendering an authentic virtual auditory object. Reverberation time ( RT 60 ) is a predominant metric for the characterization of room acoustics and numerous approaches have been proposed to estimate it blindly from a reverberant speech signal. However, a single RT 60 value may not be sufficient to correctly describe and render the acoustics of a room. This contribution presents a method for the estimation of multiple room acoustic parameters required to render close-to-accurate room acoustics in an unknown environment. It is shown how these parameters can be estimated blindly using an audio transformer that can be deployed on a mobile device. Furthermore, the paper also discusses the use of the estimated room acoustic parameters to find a similar room from a dataset of real BRIRs that can be further used for rendering the virtual audio source. Additionally, a novel binaural room impulse response (BRIR) augmentation technique to overcome the limitation of inadequate data is proposed. Finally, the proposed method is validated perceptually by means of a listening test.


Introduction
To create a convincing illusion of a virtual auditory object in a given environment via headphones, two major requirements must be met.The first requirement involves placing the object within a space which considers the room acoustics and the second involves the anatomy of an individual which changes the characteristics of the sound before reaching the eardrum [1].Placing the object in a space can be achieved by convolving a dry (or anechoic) signal with a room impulse response (RIR).This gives the listener a perception of sound originating from a specific location within a room, replicating the exact conditions under which the RIR was recorded.An RIR describes the behaviour of any sound source in a particular room and condition.It is typically divided into direct sound (DS), early reflections (ER), and late reverberations (LR) and it is well established that DS and ER have perceptually relevant directional properties [2].However, a single omnidirectional microphone lacks the ability to capture the directional information, which makes it inadequate to produce an authentic auditory rendering.For a realistic rendering of a virtual auditory object, binaural room impulse responses (BRIRs) are captured which describe how the sound would travel from a sound source to the ears of a listener in a particular room [3].An ideal BRIR measurement would require capturing the microphone signals at both the eardrums of the listener for every possible direction of view (DOV) resulting in an individual BRIR dataset.It includes the influences of both: the room's characteristics and the listener's physical presence.When convolved with a dry signal and played back over headphones, it should give that specific listener a perfect reconstruction of a virtual sound source.However, this is a very time-consuming task if it has to be performed for each listener in every listening condition and for all possible DOV.
Alternatively, to imitate the effect similar to individual BRIR, a generic BRIR can be recorded using a headand-torso simulator (HATS).The direct sound from this BRIR can then be replaced with a head-related transfer function (HRTF) of the particular listener.It is well understood that performing binaural rendering with an individual HRTF rather than a generic one leads to higher plausibility and externalization [4][5][6].Additionally, Werner et al. have demonstrated that the perceived externalization also depends on visual cues, due to acoustic divergence between the rendered and the listening rooms [7].Hence, for the virtual auditory object to sound plausible and external in audio augmented reality (AAR), it is crucial to render a reverberation that is similar to that of the listening environment, since an incorrect rendering of room acoustics would destroy the illusion of realism [8].
However, capturing BRIRs in every possible environment would require an impractical amount of effort and prime apparatus.To overcome this challenge, data-driven approaches have been proposed that estimate room acoustics directly from noisy reverberant speech signals using machine learning (ML)-based techniques which can be seen dated as far back as 2001 [9].In contrast to estimating room reverberation, other researchers propose estimating well-known room acoustic parameters such as reverberation time (RT) to get a rough understanding of the environment [10].
Apart from wide-band RT, frequency-dependent RT, energy decay curve (EDC), clarity ( C 50 ), and direct-to- reverberant ratio (DRR) have been considered the most important parameters of a room that assist in the analysis of a given scenario [11].Prior knowledge of these parameters could result in a more accurate representation of a virtual sound object in any given listening environment.But generally these parameters can only be calculated from high-quality measured RIRs and as mentioned earlier, measuring RIR is not practical at the user end of the applications due to the cost and effort involved.Hence, blind estimation of the parameters solely from reverberant speech signals has been of great interest to researchers in the field [10,[12][13][14][15][16][17][18][19][20][21][22].
RT 60 is often considered as a key parameter to describe the acoustics of a space.It is a measure that defines the time taken for the sound energy to decrease by 60 dB after the source has stopped.As defined by ISO 3382-2 [23], based on an energy decay curve of an RIR, RT 60 can be calculated by observing the time taken to reach 60 dB below the initial level used.But generally, from an RIR, RT 60 is extrapolated based on a smaller dynamic range such as 30 dB ( T 30 ), i.e. by taking twice of T 30 value.Another well-known method to calculate RT 60 of a space is given by Sabine's formula as: where RT 60 is the time in seconds required for a sound to decay 60 dB, V is the volume of the room, S is the boundary surface area, and α is the average absorption coefficient.Numerous methods exist for its estimation blindly, i.e. without the use of an RIR, using audio signals including both signal processing and ML-based methods [10,[13][14][15][16][17][18][19][20][21][22].A review of most of the given algorithms can be found in [24].Another important parameter of room acoustic is the early-to-late reverberant ratio (ELR) or clarity ( C 50 ).It is the ratio of the early sound energy (until 50 ms) and the residual energy in an RIR and is expressed in dB.As given by [25], C 50 can mathematically be defined as: Although C 50 is often used as an indicator of speech clarity or intelligibility, it may assist in discriminating rooms with similar RT 60 [11].Several non-intrusive approaches exist for its estimation [26] or jointly with RT 60 [13,16].
In 2015, The ACE challenge [12,24] created a benchmark for evaluation of RT 60 and direct-to-reverberant ratio (DRR) estimation approaches.The benchmark allows the researchers to fairly evaluate their models against the state-of-the-art models.Eaton et al. [24] compared multiple estimation approaches in terms of mean squared error (MSE), estimation bias, and Pearson correlation coefficient ( ρ ).MSE is defined as the average of squared differences between predicted and real values and can be mathematically understood as: where, x i is the ith observed value, y i is the correspond- ing predicted value and N is the number of observations.Bias is the mean error in the results and is given by: Finally, ρ is the Pearson coefficient correlation between the estimated and the ground truth results and is defined as: (1) where, x and ȳ are the means of real and predicted val- ues respectively.By considering the MSE, bias, and ρ together, it is possible to determine how well an estimator performs [12].Estimation results with a low bias and MSE might give an estimate close to the median for every speech file.But by examining ρ , it is possible to distin- guish between such an algorithm ( ρ close to 0).An algo- rithm which is more accurately estimating the parameter will have ρ closer to 1 with low MSE and bias close to 0. The results from ACE challenge showed that signal processing and ML methods had similar performance for RT estimation, while machine learning methods were better for DRR estimation at the time of the challenge.

Related works
Since the ACE challenge, a handful of publications have demonstrated the use of ML to estimate RT 60 from noisy reverberant speech signals [14,15,[18][19][20].In 2018, Gamper and Tashev used convolutional neural networks (CNN) to predict the average RT 60 of a reverberant sig- nal using Gammatone filtered time-frequency spectrum [15] which outperformed the best method from ACE challenge.In 2020, Looney and Gaubitch [17] showed promising results in the joint blind estimation of RT60, DRR, and signal-to-noise ratio (SNR).On the other hand, Bryan [18] proposes a method to generate augmented training datasets from real RIRs which showed improvement in RT 60 and DRR predictions.Ick et al. [21] introduced a series of phase-relate features and demonstrated clear improvements in the context of reverberation fingerprint estimation on unseen real-world rooms.However, one common limitation of most of the aforementioned works is that they only provide broadband parameter values, rather than frequency-dependent ones.This is an oversimplification of the room acoustics model which can potentially limit the rendering realism in the context of AAR.Instead of using speech inputs, a few researchers have tried estimating acoustic parameters from wide-band inputs such as music signals [27].The results demonstrate lower estimation accuracy in signals with music input than those with speech only.According to the authors, one of the reasons for this is the additional reverb used in the music content during the mixing process, which adds up in the input signal when convolving with the room reverb.In the study by Götz et al. [28], authors extended the work from [19] to estimate sub-band RT 60 and C 50 in dynamic conditions using convolutional recurrent neural networks (CRNN).When considering estimation results for music input, the model improves greatly (5) in estimating RT 60 under dynamic conditions over static conditions.This shows the ability of a model to differentiate between room reverberation and reverb from music when more than one condition is presented.The model trained with dynamic data shows improved performance even under static conditions.Although the results show estimation error being well under the just noticeable difference (JND) for RT 60 , the model fails to predict C 50 as accurately for C 50 > 15 dB.The training and evaluation are however only performed on noiseless data.Furthermore, as pointed out by authors, the model only considers single-channel signals which lack spatial information and could be an aspect of future works.
The parameters estimated by ML methods can also be used to drive artificial reverberators to reconstruct the RIR properties of the space.For example, the method from Srivastava et al. [22] predicts not only RT 60 but also room volume and total surface area, which could be useful parameters for the configuration of artificial reverberations.Alternatively, some methods also aim at the selection of an RIR from a database [11,29].
On the other hand, few researchers have focused on directly estimating the RIR leveraging the audio-visual cues [30,31].Although it is a valuable approach to match the room acoustics of the current room for the case of augmented reality (AR), relying on visual cues might not be practical in scenarios like headphone listening where visual input is not available.Additionally, 10 s required to optimize the material [30] is unsuitable for real-time rendering where the listener moves between different rooms.
Steinmetz et al. proposed to estimate time-domain RIR solely from reverberant speech using filtered noise shaping in [32].The approach shows a nearly perfect reconstruction of RIR from speech.However, the estimation of single-channel RIRs demonstrated in the publications may not be sufficient for auralization of spatial audio.Ratnarajah et al. [33] proposed M 3 -AUDIODEC, a neural network-based multi-channel multi-speaker audio codec which overcomes the limitation of generating singlechannel RIR.The approach compresses multi-channel audio while retaining spatial information.The approach can also aid in solving acoustic scene-matching problems.The method not only allows the reconstruction of reverberant speech but also provides the separation of clean speech and the BRIR that can be utilized to auralize other signals.The approach is however only tested on simulated data and the spatial coherence of other BRIRs in the same room has not been yet evaluated.
Most of the neural network approaches mentioned earlier make use of traditional CNN models with timefrequency representations (STFT, mel, etc.).In acoustic characterization, CNN is usually applied to solve the blind acoustic parameter estimation problem as a regression task.CNNs are suitable for learning two-dimensional time-frequency signal patterns for end-to-end modelling, which is why they are widely used in aforementioned approaches.To capture long-range global context better, hybrid models that combine CNN with a self-attention mechanism have been proposed.These models have achieved state-of-the-art results for various tasks such as acoustic event classification and other audio pattern recognition topics [34].
Gong et al. [35] recently proposed a purely attentionbased model for audio classification, called audio spectrogram transformer (AST).They evaluated AST against different audio classification benchmarks to achieve state-of-the-art results, which shows that CNNs are not always necessary in this context.However, transformers such as AST require a huge amount of computation power due to their complexity.To tackle this, in [13], a mobile audio transformer (AudMobNet) was proposed, which not only is independent of the length of the sequence but is also more robust against noise and computationally less expensive.The approach is only verified for broadband RT 60 and C 50 values.Upon further inspec- tion, it was found that instead of only estimating single broadband RT 60 value, sub-band RT 60 and C 50 can be beneficial in understanding the tonal characteristics of the room.
In this work, we aim to focus on estimating the acoustic parameters that correlate more with perceived plausibility.We extend our model (AudMobNet) presented in [13] to jointly estimate broadband and sub-band RT 60 and C 50 from noisy reverberant binaural speech signals.We also propose to improve the existing model architecture and use additional features such as phase and continuity differences to improve the performance of the model.Additionally, a novel multi-channel data augmentation technique to enhance the generalization capability of the network in sub-band RT estimation is presented.Lastly, leveraging on prior works [11,13], an end-to-end blind spatial audio rendering setup is developed (Fig. 1) that takes a noisy speech recording as input to output a plausible binaural rendering of an arbitrary signal by means of BRIR selection.We use the estimated parameters to find closely related rooms from a dataset of high-quality BRIRs using linear discriminant analysis (LDA).The selected rooms are further used for rendering a virtual auditory object.Contrary to existing approaches mentioned earlier, our method solely relies on reverberant speech to binauralize the listening room and selection of 360 degrees BRIR set allows head rotation without any audible artefacts.

Proposed method
Our proposed blind end-to-end spatial audio rendering setup consists of 4 modules, namely recorder, estimator, selector and renderer (see Fig. 1).The estimator operates in a frame-wise manner in parallel with the recorder, and the input buffer length can be adjusted as needed.A longer buffer will lead to increased latency in the room parameter estimation (when in real-time), whereas a tooshort buffer might produce inaccurate or unstable predictions.The estimated parameters are then used to select a set of 360 degrees BRIRs recorded in a real environment using the technique proposed in [11].The selected set is then used for rendering the virtual object over the headphones through convolution.This end-to-end setup is used to evaluate the perceptual relevance of the proposed method (Section 4.4).
The most important module in the end-to-end setup is the Estimator which makes use of earlier presented AudMobNet [13] and extends it to the application of AAR rendering.Exploiting the binaural recording feature Fig. 1 The end-to-end blind rendering setup of the proposed method of new-age consumer end microphones, we can blindly estimate the room acoustics parameters that can be used to find a similar high-quality BRIR set from the database.This high-quality measured BRIR set ensures that the intra-set spatial coherence is preserved.We believe if the parameters of rendered BRIR are close enough to that of the listening room, the room divergence effect will be mitigated, resulting in a plausible virtual sound source [36].Furthermore, it is believed that the head-tracking may also lead to a more authentic virtual sound source [37].

Model and feature input
The effectiveness of mobile transformers for estimating room acoustic parameters has already been demonstrated by the AudMobNet model proposed in [13].Here, we propose using AudMobNet with additional changes for estimating room acoustic parameters from binaural signals.Instead of using mel-spectrograms as the only input to the network, we propose using inter-channel differences, exploiting the binaural nature of the problem, in addition to the logarithmic mel-spectrogram.It has been well established that for localizing a sound source, inter-channel (aural) phase differences (IPD) are the main contributors in low frequencies (< 1500 Hz) while the Inter-channel Level Differences (ILD) contributes in frequencies above 1500 Hz [38].Furthermore, Ick et al. demonstrates that using phase and continuity features assist in improving RT 60 estimation in low frequencies [21].Similarly, results from Srivastava et al. [22] showed how using inter-channel features such as ILD and IPD lead to better parameter estimation over the networks where only single-channel features (STFT) are utilized.Hence, we believe that using IPD would help the network in understanding the low-frequency components.Furthermore, continuity features are used to track the phase variations across time which might help in understanding the overall context of the spectrum when estimating subband parameters.For boosting the generalization ability of the network and achieving full potential for estimating sub-band parameters, a BRIR augmentation technique is also presented which is discussed later in the section.
For the input data, the 2-channel raw audio is transformed into spectrograms using STFT with a sampling rate of 16,000 Hz and in frames with a 50% overlap using a Hann window.The STFT is further filtered with mel filterbanks generating a mel-spectrogram of the shape M × L, where M is the number of mel bins and L is the length of the resultant spectrogram.Here, L depends on the frame size F, used for calculating STFT.For faster training and ease of evaluation, we keep M fixed to 64 bins but two different frame sizes are studied, 256 and 512.The mel-spectrograms are further used for generating phase and continuity features as in [21].The mel features are then transformed to a logarithmic scale.The sine and cosine phase features from the left and right channels are then utilized to generate IPD as: where, θ t,f = ∠x t,f ,l − ∠x t,f ,r is the inter-channel phase difference between the mel-spectrogram x l and x r at time t and frequency f of the signals at microphones l and r.The second-order derivatives of IPDs are then calculated which we call Inter-channel Continuity Difference (ICD) and are given by: As shown in Fig. 2, the features (sine and cosine IPDs; and sine and cosine ICDs) are stacked with logarithmic mel-spectrograms to generate 6-channel inputs.The 6-channel inputs are then masked with a time and frequency mask of size 64 and 16 respectively.During training, the mask is applied to the input randomly masking 64 timesteps and 16 mel sub-bands.This allows the model not to rely upon specific regions in the audio.It can be regarded as the usual and widespread dropout, but applied to the input and an example can be seen in Fig. 2a.
Also, since the input already has more than 1 channel, the spectrum could not be utilized to shorten the sequence into patches.Hence, to generate 16 patch embeddings for the model input as in [13], a 3 × 3 convolution layer is used instead.The convolution provides us with 16 representations of the 6-channel masked input as can be noted in Fig. 2.These patch embeddings are used as inputs for AudMobNet.The output linear layer produces 27 embeddings which include a single full-band RT 60 , 8 sub-band RT 60 s, and similarly 9 C 50 values for each channel.

Datasets
Multiple publicly available datasets [39,40] of measured BRIRs were used, resulting in an overall of 571 real BRIRs across 45 rooms.In addition, a highly detailed internal dataset of 6 rooms presented in [11] with 1440 real BRIRs was utilized.To balance the dataset, only 200 BRIRs were chosen from the latter, resulting in a total of 771 BRIRs.Different speech corpus [41,42] and anechoic noise (6) dataset [43] were used to generate final samples.Babble noise was also simulated using a different speech dataset [44].The BRIR, Speech and Noise datasets were split into training and evaluation sets to avoid any overlapping.For example, all BRIRs from 5 different rooms were taken in the evaluation set, and the speech samples were taken from the ACE evaluation set [12].Further, in order to expand the size of the dataset and to tackle over-fitting, a novel BRIR augmentation technique was also incorporated.It shall be noted that the augmentation technique was only applied to the training set.

BRIR augmentation
We augment our BRIR dataset by parametrically modifying the diffuse tail (after mixing time) of the original BRIRs in different frequency bands, which allows us to mimic various tonal absorption patterns of rooms that are not present in the dataset.In [18], Bryan proposes replacing the later reverberation tail with a synthetic version generated using a Gaussian noise.Our technique is based on [18] with a few key differences.Firstly, our artificially generated Gaussian reverberation tail length varies in different sub-bands mimicking various tonal absorption patterns of sound in real rooms, in contrast to [18] where no frequency domain adaptation is incorporated.Secondly, due to the convolution with the original reverberation, our augmentation method also works for multi-channel RIRs.Finally, we also incorporated different decay types such as linear or logarithmic, taking into consideration more absorptive or reflective rooms.
In our augmentation approach, a circular set of BRIR is augmented by changing the reverberation tail.This circular set consists of 360/n BRIRs, where n is the spatial resolution in degrees which varies from room to room depending on the dataset.As shown in Fig. 3, to generate a reverberation tail, the original BRIR is filtered with mel-filterbank and RT is calculated for each mel sub-band.Then, a Gaussian noise with a similar length to that of the BRIR is generated.The noise is then convolved with the diffused part of the original BRIR to achieve a similar decorrelation between the left and right channels.The resulting binaural noise is filtered with the same mel filterbank and the sub-band noise shaping is performed using different EQ gains for each sub-band.Back in the time domain, a decay filter is applied for each band using either a linear or a logarithmic decay with ±500 ms RT of that of the orig- inal sub-band band RT.All the sub-band tails are then added up to generate the full-band BRIR tail.The old tails are then replaced in all the BRIRs from the same room after the mixing time.The augmented BRIR h r (t) is generated from the real BRIR h(t) by replacing the later tail with h aug (t) using a crossfade that can be interpreted as: where, t m is the approximate mixing time as given by [45] and calculated as t m = 80 • RT 500Hz .w n (t) is the first half of a Hanning window of 0.2 ( 2 • t w ) s and w e (t) is the later half of the window.Note that this augmentation technique can be applied to RIRs with any number of channels but due to the scope of research, only BRIR augmentation is demonstrated.Approximately 20,000 BRIRs were generated out of 671 real BRIRs using a single position from 40 rooms and the generated responses can be seen in Fig. 4. (10)

Data pre-processing
It is worth mentioning that the convolution of each augmented BRIR (20,000) with each speech sample (6000) requires an impractical amount of disk space.Hence, during the generation of the training sets, the BRIRs were convolved with 50 random speech signals.For the real BRIR training dataset, 671 BRIRs were convolved with random speech signals resulting in 35,000 real samples and for evaluation set consisted of 1000 real samples generated from measurements from 7 separate rooms.The number of total generated samples are given in Table 1.
The ground truth values were computed as defined by the ISO [23].The RT 60 is extrapolated by taking twice the T 30 value as suggested by the ISO.To obtain sub-band RT 60 s , the BRIR was first filtered with a mel-frequency filterbank and a RT 60 value for each filtered BRIR was cal- culated similarly.Sub-band and wide-band C 50 were cal- culated according to Eq. 2.
To simulate acoustic signals, the BRIR was convolved to each dry speech signal, s(t), and then noise n(t) was added as, where, * shows the convolution operator.For generat- ing noise realistically, all available BRIRs in one room are selected.Afterwards, 30 random speech samples are (11) x(t) = h(t) * s(t) + a.n(t) Fig. 3 Brief illustration of BRIR augmentation process picked from the noise-generation set mentioned in Section 3.2.Random EQ gains and overall gains are applied to the samples.The resultant samples are convolved with randomly chosen BRIR from the room.The convolution is done with only the diffuse tail of the real BRIR allowing us to generate diffused and spatial noise.The silent parts are removed from the signals to have continuous noise.Other noises, such as static ambient noise, babble noise, or office noises are also created similarly.The generated spatial noises ( n(t) ) are then multiplied with a gain constant ( a ) and added to the convolved speech sig- nal at random SNR ranging from 6 to 30 dB.Although the network is able to adapt to variable-length sequences, to allow batch training, the input signals were trimmed/ zero-padded to the length of 4 s.Another reason for choosing this length is the 4-s duration of most speech samples.Further, BRIRs with RT 60 < 2 s were considered which involves most of the real scenarios and the 4-s signal is enough to contain all the necessary information.

Training setup
After generating the input signals mentioned in the previous section, the 6-channel features are extracted (as shown in Section 3.1) which are used as inputs for the neural network.As seen in Table 5, 4 different configurations are presented to evaluate the effectiveness of the proposed method.In the first configuration, a single AudMobNet is used as in [13] to generate 29 embeddings (8 sub-band and a wide-band RT 60 and 8 sub-band and a wide-band C 50 for each channel) but instead of using the single channel mel-spectrogram input, we used 2-channel logarithmic mel-spectrogram inputs.In the second and third configurations, we use additional mel Phase and Continuity features for each channel similar to Ick et.al. [21], along with logarithmic mel-spectrograms producing 4-channel and 6-channel inputs respectively.In the final configuration, we propose using sine and cosine IPDs along with ICDs together with logarithmic mel-spectrograms of both channels as explained earlier to produce 6-channel inputs.Rather than looking at the features as two separate channels as in the third configuration, in the proposed method the difference of the sine and cosine phase components gives the network a better understanding of the behaviour of low frequencies.By stacking these features together, we can compare the performance differences of these features while keeping the same model complexity.

BRIR selection and rendering
In order to evaluate the proposed method, an end-toend system was built as explained at the end of Section 1, which takes a noisy speech recording as input and is able to output a plausible binaural rendering of an arbitrary signal through BRIR selection.As mentioned in Section 3.2.2, to allow batch training, the training samples were trimmed/padded to the length of 4 s, allowing the network to generalize better for this specific length.Furthermore, the results suggest higher accuracy for longer input samples when compared against signals shorter  than 3 s (see Table 2).However, it is not efficient to use 30-s-long input sequences for making a single prediction.
Hence, the samples are chopped into shorter lengths such as 4-s-long segments to get more stable estimations using multiple predictions in the Estimator module (Fig. 1).A hop size of 0.5 s was found to be a good tradeoff between latency and accuracy, hence this is chosen as the hop size for the buffer for further evaluation.Finally, the median values of predictions for the full input signal length are chosen to be the best estimates.
After the Estimator module (Fig. 1), the Selector module selects 2 best-matching room based on the parameters.Similar to the technique presented by Treybig et al. [11], linear discriminant analysis (LDA) is performed on the dataset of all real BRIRs.This separates all the rooms in the latent space based on parameters provided.The latent space is then stored in the disk along with eigenvectors.During the runtime, the predicted parameters are plotted in the same space using the saved eigenvectors.The closest measurements to the predicted parameters are selected as the best matching rooms using the nearest neighbour technique which are further used in the Renderer.
Since the dataset consists of BRIR measurements that have irregular spatial resolution, we employ a BRIR interpolation technique to obtain 1° resolution for all the BRIR circular sets.We follow the dynamic time warping (DTW)-based interpolation technique as presented by [46,47].The Renderer module performs binaural rendering by convolving dry audio signals with the selected BRIR pair utilizing partition convolution [48].The BRIR pair is swapped in real-time according to the listener yaw orientation, which is tracked with a MotionNode inertialmeasurement-unit head tracker (5-ms latency).

Evaluation
We present a concise evaluation of the room parameter estimation and the BRIR augmentation techniques separately to understand their contribution.

Model selection and input features 4.1.1 Preliminary model evaluation
In Saini and Peissig [13], we presented a mobile audio transformer to estimate wide-band RT 60 and C 50 from single-channel noisy reverberant speech signals.We evaluated the model against the relevant baselines using the benchmark evaluation criteria and dataset provided by the ACE Challenge [12].The results (Table 3) demonstrate that using the proposed mobile audio transformer effectively improves the estimation accuracy of both the parameters.It is to be noted that these models only estimate wide-band parameters from single-channel signals.
Apart from high accuracy, another advantage of this method is its ability to adapt to variable length input due to its unique hybrid transformer architecture.The model achieves higher accuracy even in low SNR levels (Fig. 5) while keeping the model complexity low (see Table 4).estimation compared to Baseline

Input features
In this work, we extend the AudMobNet L model to estimate sub-band parameters from noisy reverberant binaural speech signals (see Section 3.1).To evaluate our feature extraction method, we compared it against two approaches.The first approach involves the previously employed technique that involves mel-spectrogram as the only input feature proposed in [13] and in the second approach we use an input feature extraction technique similar to [21].The proposed feature extraction technique is compared against both of the methods mentioned above in Table 5.All the methods are compared for 2 different frame sizes (256 and 512 samples) resulting in a total of 8 configurations to be evaluated.
The models round up to 1.2 million trainable parameters for each configuration with approximately 500 MFLOPS, making it suitable for deploying on mobile devices.All configurations were trained on the same 35,000 samples (with only real BRIRs) so they could be fairly assessed.The evaluation set consisted of 1000 real samples from 7 unique rooms generated in a similar manner as the training set but with the ACE evaluation speech dataset [12].The distribution of wide-band parameters and an LDA using wide-band and sub-band parameters is given in Fig. 15 (Appendix).The training uses stochastic gradient descent on the mean-squared loss with an initial learning rate of 0.001 using an Adam optimizer.Since the parameters differ in units (s and dB), the training set was first standardized using min-max normalization.This gives each parameter equal weight when calculating loss.Batch size was selected manually to get the best out of each model and the available resources.Models were trained until convergence and the best-performing epoch was selected for each configuration.
The evaluation results presented in Table 5 demonstrate that using extra features improves the overall estimation compared to where only mel-spectrograms are utilized as input to the network.Although calculating the phase and continuity features as [21] is straightforward and improves the wide-band parameter estimation, it may not be the best choice for RT 60 estimation, be it wide-band or low-frequency sub-bands (Fig. 6).The strongest correlation can be observed between 500 Hz and 2 kHz across all the models, aligning with the findings from the ACE challenge [24].The reason for this behaviour can be partially attributed to the spectral distribution of energy present in speech signals.Further, the results from Fig. 6 agree with our hypothesis that the use of sine and cosine IPD and ICD assist the network in understanding low-frequency reflection better, resulting in higher ρ value for parameter estimation in the low frequencies especially below 1500 Hz.Using the  inter-channel differences not only improves the low-frequency prediction but also tends to improve the wideband estimation accuracy.The lower value of MSE for the sub-band as well as wide-band parameters further confirms our hypothesis.One reason for this could be higher inter-channel decorrelation in rooms where more energy in late reverberation is present leading to a longer RT 60 and a smaller C 50 .However, this shall be investi- gated in independent research.

Model output
During the course of this research, we also compared if estimating the sub-band parameters has any advantage over estimating only wide-band parameters.The results from Table 6 show that even if only mel-spectrograms are utilized as feature inputs to estimate the parameters, the model tends to have a better understanding of the input and wide-band parameters estimations to be more accurate.A similar trend can be seen for estimations made with additional inter-channel differences.

Input sample size
In many scenarios, such as in real-time, it may not be possible to pre-process and predict having a small window size (256 samples) resulting in slower computation.
Hence, using the window size of 512 samples when generating STFTs could reduce the computational complexity of the network by almost a factor of 2 when compared to the STFTs generated using 256 samples as well as the overall prediction time by at least a factor of 1.5.Using a longer window also drastically brings down the computation time since the matrix multiplication with the filterbank when computing the melspectrogram also reduces.Furthermore, the IPD and ICD calculation of smaller samples (spectrograms with bigger window size) is considerably faster due to fewer samples in the time axis.As a drawback, predictions for C 50 are not as accurate because of the larger window  size used due to energy binning involved (see Table 5).Smaller windows would have more time information which is useful for the calculation of C 50 .This energy is binned together when larger windows are used for the calculation of STFT resulting in lesser time information available for the network.Overall, the performance of the proposed method with a window size of 512 samples is comparable to the model presented in [13] with a window size of 256.Furthermore, the proposed network outperforms the existing model by accuracy in broadband as well as sub-band parameters but at a cost of 1.2 times slower inference time that is required to calculate IPD and ICD features.

Data augmentation
To show the effectiveness of the proposed data augmentation technique, the best network, i.e. the configuration with IPD and ICD and with a frame size of 256, was additionally trained with 200,000 samples generated using the technique presented in Section 3.2.1.The model from [13] was trained on the same data as a baseline for comparison.
Figure 7 demonstrates the MSE between ground truth and predicted values for each sub-band from 20 to 8000 Hz.The effectiveness of incorporating noise shaping in frequency sub-bands within our augmentation technique is clearly evident, as indicated by the low mean squared error (MSE) value.Furthermore, the results presented in Table 7 suggest an improvement in full band RT 60 predic- tion as well when compared to the models trained with real data only.Finally, employing different decay types benefits the model in predicting full-band C 50 bringing the ρ value as high as 0.94.

BRIR selection
In Section 3.4, we propose a BRIR selection method based on the output from the proposed neural network model that allows us to select a set of 360-degree BRIRs from a database of 45 rooms which are closest to the estimated parameters utilizing LDA of the dataset.The selected BRIR set is further used for rendering a virtual  [33].Although, as mentioned in Section 2, the method is designed for neural compression of binaural signals, it may also be applied to extract BRIR from the signal.Hence, we use it as the baseline for evaluation of our BRIR selection.
To evaluate our selection method against [28], we generated 30,000 new samples with the script provided in the GitHub repository 1 and fine-tuned our model on these samples.One major reason for this being unable to train M 3 − AUDIODEC on our data due to its huge size.The test set consisted of 752 samples generated using VCTK Speech Corpus [50] and BRIRs simulated with Pygsound2 .
Table 8 reports the mean of absolute errors (MAE) between the real and estimated parameters for 4 cases.The first case is selector only, i.e. when a BRIR is selected from the dataset directly using ground truth values in the LDA space.The second case considers the output of our neural network model.In this case, the model was finetuned and modified to output 4 embeddings, i.e. each parameter for each channel.The third case selects a room using LDA based on the output of the proposed model as used in our end-to-end approach.Principal component analysis (PCA) is also performed in this case, similar to [11] to find the best-matching BRIR in the selected room so that the C 50 error is minimum.To be noted, the room selected (in cases 1 and 3) is from the dataset of real BRIRs and hence might not be as accurate as generating a BRIR resulting in larger errors (see bottom image in Fig. 8).This does not mean that our method performs worse than [33] but is a limitation of the size of the dataset to select a BRIR from.On the other hand, the BRIR predicted by [33] shows neural network noise (top image in Fig. 8 after 40,000 samples) and hence to calculate the parameters, only the first 0.25 s of the whole BRIR is used.
Table 9 presents feature comparisons of the baseline [33] and the proposed method.We can see that our model being approximately 500 th of the baseline in terms of parameters, shows faster estimation resulting in a very small RTF value.This shows us that the model can be deployed on a mobile device while providing accurate real-time parameter estimation.Our model is also robust against noise as can be seen in Fig. 5 as opposed to the baseline, which was trained on simulated data with no additional noise added to the input.Furthermore, the BRIR set selected through our approach consists of high-quality 360-degree BRIR sets which were recorded in different rooms at multiple positions and hence can be used for spatial auralization of an object providing some degrees of freedom of head rotation without introducing any audible artefacts.This also leads to a more plausible illusion of a virtual object, as shown in the next section.

Perceptual evaluation
We measured 360-degree BRIR sets in two different rooms with non-identical acoustical characteristics to test the effectiveness of the proposed method perceptually.These measured sets were used as the hidden references in the listening test.

Listening test measurement setup
To measure the rooms, a dummy head was placed at a distance of 3 m from a Genelec 8020 speaker and circular sets of BRIRs were measured in the two rooms.Both rooms are different in shape, size and reverberation patterns.The rooms are shown in Fig. 9. Room A imitates a living room environment and is dry while Room B complements it as a reverberant meeting room.Room A has an average RT 60 of 0.29 s with the room dimensions of 10.5 m x 5.25 m x 2.75 m (L x B x H) while Room B has an average RT 60 of 0.65 s with the room dimensions of 12.5 m x 5.75 m x 2.75 m.The measurements from Room A have a C 50 of 17 dB while that of Room B is 10 dB.A comparison of IRs and sub-band RT 60 from both rooms is given in Fig. 10.A glance into the figure reveals how Room B has strong first two reflections and a long decay time.A meeting table in the middle of the measurement  setup creating a strong first reflection can be noticed in the impulse response of Room B adjacent to the direct sound.On the other hand, Room A has prominent early reflections but a shorter decay time due to the type of material used in the room such as carpet and absorbing curtains.We recorded 30 s of speech using Genelec 8030 CP loudspeakers in both rooms using the same measurement setup described above.For best results, the recorded signal is then chopped into multiple 4-s-long sequences with a hop size of 1 s.The distribution of all the full-band RT 60 and C 50 predictions made by the proposed model for both rooms can be seen in Fig. 11.Furthermore, the comparison in Table 10 demonstrates the differences in predicted median values and ground truth.Finally, the sub-band RT 60 predictions can be noted in Fig. 12.All the median values are then used to find the two best matching rooms using the Selector described in Section 3.4.
For each scenario, the two best matching rooms were selected from the dataset of real BRIRs using the method proposed in Section 3.4 for the listening test.A comparison of sub-band RT 60 of the best matching rooms are given in Fig. 12 and wide-band RT 60 and C 50 are given in Table 11.Although it is evident from Table 10 that the predicted wide-band parameters are well under the JND of 5% and 1.1 dB for RT 60 [51] and C 50 [52], the best- matched rooms (Table 11 and Fig. 12) may not be.At 1 kHz, the RT 60 for 2nd best-matched room of room B is almost 24% longer which is slightly longer than the JND of 22% defined by [53].Similarly, the wide-band RT 60 and C 50 values for best-matching rooms fall outside of JNDs.
To validate the proposed approach, a subjective listening test similar to MUSHRA [54] was employed (see Fig. 13).The audio signal was convolved with the selected BRIR sets in real time to be able to provide head-tracked binaural output.Participants were asked to rate the plausibility of the sound source on a scale from 1 to 5. For naive listeners, the explanation of plausibility was the perceived size of the room compared to the reference where 1 was much smaller and 5 was much bigger.The listening setup had automatic switching, so when the listener would take off the headphones, the reference loudspeaker would play.Beyerdynamic's DT 990 pro headphones were utilized for the listening test due to their openness characteristics and flatter response.A headphone equalization was also applied before playing back the stimuli through the headphones.
The listening test consisted of 4 parts, i.e. listening in two different rooms: room A and room B; and with two different stimuli types: female speech and music.The female speech was taken from the ACE challenge dataset [12] and "Get Lucky" by Daft Punk was chosen as the pop music piece.Initially, in room A, a small training session was done where the listener was presented a speech signal convolved with BRIR sets from rooms with very different acoustics were used to demonstrate the difference in reverberation in rooms (see Fig. 13, left).
In the actual test, the listeners were asked to blindly compare 5 BRIR sets in each room including 2 best matches (of the listening room), 2 anchors (reverberant and dry) and a BRIR set of the listening room (see Table 11).The anchors remain the same in both the listening rooms.Twelve listeners participated in the listening test including 7 audio and acoustic experts, 2 with audio background and 3 naive listeners.To be noted, the BRIR sets used for training were different than the ones in the test and the order was also randomized.

Results and discussion
A repeated measure analysis of variance (RM-ANOVA) test was performed on the data of each paired comparison to study the effect of the different variables and their interactions.An RM-ANOVA test gives us statistically significant differences in three or more dependent samples.In our case, three such cases can be seen: listening room, type of stimulus, and presented reverberant condition.Two values, i.e. the p-value and the F-value are presented from RM-ANOVA tests.The F-value in an RM-ANOVA represents the ratio of the variance between the groups to the variance within the groups.It tests the null hypothesis that there is no significant difference between the means of the groups, and a larger F-value indicates that the difference between the means is more likely to be significant.The p-value associated with the F-value represents the probability of observing such an extreme F-value by chance if the null hypothesis were true.Therefore, a smaller p-value indicates that the result is more statistically significant.
Not to be missed, the data from the listening did not pass the Mauchly sphericity test (p < 0.001) and the Greenhouse-Geisser epsilon was 0.69, which is smaller than 0.75, so the Greenhouse-Geisser correction was applied, following the ITU-R MUSHRA recommendation [55].Afterwards, the data was grouped based on the results and a t-test was performed for each group.A significance value ( α ) of 0.05 was used.
Effect of the listening room and stimulus: The RM-ANOVA found no significant effect of the listening room on listeners' ratings  Post hoc pairwise t-tests were run on data separated by listening room and type of stimulus for each reverberant condition using a corrected significance level of α ′ = 0.01.All the p-values are presented in Tables 12, 13, 14 and 15.Furthermore, for the ease of understanding, all pairs with no significant differences are marked with arrows in Fig. 14.The t-test found no significant difference in listeners' ratings for the reference and best match 1 in all the scenarios except for the case when listening to music in room B. It is apparent from Fig. 14 that many listeners found the room bigger than the reference in this case.One reason for this is a longer reverberation time in the low frequencies for best match 1 (0.8 s at 150 Hz) when compared to the ground truth (0.5 s) which is more prominent in the case of music with drums and bass rather than female speech as can be seen in Fig. 12.
The t-test between reference and best match 2 follows a similar trend showing no significant difference in listener's ratings.Although for music in room A and speech in room B, results show significant differences, it shall be noted the p-value (0.0095) is very close to α ′ in both cases.A general trend of best match 2 being slightly bigger than the reference can be noted.The reason for this could be longer RT 60 (20% for best match 2 in room A) or lower C 50 (4 dB for best match 2 in room B) than the JNDs in such cases.Furthermore, the results also show no significant difference between best match 1 and best match 2 under all circumstances.Similarities with respective anchors in each room can be noted in Fig. 14 but significant differences were found except for the dry anchor to the reference when listening to music in room     A, but the anchor was mostly rated as smaller than the reference.This reason could be due to presence of reverb in the music piece itself that adds up to the room reverb.

Discussion
This study demonstrates the possibility of rendering a virtual sound source blindly from noisy reverberant speech signals.The presented results show that the proposed method is able to render a sound source which sounds similar to a physical sound source in the room.In Section 4, objective and subjective evaluations are presented.A thorough evaluation of each proposed method is given which shows the effectiveness of the method.
From the objective perspective, our Estimator module shows improvements against the state-of-the-art models in overall estimation accuracy (wide-band and sub-band), inference speed and robustness against noise.Further, it was confirmed that predicting sub-band parameters along with wide-band parameters helps the network to understand the data better.We also discussed the impact of different window sizes used to calculate feature inputs.Finally, the overall end-to-end setup was evaluated objectively against the relevant baseline which shows the advantages and disadvantages of the proposed method.
From the listening test results, we discovered that in most cases, especially when rendering speech, the proposed method is able to produce such results that the listeners perceived the best matching room to be the same size as (or similar to) the actual listening room.However, the same cannot be said for the music signals.There are reasons involved in each stage.Input signal: Although the proposed network outperforms the state-of-the-art estimation techniques, its dependence solely on speech signal input affects the sub-band RT 60 estimation.The absence of low frequen- cies (< 85 Hz) in the speech signals makes it difficult for the network to estimate the parameters precisely in this frequency range.This can also be seen in Fig. 7.Although the proposed method improves the estimation accuracy in low frequencies, it may still result in incorrect estimation leading to an incorrect choice of BRIR.One solution for this could be estimating the parameters using signals with a wider frequency spectrum [27,28] however it comes with a drawback of overall lower estimation accuracy as described in Section 2.
Estimated parameters: As seen from the results (Fig. 14), the convergent case, i.e.where the listening room is the same as the BRIR set, has mostly been rated most similar.Next to the convergent case, the best-match BRIR sets obtained the most similar ratings to the reference room.While the estimated parameters may relate closely to the perceptual differences, they still might not be enough to accurately describe a room.For example, even though the estimated parameters are similar when listening in room B, significant differences with the reference were found in listener ratings with the best matching rooms.This could be due to the early reflections coming from the meeting table which may have influenced the C 50 but since more weight is given to sub-band RT 60 in LDA [11], the matched room results in a C 50 smaller than the JND.As a result, the matched scenarios sound distant, unreal, or bigger.Hence, the perceptual relevance of other parameters such as interaural cross correlation (IACC) and/or Initial Time Delay Gap (ITDG) shall also be investigated.
Lack of real BRIR dataset: Another possibility of how these perceptual differences could have been avoided is by the inclusion of more BRIR sets.Currently, the dataset consists of only 45 real rooms (with only a single or a few positions per room) that may or may not be closely related to the listening room as also discussed in Section 4.3.In the future, the estimation and augmentation techniques could be improved by considering more perceptually relevant acoustic features.However, further research is still needed on this topic, to better understand the mapping between objective acoustic metrics and the perception of room similarity.An alternative approach to this problem could be to modify/optimize the selected dataset to fill the gap between the predicted parameters and the selected rooms's parameters.

Conclusions
In this work, we propose a novel technique to blindly render any given listening environment from a speech signal.This is done in two steps.Firstly, the parameters are estimated from a noisy reverberant binaural speech signal using a mobile audio transformer.We propose improvements to the previously presented model AudMobNet for the estimation of the parameters such as by using additional features (phase and continuity) and supporting binaural signals.We demonstrate how using input features such as inter-channel phase difference (IPD) and its second-order derivative can effectively improve the overall performance of the network, especially in low frequencies RT estimation.To further improve the performance of the estimation technique, we propose a BRIR augmentation technique which can be further used to augment any multi-channel RIRs.Our augmentation approach demonstrates major improvement when compared to the state of the art in the sub-band RT 60 estimation as well as the full-band RT 60 and C 50 estima- tion.Secondly, the estimated parameters are then utilized to select a circular set of BRIR from a given dataset using LDA that is used for rendering a virtual sound source.A perceptual evaluation was also carried out during the study and the results demonstrate the selected BRIRs to be as plausible as one based on a BRIR recorded in the actual listening room.Finally, we discuss the gaps in the presented method.

Fig. 2
Fig. 2 Overview of the proposed model architecture.The top row (a) illustrates the pre-processing step where features are extracted from the binaural speech signal and stacked with the logarithmic mel-spectrogram resulting in the input shape of 6 × 64 × 1000 (C × H × W).Further, a time and frequency mask is applied to the input spectrum and the resultant is convolved with 16 kernels of size 3 producing 16 representations of size 32 × 500 which is treated as input to the transformer.The middle row (b) describes the architecture of the transformer where MV2 means MobileNetV2 block and ↓ means a reduction in the input size.N is the size of the kernel and M is the number of the linear transformers in the MobileViTV3 block.The bottom row (c) shows the architecture of each MobileViTV3 block where N and M are dependent upon the position of the block

Fig. 5
Fig. 5 Effects of SNR levels [dB] on the performance of the AudMobNet variations and baseline models evaluated on the ACE evaluation set.The top graph displays ρ for RT 60 estimates in comparison to the RT Baseline [17] while the lower graph illustrates the RMSE [dB] for C 50 estimates compared to the Baseline [26].The comparison metrics were selected based on those provided in the respective baseline

Fig. 6
Fig. 6 Sub-band RT 60 and C 50 estimation evaluation only for the models with real training data (window size of 256 samples)

Fig. 8 16 Fig. 9 Fig. 10
Fig.8 Comparison of BRIR output (top) and MAE for RT 60 estimates from proposed methods against M 3 − AUDIODEC[33].Noise in the output BRIR from baseline can be noted after 40,000 samples.The test data consists of 752 samples generated using the script provided in[33]

Fig. 11
Fig. 11 Distribution of 30 RT 60 and C 50 predictions from the model for room A (left) and room B (right).Violet means no estimation was made for the value and yellow represents the most number of estimations

Fig. 13
Fig. 13 Schematic of graphical user interface for the listening test

Fig. 14
Fig.14 Violin plots for the listening test results.The top image shows the results for the listeners' ratings in room A and the bottom one demonstrates ratings in room B. The respective coloured arrow means no significant difference was found in the listeners' ratings between the pair of the listening condition

Table 1
Summary of data used during training and evaluation

Table 3
[26]uation results from[13]on the ACE challenge evaluation set[12]for wide-band RT 60 predictions from single-channel noisy speech signals.On the left is a comparison against Baseline[17]and best-performing method[49]from the ACE challenge and on the right for wide-band C 50 estimation compared to Baseline[26]

Table 4
Number of trainable parameters and floating point operations per second (FLOPS) in millions per sample

Table 5
Evaluation results for wide-band RT 60 and C 50 (left channel only) for the models trained only with real training data.The input features describe the type of inputs and the size of the window

Table 6
Wide-band parameter ( RT 60 and C 50 ) estimation results for the proposed model (AudMobNet + IPD for sub-band parameter estimation) compared to the same model for wideband parameter estimation

Table 7
Evaluation results for the proposed model trained with real data and augmented data compared to the baseline wideband RT 60 Evaluation results of models trained with only real data compared to the models additionally trained with augmented data sound source.The closest approach to extracting a BRIR solely from speech signal was recently presented by Ratnarajah et al. ( M 3 − AUDIODEC )

Table 8
MAE for wide-band RT 60 and C 50 estimation (left channel) of each mentioned scenario and ground truth

Table 9
Comparision of baseline and proposed method on different grounds

Table 10
Calculated and estimated parameters from room A and room B

Table 11
Ground truth parameters of all the rooms used in the listening test

Table 12 p
-values of paired t-test performed on listener's ratings on speech for all reverberant conditions in room A

Table 13 p
-values of paired t-test performed on listener's ratings on music for all reverberant conditions in room A

Table 14 p
-values of paired t-test performed on listener's ratings on speech for all reverberant conditions in room B

Table 15 p
-values of paired t-test performed on listener's ratings on music for all reverberant conditions in room B