An end-to-end approach for blindly rendering a virtual sound source in an audio augmented reality environment

Saini, Shivam; Engel, Isaac; Peissig, Jürgen

doi:10.1186/s13636-024-00338-6

Empirical Research
Open access
Published: 27 March 2024

An end-to-end approach for blindly rendering a virtual sound source in an audio augmented reality environment

EURASIP Journal on Audio, Speech, and Music Processing volume 2024, Article number: 16 (2024) Cite this article

885 Accesses
Metrics details

Abstract

Audio augmented reality (AAR), a prominent topic in the field of audio, requires understanding the listening environment of the user for rendering an authentic virtual auditory object. Reverberation time ($RT_{60}$) is a predominant metric for the characterization of room acoustics and numerous approaches have been proposed to estimate it blindly from a reverberant speech signal. However, a single $RT_{60}$ value may not be sufficient to correctly describe and render the acoustics of a room. This contribution presents a method for the estimation of multiple room acoustic parameters required to render close-to-accurate room acoustics in an unknown environment. It is shown how these parameters can be estimated blindly using an audio transformer that can be deployed on a mobile device. Furthermore, the paper also discusses the use of the estimated room acoustic parameters to find a similar room from a dataset of real BRIRs that can be further used for rendering the virtual audio source. Additionally, a novel binaural room impulse response (BRIR) augmentation technique to overcome the limitation of inadequate data is proposed. Finally, the proposed method is validated perceptually by means of a listening test.

1 Introduction

To create a convincing illusion of a virtual auditory object in a given environment via headphones, two major requirements must be met. The first requirement involves placing the object within a space which considers the room acoustics and the second involves the anatomy of an individual which changes the characteristics of the sound before reaching the eardrum [1]. Placing the object in a space can be achieved by convolving a dry (or anechoic) signal with a room impulse response (RIR). This gives the listener a perception of sound originating from a specific location within a room, replicating the exact conditions under which the RIR was recorded. An RIR describes the behaviour of any sound source in a particular room and condition. It is typically divided into direct sound (DS), early reflections (ER), and late reverberations (LR) and it is well established that DS and ER have perceptually relevant directional properties [2]. However, a single omnidirectional microphone lacks the ability to capture the directional information, which makes it inadequate to produce an authentic auditory rendering. For a realistic rendering of a virtual auditory object, binaural room impulse responses (BRIRs) are captured which describe how the sound would travel from a sound source to the ears of a listener in a particular room [3]. An ideal BRIR measurement would require capturing the microphone signals at both the eardrums of the listener for every possible direction of view (DOV) resulting in an individual BRIR dataset. It includes the influences of both: the room’s characteristics and the listener’s physical presence. When convolved with a dry signal and played back over headphones, it should give that specific listener a perfect reconstruction of a virtual sound source. However, this is a very time-consuming task if it has to be performed for each listener in every listening condition and for all possible DOV.

Alternatively, to imitate the effect similar to individual BRIR, a generic BRIR can be recorded using a head-and-torso simulator (HATS). The direct sound from this BRIR can then be replaced with a head-related transfer function (HRTF) of the particular listener. It is well understood that performing binaural rendering with an individual HRTF rather than a generic one leads to higher plausibility and externalization [4,5,6]. Additionally, Werner et al. have demonstrated that the perceived externalization also depends on visual cues, due to acoustic divergence between the rendered and the listening rooms [7]. Hence, for the virtual auditory object to sound plausible and external in audio augmented reality (AAR), it is crucial to render a reverberation that is similar to that of the listening environment, since an incorrect rendering of room acoustics would destroy the illusion of realism [8].

However, capturing BRIRs in every possible environment would require an impractical amount of effort and prime apparatus. To overcome this challenge, data-driven approaches have been proposed that estimate room acoustics directly from noisy reverberant speech signals using machine learning (ML)-based techniques which can be seen dated as far back as 2001 [9]. In contrast to estimating room reverberation, other researchers propose estimating well-known room acoustic parameters such as reverberation time (RT) to get a rough understanding of the environment [10].

Apart from wide-band RT, frequency-dependent RT, energy decay curve (EDC), clarity ($C_{50}$), and direct-to-reverberant ratio (DRR) have been considered the most important parameters of a room that assist in the analysis of a given scenario [11]. Prior knowledge of these parameters could result in a more accurate representation of a virtual sound object in any given listening environment. But generally these parameters can only be calculated from high-quality measured RIRs and as mentioned earlier, measuring RIR is not practical at the user end of the applications due to the cost and effort involved. Hence, blind estimation of the parameters solely from reverberant speech signals has been of great interest to researchers in the field [10, 12,13,14,15,16,17,18,19,20,21,22].

$RT_{60}$ is often considered as a key parameter to describe the acoustics of a space. It is a measure that defines the time taken for the sound energy to decrease by 60 dB after the source has stopped. As defined by ISO 3382-2 [23], based on an energy decay curve of an RIR, $RT_{60}$ can be calculated by observing the time taken to reach 60 dB below the initial level used. But generally, from an RIR, $RT_{60}$ is extrapolated based on a smaller dynamic range such as 30 dB ($T_{30}$), i.e. by taking twice of $T_{30}$ value. Another well-known method to calculate $RT_{60}$ of a space is given by Sabine’s formula as:

$$\begin{aligned} RT_{60} = 0.161V/S \alpha \end{aligned}$$

(1)

where $RT_{60}$ is the time in seconds required for a sound to decay 60 dB, V is the volume of the room, S is the boundary surface area, and $\alpha$ is the average absorption coefficient. Numerous methods exist for its estimation blindly, i.e. without the use of an RIR, using audio signals including both signal processing and ML-based methods [10, 13,14,15,16,17,18,19,20,21,22]. A review of most of the given algorithms can be found in [24]. Another important parameter of room acoustic is the early-to-late reverberant ratio (ELR) or clarity ($C_{50}$). It is the ratio of the early sound energy (until 50 ms) and the residual energy in an RIR and is expressed in dB. As given by [25], $C_{50}$ can mathematically be defined as:

$$\begin{aligned} C_{50} = 10 \log \left( \frac{\int _{0}^{50}{p^{2}(t) dt}}{\int _{50}^{\infty }{p^{2}(t) dt}} \right) \end{aligned}$$

(2)

Although $C_{50}$ is often used as an indicator of speech clarity or intelligibility, it may assist in discriminating rooms with similar $RT_{60}$ [11]. Several non-intrusive approaches exist for its estimation [26] or jointly with $RT_{60}$ [13, 16].

In 2015, The ACE challenge [12, 24] created a benchmark for evaluation of $RT_{60}$ and direct-to-reverberant ratio (DRR) estimation approaches. The benchmark allows the researchers to fairly evaluate their models against the state-of-the-art models. Eaton et al. [24] compared multiple estimation approaches in terms of mean squared error (MSE), estimation bias, and Pearson correlation coefficient ($\rho$). MSE is defined as the average of squared differences between predicted and real values and can be mathematically understood as:

$$\begin{aligned} MSE = \frac{1}{N}\sum (x_{i} - y_{i})^2 \end{aligned}$$

(3)

where, $x_{i}$ is the ith observed value, $y_{i}$ is the corresponding predicted value and N is the number of observations. Bias is the mean error in the results and is given by:

$$\begin{aligned} Bias = \frac{1}{N}\sum (x_{i} - y_{i}) \end{aligned}$$

(4)

Finally, $\rho$ is the Pearson coefficient correlation between the estimated and the ground truth results and is defined as:

$$\begin{aligned} \rho = \frac{\sum (x_{i} - \bar{x})(y_{i} - \bar{y})}{\sqrt{\sum (x_{i} - \bar{x})^2 \sum (y_{i} - \bar{y})^2}} \end{aligned}$$

(5)

where, $\bar{x}$ and $\bar{y}$ are the means of real and predicted values respectively. By considering the MSE, bias, and $\rho$ together, it is possible to determine how well an estimator performs [12]. Estimation results with a low bias and MSE might give an estimate close to the median for every speech file. But by examining $\rho$, it is possible to distinguish between such an algorithm ($\rho$ close to 0). An algorithm which is more accurately estimating the parameter will have $\rho$ closer to 1 with low MSE and bias close to 0. The results from ACE challenge showed that signal processing and ML methods had similar performance for RT estimation, while machine learning methods were better for DRR estimation at the time of the challenge.

2 Related works

Since the ACE challenge, a handful of publications have demonstrated the use of ML to estimate $RT_{60}$ from noisy reverberant speech signals [14, 15, 18,19,20]. In 2018, Gamper and Tashev used convolutional neural networks (CNN) to predict the average $RT_{60}$ of a reverberant signal using Gammatone filtered time-frequency spectrum [15] which outperformed the best method from ACE challenge. In 2020, Looney and Gaubitch [17] showed promising results in the joint blind estimation of RT60, DRR, and signal-to-noise ratio (SNR). On the other hand, Bryan [18] proposes a method to generate augmented training datasets from real RIRs which showed improvement in $RT_{60}$ and DRR predictions. Ick et al. [21] introduced a series of phase-relate features and demonstrated clear improvements in the context of reverberation fingerprint estimation on unseen real-world rooms. However, one common limitation of most of the aforementioned works is that they only provide broadband parameter values, rather than frequency-dependent ones. This is an oversimplification of the room acoustics model which can potentially limit the rendering realism in the context of AAR.

Instead of using speech inputs, a few researchers have tried estimating acoustic parameters from wide-band inputs such as music signals [27]. The results demonstrate lower estimation accuracy in signals with music input than those with speech only. According to the authors, one of the reasons for this is the additional reverb used in the music content during the mixing process, which adds up in the input signal when convolving with the room reverb. In the study by Götz et al. [28], authors extended the work from [19] to estimate sub-band $RT_{60}$ and $C_{50}$ in dynamic conditions using convolutional recurrent neural networks (CRNN). When considering estimation results for music input, the model improves greatly in estimating $RT_{60}$ under dynamic conditions over static conditions. This shows the ability of a model to differentiate between room reverberation and reverb from music when more than one condition is presented. The model trained with dynamic data shows improved performance even under static conditions. Although the results show estimation error being well under the just noticeable difference (JND) for $RT_{60}$, the model fails to predict $C_{50}$ as accurately for $C_{50}>$ 15 dB. The training and evaluation are however only performed on noiseless data. Furthermore, as pointed out by authors, the model only considers single-channel signals which lack spatial information and could be an aspect of future works.

The parameters estimated by ML methods can also be used to drive artificial reverberators to reconstruct the RIR properties of the space. For example, the method from Srivastava et al. [22] predicts not only $RT_{60}$ but also room volume and total surface area, which could be useful parameters for the configuration of artificial reverberations. Alternatively, some methods also aim at the selection of an RIR from a database [11, 29].

On the other hand, few researchers have focused on directly estimating the RIR leveraging the audio-visual cues [30, 31]. Although it is a valuable approach to match the room acoustics of the current room for the case of augmented reality (AR), relying on visual cues might not be practical in scenarios like headphone listening where visual input is not available. Additionally, 10 s required to optimize the material [30] is unsuitable for real-time rendering where the listener moves between different rooms.

Steinmetz et al. proposed to estimate time-domain RIR solely from reverberant speech using filtered noise shaping in [32]. The approach shows a nearly perfect reconstruction of RIR from speech. However, the estimation of single-channel RIRs demonstrated in the publications may not be sufficient for auralization of spatial audio. Ratnarajah et al. [33] proposed $M^3$-AUDIODEC, a neural network-based multi-channel multi-speaker audio codec which overcomes the limitation of generating single-channel RIR. The approach compresses multi-channel audio while retaining spatial information. The approach can also aid in solving acoustic scene-matching problems. The method not only allows the reconstruction of reverberant speech but also provides the separation of clean speech and the BRIR that can be utilized to auralize other signals. The approach is however only tested on simulated data and the spatial coherence of other BRIRs in the same room has not been yet evaluated.

Most of the neural network approaches mentioned earlier make use of traditional CNN models with time-frequency representations (STFT, mel, etc.). In acoustic characterization, CNN is usually applied to solve the blind acoustic parameter estimation problem as a regression task. CNNs are suitable for learning two-dimensional time-frequency signal patterns for end-to-end modelling, which is why they are widely used in aforementioned approaches. To capture long-range global context better, hybrid models that combine CNN with a self-attention mechanism have been proposed. These models have achieved state-of-the-art results for various tasks such as acoustic event classification and other audio pattern recognition topics [34].

Gong et al. [35] recently proposed a purely attention-based model for audio classification, called audio spectrogram transformer (AST). They evaluated AST against different audio classification benchmarks to achieve state-of-the-art results, which shows that CNNs are not always necessary in this context. However, transformers such as AST require a huge amount of computation power due to their complexity. To tackle this, in [13], a mobile audio transformer (AudMobNet) was proposed, which not only is independent of the length of the sequence but is also more robust against noise and computationally less expensive. The approach is only verified for broadband $RT_{60}$ and $C_{50}$ values. Upon further inspection, it was found that instead of only estimating single broadband $RT_{60}$ value, sub-band $RT_{60}$ and $C_{50}$ can be beneficial in understanding the tonal characteristics of the room.

In this work, we aim to focus on estimating the acoustic parameters that correlate more with perceived plausibility. We extend our model (AudMobNet) presented in [13] to jointly estimate broadband and sub-band $RT_{60}$ and $C_{50}$ from noisy reverberant binaural speech signals. We also propose to improve the existing model architecture and use additional features such as phase and continuity differences to improve the performance of the model. Additionally, a novel multi-channel data augmentation technique to enhance the generalization capability of the network in sub-band RT estimation is presented. Lastly, leveraging on prior works [11, 13], an end-to-end blind spatial audio rendering setup is developed (Fig. 1) that takes a noisy speech recording as input to output a plausible binaural rendering of an arbitrary signal by means of BRIR selection. We use the estimated parameters to find closely related rooms from a dataset of high-quality BRIRs using linear discriminant analysis (LDA). The selected rooms are further used for rendering a virtual auditory object. Contrary to existing approaches mentioned earlier, our method solely relies on reverberant speech to binauralize the listening room and selection of 360 degrees BRIR set allows head rotation without any audible artefacts.

3 Proposed method

Our proposed blind end-to-end spatial audio rendering setup consists of 4 modules, namely recorder, estimator, selector and renderer (see Fig. 1). The estimator operates in a frame-wise manner in parallel with the recorder, and the input buffer length can be adjusted as needed. A longer buffer will lead to increased latency in the room parameter estimation (when in real-time), whereas a too-short buffer might produce inaccurate or unstable predictions. The estimated parameters are then used to select a set of 360 degrees BRIRs recorded in a real environment using the technique proposed in [11]. The selected set is then used for rendering the virtual object over the headphones through convolution. This end-to-end setup is used to evaluate the perceptual relevance of the proposed method (Section 4.4).

The most important module in the end-to-end setup is the Estimator which makes use of earlier presented AudMobNet [13] and extends it to the application of AAR rendering. Exploiting the binaural recording feature of new-age consumer end microphones, we can blindly estimate the room acoustics parameters that can be used to find a similar high-quality BRIR set from the database. This high-quality measured BRIR set ensures that the intra-set spatial coherence is preserved. We believe if the parameters of rendered BRIR are close enough to that of the listening room, the room divergence effect will be mitigated, resulting in a plausible virtual sound source [36]. Furthermore, it is believed that the head-tracking may also lead to a more authentic virtual sound source [37].

3.1 Model and feature input

The effectiveness of mobile transformers for estimating room acoustic parameters has already been demonstrated by the AudMobNet model proposed in [13]. Here, we propose using AudMobNet with additional changes for estimating room acoustic parameters from binaural signals. Instead of using mel-spectrograms as the only input to the network, we propose using inter-channel differences, exploiting the binaural nature of the problem, in addition to the logarithmic mel-spectrogram. It has been well established that for localizing a sound source, inter-channel (aural) phase differences (IPD) are the main contributors in low frequencies (< 1500 Hz) while the Inter-channel Level Differences (ILD) contributes in frequencies above 1500 Hz [38]. Furthermore, Ick et al. demonstrates that using phase and continuity features assist in improving $RT_{60}$ estimation in low frequencies [21]. Similarly, results from Srivastava et al. [22] showed how using inter-channel features such as ILD and IPD lead to better parameter estimation over the networks where only single-channel features (STFT) are utilized. Hence, we believe that using IPD would help the network in understanding the low-frequency components. Furthermore, continuity features are used to track the phase variations across time which might help in understanding the overall context of the spectrum when estimating sub-band parameters. For boosting the generalization ability of the network and achieving full potential for estimating sub-band parameters, a BRIR augmentation technique is also presented which is discussed later in the section.

For the input data, the 2-channel raw audio is transformed into spectrograms using STFT with a sampling rate of 16,000 Hz and in frames with a 50% overlap using a Hann window. The STFT is further filtered with mel filterbanks generating a mel-spectrogram of the shape M × L, where M is the number of mel bins and L is the length of the resultant spectrogram. Here, L depends on the frame size F, used for calculating STFT. For faster training and ease of evaluation, we keep M fixed to 64 bins but two different frame sizes are studied, 256 and 512. The mel-spectrograms are further used for generating phase and continuity features as in [21]. The mel features are then transformed to a logarithmic scale. The sine and cosine phase features from the left and right channels are then utilized to generate IPD as:

$$\begin{aligned} sinIPD(t, f) = sin(\theta _{t, f}) \end{aligned}$$

(6)

$$\begin{aligned} cosIPD(t, f) = cos(\theta _{t, f}) \end{aligned}$$

(7)

where, $\theta _{t, f}= \angle x_{t,f,l} - \angle x_{t,f,r}$ is the inter-channel phase difference between the mel-spectrogram $x_{l}$ and $x_{r}$ at time t and frequency f of the signals at microphones l and r. The second-order derivatives of IPDs are then calculated which we call Inter-channel Continuity Difference (ICD) and are given by:

$$\begin{aligned} sinICD(t, f) = \lim _{i\rightarrow 0} \lim _{j\rightarrow 0} \frac{\sin (\theta _{t,f}) - \sin (\theta _{t+i,f+j})}{ \theta _{t,f} - \theta _{t+i,f+j}} \end{aligned}$$

(8)

$$\begin{aligned} cosICD(t, f) = \lim _{i\rightarrow 0} \lim _{j\rightarrow 0} \frac{\cos (\theta _{t,f}) - \cos (\theta _{t+i,f+j})}{ \theta _{t,f} - \theta _{t+i,f+j}} \end{aligned}$$

(9)

As shown in Fig. 2, the features (sine and cosine IPDs; and sine and cosine ICDs) are stacked with logarithmic mel-spectrograms to generate 6-channel inputs. The 6-channel inputs are then masked with a time and frequency mask of size 64 and 16 respectively. During training, the mask is applied to the input randomly masking 64 timesteps and 16 mel sub-bands. This allows the model not to rely upon specific regions in the audio. It can be regarded as the usual and widespread dropout, but applied to the input and an example can be seen in Fig. 2a.

Also, since the input already has more than 1 channel, the spectrum could not be utilized to shorten the sequence into patches. Hence, to generate 16 patch embeddings for the model input as in [13], a 3 × 3 convolution layer is used instead. The convolution provides us with 16 representations of the 6-channel masked input as can be noted in Fig. 2. These patch embeddings are used as inputs for AudMobNet. The output linear layer produces 27 embeddings which include a single full-band $RT_{60}$, 8 sub-band $RT_{60}$s, and similarly 9 $C_{50}$ values for each channel.

3.2 Datasets

Multiple publicly available datasets [39, 40] of measured BRIRs were used, resulting in an overall of 571 real BRIRs across 45 rooms. In addition, a highly detailed internal dataset of 6 rooms presented in [11] with 1440 real BRIRs was utilized. To balance the dataset, only 200 BRIRs were chosen from the latter, resulting in a total of 771 BRIRs. Different speech corpus [41, 42] and anechoic noise dataset [43] were used to generate final samples. Babble noise was also simulated using a different speech dataset [44]. The BRIR, Speech and Noise datasets were split into training and evaluation sets to avoid any overlapping. For example, all BRIRs from 5 different rooms were taken in the evaluation set, and the speech samples were taken from the ACE evaluation set [12]. Further, in order to expand the size of the dataset and to tackle over-fitting, a novel BRIR augmentation technique was also incorporated. It shall be noted that the augmentation technique was only applied to the training set.

3.2.1 BRIR augmentation

We augment our BRIR dataset by parametrically modifying the diffuse tail (after mixing time) of the original BRIRs in different frequency bands, which allows us to mimic various tonal absorption patterns of rooms that are not present in the dataset. In [18], Bryan proposes replacing the later reverberation tail with a synthetic version generated using a Gaussian noise. Our technique is based on [18] with a few key differences. Firstly, our artificially generated Gaussian reverberation tail length varies in different sub-bands mimicking various tonal absorption patterns of sound in real rooms, in contrast to [18] where no frequency domain adaptation is incorporated. Secondly, due to the convolution with the original reverberation, our augmentation method also works for multi-channel RIRs. Finally, we also incorporated different decay types such as linear or logarithmic, taking into consideration more absorptive or reflective rooms.

In our augmentation approach, a circular set of BRIR is augmented by changing the reverberation tail. This circular set consists of 360/n BRIRs, where n is the spatial resolution in degrees which varies from room to room depending on the dataset. As shown in Fig. 3, to generate a reverberation tail, the original BRIR is filtered with mel-filterbank and RT is calculated for each mel sub-band. Then, a Gaussian noise with a similar length to that of the BRIR is generated. The noise is then convolved with the diffused part of the original BRIR to achieve a similar decorrelation between the left and right channels. The resulting binaural noise is filtered with the same mel filterbank and the sub-band noise shaping is performed using different EQ gains for each sub-band. Back in the time domain, a decay filter is applied for each band using either a linear or a logarithmic decay with $\pm 500$ ms RT of that of the original sub-band band RT. All the sub-band tails are then added up to generate the full-band BRIR tail. The old tails are then replaced in all the BRIRs from the same room after the mixing time. The augmented BRIR $h_{r}(t)$ is generated from the real BRIR $h(t)$ by replacing the later tail with $h_{aug}(t)$ using a crossfade that can be interpreted as:

$$\begin{aligned} h_{r}(t)= \left\{ \begin{array}{ll} h(t),&{}t_{0}\le t \le t_{m} - t_{w}\\ h(t).w_e(t) + h_{aug}(t).w_n(t) ,&{}t_{m}-t_{w}< t \le t_{m}+t_{w}\\ h_{aug}(t) ,&{} t_{m}+t_{w} < t\\ \end{array}\right. \end{aligned}$$

(10)

where, $t_{m}$ is the approximate mixing time as given by [45] and calculated as $t_{m} = 80\cdot RT_{500Hz}$. $w_n(t)$ is the first half of a Hanning window of 0.2 ($2\cdot t_{w}$) s and $w_e(t)$ is the later half of the window. Note that this augmentation technique can be applied to RIRs with any number of channels but due to the scope of research, only BRIR augmentation is demonstrated. Approximately 20,000 BRIRs were generated out of 671 real BRIRs using a single position from 40 rooms and the generated responses can be seen in Fig. 4.

3.2.2 Data pre-processing

It is worth mentioning that the convolution of each augmented BRIR (20,000) with each speech sample (6000) requires an impractical amount of disk space. Hence, during the generation of the training sets, the BRIRs were convolved with 50 random speech signals. For the real BRIR training dataset, 671 BRIRs were convolved with random speech signals resulting in 35,000 real samples and for evaluation set consisted of 1000 real samples generated from measurements from 7 separate rooms. The number of total generated samples are given in Table 1. The ground truth values were computed as defined by the ISO [23]. The $RT_{60}$ is extrapolated by taking twice the $T_{30}$ value as suggested by the ISO. To obtain sub-band $RT_{60}s$, the BRIR was first filtered with a mel-frequency filterbank and a $RT_{60}$ value for each filtered BRIR was calculated similarly. Sub-band and wide-band $C_{50}$ were calculated according to Eq. 2.

To simulate acoustic signals, the BRIR was convolved to each dry speech signal, s(t), and then noise n(t) was added as,

$$\begin{aligned} x(t) = h(t) * s(t) + a.n(t) \end{aligned}$$

(11)

where, $*$ shows the convolution operator. For generating noise realistically, all available BRIRs in one room are selected. Afterwards, 30 random speech samples are picked from the noise-generation set mentioned in Section 3.2. Random EQ gains and overall gains are applied to the samples. The resultant samples are convolved with randomly chosen BRIR from the room. The convolution is done with only the diffuse tail of the real BRIR allowing us to generate diffused and spatial noise. The silent parts are removed from the signals to have continuous noise. Other noises, such as static ambient noise, babble noise, or office noises are also created similarly. The generated spatial noises ($n(t)$) are then multiplied with a gain constant ($a$) and added to the convolved speech signal at random SNR ranging from 6 to 30 dB.

Table 1 Summary of data used during training and evaluation

An end-to-end approach for blindly rendering a virtual sound source in an audio augmented reality environment

Abstract

1 Introduction

2 Related works

3 Proposed method

3.1 Model and feature input

3.2 Datasets

3.2.1 BRIR augmentation

3.2.2 Data pre-processing

3.3 Training setup

3.4 BRIR selection and rendering

4 Evaluation

4.1 Model selection and input features

4.1.1 Preliminary model evaluation

4.1.2 Input features

4.1.3 Model output

4.1.4 Input sample size

4.2 Data augmentation

4.3 BRIR selection

4.4 Perceptual evaluation

4.4.1 Listening test measurement setup

4.4.2 Results and discussion

5 Discussion

6 Conclusions

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Authors’ information

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords