Working examples
To evaluate the proposed method, we implemented two working examples comparing binaural rendering of SMA data using the SMATBIN filters with the rendering chain implemented in the SOFiA toolbox [28]. As all spherical microphone array processing in the present work was performed using SOFiA, and the SMATBIN filters for the two examples were based on the SOFiA rendering chain, the BRTFs/BRIRs produced by the two methods should ideally be identical.
For binaural decoding, we used the sofia_binauralX function, which employs the virtual loudspeaker approach in combination with HRTF switching to account for arbitrary head orientations [2, 3]. The HRTFs used in SOFiA are from a Neumann KU100 dummy head measured on a Lebedev grid with 2702 sampling points [29]. The HRTFs are transformed to the SH domain at a sufficiently high order of N=35, allowing artifact-free SH interpolation to obtain HRTFs for any direction corresponding to the directions of the plane waves [3].
For both working examples, the radius of the rigid sphere array was r = 8.75 cm, and the radial filter gain was soft-limited to 20 dB [30]. The SMATBIN filter length was defined as K = 2048 taps at a sampling rate of fs = 48 kHz. Figure S1 in the supplementary material (Additional file 1) shows an example of SMATBIN filters with the above-mentioned array and filter parameters for a Lebedev sampling scheme of order N=1. The described implementations with functions to calculate the SMATBIN filters and generate results plots, as well as various demo implementations, are available onlineFootnote 1.
Working example 1
In the first working example, we simulated a single broadband plane wave incident from the front (ϕ = 0 ∘,θ = 90 ∘, with ϕ the horizontal angle ranging from 0 ∘ to 360 ∘ and θ the vertical angle ranging from 0 ∘ to 180 ∘) on three different rigid sphere arrays with Lebedev sampling schemes of orders N={1,7,35}, corresponding to 6, 86, and 1730 sampling points respectively. Besides the more common orders N = 1 and N = 7, we decided to show the implementation with the rather high order N=35 to verify that no artifacts or instabilities occur even when processing with a very high number of SMATBIN filters. From these SMA signals, we calculated BRIRs using the SOFiA implementation employing plane wave decomposition and virtual loudspeaker rendering (see Fig. 1 (top)) as well as using the proposed SMATBIN filter method where the SMA signals are simply filtered and then superimposed to achieve a BRIR (see Fig. 1 (bottom)).
Figure 2 compares the left-ear BRIRs/BRTFs resulting from the SOFiA and the SMATBIN filter processing, taking frontal head orientation (ϕ = 0 ∘,θ = 90 ∘) as an example. The absolute amplitudes of the broadband pressure BRIRs (left column) are nearly identical in their overall time-energy structure with matching amplitude and time events. Accordingly, the magnitude frequency responses of the respective BRTFs (middle column) show no considerable differences and are almost identical at all examined spatial orders. Consistent with that, the magnitude differences (right column) are minimal over the entire audible frequency range from 20 Hz to 20 kHz for all examined spatial orders, with a maximum of about ±0.5 dB at higher frequencies.
In further analysis, we compared BRIRs for 360 head orientations in the horizontal plane (1 ∘ steps from 0 ∘ to 360 ∘), generated based on the SMA signals for a single plane wave incident from the front as described above. For a perception-related evaluation of the spectral deviations, we calculated for each head orientation the absolute energetic difference ΔG between SOFiA and SMATBIN BRIRs in 40 auditory gammatone filter bands between 50 Hz and 20 kHz [31, 32], as implemented in the Auditory Toolbox [33]. Figure 3 shows the so determined left-ear differences on the example of N=7 for all 360 head orientations (gray lines) and averaged over all head orientations (blue line). In general, the differences are minimal and well below an assumed just-noticeable difference (JND) of 1 dB [34] and thus can be considered perceptually uncritical. For certain head orientations, the differences reach a maximum of approximately 0.8 dB in the frequency range of about 2-3 kHz. These larger differences occur mainly for lateral sound incidence, i.e., for head orientations in the range of 90 ∘ and 270 ∘. Smaller differences with a maximum of approximately 0.3 dB in the range of 2-3 kHz occur for frontal and rear sound incidence, i.e., for head orientations in the range of 0 ∘ and 180 ∘. The average difference across head orientations is generally very small, but increases slightly towards mid frequencies, reaching a maximum of approximately 0.3 dB at about 2 kHz.
Working example 2
In the second working example, we evaluated the proposed method using measured SMA data of a real, more complex sound field. Specifically, we employed data captured with the VariSphear measurement system [35] on a Lebedev grid of order N=44 in a classroom at TH Köln [22]. The shoebox-shaped classroom has a volume of 459 m3 and a mean reverberation time of about 0.9 s (0.5 - 8 kHz). The sound source was a Neumann KH420 loudspeaker, placed at a distance of about 4.50 m and a height of 1.40 m in front of the VariSphear array. We spatially resampled the measurements to Lebedev grids of orders N={1,7,35} using SH interpolation. From these (resampled) SMA data, we calculated BRIRs using the SOFiA rendering chain as well as the SMATBIN filter method.
Figure 4 compares the left-ear BRIRs/BRTFs for frontal head orientation generated using SOFiA or the SMATBIN filter method. Also for the complex sound field, the time-energy structure of the two broadband pressure BRIRs (left column) is almost identical. Consequently, the 1/6-octave smoothed magnitude responses (middle column) are largely identical for all spatial orders examined, and the magnitude differences (right column) are minimal, with a maximum range of about ±0.5 dB over the entire audible frequency range.
The analysis of the absolute energetic difference ΔG across 360 head orientations in the horizontal plane and selected SH order N=7 revealed differences that should be perceptually uncritical as they are clearly below the assumed JND of 1 dB (see Fig. 5). At frequencies below 100 Hz and in the range between 500 Hz and 3 kHz, the differences for certain head orientations reach a maximum of about 0.4 dB, but decrease again above 3 kHz. The average difference across head orientations does not exceed 0.2 dB in the entire audible frequency range and even tends towards 0 dB at frequencies above 3 kHz.
Interim summary
The results of the two working examples clearly show that the presented approach can be used equivalently to the established but much more complex virtual loudspeaker approach for binaural rendering of SMA data or for generating BRIRs from SMA measurements. Theoretically, the result of the two compared methods should even be completely identical. In practice, however, minimal differences between the binaural signals can occur because of the filter design, i.e., because of the necessary further processing of the filters after sampling the rendering chain with unit pulses, such as windowing, truncation, or delay compensation.
The supplementary material (Additional file 1) contains further BRIR/BRTF plots for Working example 1 for the (more application-oriented range of) orders N={1,3,7}, selected SMATBIN filter lengths, and selected head orientations. Similar to Fig. 2, the results of the SOFiA and SMATBIN renderings are nearly identical as long as the SMATBIN filters have a sufficient number of filter taps. If the number of FIR filter taps is too small (approximately below 512 taps), deviations from the reference occur in the low-frequency range (<100 Hz) because of insufficient frequency resolution. The SMATBIN filter length can thus be used to adjust the accuracy of the binaural reproduction (compared to the reference) in the low-frequency range, but also the required computing power and memory requirements, as the computing effort for the real-time convolution as well as the required memory space depends on the number of filter taps.
Computational complexity
In particular, towards higher orders N, the SHT dominates the computational complexityFootnote 2 of the common virtual loudspeaker and SH domain approaches. As the SHT must be performed for each frequency bin, it scales linearly with the filter length or FFT size K. The SMATBIN filter approach omits the SHT and reduces the entire encoding and decoding chain to linear filtering and summation (see Fig. 1), thereby decreasing the complexity for binaural rendering of SMA data, as detailed in the following.
The conventional SHT has a complexity of O(N4 K) and thus, the calculation effort increases rapidly as a function of spatial order N [36]. Optimized methods for performing the SHT with reduced complexity still require O(N2 (logN)2 K) or \(O(N^{\frac {5}{2}} (log N)\,K)\) steps, depending on the optimization [36, 37].
All other processing steps for binaural rendering of SMA data depend on N only with O(N2). The FFT and IFFT, required in all rendering methods to transform the SMA signals to frequency domain and the binaural signals to time domain, respectively, both have a complexity of O(N2 K logK). Linear filtering in frequency domain, which in the present case corresponds to either applying the radial filters to the SH signals or the SMATBIN filters to the SMA signals, has a complexity of O(N2 K), and summing up all channels also has a complexity of O(N2 K).
Thus, the SHT has the highest complexity depending on N in the entire rendering chain, and especially for large N, its calculation effort significantly exceeds that of all other processing steps. As a result, by omitting the SHT, the SMATBIN filter method allows a more efficient binaural rendering of SMA data than the conventional methods.
Memory requirements
The lower computational complexity of the SMATBIN filter method comes at the cost of higher memory requirements, as a set of filters must be precomputed and stored for each required head orientation. To estimate by example how much more memory the SMATBIN filter approach requires compared to the virtual loudspeaker or SH domain approach, we assume in the following an SMA with a Lebedev sampling scheme of order N=12, i.e., Q=230 microphones and A=169 SH channels, a bit depth of 32 bit, i.e., P=4 bytes per filter tap, and a filter length of K=2048 taps.
For the virtual loudspeaker approach, N+1 radial filters with a length of K taps and 2×D HRTF filters with a length of L taps must be stored, with D the number of directions of the HRTF set. The total memory requirement calculates as [((N+1)×K+2×D×L)×P]. Assuming an HRTF set with D=2702 directions and L=128 taps, the virtual loudspeaker approach requires 2.9 MB.
With the SH domain approach, only 2×A HRTF filters in the SH domain need to be stored in addition to the radial filters. Here, the total memory requirement calculates as [((N+1)+A×2)×K×P], which also results in 2.9 MB.
In the case of the SMATBIN filter method, the required memory scales with the number of microphones Q and the number of head orientations M. The total memory requirement calculates as [Q×2×M×K×P]. Assuming that, as is often the case, only head orientations in the horizontal plane with a sufficiently high resolution of 2 ∘ are rendered [38], yields M=180 head orientations and a total memory requirement of 678 MB. Thus, the SMATBIN filter method requires significantly more memory space than the other two methods, but is computationally less demanding. Accordingly, it must be decided on a case-by-case basis, depending on the technical requirements of a rendering system, whether memory space can be sacrificed for a lower computational load.