 Empirical Research
 Open Access
 Published:
A neural networksupported twostage algorithm for lightweight dereverberation on hearing devices
EURASIP Journal on Audio, Speech, and Music Processing volume 2023, Article number: 18 (2023)
Abstract
A twostage lightweight online dereverberation algorithm for hearing devices is presented in this paper. The approach combines a multichannel multiframe linear filter with a singlechannel singleframe postfilter. Both components rely on power spectral density (PSD) estimates provided by deep neural networks (DNNs). By deriving new metrics analyzing the dereverberation performance in various time ranges, we confirm that directly optimizing for a criterion at the output of the multichannel linear filtering stage results in a more efficient dereverberation as compared to placing the criterion at the output of the DNN to optimize the PSD estimation. More concretely, we show that training this stage endtoend helps further remove the reverberation in the range accessible to the filter, thus increasing the earlytomoderate reverberation ratio. We argue and demonstrate that it can then be well combined with a postfiltering stage to efficiently suppress the residual late reverberation, thereby increasing the earlytofinal reverberation ratio. This proposed twostage procedure is shown to be both very effective in terms of dereverberation performance and computational demands, as compared to, e.g., recent stateoftheart DNN approaches. Furthermore, the proposed twostage system can be adapted to the needs of different types of hearingdevice users by controlling the amount of reduction of early reflections.
1 Introduction
Communication and hearing devices require modules aiming at suppressing undesired parts of the signal to improve the speech quality and intelligibility. Reverberation is one of such distortions caused by room acoustics and is characterized by multiple reflections on the room enclosures. Late reflections particularly degrade the speech signal and may result in a reduced intelligibility [1].
Traditional approaches were proposed for dereverberation such as spectral enhancement [2], beamforming [3], a combination of both [4], coherence weighting [5, 6], and linearprediction based approaches such as the wellknown weighted prediction error (WPE) algorithm [7, 8]. WPE computes an autoregressive multichannel filter in the shorttime spectrum and applies it to a delayed group of reverberant speech frames. This approach is able to partially cancel late reverberation while inherently preserving parts of the early reflections, thus improving speech intelligibility for normal and hearingsupported listeners [9].
WPE and its extensions require the prior estimation of the anechoic speech PSD, which is modeled for instance through the speech periodogram [7] or a powercompressed periodogram corresponding to sparse priors [8], by an autoregressive process [10] or through nonnegative matrix factorization [11]. A DNN was first introduced in [12] to model the anechoic PSD, thus avoiding the use of an iterative refinement.
Instead of providing parameters for linear prediction as in, e.g. [12, 13], DNNs were also proposed for mappingbased dereverberation in the timefrequency magnitude domain [14], complex domain [15, 16], or in the timedomain [17].
As hearing devices operate in realworld scenarios in realtime, the proposed techniques for dereverberation should support lowlatency online processing and adapt to changing room acoustics. Such online adaptive approaches were introduced, based on either Kalman filtering [18, 19] or on a recursive least squares (RLS)adapted WPE, which is a special case of Kalman filtering [20]. Strategies for handling the case of speakers changing positions were introduced in [19, 20]. In the RLSWPE framework, the PSD is either estimated by recursive smoothing of the reverberant signal [20] or by a DNN [21].
In the previously cited works, the DNN is trained towards PSD estimation, although this stage is only a frontend followed by RLSWPEbased dereverberation algorithms. Socalled endtoend techniques aim to solve this mismatch by using a criterion placed at the output of the complete algorithm to train the DNN. Endtoend techniques using an automatic speech recognition (ASR) criterion were designed to refine the frontend DNN handling, e.g., speech separation [22], denoising [23], or multiple tasks [24]. An endtoend procedure using ASR as a training criterion was also introduced in [25] to optimize a DNN used for online dereverberation.
This journal paper is an extension of our prior work [26], where we proposed instead to use a criterion directly on the output signal rather than using ASR. We experimentally showed that it improved instrumentally predicted speech intelligibility and quality. The proposed criterion also enabled us to use different target signals and corresponding WPE parameters to make our approach adapt to the needs of different hearingaid users categories: hearing aid (HA) users on the one hand benefiting from early reflections like normal listeners [9] and cochlear implant (CI) users on the other hand which do not benefit from early reflections [27].
We noticed in [26] that although the energy residing in the moderate reverberation range corresponding to the filter length was particularly suppressed when training the approach endtoend, residual late reverberation could still be heard at the output. A further processing stage could be dedicated to removing this residual reverberation, as increasing the length of the linear filters results in rapidly increasing computational complexity. Hybrid approaches using such cascaded DNNassisted stages have been proposed for dereverberation [28] or joint dereverberation, separation, and denoising [13, 24, 29].
The extension to our work [26] consists in the three following contributions. First, we introduce metrics to measure the energies in various reverberation ranges in order to investigate the differences between the previously cited WPEbased approaches and our proposed method. Second, we propose to use a second DNNsupported stage based on singleframe nonlinear magnitude filtering and show that it significantly suppresses the residual late reverberation at the output of WPE. We show with the newly introduced metrics that this latter stage particularly benefits from strong dereverberation within the linear filter range obtained with the previous endtoend WPE approach. Finally, we evaluate our approach and baselines on simulated reverberant data inspired by the WHAMR! dataset [30].
The rest of this paper is organized as follows. In Section 2, the online DNNWPE dereverberation scheme is summarized. Section 3 presents the DNNsupported postfilter and describes the used endtoend training procedure. In Section 4, we describe the experimental setup and introduce metrics in order to detail the dereverberation performance in various ranges. The results are presented and discussed in Section 5.
2 Signal model and DNNsupported WPE Dereverberation
2.1 Signal model
We use a subbandfiltering approximation in the shorttime Fourier transform (STFT) domain as in [7], and all computations except those involving neural networks are computed for each frequency band independently. Therefore, we omit the frequency index f when unnecessary and all vectors and matrices have an additional implicit frequency dimension of size F. The time frame index in the sequences of length T is denoted by t and is also dropped when not explicitly needed. We use lowercase normal font notation for signals having only time (and frequency) dimensions (\(a_t \in \mathbb {C}\)), lowercase bold font notation for vectors having one extra dimension (\(\varvec{a}_t \in \mathbb {C}^{d_1}\)) and reserve uppercase bold font notation for matrices having two extra dimensions (\(\varvec{A}_t \in \mathbb {C}^{d_1 \times d_2}\)).
The reverberant speech \(\varvec{x} \in \mathbb {C}^{D \times T}\) is obtained at the Dmicrophone array by convolution of the anechoic speech \(s \in \mathbb {C}^{T}\) and the room impulse responses (RIRs) \(\varvec{h} \in \mathbb {C}^{D \times N}\):
where \(\varvec{d}\) denotes the direct path, \(\varvec{e}\) the early reflections component, \(\varvec{r}\) the late reverberation, and \(\varvec{u}\) an error term comprising modeling errors and background noise. The early reflections component \(\varvec{e}\) was shown to contribute to speech quality and intelligibility for normal and HA listeners [9] but not for CI users, particularly in highlyreverberant scenarios [27]. Therefore, we propose that the dereverberation objective is to retrieve \(\varvec{\nu } = \varvec{d + e}\) for HA listeners and \(\varvec{\nu } = \varvec{d}\) for CI listeners.
2.2 WPE dereverberation
In relation to the subband reverberant model in (1), the WPE algorithm [7] uses an autoregressive model to approximate the late reverberation \(\varvec{r}\). Based on a zeromean timevarying Gaussian model on the STFT anechoic speech s with time (and frequency) dependent PSD \(\lambda ^{(\textrm{WPE})}\), a multichannel filter \(\varvec{G} \in \mathbb {C}^{DK \times D}\) with K taps is estimated. This filter aims at representing the inverse of the late tail of the RIRs \(\varvec{h}\), such that the target \(\varvec{\nu }\) can be obtained through linear prediction with delay \(\Delta\). The prediction delay \(\Delta\) is originally intended to avoid undesired shorttime speech cancelations in [7]; however, this also leads to preserving parts of the early reflections. As such, we propose to set \(\Delta\) larger for normal hearing and HA users who benefit from early reflections [9] but lower for CI users who suffer from early reflections [27]. By disregarding the error term \(\varvec{u}\) in (1) in noiseless scenarios, we obtain:
where \(\mathcal {X}_{t  \Delta } = \left[ \begin{array}{c} \varvec{x}^T_{t\Delta }, \dots , \varvec{x}^T_{t\Delta K+1} \end{array}\right] ^T \in \mathbb {C}^{DK}\).
In order to obtain an adaptive and realtime capable approach, RLSWPE was proposed in [20], where the WPE filter \(\varvec{G}\) is recursively updated along time. RLSWPE can be seen as a special case of Kalman filtering, in which the target covariance matrix is replaced by the scaled identity matrix \(\lambda ^{(\textrm{WPE})} \varvec{I}\), and the weight state error matrix is simply updated by dividing by the recursive factor \(\alpha\) instead of following the usual Markov model [19]:
\(\varvec{k} \in \mathbb {C}^{DK}\) is the Kalman gain, \(\varvec{R} \in \mathbb {C}^{DK \times DK}\) the covariance of the delayed reverberant signal buffer \(\mathcal {X}_{t  \Delta }\) weighted by the PSD estimate \(\lambda ^{(\textrm{WPE})}\), and \(\alpha\) the forgetting factor.
In nonidealistic scenarios, the term \(\varvec{u}\) is not zero. Therefore, a regularization parameter \(\epsilon > 0\) is added to the denominator of (3) which can be seen as a form of spectral flooring as used in traditional spectral enhancement schemes [4, 6, 31]. Although it is not per se a denoising solution and we still consider scenarios where noise is negligible in comparison to reverberation, adding this parameter helps increasing the robustness of WPE to noise, numerical instabilities and modeling errors. On the other hand, setting \(\epsilon\) to a high value will excessively attenuate the relative variations of the Kalman denominator, which mitigates the benefits of variancenormalization as explained in [32]. A value of \(\epsilon ^*=0.001\) was picked based on the performance of the WPE algorithm using oracle PSD.
2.3 DNNbased PSD estimation
The anechoic speech PSD estimate \(\lambda ^{(\textrm{WPE})}\) is obtained at each time step, either by recursive smoothing of the reverberant periodogram [20] or with help of a DNN [21]. A block diagram of the DNNWPE algorithm as proposed in [21] is given in Fig. 1, as the first stage up to \(\varvec{\nu }^{(\textrm{WPE})}\). In this approach, the input to the neural network is the magnitude of the reference channel \(x_0\), taken here to be the first channel. We did not observe changes in the results by changing the reference channel or computing an average of the channels to obtain the DNN input, likely because the signal model itself considers a channelagnostic PSD. The magnitude frame is then fed to a recurrent neural network \(\mathrm {MaskNet_{WPE}}\), which outputs a realvalued mask \(\mathcal {M}^{(\textrm{WPE})}\). The PSD estimate is obtained by timefrequency masking:
where \(\odot\) represents the Hadamard element product.
In [12, 21], the DNN is optimized with a meansquared error (\(\textrm{MSE}\)) criterion on the masked output. In contrast, we proposed to use the \(L^1\) loss:
This loss function indeed led to better results in our experiments [26]. This can be explained by the fact that the \(L^1\) loss puts more weight on lowenergy bins than highenergy bins in comparison to the \(\textrm{MSE}\) loss as it is more concave, which is a good fit for dereverberation.
2.4 Endtoend training procedure
2.4.1 Endtoend criterion and objectives
We argue that the mismatch between the DNNoptimization criterion (7) and the dereverberation task may limit the overall performance. However, using ASR as an endtoend training criterion, as is done in [25], may not necessarily the best choice in order to optimize a dereverberation algorithm for hearingaid users. The first reason is that the resulting scheme could not be adapted to specific user categories, although these benefit from different speech cues. Namely, HA listeners are shown to benefit from early reflections [9] where CI listeners do not significantly benefit from those, in particular in highly reverberant scenarios where early reflections degrade intelligibility [27]. The second reason is that by nature, the dereverberation scheme will provide the best representation possible for ASR, which may be not the optimal representation in terms of quality and intelligibility for a human listener.
We therefore proposed an endtoend training procedure where the optimization criterion is placed in the timefrequency domain at the output of the DNNWPE algorithm, thus including the backend WPE into DNN optimization:
2.4.2 Endtoend training procedure
An important practical aspect of this study focuses on handling the initialization period of the RLSWPE algorithm. During this interval, the filter \(\varvec{G}\) has not yet converged to a stable value, reducing dereverberation performance. Therefore, rather than relying on a hypothetical shortening of this period through implicit PSD optimization [25], we choose to exclude this initialization period from training. The DNN is thus optimized so that the algorithm works best in its stable regime. To do so, we first craft long reverberant utterances that we cut in segments of \(L_i\) frames, where \(L_i\) is the worst case initialization time plus some margin. We then design the training procedure so that the first segment is used only to initialize the WPE statistics \(\varvec{G}\) and \(\varvec{R^{1}}\) and the DNN hidden states \(h(\mathrm {MaskNet_{WPE}})\). This enables to train the DNN weights on the next segments, during the stable regime. The data generation procedure is detailed again in subsection 4.
We showed in [26] that the best performance was obtained with the E2EpWPE approach, where the network \(\mathrm {MaskNet_{WPE}}\) is first pretrained with (7) and finetuned with (8). If \(\mathrm {MaskNet_{WPE}}\) is only pretrained, the algorithm is named DNNWPE, and corresponds to [21] with a different training loss function.
The proposed endtoend training procedure is summarized in Algorithm 1.
3 Residual reverberation suppression
3.1 Signal model
As shown in Section 5, training the DNNsupported WPE stage in an endtoend fashion helps suppressing large part of the reverberant signal immediately following the target range, that is, up to \(L_m\), which we refer to as the moderate reverberation range.
We thus refine the reverberant signal model as (1):
where the undesired reverberant signal in (1) (corresponding to \(\varvec{r}\) and \(\varvec{e+r}\) in the HA and CI case respectively) is split in the moderate reverberant signal \(\varvec{m}\) and the final reverberant signal \(\varvec{\phi }\), defined as:
The resulting WPE estimate thus contains the target \(\varvec{\nu }\), a target estimation error \(\varvec{\tilde{\nu }}\), a residue \(\varvec{\tilde{m}}\) from this moderate reverberation and a residue stemming from the final reverberation \(\varvec{\tilde{\phi }}\) (again disregarding the error term \(\varvec{u}\) in noiseless scenarios):
The target estimation error \(\varvec{\tilde{\nu }}\) is the target component which was degraded by the algorithm. As described in [32] for the original WPE algorithm, parts of the early reflections may be destroyed because of the inner shorttime speech correlations. Under some mild assumptions, the direct path is however fully preserved if the prediction delay \(\Delta\) is sufficiently large (i.e., larger than the inner speech correlation time). The target estimation error is therefore likely to be larger when using WPEbased algorithms in the HA scenario—containing more early reflections—than in the CI scenario.
3.2 Postfiltering scheme
We aim at suppressing the two residues \(\varvec{\tilde{m}}\) and, more particularily, \(\varvec{\tilde{\phi }}\). Indeed, \(\varvec{\tilde{\phi }}\) is generally of higher magnitude than \(\varvec{\tilde{m}}\), as we will show in the experiments that a large amount of moderate reverberation can be canceled by efficient WPEbased dereverberation. Additionally, \(\varvec{\tilde{\phi }}\) is the more perceptually disturbing of the two residues for the following reasons.
On the one hand, \(\varvec{\tilde{\phi }}\) can be considered as speechlike noise which is very poorly correlated to the target signal in comparison to \(\varvec{\tilde{m}}\). On the other hand, as WPE cancels most of the socalled moderate reverberation, there is no preceding energy anymore to mask the late reverberation. The final reverberation residue is then clearly audible.
We thus add a postfiltering enhancement stage after the linear WPE filtering stage, which consists of a singlechannel Wiener filter, the phase being left unchanged. This Wiener filter uses estimates of the target PSD \(\lambda ^{(\nu , \textrm{PF})}\) and interference PSD \(\lambda ^{(\tilde{r}, \textrm{PF})}\), which can be obtained with classical techniques as decisiondirected signaltonoise ratio (SNR) estimation [33], cepstral smoothing [6, 34], or from a neural network [21, 35].
The resulting estimate is then given for each channel d separately by the celebrated Wiener filter, using the WPE output:
3.3 DNNbased PSD estimation
We use a DNNbased masking approach to obtain the target and residual reverberation PSDs, similar to what is used to estimate the target speech PSD for WPE filtering (see (6)). At each time step, a frame of the WPE output’s magnitude taken from the reference channel \( \nu _0^{(\textrm{WPE})} \) is fed to a recurrent neural network \(\mathrm {MaskNet_{PF}}\), which outputs both a target and interference mask. The PSD estimate \(\lambda ^{(\eta )}\) is then obtained for each channel d through timefrequency masking for each signal \(\varvec{\eta } \in \{ \varvec{\nu }, \varvec{\tilde{r}} \}\):
We apply the same referencechannel mask for all channels using only one instance of the DNN, which saves some computational power and enables us to leave the interaural level differences unchanged. Also, the interaural phase differences are well estimated by WPE linear filtering and are not modified by the postfiltering scheme (see (13)). Therefore the target binaural cues are well preserved, which is important for hearing devices.
A block diagram of the complete twostage algorithm is provided in Fig. 1.
3.4 Training procedure
We trained the postfilter DNN \(\mathrm {MaskNet_{PF}}\) with a similar maskbased objective as \(\mathrm {MaskNet_{WPE}}\):
where \(\tilde{r}_0\) is the undesired signal defined in (12) taken at the reference channel. We report results for two approaches. First is DNNWPE+DNNPF, where the network \(\mathrm {MaskNet_{WPE}}\) is pretrained with (7), then frozen for the pretraining of \(\mathrm {MaskNet_{PF}}\) with (15). Second is E2EpWPE+DNNPF, where the network \(\mathrm {MaskNet_{WPE}}\) is pretrained with (7) and finetuned with (8), then frozen for the pretraining of \(\mathrm {MaskNet_{PF}}\) with (15).
A table making the present algorithms correspond to their characteristics and acronyms is given in Table 1.
4 Experimental Setup
4.1 Dataset generation
We use clean speech material from the WS0 dataset [36], using the usual split of 101, 10, and 8 speakers for training, validation, and testing respectively. For each split independently, we concatenate utterances belonging to the same speaker, and construct sequences of approximately 20 s. The initialization time of WPE can go up to to 2 s in the worst case when using a forgetting factor of \(\alpha = 0.99\). For endtoend training, we do not want to learn during that period (cf Section 2.4). Therefore, we cut these long sequences in segments of \(L_i = 4\) s and use the first segment only for initialization, thus not backpropagating the loss on it (cf Algorithm 1). We choose \(L_i\) to fill both requirements of (i) being larger than the worst case initialization time of WPE and (ii) providing a sufficient receptive field for training with LSTMs. Since the first segment is never used for optimization, permutations of the original utterances are used to create several versions of each sequence, so that we still use all speech data available for training the DNNs.
These sequences are convolved with 2channel RIRs generated with the RAZR engine [37] and randomly picked. Each RIR is generated by uniformly sampling room acoustics parameters as in [30] and a \(\text {T}_\text {60}\) reverberation time between 0.4 and 1.0 s. HeadRelated Transfer Function based auralization is performed in the RAZR engine, using a KEMAR dummy head response from the MMHRHRTF database [38].
As specified earlier, the target data for the HA case should represent the direct path and the early reflections as normal hearing and hearingaided listeners benefit from early reflections [9]. Therefore, we convolve the dry utterance with the beginning of the RIR, up to a separation time often found in the dereverberation literature [1, 9, 39]. We empirically set the separation time to 40 ms instead of the usual 50 ms, as we obtained better instrumental results when comparing the resulting target data to WPE estimates using the oracle PSD.
In the CI scenario, the target data data should theoretically contain the direct path only [27]. However, directly estimating the direct path from reverberant speech often provides poor instrumental results given the low input SNR. Note also that the first WPE stage uses a prediction delay \(\Delta\) supposed to protect the inner speech correlations, whose range is usually estimated to \(\sim 10\) ms. The minimal \(\Delta\) that fills this requirement is \(\Delta =2\) STFT frames with the hyperparameters described below, that is, 16 ms. Therefore, we propose to match the target data with the best possible WPE estimate, by convolving the dry utterance with the first 16 ms of the RIR. This also contributes to decreasing the difficulty of the estimation task, which helps obtain reasonable estimates with the proposed algorithm. We further noticed that with this setting, very few early reflections could be heard in the target.
The original mean input directtoreverberant ratio (DRR) between the dry signal and reverberant mixture is \(6.0\textrm{dB}\) and the mean microphonetospeaker distance used was estimated to 4.2m. The resulting mean input signaltonoise ratio (SNR) between the generated target and the reverberant mixture is \(0.9\textrm{dB}\) for the HA scenario, and \(1.4\textrm{dB}\) for the CI scenario.
Finally, independent and identically distributed Gaussian noise is added to each channel with an input SNR uniformly sampled in \([15, 25] \, \textrm{dB}\) to simulate sensor noise. Ultimately, the training, validation and testing sets contain around 55, 16 and 3 h of speech sampled at 16 kHz.
4.2 Hyperparameter settings
The STFT uses a squarerooted Hann window of 32 ms and a 75 % overlap. For training, segments of \(L_i=4\) s are constructed from each sequence (see Section 4.1). All approaches are trained using the Adam optimizer with a learning rate of \(10^{4}\) and a batch size of 128. Training is stopped if a maximum of 500 epochs is reached or if early stopping is detected, in case the validation loss has not decreased in 20 consecutive epochs.
The WPE filter length is set to \(K=10\) STFT frames (i.e., 80 ms), the number of channels to \(D=2\), the WPE adaptation factor to \(\alpha =0.99\), and the delays to \(\Delta _{\text {HA}}=5\) frames (i.e., 40 ms) for the HA scenario and \(\Delta _{\text {CI}}=2\) (i.e., 16 ms) frames for the CI scenario. The delay values are picked to match the amount of early reflections contained in the respective target, and they experimentally provide optimal evaluation metrics when comparing the corresponding target to the output of WPE when using the oracle PSD (see Section 4.1).
The DNN used in [21] is composed of a single longshort term memory (LSTM) layer with 512 units followed by two linear layers with rectified linear activations (ReLU) and a linear output layer with sigmoid activation. We remove the two ReLUactivated layers in our experiments, which did not significantly degrade the dereverberation performance, while reducing the number of trainable parameters by 75 %, therefore ending with 1.6M parameters. We use the same architecture for \(\mathrm {MaskNet_{WPE}}\) and \(\mathrm {MaskNet_{PF}}\). We choose to use LSTMs rather than recent convolutional network or transformerbased architectures to develop a frugal algorithm for hearing devices with limited computing resources. Indeed, LSTMs require much fewer operations per second than the mentioned alternatives, given that they process only one input frame and perform sequencemodeling using their internal memory state.
4.3 Evaluation metrics
We evaluate all approaches on the described test sets corresponding to the HA and CI scenarios.
Following the definition of the earlytolate reverberation ratio (\(\textrm{ELR}\)) [10, 40], we introduce two new instrumental measures: the earlytomoderate reverberation ratio (\(\textrm{EMR}\)) and earlytofinal reverberation ratio (\(\textrm{EFR}\)). Estimated RIR coefficients \(\{ \hat{H} \}_{d, \tau , f}\) of order \(0 \le \tau \le P1\) are computed for each channel d and frequency bin f separately, in order to minimize a minimum mean square error regression objective in the timefrequency domain between a reverberant utterance Y and the corresponding dry utterance S filtered by H [13]:
with \(\delta ^*\) being the oracle propagation delay obtained by looking for the direct path in the true RIR. This delay is used so as not to try and estimate RIR coefficients preceding the propagation delay which are supposed to be zero, therefore reducing the estimation error. The estimation error is further reduced by choosing the order P to match the \(T_{30}\) of the true RIR rather than the \(T_{60}\), as the estimation error floor was found to be close to \(30\textrm{dB}\).
The channelwise RIRs are then stacked and the target, moderate and final reverberation components are estimated as:
We set \(\tilde{\Delta }=5\) (i.e., 40ms) in the hearingaided case and \(\tilde{\Delta }=2\) (i.e., 16ms) in the cochlearimplanted scenario as explained in the target specifications in the section above. We set the moderate range length to \(L_m=K=10\) (i.e., 80ms).
The \(\textrm{ELR}\), \(\textrm{EMR}\) and \(\textrm{EFR}\) are then defined as:
We complete the evaluation benchmark with Perceptual Objective Listening Quality Analysis (\(\textrm{POLQA}\))^{Footnote 1}, signaltodistortion ratio (SDR), and signaltonoise ratio [41].
5 Experimental results and discussion
5.1 Compared algorithms
We apply the different strategies mentioned in Sections 2 and 3 and compare their results in Figs. 2 and 3 for the HA and CI scenarios of our simulated dataset respectively.
Spectrograms are also plotted in Fig. 4. We add to the already proposed approaches (mentioned in italics):

OPSDWPE: RLSWPE using the oracle target PSD

DNNPF: The output of the network \(\mathrm {MaskNet_{WPE}}\) is directly used for singlechannel Wiener nonlinear filtering, eluding the WPE linear filter step

GaGNet [42]: A recent CNNbased network for hybrid magnitude and complex domain enhancement. GaGNet is the successor of [43] which was ranked first in the realtime enhancement track of the DNS2021 challenge [44]. We used the open source available implementation^{Footnote 2} but adapted the number of frequency bins to be 257 as in our implementation
Some listening examples and spectrograms are available on our dedicated webpage^{Footnote 3}. We also include there a video recording of our proposed E2EpWPE+DNNPF (HA) algorithm performing in real time in both static and moving speaker scenarios. The algorithm performs with a total latency of 40 ms determined by the 32 ms algorithmic latency due to the STFT synthesis window length and the 8ms processing time which is contained within a STFT hop. We show that for reasonable speaker movements, the algorithm yields high performance also in the dynamic setting.
5.2 Moderate reverberation suppression
We first validate the method used for deriving the ELR, EMR and EFR metrics, described in 4.3. We plot the logenergies of the true RIR, the RIR estimated with (16) and the transfer function of the concatenation of the room with the OPSDWPE algorithm on Fig. 5. We observe that in the chosen \(\textrm{T}_{\textrm{30}}\) range, the true and estimated RIRs match almost perfectly, showing the validity of this MMSEbased estimation for linear transfer function estimation in this range. We also observe a strong derverberation performance of the OPSDWPE algorithm in the filter range as well as shortly after this range, which is the effect of recursive averaging.
The ELR metric in Figs. 2 and 3 indicates a superior dereverberation performance of E2EpWPE in comparison to DNNWPE, i.e., when the DNN \(\mathrm {MaskNet_{WPE}}\) is finetuned endtoend. The high EMR difference indicates that the moderate reverberation in the range \([\tilde{\Delta }, \tilde{\Delta }+L_m1]\) is particularly well suppressed. As already mentioned in [26], this stems from the better dereverberation performance in the range which is available to the WPE linear filter, through endtoend optimization of the neural network \(\mathrm {MaskNet_{WPE}}\).
5.3 Residual reverberation suppression
As displayed in Figs. 2 and 3, using a DNNassisted postfiltering stage highly improves the dereverberation performance on the basis of WPE linear filtering, and yields much superior POLQA scores. The high EFR improvement indicates that postfiltering mostly focuses on removing the final reverberation, i.e., after the range accessible to WPE filtering. In particular, the E2EpWPE+DNNPF approach which uses a pretrained network for postfiltering on top of endtoend trained WPE filtering outperforms all other approaches on all metrics. In comparison, using only the postfilter without WPE filtering introduces a lot of speech distortion, as shown in Fig. 4. Similarly, the DNNWPE+DNNPF performance indicates that using the postfiltering stage on the output of the DNNWPE algorithm—without finetuning \(\mathrm {MaskNet_{WPE}}\) with our endtoend procedure—yields poorer results (final POLQA is 0.2 lower and SNR is \(1\textrm{dB}\) lower than E2EpWPE+DNNPF). This shows that removing the moderate reverberation with WPE linear filtering is an essential step before using a speech enhancement scheme like our postfilter. Since E2EpWPE efficiently removes the moderate reverberation, as measured by EMR, it provides a particularly good ground for enhancementlike postfiltering, since only the reverberation tail remains and provides the best EFR and POLQA performance.
5.4 Reverberation times
For a given scenario, the dereverberation task becomes increasingly difficult as the \(\textrm{T}_{\textrm{60}}\) time grows longer. We observe for example that using the oracle PSD for WPE performs well only for low \(\textrm{T}_{\textrm{60}}\) reverberation times because of the limited filter length, and the performance gap between this approach and the proposed twostage approach increases with the \(\textrm{T}_{\textrm{60}}\) reverberation time.
Furthermore, we notice an increasing gap in SNR and EFR between DNNWPE+DNNPF and E2EpWPE+DNNPF as the \(\textrm{T}_{\textrm{60}}\) grows larger, which seems to indicate that our best performing approach E2EpWPE+DNNPF is more robust to challenging reverberation conditions.
5.5 Hearing device users categories specialization
Similar trends in performances are observed for the hearingaided and cochlearimplanted scenarios.
Dereverberation is a more complicated task in the CI scenario as compared to the HA scenario, as the input ELR and SDR scores are lower. Yet, the POLQA and SDR score improvements stay relatively consistent across both scenarios, highlighting the robustness of our approach. However, the EMR improvements seem larger in the HA scenario than in the CI scenario. Indeed, it is more arduous in the latter scenario to remove the beginning of what is considered to be the reverberant tail, as it includes parts of the early reflections, which are complicated to attenuate without degrading the direct path. This also accounts for the smaller EMR improvement of E2EpWPE over DNNWPE, as compared to the HA scenario. Furthermore, the SNR improvements are larger in the CI scenario than in the HA scenario, especially those brought by the proposed E2EpWPE+DNNPF approach, which shows that the postfiltering stage is in this case able to remove a lot of the residual reverberation.
5.6 Computational requirements
We estimate the number of MAC operations per second of the models using the pythonpapi Python package which provides CPU counters for single and doublepoint precision operations. We end up with an estimate of 0.13 GMAC\(\cdot \textrm{s}^{1}\) for our proposed E2EpWPE+DNNPF algorithm running at 16 kHz. With the same estimation method, the implemented GaGNet uses 0.81 GMAC\(\cdot \textrm{s}^{1}\). Also with regard to memory, our method has a lower budget as GaGNet has 11.8M trainable parameters while our approach has 3.2M parameters.
Our method therefore outperforms GaGNet on the proposed dataset with a significantly smaller computational load, without special finetuning of the hyperparameters nor optimization of the architectures used.
6 Conclusions
We have proposed a lightweight twostage DNNassisted algorithm for frameonline adaptive multichannel dereverberation on hearing devices. The first stage consists of multiframe, multichannel linear filtering with help of a DNN estimating the target speech PSD, optimized endtoend. This first stage was shown to focus on accurately removing moderate reverberation up to the given filter range, in our case, \(120~\textrm{ms}\). The second stage performs channelwise, singleframe nonlinear spectral enhancement with help of a DNN estimating the target and interference PSDs. This second stage is able to efficiently remove residual late reverberation left off by the first stage.
Our modelbased approach allows to tailor the twostage algorithm toward different classes of hearingimpaired listeners, namely hearingimpaired listeners benefiting from early reflections on the one hand, and cochlearimplanted users on the other hand benefiting from the direct path only.
Instrumental metrics like the earlytolate reverberation ratio and its variants confirm the listeningbased experiments showing the complementary aspect of the two proposed stages.
The proposed approach outperforms a stateoftheart DNNbased enhancement scheme on the proposed dataset, using a significantly smaller time and memory footprint.
Availability of data and materials
The data that support the findings of this study are available from the Linguistic Data Consortium but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Linguistic Data Consortium.
Notes
Wideband MOS score, following standard ITUT P.863. The authors would like to thank Rohde & Schwarz SwissQual AG for their support with POLQA.
Abbreviations
 DNN:

Deep neural network
 WPE:

Weighted prediction error
 PSD:

Power spectral density
 SNR:

Signaltonoise ratio
 RLS:

Recursive least squares
 ASR:

Automatic speech recognition
 HA:

Hearing aid
 CI:

Cochlear implant
 DRR:

Directtoreverberant ratio
References
P. Naylor, N. Gaubitch, Speech dereverberation. Noise Control. Eng. J. 59, 13 (2011)
E. Habets, Single and multimicrophone speech dereverberation using spectral enhancement. Ph.D. thesis (2007)
A. Kuklasiński, S. Doclo, T. Gerkmann, S. Holdt Jensen, J. Jensen, in IEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), Multichannel PSD estimators for speech dereverberation  a theoretical and experimental comparison. IEEE. Brisbane, Australia (2015)
B. Cauchi, I. Kodrasi, R. Rehr, S. Gerlach, A. Jukic, T. Gerkmann, S. Doclo, S. Goetze, Combination of MVDR beamforming and singlechannel spectral processing for enhancing noisy and reverberant speech. EURASIP J. Adv. Sig. Proc. 2015, 61 (2015)
A. Schwarz, W. Kellermann, Coherenttodiffuse power ratio estimation for dereverberation. IEEE/ACM Trans. Audio, Speech, Lang. Proc. 23(6), 10061018 (2015)
T. Gerkmann, in Proc. Euro. Signal Proc. Conf. (EUSIPCO), Cepstral weighting for speech dereverberation without musical noise. EURASIP. Barcelona, Spain (2011)
T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B. Juang, in IEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), Blind speech dereverberation with multichannel linear prediction based on short time Fourier transform representation. IEEE. Las Vegas, Nevada (2008)
A. Jukić, T. van Waterschoot, T. Gerkmann, S. Doclo, Multichannel linear predictionbased speech dereverberation with sparse priors. IEEE/ACM Trans. Audio, Speech, Lang. Proc. 23(9), 2328 (2015)
J.S. Bradley, H. Sato, M. Picard, On the importance of early reflections for speech in rooms. J. Acoust. Soc. Am. 113(6), 32333244 (2003)
T. Yoshioka, T. Nakatani, M. Miyoshi, H.G. Okuno, Blind separation and dereverberation of speech mixtures by joint optimization. IEEE Trans. Audio, Speech, Lang. Proc. 19(1), 6984 (2011)
H. Kagami, H. Kameoka, M. Yukawa, in IEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), Joint separation and dereverberation of reverberant mixtures with determined multichannel nonnegative matrix factorization. IEEE. Calgary, Canada (2018)
K. Kinoshita, M. Delcroix, H. Kwon, T. Mori, T. Nakatani, in ISCA Interspeech, Neural networkbased spectrum estimation for online WPE dereverberation. ISCA. Stockholm, Sweden (2017)
Z.Q. Wang, G. Wichern, J.L. Roux, Convolutive prediction for monaural speech dereverberation and noisyreverberant speaker separation. IEEE/ACM Trans. Audio, Speech, Lang. Proc. 29, 34763490 (2021)
K. Han, Y. Wang, D. Wang, W.S. Woods, I. Merks, T. Zhang, Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Trans. Audio, Speech, Lang. Proc. 23(6), 982992 (2015)
D.S. Williamson, D. Wang, Timefrequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans. Audio, Speech, Lang. Proc. 25(7), 489492 (2017)
A. Li, C. Zheng, L. Zhang, X. Li, Glance and gaze: a collaborative learning framework for singlechannel speech enhancement. Appl. Acoust. 187, 108499 (2022)
Y. Luo, N. Mesgarani, in ISCA Interspeech, Realtime singlechannel dereverberation and separation with timedomain audio separation network. ISCA. Hyderabad, India (2018)
B. Schwartz, S. Gannot, E.A.P. Habets, Online speech dereverberation using Kalman filter and EM algorithm. IEEE/ACM Trans. Audio, Speech, Lang. Proc. 23(2), 394406 (2015)
S. Braun, E.A.P. Habets, Online dereverberation for dynamic scenarios using a Kalman filter with an autoregressive model. IEEE Sig. Proc. Lett. 23(12), 17411745 (2016)
T. Yoshioka, H. Tachibana, T. Nakatani, M. Miyoshi, in IEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), Adaptive dereverberation of speech signals with speakerposition change detection. IEEE. Taipei, Taiwan (2009)
J. Heymann, L. Drude, R. HaebUmbach, K. Kinoshita, T. Nakatani, in International Workshop on Acoustic Signal Enhancement, Frameonline DNNWPE dereverberation. IEEE. Tokyo, Japan (2018)
X. Chang, W. Zhang, Y. Qian, J.L. Roux, S. Watanabe, in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), MIMOspeech: endtoend multichannel multispeaker speech recognition. IEEE. Sentosa, Singapore (2019)
T. Ochiai, S. Watanabe, T. Hori, J.R. Hershey, X. Xiao, Unified architecture for multichannel endtoend speech recognition with neural beamforming. IEEE J. Sel. Top. Sig. Proc. 11(8), 12741288 (2017)
W. Zhang, C. Boeddeker, S. Watanabe, T. Nakatani, M. Delcroix, K. Kinoshita, T. Ochiai, N. Kamo, R. HaebUmbach, Y. Qian, in IEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), Endtoend dereverberation, beamforming, and speech recognition with improved numerical stability and advanced frontend. IEEE. Toronto, Canada (2021)
J. Heymann, L. Drude, R. HaebUmbach, K. Kinoshita, T. Nakatani, in IEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), Joint optimization of neural networkbased WPE dereverberation and acoustic model for robust online ASR. IEEE. Brighton, United Kingdom (2019)
J.M. Lemercier, J. Thiemann, R. Koning, T. Gerkmann, in IEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), Customizable endtoend optimization of online neural networksupported dereverberation for hearing devices. IEEE. Singapore, Singapore (2022)
Y. Hu, K. Kokkinakis, Effects of early and late reflections on intelligibility of reverberated speech by cochlear implant listeners. J. Acoust. Soc. Am. 135, 22–8 (2014)
Z.Q. Wang, D. Wang, Deep learning based target cancellation for speech dereverberation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, 941950 (2020)
L. Drude, C. Böddeker, J. Heymann, R. HaebUmbach, K. Kinoshita, M. Delcroix, T. Nakatani, in ISCA Interspeech, Integrating neural network based beamforming and weighted prediction error dereverberation. ISCA. Hyderabad, India (2018)
M. Maciejewski, G. Wichern, E. McQuinn, J.L. Roux, in IEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), WHAMR!: Noisy and reverberant singlechannel speech separation. IEEE. Barcelona, Spain (2020)
I. Cohen, Optimal speech enhancement under signal presence uncertainty using logspectral amplitude estimator. IEEE Sig. Process. Lett. 9(4), 113116 (2002)
T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B.H. Juang, Speech dereverberation based on variancenormalized delayed linear prediction. IEEE Trans. Audio, Speech, Lang. Proc. 18(7), 17171731 (2010)
Y. Ephraim, D. Malah, Speech enhancement using a minimum meansquare error logspectral amplitude estimator. IEEE Trans. Audio, Speech, Lang. Proc. 33(2), 443445 (1985)
C. Breithaupt, M. Krawczyk, R. Martin, in IEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), Parameterized mmse spectral magnitude estimation for the enhancement of noisy speech. IEEE. Las Vegas, Nevada (2008)
O. Ernst, S.E. Chazan, S. Gannot, J. Goldberger, in Proc. Euro. Signal Proc. Conf. (EUSIPCO), Speech dereverberation using fully convolutional networks. EURASIP. Coruna, Spain (2019)
D.B. Paul, J.M. Baker, in Proceedings of the Workshop on Speech and Natural Language, The design for the Wall Street Journalbased CSR corpus. ACL. Harriman, New York (1992)
T. Wendt, S. Van De Par, S.D. Ewert, A computationallyefficient and perceptuallyplausible algorithm for binaural room impulse response simulation. J. Audio Eng. Soc. 62(11), 748766 (2014)
J. Thiemann, S. van de Pars, A multiple model highresolution headrelated impulse response database for aided and unaided ears. EURASIP J. Adv. Sig. Proc. 2019, 9 (2019)
H. Kuttruff, Room acoustics. (CRC Press, 2016)
G. Carbajal, R. Serizel, E. Vincent, E. Humbert, Joint NNsupported multichannel reduction of acoustic echo, reverberation and noise. IEEE/ACM Trans. Audio, Speech, Lang. Proc. 28, 21582173 (2020)
E. Vincent, R. Gribonval, C. Fevotte, Performance measurement in blind audio source separation. IEEE/ACM Trans. Audio, Speech, Lang. Proc. 14(4), 14621469 (2006)
A. Li, W. Liu, X. Luo, G. Yu, C. Zheng, X. Li, in ISCA Interspeech, A simultaneous denoising and dereverberation framework with target decoupling. ISCA. Brno, Czech Republic (2021)
A. Li, W. Liu, X. Luo, C. Zheng, X. Li, in IEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), Decoupling magnitude and phase optimization with a twostage deep network. IEEE. Toronto, Canada (2021)
C.K.A. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun, H. Gamper, R. Aichner, S. Srinivasan, in IEEE Int. Conf. Acoustics, Speech, Signal Proc. (ICASSP), ICASSP 2021 Deep Noise Suppression challenge. IEEE. Toronto, Canada (2021)
Acknowledgements
Not applicable.
Funding
Open Access funding enabled and organized by Projekt DEAL. This work has been funded by the Federal Ministry for Economic Affairs and Climate Action, project 01MK20012S, AP380. The authors are responsible for the content of this paper.
Author information
Authors and Affiliations
Contributions
All authors listed have contributed significantly to this work. The authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lemercier, JM., Thiemann, J., Koning, R. et al. A neural networksupported twostage algorithm for lightweight dereverberation on hearing devices. J AUDIO SPEECH MUSIC PROC. 2023, 18 (2023). https://doi.org/10.1186/s13636023002858
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636023002858
Keywords
 Dereverberation
 Neural network
 Endtoend learning
 Hearing devices