Multichannel speaker interference reduction using frequency domain adaptive filtering

Microphone leakage or crosstalk is a common problem in multichannel close-talk audio recordings (e.g., meetings or live music performances), which occurs when a target signal does not only couple into its dedicated microphone, but also in all other microphone channels. For further signal processing such as automatic transcription of a meeting, a multichannel speaker interference reduction is required in order to eliminate the interfering speech signals in the microphone channels. The contribution of this paper is twofold: First, we consider multichannel close-talk recordings of a three-person meeting scenario with various different crosstalk levels. In order to eliminate the crosstalk in the target microphone channel, we extend a multichannel Wiener filter approach, which considers all individual microphone channels. Therefore, we integrate an adaptive filter method, which was originally proposed for acoustic echo cancellation (AEC), in order to obtain a well-performing interferer (noise) component estimation. This results in an improved speech-to-interferer ratio by up to 2.7 dB at constant or even better speech component quality. Second, since an AEC method requires typically clean reference channels, we investigate and report findings why the AEC algorithm is able to successfully estimate the interfering signals and the room impulse responses between the microphones of the interferer and the target speakers even though the reference signals are themselves disturbed by crosstalk in the considered meeting scenario.


Introduction
Meetings, here considered as a face-to-face conversation in a meeting room of at least two persons (c.f. Fig. 1), belong to the most natural ways of humans to communicate with each other. Investigations of social behavior or interaction forms in such a meeting have a long research history in psychology [1,2] and have also become a vital research topic in Computer Science, which is caused by two aspects: First, an automatic meeting analysis facilitates psychological studies which rely on  [8] take the view that "nearly every problem in spoken language recognition (and understanding) can be explored in the context of meetings. " Putting the focus on the automatic analysis of audio and speech signals, a lot of research, based on audio recordings of meetings, has already been carried out in the last two decades [9][10][11][12][13]. Thereby, there are three typical methods for data acquisition: a single table-top microphone, a microphone array, or, as illustrated in Fig. 1, personalized close-talk microphones (e.g., headsets or lapel microphones) [13,14]. In order to process and analyze recorded interactions, first of all, isolated speech of all participants is needed. For this purpose, it is obvious that recordings with a single microphone or a microphone array require blind source separation (BSS) approaches, which have received great attention in a variety of applications over many years [15,16].
Since we deal with interactive meetings and project work, in which the participants use the entire room including workstations, flip charts, and a meeting desk such as in [17], a single table-top microphone or a microphone array would not be able to deliver sufficient speech quality. Therefore, our research is based on multichannel close-talk recordings, which offer, especially in this case, the best audio quality by recording the individual speech signals robustly with a suitable signal level. Furthermore, distorting room characteristics are mostly negligible. However, even in this case, the target speaker channel is disturbed by speech portions of the interfering speakers which couple into the target microphone with a non-negligible level. This effect is known as crosstalk [8,[18][19][20] or microphone leakage [20,21] and requires a multichannel speaker interference reduction (MSIR) in order to obtain the desired isolated speech of each person. This issue is getting worse when considering the rate of multi-talk (e.g., double-talk) situations, which occur up to 14% of the time in a professional meeting [22,23] and can easily exceed 20% in an informal get-together [24]. Since these statistical values are too high to being ignored, and moreover, multi-talk can dramatically decrease the performance of later applications such as automatic speech recognition systems [22,23], eliminating or rather separating multi-talk situations is one of the key challenges in signal preprocessing for analyzing a meeting.
For the purpose of eliminating crosstalk signals in a microphone channel of interest, we have only access to the microphone channels of all other persons, which, however, are disturbed by crosstalk as well. Even worse, the unknown room impulse responses (RIRs) from the interfering speakers to the target speaker's microphone affect the interfering signals, so that the crosstalk signals, recorded by the target speaker's microphone, differ to quite some extent from the recordings of the interfering persons' microphones. One possible solution for this issue is again BSS approaches, and there are decades of research dealing with multichannel recordings of dedicated microphones [25][26][27][28][29], but the main focus is not on close-talk scenarios.
Another field which is very familiar with close-talk recordings and the effect of microphone leakage is signal processing for (live) music performances [21,30], where different instruments, each recorded with at least one microphone, are played at the same time. It is a big advantage for the final mix, if the sound engineer has access to the undisturbed microphone signals of each instrument in order to apply reverberation or equalization. Furthermore, microphone leakage can lead to unwanted artifacts such as the comb filtering effect, which occurs when mixing a signal and a delayed version of the same signal together [31]. Using directional microphones can decrease the leakage effect, but does not completely eliminate it and, even worse, can cause other artifacts such as the proximity effect [32], which describes an energy increase at low frequencies. They are in addition sensitive to microphone orientation, which is why omnidirectional microphones are still a good choice for robust and clean recordings.
In order to reduce the interfering signals in each microphone channel, several approaches have been published in the last years: adaptive filtering in time and frequency domain [30,[33][34][35] are popular solutions to estimate the room impulse response from the interferer source microphone to the target microphone. These methods often use additional information such as a speaker activity detection or a signal-to-interferer ratio (SIR) for controlling the adaptation process. Other approaches propose nonnegative signal factorization [36], kernel additive modeling [20], or a Gaussian probabilistic framework [37]. Thereby, [20,30,34,35] use iterative or cascading schemes of individual filters for each channel to obtain crosstalkfree interferer channels for further processing. Kokkinis et al. [21] tackled this problem by a multichannel Wiener filter (MWF), thereby interpreting the problem as a noise suppression task, thus taking a big step forward by ignoring the explicit room characteristics and focusing on the energy of the interfering signals by a simple gain factor. This is primarily motivated by the two facts that music productions commonly use sampling rates of at least 44.1 kHz and deal with reverberation times, especially in live sounds, that can easily exceed 1 s. This in combination can lead to an extremely high number of filter coefficients in order to estimate the RIRs, which is synonymous to a slow convergence behavior and high computational costs. However, calculation power increases steadily, and w.r.t. a meeting, we typically deal with 16 kHz sampling rate and RIR lengths of around 250 ms in a common-sized meeting room [28,38].
We already proposed a multichannel Kalman-based Wiener filter (MKWF) method [39], which is an extension of the MWF approach of Kokkinis et al. [21], taking into account the characteristics of the RIRs between the microphones of the persons in our meeting scenario. Thereby, we improved the SIR at constant speech component quality during triple-talk. Since Buchner et al. [40] understand acoustic echo compensation (AEC) with clean reference signals as a special case of BSS, we applied an adaptive filter to estimate the RIRs, similar to [30,33]. We used a multichannel AEC (MAEC) method from Enzner et al. [41], which was developed for a hands-free teleconferencing system to estimate the RIRs from the loudspeakers to the microphone on the basis of clean loudspeaker reference signals. Surprisingly, the MAEC RIR estimation performed well without any preprocessing steps to enhance the disturbed microphone channels. Considering the fact that the MAEC is based on the assumption that the reference signals are clean, some questions remained open.
In this paper, we further enhance the MKWF and precisely analyze why the MAEC RIR estimation method of Enzner et al. [41] is performing well in the considered meeting scenario with crosstalk-disturbed reference channels. First, we briefly recap and extend our proposed MKWF method [39] by a control strategy for the case of interferer speech pauses (ISPs), in which the RIR estimation is lost due to a missing excitation signal. Afterwards, we compare the improved MKWF with the MWF [21] and the MAEC [41] in a more realistic and challenging meeting scenario with multiple RIR changes. Subsequently, we investigate the performance of the MAEC RIR estimation compared to an oracle MAEC, which has access to clean reference signals and points out the main performance differences. Finally, we elaborate why the MAEC RIR estimation can be successfully applied in a meeting scenario with leaky microphone channels.
The outline of this work is as follows: We introduce the considered meeting scenario with some important notations and the problem formulation in Section 2. While Section 3 contains the algorithmic descriptions of the MWF and MAEC baseline approaches, we define our extended MKWF method in Section 4. A comparison of the baselines and the MKWF is carried out in Section 5, which is followed by a more detailed analysis of the MAEC with focus on RIR estimation in Section 6 to answer the question why the MAEC is able to work despite leaky reference channels. The paper is concluded with some remarks in Section 7.

Scenario model and data acquisition
First of all, we formulate the problem regarding the considered meeting scenario with some notations and describe the data acquisition process and preparation for the later experiments.

Considered meeting scenario and problem formulation
We consider a meeting scenario of three persons (P1, P2, P3) sitting at a table and talking to each other as depicted in Fig. 1. Note that most interesting algorithmic aspects can already be investigated with three persons. In order to analyze or transcribe the course of a meeting's conversation, all persons are equipped with a (wireless) headset for two reasons: on the one hand, a microphone channel with good close-talk audio quality for each person is obtained; on the other hand, the participants are free to stand up, go to the flip chart and workstations, or even walk around, still allowing for high-quality sound acquisition, which is in this case hardly possible with a single table-top microphone or a microphone array. However, for simulations, we consider only fixed positions of the persons at the table, which is already a challenging scenario in terms of interfering speaker levels. Furthermore, we assume the headsets to have an omnidirectional microphone characteristic, supporting robustness w.r.t. the acquisition of the target speaker's voice. However, not only the speech of the target speaker is recorded, but also the speech portions from all other persons. Depending on the position and the loudness level of the other interfering speakers, low-and high-level crosstalk may occur. This undesired effect is well known from audio recordings or the mixing of live sounds and is designated as microphone leakage.
In the following, we denote m ∈ M = {1, . . . , M} as one specific target speaker being currently focused on, and μ ∈ I = {1, . . . , M | μ = m} as an interfering speaker. As shown in Fig. 2, each microphone channel y m (n) of the associated target speaker m is modeled as

Data acquisition and signal processing
A detailed evaluation of applied MSIR methods requires the use of objective measures. Therefore, it is necessary to have access to each individual signal component in (1), more specifically s m (n) = s m (n) * h m,m (n) and d m,μ (n) = s μ (n) * h m,μ (n). In order to be able to generate realistic microphone signals of a meeting scenario under these conditions, in line with ITU-T P.1110 [42] and P.1130 [43], we record real RIRs in a typical meeting room which are then used to generate the individual signal components of microphone channel y m (n) by means of various speech samples.
To allow the acquisition of RIRs containing the typical characteristics of the direct path and the early reflections in a realistic meeting scenario, the participants of the considered meeting are represented by head-and-torso simulators (HATSs), which are equipped with a headset and placed around a table. Thus, we can measure the RIRs from the mouth reference point (MRP) [44] (cf. Fig. 2) of each HATS to the headset microphone of all HATSs.
Due to the considered close-talk scenario, the MRP and the headset microphone of the target speaker are almost located at the same place. Hence, we assume h m,m (n) = δ(n) in order to simplify the acoustic channel model w.r.t. the investigated conversational group interaction, requiring only the RIRs from the target speaker to all other persons' microphones to be measured. For this purpose, one Yamaha HS80M studio monitor loudspeaker, representing the MRP, and two HEAD acoustics HMS II.6 HATSs were employed as acoustic source and sinks, respectively. While later on in practice the participants shall wear a wireless headset, for data recording and the purpose of this work, we acquired the audio signals of each speaker (HATS) by a wired omnidirectional close-talk Beyerdynamic MM1 measurement microphone placed at the position of a typical headset microphone in order to exclude transmission effects.
We recorded two sets of RIRs, later denoted as RIR set I and RIR set II, in the already described three-person scenario in a meeting room of size 6.6 m × 5.75 m × 2.5 m (length × width × height) according to [45]. As excitation, a 48 kHz and 32 bit linear sweep signal from 0.01 Hz to 24 kHz with a length of 10 s was used. The excitation was played back with the studio monitor placed at the position of each speaker and recorded at each of the other speaker's headset position by means of the Beyerdynamic MM1 microphone. Afterwards, the RIRs were determined with the aid of a recorded electrical reference signal, which was recorded once for all measurements, by a linear deconvolution in the frequency domain [45] and finally downsampled to 16 kHz. The T60 times of the measured RIRs are on average 0.24 s, which is in line with [28,38] regarding a common meeting room. As a result, the RIR signals were cut off after 4000 samples (0.25 s) for our experiments.
By using the measured RIRs, we are able to simulate any desired dialog between the considered three persons as defined in (1). The simulation diagram structure of all signals for the later experiments according to the acoustic channel model in Fig. 2 is illustrated in detail in Fig. 3. Due to the assumption that h m,m (n) = δ(n), the speech signal s m (n) of the target speaker of microphone channel y m (n) is not convolved with any RIR. In order to obtain s m (n), the active speech level (ASL) is scaled with α s m to −26 dBov in accordance with ITU-T P.56 [46]. In addition, two interferer signals s μ (n) are convolved with h m,μ (n) and also adjusted to −26 dBov ASL by α s μ . Afterwards, the two interferer signals are superimposed and jointly scaled with α d to the desired crosstalk level. Finally, s m (n) and d m (n) are superimposed and a white Gaussian noise floor n m (n), adjusted to −75 dBov (using α n ), is added to the microphone channel, to simulate some sensor noise. The explicit signal mixtures of the respective experiments follow in the corresponding sections. Fig. 3 Simulation diagram of the close-talk microphone signal y 1 (n), employing input speech data, measured room impulse responses (RIRs), and assuming h m,m (n) = δ(n). For computation of microphone signals y 2 (n) and y 3 (n), the inputs are to be changed accordingly

Baseline approaches
In this section, we present the multichannel Wiener filter (MWF) approach according to [21] and the multichannel acoustic echo cancellation (MAEC) method adopted from [41], applied to the considered meeting scenario. Some mathematical detail is important here to understand, since later on we will refer to it with our proposed approach.

Multichannel Wiener filter (MWF)
Kokkinis et al. [21] consider microphone leakage effects in music recordings with several instruments, each assigned to a microphone being close by the instrument, resulting in (1). Under the assumption that each microphone captures primarily the audio signal of the assigned audio source and only to some lower extent the interferer sources, a Wiener filter W m ( , k) considering all interfering signals is applied in the discrete Fourier transform (DFT) domain to each microphone channel m in order to reduce the interferer signals according tô Here, Assuming statistical independence of target speech and interferer signals, the Wiener filter is modeled by with the estimated power spectral densities (PSDs) SS,m ( , k) andˆ DD,m,μ ( , k) of the speech signal of the target speaker and the interferer's, respectively. Since these signals are not accessible, they have to be estimated.
An overview of the MWF method is illustrated in Fig. 4. The input signals y m (n) constitute the continuation of the denotes the buffering of the previous K −R samples. With the aid of a PSD estimation, a Wiener filter W 1 ( , k) is adapted to reduce the interferer signals in microphone channel y 1 (n). As indicated by the three layers, microphone channels y 2 (n) and y 3 (n) are enhanced by changing the input signals accordingly output signals of Fig. 2 and are transformed into the DFT domain by using an overlap-add (OLA) structure. The PSD estimation block delivers the update of the Wiener filter coefficients (3) and is utilized for the calculation of both target speaker PSD and interferer PSDs, whereby each has its own control unit in form of the forgetting factor and the solo detection block, respectively, which will be explained later in this section. In the following, the estimations of both the target speaker PSD and the interferer PSDs are described in accordance with [21].

Estimation of the target speaker PSDˆ SS,m
It is assumed that PSD bins k without an influence of an interferer signal show almost equivalence between the PSDs of the input microphone signalˆ YY ,m ( , k) and the enhanced output signalˆ ŜŜ ,m ( , k) [21], which are both obtained by squaring the absolute value of Y m ( , k) and S m ( − 1, k), respectively. We obtain these dominant frequency bins k ∈ K dom ( ) by comparing their active frequency bins k ∈ K act ( ), which are calculated for channel m by with E Y ,m ( ) and EŜ ,m ( −1) being the root-mean-squared are identified as those, which are active in both PSDs according to By means of K dom m ( ), a binary mask can be defined, and thus, both a dominant PSD compo-nentˆ and a residual PSD component are determined. The parameters δ dom , δ res Y , and δ reŝ S take on values between 0 and 1 with δ dom + δ res Y + δ reŝ S = 1 . Finally, the PSD of the target speaker results in with γ m ( ) being an energy-adaptive forgetting factor considering the time-varying energy of all microphone channels in order to ensure that the PSD estimation only proceeds, if the target microphone channel has more energy than the others, and hence, there is a high confidence of hardly interfering signals ( Fig. 4 block Forgetting Factor).

Estimation of the interferer PSDsˆ DD,m
The estimation ofˆ DD,m,μ ( , k) is formulated on the basis of the interferer source s μ (n), μ ∈ I, which is once convolved with h μ,μ (n) to provide s μ (n), and once with all h m,μ (n) to provide the interferer signals d m,μ (n) (cf. Fig. 2). Hence, by ignoring further interferences, it is assumed for reasons of simplification that the corresponding PSDsˆ SS,μ ( , k) andˆ DD,m,μ ( , k) only differ by a time-variant but full-band factor α m,μ ( ), which is obtained by Thereby, α m,μ ( ) is only updated in solo intervals (singletalk) of the considered interfering speaker, yielding The solo parts ( Fig. 4 block Solo Detection), depicted by ν μ ( ), are detected by an energy function based on a sigmoid function as well as E Y ,m ( ) and E Y ,μ ( ). For further details, please refer to [21]. With the aid of α m,μ ( ), the interferer PSD is estimated aŝ

Multichannel acoustic echo cancellation (MAEC)
The frequency domain adaptive filtering-based MAEC approach by Malik and Enzner [41] has been originally intended for full-duplex hands-free telephony. The basic idea of the MAEC directly applied to our speaker interference scenario is depicted in Fig. 5. Note that Fig. 5 can be seen as a continuation of Fig. 2 with the output microphone signals y m (n) of Fig. 2 being the input signals to Fig. 5. Note further that different to a classical MAEC scenario as assumed in [41], the reference signals y 2 (n) and y 3 (n) in our scenario are themselves distorted by each other and, even worse, by the target speech s 1 (n) coupling into y 2 (n) and y 3 (n) as depicted in Fig. 3. Furthermore, each channel m ∈ M has to be enhanced and is independently processed by a basic MAEC approach, depicted by the M = 3 instances in Fig. 5.
In the following, we adapt the basic MAEC approach to our meeting scenario in accordance with [47,48]. The MAEC is implemented in an overlap-save (OLS) structure and is based on a Kalman filter consisting of an alternating prediction and correction step. After windowing of frame length K with frame shift R, we obtain frames of each interfering microphone channel y μ (n) (i.e., as reference signal), frame-wise packed in K × 1 vectors y μ ( ), with frame index . The first frame is headed by K − shows the elimination of K −R samples w.r.t. the OLS constraint. Additionally, depicts the replacement of K −R samples by zeros. The RIRs H 1,2 ( , k) and H 1,3 ( , k) of interferer signals S 2 ( , k) and S 3 ( , k) are estimated (Ĥ 1,2 ( , k),Ĥ 1,3 ( , k)). Afterwards, microphone signals Y 2 ( , k) and Y 3 ( , k) are multiplied withĤ 1,2 ( , k) andĤ 1,3 ( , k), respectively, and are then subtracted from the microphone signal Y 1 ( , k), resulting inŜ 1 ( , k). By means of the error signal E 1 ( , k), the estimation of the RIRs is adapted. To obtainŜ 2 ( , k) andŜ 3 ( , k), the inputs are changed accordingly R zeros, followed by the first R samples of the respective microphone signal. Each frame is transformed into the frequency domain and shaped into a main diagonal matrix, to allow for a convenient notation, by with F K×K being the K-point DFT matrix and μ ∈ I. In contrast, the R × 1 target microphone channel y m ( ) is processed with the K × R overlap-save projection matrix whereby 0 and I denote a zero and unity matrix, respectively. The subsequent steps are done for each speaker m ∈ M, i.e., M instances of the MAEC are in operation.

Correction step
With the aid of the K × K overlap-save constraint matrix G = F K×K Q · Q T F −1 K×K , the preliminary error vector [48] (DFT coefficient vector of length K) is obtained as Subsequently, the preliminary error signal, weighted with the Kalman gain diagonal K × K matrix K m,μ ( ), is used to correct (update) the predicted MAEC filter coefficient states to obtain with the initialization ofĤ m,μ ( = 0) = 0 K×1 . Besides, the state error covariance prediction matrix is updated as well by using the Kalman gain as Thereby, the Kalman gain diagonal matrix is defined by with the K ×K diagonal matrix being the step-size for the Kalman gain. Furthermore, the diagonal matrix B m ( ) of the target microphone channel m results in (22) and includes the covariance diagonal matrix S m ( ) of the measurement noise, which indicates the presence of nearend speech and is determined by a temporal smoothing as initialized with S m ( = 0) = 0 K×K . Finally, using (18), the error signal, namely the estimated target speaker's signal, can be computed and the time-domain signalŝ m (n) is recovered by overlap-save synthesis based on

Multichannel Kalman-based Wiener filter (MKWF)
Investigations on the MWF and MAEC (as presented in Section 5.3) result in two observations: first, the MWF has a bigger potential for a high interferer reduction compared to the MAEC, and second, the MAEC achieves a better and more homogeneous quality of the remaining target speech signal over a wide range of different SIRs. Based on these observations, we start in this section with the MKWF approach according to [39], which is an extension of the MWF. Thereby, the MKWF considers the influence of the real RIRs on the interferer signals to obtain a better interferer PSD estimation. Thus, the quality of the remaining target speech is improved, since this allows a more precise filtering (over all frequencies) of the leaky microphone channels, instead of just using a single fullband gain factor (c.f. (10)). For this purpose, we replace the MWF-interferer PSD estimation (c.f. Section 3.1.2) by applying the Kalman filter of the MAEC to estimate the interferer (noise) PSDˆ DD,m ( , k). We further improve the estimation of the target PSDˆ SS,m ( , k) by using the output signals of the MAEC and integrate a new extended control strategy for the RIR update of the MAEC to be able to deal with interferer speech pauses. In the following, we describe the proposed MKWF, which is depicted in Fig. 6.
In line with the MWF, the Wiener filter is modeled by To obtainˆ DD,m,μ ( , k), the determination of the interferer signals in channel m by means of the MAEC is already defined in (24) aŝ Since the MAEC uses in contrast to the MWF an OLS structure, we first have to adapt the MAEC output. Therefore, we calculatê and retain only the last R samples ofd m ( ) due to the OLS constraint, yieldingd m (n) (cf. Fig. 6). Afterwards, d m (n) is transformed into the frequency domain, deliver-ingD WF m ( , k), by applying an OLA structure with a Hann window, frame shift R, and frame length K WF = 2R. Subsequently, the interferer signal PSD determination is done byˆ The estimation of the target speaker PSDˆ SS,m ( , k) follows the MWF approach. In order to make the estimation ofˆ SS,m ( , k) more robust to low-SIR input signals, we subtract the estimated interferer signals from the target microphone signal, before calculating the PSD of Following Section 3.1, (4) to (8) and neglecting the forgetting factor results in

New extended RIR update control strategy
The MAEC algorithm contains an intelligent RIR update function, which is mainly based on S m ( ) (23) and P + m,μ,ν ( ) (15), both indirectly controlling the step-size μ m,μ,ν ( ) (21). Thus, the adaptation of the RIR filter coefficients depends on the presence of target speech and the state error. Nevertheless, in case of interferer speech pauses, there is no excitation for the MAEC to estimate the RIRs. In such a case, the current RIR estimate is lost and converges to zero due to the integrated update function aiming to protect the target speech signal. This in consequence leads to a permanent reconvergence of the estimation process in a meeting if a person does not speak continuously, and thus, only a suboptimal result is obtained during the beginning of a new utterance.
In order to prevent this behavior, we detect the speech activity of the interferer signals and store the corresponding latest filter coefficients ofĤ m,μ ( ) (18) as well as P m,μ,ν ( ) (19) during active speech of interferer μ. After a speech pause of at least 0.5 s of interferer μ, the stored filter coefficients are restored if the interferer starts to speak again. Thereby, the internal RIR update function of the MAEC is not interrupted during interferer speech pauses to prevent the target speech signal. Since RIRs can change rapidly over time, the restored RIRs may be incorrect to some extent, but as we will show in the experimental evaluation, it still reduces the required time for the MAEC to reconverge in a common meeting scenario.
To detect the speech activity in a channel, we apply a multichannel speaker activity detection (MSAD), which is inspired by [49] and briefly described in the following. The MSAD is based on a comparison of the PSDs of all considered microphone channels m ∈ M. Frames are obtained with a Hann window, a frame shift R, and a frame length K MSAD = 2R. We determine the PSD comparison for channel m by  [49]. We further calculate a signal-to-noise ratio (SNR) aŝ withˆ NN,m ( , k) = λ SNRˆ NN,m ( , k) and λ SNR = 4 being an overestimation factor to be more robust during speech pauses. By means of SPR m ( , k) andξ m ( , k), we determine all relevant frequency bins in channel m by with ϑ SNR = 0.25. Thus, we obtain a soft full-band MSAD by with κ + m ( ) = |K + m ( )|/K MSAD and being an SNR-dependent weighting function with α = 0.1. Thereby, which depicts the maximum SNR value of B = 10 averaged frequency bands with index b ∈ B = {1, 2, ..., B} and K b being the set of frequency bin indices k in band b. Finally, we obtain a MSAD decision for channel m by with θ MSAD = 0.2. The values of λ SNR , ϑ SNR , α and θ MSAD were determined empirically.

Experiments and discussion
We first introduce the applied evaluation metrics for all upcoming experiments and define the experimental setup, before we discuss the performance comparison between the MWF, MAEC, and MKWF methods.

Quality measures
With the aid of the PEASS toolbox 1 according to [50], each estimated target signalŝ m (n) is decomposed into a target distortion e with oSIR m being the output SIR after applying PEASS for channel m and sample index n ∈ N = {1, . . . , N}. We then define the improvement of the SIR as whereby iSIR m is the input SIR of channel m, measured according to ITU-T P.56 [46]. The further measures of the PEASS toolbox are the signal-to-distortion ratio (SDR), the source image to spatial distortion ratio (ISR), and the signal-to-artifact ratio (SAR) [50], which are defined as SDR m = 10 log 10 ISR m = 10 log 10 SAR m = 10 log 10 respectively. Please note, since a subjective perception of the enhanced target signals is not the primary evaluation criterion for the purpose of an automatic meeting analysis, we do not consider further perceptual measures from the PEASS toolbox for the evaluation.

Experimental setup
During a group interaction with three persons, at any time, one out of four different states may occur: silence (nobody speaks), single-talk (only one person speaks), double-talk (two persons speak), and triple-talk. Thereby, the elimination of crosstalk during multi-talk situations is obviously the most challenging task, since both the interfering and the target speaker are talking at the same time. This requires precise filtering of the recorded microphone signals and is quite similar to the case of common (live) music performances, in which most instruments are playing at the same time. Especially, short interruptions occur very often in a meeting and depict a challenging doubletalk task for MSIR methods due to the short adaptation time. For these reasons, our main focus is on multi-talk scenarios.
In order to generate a challenging meeting scenario that is as real as possible and has a focus on multi-talk situations, we created a conversation for target speaker m = 1 with explicit single-, double-, and triple-talk parts as well as short speech pauses and multiple RIR changes. We used the speech signals from the ITU-T Recommendation P.501 [51] for the implementation, which are designed for challenging double-talk scenarios in the field of echo compensation in telephonometry.
The composition of the m = 3 source signals is as follows: the target signal consists of the 10 s long female short conditioning sequence, a speech pause of around 6 s, and the single-talk sequence with a length of 35 s. Interferer signal μ = 2 begins with the male short conditioning sequence of 10 s, followed by the female short conditioning sequence, which was cut off after 6 s, and ends with the 35 s long double-talk sequence. The second interferer signal μ = 3 is generated with 10 s of speech pause, the female short conditioning sequence, where the first 4 s were cut off, again speech pause of 12 s, and the female long conditioning sequence with a duration of around 23 s (c.f. Fig. 8).
The microphone signals are obtained in accordance with Section 2.2 (c.f. Fig. 3) by means of the recorded RIRs. We insert multiple RIR changes at different points in time for each speaker by changing between RIR set I and RIR set II of the corresponding crosstalk signals of the respective speaker (for a visualization of the crosstalk dependencies w.r.t. the microphone channels, please refer to Fig. 2). The RIRs from target speaker m = 1 to channel μ = 2 and μ = 3 are changed after 7, 18, 40, and 47 s; RIRs from speaker μ = 2 to channel m = 1 and μ = 3 are changed after 3, 12, 26, and 47 s; and the RIRs from speaker μ = 3 to channel m = 1 and μ = 2 are changed after 28, 30, and 47 s. For a better overview, all changing points are marked in Fig. 8 with the corresponding speech source color. The last changing point after 47 s is colored in black, because at this point all applied RIRs are changed simultaneously. Microphone channels are mixed as described in Section 2.2 with the recorded RIR sets; we level the target speaker signals to −26 dBov and the sensor noise to −75 dBov. Furthermore, we investigate crosstalk levels between −26 and −46 dBov with a step-size of 2 dB. All applied signals are sampled at 16 kHz.
The parameter configuration of the MAEC and the MWF is in line with [48] and [21], respectively. Due to the reduced sampling frequency, we only adapt the frame length of the MWF to 512 samples and the frame shift to 256 samples. Moreover, the applied MSAD obtains an accuracy between 76.3% for iSIR = 0 dB and 85.0% for iSIR = 20 dB, with a maximum value of 86.2% for iSIR = 12 dB w.r.t the interferer channels μ ∈ {2, 3}. We also integrated the MSAD-based extended RIR update control mechanism to the MAEC for better comparability. In contrast, replacing the energy-based solo detection of the MWF (c.f. Fig. 4) by the MSAD leads to a significant drop of the SIR performance, so that we did not apply the MSAD to the MWF approach. To ensure fair comparison, we further apply the same parameter values for all common parameters of our MKWF method and the two other approaches. An overview of all parameters for each method is given in Table 1.  metrics from the PEASS toolbox as a function of a wide range of crosstalk levels (iSIR) from 0 to 20 dB. Please note, the discussion of the oracle MAEC (skyblue dotted line) will be the topic of Section 6.1 in the context of a deeper analysis of the RIR estimation by means of the MAEC.

Results and discussion
Comparing the MAEC and the MWF, it is evident that the MAEC reaches better results of the SIR for the iSIR range of 0 to 8 dB as well as 17 dB and higher, while the MWF outperforms the interferer reduction of the MAEC by up to 1.2 dB for 8 dB < iSIR < 17 dB and obtains a higher maximum interferer reduction than the MAEC. Furthermore, the MAEC outperforms the MWF significantly regarding SDR, ISR, and SAR for iSIR > 12 dB by up to 6.7 dB, 13.7 dB, and 4.9 dB, respectively. This is due to the fact that the MWF operates very aggressively in high iSIR conditions, whereby it also affects the target speech component negatively. Interestingly, the MWF reaches better ISR and SAR results for low iSIR conditions (iSIR < 12 dB) compared to the MAEC, which is, however, mainly due to the significantly stronger drop of the SIR performance for this iSIR range. To conclude, the MAEC achieves better speech quality of the remaining target speech, while the MWF has more potential for a higher interferer reduction.
The MKWF outperforms both the MAEC and the MWF in almost all concerns. It achieves better results over the whole considered iSIR range regarding the SIR and SDR, whereby it improves the baselines by up to 2.7 dB and 2.3 dB for SIR and SDR, respectively. An exception is the performance of the MWF regarding the ISR and the SAR for iSIR < 8 dB, in which the MKWF obtains a somewhat poorer performance than the MWF, while achieving an improved SIR performance of up to 4 dB at the same time. However, the MKWF outperforms the MWF significantly as well as the MAEC regarding all considered measures for iSIR > 8 dB, which depicts the most relevant range for a common meeting scenario. In addition, shifting the optimal operating point towards a lower iSIR is, besides the significant SIR increase while maintaining approximately equal or even better speech quality (SDR, ISR), the main advantage of the proposed MKWF method compared to the MAEC and the MWF.
The effect of the extended RIR update control strategy (c.f. Section 4.1) w.r.t. the (oracle) MAEC and the proposed MKWF approach is depicted in Table 2. All results are averaged over the iSIR conditions from 0 to 20 dB of the considered scenario, and the results are additionally compared with the MWF method. It is evident that the use of the extended RIR update control significantly improves the performance of the MAEC, MKWF, and oracle MAEC regarding both the SIR and the SDR measure. Thereby, the MKWF achieves the biggest improvement of around  3.1 dB and 1.7 dB w.r.t. the SIR and SDR, respectively. This improvement is obtained, since the implemented control strategy prevents the reconvergence process after an interferer speech pause by restoring the filter coefficients of the last active interferer speech frame, as already mentioned in Section 4.1. An in-depth analysis of this issue can be found in Section 6.1. As expected, due to the stronger interference reduction by means of the extended RIR update control strategy, the performance decreases w.r.t. the ISR and SAR measure for both the MAEC and the MKWF. Interestingly, the ISR and SAR performance of the oracle MAEC is improved by the extended RIR update control. However, the averaged MKWF results with ISR = 25.2 dB and SAR = 27.2 dB are still the best compared to the results of the MWF and the MAEC (with and without using the extended RIR update control).
6 Analysis of the MAEC RIR estimation with leaky reference channels Section 5.3 has shown that using the MAEC RIR estimation improves the MWF approach and, also, that the overall performance of the MAEC is quite good in the considered meeting scenario, which is somehow surprising, since the MAEC was not developed to deal with leaky reference channels. In order to understand these results and verify the usability of the MAEC RIR estimation for the proposed MKWF, we analyze the MAEC in this section in more detail and answer two main questions-first: How does the MAEC actually operate with leaky instead of clean reference channels? (addressed in Section 6.1); second: Why does the target speech remain undistorted even though it is present in the reference channels? (addressed in Section 6.2).

Influence of leaky reference channels on the MAEC
To answer the first question, we compare the applied MAEC with an oracle MAEC, which processes the same target microphone channel, but has access to crosstalkfree reference channels. Thus, we obtain performance results of our target channel for an echo cancellation scenario, which, considering the fact that the MAEC was developed for echo cancellation, depicts a kind of upper quality limit for the MAEC in the depicted meeting scenario.
Comparing the MAEC to the oracle MAEC (c.f. Fig. 7), two main observations are evident: First, the performance of the MAEC for iSIR > 16 dB is approximately equal to the oracle MAEC, which is conclusive, since the input signals of the MAEC are converging towards the oracle signals due to the decreasing distortion. Only the ISR measure shows quite a gap due to the target speech occurring as crosstalk in the reference channels, which has a very slight effect on the target speech in the target microphone channel, especially when the system is still in the initial convergence process. Second, the SIR of the oracle MAEC for iSIR < 14 dB is significantly higher than that of the MAEC, where the performance achieves a maximum of about 11.9 dB for iSIR = 0 dB. An iSIR of 0 dB means that the source signal and the crosstalk signals share the same active speech level. Thus, the algorithm must not estimate the compensation of level mismatches with the RIR but can focus on the room characteristics without paying attention to a global attenuation or gain factor. Even though the MAEC is theoretically able to determine the gain factor, these types of calibration are common in practice to achieve the best possible result. In contrast, the MAEC cannot benefit from this balanced signal levels in the considered meeting scenario, since the accompanying distortion level is too high for a good performance. However, an iSIR < 5 dB is already a very challenging task and depicts a lower limit for a meeting scenario in practice. Nevertheless, even if the SIR result of the MAEC decreases with an increasing crosstalk level (distortion of reference channels), it still achieves adequate results in this range compared to the MWF.
In order to get a deeper insight into the MAEC RIR estimation, we consider two MAEC metrics. The system distance in [dB], which is defined by d sys m,μ ( ) = 10 log 10 as well as the averaged absolute value of the measurement noise, which is averaged over all frequency bins (main diagonal of (23)) and determined by with tr(·) being the trace of the matrix. Figure 8  for target channel m = 1 of the considered meeting scenario at iSIR = 10 dB. Thereby, | | S m=1 ( ) is smoothed over time for better illustration. Furthermore, to verify that the MAEC RIR estimation is able to deal with time-variant RIRs, we consider multiple RIR changes of different crosstalk signals in Fig. 8 (c.f. Section 5.2 and Fig. 2), each carried out in a specific meeting situation. The RIR changes are illustrated in the lower plot of Fig. 8 by abbreviations in colored circles. Thereby, circle colors mark the changed RIRs w.r.t. the corresponding speech source color and the abbreviations for single-talk (ST), double-talk (DT), triple-talk (TT), and interferer speech pause (ISP) specify the respective meeting situation (e.g., depicts single-talk and a change of the RIRs h 1,2 (n) and h 3,2 (n)). The black colored circle at 47 s defines a simultaneous change of all RIRs.
It can be seen that the plots of | | S m=1 ( ), which indicate the presence of target speech, are very similar for the MAEC and the oracle MAEC. The system distances of the MAEC are also pretty good compared to the oracle MAEC, the latter of course obtaining still better results, which is in line with Fig. 7. By looking at the system distance, we can get a picture of some general characteristic behaviors of the (oracle) MAEC.
First, RIR changes emanating from the target signal m = 1 (h 3,1 (n) and h 2,1 (n)) seem to have no influence on the RIR estimation of h m=1,μ (n). This is indicated by the three RIR changes at 7, 18, and 40 s, since there is no effect on the system distance d sys m=1,μ ( ), μ ∈ {2, 3} for the (oracle) MAEC.
Second, the best results w.r.t. d sys m=1,μ ( ) are obtained during (single) interferer-only talk (e.g., in the intervals 0 s . . . 5 s or 10 s . . . 16 s). This is shown for the MAEC by the fast reconvergence process from −4 dB back to −11 dB in less than 2 s w.r. tively, compared to and . This is due to the fact that target speech represents already a distortion to the RIR estimation process of the (oracle) MAEC. These sections are typically marked by huge values of the measurement noise (indicating the presence of target speech), which leads to a small value of the step-size for the Kalman gains (c.f. (21)). Nevertheless, the system distance of the (oracle) MAEC still decreases in all considered cases, so that the functionality of the RIR estimation is ensured. Finally, the system distance increases during speech pauses of the interferer (e.g., in the intervals 5 s . . . 10 s, 15 s . . . 19 s, or 31.5 s . . . 42 s). This is consistent, since in this case the algorithm has no excitation to estimate the RIRs, which is the main reason why we integrated the extended RIR update control strategy (c.f. Section 4.1) into both the MKWF and the (oracle) MAEC. The positive effect is illustrated in the lowest plot of Fig. 8 ( ) compared to the basic MAEC without using the extended RIR update control. This indicates a certain dependency between different RIRs inside the same environment (meeting room), so that it is a benefit to store the latest RIR during active speech instead of starting the complete initialization process of the MAEC again.
We can conclude from Fig. 8 that the MAEC RIR estimation is still completely operational during a crosstalk level of an iSIR = 10 dB, even if the distortion by the crosstalk leads to some minor performance limitations compared to the oracle MAEC (c.f. Fig. 7). In summary, since in a meeting scenario we typically deal with iSIR ≥ 5 dB, by considering the results of Figs. 7 and 8, we can assume that the MAEC RIR estimation is suitable for the proposed MKWF and for this kind of application.

Preservation of the target speech by the MAEC
So far, we know how the microphone leakage of the reference channels influences the adaptation process and thus also the performance of the MAEC RIR estimation. The remaining question "how does the MAEC in general distinguish between the target and the interferer signals?" is being answered in the following.
In order to understand this behavior, we have to investigate the influence of a RIR on our source signals. Therefore, we create a single-talk scenario based on speech samples of the NTT multi-lingual speech database [52] and use a very much simplified RIR, which is represented by an impulse α · δ(n − n 0 ): • Attenuation: α < 1, n 0 = 0 • Amplification: α > 1, n 0 = 0 • Delay: α = 1, n 0 > 0 By superposition and concatenation of such impulses, we can model any discrete RIR. This includes also some reverberation, which corresponds to a sequence h(n), with n ∈ N 0 and h(n) ∈ R. Table 3 depicts the performance of the MAEC for this single-talk experiment, in which all speaker signals s m (n) are convolved with the described simplified RIRs (ASL of all speakers is adjusted to −26 dBov before the convolution) by coupling as interferer signals into the non-dedicated microphone channels y μ (n) (c.f. Section 2.2). Thereby, we either attenuate or amplify the interfering signals by −5 and +5 dB, respectively, or choose a delay of n 0 = 100 samples (which is below the frame shift R = 256). Additionally, we use a random sequence to simulate reverberation, which is shaped by an exponentially decreasing function to impose an energy decay with a reverberation time of T 60 = 5 ms, and truncate it after 6.25 ms (i.e., after 100 samples). The RIRs from the target speakers to their dedicated microphone channels y m (n) remain with h m,m (n) = δ(n). Even though we already know that the crosstalk level has an effect on the performance of the MAEC during multi-talk (cf. Fig. 7), it is clear to see in Table 3 that this is not the reason why the MAEC is able to work with leaky reference signals. To be more specific, in the case of single-talk, no RIR or an RIR only attenuating or amplifying the crosstalk signals leads to almost an elimination of the target speech signal, which is evident for these three types of RIRs in Table 3 from the poor performances regarding the SDR and the ISR measures. Nevertheless, the attenuating RIR obtains the best results of these three RIRs.
However, the key factor for the MAEC with leaky reference signals seems to be the delay of the interfering signals, as is evident from the best-performing result w.r.t. both SDR and ISR measures in Table 3. This can be explained as follows: All interfering signals (especially the crosstalk of the target speech in the reference channels) are delayed w.r.t. their source signals. In consequence, the MAEC would have to estimate an RIR with negative delay in order to affect the target speech component in the desired channel m. But this is physically not possible and causes the MAEC to treat the target speech component as near-end speech (as mentioned in Section 3.2), which might be the reason why the MAEC (RIR estimation) works fine without degrading the target speech. Since the reverberation can be seen as a combination of delay and level adjustment, it is obvious that this has also a positive effect on the MAEC (RIR estimation) in our scenario.
We can conclude that the MAEC (RIR estimation) can be applied to close-talk multichannel recordings with crosstalk-disturbed reference channels, if the microphones are closest to their dedicated person.

Conclusions
In this work, we investigated the applicability of a multichannel acoustic echo cancellation (MAEC) approach for speaker interference reduction in a close-talk (wireless) headset meeting scenario, which deals with crosstalk and thus reference channels disturbed by both target and interferer speech. We further show that the characteristics of the room impulse response (RIR), especially the delay, and during multi-talk to some extent also the attenuation affecting the energy level of the crosstalk, are the reasons why the MAEC is able to operate successfully with crosstalk-disturbed reference signals in this specific scenario. Moreover, by means of the MAEC RIR estimation, we propose a multichannel Kalman-based Wiener filter (MKWF) method, which is an extension of a multichannel Wiener filter (MWF) approach by considering the RIRs between the microphones of the interferer and the target speakers. Thus, the MKWF estimates the interfering signals more precisely, leading to an increase of up to 2.7 dB signal-to-interferer ratio, while the obtained speech quality remains equal or is even better compared to the MWF and the MAEC.