Audio source separation by activity probability detection with maximum correlation and simplex geometry

Two novel methods for speaker separation of multi-microphone recordings that can also detect speakers with infrequent activity are presented. The proposed methods are based on a statistical model of the probability of activity of the speakers across time. Each method takes a different approach for estimating the activity probabilities. The first method is derived using a linear programming (LP) problem for maximizing the correlation function between different time frames. It is shown that the obtained maxima correspond to frames which contain a single active speaker. Accordingly, we propose an algorithm for successive identification of frames dominated by each speaker. The second method aggregates the correlation values associated with each frame in a correlation vector. We show that these correlation vectors lie in a simplex with vertices that correspond to frames dominated by one of the speakers. In this method, we utilize convex geometry tools to sequentially detect the simplex vertices. The correlation functions associated with single-speaker frames, which are detected by either of the two proposed methods, are used for recovering the activity probabilities. A spatial mask is estimated based on the recovered probabilities and is utilized for separation and enhancement by means of both spatial and spectral processing. Experimental results demonstrate the performance of the proposed methods in various conditions on real-life recordings with different reverberation and noise levels, outperforming a state-of-the-art separation method.


Introduction
Blind audio source separation (BASS) is a prominent task in the field of audio processing, dealing with the analysis of audio streams comprising several speakers. BASS aims at extracting the individual speech signals of each of the sources present in an audio mixture [1]. Most methods for BASS usually assume that the speakers are concurrently active for most of the time, and a little attention was paid to the case of infrequent speakers.
BASS has been a topic of extensive research for the last decades, leading to a large variety of separation algorithms. The measured signals in an array of microphones *Correspondence: sharon.gannot@biu.ac.il 1 Faculty of Engineering, Bar-Ilan University, 5290002, Ramat-Gan, Israel Full list of author information is available at the end of the article represent convolutive mixtures of the clean source signals with the corresponding acoustic channels [2][3][4][5]. Commonly, the signals are analyzed in the short time Fourier transform (STFT) domain, in which the convolutive mixtures are approximated by multiplicative mixtures. Various approaches for BASS exist, such as the independent component analysis (ICA) and independent vector analysis (IVA) separation methods [6][7][8][9][10], non-negative matrix factorization (NMF) [11][12][13][14][15], and, more recently, deep neural network (DNN)-based separation methods [16][17][18][19][20][21][22][23]. A related problem to acoustic source separation was recently investigated in the field of structural health monitoring based on acoustic emission, dealing with onset detection of overlapped acoustic emission waves [24] for accurate time of arrival estimation [25].
A vast number of BASS algorithms rely on the sparsity of speech signals in the STFT domain, assuming that speech components of simultaneously active speakers are non-overlapping [26]. One approach is to compute the time or the phase differences at each time-frequency (TF) bin, and then to jointly cluster all TF bins [27][28][29]. Alternatively, dual-stage methods perform a frequencywise clustering, followed by a permutation alignment of the identities of the speakers across all frequency bands [30][31][32]. The resulting algorithms consist of iterative methods, such as the well-known expectation maximization (EM) algorithm, which require a careful initialization, are susceptible to converge to local maxima, and commonly impose high computational load.
Source separation can also be achieved by applying beamformers [1], which are multichannel spatial filters designed by certain criteria, such as the linearly constrained minimum variance (LCMV) beamformer [33]. These algorithms are not fully blind, as their design requires some knowledge on the signal statistics or the acoustic systems. In [33,34], the beamformer parameters were estimated assuming some prior knowledge on the activity of the speakers, such as assuming the existence of known time intervals in which each of the desired speakers is separately active [33], or, alternatively, assuming a scenario in which speakers become successively active [34]. In [35], a blind approach for learning the beamformer parameters was proposed using variational inference framework, which is initialized based on the speakers' time difference of arrivals (TDOAs).
In many conversations, an unbalanced activity of the different speakers is common, when several speakers may participate frequently, while others only seldom speak. This is often the case in interviews, police interrogations, and counseling, just to name a few. In this type of scenarios, one speaker presents short questions or comments, while the other speaker provides long answers or descriptions, which may be then followed by short expressions of agreement or disagreement from the first speaker. Identifying a speaker with low activity is extremely challenging, since it is based on a very limited amount of data. To the best of our knowledge, this scenario has not been considered in the literature, although it is very common and of great importance in many applications.
In this paper, we present a source separation method based on an LCMV beamformer followed by a postfilter with parameters that are learned in a completely blind manner by recovering the probability of activity of the speakers across time. This method is flexible and can be applied to a wide range of scenarios from conversational speech with only a limited amount of overlapping speech to audio mixtures of simultaneously active speakers with possibly large overlap between them. Furthermore, it can also detect speakers with low activity.
The proposed method relies on a probabilistic model that is built upon the speech sparsity in the STFT domain and describes the probability of activity of the speakers across time. Based on the assumed statistical model, it is shown that the correlation between each two time frames along the measured signal equals the multiplication of the associated speaker's probabilities. A correlation function is defined for each frame, consisting of its correlation with all other frames. For frames dominated by one of the speakers, the value of the correlation function equals the probability of this speaker in each frame, and can be used as an estimator for the activity probability. We present two methods for detecting the set of frames dominated by a single speaker. The first method relies on the theory of linear programming (LP) that states that singlespeaker frames are the maximum points of the correlation function associated with any other frame. Based on this observation, we present a sequential algorithm to detect frames dominated by each speaker. In the second method, we aggregate the correlations in a correlation matrix. We show that the columns of this matrix lie in a simplex. The vertices of the simplex correspond to frames dominated by a single speaker, and can be recovered by convex geometry tools. Based on the activity probabilities estimated by either method, we recover the spectral mask that assigns each TF bin with the dominant speaker, and exploit it for performing spatial multichannel separation followed by spectral single-channel post-processing.
The contribution of the current work is twofold, both in terms of presenting novel methodologies for source separation which have several advantages over existing separation methods, as well as in terms of achieving high-quality performance in various scenarios as shown in the experimental part. Compared to existing separation algorithms, the proposed methods do not require any initialization processes, do not include iterative search mechanisms, circumvent frequency-permutation problems, and do not require training data, resulting in efficient high-performance methods. Compared to our previously proposed simplex-based method [36,37], which relies on a similar probabilistic model, here we present new methodologies for solving the problem, yielding simplified processing methods and better supporting infrequent speakers. In addition, one of the presented methods has lower computational complexity with respect to [36]. It is worthwhile noting that the proposed maximum correlation method was utilized for a psychological research on vocal emotional dynamics during psychotherapy sessions [38]. The method was used for detecting time intervals in which the therapist and the patient are active, from which vocal features were then extracted and analyzed. An example demonstrating the capabilities of the proposed method for this task is given in the experimental part. This utilization of the proposed method is the signal of the jth speaker measured by the mth microphone, where A m j (f ) is the acoustic transfer function (ATF) relating the jth speaker and the mth microphone and S j (l, f ) is the signal of the jth speaker, and N m (l, f ) is the non-directional noise signal measured by the mth microphone. Directional noises can be treated as additional sources, increasing J accordingly.
Our goal is to apply separation, namely to extract the individual source signals {Y 1 j (l, f )} l,j from the mixture while reducing the noise. Note that instead of estimating the original source signals, we provide an estimate of the source signals as they are measured by the first microphone that serves as a reference microphone.

Sparsity-based statistical model
Relying on the assumption of the speech sparsity in the STFT domain (a.k.a. W-disjoint orthogonality) [26], each TF bin is dominated by either one of the speakers or consists of noise. We define the categorical spectral mask {M(l, f )} l,f that assigns each TF bin with its dominating component, either one of the J speakers, 1 ≤ M(l, f ) ≤ J or the noise M(l, f ) = J + 1. When the power of the jth speaker is considerably larger compared to the power of the other speakers and the noise, in the (l, f )th TF bin, we have M(l, f ) = j, 1 ≤ j ≤ J. TF bins that are not dominated by either of the speakers are considered noise, i.e., M(l, f ) = J + 1. Accordingly, we can restate (1): We assume that the index of the dominating component M(l, f ) has a categorical distribution with and J j=1 p j (l) ≤ 1 for each frame l. The activity probabilities {p j (l)} j,l are independent of the frequency bin, and represent the activity patterns of the speakers across time, namely with respect to the frame index l.
We compute the following bin-wise ratio between the measurement at the mth microphone and the measurement at the reference microphone: According to the sparsity assumption (2), we have: where is the relative transfer function (RTF) [39,40] defined as the ratio between the ATF of the mth microphone and the ATF of the reference microphone, both of which are associated with the jth speaker. Here, η(l, f ) = N m (l, f )/N 1 (l, f ) is a noise term that is both frequency and frame dependent. We obtain that the ratio in (5) equals the RTF of one of the speakers or a noise term. We assume that the RTFs and the noise terms are independent zeromean random variables. The RTFs of different speakers, frequencies, or microphones are assumed to be independent, and the same holds for the noise terms of different frequencies or frames. Further discussion on the validity of these assumptions can be found in [36]. For the sake of simplicity, we assume a unit variance for the real and the imaginary parts of the RTFs and the noise terms in each TF bin. Note that the following derivation also holds for non-unit and non-constant variance by applying a proper normalization.

Feature extraction and correlation analysis
Based on the computed ratios, a feature vector r(l) of length D = 2 · F · (M − 1) is defined for each frame as a concatenation of the real and the imaginary parts of the where f 1 , . . . , f F ∈ {1, ..., K}.
Based on the presented statistical assumptions, consider the expected bin-wise correlation between two different frames l = n, 1 ≤ l, n ≤ L, given the identity of the dominating components: Equation (8) states that the conditional correlation equals "1" if the same speaker is active in both TF bins, and "0" if there are different dominating speakers or that one of the TF bins is dominated by noise. Thus, according to the law of total expectation, we have: implying that the bin-wise correlation between the features equals the multiplication of the corresponding activity probabilities. In practice, we can obtain an approximation to this expected correlation by averaging the product of the features over a large amount of frequency bins. To this end, the bin-wise correlations can be treated as a sequence of uncorrelated random variables for different values of k (different frequencies or different microphones). Therefore, according to the strong law of the large numbers, their sample mean converges almost surely to their mean value given in (9): for D → ∞. Note that for the same frame l = n, the expected value does not obey (9) and (10), but instead E{[ r(l)] 2 k } = 1, and therefore 1 D r T (l)r(l) a.s. → 1. In the following, we show how we can exploit the relation in (10) between the feature-wise correlations and the probability products to estimate the activity probabilities, by exploiting either linear programming theory or convex geometry tools.

Proposed methods
We present two methods for estimating the activity probabilities {p j (l)} j,l of the different speakers at each frame. The estimation is based on the statistical model presented in Section 2.1, and specifically on the correlation between frames (10). The two methods first identify frames that are dominated by one of the speakers, and then infer the probabilities associated with each speaker using the correlations with respect to the identified frames. The first method is described in Section 2.2.1 and is based on the detection of the maximum points of the correlation functions defined for each frame. The second method is described in Section 2.2.2 and is based on the detection of the vertices of the simplex that consists of the correlation vectors defined for each frame.

Maximum correlation method
where the probabilities {p j (l)} J j=1 associated with the lth frame are the parameters defining the function. Consider the following optimization problem: This is a linear programming (LP) problem, where the constraints, defining the feasible region, specify the J-D probability simplex. The vertices of the simplex are the standard unit vectors {e j } J j=1 , where e j =[ 0, . . . , 1, . . . , 0], with one in the jth entry and zeros elsewhere, and there is an additional vertex at the origin since the sum J j=1 q j can be lower than "1" but is bound to be positive. Based on the theory of LP [41], every local maximum is a global maximum, and the maximum is attained at either of the simplex vertices. Note that the function value at the origin is "0"; therefore, the maximum must be attained at one of the other vertices of the simplex {e j } J j=1 . According to (10), a set of possible values of the function t l is given by the correlations between the lth frame and all other frames, i.e.: representing the function value at the point that for a specific frame l, the probabilities p(l) = [ p 1 (l), p 2 (l), . . . , p J (l)] T are treated as fixed parameters determining the structure of the function t l , while the probabilities of the other frames q(n), 1 ≤ n ≤ L, n = l are treated as points at which the function is evaluated.
Here, we utilize the equivalence between the feature-wise correlations and the probability products implied by (10) to obtain samples from the function t l , where the sample t l (n) is given by the correlation between the features associated with the lth and the nth frames. Note that we only have the function value t l (n), but not the function parameters p(l) and the point q(l), in which the function was evaluated.
The formulation of the LP problem (12) can be used to detect frames dominated by a single speaker. It is important to clarify that we do not solve the optimization in (12), since the parameters defining the function t l are unknown. Instead, we utilize the fact that the maximum is attained at a point corresponding to a frame with a single speaker, i.e., with probability 1 for one of the speakers, and with probability 0 for the other speakers. According to (13), the correlation between the lth frame and all other frames provides L − 1 values of the function t l . Therefore, we search for the maximum among the given values of t l . The maximum value is attributed to a frame that is exclusively dominated by a single speaker. In practice, we define the set of L correlation functions {t l } L l=1 , by appending to each frame its correlation to all other frames. Next, we count the number of times that each frame is detected as a maximum point of the correlation functions defined by the other frames. This way, we obtain a score function conveying how often each frame serves as maximum point. Frames that achieve a high score are assumed to be dominated by one of the speakers. To obtain only a single representative frame for each speaker, we select these frames sequentially and eliminate in each step the frames that correspond to speakers that have already been identified.
Examples of the function t l are given in Fig. 1, for two mixtures of J = {2, 3} speakers. Further details on the generation of the mixtures are given in Section 3. Note that for J = 2, the constraint q 1 + q 2 ≤ 1 specifies a triangle, and for J = 3, the constraint q 1 + q 2 + q 3 ≤ 1 specifies a corner of a cube. The coloring of the points is according to the function value at each point. It can be observed that the maximum is attained at a vertex, which corresponds to the speaker with maximum probability. Note that the maximum is not necessarily unique, for example, a flat function is obtained for a frame with equal speakers' probabilities. We avoid these cases by considering only correlation functions that their maximum value is above a certain threshold.
Based on these observations, we propose an algorithm for sequential recovery of J frames, each of them dominated by one of the J speakers. We assume that for each speaker, there is at least one frame, with index l j , which is entirely dominated by this speaker, i.e., p(l j ) = e j . For the simplicity of the notation, we ignore possible permutation in the order of the identified speakers.
We define a function g which assigns to any frame index l a frame index g(l) with maximum correlation to frame l, i.e.: where S 1 = {1, . . . , L}. We define by c(l ) the number of frames with maximal correlation attained at the l th frame, i.e.: where g −1 (l ) = l ∈ S 1 |g(l) = l is the inverse image of g, and | · | denotes the set cardinality. The frame associated with the first speaker is the one most frequently detected as a maximum point: The probabilities associated with frame l 1 satisfy q(l 1 ) = p(l 1 ) = e 1 ; hence, in (11), we have: (17) Next, we define a smaller subset of frames with low probability of activity of the first speaker: with ε a threshold parameter. A second frame, dominated exclusively by the second speaker, is chosen using the same criterion as in (16), l 2 = max η∈S 2 c(η), where the search runs now over S 2 . Limiting our search to frames in S 2 prevents choosing a frame dominated by the first speaker.
Assuming that r−1 speakers have already been detected, a frame dominated by the rth speaker is identified by: where The process is stopped when r = J. The probabilities of the speakers in each frame are estimated by their correlation to the identified frames {l j } J j=1 : In the case the set S r is empty and r < J, we replace the search rule of (19) by:  Note that according to (21), r−1 j=1 t l j (η) = r−1 j=1 p l j (η); hence, the criterion in (22) amounts to selecting the frame with lowest activity probability of the speakers that were already detected. The proposed maximum correlation method is summarized in Algorithm 1.

Correlation simplex method
For the second method, we aggregate the values of the correlation function defined for each frame in an L × 1 correlation vector t l defined as:

Algorithm 1: Maximum Correlation Method
• Define g(l) by detecting the maximum point for each correlation function (14). • Define c(l) by counting the number of times each frame serves as a maximum point (15). * S r+1 = l ∈ S r , t l r (l) < ε • Obtain the activity probabilities p j (n) = t l j (n) (21).
Based on (13), we have: where p j =[ p j (1), p j (2), . . . , p j (L)] T is the probability vector of size L × 1, which consists of the probabilities of the jth speaker in each frame. Here, t l = (1 − J j=1 p 2 j (l))e j represents a small difference vector that stands for the deviation in the jth entry due to the fact that the self-correlation of the frame with itself equals "1. " This deviation has a negligible effect and is therefore ignored.
According to (24), the correlation vectors {t l } L l=1 are obtained as convex combinations of {p j } J j=1 ; hence, they lie in the following simplex in R L : where the simplex vertices are {p j } J j=1 , and there is an additional vertex at the origin, since the sum of the weights can also be lower than one. Therefore, recovering • Obtain the activity probabilities p j = t l j (26).
the simplex vertices with indexes {l j } provides an estimate of the columns of the probability matrix P, i.e.: The simplex vertices are detected by means of convex geometry tools using the successive projection algorithm (SPA) [42,43]. In this algorithm, the vertices are sequentially detected by maximum norm criterion, when each vector is first projected to the orthogonal complement of the subspace spanned by the already identified vertices. The proposed correlation simplex method is summarized in Algorithm 2. Note that the two proposed methods are based on iterative procedures for detecting frames dominated by a single speaker. In the maximum correlation method, the detection criterion is based on how frequently the frame serves as a maximum point, while the simplex correlation method is based on maximum norm criterion. These two criteria are related to each other since when a frame is frequently detected as a maximum, it implies that it has high correlation with other frames, indicating that its correlation vector has high norm. The main difference is that for the maximum correlation method, the detection criterion is computed only once, and the set of possible frames is reduced in each step by eliminating frames with non-negligible correlation to previously identified frames, while for the simplex correlation method the detection criterion is computed in each iteration by first projecting the correlation vectors to the orthogonal complement of the subspace spanned by the correlation vectors of the previously identified frames. Following this difference, we show in the experimental part that the maximum correlation method is much more computationally efficient.

Relation to Simplex-EVD method
In this section, we discuss the relation of the proposed methods for activity probability estimation and our previously proposed simplex method [36]. Both the simplex algorithm [36] and the proposed methods rely on the statistical model presented in Section 2.1.2, and specifically on the correlation between frames (10). In [36], we define a correlation matrix with L columns that correspond to the correlation vectors defined in (23). We apply eigenvalue decomposition (EVD) to the correlation matrix and obtain a simplex representation embedded in R J . Next, we detect the simplex vertices and use them to transform the simplex representation that is based on the computed eigenvectors to the probability simplex. This method shares some similarity with the correlation simplex method derived in Section 2.2.2. The difference is that in [36], we first apply an EVD of the correlation matrix, while here we use the correlation vectors directly. Accordingly, in [36], we obtain a simplex in R J , while here we have a simplex in R L . The maximum correlation method, derived in Section 2.2.1, takes an entirely different approach that is based on LP theory. Note also that both proposed methods estimate the probabilities directly from the correlations, while in [36] they are estimated based on the computed eigenvectors. A diagram summarizing the three methods is depicted in Fig. 2a.
We discuss the computational complexity of the three methods: We compare the performance and the computational time of the three methods in Section 3. Block-diagrams of a the methods for activity probability detection ("Max-Corr" derived in Section 2.2.1, "Simplex-Corr" derived in Section 2.2.2, and "Simplex-EVD" proposed in [36]) and b the overall separation system

Separation based on activity probability
We present a separation scheme that relies on the estimated activity probabilities. In the first stage, we estimate the spectral mask M(l, f ) based on the activity probabilities. In the second stage, we utilize the estimated spectral mask for the actual separation by applying a multichannel beamformer followed by a single-channel postfilter.

Spectral mask estimation
The spectral mask is estimated per frequency by combining the local relations between the bin-wise ratio features r(l, f ) defined in (7) with the activity probabilities p(l) estimated by one of methods derived in Section 2.2. Specifically, the value of the spectral mask is determined for each TF bin based on the following weighted nearest-neighbor rule: where the weight ω ln (f ) of each frame n with respect to the inspected frame l is inversely proportional to distances in the space defined by {r(l, f )} L l=1 . Particularly, we use the following Gaussian weighting: In (27), π j serves as a class normalization and is given by: Note that since the local mappings are aligned using the same global probabilities, the proposed method does not suffer from permutation ambiguity of the identity of the speakers across the different frequencies. The mask estimation procedure was adopted from [45], with a change in the definition of the local feature vectors. In [45], the feature vectors were defined based on a local simplex representation extracted from the EVD of the correlation (2021) 2021:5 Page 9 of 16 matrix defined for each frequency, while here we directly use the ratio values of all microphones.

Separation and enhancement
The separation is performed based on the estimated spectral mask (27) and is carried out in two stages by applying a multichannel beamformer followed by a single-channel spectral masking. The beamformer utilizes the diversity in the spatial characteristics of each speaker location and the noise, while the spectral masking utilizes the spectral diversity of the signals in the TF domain.
In the first stage, we apply a linearly constrained minimum variance (LCMV) beamformer: where the LCMV beamformer is defined by: where nn (f ) is the noise power spectral density (PSD) matrix of size M × M and C(f ) is an M × J matrix, comprising the RTFs of all speakers, i.e., [ C(f )] m,j = H m j (f ). The estimation of the noise PSD matrix and the RTF matrix is summarized in Algorithm 3. The vector g j ∈ R J extracts the jth speaker, with one in the jth entry and zeros elsewhere.
In the second stage, residual noise and interference signals at the output of the LCMV beamformer can be further suppressed by applying the estimated spectral mask: (32) where I j (l, f ) is the indicator function defined in Algorithm 3 and β is an attenuation factor. The entire separation process is summarized in Algorithm 4, and a block-diagram is depicted in Fig. 2b.

Results and discussion
The algorithm performance was tested on a dataset that was self-recorded at the Bar-Ilan University (BIU) acoustic lab. We first describe the competing methods, the performance measures used for evaluation, the experimental setup, and the examined scenarios. Next, we present and discuss the results obtained by the proposed methods and the baseline methods.

Competing methods
In all experiments, we compared the proposed methods ("Max-Corr" and "Simplex-Corr") to the simplex method ("Simplex-EVD") [36]. As a baseline method, we used independent low-rank matrix analysis (ILRMA) algorithm [15], which is a state-of-the-art blind source separation

Algorithm 3: Beamformer Parameter Estimation
Define the indicator function, 1 ≤ j ≤ J: Noise PSD estimation: RTF estimation: • Speech PSD estimation: • Solve the following generalized eigenvalue decomposition (GEVD) problem [33]: (BSS) method that unifies IVA and NMF. Moreover, we compared to an ideal separator, which uses an ideal mask computed from the individual signals of each of the speakers.

Performance measures
The separation performance was evaluated in terms of signal to interference ratio (SIR) and signal to distortion ratio (SDR) measures as defined in [46]. The SIR measure reflects the suppression of interfering speech components with respect to the desired estimated speaker. The SDR measure reflects the preservation of the original speech components of the desired estimated speaker with respect to the corresponding true reference signal. Both measures were evaluated using the BSS-Eval toolbox [46].

Experimental setup
We describe the setup for the recordings carried out at the BIU acoustic lab. The room of size 6 × 6 × 2.4 is equipped with controllable panels mounted over the ceiling, the floor, and the walls, which are used to adjust the reverberation time. In this experiment, the panels were adjusted to create two reverberation levels: a low reverberation level set to T 60 ≈ 150 ms, and a high reverberation level set to T 60 ≈ 550 ms. The recordings included 20 • Compute ratios R m (l, f ) l,f ,m (4).

Spectral Mask Estimation:
• Estimate the activity probabilities by Algorithm 1 or 2.
• for each frequency f = 1 : K: human participants (10 females and 10 males), sitting in 6 possible seats around a rectangular table. The participants were recorded individually one at a time to enable the generation of various multi-party scenarios. Each participant was recorded 12 times for each seat and for each reverberation level. In each recording, the speaker uttered five different sentences of about 5 s each with a pause of about 5 s between two subsequent sentences. The sentences were unique for each speaker and for each seat. The signals were measured by 24 CK32 omnidirectional microphones of AKG, placed in the room and mounted on the table. The room layout with the positions of the speakers and the microphones is depicted in Fig. 3a, and a photo of the room setup is presented in Fig. 3b.
In addition, a babble noise was recorded, imitating a diffuse noise field that arrives evenly from all directions. To generate this noise, 8 loudspeakers were placed at each corner of the room and at the middle of each of the walls, as illustrated in Fig. 3a. The loudspeakers were pointed towards the walls and were recorded while playing babble noise signals.
In the experiments presented here, we utilized the measurements of a subset of M = 8 microphones with indexes 25-32 in the setup depicted in Fig. 3a. The signals were acquired with 24-bit resolution and 48 kHz sampling rate, and were then downsampled to 16 kHz for further processing. The signals were analyzed in the STFT domain with window length of K = 2048 samples with 75% overlap between adjacent frames. The feature vectors (7) consist of F = 257 frequency bins, corresponding to 1-3 kHz. Note that we focus on a range that contains most of the speech power, and exclude low frequencies that mostly contain noise, as well as high frequencies, in which there is typically only low speech energy. The following parameter values were used: ε = 0.2, β = 0.3. The parameters were chosen empirically to obtain good and stable results. The parameter ε (18) is a probability threshold used for excluding frames correlated with sources that have already been identified, and therefore should be low enough to exclude all the frames associated with the detected sources but high enough so that the remaining sources would not be missed. The parameter β (32) presents a trade-off between speech distortion and noise and interference suppression, such that as β increases we obtain better SIR but lower SDR, and vice versa.

Examined scenarios
The performance was examined on two scenarios as summarized in Table 1. The first scenario consists of mixtures of 4 speakers, where the first speaker has infrequent activity while the other three speakers have balanced activity. The first speaker is active solely for a short duration, and then, the other 3 speakers start speaking one after the other. In the second scenario, there are mixtures of J speakers with balanced activity.
At the beginning of each mixture signal, there is a 1s-long segment with noise only. The total length of the signals is 20 s. For each condition in each scenario, we conducted 50 Monte Carlo (MC) trials. In each trial, a random subset of speakers and seats was chosen. Representative timelines of the two scenarios are given in Fig. 4.

Results
For the first scenario, we examined two levels of activity of the first speaker, i.e., 5% and 10% activity percentages with respect to the entire signal duration. The signal to noise ratio (SNR) was set to 20 dB. The scores of the first speaker were averaged over 50 trials, and the scores of the three balanced speakers were averaged together in all 50 trials. The average SIR scores are given in Table 2, and the average SDR scores are given in Table 3 for the two reverberation levels.
We observe the superiority of the proposed methods over ILRMA in all cases, and especially for the infrequent speaker. The "Simplex-EVD" and "Simplex-Corr" methods obtain similar scores for the balanced speakers, which are higher than the scores obtained for the "Max-Corr" method. Both "Simplex-Corr" and "Max-Corr" methods obtain better performance compared to the "Simplex-EVD" method with respect to the infrequent speaker. For this speaker, we observe an advantage of the "Max-Corr" method compared to "Simplex-Corr" method, except for the case of high reverberation and 10% activity percentage.
For the second scenario, we first examined the performance with respect to the noise level for mixtures of J = 3 speakers with balanced activity. We carried out 50 MC trials for each SNR level and averaged the obtained scores over all trials and over the three speakers. The average SIR and SDR scores are depicted in Fig. 5. It can be seen that the performance of the "Simplex-Corr" method is comparable to that of the "Simplex-EVD" method, and both of them are superior with respect to the "Max-Corr" method. In addition, both "Simplex-Corr" and "Max-Corr" methods achieve better results compared to the ILRMA algorithm.
In order to show that the obtained performance trends are not tailored to a specific array configuration, we repeated this experiment with a different array constellation. We used a uniform linear array (ULA), located on the room table, which consists of six microphones, indexed as 1-6 in Fig. 3a. The results obtained for mixtures of 3 speakers are depicted in Fig. 6. The performance trends obtained for the ULA are similar to those obtained for the distributed array in Fig. 5, and here too the proposed methods outperform ILRMA. In general, the SIR scores are lower and the SDR scores are higher for the ULA compared to the distributed configuration.
For the second scenario of balanced speakers, we also carried out an evaluation of the performance with respect to the number of speakers, based on microphones 25-32. The obtained scores are depicted in Fig. 7, where each point in the figure represents an average over 50 MC trials and over all speakers. The SNR was set to 20 dB. We observe a decrease in the separation scores obtained by all methods as the number of speakers increases. Similar trends are observed here as for the performance evaluation with respect to the noise levels, namely the comparable performance of "Simplex-Corr" and "Simplex-EVD" methods that is superior to that of the "Max-Corr" method. Regarding ILRMA algorithm, it outperforms the "Max-Corr" method in terms of the SDR score only for high reverberation conditions and large number of speakers, and achieves lower scores in all other cases as compared with both proposed methods.
In addition, we conducted an experiment to evaluate the running time of the three methods for activity probability estimation. The algorithms were implemented in Matlab on a standard PC (CPU Intel Core2 Quad 3.7 GHz, RAM 8 GB). The running times of each algorithm for mixtures of J = 3 speakers are summarized in Table 4 for different recording lengths. Each running time in the table is obtained by an average over 4 trials. We observe that the "Max-Corr" method achieves the lowest running times.

Activity detection of a counseling session
Finally, we demonstrate the performance of the proposed methods on real recordings of a session of a psychological counseling recorded at the BIU Psychotherapy Research Lab. For this purpose, we used a two lapel microphone recording of a client, which speaks most of the time, and a therapist, who is involved only during short time segments. Figure 8 depicts the two-channel measured waveforms. On top, asterisks denote true and estimated time instances with activity of each speaker. The true annotation was manually determined with Praat software [47]. We observe that the "Simplex-EVD" method detects only the client, while the proposed methods successfully detect the activities of both the client and the therapist almost all the time, even when they overlap.

Discussion
We conclude that the proposed methods obtain high separation scores for both balanced and infrequent speakers Table 2 Distributed array: SIR scores-mixtures with unbalanced activity for "Low"/"High" reverberation and for "5%"/"10%" activity of the 1st speaker  Table 3 Distributed array: SDR scores-mixtures with unbalanced activity for "Low"/"High" reverberation and for "5%"/"10%" activity of the 1st speaker and outperform the ILRMA algorithm for most cases in various noise and reverberation conditions. The robustness of the proposed methods to reverberation may be attributed to the fact that for both parts of the activity detection and separation, we use the RTF, which consists of the full reflection pattern. For the activity detection, we differentiate between the speakers using features based on the RTFs and thus obtain robustness to reverberation compared to methods that rely on the direct-path only, which may be masked by reflections in high reverberation conditions. For the separation, we apply a beamformer that is constructed with a steering vector based on the RTF rather than the direct-path only, which results in milder distortion of the speech signal due to the preservation of the entire speech power coming from both the direct and reflected paths, as was shown in [39]. Note also that relying on the RTFs, which provide a richer spatial information compared to the direct-path only, has also the potential of separating sources that are located one behind the other, as was demonstrated in [48] and [36] (see Fig. 8

therein).
For speakers with balanced activity, the best performance is achieved by the "Simplex-Corr" and the "Simplex-EVD" methods with a small advantage for latter. For infrequent speakers with low activity, the "Simplex-Corr" and the "Max-Corr" methods are preferable over the "Simplex-EVD" method, where the "Maxcorr" method performs the best in the case of very low  activity. The lowest computational time is achieved by the "Max-corr" method. Table 5 presents several scenarios and the method that would be preferred in each case. The reason for the differences in the performance of "Simplex-EVD, " "Simplex-Corr, " and the "Max-Corr" methods can be explained as follows. The "Simplex-EVD" method performs a global processing using an EVD and hence may have the best performance in standard cases, but often misses low-active speakers, which have a minor contribution to the obtained decomposition. In contrast, the local approach taken by both proposed methods is more sensitive to infrequent participants. It turns out that in most cases, the performance of the "Simplex-Corr" method is preferable; however, the "Max-Corr" method is more sensitive to speakers with very low activity and also has the advantage of lower computational complexity.

Conclusions
We presented two novel methods for multichannel speaker separation. For the first method, it is shown that the maxima of the correlation function between different frames correspond to single-speaker frames. Accordingly, we propose an algorithm for sequential recovery of frames dominated by each speaker, and in turn, we use their correlations as an estimator for the activity probabilities. In the second method, single-speaker frames correspond to  vertices of the simplex defined by the correlation vectors and are detected by means of convex geometry. A spectral mask is recovered by the estimated probabilities and is utilized for the actual separation of the mixture. Both proposed methods show high separation capabilities in real-life scenarios for different reverberation and noise levels, and especially in the challenging scenario of speakers with low activity. The maximum correlation method has better performance for speakers with very low activity and is also more computationally efficient, while the correlation simplex method performs better for speakers with balanced activity, especially in adverse conditions of high noise and reverberation.