 Empirical Research
 Open access
 Published:
Learningbased robust speaker counting and separation with the aid of spatial coherence
EURASIP Journal on Audio, Speech, and Music Processing volume 2023, Article number: 36 (2023)
Abstract
A threestage approach is proposed for speaker counting and speech separation in noisy and reverberant environments. In the spatial feature extraction, a spatial coherence matrix (SCM) is computed using whitened relative transfer functions (wRTFs) across time frames. The global activity functions of each speaker are estimated from a simplex constructed using the eigenvectors of the SCM, while the local coherence functions are computed from the coherence between the wRTFs of a timefrequency bin and the global activity functionweighted RTF of the target speaker. In speaker counting, we use the eigenvalues of the SCM and the maximum similarity of the interframe global activity distributions between two speakers as the input features to the speaker counting network (SCnet). In speaker separation, a global and local activitydriven network (GLADnet) is used to extract each independent speaker signal, which is particularly useful for highly overlapping speech signals. Experimental results obtained from the real meeting recordings show that the proposed system achieves superior speaker counting and speaker separation performance compared to previous publications without the prior knowledge of the array configurations.
1 Introduction
Blind speech separation (BSS) involves the extraction of individual speech sources from a mixed signal without prior knowledge of the speakers and mixing systems [1]. BSS finds application in smart voice assistants, handsfree teleconferencing, automatic meeting transcription, etc., where only mixed signals from single or multiple microphones are available. Several BSS algorithms have been developed based on different assumptions about the characteristics of the speech sources and the mixing systems [2,3,4,5,6,7,8,9]. Learningbased BSS approaches have recently received increased research attention due to advances in deep learning hardware and software. Promising results have been obtained using singlechannel neural networks (NNs) [10,11,12,13,14,15]. To further improve separation performance, techniques have been developed that exploit the spatial information embedded in the microphone array signals began to emerge [16,17,18,19]. However, most of these BSS techniques assume a known number of speakers prior to separation. As a key step prior to speaker separation, speaker counting [20] is examined next.
Some studies have assumed the maximum number of speakers during speaker separation [15, 21,22,23]. Another approach is to extract speech signals in a recursive manner [24,25,26], where the BSS problem has been tackled by a multipass sourceextraction procedure based on a recurrent neural network (RNN). In contrast to the previous methods that use implicit speaker counting for separation, a multidecoder DPRNN [27] uses a counthead to infer the number of speakers and multiple decoder heads to separate the signals. A speaker counting technique has been proposed using a scheme that alternates between speech enhancement and speaker separation [28]. Instead of exhaustive separation, one can selectively extract only the target speech signal, with the help of auxiliary information such as video images [29, 30], preenrolled utterances [31,32,33], and the location of the target speaker [34,35,36,37]. Although the target speaker extraction approach leads to significant performance improvements, the auxiliary information may not always be accessible. To overcome this problem, the speaker activitydriven speech extraction neural network [38] has been proposed to facilitate target speaker extraction by monitoring speaker activity. However, the speaker activitydriven speech extraction neural network is susceptible to adverse acoustic conditions in speaker extraction using speaker activity information alone. In such circumstances, multichannel approaches may be more advantageous than monochannel approaches. For example, deep clusteringbased speaker counting and mask estimation have been incorporated into maskingbased linear beamforming for speaker separation tasks [39]. Chazan et al. presented the use of a deepneural network (DNN)based singlemicrophone concurrent speaker detector for source counting, followed by beamformer coefficient estimation for speaker separation [40, 41].
Despite the promising results obtained with DNNbased approaches, most network models require a large amount of data for training. Another limitation is that identical array configurations used in the test, and training phases are preferred. Therefore, DSPbased approaches may have certain advantages [42]. LauferGoldshtein et al. proposed the global and local simplex separation algorithm by exploiting the correlation matrix of relative transfer functions (RTFs) across time frames [43]. The number of speakers is determined from the eigenvalue decay of the correlation matrix. The activity probabilities of each speaker are estimated from the simplex formed by the eigenvectors. In the separation stage, a spectral mask is computed for the identified dominant speakers, followed by spatial beamforming and postfiltering. Although the simplexbased approach is very effective in most cases, it does not work well for lowactivity speakers [44].
In general, the DNNbased approaches show promise, but require extensive training data and could not generalize well to unseen array configurations. The DSPbased approaches require no training and often allow for lowresource implementation, but their performance depends on the array configuration. While the deep clusteringbased speaker counting and mask estimation methods [39,40,41] are also array configurationagnostic, their speaker counting capability relies on a singlechannel input feature, which can degrade counting performance in adverse acoustic conditions. Furthermore, the separation performance of these methods is dependent on the array configurations used.
The goal of this study is twofold. First, we reformulate a spatial feature that significantly improves the performance and robustness of source counting and separation. Second, we seek to leverage the strengths of DSPbased and learningbased methods for improved speaker counting and speaker separation performance, with robustness to unseen room impulse responses (RIRs) and array configurations. Inspired by the work of Gannot et al. [43, 45], which is a purely DSPbased approach, we propose a robust speaker counting and activitydriven speaker separation algorithm that combines statistical preprocessing and a neural network backend. We formulate a modified spatial coherence matrix based on whitened relative transfer functions (wRTFs) as a spatial signature of directional sources. The whitening procedure provides spectrally rich phase information that proves to be a robust spatial signature for dealing with mismatched array configurations. In the speaker counting stage, our approach attempts to reliably estimate the number of active speakers in lowSNR and lowactivity scenarios by incorporating eigenvalues from the spatial coherence matrix and the maximum similarity between the global activity distributions. In the speaker separation stage, the local coherence functions of each speaker are computed using the coherence between the wRTFs of each timefrequency (TF) bin and that weighted by the corresponding global activity function. The target masks for each speaker are estimated using a global and local activitydriven network (GLADnet), which remains effective for “mismatched” RIRs and array configurations not included in the training data.
We train our DNN models with RIRs simulated using the imagesource method [46], while the trained models are tested using the measured RIRs recorded at BarIlan University [47]. Reallife recordings from the LibriCSS meeting corpus [48] are also used to validate the proposed separation networks. In this study, the proposed speaker counting and speaker separation algorithms are compared with the simplexbased methods developed by LauferGoldshtein et al. [43] in terms of F1 scores and confusion matrices. Perceptual evaluation of speech quality (PESQ) [49] and word error rate (WER) are adopted as the performance measures in speaker separation tasks.
While inspired by Ref. [43], this study presents three main contributions that differ from the previous work. First, a learningbased robust speaker counting and activitydriven speaker separation algorithm is developed. Second, a modified spatial coherence matrix is formulated to effectively capture the spatial information of independent speakers. A novel idea based on the maximum similarity between the global activity distribution of two speakers over time frames is explored as an input feature for speaker counting. Third, an array configurationagnostic GLADnet informed by the global and local speaker activities is proposed.
The remainder of this paper is organized as follows. Section 2 presents the problem formulation and a brief review of the simplexbased approach, which is used as the baseline in this study. Section 3 presents the proposed speaker counting and speaker separation system. In Section 4, we compare the proposed system with several baselines through extensive experiments. Section 5 concludes the paper.
2 Problem formulation and the baseline approach
2.1 Problem formulation
Consider a scenario in which the utterances of J speakers are captured by M distant microphones in a reverberant room. We assume that there is no prior knowledge of the array configuration. The array signal model is described in the shorttime Fourier transform (STFT) domain. The received signal at the mth microphone can be written as
where l and f denote the time frame index and frequency bin index, respectively; \({A}_{j}^{m}\left(f\right)\) denotes the acoustic transfer function (ATF) between the mth microphone and the jth speaker; \({S}_{j}\left(l, f\right)\) denotes the signal of the jth speaker; and \({V}^{m}\left(l,f\right)\) denotes the additive sensor noise. This study aims to estimate the number of speakers J (speaker counting) and extract independent speaker signals from the microphone mixture signals without information about the sources and the mixing process.
2.2 Baseline method: the simplexbased approach
In this section, we present the baseline by revisiting [43]. The simplexbased approach [43, 44] is based on the global and local simplex representations and relies on the assumption of the speech sparsity in the STFT domain [50]. By assuming speech sparsity, each TF bin is dominated by either the speaker or the noise. The ideal indicator selected in each TF bin can be expressed as follows:
If a TF bin is not dominated by any speakers, such a TF bin will be dominated by noise, i.e., \({\sum }_{j=1}^{J}{I}_{j}\left(l, f\right)=0\). Let \({p}_{j}^{G}\left(l\right)\) be the global activity of speaker j in frame l:
which is the global activity associated with the jth speaker in the lth frame. Note that the global activities \({\left\{{p}_{j}^{G}\left(1\right)\right\}}_{j=1}^{J}\) depend only on the frame index, not on the frequency index.
2.2.1 Spatial feature extraction
Assuming speech sparsity in the TF domain, the relative transfer function (RTF) [51], which represents the ratio between the ATF of the mth microphone and the ATF of the first (reference) microphone, can be written as follows:
In the following, a feature vector \(\mathbf{r}\left(l\right)\) for each frame l is defined to compose \(D=2\times \left(M1\right)\times K\) elements of the real and imaginary parts of the computed ratios (4) for \(1\le k\le K\) frequency bins and in (M1) microphone signals:
where \({\left\{{f}_{k}\right\}}_{k=1}^{K}\) are the selected frequencies. The correlation matrix \(\mathbf{W}\in {\mathbb{R}}^{L\times L}\) is computed, where \({\left[\mathbf{W}\right]}_{ln}=\frac{1}{D}{\mathbf{r}}^{T}\left(l\right)\mathbf{r}\left(n\right)\). W can be approximated as [45]
where \(\mathbf{P}=\left[{\mathbf{p}}_{1}^{G} \dots {\mathbf{p}}_{J}^{G}\right]\in {\mathbb{R}}^{L\times J}\) is composed of the global activity vectors \({\mathbf{p}}_{j}^{G}={\left[{p}_{j}^{G}\left(1\right)\dots {p}_{j}^{G}\left(L\right)\right]}^{T}\in {\mathbb{R}}^{L\times 1}\) associated with the jth speaker.
2.2.2 Speaker counting
For J independent speakers, the matrix P should have rank J. It follows that the number of speakers can be determined by counting the principal eigenvalues of the correlation matrix W. However, selecting an appropriate threshold is not straightforward due to complex acoustic conditions. To select an appropriate threshold, the speaker counting problem has been formulated as a classification problem [43], where each class corresponds to a different number of speakers. A feature vector consisting of the first \(J'\) principal eigenvalues of the correlation matrix is used as the input to the classifier
where \(J'\) is the maximum possible number of speakers and is set to 4 in this study. The multiclass support vector machine (SVM) is used as the classifier in [43].
2.2.3 Speaker separation
Once the number of speakers (J) is available, the eigenvectors associated with the J largest eigenvalues for each frame l are selected to form the global mapping vector
where \(\left\{u_{j} (l)\right\}^{J}_{j=1}\) denotes the lth element associated with the jth eigenvector.
According to [43, 45], the global mapping vector \(\mathbf{v}^{G}(l)\) can be expressed as a linear transformation of the global activity vector \(\mathbf{p}^{G}(l)\) :
with embedded information of speaker activities. The successive projection algorithm [52] can be applied to identify the simplex vertices and construct the transformation matrix \(\mathbf{G} = [\mathbf{v}^{G}(l_{1}), \mathbf{v}^{G}(l_{2}),\ldots,\mathbf{v}^{G}(l_{J})]\), where \({\left\{{l}_{j}\right\}}_{j=1}^{J}\) represents frame indices of the simplex vertices. Hence, the global activity can be computed.
For the local mapping, each TF bin is assigned to a dominant speaker or noise. The spectral mask can be obtained by using the weighted nearestneighbor rule.
where \({\pi }_{j}={\sum }_{n=1}^{L}{p}_{j}^{G}\left(n\right)\) denotes the class normalization factor and \({\omega }_{ln}\left(f\right)\) is a Gaussian weighting function [33]:
that is inversely related to the distance in the space defined by the local representation \({\left\{\mathbf{r}\left(l,f\right)\right\}}_{l=1}^{L}\) between frame n and frame l. The signal of the jth speaker can be estimated by applying the spectral mask in (11) to the reference microphone signal:
where \(\beta\) is the attenuation factor to avoid musical noise. In this paper, \(\beta\) is set to 0.2 as in [43].
A linearly constrained minimum variance (LCMV) beamformer can be used to extract each independent speaker signals [43, 44], with the weights below
where \(\widehat{\mathbf{A}}\left(f\right)={\left[{\widehat{\mathbf{a}}}_{1}\left(f\right),\dots ,{\widehat{\mathbf{a}}}_{J}\left(f\right)\right]}^{T}\in {\mathbb{C}}^{M\times J}\) denotes the RTF matrix with \({\widehat{\mathbf{a}}}_{j}\left(f\right)={\left[{\widehat{\mathrm{A}}}_{j}^{1}\left(f\right), {\widehat{\mathrm{A}}}_{j}^{2}\left(f\right),\dots ,{\widehat{A}}_{j}^{M}\left(f\right)\right]}^{T}\) of the jth speaker and \({\mathbf{R}}_{nn}\left(f\right)\) is the noise covariance matrix. In this study, only sensor noise is assumed, i.e., \({\mathbf{R}}_{nn}={\sigma }_{nn}\mathbf{I}\). As a result, (14) reduces to
where the RTF of the jth speaker can be estimated by
where \({\mathcal{L}}_{j}=\left\{l\left{p}_{j}^{G}\left(l\right)>\varepsilon ,l\in \left\{1,\dots ,L\right\}\right.\right\}\) denotes the set of frames dominated by the jth speaker, and ε = 0.2 is an activity threshold.
To further illuminate the residual noise and interference, a singlechannel mask is applied [43, 44], as given by
where the vector \(\mathbf{x}\left(l,f\right)={\left[{X}^{1}\left(l,f\right),\dots ,{ X}^{M}\left(l,f\right)\right]}^{T}\) denotes the microphone signals, \({\mathbf{g}}_{j} \epsilon {\mathbb{R}}^{J\times 1}\) is a onehot vector with one in the jth entry and zeros elsewhere, and β = 0.2 is a small factor to prevent from musical noise.
3 Proposed method
Inspired by the above simplexbased approach, we develop a robust speaker counting and separation system by exploiting spatial coherence features of array signals, as illustrated in Fig. 1. The system consists of three modules: the feature extraction module (Section 3.1), the speaker counting module (Section 3.2), and the speaker separation module (Section 3.3), as detailed in the sequel.
3.1 Spatial feature extraction
The simplexbased method [43] exploits the spatial information provided by the microphone array. As a result, spatial feature extraction plays a critical role in subsequent speaker counting and separation algorithms. Instead of the RTF used in [43], in this study, we extract spatial information by whitening RTFs with no change in phase to enhance the spatial signature of the directional source, analogous to generalized crosscorrelation with phase transformation (GCCPHAT) [53]. In the light of the uncertainty principle [54], this helps to improve the time domain resolution for the computation of the spatial coherence matrix. Instead of the real feature vector used in the simplexbased approach, a “whitened” complex feature vector \(\tilde{\mathbf{r}}(l)\) is defined as follows:
Where
\(R^{m} (l. f)\) is defined in (4), and \(\{f_{k}\}^{K}_{k=1}\) is the selected frequency band as in (5). Next, we construct a spatial coherence matrix \(\tilde{\mathbf{W}} \in \mathbb{R}^{L \times L}\) with the lnth entry defined as
where Re{·} is the realpart operator, \(\\!\cdot\!\\) denotes the l_{2}norm, and \(\tilde{D} = \\tilde{\mathbf{r}}(l)\\, \\tilde{\mathbf{r}}(n)\ = (M  1)K\) due to the fact that the feature vectors have been whitened. Note that the complex inner product of \(\tilde{\mathbf{r}}(l)\) and \(\tilde{\mathbf{r}}(n)\) is computed, which can also be regarded as a signsensitive cosine similarity based on the Euclidean angle [55]. An example of the spatial correlation matrix computed using the method reported in the references [43,44,45] and the proposed spatial coherence matrix are compared in Fig. 2, which is generated using a 12second clip with a threespeaker mixture captured by an eightelement uniform linear array (ULA) with interelement spacing of 8 cm. The image in Fig. 2(b) is preferable to Fig. 2(a) because the time span of the proposed spatial coherence matrix aligns better than the baseline, especially at the overlap, as shown by the groundtruth activity bar at the top of the figure. This suggests that the proposed spatial coherence matrix is effective in capturing speaker activity, much like a voice activity detector. In addition, the range of the proposed coherence matrix is within [−1, 1], which is a desired property for network training.
3.2 Speaker counting
The flowchart of the proposed speaker counting approach is detailed in Fig. 3. Two features related to the speaker count are extracted from the spatial coherence matrix \(\tilde{\mathbf{W}}\) and input to the speaker counting network (SCnet), as will be detailed next.
In this study, we propose to use the eigenvalues \(\left\{\tilde{\lambda}_{n}\right\}^{L}_{n=1}\) of the spatial coherence matrix \(\tilde{\mathbf{W}}\) as the feature for the classifier. An example of scatter pattern of the eigenvalues to discriminate between different speaker count classes, \(J \in \left\{1, 2, 3, 4\right\}\), is illustrated in Fig. 4. We generated 2000sample speech mixtures for 1–4 speakers, with 0%, 10%, 20%, 30%, and 40% overlap ratios. Sensor noise was added with 10 dB SNR. Dry signals were convoluted with the measured RIRs selected from the MultiChannel Impulse Responses Database [47] that was recorded using an eightelement ULA with interelement spacing of 8 cm and T60 = 0.61 s. Each cross in the figure represents one observation to specify the number of speakers. Figure 4 shows the ability of the eigenvalues obtained from the correlation matrix and the coherence matrix to discriminate between different numbers of speakers. In addition, the eigenvalues of the coherence matrix \(\tilde{\mathbf{W}}\) can discriminate between different numbers of speakers better than those of the correlation matrix \(\mathbf{W}\). However, some of the observations cannot be classified into the correct class according to the eigenvalues. In this study, we evaluate the similarity between global activities as auxiliary information to address the cases where the principal eigenvaluebased counting method does not work.
Apart from eigenvalues of the spatial coherence matrix, another feature that can help speaker counting is introduced to deal with meeting scenarios in which the overlap ratio of conversation is often less than 20% [56]. For such scenarios, we first calculate a similarity matrix \({\widetilde{\gamma }}^{j}\epsilon {\mathbb{R}}^{j\times j}\) of the first \(j\) global activities with the pqth entry defined as follows:
where “·” denotes the inner product, \({\tilde{\mathbf{p}}}_{p}^{G}\in {\mathbb{R}}^{L\times 1}\) and \({\tilde{\mathbf{p}}}_{q}^{G}\in {\mathbb{R}}^{L\times 1}\) denote the pth and qth global activities estimated from the spatial coherence matrix \(\tilde{\mathbf{W}}\) and \(1\le p,q\le \mathrm{j}\). Next, we find the maximum similarity value of all entries but the diagonal entries.
Similarly, \({\gamma }_{\mathrm{max}}^{j}\) denotes the maximum similarity calculated using the first j global activities obtained from the spatial correlation matrix W. An example of scatter pattern of the maximum similarity to discriminate between different speaker count classes, \(J\in \left\{1, 2, 3, 4\right\}\), is illustrated in Fig. 5. The data generation is identical to those of Fig. 4. To visualize the separability by using the proposed feature, we plot the scatter plot by the projection onto a twodimensional feature space. Figure 5 suggests that the observations are separable by the maximum similarity, which helps to classify the number of speakers. In Fig. 5(a), the singlespeaker observations and the two to four speaker observations are clearly separable along the \({\tilde{\gamma }}_{\mathrm{max}}^{2}\) coordinate. The one or two speaker observations and the three or four speaker observations are clearly separable along the \({\tilde{\gamma }}_{\mathrm{max}}^{3}\) coordinate. In Fig. 5(b), the one to three speaker observations and the four speaker observations are clearly separable along the \({\tilde{\gamma }}_{\mathrm{max}}^{4}\) coordinate.
In this study, the speaker counting problem is formulated as a classification problem as in Ref. [43] with four classes corresponding to 1 to 4 speakers. For each observation (audio clip), the number of speakers is indicated by a onehot vector \(\mathbf{z}\in {\mathbb{R}}^{4\times 1}\) . For inference, the prediction is the highest probability of the output distribution. Three different input feature vectors are defined for the assessment of speaker counting performance:
where \(J' = 4\) is the maximum possible number of speakers, and the eigenvalues are normalized by the maximum eigenvalue to improve convergence. Features f_{baseline 2} is obtained from the spatial correlation matrix W, whereas features f_{proposal 1} and f_{proposal 2} are obtained from the proposed spatial coherence matrix \(\tilde{\mathbf{W}}\) .
A DNN model termed SCnet is used as the classifier for speaker counting. Figure 6 shows an SCnet consisting of three dense layers followed by a rectified linear unit (ReLU) activation, with softmax activation in the output layer. In addition, (F_{size},64) means a dense layer with input size = F_{size} and output size = 64. The crossentropy is used as the loss function in network training.
3.3 Speaker separation
The simplexbased method relies solely on the spatial cue to perform the subsequent beamforming, which depends on the specific array configuration. In contrast, our learningbased approach uses global and local spatial activity features to train the model, as shown in Fig. 7. The proposed system consists of two main modules: (1) the local coherence estimation of independent speakers, which monitors the local activity of each speaker according to the global activity of the speaker, and (2) the global and local activitydriven network (GLADnet), which extracts the speaker signal with the auxiliary information about the global and local activities of the speaker.
In the local coherence estimation of a speaker, the local coherence is calculated between the wRTF of the target speaker and the wRTF of each TF bin. The wRTF of the jth speaker is calculated as follows:
where \({\widehat{A}}_{j}^{m}\left(f\right)\) is the estimated RTF. Thus, the local coherence of the jth speaker can be calculated as follows:
where \(\widetilde{\mathbf{r}}\left(l,f\right)\) is given by the equation under (14). Local coherence serves to inform the DNN about the local activity of a speaker.
GLADnet is based on a convolutional recurrent network [57], as illustrated in Fig. 8. The network has three inputs: the magnitude spectrogram of the reference microphone signal, the global activity of the speaker, and the local activity of the speaker. GLADnet has six symmetric encoder and decoder layers with an 81632128128128 filter. The convolutional blocks feature a separable convolution layer, followed by batch normalization, and exponential linear unit activation. The output layer terminates with sigmoid activation. The convolution kernel and step size are set to (3,2) and (2,1), respectively. Note that 1 × 1 pathway convolutions (PConv) are used as skip connections, which leads to considerable parameter reduction with little performance degradation. The global activity is concatenated to the output of the linear layer with 256 nodes in each time frame. The resulting vector is then fed to the following bidirectional long shortterm memory layers with 256 nodes to sift out the latent features pertaining to each speaker. The soft mask estimated by the network is multiplied elementwise with the noisy magnitude spectrogram to yield an enhanced spectrogram. The complete complex spectrogram can be obtained by combining the enhanced magnitude spectrogram with the phase of the noisy spectrogram. The network is trained to minimize the compressed mean square error between the masked magnitude \(\left(\widehat{\mathbf{S}}\right)\) and the groundtruth magnitude \(\left(\mathbf{S}\right)\)
where c = 0.3 is the compression factor and \({\parallel \parallel }_{F}\) denotes the Frobenius norm.
4 Experimental study
Experiments were performed to validate the proposed learningbased speaker counting and separation system. The networks were trained on the simulated RIRs and tested on the measured RIRs with different T60s and array configurations recorded at BarIlan University [47]. For meeting scenarios, we also tested the proposed system on real meeting recordings from the LibriCSS meeting corpus [48].
4.1 Training and validation dataset
In total, 50,000 and 5000 samples were used in training and validation, respectively. Dry speech signals selected from the trainclean360 subset of th
LibriSpeech corpus [58] were used for training and validation. Noisy speech mixtures edited in 12s clips were prepared with different numbers of speakers \(J\in \left\{1, 2, 3, 4\right\}\) in reverberation conditions and signaltonoise ratios (SNRs) between −5 dB and 5 dB. The overlap ratio of the speech mixtures varied from 0 to 40%. Reverberant microphone signals were simulated by filtering the dry signals with the simulated RIRs using the imagesource method [46]. The reverberation time was within the range of [0.2, 0.6] s. Sensor noise was added with SNR = 15, 25, and 35 dB. In this study, simulated (Gaussian) noise was used to simulate the sensor noise. Two microphone array geometries were used for training and validation, as depicted in Fig. 9. The first microphone array is an eightelement ULA with interelement spacing of 8 cm. The geometry of the second array is similar to that of the sevenelement uniform circular array (UCA) used in the LibriCSS dataset [48] which has one microphone at the center and the other six uniformly distributed around a circle with a radius of 4.25 cm. The RIRs of rectangular rooms with randomly generated dimensions (length, width, and height) in the range of [3 × 3 × 2.5, 7 × 7 × 3] m were simulated. The ULA was placed at 0.5 m from the wall, while the UCA was placed at the center of the room. Any two speakers were separated by at least 15°.
4.2 Implementation and evaluation metrics
In this study, the signal frame was 128 ms long with a 32 ms stride. A 2048point fast Fourier transform was used. The sample rate was 16 kHz. The feature vectors in (5) and (18) comprised \(K=257\) frequency bins in 1–3 kHz. We chose this frequency range because, as in Ref. [43], it performed well in all of the scenarios examined for different simulated and measured RIRs and array configurations. In the experiment, SCnet and GLADnet were trained using the Adam optimizer with a learning rate of 0.001 and a gradient norm clipping of 3. The learning rate was halved if the validation loss did not improve for three consecutive epochs.
The F1 score and the confusion matrix are used to evaluate the speaker counting performance. The F1 score is a measure of the accuracy of a test in classification problems. It is defined as the harmonic mean of precision and recall [59]. PESQ [49] is used as a metric for speech quality and is computed only in the period when the speech is present. In addition, we also evaluate the WER achieved by the proposed system compared to the baselines, by using a transformerbased pretrained model from the SpeechBrain toolkit [60]. The pretrained model was trained on the LibriSpeech dataset. The WER obtained with this model when tested on the testclean subset is 1.9%.
4.3 Spatial feature robustness
In this section, we aim to investigate the robustness of the algorithm with respect to the spatial correlation matrix and the spatial coherence matrix for measured RIRs and unseen array geometries. The proposed spatial coherence matrix based on wRTFs is used as a spatial signature for directional sources. The whitening process provides spectrally rich information that better accommodates unseen array configurations and measured RIRs. To see this, we compute the Modal Assurance Criterion (MAC) value on the spatial correlation matrix and the spatial coherence matrix for various unseen array configurations and RIRs. First, we vectorize the spatial matrix as \(\psi ={\left[{\mathbf{w}}_{1}\quad {\mathbf{w}}_{2}\quad \cdots\quad {\mathbf{w}}_{L}\right]}^{T}\in {\mathbb{R}}^{{L}^{2}\times 1}\) , where \({\mathbf{w}}_{l}=\left[{W}_{l1}\quad {W}_{l2}\quad \cdots\quad {W}_{lL}\right] \in {\mathbb{R}}^{L\times 1}\). \(\psi\) and \({\psi}'\) represent feature vectors associated with two spatial matrices. The MAC value between \(\psi\) and \(\psi'\) is defined as follows:
To evaluate the robustness of the proposed spatial feature extraction method, we generated four different test datasets, each consisting of 500 samples. The first three datasets (G1, G2, and G3) were generated using measured RIRs from the MultiChannel Impulse Responses Database [47], while the last dataset (sG1) was generated using simulated RIRs. As shown in Fig. 10, the first array configuration (G1) is included in the training set, while the second and third array configurations (G2 and G3) are considered “unseen” to the trained model. Note that sG1 had the same array configuration as G1, but with simulated RIRs. Tables 1 and 2 summary the MAC values obtained using the spatial correlation matrix and spatial coherence matrix. The offdiagonal MAC values of the spatial coherence matrix are consistently close to one and larger than those of the spatial correlation matrix. The MAC test demonstrates that the proposed spatial coherence matrix exhibits superior robustness to different array configurations and RIRs compared to the spatial correlation matrix. This property is desirable for the subsequent learningbased speaker counting and speaker separation approaches when dealing with unseen array configurations and measured RIRs.
4.4 Speaker counting performance
In the following, we examine several speaker counting methods for different levels of sensor noise and T60s. We generated 2000sample speech mixtures for 1–4 speakers, with 0%, 10%, 20%, 30%, and 40% overlap ratios and dry speech signals from the testclean subset of the LibriSpeech corpus. Sensor noise was added with SNR = 10, 20, and 30 dB. The measured RIRs were selected from the Multichannel Impulse Responses Database [47] recorded using an eightelement ULA with interelement spacing of 8 cm and T60 = 0.36, 0.61 s at BarIlan University. The RIRs were measured in 15° intervals from −90 to 90° at distances of 1 and 2 m from the array center. Table 3 summarizes the speaker counting results in F1 scores. We compare the proposed counting approaches with two baselines. Baseline 1 is the method proposed in [43]. The SVM classifier with f_{baseline 1} in (7) as the input feature is used for training. Baseline 2 is the SCnet trained with f_{baseline 2} in (22). For the proposed methods, proposals 1 and 2 represent the SCnet trained with f_{proposal 1} and f_{proposal 2} in (22). The speaker counting performance summarized in Table 3 suggests that baseline 1 performs comparably with baseline 2 in high SNR conditions. However, the speaker counting performance of baseline 1 degrades significantly as the SNR decreases. The feature using the eigenvalues obtained from the spatial coherence matrix (proposal 1) significantly outperform those obtained from the spatial correlation matrix (baseline 1), especially when the SNR is low. In addition, the method trained with the maximum similarity (proposal 2) could further improve the speaker counting performance over the method trained with eigenvalues only (proposal 1). In this study, speaker counting is highly dependent on the quality of spatial information extracted from the microphone array. However, it should be noted that spatial features tend to degrade as the SNR decreases. As a result, the counting performance may be relatively lower at SNR = 10.
Next, we investigate speaker counting in lowactivity scenarios using fourspeaker mixtures, where the first speaker was active for only 5% of the time. In Table 4, we see a significant performance degradation in the SCnet trained on the eigenvalues of the spatial correlation matrix (baseline 1), even in highSNR conditions. In contrast, the SCnet trained on the eigenvalues and the maximum similarities computed using the proposed spatial coherence matrix (proposal 2) performs quite satisfactorily despite the unbalanced speaker activity.
Lastly, we investigate speaker counting using the reallife recordings from the LibriCSS dataset [48]. There are 10 onehour sessions, including six 10min minisessions in each session with different speaker overlap ratios (0S, 0L, 10%, 20%, 30%, and 40%). In the 0% case, 0S and 0L represent the signals with short and long silence periods, where interutterance silence lasts between 0.1–0.5s and 2.9–3.0s. The test data was presegmented into 12s clips containing 1 to 4 speakers in each session.
The speaker count of each audio clip was labeled by using the groundtruth information. In addition, the dataset contains 511, 1119, 614, and 154 examples for one, two, three, and four speakers, respectively. The results of speaker counting are summarized in the confusion matrices depicted in Fig. 11. The F1 scores for the baselines 1 and 2, proposals 1 and 2 were 88.37%, 92.44%, 96.48%, and 97.36%. From Fig. 11, we can see that the methods trained on the features from the spatial coherence matrix (proposals 1 and 2) outperform the methods trained on the features from the spatial correlation matrix (baselines 1 and 2). Figure 11(c) and (d) show that the methods trained on maximum similarities (proposal 2) yield significantly lower underestimation rates than the methods trained on eigenvalues only (proposal 1). For the BSS problems, underestimation can undermine the subsequent separation, while overestimation is less critical. In summary, we extract spatial information by whitening the RTFs without changing the phase to enhance the spatial signature of the directional source, analogous to generalized crosscorrelation with phase transformation (GCCPHAT) [53]. In the light of the uncertainty principle [54], this helps to improve the time domain resolution for the computation of the spatial coherence matrix, which in turn leads to a more accurate estimation of the spatial activity, especially in low SNR cases. This enables a more accurate estimation of the maximum similarity of two global activities as independent activities, without overlooking scenarios with low activity speakers.
Furthermore, unlike most multichannel source counting methods, which typically require more microphones than sources, the simplexbased and the proposed methods are limited by the total number of frames used to compute the spatial correlation matrix and the spatial coherence matrix, not the number of microphones. This implies that, in theory, there is virtually no limit to the number of speakers that can be identified. In fact, the only limit on counting accuracy is the degree of time overlap. To see this, we give two examples with different speaker activity patterns to show the maximum number of independent speakers that can be identified using ULAs with 2–5 elements evenly spaced at 8 cm.
Case I represents a scenario where four speakers are active in moderately overlapping time periods, as shown in Fig. 12(a). Note that at 2–4 s, three speakers are active concurrently. Inspection of Fig. 13(a) indicates that the spatial coherence matrices associated with different numbers of microphones remain very similar. In this case, the eigenvalue distribution analysis reveals that the number of sources can be accurately estimated, even when the number of speakers (4) exceeds the number of microphones (5), as shown in Fig. 14(a).
Case II presents a scenario where the proposed source counting method fails, where four independent speakers are active with 100% overlap, as shown in Fig. 12(b). In this case, the spatial coherence matrices in Fig. 13(b) show no meaningful patterns of activity, regardless of the number of microphones. The eigenvalue distribution analysis in Fig. 14(b) provides an incorrect estimate, one. In summary, methods based on simplex preprocessing are not limited by the number of microphones, but rather by the overlap percentage of the speaker activity time span.
4.5 Speaker separation performance
In the following, we compare the proposed speaker separation approach (GLADnet) with three baselines. The first baseline (mask) uses only a spectral mask (13). The second baseline (LCMVmask) is the simplexbased approach [43, 44] with beamforming and spectral masking (17). The third is the GLADnet, which is trained only on the global activity, called the global activitydriven network (GADnet). To evaluate the robustness of the proposed speaker separation approach when applied to unseen RIRs and array configurations, we created three 2000sample test datasets for three different array configurations (G1, G2, and G3) using the measured RIRs from the MultiChannel Impulse Responses Database [47]. The array configurations G1, G2, and G3 are shown in Fig. 10.
First, we examine the separation performance using the G1 configuration for different overlap ratios and T60s. The results in Fig. 15 show that the proposed GLADnet outperforms the three baselines in terms of speech quality. The performance of the GADnet, which is not trained with spatial features, degrades drastically as the overlap ratio increases. While the LCMVmask method achieves comparable WER to GLADnet at moderate T60 = 360 ms, its separation performance drops sharply at high reverberation.
Next, the effect of array configurations on separation performance is investigated. Figure 16 reveals that the speech quality (PESQ) and the ASR performance (WER) using the LCMVmask method degrade as the array spacing and the array aperture decrease, even for moderate T60s. In contrast, the proposed GLADnet performs quite satisfactorily despite the unseen RIRs and array geometries.
We also evaluated the proposed network in speaker separation using a more realistic LibriCSS dataset. The dataset generation for network testing is identical to that for speaker counting. Figure 17 shows that the LCMVmask method has a comparable performance to the proposed GLADnet when the overlap ratio is low. However, the performance of the LCMVmask drops dramatically at high overlap ratios. In addition, GADnet performs satisfactorily only for nonoverlapping speech mixtures. In summary, the separation performance of baselines such as mask and LCMVmask, which rely solely on spatial information, can be significantly affected by the interelement spacing and array aperture. On the other hand, the baseline GADnet, which relies solely on spectral information, can suffer performance degradation in adverse acoustic conditions such as large reverberation and high overlap ratios. In contrast to these baselines, the proposed GLADnet exploits both spatial and spectral information to achieve superior performance in terms of PESQ and WER metrics. In addition, the GLADnet is trained using the global and local activities derived from the wRTFs, which is less sensitive to unseen RIRs and array configurations.
5 Conclusions
In this paper, a learningbased robust speaker counting and separation system has been implemented by integrating array signal processing and DNN. In feature extraction, the spatial coherence matrix computed with wRTFs across time frames shows superior robustness to different array configurations and RIRs compared to the spatial correlation matrix. In speaker counting, the SCnet trained on the eigenvalues and the maximum similarities obtained from the spatial coherence matrix is conducive to speaker counting in adverse acoustic conditions, especially in unbalanced voice activity scenarios. In speaker separation, the GLADnet based on global and local spatial activities proves to be capable of effective and robust enhancement with different overlap ratios for unseen RIRs and array configurations, which is highly desirable for realworld applications.
Availability of data and materials
N/a.
Abbreviations
 SCM:

Spatial coherence matrix
 Wrtf:

Whitened relative transfer functions
 SCnet:

Speaker counting network
 GLADnet:

Global and local activitydriven network
 BSS:

Blind source separation
 NN:

Neural network
 RNN:

Recurrent neural network
 DNN:

Deepneuralnetwork
 RTF:

Relative transfer function
 RIRs:

Room impulse responses
 PESQ:

Perceptual evaluation of speech quality
 WER:

Word error rate
 STFT:

Shorttime Fourier transform
 ATF:

Acoustic transfer function
 SVM:

Support vector machine
 LCMV:

Linearly constrained minimum variance
 GCCPHAT:

Generalized crosscorrelation with phase transformation
 ULA:

Uniform linear array
 SNR:

Signaltonoise ratio
 UCA:

Uniform circular array
 WER:

Word error rate
References
E. Vincent, T. Virtanen, S. Gannot, Audio source separation and speech enhancement (Wiley, USA, 2018)
M. Kawamoto, K. Matsuoka, N. Ohnishi, A method of blind separation for convolved nonstationary signals. Neurocomputing 22, 157–171 (1998)
H. Buchner, R. Aichner, W. Kellermann, A generalization of blind source separation algorithms for convolutive mixtures based on secondorder statistics. IEEE Trans Audio Speech Lang Process 13(1), 120–134 (2005)
Z. Koldovsky, P. Tichavsky, Timedomain blind separation of audio sources on the basis of a complete ICA decomposition of an observation space. IEEE Trans Audio Speech Lang Process 19(2), 406–416 (2011)
T. Kim, T. Eltoft, T.W. Lee, Independent vector analysis: an extension of ICA to multivariate components, in International Conference on Independent Component Analysis and Signal Separation. (2006), pp.165–172
T. Virtanen, Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans Audio Speech Lang Process 15(3), 1066–1074 (2007)
O. Dikmen, A.T. Cemgil, Unsupervised singlechannel source separation using Bayesian NMF, in Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). (2009), pp.93–96
A. Ozerov, C. Fvotte, Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation. IEEE Trans Audio Speech Lang Process 18(3), 550–563 (2010)
Y. Mitsufuji, A. Roebel, Sound source separation based on nonnegative tensor factorization incorporating spatial cue as prior knowledge, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2013), pp.71–75
J.R. Hershey, Z. Chen, J. Le Roux, S. Watanabe, Deep clustering: discriminative embeddings for segmentation and separation, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2016), pp.31–35
Z. Chen, Y. Luo, N. Mesgarani, Deep attractor network for singlemicrophone speaker separation, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2017), pp.246–250
Y. Luo, N. Mesgarani, ConvTasNet: surpassing ideal timefrequency magnitude masking for speech separation. IEEE Trans Audio Speech Lang Process 27(8), 1256–1266 (2019)
Y. Lue, Z. Chen, T. Yoshioka, Dualpath RNN: efficient long sequence modeling for timedomain singlechannel speech separation, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2020), pp.46–50
D. Yu, M. Kolbæk, Z. Tan, J. Jensen, Permutation invariant training of deep models for speakerindependent multitalker speech separation, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2017), pp.241–245
M. Kolbæk, D. Yu, Z. Tan, J. Jensen, Multitalker speech separation with utterancelevel permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(10), 1901–1913 (2017)
L. Drude, R. HaebUmbach, Tight integration of spatial and spectral features for BSS with deep clustering embeddings, in Interspeech. (2017), pp.2650–2654
Z.Q. Wang, J. Le Roux, J.R. Hershey, Multichannel deep clustering: discriminative spectral and spatial embeddings for speakerindependent speech separation, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2018), pp.1–5
Z. Wang, D. Wang, Combining spectral and spatial features for deep learning based blind speaker separation. IEEE/ACM Trans Audio Speech Lang Process 27(2), 457–468 (2019)
Y. Luo, C. Han, N. Mesgarani, E. Ceolini, S. Liu, FaSNet: Lowlatency adaptive beamforming for multimicrophone audio processing, in Proc. of IEEE Workshop Automatic Speech Recognition and Understanding. (2019), pp.260–267
K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, Tackling real noisy reverberant meetings with allneural source separation, counting, and diarazation system, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2020), pp.381–385
Y. Liu, D. Wang, Divide and conquer: a deep CASA approach to talkerindependent monaural speaker separation. IEEE/ACM Trans Audio Speech Lang Process 27(12), 2092–2102 (2019)
E. Nachmani, Y. Adi, L. Wolf, Voice separation with an unknown number of multiple speakers, in International Conference on Machine Learning (ICML). (2020), pp.2623–2634
Y. Luo, N. Mesgarani, Separating varying numbers of sources with auxiliary autoencoding loss, in Interspeech. (2020)
K. Kinoshita, L. Drude, M. Delcroix, T. Nakatani, Listening to each speaker one by one with recurrent selective hearing networks, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2018), pp.5064–5068
T. Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, R. HaebUmbach, Allneural online source separation counting and diarization for meeting analysis, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2019), pp.91–95
Jin, Z., Hao, X., and Su, X, Coarsetofine recursive speech separation for unknown number of speakers. arXiv 2203.16054 (2022)
J. Zhu, R.A. Yeh, M. HasegawaJohnson, Multidecoder DPRNN: source separation for variable number of speakers, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2021), pp.3420–3424
Z..Q. Wang, D. Wang, Count and separate: incorporating speaker counting for continuous speaker separation, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2021), pp.11–15
A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W.T. Freeman, M. Rubinstein, Looking to listen at the cocktail party: a speakerindependent audiovisual model for speech separation. ACM Trans Graph 37(4), 1–11 (2018)
C. Li, Y. Qian, Listen, watch and understand at the cocktail party: audiovisualcontextual speech separation, in Interspeech. (1426, 2020), p.1430
K. Žmolíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, J. Černocký, Speakerbeam: speaker aware neural network for target speaker extraction in speech mixtures. IEEE J Sel Top Signal Process 13(4), 800–814 (2019)
Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, Z, J.R. Hershey, J.R, R.A. Saurous, R.J. Weiss, Y. Jia, I.L. Moreno, VoiceFilter: targeted voice separation by speakerconditioned spectrogram masking, in Interspeech. (2019), pp.2728–2732
M. Ge, C. Xu, L. Wang, E.S. Chang, H. Li, Spex+: a complete time domain speaker extraction network, in Interspeech. (2020), pp.1406–1410
R. Gu, L. Chen, S.X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou, D. Yu, Neural spatial filter: target speaker speech separation assisted with directional information, in Interspeech. (2019), pp.4290–4294
M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita, N. Tawara, T. Nakatani, S. Araki, Improving speaker discrimination of target speech extraction with timedomain speakerbeam, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2020), pp.691–695
J. Han, W. Rao, Y. Wang, Y. Long, Improving channel decorrelation for multichannel target speech extraction, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2021), pp.6094–6098
Y. Hsu, Y. Lee, M.R. Bai, Learningbased personal speech enhancement for teleconferencing by exploiting spatialspectral features, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2022), pp.8787–8791
M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, T. Nakatani, Speaker activity driven neural speech extraction, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2021), pp.6099–6103
T. Higuchi, K. Kinoshita, M. Delcroix, K. Zmolkova, T. Nakatani, Deep clusteringbased beamforming for separation with unknown number of sources, in Interspeech. (2017)
S.E. Chazan, J. Goldberger, S. Gannot, DNNbased concurrent speakers detector and its application to speaker extraction with LCMV beamforming, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2018), pp.6712–6716
S.E. Chazan, S. Gannot, J. Goldberger, Attentionbased neural network for joint diarization and speaker extraction, in Proc. of IEEE International Workshop on Acoustic Signal Enhancement (IWAENC). (2018), pp.301–305
C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann, R. HaebUmbach, Frontend processing for the CHiME5 dinner party scenario, in Proc. of CHiME5 Workshop. (2018), pp.35–40
B. LauferGoldshtein, R. Talmon, S. Gannot, Global and local simplex representations for multichannel source separation. IEEE/ACM Trans Audio Speech Lang Process 28(1), 914–928 (2020)
B. LauferGoldshtein, R. Talmon, S. Gannot, Audio source separation by activity probability detection with maximum correlation and simplex geometry. EURASIP J Audio Speech Music Process 2021, 5 (2021)
B. LauferGoldshtein, R. Talmon, S. Gannot, Source counting and separation based on simplex analysis. IEEE/ACM Trans Audio Speech Lang Process 66(24), 6458–6473 (2018)
E. Lehmann, A. Johansson, Prediction of energy decay in room impulse responses simulated with an imagesource model. J Acoust Soc Am 124(1), 269–277 (2008)
E. Hadad, F. Heese, P. Vary, S. Gannot, Multichannel audio database in various acoustic environments, in Proc. of IEEE International Workshop on Acoustic Signal Enhancement (IWAENC). (2014), pp.313–317
Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, J. Li, J, Continuous speech separation: dataset and analysis, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2020), pp.7484–7288
A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, Perceptual evaluation of speech quality (PESQ)a new method for speech quality assessment of telephone networks and codecs, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2021), pp.749–752
O. Yilmaz, S. Rickard, Blind separation of speech mixtures via timefrequency masking. IEEE Trans Signal Process 52(7), 1830–1847 (2004)
S. Gannot, D. Burshtein, E. Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans Signal Process 49(8), 1614–1626 (2001)
W.K. Ma et al., A signal processing perspective on hyperspectral unmixing: Insights from remote sensing. IEEE Signal Process Mag 31(1), 67–81 (2014)
C. Knapp, G. Carter, The generalized correlation method for estimation of time delay. IEEE Trans Signal Process 24(4), 320–327 (1967)
L. Cohen, The uncertainty principle in signal analysis, in Proc. of IEEE TimeFreq./TimeScale Anal. (1994), pp.182–185
K. Scharnhorst, Angles in complex vector spaces. Acta Applicandae Mathematicae 69(1), 95–103 (2001)
O. Çetin, E. Shriberg, Analysis of overlaps in meetings by dialog factors hot spots speakers and collection site: insights for automatic speech recognition, in Interspeech. (2006), pp.293–296
K. Tan, D. Wang, A convolutional recurrent neural network for realtime speech enhancement, in Interspeech. (2018), pp.3229–3233
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). (2015), pp.5206–5210
D. Powers, Evaluation: from precision, recall and Fmeasure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2, 37–63 (2007)
Ravanelli, M. et al, SpeechBrain: a generalpurpose speech toolkit. arXiv preprint arXiv:2106.04624 (2021)
Acknowledgements
N/a.
Funding
This work was supported by the National Science and Technology Council (NSTC), Taiwan, under the project number 1102221E007027MY3.
Author information
Authors and Affiliations
Contributions
Model development: Y. Hsu and M. R. Bai. Design of the dataset and test cases: Y. Hsu. Experimental testing: Y. Hsu. Writing paper: Y. Hsu and M. R. Bai. The authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hsu, Y., Bai, M.R. Learningbased robust speaker counting and separation with the aid of spatial coherence. J AUDIO SPEECH MUSIC PROC. 2023, 36 (2023). https://doi.org/10.1186/s13636023002983
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636023002983