Source ambiguity resolution of overlapped sounds in a multi-microphone room environment

When several acoustic sources are simultaneously active in a meeting room scenario, and both the position of the sources and the identity of the time-overlapped sound classes have been estimated, the problem of assigning each source position to one of the sound classes still remains. This problem is found in the real-time system implemented in our smart-room, where it is assumed that up to two acoustic events may overlap in time and the source positions are relatively well separated in space. The position assignment system proposed in this work is based on fusion of model-based log-likelihood ratios obtained after carrying out several different partial source separations in parallel. To perform the separation, frequency-invariant null-steering beamformers, which can work with a small number of microphones, are used. The experimental results using all the six microphone arrays deployed in the room show a high assignment rate in our particular scenario.


Introduction
Sound is a rich source of information. For that reason, machine audition [1] plays an important role in many applications. In particular, for meeting room scenarios, knowledge of the identity of possibly simultaneous sounds that take place in a room at a given time and their position in space is relevant to automatically describe social and human activities [2][3][4][5], to increase the robustness of speech processing systems operating in the room [6], to assist video conferencing [7], etc.
Acoustic event detection (AED) systems try to determine the identity of an occurring sound and the time interval when it is produced [2][3][4]. Acoustic source localization (ASL) systems estimate its position in space [5,[8][9][10][11]. Both tasks become much more challenging when there exists sound simultaneity, i.e., several sounds overlapping in time and in a given room. For example, after the CLEAR'07 international evaluations [12], where AED was carried out with meeting room seminars, it became clear that time overlapping of acoustic events (AEs) was a major source of detection errors [13].
In the concrete scenario used for the experiments, a typical meeting room acoustic scene is considered, where only a person is speaking at a given time and other non-speech sounds may happen simultaneously with the speaker's voice. Therefore, we have to deal with the problem of detecting and localizing an acoustic event that may be temporally overlapped with speech. The detection of overlapping events may be tackled with different approaches, either at the signal level, at the model level, or at the decision level. In [13][14][15], a model-based approach was adopted for detection of events in that meeting room scenario with two sources, one of which is always speech and the other one is an acoustic event from a list of 11 predefined events. Thus, besides the mono-event acoustic models, additional acoustic models were considered for each AE overlapped with speech, so the number of models was doubled (22 in that case). That approach is used in the current real-time system implemented in the smart-room of Universitat Politècnica de Catalunya (UPC), which includes both AED and ASL [14].
In that model-based approach, a permutation problem exists. In fact, the AED system gives the hypothesized identities of the overlapped sounds, but does not associate each of them to one of the available source positions that are provided by the ASL system. The same problem may be encountered by using other AED approaches, for instance, if a blind source separation technique is used prior to the detection of each of the separated events. To solve that source ambiguity problem, a position assignment (PA) system that performs a one-to-one correspondence between the set of source positions and the set of class labels is presented in this paper. The proposed PA system, a preliminary version of which was presented in [15,16], consists of three stages: beamforming-based signal separation, model-based likelihood calculations, and fusion of log-likelihood ratios over the set of beamformers. Frequency-invariant nullsteering beamformers are designed for the small microphone arrays existing in the smart-room. In addition, the likelihoods coming from both a speech model and an acoustic event model are combined in order to improve the system accuracy. The work presented in this paper is an extension and improvement of the work reported in [15,16]. On the contrary, to that previous published work, in the PA system reported here, all the six microphone arrays available in the room are employed in the experiments. For taking the decision, the scores obtained from each array are combined using either a product of likelihood ratios or a fuzzy integral-based fusion technique [17][18][19]. Experiments are carried out for the scenario described above and with a set of signals collected in the smart-room. The observed PA accuracy is larger than 95%.
The acoustic scenario is presented in Section 2 together with the signal database used for experimentation. The position assignment system is described in Section 3. Experiments are reported in Section 4, along with some practical issues about the system implementation and the used metrics. Conclusions are presented in Section 5. Figure 1 shows the smart-room of the UPC, with the position of its six T-shaped four-microphone arrays on the walls. The linear arrays of three microphones are used in the experiments. The total number of considered acoustic event classes is 12, including speech, as shown in Table 1. In the working scenario, it is assumed that speech is always produced at one side of the room (either left or right), and the other AEs are produced at the other side.

Acoustic scenario and database
For the offline design of the system, whose real-time version is later implemented in the room, a database is needed. It was recorded using a spatial distribution of the AE sources and the speech source, as depicted in Figure 1. The position of the speaker was rather stable, but the other AEs were produced within broad areas of the room layout. There are eight recorded sessions (S01 to S08) of isolated AEs, where six different persons had participated and performed each AE several times. Note that, though in a real meeting room scenario the speaker may be placed at either the left or the right side of the room, in the database its position is fixed. This will not constrain the usefulness of the results, because the system will not make use of that knowledge.
As in [14], we have used for training, development, and testing up to eight sessions of audio data with isolated acoustic events. Each session was recorded with all the six T-shaped microphone arrays (24 microphones). The overlapped signals of the database were generated adding those AE signals recorded in the room with a speech signal, also recorded in the room for a single speaker from all the 24 microphones. To do that, for each AE instance, a segment with the same length was extracted from the speech signal starting from a random position and added to the AE signal. The mean power of speech was made equivalent to the mean power of the overlapping AE. That addition of signals produces an increment of the background noise level, since it is included twice in the overlapped signals; however, going from isolated to overlapped signals, the SNR reduction is slight: from 18.7 to 17.5 dB. The average duration of the events is 500 ms, and the reverberation time of the room is around 450 ms. Signals were recorded at 44.1 kHz sampling frequency and further converted to 16 kHz.

Source position assignment
The block diagram of the whole system that performs position assignment from the outputs of the acoustic event detection and localization systems is depicted in Figure 2. The model-based AED system outputs either one or two AE hypothesis. On the other hand, in the online implementation at the UPC's smart-room, the ASL system provides either one or two source positions. Hence, there are four different possibilities for mapping the one/two detected events into the one/two detected positions. As can easily be noticed, there exists an ambiguity in three out of those four possibilities. This work is focused on the most general case, where two events are detected, i.e., E (one of the 11 possible AEs) and 'sp, ' and also two source positions: P 1 and P 2 . Hence, the position assignment (PA) block actually is a binary classifier that assigns E to either P 1 or P 2 .
If the problem of assigning the two events to the two positions is solved, the other two cases with ambiguity can also be solved using the same approach. In this section, the aim is to design a system that can be deployed in real time in the room to resolve that ambiguity in the correspondence between detected AEs and acoustic source positions.

The AED and ASL subsystems
The two-source AED system included in Figure 2, which was developed in previous work [13], employs a model-based approach with one microphone. All accepted sound combinations are modeled, i.e., the AED system has a model for each class whether it is an isolated event or a combination of events. This approach does not require a prior separation of the two overlapped signals, but requires a number of models that may be too large. In our particular meeting room scenario, however, the approach is feasible because 11 AEs are considered, which may be overlapped only with one class, speech, so only 22 models are required [14,15]. The ASL system, also developed in previous works [14], is based on the steered response power with phase transform (SRP-PHAT) algorithm, which uses 24 microphones available in the room.

The position assignment system
The scheme of the PA system, which is shown in Figure 3 for one array, has at its front-end two null-steering beamformers (NSBs), which work in parallel. The main beam of each NSB is steered towards the desired source and a null is placed in the direction of the interfering source, so each NSB will nullify a different source signal. Thus, the contribution of one of the simultaneous sounds to the beamformer output is expected to be lower than its contribution to the beamformer input. Indeed, beamforming is based on the prior knowledge of the direction of the desired source and the interferent source, which can be provided by an ASL system. Thus, each NSB requires two inputs: (1) the multi-microphone signal and (2) the position coordinates or direction of arrival (DOA) of the sources.
Each of the beamformers is followed by feature extraction (FE) and likelihood computation (LC). In this work, hidden Markov model and Gaussian mixture model (HMM-GMM) are employed, for both acoustic events and speech. Given the AE class E, the model for E and the model for speech (sp) are needed for the likelihood computations. Finally, a decision block makes the assignment based on the computed log-likelihoods.

Null-steering beamforming
Null-steering beamforming is one of the earliest, but potentially very useful beamforming techniques. It belongs to a class of very popular and widely used beamforming techniques called multiple side lobe cancellers (MSC) [20][21][22]. NSB adapts the sensor array pattern by steering the main beam towards the desired source and placing nulls in the direction of the interfering sources [23]. The Step 205 solution for the weight matrix in this type of beamformer is achieved by setting to unity the desired response at the direction of the target sound and setting it to zeros at the direction of the interfering sources. In our particular scenario, only one interfering source has to be nulled, so only two microphone signals are required to get a solution for the weight matrix. However, as in each T-shaped array of the room, there are three linearly spaced microphones; all three of them will be used in the experiments. Given the broadband characteristics of the audio signals, to determine the beamformer coefficients, a technique called frequency-invariant beamforming (FIB) is used [24][25][26][27]. The method, proposed in [25] and [26], uses a numerical approach to construct an optimal frequencyinvariant response for an arbitrary array configuration with a very small number of microphones, and it is capable of nulling several interfering sources simultaneously. As depicted in Figure 4, the FIB method first decouples the spatial selectivity from the frequency selectivity by replacing the set of real sensors by a set of virtual ones, which are frequency invariant. Then, the same array coefficients can be used for all frequencies.
An alternative frequency-dependent approach was explored for a single array in our previous works [15,16], using either a time domain implementation or a frequency domain implementation. However, in spite of carrying out a careful frequency tuning, the obtained PA accuracy was just slightly higher than the one from the FIB-based system for the time domain implementation [16]. On the other hand, the alternative FIB technique does not require frequency tuning, and thus, it is less dependent on the concrete scenario. For those reasons, FIB was chosen in the work reported here.

Single-array classification stage
As shown in Figure 3, the classification stage of the PA system with a single array consists of feature extraction, followed by log-likelihood calculation and a binary decision block. Features are extracted from the audio signals with a frame length of 30 ms and a frame shift of 20 ms. As features, frequency-filtered log filter-bank energies (FF-LFBEs), which were developed for speech recognition, are used [28]. These features are uncorrelated, similarly to the conventional MFCC features, but unlike the latter, the FF-LFBE features keep the frequency   Figure 3, there is a set of two likelihood calculators for each parallel channel, one of them to calculate the model based log-likelihood for the AE label (E), provided by the AED system, and the other to calculate it for speech. As in [14], here also hidden Markov models (HMMs) are employed, where Gaussian mixture models (GMMs) are used to calculate the emission probabilities [29]. Thirty-two Gaussian components with diagonal covariance matrices are used per model. There is one left-to-right HMM with three emitting states for each AE and speech. Eleven HMMs are trained with isolated events using the Baum-Welch algorithm [29]. The HTK toolkit is used for training and testing this HMM-GMM system [30].
To make the mapping between source positions and event identities, the decision block uses the four log-likelihood scores computed from the HMM-GMM models. Those four scores, which are indicated in Figure 3 as LL i , i = 1, 2, 3, 4, are grouped in two log-likelihood ratios LLR 1 and LLR 2 , one for each beamformer path, and the following single-array score S is computed: If S is positive, the AE E is associated to the position P 2 , and if S is negative, it is associated to P 1 . Let us illustrate it with a particular case. Assume that P 1 truly corresponds to speech and P 2 to the acoustic event E. When using the AE model, it is expected to get comparatively higher log-likelihood from the output of NSB1 (LL 1 ) than from the output of NSB2 (LL 4 ). For the clean speech model, it is expected to get comparatively higher log-likelihood from the output of NSB2 (LL 3 ) than from the output of NSB1 (LL 2 ). If that is the case, the decision is taken that speech is at P 1 and E is at P 2 , which is the correct decision. Note that with this type of combination, the decision block gives equal importance to all the four likelihood calculator outputs.
In order to get the most from the available information, in the current scheme, unlike in [15], the classification stage includes a speech model besides the AE model. Indeed, the system could also work with only either the speech model-based classifier or the AE model-based classifier. To study the contribution of each one of the models, all those options have been tested, and the results are reported in Section 4. For the decision, if only either the AE or the speech-based classifier is used, just either LL 1 -LL 4 or LL 3 -LL 2 , respectively, is needed.

Multi-array fusion
As it is mentioned earlier, all the six three-microphone linear arrays deployed in the room are used in the position assignment system. For taking the assignment decision, the six sets of scores LLR 1 and LLR 2 , computed as indicated in Equation 1, are combined either with a uniformly weighted average [31] of the 12 values or by fuzzy integral-based fusion [19]. In the following, the latter technique is presented.

FI-based optimized fusion
The scores at the output of the classification stage can be linearly combined by using an optimal fusion approach that assigns an individual weight to each of them. However, a more sophisticated weighting technique that considers all subsets of information sources: the fuzzy integral (FI) approach, is considered in this work [19].
Let us denote by h i , i = 1, 2,…, N, the set of output scores (LLR 1 and LLR 2 ) of the N/2 single-array systems. Assuming that the sequence h i , i = 1, 2,…, N, is ordered in such a way that h 1 ≤ … ≤ h N , the Choquet fuzzy integral can be computed as where μ(N + 1) = 0. The value μ(Q) can be viewed as a weight related to a subset Q of the set Z of N information sources. It is called the fuzzy measure, and if Q and T are subsets of Z, it has to meet the following conditions: In this work, a supervised gradient-based training algorithm for learning the fuzzy measures from the training data with cross-validation is used [18,19].

Experimental work
The PA experiments are done under the assumption that there is always an AE overlapped with speech. It is assumed that the identity of the AE event is known, to avoid the propagation of the AED errors to the PA system. Additionally, it is assumed that the approximate position in the room of the AE source and the speaker are known. Thus, the PA system only has to make a binary decision (the AE is from position P 1 or position P 2 ), which will be either correct or incorrect.
To design and evaluate the performance of the system, the position assignment rate (PAR) metric for a given AE class is defined as the quotient between the number of correct decisions and the total number of occurrences of that class in the testing database. Then, the PAR will be averaged over the classes to have the final evaluation measure. For reference, a second metric is also considered, called Diff_LL, which is the value of the S score from Equation 1 provided that the assignment is correct (LL 1 -LL 4 for the AE-based system, or LL 3 -LL 2 for the speech-based one, or S when both the AE model and the speech model are used). Actually, that score can be considered as an estimate of the degree of source separation carried out by the beamformers when a correct assignment is made.
In the PA system from Figure 3, there exist two FIBbased NSBs at the front end. The design of the beamformers for each particular AE sample requires the DOA angles corresponding to the target and the null, i.e., the DOAs from the source positions P 1 and P 2 . Two different options regarding the approximate positions of the acoustic events from which the DOAs are extracted have been considered. First, the same approximate DOA for the whole set of acoustic events for each array has been considered. It is obtained as a DOA average over the AE source positions, which are known from visual inspection during recording. In this case, a beam pattern with a broader main lobe (as shown in the right side of Figure 5) to approximately encompass all the positions of the acoustic events has been designed. And, in the second option, the position of the event estimated by an ASL system based on the SRP-PHAT technique has been used. Therefore, in that case, the beam steers to the direction of the specific event position. It is worth to mention here that the AE source positions are estimated using a one-source ASL system instead of a two-source one, in order to avoid more propagation of errors from the ASL system to the PA system. Regarding the speech source position, the speaker's position specified during recording has been used for all experiments.

Results and discussion
To assess the performance of the PA system depicted in Figure 3, several experiments have been conducted. The testing results are obtained with all the eight recording sessions (S01 to S08), using a leave-one-out criterion, and averaging over all the testing dataset. In all the FIbased fusions, a 5-fold cross-validation on the training data to stop the training process and avoid over-fitting is used. To check the performance of the PA system when either only the AE model or only the speech model is used, the experiments for the array T6 is performed, using visually inspected positions for AEs and a broad beam. The results are shown in Table 2. It is worth mentioning that the AE source positions and the speech source position are physically rather well separated from the viewpoint of the array T6.
It can be observed, from the results in Table 2, that the combination of the two models with the S score, which averages the scores LLR 1 and LLR 2 , improves the performance of the system with respect to the use of only one type of model. The improvement is much more noticeable using the FI-based fusion of the two scores. Notice also that the AE model-based system works much better than the speech model-based one. In fact, the former uses a more specific model, because the speech model is obtained from the whole set of speech sounds. In that table, the Diff_LL score is also shown. Notice that, in general, it is well correlated with the PAR one. However, there is a large difference between the values of Diff_LL for the AE-based case and the LL combination case, in contrast with the very small difference there is in terms of PAR. It means that the use of both models allows achieving a much stronger confidence on the PA decision when it is correct. Table 3 shows the PAR scores when all six microphone arrays are employed, either alone or in combination. The results are given for the two types of DOA settings mentioned above. Also, two types of intra-array combinations are considered, as in Table 2 (second column of  Table 3): the average of LLR scores given by Equation 1 and the FI-based fusion. The testing results are obtained with all the eight recording sessions (S01 to S08), using a leave-one-out criterion, averaging over all the testing dataset, and tabulated with the population standard deviations.
In the first columns of Table 3, we have the PAR (in %) with the standard deviations when each one of the arrays is used alone. Notice that the PAR scores of the upper half of the array numbers (T4, T5, and T6) are higher than the ones of the lower half. It could be expected, since for those arrays the angle between the acoustic event positions and the speech source position is larger than that for the other arrays (T1, T2, and T3).
Note from the two last columns in Table 3 that the accuracy obtained from either the average of LLR scores or the FI-based fusion of the whole set of arrays is higher than the accuracy obtained from any of the single arrays. Comparing both types of fusion, the FI one shows a noticeable better performance for both DOA setting cases, arriving to a PA error of only 4.3%.
The use of intra-array FI-based fusion improves the PAR scores with respect to using a uniformly weighted average of LLR scores, especially for the upper half arrays. Therefore, though by employing the FI approach there is the cost of having to learn the fuzzy measures from data, it may be a good choice when the quality of the signal separation is not too low, like it presumably happens with the upper half arrays Regarding the type of DOA setting, the ASL-estimated AE position-based system works always better than the one that uses an average DOA based on visual inspection (i.e., a broad-beam nulling angle). That could be expected, since the beam pattern is specific of each event occurrence, whereas the broad beam encompasses all the angles of the AE source positions. While the latter design simplifies the overall system, as it does not require a precise source position and may avoid an additional external ASL block, it is specific of the given scenario, so it has to be redesigned when the scenario changes.

Conclusions
An attempt is made in this paper to resolve the source identification ambiguity that appears when an acoustic event, which overlaps with speech, is detected. A position assignment system has been proposed and tested. It firstly consists of a set of frequency-invariant null-steering beamformers that carry out different partial signal separations for each microphone array. The beamformers are followed by model-based likelihood calculations, using both the acoustic event model and the speech model, to obtain two likelihood ratios, whose combination gives a final score per array. Using the fuzzy integral for that intra-array combination and also for the fusion of the six array scores, the best assignment error is obtained, which is smaller than 5%. It is worth noticing that, though the position assignment system has been developed for the problem encountered in the current scenario, its scheme can be extended to more than two sources and to different types of sound overlap combinations. Future work will be devoted to that.