Source ambiguity resolution of overlapped sounds in a multi-microphone room environment
© Chakraborty et al.; licensee Springer. 2014
Received: 4 August 2013
Accepted: 21 March 2014
Published: 28 April 2014
When several acoustic sources are simultaneously active in a meeting room scenario, and both the position of the sources and the identity of the time-overlapped sound classes have been estimated, the problem of assigning each source position to one of the sound classes still remains. This problem is found in the real-time system implemented in our smart-room, where it is assumed that up to two acoustic events may overlap in time and the source positions are relatively well separated in space. The position assignment system proposed in this work is based on fusion of model-based log-likelihood ratios obtained after carrying out several different partial source separations in parallel. To perform the separation, frequency-invariant null-steering beamformers, which can work with a small number of microphones, are used. The experimental results using all the six microphone arrays deployed in the room show a high assignment rate in our particular scenario.
Sound is a rich source of information. For that reason, machine audition  plays an important role in many applications. In particular, for meeting room scenarios, knowledge of the identity of possibly simultaneous sounds that take place in a room at a given time and their position in space is relevant to automatically describe social and human activities [2–5], to increase the robustness of speech processing systems operating in the room , to assist video conferencing , etc.
Acoustic event detection (AED) systems try to determine the identity of an occurring sound and the time interval when it is produced [2–4]. Acoustic source localization (ASL) systems estimate its position in space [5, 8–11]. Both tasks become much more challenging when there exists sound simultaneity, i.e., several sounds overlapping in time and in a given room. For example, after the CLEAR'07 international evaluations , where AED was carried out with meeting room seminars, it became clear that time overlapping of acoustic events (AEs) was a major source of detection errors .
In the concrete scenario used for the experiments, a typical meeting room acoustic scene is considered, where only a person is speaking at a given time and other non-speech sounds may happen simultaneously with the speaker's voice. Therefore, we have to deal with the problem of detecting and localizing an acoustic event that may be temporally overlapped with speech. The detection of overlapping events may be tackled with different approaches, either at the signal level, at the model level, or at the decision level. In [13–15], a model-based approach was adopted for detection of events in that meeting room scenario with two sources, one of which is always speech and the other one is an acoustic event from a list of 11 predefined events. Thus, besides the mono-event acoustic models, additional acoustic models were considered for each AE overlapped with speech, so the number of models was doubled (22 in that case). That approach is used in the current real-time system implemented in the smart-room of Universitat Politècnica de Catalunya (UPC), which includes both AED and ASL .
In that model-based approach, a permutation problem exists. In fact, the AED system gives the hypothesized identities of the overlapped sounds, but does not associate each of them to one of the available source positions that are provided by the ASL system. The same problem may be encountered by using other AED approaches, for instance, if a blind source separation technique is used prior to the detection of each of the separated events. To solve that source ambiguity problem, a position assignment (PA) system that performs a one-to-one correspondence between the set of source positions and the set of class labels is presented in this paper.
The proposed PA system, a preliminary version of which was presented in [15, 16], consists of three stages: beamforming-based signal separation, model-based likelihood calculations, and fusion of log-likelihood ratios over the set of beamformers. Frequency-invariant null-steering beamformers are designed for the small microphone arrays existing in the smart-room. In addition, the likelihoods coming from both a speech model and an acoustic event model are combined in order to improve the system accuracy. The work presented in this paper is an extension and improvement of the work reported in [15, 16]. On the contrary, to that previous published work, in the PA system reported here, all the six microphone arrays available in the room are employed in the experiments. For taking the decision, the scores obtained from each array are combined using either a product of likelihood ratios or a fuzzy integral-based fusion technique [17–19]. Experiments are carried out for the scenario described above and with a set of signals collected in the smart-room. The observed PA accuracy is larger than 95%.
The acoustic scenario is presented in Section 2 together with the signal database used for experimentation. The position assignment system is described in Section 3. Experiments are reported in Section 4, along with some practical issues about the system implementation and the used metrics. Conclusions are presented in Section 5.
2. Acoustic scenario and database
Acoustic classes and their number of occurrences
Number of occurrences
For the offline design of the system, whose real-time version is later implemented in the room, a database is needed. It was recorded using a spatial distribution of the AE sources and the speech source, as depicted in Figure 1. The position of the speaker was rather stable, but the other AEs were produced within broad areas of the room layout. There are eight recorded sessions (S01 to S08) of isolated AEs, where six different persons had participated and performed each AE several times. Note that, though in a real meeting room scenario the speaker may be placed at either the left or the right side of the room, in the database its position is fixed. This will not constrain the usefulness of the results, because the system will not make use of that knowledge.
As in , we have used for training, development, and testing up to eight sessions of audio data with isolated acoustic events. Each session was recorded with all the six T-shaped microphone arrays (24 microphones). The overlapped signals of the database were generated adding those AE signals recorded in the room with a speech signal, also recorded in the room for a single speaker from all the 24 microphones. To do that, for each AE instance, a segment with the same length was extracted from the speech signal starting from a random position and added to the AE signal. The mean power of speech was made equivalent to the mean power of the overlapping AE. That addition of signals produces an increment of the background noise level, since it is included twice in the overlapped signals; however, going from isolated to overlapped signals, the SNR reduction is slight: from 18.7 to 17.5 dB. The average duration of the events is 500 ms, and the reverberation time of the room is around 450 ms. Signals were recorded at 44.1 kHz sampling frequency and further converted to 16 kHz.
3. Source position assignment
If the problem of assigning the two events to the two positions is solved, the other two cases with ambiguity can also be solved using the same approach. In this section, the aim is to design a system that can be deployed in real time in the room to resolve that ambiguity in the correspondence between detected AEs and acoustic source positions.
3.1. The AED and ASL subsystems
The two-source AED system included in Figure 2, which was developed in previous work , employs a model-based approach with one microphone. All accepted sound combinations are modeled, i.e., the AED system has a model for each class whether it is an isolated event or a combination of events. This approach does not require a prior separation of the two overlapped signals, but requires a number of models that may be too large. In our particular meeting room scenario, however, the approach is feasible because 11 AEs are considered, which may be overlapped only with one class, speech, so only 22 models are required [14, 15]. The ASL system, also developed in previous works , is based on the steered response power with phase transform (SRP-PHAT) algorithm, which uses 24 microphones available in the room.
3.2. The position assignment system
Each of the beamformers is followed by feature extraction (FE) and likelihood computation (LC). In this work, hidden Markov model and Gaussian mixture model (HMM-GMM) are employed, for both acoustic events and speech. Given the AE class E, the model for E and the model for speech (sp) are needed for the likelihood computations. Finally, a decision block makes the assignment based on the computed log-likelihoods.
3.3. Null-steering beamforming
Null-steering beamforming is one of the earliest, but potentially very useful beamforming techniques. It belongs to a class of very popular and widely used beamforming techniques called multiple side lobe cancellers (MSC) [20–22]. NSB adapts the sensor array pattern by steering the main beam towards the desired source and placing nulls in the direction of the interfering sources . The solution for the weight matrix in this type of beamformer is achieved by setting to unity the desired response at the direction of the target sound and setting it to zeros at the direction of the interfering sources. In our particular scenario, only one interfering source has to be nulled, so only two microphone signals are required to get a solution for the weight matrix. However, as in each T-shaped array of the room, there are three linearly spaced microphones; all three of them will be used in the experiments.
An alternative frequency-dependent approach was explored for a single array in our previous works [15, 16], using either a time domain implementation or a frequency domain implementation. However, in spite of carrying out a careful frequency tuning, the obtained PA accuracy was just slightly higher than the one from the FIB-based system for the time domain implementation . On the other hand, the alternative FIB technique does not require frequency tuning, and thus, it is less dependent on the concrete scenario. For those reasons, FIB was chosen in the work reported here.
3.4. Single-array classification stage
As shown in Figure 3, the classification stage of the PA system with a single array consists of feature extraction, followed by log-likelihood calculation and a binary decision block. Features are extracted from the audio signals with a frame length of 30 ms and a frame shift of 20 ms. As features, frequency-filtered log filter-bank energies (FF-LFBEs), which were developed for speech recognition, are used . These features are uncorrelated, similarly to the conventional MFCC features, but unlike the latter, the FF-LFBE features keep the frequency localization property of the LFBEs. In the experiments, a 32-dimension feature vector (16 FF-LFBEs and their first temporal derivatives) is used. As shown in the scheme of Figure 3, there is a set of two likelihood calculators for each parallel channel, one of them to calculate the model based log-likelihood for the AE label (E), provided by the AED system, and the other to calculate it for speech. As in , here also hidden Markov models (HMMs) are employed, where Gaussian mixture models (GMMs) are used to calculate the emission probabilities . Thirty-two Gaussian components with diagonal covariance matrices are used per model. There is one left-to-right HMM with three emitting states for each AE and speech. Eleven HMMs are trained with isolated events using the Baum-Welch algorithm . The HTK toolkit is used for training and testing this HMM-GMM system .
If S is positive, the AE E is associated to the position P2, and if S is negative, it is associated to P1. Let us illustrate it with a particular case. Assume that P1 truly corresponds to speech and P2 to the acoustic event E. When using the AE model, it is expected to get comparatively higher log-likelihood from the output of NSB1 (LL1) than from the output of NSB2 (LL4). For the clean speech model, it is expected to get comparatively higher log-likelihood from the output of NSB2 (LL3) than from the output of NSB1 (LL2). If that is the case, the decision is taken that speech is at P1 and E is at P2, which is the correct decision. Note that with this type of combination, the decision block gives equal importance to all the four likelihood calculator outputs.
In order to get the most from the available information, in the current scheme, unlike in , the classification stage includes a speech model besides the AE model. Indeed, the system could also work with only either the speech model-based classifier or the AE model-based classifier. To study the contribution of each one of the models, all those options have been tested, and the results are reported in Section 4. For the decision, if only either the AE or the speech-based classifier is used, just either LL1-LL4 or LL3-LL2, respectively, is needed.
3.5. Multi-array fusion
As it is mentioned earlier, all the six three-microphone linear arrays deployed in the room are used in the position assignment system. For taking the assignment decision, the six sets of scores LLR1 and LLR2, computed as indicated in Equation 1, are combined either with a uniformly weighted average  of the 12 values or by fuzzy integral-based fusion . In the following, the latter technique is presented.
3.5.1. FI-based optimized fusion
The scores at the output of the classification stage can be linearly combined by using an optimal fusion approach that assigns an individual weight to each of them. However, a more sophisticated weighting technique that considers all subsets of information sources: the fuzzy integral (FI) approach, is considered in this work .
where μ(N + 1) = 0. The value μ(Q) can be viewed as a weight related to a subset Q of the set Z of N information sources. It is called the fuzzy measure, and if Q and T are subsets of Z, it has to meet the following conditions:
4. Experimental work
The PA experiments are done under the assumption that there is always an AE overlapped with speech. It is assumed that the identity of the AE event is known, to avoid the propagation of the AED errors to the PA system. Additionally, it is assumed that the approximate position in the room of the AE source and the speaker are known. Thus, the PA system only has to make a binary decision (the AE is from position P1 or position P2), which will be either correct or incorrect.
To design and evaluate the performance of the system, the position assignment rate (PAR) metric for a given AE class is defined as the quotient between the number of correct decisions and the total number of occurrences of that class in the testing database. Then, the PAR will be averaged over the classes to have the final evaluation measure. For reference, a second metric is also considered, called Diff_LL, which is the value of the S score from Equation 1 provided that the assignment is correct (LL1-LL4 for the AE-based system, or LL3-LL2 for the speech-based one, or S when both the AE model and the speech model are used). Actually, that score can be considered as an estimate of the degree of source separation carried out by the beamformers when a correct assignment is made.
4.1. Results and discussion
PA rate and Diff_LL for the PA system with the T6 array alone
Average of LLR scores
It can be observed, from the results in Table 2, that the combination of the two models with the S score, which averages the scores LLR1 and LLR2, improves the performance of the system with respect to the use of only one type of model. The improvement is much more noticeable using the FI-based fusion of the two scores. Notice also that the AE model-based system works much better than the speech model-based one. In fact, the former uses a more specific model, because the speech model is obtained from the whole set of speech sounds. In that table, the Diff_LL score is also shown. Notice that, in general, it is well correlated with the PAR one. However, there is a large difference between the values of Diff_LL for the AE-based case and the LL combination case, in contrast with the very small difference there is in terms of PAR. It means that the use of both models allows achieving a much stronger confidence on the PA decision when it is correct.
PA performance (%) with standard deviation for each single array and for the two combinations
Average of LLR scores
Broad-beam nulling angle
Average of LLR scores
83.1 ± 1.9
77.3 ± 2.6
81.3 ± 1.9
88.9 ± 2
88.2 ± 1.8
87.1 ± 1.8
89.8 ± 2
93.5 ± 1.9
83.5 ± 1.9
77.1 ± 2.5
82.3 ± 1.8
92.8 ± 1.4
92.7 ± 1.5
91.2 ± 1.6
93.6 ± 1.7
ASL-estimated AE positions
Average of LLR scores
88.2 ± 1.8
85.6 ± 2.3
89.8 ± 1.7
91.2 ± 1.6
92.1 ± 1.7
91 ± 1.9
93.6 ± 1.8
95.4 ± 1.7
88.3 ± 1.7
84.9 ± 2.1
90.2 ± 1.6
92.7 ± 1.3
93 ± 1.4
92.2 ± 1.4
95.7 ± 1.5
In the first columns of Table 3, we have the PAR (in %) with the standard deviations when each one of the arrays is used alone. Notice that the PAR scores of the upper half of the array numbers (T4, T5, and T6) are higher than the ones of the lower half. It could be expected, since for those arrays the angle between the acoustic event positions and the speech source position is larger than that for the other arrays (T1, T2, and T3).
Note from the two last columns in Table 3 that the accuracy obtained from either the average of LLR scores or the FI-based fusion of the whole set of arrays is higher than the accuracy obtained from any of the single arrays. Comparing both types of fusion, the FI one shows a noticeable better performance for both DOA setting cases, arriving to a PA error of only 4.3%.
The use of intra-array FI-based fusion improves the PAR scores with respect to using a uniformly weighted average of LLR scores, especially for the upper half arrays. Therefore, though by employing the FI approach there is the cost of having to learn the fuzzy measures from data, it may be a good choice when the quality of the signal separation is not too low, like it presumably happens with the upper half arrays
Regarding the type of DOA setting, the ASL-estimated AE position-based system works always better than the one that uses an average DOA based on visual inspection (i.e., a broad-beam nulling angle). That could be expected, since the beam pattern is specific of each event occurrence, whereas the broad beam encompasses all the angles of the AE source positions. While the latter design simplifies the overall system, as it does not require a precise source position and may avoid an additional external ASL block, it is specific of the given scenario, so it has to be redesigned when the scenario changes.
An attempt is made in this paper to resolve the source identification ambiguity that appears when an acoustic event, which overlaps with speech, is detected. A position assignment system has been proposed and tested. It firstly consists of a set of frequency-invariant null-steering beamformers that carry out different partial signal separations for each microphone array. The beamformers are followed by model-based likelihood calculations, using both the acoustic event model and the speech model, to obtain two likelihood ratios, whose combination gives a final score per array. Using the fuzzy integral for that intra-array combination and also for the fusion of the six array scores, the best assignment error is obtained, which is smaller than 5%. It is worth noticing that, though the position assignment system has been developed for the problem encountered in the current scenario, its scheme can be extended to more than two sources and to different types of sound overlap combinations. Future work will be devoted to that.
This work has been supported by the Spanish project SARAI (TEC2010-21040-C02-01).
- Wang W: Machine Audition: Principles, Algorithms and Systems. IGI Global; 2010.Google Scholar
- Waibel A, Stiefelhagen R: Computers in Human Interaction Loop. New York: Springer; 2009.View ArticleGoogle Scholar
- Temko A, Nadeu C, Macho D, Malkin R, Zieger C, Omologo M: Acoustic event detection and classification. In Computers in the Human Interaction Loop. Edited by: Waibel A, Stiefelhagen R. New York: Springer; 2009:61-73.View ArticleGoogle Scholar
- Zhuang X, Zhou X, Hasegawa-Johnson MA, Huang TS: Real-world acoustic event detection. Pattern Recogn. Lett. 2010, 31: 1543-1551. 10.1016/j.patrec.2010.02.005View ArticleGoogle Scholar
- Omologo M, Svaizer P: Acoustic event localization using crosspower-spectrum phase based technique. In ICASSP. Adelaide; 1994.Google Scholar
- Nishiura T, Nakamura S: Study of environmental sound source identification based on hidden Markov model for robust speech recognition. J. Acoust. Soc. Am. 2003, 114(4):2399.View ArticleGoogle Scholar
- Wang H, Chu P: Voice source localization for automatic camera pointing system in videoconferencing. In ICASSP. Munich; 1997:187-190.Google Scholar
- DiBiase J, Silverman HF, Brandstein M: Microphone arrays. Robust localization in reverberant rooms. In Microphone Arrays: Signal Processing Techniques and Applications. Edited by: Brandstein M, Ward D. New York: Springer; 2001.Google Scholar
- Wang D, Brown GJ: Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Edited by: Wang D, Brown GJ. Wiley-IEEE; 2006.View ArticleGoogle Scholar
- Dmochowski J, Benesty J: Steered beamforming approaches for acoustic source localization. In Speech Processing in Modern Communication. Volume 12. Edited by: Cohen I, Benesty J, Gannot S. Berlin: Springer; 2010:307-337.View ArticleGoogle Scholar
- Velasco J, Pizarro D, Macias-Guarasa J: Source localization with acoustic sensor arrays using generative model based fitting with sparse constraints. Sensors 2012, 12(10):13781-13812.View ArticleGoogle Scholar
- CLEAR: Classifications of Events, Activities and Relationships: Evaluation and Workshop. Baltimore; 2007.Google Scholar
- Temko A, Nadeu C: Acoustic event detection in meeting-room environments. Pattern Recogn. Lett. 2009, 30/14: 1281-1288.View ArticleGoogle Scholar
- Butko T, Gonzalez Pla F, Segura C, Nadeu C, Hernando J: Two-source acoustic event detection and localization: online implementation in a smart-room. In Proc. EUSIPCO. Barcelona; 2011.Google Scholar
- Chakraborty R, Nadeu C, Butko T: Detection and positioning of overlapped sounds in a room environment. In Proc. Interspeech. Portland; 2012.Google Scholar
- Chakraborty R, Nadeu C, Butko T: Binary position assignment of two known simultaneous acoustic sources. In Proc. IberSPEECH. Madrid; 2012.Google Scholar
- Grabisch M: Fuzzy integral in multi-criteria decision-making. Fuzzy Set Syst 1995, 69(3):279-298. 10.1016/0165-0114(94)00174-6MathSciNetView ArticleGoogle Scholar
- Chang S, Greenberg S: Syllable-proximity evaluation in automatic speech recognition using fuzzy measures and a fuzzy integral. In Proc. IEEE Fuzzy Systems Conference. St. Louis; 2003:828-833.Google Scholar
- Temko A, Macho D, Nadeu C: Fuzzy integral based information fusion for classification of highly confusable non-speech sounds. Pattern Recogn. 2008, 41(5):1831-1840.View ArticleGoogle Scholar
- Van Veen BD, Buckley KM: Beamforming: a versatile approach to spatial filtering. ASSP Mag IEEE 1988, 5(2):4-24.View ArticleGoogle Scholar
- Applebaum SP: Adaptive arrays. IEEE Trans Antenna Propag 1976, 24: 585-595. 10.1109/TAP.1976.1141417View ArticleGoogle Scholar
- Feng AS, Jones DL: Localization-based grouping. In Computational Auditory Scene Analysis: Principles, Algorithm, and Applications. Edited by: Wang D, Brown GJ. IEEE/Wiley-Interscience; 2006.Google Scholar
- Hoshuyama O, Sugiyama A: Robust adaptive beamforming. In Microphone Arrays: Signal Processing Techniques and Applications. Edited by: Brandstein M, Ward D. New York: Springer; 2001.Google Scholar
- Ward DB, Kennedy RA, Williamson RC: Theory and design of broadband sensor arrays with frequency invariant far-field beam patterns. J. Acoust. Soc. Am. 1995, 97: 1023-1034. 10.1121/1.412215MathSciNetView ArticleGoogle Scholar
- Parra LC: Least squares frequency invariant beamforming. In Proc. Workshops on Applications of Signal Processing on Acoustics and Audio. New York: IEEE; 2005.Google Scholar
- Parra LC: Steerable frequency-invariant beamforming for arbitrary arrays. J. Acoust. Soc. Am. 2006, 119(6):3839-3847. 10.1121/1.2197606View ArticleGoogle Scholar
- Zhao Y, Liu W, Langley RJ: Design of frequency invariant beamformers in subbands. In Proc. IEEE/SP 15th Workshop on Statistical Signal Processing. Cardiff; 2009.Google Scholar
- Nadeu C, Macho D, Hernando J: Frequency and time filtering of filter-bank energies for robust HMM speech recognition. Speech Comm. 2001, 34: 93-114. 10.1016/S0167-6393(00)00048-0View ArticleGoogle Scholar
- Rabiner L, Juang B: Fundamentals of Speech Recognition. Prentice Hall; 1993.Google Scholar
- Young S, Evermann G, Kershaw D, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P: The HTK Book (for HTK Version 3.2). Cambridge University; 2002.Google Scholar
- Kuncheva LI: Combining Pattern Classifiers: Methods and Algorithms. New Jersey: Wiley-Interscience; 2004.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.