Speech enhancement with an acoustic vector sensor: an effective adaptive beamforming and post-filtering approach
© Zou et al.; licensee Springer. 2014
Received: 21 January 2014
Accepted: 3 April 2014
Published: 27 April 2014
Speech enhancement has an increasing demand in mobile communications and faces a great challenge in a real ambient noisy environment. This paper develops an effective spatial-frequency domain speech enhancement method with a single acoustic vector sensor (AVS) in conjunction with minimum variance distortionless response (MVDR) spatial filtering and Wiener post-filtering (WPF) techniques. In remote speech applications, the MVDR spatial filtering is effective in suppressing the strong spatial interferences and the Wiener post-filtering is considered as a popular and powerful estimator to further suppress the residual noise if the power spectral density (PSD) of target speech can be estimated properly. With the favorable directional response of the AVS together with the trigonometric relations of the steering vectors, the closed-form estimation of the signal PSDs is derived and the frequency response of the optimal Wiener post-filter is determined accordingly. Extensive computer simulations and a real experiment in an anechoic chamber condition have been carried out to evaluate the performance of the proposed algorithm. Simulation results show that the proposed method offers good ability to suppress the spatial interference while maintaining comparable log spectral deviation and perceptual evaluation of speech quality performance compared with the conventional methods with several objective measures. Moreover, a single AVS solution is particularly attractive for hands-free speech applications due to its compact size.
As the presence of background noise significantly deteriorates the quality and intelligibility of speech, enhancement of speech signals has been an important and challenging problem and various methods have been proposed in the literature to tackle this problem. Spectral subtraction, Wiener filtering, and their variations  are commonly used for suppressing additive noise, but they are not able to effectively suppress spatial interference. In order to eliminate spatial interferences, beamforming techniques applied to microphone array recordings can be employed [2–9]. Among these, the minimum variance distortionless response (MVDR) beamformer known as the Capon beamformer and their equivalent generalized sidelobe cancellers (GSC) work successfully in remote speech enhancement applications . However, the performance of MVDR-type methods is proportional to the number of array sensors used, thus limiting their application. Moreover, the MVDR beamformer is not effective at suppressing additive noise, leaving residual noise in its output. As a result, the well-known Wiener post-filtering solution normally can be employed to further reduce the residual noise from the output of the beamformer . Recently, speech enhancement using the acoustic vector sensor (AVS) array has received research attention due to the merit of spatial co-location of microphones and signal time alignment [5, 10–12]. Compared with the traditional microphone array, the compact structure (occupying a volume of approximately 1 cm3) makes the AVS much more attractive in portable speech enhancement applications. Research showed that the AVS array beamformer with the MVDR method [5, 10] successfully suppresses spatial interferences but fails to effectively suppress background noise. The integrated MVDR and Wiener post-filtering method using AVS array  offers good performance in terms of suppression of spatial interferences and background additive noise, but it requires more than two AVS units as well as the good voice activity detection (VAD) technique.
In this paper, we focus on developing a speech enhancement solution capable of effectively suppressing spatial interferences and additive noise at a less computational cost using only one AVS unit. More specifically, by exploring the unique spatial co-location property (the signal arrives at the sensors at the same time) and the trigonometric relations of the steering vectors of the AVS, a single AVS-based speech enhancement system is proposed. The norm-constrained MVDR method is employed to form the spatial filter, while the optimal Wiener post-filter is designed by using a novel closed-form power spectral density (PSD) estimation method. The proposed solution does not depend on the VAD technique (for noise estimation) and hence has advantages of small size, less computation cost, and the ability to suppress both spatial interferences and background noise.
The paper is organized as follows. The data model of an AVS and the frequency domain MVDR (FMV) with a single AVS are presented in Section 2. The detailed derivation of the closed-form estimation of the signal PSDs for an optimal Wiener post-filtering (WPF) using the AVS structure is given in Section 3. The proposed norm-constrained FMV-effective Wiener post-filtering (NCFMV-EWPF) algorithm for speech enhancement is presented in Section 4. Simulation results are presented in Section 5. Section 6 concludes our work.
2 Problem formulation
2.1 Data model for an AVS unit
where [.] T denotes the vector/matrix transposition.
where x u (t), x v (t), and x o (t) are the received data of the u-, v-, and o-sensor, respectively, and n u (t), n v (t), and n o (t) are the captured noise at the u-, v-, and o-sensor, respectively. The task of speech enhancement with an AVS is to estimate s(t) from x avs (t).
In this study, without loss of generality, we follow the commonly used assumptions : (1) s(t) and s i (t) are mutually uncorrelated; (2) n u (t), n v (t), and n o (t) are mutually uncorrelated.
2.2 FMV with a single AVS
2.3 The estimation of the power spectral density
As discussed above, the NCFMV is only effective in suppressing the spatial interferences. In this section, a new solution has been proposed by incorporating the well-known Wiener post-filter (WPF) to further suppress the residual noise in beamformer output Y(f) in (18).
where ψ YS (f) is the cross-power spectrum density (CSD) of S(f) and Y(f) and ψ YY (f) is the power spectral density (PSD) of Y(f). Generally, Y(f) are considered uncorrelated to interferences, and we can approximately get the second equation in (19) via (18). From (19), it is clear that a good estimate of ψ SS (f) and ψ YY (f) from X(f) and Y(f) are very crucial to the performance of the WPF. There are some PSD estimation algorithms that have been proposed under different spatial-frequency joint estimation schemes. For single-channel application as an example, the voice activity detection (VAD) method is usually applied to get the noise and speech segments, and then the spectrum subtraction algorithm can be used to remove noise components before estimating ψ SS (f). Moreover, for microphone array post-filtering schemes, ψ SS (f) can be estimated from the available multichannel signals, which are assumed to be within an incoherent noise environment .
3 The formulation of the Wiener post-filter
3.1 Derivation of the estimate of CSD and PSD
3.2 The proposed EWPF method and some discussions
Till now, we have mathematically derived the closed-form expressions of the ψ SS in (34), ψ YY in (27), and W pf in (19). Since Y, X u , X v , and X o can be measured, the estimates of ψ SS and ψ YY can be determined accordingly. Hence, (33), (34), (27), and (19) describe the basic form of our proposed effective Wiener post-filtering algorithm for further enhancing the speech with an AVS (here, we term it as EWPF for short). In the following context, we will have some discussions on our proposed EWPF method.
where l is the frame index and λ ∈ (0, 1] is the forgetting factor.
Thirdly, analyzing the properties of g(ϕ i ), we observe the following: (1) If the target source s(t) is considered as short-time spatially stationary (approximately true for speech applications), w NC in (17) can be updated every L u frames for reducing computational complexity. Therefore, from the definition of (13), the gain g(ϕ i ) will remain unchanged within L u frames. However, is estimated frame by frame via (35); therefore, a more accurate estimation of g(ϕ i ) can be achieved by averaging over L u frames. (2) From (36), it is clear that the small denominator will lead to a large variation of g(ϕ i ), reflecting incorrect estimates since the NCFMV is designed to suppress rather than to amplify the interference. Hence, it is reasonable to apply a clipping function f c (x, b) (see (43)) to remove the outliers in the estimate of .
4 The proposed NCFMV-EWPF algorithm
For presentation completeness, the proposed NCFMV-EWPF algorithm is summarized in Algorithm 1.
5 Simulation study
- 1.Output signal to interference plus noise ratio (SINR) defined as (44)
- 2.Log spectral deviation (LSD), which is used to measure the speech distortion and defined as (45)
In addition, we also compared the performance of the Zelinski post-filter (ZPF) , NCFMV , and NCFMV-ZPF  algorithms under the same conditions to our proposed algorithm. The setup of the single AVS unit is shown in Figure 1.
5.1 Experiments on simulated data
5.1.1 Experiment 1: the SINR performance under different noise conditions
SINR-out for different algorithms (dB)
Trial 1 (navs(t) = 0 and s i (t) ≠ 0)
Trial 2 (navs(t) = 0 and s i (t) ≠ 0)
Trial 3 (navs(t) = 0 and s i (t) ≠ 0)
Trial 4 (navs(t) ≠ 0 and s i (t) = 0)
Trial 5 (navs(t) ≠ 0 and s i (t) = 0)
Trial 6 (navs(t) ≠ 0 and s i (t) = 0)
Trial 7 (navs(t) ≠ 0 and s i (t) ≠ 0)
Trial 8 (navs(t) ≠ 0 and s i (t) ≠ 0)
Trial 9 (navs(t) ≠ 0 and s i (t) ≠ 0)
5.1.2 Experiment 2: the impact of the angle between the target and interference speakers
5.1.3 Experiment 3: SINR, LSD, and PESQ performance
5.2 Experiments on recorded data in an anechoic chamber
5.2.1 Experiment 4: the SINR-out performance with different speakers
5.2.2 Experiment 5: the impact of the angle between the target and interference speakers
From Figure 8, it is clear to see that the performance of the proposed NCFMV-EWPF algorithm is superior to that of the NCFMV algorithm for all Δϕ values. Compared to the results shown in Figure 5 using the simulated data, similar conclusions can be drawn for the proposed NCFMV-EWPF algorithm. More specifically, with the recorded data, when Δϕ > 15°, the proposed NCFMV-EWPF algorithm can effectively enhance the target speech.
5.2.3 Experiment 6: PESQ performance versus Δϕ
In this paper, a novel speech enhancement algorithm named as NCFMV-EWPF has been derived with a single AVS unit by an efficient closed-form estimation of the power spectral densities of signals. The results of computer simulation show that the proposed NCFMV-EWPF algorithm outperforms the existing ZPF, NCFMV, and NCFMV-ZPF algorithms, in terms of suppressing the competing speaker and noise field. The results of real experiments show that compared with the NCFMV algorithms, the proposed NCFMV-EWPF algorithm can effectively suppress the competing speech and additive noise while maintaining good speech quality and less distortion. In addition, it is noted that the NCFMV-EWPF algorithm does not require the VAD technique, which not only reduces the computational complexity but also provides more robust performance in a noisy environment, such as the higher output SINR, less speech distortion, and better speech intelligibility. It is expected that this novel approach developed in this paper is a suitable solution for implementation within hands-free speech recording systems.
This work is partially supported by the National Natural Science Foundation of China (No. 61271309) and the Shenzhen Science & Technology Fundamental Research Program (No. JCY201110006). It was also partially supported by the Australian Research Council Grant DP1094053.
- Boll S: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process 1979, 27(2):113-120. 10.1109/TASSP.1979.1163209View ArticleGoogle Scholar
- Griffiths LJ, Jim CW: An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propag. 1982, 30(1):27-34. 10.1109/TAP.1982.1142739View ArticleGoogle Scholar
- Zou YX, Chan SC, Wan B, Zhao J: Recursive robust variable loading MVDR beamforming in impulsive noise environment. Volume 1–4. Macao: Paper presented at the IEEE ASIA Pacific conference on circuits and system; 2008:988-991.Google Scholar
- Zelinski R: A microphone array with adaptive post-filtering for noise reduction in reverberant rooms. New York: Paper presented at the IEEE international conference on acoustics, speech, and signal processing (ICASSP); 1988.View ArticleGoogle Scholar
- Lockwood ME, Jones DL: Beamformer performance with acoustic vector sensors in air. J. Acoust. Soc. Am. 2006, 119: 608-619. 10.1121/1.2139073View ArticleGoogle Scholar
- McCowan IA, Bourlard H: Microphone array post-filter based on noise field coherence. IEEE Trans. Speech Audio Process 2003, 11(6):709-716. 10.1109/TSA.2003.818212View ArticleGoogle Scholar
- Benesty J, Sondhi MM, Huang Y: Springer Handbook of Speech Processing. Berlin-Heidelberg: Springer; 2008.View ArticleGoogle Scholar
- Vaseghi SV: Advanced Digital Signal Processing and Noise Reduction. 2nd edition. Chichester: John Wiley & Sons ltd; 2000.Google Scholar
- Bitzer J, Simmer KU, Kammeyer KD: Multichannel noise reduction algorithms and theoretical limits. Rhodes: Paper presented at EURASIP European signal processing conference (EUSIPCO); 1998.Google Scholar
- Lockwood ME, Jones DL, Bilger RC, Lansing CR, Brien WDO, Wheeler BC, Feng AS: Performance of time-and frequency-domain binaural beamformers based on recorded signals from real rooms. J. Acoust. Soc. Am. 2004, 115: 379. 10.1121/1.1624064View ArticleGoogle Scholar
- Shujau M, Ritz CH, Burnett IS: Speech enhancement via separation of sources from co-located microphone recordings. Dallas: Paper presented at IEEE international conference on acoustics, speech and signal processing (ICASSAP); 2010.View ArticleGoogle Scholar
- Wu PKT, Jin C, Kan A: A multi-microphone speech enhancement algorithm tested using acoustic vector sensor. Tel-Aviv-Jaffa: Paper presented at the 12th international workshop on acoustic echo and noise control; 2010.Google Scholar
- Li B, Zou YX: Improved DOA estimation with acoustic vector sensor arrays using spatial sparsity and subarray manifold. Kyoto: Paper presented at IEEE international conference on acoustics, speech and signal processing (ICASSP); 2012.View ArticleGoogle Scholar
- Shi W, Zou YX, Li B, Ritz CH, Shujau M, Xi J: Multisource DOA estimation based on time-frequency sparsity and joint inter-sensor data ratio with single acoustic vector sensor. Vancouver: Paper presented at IEEE international conference on acoustics, speech and signal processing (ICASSP); 2013.Google Scholar
- Shujau M: In air acoustic vector sensors for capturing and processing of speech signals, Dissertation. University of Wollongong; 2011.Google Scholar
- Gray R, Buzo A, Gray JA, Matsuyama Y: Distortion measures for speech processing. IEEE Trans. Acoust. Speech Signal Process 1980, 28(4):367-376. 10.1109/TASSP.1980.1163421View ArticleGoogle Scholar
- ITU-T: Recommendation P.862 - Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. Geneva: International Telecommunication Union - Telecommunication Standardization Sector; 2001.Google Scholar
- NOISEX-92. http://www.speech.cs.cmu.edu/comp.speech/Section1/Data/noisex.html
- Ritz CH, Burnett IS: Separation of speech sources using an acoustic vector sensor. Hanzhou: Paper presented at IEEE international workshop on multimedia signal processing; 2011.Google Scholar
- Subcommittee IEEE: IEEE recommended practice for speech quality measurements. IEEE Trans. Audio Electro-acoustics 1969, AU-17(3):225-246.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.