A Maximum Likelihood Estimation of Vocal-Tract-Related Filter Characteristics for Single Channel Speech Separation

Radfar, Mohammad H.; Dansereau, Richard M.; Sayadiyan, Abolghasem

doi:10.1155/2007/84186

Research Article
Open access
Published: 16 November 2006

A Maximum Likelihood Estimation of Vocal-Tract-Related Filter Characteristics for Single Channel Speech Separation

Mohammad H. Radfar¹,
Richard M. Dansereau² &
Abolghasem Sayadiyan¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2007, Article number: 084186 (2006) Cite this article

2051 Accesses
27 Citations
3 Altmetric
Metrics details

Abstract

We present a new technique for separating two speech signals from a single recording. The proposed method bridges the gap between underdetermined blind source separation techniques and those techniques that model the human auditory system, that is, computational auditory scene analysis (CASA). For this purpose, we decompose the speech signal into the excitation signal and the vocal-tract-related filter and then estimate the components from the mixed speech using a hybrid model. We first express the probability density function (PDF) of the mixed speech's log spectral vectors in terms of the PDFs of the underlying speech signal's vocal-tract-related filters. Then, the mean vectors of PDFs of the vocal-tract-related filters are obtained using a maximum likelihood estimator given the mixed signal. Finally, the estimated vocal-tract-related filters along with the extracted fundamental frequencies are used to reconstruct estimates of the individual speech signals. The proposed technique effectively adds vocal-tract-related filter characteristics as a new cue to CASA models using a new grouping technique based on an underdetermined blind source separation. We compare our model with both an underdetermined blind source separation and a CASA method. The experimental results show that our model outperforms both techniques in terms of SNR improvement and the percentage of crosstalk suppression.

[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68]

References

Jutten C, Herault J: Blind separation of sources, part I. An adaptive algorithm based on neuromimetic architecture. Signal Processing 1991,24(1):1-10. 10.1016/0165-1684(91)90079-X
Article MATH Google Scholar
Comon P: Independent component analysis. A new concept? Signal Processing 1994,36(3):287-314. 10.1016/0165-1684(94)90029-9
Article MATH Google Scholar
Bell AJ, Sejnowski TJ: An information-maximization approach to blind separation and blind deconvolution. Neural Computation 1995,7(6):1129-1159. 10.1162/neco.1995.7.6.1129
Article Google Scholar
Amari S-I, Cardoso J-F: Blind source separation-semiparametric statistical approach. IEEE Transactions on Signal Processing 1997,45(11):2692-2700. 10.1109/78.650095
Article Google Scholar
Bregman AS: Auditory Scene Analysis. MIT Press, Cambridge, Mass, USA; 1994.
Google Scholar
Brown GJ, Cooke M: Computational auditory scene analysis. Computer Speech and Language 1994,8(4):297-336. 10.1006/csla.1994.1016
Article Google Scholar
Cooke M, Ellis DPW: The auditory organization of speech and other sources in listeners and computational models. Speech Communication 2001,35(3-4):141-177. 10.1016/S0167-6393(00)00078-9
Article MATH Google Scholar
Ellis DPW: Using knowledge to organize sound: the prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures. Speech Communication 1999,27(3-4):281-298. 10.1016/S0167-6393(98)00083-1
Article Google Scholar
Nakatani T, Okuno HG: Harmonic sound stream segregation using localization and its application to speech stream segregation. Speech Communication 1999,27(3):209-222. 10.1016/S0167-6393(98)00079-X
Article Google Scholar
Brown GJ, Wang DL: Separation of speech by computational auditory scene analysis. In Speech Enhancement: What's New?. Edited by: Benesty J, Makino S, Chen J. Springer, New York, NY, USA; 2005:371-402.
Chapter Google Scholar
Darwin CJ, Carlyon RP: Auditory grouping. In The Handbook of Perception and Cognition. Volume 6. Edited by: Moore BCJ. Academic Press, Orlando, Fla, USA; 1995:387-424. chapter Hearing
Google Scholar
Wang DL, Brown GJ: Separation of speech from interfering sounds based on oscillatory correlation. IEEE Transactions on Neural Networks 1999,10(3):684-697. 10.1109/72.761727
Article MathSciNet Google Scholar
Hu G, Wang DL: Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Transactions on Neural Networks 2004,15(5):1135-1150. 10.1109/TNN.2004.832812
Article Google Scholar
Jang GJ, Lee TW: A probabilistic approach to single channel blind signal separation. Proceedings of Advances in Neural Information Processing Systems (NIPS '02), December 2002, Vancouver, British Columbia, Canada 1173-1180.
Google Scholar
Fevotte C, Godsill SJ: A Bayesian approach for blind separation of sparse sources. IEEE Transaction on Speech and Audio Processing 2005,4(99):1-15.
Google Scholar
Girolami M: A variational method for learning sparse and overcomplete representations. Neural Computation 2001,13(11):2517-2532. 10.1162/089976601753196003
Article MATH Google Scholar
Lee T-W, Lewicki MS, Girolami M, Sejnowski TJ: Blind source separation of more sources than mixtures using overcomplete representations. IEEE Signal Processing Letters 1999,6(4):87-90. 10.1109/97.752062
Article Google Scholar
Beierholm T, Pedersen BD, Winther O: Low complexity Bayesian single channel source separation. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '04), May 2004, Montreal, Quebec, Canada 5: 529-532.
Google Scholar
Roweis S: One microphone source separation. Proceedings of Advances in Neural Information Processing Systems (NIPS '00), October-November 2000, Denver, Colo, USA 793-799.
Google Scholar
Reyes-Gomez MJ, Ellis DPW, Jojic N: Multiband audio modeling for single-channel acoustic source separation. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '04), May 2004, Montreal, Quebec, Canada 5: 641-644.
Google Scholar
Reddy AM, Raj B: A minimum mean squared error estimator for single channel speaker separation. Proceedings of the 8th International Conference on Spoken Language Processing (INTERSPEECH '04), October 2004, Jeju Island, Korea 2445-2448.
Google Scholar
Kristjansson T, Attias H, Hershey J: Single microphone source separation using high resolution signal reconstruction. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '04), May 2004, Montreal, Quebec, Canada 2: 817-820.
Google Scholar
Rowies ST: Factorial models and refiltering for speech separation and denoising. Proceedings of the 8th European Conference on Speech Communication and Technology (EUROSPEECH '03), September 2003, Geneva, Switzerland 7: 1009-1012.
Google Scholar
Virtanen T, Klapuri A: Separation of harmonic sound sources using sinusoidal modeling. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '00), June 2000, Istanbul, Turkey 2: 765-768.
Google Scholar
Quatieri TF, Danisewicz RG: An approach to co-channel talker interference suppression using a sinusoidal model for speech. IEEE Transactions on Acoustics, Speech, and Signal Processing 1990,38(1):56-69. 10.1109/29.45618
Article Google Scholar
Wan EA, Nelson AT: Neural dual extended Kalman filtering: applications in speech enhancement and monaural blind signal separation. Proceedings of the 7th IEEE Workshop on Neural Networks for Signal Processing (NNSP '97), September 1997, Amelia Island, Fla, USA 466-475.
Google Scholar
Hopgood JR, Rayner PJW: Single channel nonstationary stochastic signal separation using linear time-varying filters. IEEE Transactions on Signal Processing 2003,51(7):1739-1752. 10.1109/TSP.2003.812837
Article Google Scholar
Balan R, Jourjine A, Rosca J: AR processes and sources can be reconstructed from degenerative mixtures. Proceedings of the 1st International Workshop on Independent Component Analysis and Signal Separation (ICA '99), January 1999, Aussois, France 467-472.
Google Scholar
Rouat J, Liu YC, Morissette D: A pitch determination and voiced/unvoiced decision algorithm for noisy speech. Speech Communication 1997,21(3):191-207. 10.1016/S0167-6393(97)00002-2
Article Google Scholar
Chazan D, Stettiner Y, Malah D: Optimal multi-pitch estimation using the EM algorithm for co-channel speech separation. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '93), April 1993, Minneapolis, Minn, USA 2: 728-731.
Google Scholar
Wu M, Wang DL, Brown GJ: A multipitch tracking algorithm for noisy speech. IEEE Transactions on Speech and Audio Processing 2003,11(3):229-241. 10.1109/TSA.2003.811539
Article Google Scholar
Nishimoto T, Sagayama S, Kameoka H: Multi-pitch trajectory estimation of concurrent speech based on harmonic GMM and nonlinear Kalman filtering. Proceedings of the 8th International Conference on Spoken Language Processing (INTERSPEECH '04), October 2004, Jeju Island, Korea 1: 2433-2436.
Google Scholar
Tolonen T, Karjalainen M: A computationally efficient multipitch analysis model. IEEE Transactions on Speech and Audio Processing 2000,8(6):708-716. 10.1109/89.876309
Article Google Scholar
Kwon Y-H, Park D-J, Ihm B-C: Simplified pitch detection algorithm of mixed speech signals. Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '00), May 2000, Geneva, Switzerland 3: 722-725.
Google Scholar
Morgan DP, George EB, Lee LT, Kay SM: Cochannel speaker separation by harmonic enhancement and suppression. IEEE Transactions on Speech and Audio Processing 1997,5(5):407-424. 10.1109/89.622561
Article Google Scholar
Radfar MH, Dansereau RM, Sayadiyan A: Performance evaluation of three features for model-based single channel speech separation problem. Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH '06), September 2006, Pittsburgh, Pa, USA 2610-2613.
Google Scholar
Hu G, Wang D: Auditory segmentation based on onset and offset analysis. to appear in IEEE Transactions on Audio, Speech, and Language Processing
Ellis D: Model-based scene analysis. In Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Edited by: Wang D, Brown G. Wiley/IEEE Press, New York, NY, USA; 2006.
Google Scholar
Parsons TW: Separation of speech from interfering speech by means of harmonic selection. Journal of the Acoustical Society of America 1976,60(4):911-918. 10.1121/1.381172
Article Google Scholar
de Cheveigné A, Kawahara H: Multiple period estimation and pitch perception model. Speech Communication 1999,27(3):175-185. 10.1016/S0167-6393(98)00074-0
Article Google Scholar
Weintraub M: A computational model for separating two simultaneous talkers. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '86), April 1986, Tokyo, Japan 11: 81-84.
Article MathSciNet Google Scholar
Hanson BA, Wong DY: The harmonic magnitude suppression (HMS) technique for intelligibility enhancement in the presence of interfering speech. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '84), March 1984, San Diego, Calif, USA 2: 18A. 5. 1-18A. 5. 4.
Google Scholar
Kanjilal PP, Palit S: Extraction of multiple periodic waveforms from noisy data. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '94), April 1994, Adelaide, SA, Australia 2: 361-364.
MATH Google Scholar
Every MR, Szymanski JE: Separation of synchronous pitched notes by spectral filtering of harmonics. IEEE Transactions on Audio, Speech and Language Processing 2006,14(5):1845-1856.
Article Google Scholar
Maher RC, Beauchamp JW: Fundamental frequency estimation of musical signals using a two-way mismatch procedure. Journal of the Acoustical Society of America 1994,95(4):2254-2263. 10.1121/1.408685
Article Google Scholar
Karjalainen M, Tolonen T: Multi-pitch and periodicity analysis model for sound separation and auditory scene analysis. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '99), March 1999, Phoenix, Ariz, USA 2: 929-932.
Article Google Scholar
Cooke M: Modeling auditory processing and organization, Doctoral thesis.
McAulay RJ, Quatieri TF: Sinusoidal coding. In Speech Coding and Synthesis. Edited by: Kleijn W, Paliwal K. Elsevier, New York, NY, USA; 1995.
Google Scholar
Quatieri TF: Discrete-Time Speech Signal Processing Principle and Practice. Prentice-Hall, Englewood Cliffs, NJ, USA; 2001.
Google Scholar
Yair E, Medan Y, Chazan D: Super resolution pitch determination of speech signals. IEEE Transactions on Signal Processing 1991,39(1):40-48. 10.1109/78.80763
Article Google Scholar
Martin P: Comparison of pitch detection by cepstrum and spectral comb analysis. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '82), May 1982, Paris, France 7: 180-183.
Article Google Scholar
Meddis R, Hewitt M: Virtual pitch and phase sensitivity of a computer model of the auditory periphery I: pitch identification. Journal of the Acoustical Society of America 1991,89(6):2866-2882. 10.1121/1.400725
Article Google Scholar
Meddis R, O'Mard L: A unitary model of pitch perception. Journal of the Acoustical Society of America 1997,102(3):1811-1820. 10.1121/1.420088
Article Google Scholar
Chandra N, Yantorno RE: Usable speech detection using the modified spectral autocorrelation peak to valley ratio using the LPC residual. Proceedings of 4th IASTED International Conference on Signal and Image Processing, August 2002, Kaua'i Marriott, Hawaii, USA 146-149.
Google Scholar
Mahgoub YA, Dansereau RM: Voicing-state classification of co-channel speech using nonlinear state-space reconstruction. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 1: 409-412.
Google Scholar
Kizhanatham AR, Chandra N, Yantorno RE: Co-channel speech detection approaches using cyclostationarity or wavelet transform. Proceedings of 4th IASTED International Conference on Signal and Image Processing, August 2002, Kaua'i Marriott, Hawaii, USA
Google Scholar
Benincasa DS, Savic MI: Voicing state determination of co-channel speech. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '98), May 1998, Seattle, Wash, USA 2: 1021-1024.
Google Scholar
Radfar MH, Dansereau RM, Sayadiyan A: A joint identification-separation technique for single channel speech separation. Proceedings of the 12th IEEE Digital Signal Processing Workshop (DSP '06), September 2006, Grand Teton National Park, Wyo, USA 76-81.
Google Scholar
Radfar MH, Sayadiyan A, Dansereau RM: A new algorithm for two-talker pitch tracking in single channel paradigm. Proceedings of International Conference on Signal Processing (ICSP '06), November 2006, Guilin, China
Google Scholar
Nadas A, Nahamoo D, Picheny MA: Speech recognition using noise-adaptive prototypes. IEEE Transactions on Acoustics, Speech, and Signal Processing 1989,37(10):1495-1503. 10.1109/29.35387
Article Google Scholar
Paliwal KK, Alsteris LD: On the usefulness of STFT phase spectrum in human listening tests. Speech Communication 2005,45(2):153-170. 10.1016/j.specom.2004.08.001
Article Google Scholar
Paul DB: The spectral envelope estimation vocoder. IEEE Transactions on Acoustics, Speech, and Signal Processing 1981,29(4):786-794. 10.1109/TASSP.1981.1163643
Article Google Scholar
de Boor C: A Practical Guide to Splines. Springer, New York, NY, USA; 1978.
Book MATH Google Scholar
Talkin D: A robust algorithm for pitch tracking (RAPT). In Speech Coding and Synthesis. Edited by: Kleijn W, Paliwal K. Elsevier, Amsterdam, The Netherlands; 1995:495-518.
Google Scholar
Gersho A, Gray RM: Vector Quantization and Signal Compression. Kluwer Academic, Norwell, Mass, USA; 1992.
Book MATH Google Scholar
Chu WC: Vector quantization of harmonic magnitudes in speech coding applications - a survey and new technique. EURASIP Journal on Applied Signal Processing 2004,2004(17):2601-2613. 10.1155/S1110865704407161
Article MATH Google Scholar
Wang D: On ideal binary mask as the computational goal of auditory scene analysis. In Speech Separation by Humans and Machines. Edited by: Divenyi P. Kluwer Academic, Norwell, Mass, USA; 2005:181-197.
Chapter Google Scholar
Naylor JA, Boll SF: Techniques for suppression of an interfering talker in co-channel speech. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '87), April 1987, Dallas, Tex, USA 1: 205-208.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, Amirkabir University, Tehran, 15875-4413, Iran
Mohammad H. Radfar & Abolghasem Sayadiyan
Department of Systems and Computer Engineering, Carleton University, Ottawa, ON, K1S 5B6, Canada
Richard M. Dansereau

Authors

Mohammad H. Radfar
View author publications
You can also search for this author in PubMed Google Scholar
Richard M. Dansereau
View author publications
You can also search for this author in PubMed Google Scholar
Abolghasem Sayadiyan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad H. Radfar.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Radfar, M.H., Dansereau, R.M. & Sayadiyan, A. A Maximum Likelihood Estimation of Vocal-Tract-Related Filter Characteristics for Single Channel Speech Separation. J AUDIO SPEECH MUSIC PROC. 2007, 084186 (2006). https://doi.org/10.1155/2007/84186

Download citation

Received: 03 March 2006
Revised: 13 September 2006
Accepted: 27 September 2006
Published: 16 November 2006
DOI: https://doi.org/10.1155/2007/84186

A Maximum Likelihood Estimation of Vocal-Tract-Related Filter Characteristics for Single Channel Speech Separation

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords