A Decision-Tree-Based Algorithm for Speech/Music Classification and Segmentation

Lavner, Yizhar; Ruinskiy, Dima

doi:10.1155/2009/239892

Table 1 Summary of Former studies.

From: A Decision-Tree-Based Algorithm for Speech/Music Classification and Segmentation

Paper	Main Applications	Features	Classification method	Audio material	Results
Saunders, 1996 [4]	Automatic real-time FM radio monitoring	Short-time energy, statistical parameters of the ZCR	Multivariate Gaussian classifier	Talk, commercials, music (different types)	95%–96%
Scheirer and Slaney, 1997 [6]	Speech/music discrimination for automatic speech recognition	13 temporal, spectral and cepstral features (e.g., 4 Hz modulation energy, % of low energy frames, spectral rolloff, spectral centroid, spectral flux, ZCR, cepstrum-based feature, "rhythmicness"), variance of features across 1 sec.	Gaussian mixture model (GMM), K nearest neighbour (KNN), K-D trees, multidimensional Gaussian MAP estimator	FM radio (40 min): male and female speech, various conditions, different genres of music (training: 36 min, testing: 4 min)	94.2% (frame-by-frame), 98.6% (2.4 sec segments)
Foote, 1997 [10]	Retrieving audio documents by acoustic similarity	12 MFCC, Short-time energy	Template matching of histograms, created using a tree-based vector quantizer, trained to maximize mutual information	409 sounds and 255 (7 sec long) clips of music.	No specific accuracy rates are provided. High rate of success in retrieving simple sounds.
Liu et al., 1997 [5]	Analysis of audio for scene classification of TV programs	Silence ratio, volume std, volume dynamic range, 4 Hz freq, mean and std of pitch difference, speech, noise ratios, freq. centroid, bandwidth, energy in 4 sub-bands	A neural network using the one-class-in-one-network (OCON) structure	70 audio clips from TV programs (1 sec. long) for each scene class (training: 50, testing: 20)	Recognition of some of the classes is successful
Zhang and Kuo, 1999 [11]	Audio segmentation/retrieval for video scene classification, indexing of raw audio visual recordings, database browsing	Features based on short-time energy, average ZCR, short-time fundamental frequency	A rule-based heuristic procedure for the coarse stage, HMM for the second stage	Coarse stage: speech, music, env. sounds and silence. Second stage: fine-class classification of env. sounds.	90% (coarse stage)
Williams and Ellis, 1999 [12]	Segmentation of speech versus nonspeech in automatic speech recognition tasks	Mean per-frame entropy and average probability "dynamism", background-label energy ratio, phone distribution match—all derived from posterior probabilities of phones in hybrid connectionist-HMM framework	Gaussian likelihood ratio test	Radio recordings, speech (80 segments, 15 sec. each) and music (80, 15), respectively. Training: 75%, testing: 25%.	100% accuracy with 15 seconds long segments 98.7% accuracy with 2.5-seconds long segments
El-Maleh et al., 2000 [13]	Automatic coding and content-based audio/video retrieval	LSF, differential LSF, measures based on the ZCR of high-pass filtered signal	KNN classifier and quadratic Gaussian classifier (QCG)	Several speakers, different genres of music (training: 9.3 min. and 10.7 min., resp.)	Frame level (20 ms): music 72.7% (QGC), 79.2% (KNN). Speech 74.3% (QGC), 82.5% (KNN). Segment level (1 sec.), music 94%–100%, speech 80%–94%.
Buggati et al., 2002 [2]	"Table of Content description" of a multi-media document	ZCR-based features, spectral flux, short-time energy, cepstrum coefficients, spectral centroids, ratio of the high-frequency power spectrum, a measure based on syllabic frequency	Multivariate Gaussian classifier, neural network (MLP)	30 minutes of alternating sections of music and speech (5 min each)	95%–96% (NN). Total error rate: 17.7% (Bayesian classifier), 6.0% (NN).
Lu, Zhang, and Jiang, 2002 [9]	Audio content analysis in video parsing	High zero-crossing rate ratio (HZCRR), low short-time energy ratio (LSTER), linear spectral pairs, band periodicity, noise-frame ratio (NFR)	3-step classification: 1. KNN and linear spectral pairs-vector quantization (LSP-VQ) for speech/nonspeech discrimination. 2. Heuristic rules for nonspeech classification into music/background noise/silence. 3. Speaker segmentation	MPEG-7 test data set, TV news, movie/audio clips. Speech: studio recordings, 4 kHz and 8 kHz bandwidths, music: songs, pop (training: 2 hours, testing: 4 hours).	Speech 97.5%, music 93.0%, env. sound 84.4%. Results of only speech/music discrimination: 98.0%
Ajmera et al., 2003 [14]	Automatic transcription of broadcast news	Averaged entropy measure and "dynamism" estimated at the output of a multilayer perceptron (MLP) trained to emit posterior probabilities of phones. MLP input: 13 first cepstra of a 12th-order perceptual linear prediction filter.	2-state HMM with minimum duration constraints (threshold-free, unsupervised, no training).	4 files (10 min. each): alternate segments of speech and music, speech/music interleaved	GMM: Speech 98.8%, Music 93.9%. Alternating, variable length segments (MLP): Speech 98.6%, Music 94.6%.
Burred and Lerch, 2004 [1]	Audio classification (speech/ music/back- ground noise), music classification into genres	Statistical measures of short-time frame features: ZCR, spectral centroid/rolloff/flux, first 5 MFCCs, audio spectrum centroid/flatness, harmonic ratio, beat strength, rhythmic regularity, RMS energy, time envelope, low energy rate, loudness, others	KNN classifier, 3-component GMM classifier	3 classes of speech, 13 genres of music and background noise: 50 examples for each class (30 sec each), from CDs, MP3, and radio.	94.6% /96.3% (hierarchical approach and direct approach, resp.)
Barbedo and Lopes, 2006 [15]	Automatic segmentation for real-time applications	Features based on ZCR, spectral rolloff, loudness and fundamental frequencies	KNN, self-organizing maps, MLP neural networks, linear combinations	Speech (5 different conditions) and music (various genres)more than 20 hours of audio data, from CDs, Internet radio streams, radio broadcasting, and coded files.	Noisy speech 99.4%, Clean speech 100%, Music 98.8%, Music without rap 99.2%. Rapid alternations: speech 94.5%, music 93.2%.
Muñoz-Expósito et al., 2006 [3]	Intelligent audio coding system	Warped LPC-based spectral centroid	3-component GMM, with or without fuzzy rules-based system	Speech (radio and TV news, movie dialogs, different conditions); music (various genres, different instruments/singers) -1 hour for each class.	GMM: speech 95.1%, music 80.3%. GMM with fuzzy system: speech 94.2%, music 93.1%.
Alexandre et al, 2006 [16]	Speech/music classification for musical genre classification	Spectral centroid/rolloff, ZCR, short-time energy, low short-time energy ratio (LSTER), MFCC, voice-to-white	Fisher linear discriminant, K nearest-neighbour	Speech (without background music), and music without vocals. (training: 45 min, testing: 15 min)	Music 99.1%, speech 96.6%. Individual features: 95.9% (MFCC), 95.1% (voice to white).

Back to article page