From: A Decision-Tree-Based Algorithm for Speech/Music Classification and Segmentation
Paper | Main Applications | Features | Classification method | Audio material | Results |
---|---|---|---|---|---|
Saunders, 1996 [4] | Automatic real-time FM  radio monitoring | Short-time energy, statistical parameters of  the ZCR | Multivariate   Gaussian classifier | Talk, commercials, music (different types) | 95%–96% |
Scheirer and Slaney, 1997 [6] | Speech/music discrimination for  automatic speech recognition | 13 temporal, spectral and cepstral  features  (e.g., 4 Hz modulation energy, % of low energy frames, spectral  rolloff,  spectral centroid,  spectral  flux, ZCR,  cepstrum-based feature, "rhythmicness"), variance of features across 1 sec. | Gaussian mixture model (GMM),  K  nearest neighbour (KNN), K-D trees,  multidimensional Gaussian  MAP estimator | FM radio  (40 min): male and female speech, various  conditions, different  genres  of music (training:  36 min, testing: 4 min) | 94.2%  (frame-by-frame),  98.6%  (2.4 sec segments) |
Foote, 1997 [10] | Retrieving audio documents by  acoustic similarity | 12  MFCC,  Short-time energy | Template  matching  of histograms,  created using  a  tree-based vector  quantizer, trained  to  maximize mutual information | 409  sounds  and  255 (7 sec  long)  clips  of music. | No  specific  accuracy rates  are  provided. High  rate  of  success in  retrieving  simple sounds. |
Liu et al., 1997 [5] | Analysis of  audio for  scene classification of  TV programs | Silence ratio, volume std, volume dynamic range, 4 Hz  freq,  mean  and std  of  pitch  difference, speech, noise ratios, freq. centroid,  bandwidth, energy in 4 sub-bands | A neural network using the  one-class-in-one-network  (OCON) structure | 70 audio clips from TV programs  (1 sec.  long) for  each  scene  class (training: 50, testing: 20) | Recognition of some of the classes is successful |
Zhang and Kuo, 1999 [11] | Audio segmentation/retrieval for  video scene classification, indexing of  raw audio  visual recordings, database browsing | Features  based  on short-time  energy, average ZCR, short-time fundamental frequency | A  rule-based  heuristic procedure for the coarse stage,  HMM  for  the second stage | Coarse  stage:  speech, music,   env.   sounds  and silence.  Second  stage:   fine-class  classification of env. sounds. | 90% (coarse stage) |
Williams and Ellis, 1999 [12] | Segmentation of  speech versus nonspeech in automatic speech recognition tasks | Mean  per-frame entropy  and  average probability "dynamism", background-label energy ratio, phone distribution match—all derived from posterior probabilities of  phones  in  hybrid connectionist-HMM framework | Gaussian likelihood ratio test | Radio  recordings, speech  (80 segments, 15 sec.  each)  and  music (80, 15),  respectively. Training: 75%, testing: 25%. | 100%  accuracy with  15  seconds long segments  98.7% accuracy  with  2.5-seconds long segments |
El-Maleh et  al., 2000 [13] | Automatic coding  and content-based audio/video retrieval | LSF,  differential  LSF, measures based on the ZCR of high-pass filtered signal | KNN  classifier  and quadratic  Gaussian classifier (QCG) | Several  speakers, different  genres  of music (training: 9.3 min. and 10.7 min., resp.) | Frame  level  (20 ms): music 72.7% (QGC), 79.2%  (KNN).  Speech 74.3% (QGC), 82.5% (KNN).  Segment level  (1 sec.),  music 94%–100%,  speech 80%–94%. |
Buggati et  al., 2002 [2] | "Table  of Content description" of  a  multi-media document | ZCR-based  features,   spectral  flux,  short-time  energy,  cepstrum coefficients,  spectral centroids,  ratio  of the  high-frequency power  spectrum,  a measure  based  on syllabic frequency | Multivariate  Gaussian classifier, neural network (MLP) | 30 minutes of alternating sections of music and speech (5 min each) | 95%–96%  (NN).   Total error  rate:  17.7% (Bayesian  classifier), 6.0% (NN). |
Lu, Zhang, and Jiang, 2002 [9] | Audio content analysis  in video parsing | High zero-crossing rate ratio  (HZCRR),  low short-time energy ratio (LSTER), linear spectral pairs, band periodicity, noise-frame ratio (NFR) | 3-step  classification: 1.  KNN  and  linear spectral  pairs-vector quantization  (LSP-VQ) for  speech/nonspeech discrimination.  2. Heuristic  rules  for nonspeech classification into music/background noise/silence. 3. Speaker segmentation | MPEG-7  test  data  set, TV  news,  movie/audio clips.  Speech:  studio recordings,  4 kHz  and 8 kHz  bandwidths, music:  songs,  pop (training:  2  hours, testing: 4 hours). | Speech  97.5%,  music 93.0%,  env.  sound 84.4%.  Results  of only  speech/music discrimination: 98.0% |
Ajmera et  al., 2003 [14] | Automatic transcription of  broadcast news | Averaged  entropy measure  and "dynamism"  estimated at  the  output  of  a multilayer  perceptron (MLP) trained to emit posterior  probabilities of phones. MLP input: 13  first  cepstra  of  a 12th-order  perceptual linear prediction filter. | 2-state  HMM  with minimum  duration constraints  (threshold-free,  unsupervised,  no training). | 4  files  (10 min.  each): alternate  segments  of speech  and  music, speech/music interleaved | GMM:  Speech 98.8%,  Music  93.9%. Alternating,  variable length  segments (MLP): Speech 98.6%, Music 94.6%. |
Burred and Lerch, 2004 [1] | Audio classification (speech/ music/back- ground noise), music classification into genres | Statistical  measures of  short-time  frame features: ZCR, spectral centroid/rolloff/flux, first  5  MFCCs, audio  spectrum centroid/flatness, harmonic  ratio,  beat strength,  rhythmic regularity, RMS energy, time  envelope,  low energy  rate,   loudness, others | KNN  classifier,  3-component  GMM classifier | 3 classes of speech, 13 genres  of   music  and background  noise:  50 examples for each class (30 sec each), from CDs, MP3, and radio. | 94.6% /96.3% (hierarchical  approach and  direct  approach, resp.) |
Barbedo and Lopes, 2006 [15] | Automatic segmentation for real-time applications | Features  based  on ZCR, spectral rolloff, loudness and  fundamental frequencies | KNN,  self-organizing maps,  MLP  neural networks,  linear combinations | Speech  (5  different conditions)   and  music (various  genres)more than  20  hours  of audio  data,  from  CDs, Internet  radio  streams, radio broadcasting, and coded files. | Noisy  speech  99.4%, Clean  speech  100%, Music  98.8%,  Music without  rap  99.2%. Rapid  alternations: speech 94.5%, music 93.2%. |
Muñoz-Expósito et  al., 2006 [3] | Intelligent audio coding system | Warped LPC-based spectral centroid | 3-component  GMM, with or without fuzzy rules-based system | Speech (radio and TV news, movie dialogs, different conditions); music (various genres, different instruments/singers) -1 hour for each class. | GMM:  speech  95.1%, music  80.3%.  GMM with  fuzzy  system: speech  94.2%,  music 93.1%. |
Alexandre et  al, 2006 [16] | Speech/music classification for  musical genre classification | Spectral centroid/rolloff, ZCR,  short-time energy,  low  short-time  energy  ratio (LSTER), MFCC, voice-to-white | Fisher  linear discriminant, K nearest-neighbour | Speech  (without background  music), and  music  without vocals. (training: 45 min, testing: 15 min) | Music  99.1%,  speech 96.6%.  Individual features:  95.9%   (MFCC),  95.1% (voice to white). |