Skip to main content

Table 1 Summary of Former studies.

From: A Decision-Tree-Based Algorithm for Speech/Music Classification and Segmentation

Paper

Main Applications

Features

Classification method

Audio material

Results

Saunders, 1996 [4]

Automatic real-time FM  radio monitoring

Short-time energy, statistical parameters of  the ZCR

Multivariate   Gaussian classifier

Talk, commercials, music (different types)

95%–96%

Scheirer and Slaney, 1997 [6]

Speech/music discrimination for  automatic speech recognition

13 temporal, spectral and cepstral  features  (e.g., 4 Hz modulation energy, % of low energy frames, spectral  rolloff,  spectral centroid,  spectral  flux, ZCR,  cepstrum-based feature, "rhythmicness"), variance of features across 1 sec.

Gaussian mixture model (GMM),  K  nearest neighbour (KNN), K-D trees,  multidimensional Gaussian  MAP estimator

FM radio  (40 min): male and female speech, various  conditions, different  genres  of music (training:  36 min, testing: 4 min)

94.2%  (frame-by-frame),  98.6%  (2.4 sec segments)

Foote, 1997 [10]

Retrieving audio documents by  acoustic similarity

12  MFCC,  Short-time energy

Template  matching  of histograms,  created using  a  tree-based vector  quantizer, trained  to  maximize mutual information

409  sounds  and  255 (7 sec  long)  clips  of music.

No  specific  accuracy rates  are  provided. High  rate  of  success in  retrieving  simple sounds.

Liu et al., 1997 [5]

Analysis of  audio for  scene classification of  TV programs

Silence ratio, volume std, volume dynamic range, 4 Hz  freq,  mean  and std  of  pitch  difference, speech, noise ratios, freq. centroid,  bandwidth, energy in 4 sub-bands

A neural network using the  one-class-in-one-network  (OCON) structure

70 audio clips from TV programs  (1 sec.  long) for  each  scene  class (training: 50, testing: 20)

Recognition of some of the classes is successful

Zhang and Kuo, 1999 [11]

Audio segmentation/retrieval for  video scene classification, indexing of  raw audio  visual recordings, database browsing

Features  based  on short-time  energy, average ZCR, short-time fundamental frequency

A  rule-based  heuristic procedure for the coarse stage,  HMM  for  the second stage

Coarse  stage:  speech, music,   env.   sounds  and silence.  Second  stage:   fine-class  classification of env. sounds.

90% (coarse stage)

Williams and Ellis, 1999 [12]

Segmentation of  speech versus nonspeech in automatic speech recognition tasks

Mean  per-frame entropy  and  average probability "dynamism", background-label energy ratio, phone distribution match—all derived from posterior probabilities of  phones  in  hybrid connectionist-HMM framework

Gaussian likelihood ratio test

Radio  recordings, speech  (80 segments, 15 sec.  each)  and  music (80, 15),  respectively. Training: 75%, testing: 25%.

100%  accuracy with  15  seconds long segments  98.7% accuracy  with  2.5-seconds long segments

El-Maleh et  al., 2000 [13]

Automatic coding  and content-based audio/video retrieval

LSF,  differential  LSF, measures based on the ZCR of high-pass filtered signal

KNN  classifier  and quadratic  Gaussian classifier (QCG)

Several  speakers, different  genres  of music (training: 9.3 min. and 10.7 min., resp.)

Frame  level  (20 ms): music 72.7% (QGC), 79.2%  (KNN).  Speech 74.3% (QGC), 82.5% (KNN).  Segment level  (1 sec.),  music 94%–100%,  speech 80%–94%.

Buggati et  al., 2002 [2]

"Table  of Content description" of  a  multi-media document

ZCR-based  features,   spectral  flux,  short-time  energy,  cepstrum coefficients,  spectral centroids,  ratio  of the  high-frequency power  spectrum,  a measure  based  on syllabic frequency

Multivariate  Gaussian classifier, neural network (MLP)

30 minutes of alternating sections of music and speech (5 min each)

95%–96%  (NN).   Total error  rate:  17.7% (Bayesian  classifier), 6.0% (NN).

Lu, Zhang, and Jiang, 2002 [9]

Audio content analysis  in video parsing

High zero-crossing rate ratio  (HZCRR),  low short-time energy ratio (LSTER), linear spectral pairs, band periodicity, noise-frame ratio (NFR)

3-step  classification: 1.  KNN  and  linear spectral  pairs-vector quantization  (LSP-VQ) for  speech/nonspeech discrimination.  2. Heuristic  rules  for nonspeech classification into music/background noise/silence. 3. Speaker segmentation

MPEG-7  test  data  set, TV  news,  movie/audio clips.  Speech:  studio recordings,  4 kHz  and 8 kHz  bandwidths, music:  songs,  pop (training:  2  hours, testing: 4 hours).

Speech  97.5%,  music 93.0%,  env.  sound 84.4%.  Results  of only  speech/music discrimination: 98.0%

Ajmera et  al., 2003 [14]

Automatic transcription of  broadcast news

Averaged  entropy measure  and "dynamism"  estimated at  the  output  of  a multilayer  perceptron (MLP) trained to emit posterior  probabilities of phones. MLP input: 13  first  cepstra  of  a 12th-order  perceptual linear prediction filter.

2-state  HMM  with minimum  duration constraints  (threshold-free,  unsupervised,  no training).

4  files  (10 min.  each): alternate  segments  of speech  and  music, speech/music interleaved

GMM:  Speech 98.8%,  Music  93.9%. Alternating,  variable length  segments (MLP): Speech 98.6%, Music 94.6%.

Burred and Lerch, 2004 [1]

Audio classification (speech/ music/back- ground noise), music classification into genres

Statistical  measures of  short-time  frame features: ZCR, spectral centroid/rolloff/flux, first  5  MFCCs, audio  spectrum centroid/flatness, harmonic  ratio,  beat strength,  rhythmic regularity, RMS energy, time  envelope,  low energy  rate,   loudness, others

KNN  classifier,  3-component  GMM classifier

3 classes of speech, 13 genres  of   music  and background  noise:  50 examples for each class (30 sec each), from CDs, MP3, and radio.

94.6% /96.3% (hierarchical  approach and  direct  approach, resp.)

Barbedo and Lopes, 2006 [15]

Automatic segmentation for real-time applications

Features  based  on ZCR, spectral rolloff, loudness and  fundamental frequencies

KNN,  self-organizing maps,  MLP  neural networks,  linear combinations

Speech  (5  different conditions)   and  music (various  genres)more than  20  hours  of audio  data,  from  CDs, Internet  radio  streams, radio broadcasting, and coded files.

Noisy  speech  99.4%, Clean  speech  100%, Music  98.8%,  Music without  rap  99.2%. Rapid  alternations: speech 94.5%, music 93.2%.

Muñoz-Expósito et  al., 2006 [3]

Intelligent audio coding system

Warped LPC-based spectral centroid

3-component  GMM, with or without fuzzy rules-based system

Speech (radio and TV news, movie dialogs, different conditions); music (various genres, different instruments/singers) -1 hour for each class.

GMM:  speech  95.1%, music  80.3%.  GMM with  fuzzy  system: speech  94.2%,  music 93.1%.

Alexandre et  al, 2006 [16]

Speech/music classification for  musical genre classification

Spectral centroid/rolloff, ZCR,  short-time energy,  low  short-time  energy  ratio (LSTER), MFCC, voice-to-white

Fisher  linear discriminant, K nearest-neighbour

Speech  (without background  music), and  music  without vocals. (training: 45 min, testing: 15 min)

Music  99.1%,  speech 96.6%.  Individual features:  95.9%   (MFCC),  95.1% (voice to white).