Drum Sound Detection in Polyphonic Music with Hidden Markov Models
© J. Paulus and A. Klapuri. 2009
Received: 18 August 2009
Accepted: 16 November 2009
Published: 14 December 2009
This paper proposes a method for transcribing drums from polyphonic music using a network of connected hidden Markov models (HMMs). The task is to detect the temporal locations of unpitched percussive sounds (such as bass drum or hi-hat) and recognise the instruments played. Contrary to many earlier methods, a separate sound event segmentation is not done, but connected HMMs are used to perform the segmentation and recognition jointly. Two ways of using HMMs are studied: modelling combinations of the target drums and a detector-like modelling of each target drum. Acoustic feature parametrisation is done with mel-frequency cepstral coefficients and their first-order temporal derivatives. The effect of lowering the feature dimensionality with principal component analysis and linear discriminant analysis is evaluated. Unsupervised acoustic model parameter adaptation with maximum likelihood linear regression is evaluated for compensating the differences between the training and target signals. The performance of the proposed method is evaluated on a publicly available data set containing signals with and without accompaniment, and compared with two reference methods. The results suggest that the transcription is possible using connected HMMs, and that using detector-like models for each target drum provides a better performance than modelling drum combinations.
This paper applies connected hidden Markov models (HMMs) to the transcription of drums from polyphonic musical audio. For brevity, the word "drum" is here used to refer to all the unpitched percussions met in Western pop/rock music, such as bass drum, snare drum, and cymbals. The word "transcription" is used to refer to the process of locating drum sound onset instants and recognising the drums played. The analysis result enables several applications, such as using the transcription to assist beat tracking , drum track modification in the audio , reusing the drum patterns from existing audio, or musical studies on the played patterns.
Several methods have been proposed in literature to solve the drum transcription problem. Following the categorisation made in [3, 4], majority of the methods can be viewed to be either segment and classify or separate and detect approaches. The methods in the first category operate by segmenting the input audio into meaningful events, and then attempt to recognise the content of the segments. The segmentation can be done by detecting candidate sound onsets or by creating an isochronous temporal grid coinciding with most of the onsets. After the segmentation a set of features is extracted from each segment, and a classifier is employed to recognise the contents. The classification method varies from a naive Bayes classifier with Gaussian mixture models (GMMs)  to support vector machines (SVMs) [4, 6] and decision trees .
The methods in the second category aim at segregating each target drum into a separate stream and to detect sound onsets within the streams. The separation can be done with unsupervised methods like sparse coding  or independent subspace analysis (ISA) , but these require recognising the instruments from the resulting streams. The recognition step can be avoided by utilising prior knowledge of the target drums in the form of templates, and applying a supervised source separation method. Combining ISA with drum templates produces a method called prior subspace analysis (PSA) . PSA represents the templates as magnitude spectrograms and estimates the gains of each template over time. The possible negative values in the gains do not have a physical interpretation and require a heuristic post-processing. This problem was solved using nonnegative matrix factorisation (NMF) restricting the component spectra and gains to be nonnegative. This approach was shown to perform well when the target signal matches the model (signals containing only target drums) .
Some methods cannot be assigned to either of the categories above. These include template matching and adaptation methods operating with time-domain signals , or with a spectrogram representation .
The main weakness with the "segment and classify" methods is the segmentation. The classification phase is not able to recover any events missed in the segmentation without an explicit error correction scheme, for example, . If a temporal grid is used instead of onset detection, most of the events will be found, but the expressivity lying in the small temporal deviations from the grid is lost, and problems with the grid generation will be propagated to subsequent analysis stages.
To avoid making any decisions in the segmentation, this paper proposes to use a network of connected HMMs in the transcription in order to locate sound onsets and recognise the contents jointly. The target classes for recognition can be either combinations of drums or detectors for each drum. In the first approach, the recognition dictionary consists of combinations of target drums with one model to serve as the background model when no combination is played, and the task is to cover the input signal with these models. In the detector approach, each individual target drum is associated with two models: a "sound" model and a "silence" model, and the input signal is covered with these two models for each target drum independently from the others.
In addition to the HMM baseline system, the use of model adaptation with maximum likelihood linear regression (MLLR) will be evaluated. MLLR adapts the acoustic models from training to better match the specific input.
The rest of this article is organised as follows: Section 2 describes the proposed HMM-based transcription method; Section 3 details the evaluation setup and presents the obtained results; and finally Section 4 presents the conclusions of the paper. Parts of this work have been published earlier in [15, 16].
2. Proposed Method
2.1. Feature Extraction and Transformation
It has been noted, for example, in [13, 18], that suppression of tonal spectral components improves the accuracy of drum transcription. This is no surprise, as the common drums in pop/rock drum kit contain a notable stochastic component and relatively little tonal energy. Especially the idiophones (e.g., cymbals) produce mostly noise-like signal, while the membranophones (skinned drums) may contain also tonal components . The harmonic suppression is here done with simple sinusoids-plus-residual modelling [20, 21]. The signal is subdivided into 92.9 ms frames, the spectrum is calculated with discrete Fourier transform, and 30 sinusoids with the largest magnitude are selected by locating the 30 largest local maxima in the magnitude spectrum. The sinusoids are then synthesised and the resulting signal is subtracted from the original signal. The residual serves as the input to the following analysis stages. Even though the processing may remove some of the tonal components of the membranophones, the remaining ones and the stochastic components are enough for the recognition. Preliminary experiments also suggest that the exact number of removed components is not important, even doubling the number to 60 caused only an insignificant drop in the performance.
The feature extraction calculates 13 mel-frequency cepstral coefficients (MFCCs) in 46.4 ms frames with 75% overlap . In addition to the MFCCs, their first-order temporal derivatives are estimated. The zeroth coefficient which is often discarded is also used. MFCCs have proven to work well in a variety of acoustic signal content analysis tasks including instrument recognition . In addition to the MFCCs and their temporal derivatives, other spectral features, such as band energy ratios, spectral kurtosis, skewness, flatness, and slope used, for example, in  were considered for the feature set. However, preliminary experiments suggested that their inclusion reduces the overall performance slightly and they are not used in the presented results. The reason for this degradation is an open question to be addressed in the future work, but is assumed that the features do not contain enough additional information compared to the original set to compensate the increased modelling requirements.
The resulting 26-dimensional feature vectors are normalised to have zero mean and unity variance in each feature dimension over the training data. Then the feature matrix is subjected to dimensionality reduction. Though unsupervised transformation with principal component analysis (PCA) has been successfully used in some earlier publications, for example, , it did not perform well in our experiments. It is assumed that this is because PCA attempts only to describe the variance of the data without class information, and it may be distracted by the amount of noise present in the data.
The feature transformation used here is calculated with linear discriminant analysis (LDA). LDA is a class-aware transformation attempting to minimise intra-class scatter while maximising interclass separation. If there are different classes, LDA produces a transformation to feature dimensions.
2.2. HMM Topologies
With detector models, the training data can be utilised more efficiently than with combination models, because all combinations containing the target drum can be used to train the model. Another difference in the training phase is that each drum has a separate silence (or background) model.
As will be shown in Section 3, the detector topology generally outperforms the combination modelling which was found to have problems with overfitting the limited amount of training data. This was indicated by the following observations: performance degradation with increasing the number of HMM training iterations and acoustic adaptation, and slight improvement in the performance with simpler models and reduced feature dimensions. Because of this, the results on acoustic model adaptation and feature transformations are presented only for the detector topology (similar choice has been done, e.g., in ). For the sake of comparison, however, results are reported also for the combination modelling baseline.
The sound models consist of a four-state left-to-right HMM where a transition is allowed to the state itself and to the following state. The observation likelihoods are modelled with single Gaussian distributions. The silence model is a single-state HMM with a 5-component GMM for the observation likelihoods. This topology was chosen because the background sound does not have a clear sequential form. The number of states and GMM components were empirically determined.
The models are trained with expectation maximisation algorithm  using segmented training examples. The segments are extracted after annotated event onsets using a maximum duration of 10 frames. If there is another onset closer than the set limit, the segment is truncated accordingly. In detector modelling, the training instances for the "sound" model are generated from the segments containing the target drum, and the remaining frames are used to train the "silence" model. In combination modelling, the training instances for each combination are collected from the data, and the remaining frames are used to train the background model.
2.3. Acoustic Adaptation
Unsupervised acoustic adaptation with maximum likelihood linear regression (MLLR)  has been successfully used to adapt the HMM observation density parameters, for example, in adapting speaker independent models to speaker dependent models in speech recognition , language adaptation from Spanish to Valencian , or to utilise a recognition database trained for phone speech to recognise speech in car conditions . The motivation for using MLLR here is that, it is assumed that the acoustic properties of the target signal always differ from those of the training data, and the match between the model and the observations can be improved with adaptation. The adaptation is done for each target signal independently to provide models that fit the specific signal better. The adaptation is evaluated only for the detector topology, because for drum combinations, the adaptation was not successful, most likely due to the limited amount of observations.
In single variable MLLR for the mean parameter, a transformation matrix
The value of the vector can be calculated by
where is frame index; is the observation vector from frame ; is an index of GMM components in the HMM; is the covariance matrix of GMM component , the occupation probability of th component in frame (calculated, e.g., with the forward-backward algorithm), and matrix is defined as a concatenation of two diagonal matrices
where is the mean vector of the th component and is a identity matrix . In addition to the single variable mean transformation, also full matrix mean transformation  and variance transformation  were tested. In the evaluations, the single variable adaptation performed better than the full matrix mean transformation, and therefore the results are presented only for it. Variance transformation reduced performance in all cases.
The adaptation is done so that the signal is first analysed with the original models. Then it is segmented to examples of either class ("sound"/"silence") based on the recognition result, and the segments are used to adapt the corresponding models. The adaptation can be repeated using the models from the previous adaptation iteration for segmentation. It was found in the evaluations that applying the adaptation repeatedly for three times produced the best result even though the obtained improvement after the first adaptation was usually very small. Further increment of the number of adaptation iterations from this started to degrade the results.
In the recognition phase, the (adapted) HMM models are combined into a larger compound model; see Figures 2 and 3. This is done by concatenating the state transition matrices of the individual HMMs and incorporating the intermodel transition probabilities in the same matrix. The transition probabilities between the models are estimated from the same material that is used for training the acoustic models, and the bigram probabilities are smoothed with Witten-Bell smoothing . The compound model is then used to decode the sequence with Viterbi algorithm. Another alternative would be to use token passing algorithm , but since the model satisfies the first-order Markov assumption (only bigrams are used), Viterbi is still a viable alternative.
The performance of the proposed method is evaluated using the publicly available data set "ENST drums" . The data set allows adjusting the accompaniment (everything else but the drums) level in relation to the drum signal, and two different levels are used in the evaluations: a balanced mix and a drums-only signal. The performance of the proposed method is compared with two reference systems: a "segment and classify" method by Tanghe et al. , and a supervised "separate and detect" method using nonnegative matrix factorisation .
3.1. Acoustic Data
The data set "ENST drums" contains multichannel recordings of three drummers playing with different drum kits. In addition to the original multichannel recordings, also two downmixes are provided: "dry" with minimal effects, mainly having only the levels of different drums balanced, and "wet" resembling the drum tracks on commercial recordings, containing some effects and compression. The material in the data set ranges from individual hits to stereotypical phrases, and finally to longer tracks played along with an accompaniment. These "minus one" tracks played on accompaniment have the synchronised accompaniment available as a separate signal allowing to create polyphonic signals with custom mixing levels. The ground truth for the data set contains the onset times for the different drums, and was provided with the data set.
The "minus one" tracks are used as the evaluation data. They are naturally split into three subsets based on the player and kit, each having approximately the same number of tracks (two with 21 tracks and one with 22). The lengths of the tracks range from 30 s to 75 s with mean duration of 55 s. The mixing ratios of drums and accompaniment used in the evaluations are drums-only and a "balanced" mix. The former is used to obtain a baseline result for the system with no accompaniment. The latter, corresponding to applying scaling factors of for the drum signal and for the accompaniment, is used then to evaluate the system performance in realistic conditions met in polyphonic music. (The mixing levels are based on personal communication with Gillet, and result into an average of dB drums-to-accompaniment ratio over the whole data set.)
3.2. Evaluation Setup
Evaluations are run using a three-fold cross-validation scheme. Data from two drummers are used to train the system and the data from the third are used for testing, and the division is repeated three times. This setup guarantees that the acoustic models have not seen the test data and their generalisation capability will be tested. In fact, the sounds of the corresponding drums in different kits may differ considerably (e.g., depending on the tension of the skin, the use of muffling in case of kick drum, or the instrument used to hit the drum that can be a mallet, a stick, rods, or brushes) and using only two examples of a certain drum category to recognise a third one is a difficult problem. Hence, in real applications the training should be done with as diverse data as possible.
and recall rate
3.3. Reference Methods
The system performance is compared with two earlier methods: a "segment and classify" method by Tanghe et al.  and a "separate and detect" method by Paulus and Virtanen . The former, referred to as SVM in the results, was designed for transcribing drums from polyphonic music by detecting sound onsets and then classifying the sounds with binary SVMs for each target drum. An implementation of the original author is used . The latter, referred to as NMF-PSA, was designed for transcribing drums from a signal without accompaniment. The method uses spectral templates for each target drum and estimates their time-varying gains using NMF. Onsets are detected from the recovered gains. Also here the original implementation is used. The models for the SVM method are not trained specifically for the data used, but the generic models provided are used instead. The spectral templates for NMF-PSA are calculated from the individual drum hits in the data set used here. In the original publication the mid-level representation used spectral resolution of five bands. Here they are replaced with 24 Bark bands for improved frequency resolution.
Evaluation results for the tested methods using the balanced drums and accompaniment mixture as input.
HMM + MLLR
Evaluation results for the tested methods using signals without any accompaniment as input.
HMM + MLLR
HMM: The proposed HMM method with detectors for each target drum without acoustic adaptation,
HMM + MLLR: The proposed detector-like HMM method including the acoustic model adaptation with MLLR,
HMM comb: The proposed HMM method with drum combinations without acoustic adaptation,
NMF-PSA: A "separate and detect" method using NMF for the source separation, proposed in ,
SVM: A "segment and classify" method proposed in  using SVMs for detecting the presence of each target drum in the located segments.
The results show that the proposed method performs best among the evaluated methods. In addition, it can be seen that the acoustic adaptation slightly improves the recognition result. All the evaluated methods seem to have problems in transcribing the snare drum (SD), even without the presence of accompaniment. One reason for this is that the snare drum is often played in more diverse ways than, for example, the bass drum. Examples of these include producing the excitation with sticks or brushes, or playing with and without the snare belt, or by producing barely audible "ghost hits".
When analysing the results of "segment and classify" methods, it is possible to distinguish between errors in segmentation and classification. However, since the proposed method aims to perform these tasks jointly, acting as a specialised onset detection method for each target drum, this distinction cannot be made.
An earlier evaluation with the same data set was presented in [4, Table II] . The table section "Accompaniment dB" in there corresponds to the results presented in Table 1, and section "Accompaniment dB" corresponds to the results in Table 2. In both cases, the proposed method clearly outperforms the earlier method in bass drum and hi-hat transcription accuracy. However, the performance of the proposed method on snare drum is slightly worse.
The improvement obtained using the acoustic model adaptation is relatively small. Measuring the statistical significance with two-tailed unequal variance Welch's -test  on the -measures for individual test signals produces -value of approximately .64 for the balanced mix test data and .18 for the data without accompaniment suggesting that the difference in the results is not statistically significant. However, the adaptation seems to provide a better balance on precision and recall rates. The performance differences between the proposed detector-like HMMs and the other methods are clearly in favour of the proposed method.
This paper has studied and evaluated different ways of using connected HMMs for transcribing drums from polyphonic music. The proposed detector-type approach is relatively simple with only two models for each target drum: a "sound" and a "silence" model. In addition, modelling of drum combinations instead of detectors for individual drums was investigated, but found not to work very well. It is likely that the problems with the combination models are caused by overfitting the training data. The acoustic front-end extracts mel-frequency cepstral coefficients (MFCCs) and their first-order derivatives to be used as the acoustic feature. Comparison of feature transformations suggests that LDA provides a considerable performance increase with the proposed method. Acoustic model adaptation with MLLR is tested, but the obtained improvement is relatively small. The proposed method produces a relatively good transcription of bass drum and hi-hat, but snare drum recognition has some problems that need to be addressed in future work. The main finding is that it is not necessary to have a separate segmentation step in a drum transcriber, but the segmentation and recognition can be performed jointly with an HMM even in the presence of accompaniment and with bad signal-to-noise ratios.
This work was supported by the Academy of Finland (application number 129657, Finnish Programme for Centres of Excellence in Research 2006–2011).
- Goto M: An audio-based real-time beat tracking system for music with or without drum-sounds. Journal of New Music Research 2001,30(2):159-171. 10.1076/jnmr.22.214.171.12414View ArticleGoogle Scholar
- Yoshii K, Goto M, Okuno HG: INTER:D: a drum sound equalizer for controlling volume and timbre of drums. Proceedings of the 2nd European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies (EWIMT '05), November-December 2005, London, UK 205-212.View ArticleGoogle Scholar
- FitzGerald D, Paulus J: Unpitched percussion transcription. In Signal Processing Methods for Music Transcription. Edited by: Klapuri A, Davy M. Springer, New York, NY, USA; 2006:131-162.View ArticleGoogle Scholar
- Gillet O, Richard G: Transcription and separation of drum signals from polyphonic music. IEEE Transactions on Audio, Speech and Language Processing 2008,16(3):529-540.View ArticleGoogle Scholar
- Paulus J, Klapuri AP: Conventional and periodic N-grams in the transcription of drum sequences. Proceedings of IEEE International Conference on Multimedia and Expo, July 2003, Baltimore, Md, USA 2: 737-740.Google Scholar
- Tanghe K, Dengroeve S, De Baets B: An algorithm for detecting and labeling drum events in polyphonic music. Proceedings of the 1st Annual Music Information Retrieval Evaluation eXchange, September 2005, London, UK extended abstractGoogle Scholar
- Sandvold V, Gouyon F, Herrera P: Percussion classification in polyphonic audio recordings using localized sound models. Proceedings of the 5th International Conference on Music Information Retrieval, October 2004, Barcelona, Spain 537-540.Google Scholar
- Virtanen T: Sound source separation using sparse coding with temporal continuity objective. Proceedings of International Computer Music Conference, October 2003, Singapore 231-234.Google Scholar
- Uhle C, Dittmar C, Sporer T: Extraction of drum tracks from polyphonic music using independent subspace analysis. Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Signal Separation, April 2003, Nara, Japan 843-848.Google Scholar
- FitzGerald D, Lawlor B, Coyle E: Prior subspace analysis for drum transcription. Proceedings of the 114th Audio Engineering Society Convention, March 2003, Amsterdam, The NetherlandsGoogle Scholar
- Paulus J, Virtanen T: Drum transcription with non-negative spectrogram factorisation. Proceedings of the 13th European Signal Processing Conference, September 2005, Antalya, TurkeyGoogle Scholar
- Zils A, Pachet F, Delerue O, Gouyon F: Automatic extraction of drum tracks from polyphonic music signals. Proceedings of the 2nd International Conference on Web Delivering of Music, December 2002, Darmstadt, Germany 179-183.Google Scholar
- Yoshii K, Goto M, Okuno HG: Drum sound recognition for polyphonic audio signals by adaptation and matching of spectrogram templates with harmonic structure suppression. IEEE Transactions on Audio, Speech and Language Processing 2007,15(1):333-345.View ArticleGoogle Scholar
- Yoshii K, Goto M, Komatani K, Ogata T, Okuno HG: An error correction framework based on drum pattern periodicity for improving drum sound detection. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), May 2006, Toulouse, France 5: 237-240.Google Scholar
- Paulus J: Acoustic modelling of drum sounds with hidden Markov models for music transcription. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), May 2006, Toulouse, France 5: 241-244.Google Scholar
- Paulus J, Klapuri A: Combining temporal and spectral features in HMM-based drum transcription. Proceedings of the 8th International Conference on Music Information Retrieval, September 2007, Vienna, Austria 225-228.Google Scholar
- Leggetter CJ, Woodland PC: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language 1995,9(2):171-185. 10.1006/csla.1995.0010View ArticleGoogle Scholar
- Gillet O, Richard G: Drum track transcription of polyphonic music using noise subspace projection. Proceedings of the 6th International Conference on Music Information Retrieval, September 2005, London, UK 156-159.Google Scholar
- Fletcher NH, Rossing TD: The Physics of Musical Instruments. 2nd edition. Springer, New York, NY, USA; 1998.View ArticleMATHGoogle Scholar
- McAulay RJ, Quatieri TF: Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 1986,34(4):744-754. 10.1109/TASSP.1986.1164910View ArticleGoogle Scholar
- Serra X: Musical sound modeling with sinusoids plus noise. In Musical Signal Processing. Edited by: Roads C, Pope S, Picialli A, De Poli G. Swets & Zeitlinger, Lisse, The Netherlands; 1997:91-122.Google Scholar
- Davis SB, Mermelstein P: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 1980,28(4):357-366. 10.1109/TASSP.1980.1163420View ArticleGoogle Scholar
- Eronen A: Comparison of features for musical instrument recognition. Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (ASSP '01), October 2001, New Platz, NY, USA 19-22.Google Scholar
- Somervuo P: Experiments with linear and nonlinear feature transformations in HMM based phone recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '03), 2003, Hong Kong 1: 52-55.Google Scholar
- Gillet O, Richard G: ENST-drums: an extensive audio-visual database for drum signal processing. Proceedings of the 7th International Conference on Music Information Retrieval, October 2006, Victoria, Canada 156-159.Google Scholar
- Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 1989,77(2):257-286. 10.1109/5.18626View ArticleGoogle Scholar
- Luján M, Martínez CD, Alabau V: Evaluation of several maximum likelihood linear regression variants for language adaptation. Proceedings of the 6th International Language Resources and Evaluation Conference, May 2008, Marrakech, MoroccoGoogle Scholar
- Fischer A, Stahl V: Database and online adaptation for improved speech recognition in car environments. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '99), March 1999, Phoenix, Ariz, USA 1: 445-448.View ArticleGoogle Scholar
- Gales MJF, Pye D, Woodland PC: Variance compensation within the MLLR framework for robust speech recognition and speaker adaptation. Proceedings of International Conference on Spoken Language Processing (ICSLP '96), October 1996, Philadelphia, Pa, USA 3: 1832-1835.View ArticleGoogle Scholar
- Witten IH, Bell TC: The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory 1991,37(4):1085-1094. 10.1109/18.87000View ArticleGoogle Scholar
- Young SJ, Russell NH, Thornton JHS: Token passing: a simple conceptual model for connected speech recognition systems. Cambridge University Engineering Department, Cambridge, UK; July 1989.Google Scholar
- MAMI : Musical audio-mining, drum detection console applications. 2005, http://www.ipem.ugent.be/MAMI/
- Welch BL: The generalization of "Student's" problem when several different population variances are involved. Biometrika 1947,34(1-2):28-35. 10.1093/biomet/34.1-2.28MathSciNetView ArticleMATHGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.