- Research Article
- Open Access
Tango or Waltz?: Putting Ballroom Dance Style into Tempo Detection
© Björn Schuller et al. 2008
- Received: 31 October 2007
- Accepted: 14 March 2008
- Published: 1 April 2008
Rhythmic information plays an important role in Music Information Retrieval. Example applications include automatically annotating large databases by genre, meter, ballroom dance style or tempo, fully automated D.J.-ing, and audio segmentation for further retrieval tasks such as automatic chord labeling. In this article, we therefore provide an introductory overview over basic and current principles of tempo detection. Subsequently, we show how to improve on these by inclusion of ballroom dance style recognition. We introduce a feature set of 82 rhythmic features for rhythm analysis on real audio. With this set, data-driven identification of the meter and ballroom dance style, employing support vector machines, is carried out in a first step. Next, this information is used to more robustly detect tempo. We evaluate the suggested method on a large public database containing 1.8 k titles of standard and Latin ballroom dance music. Following extensive test runs, a clear boost in performance can be reported.
- Detection Function
- Onset Detection
- Comb Filter
- Music Information Retrieval
Music Information Retrieval (MIR) has been a growing field of research over the last decade. The increasing popularity of portable music players and music distribution over the internet has made worldwide, instantaneous access to rapidly growing music archives possible. Such archives must be well structured and sorted in order to be user friendly. For example, many users face the problem of having heard a song they would like to buy but not knowing its bibliographic data, that is, title and artist, which is necessary to find the song in conventional (online) music stores. According to Downie in , almost three fourths of all MIR queries are of bibliographic nature. The querying person gives information he or she knows about the song, most likely genre, meter, tempo, lyrics, or acoustic properties, for example, tonality and demands information about title and/or artist. In order to have machines assist in building a song database queryable by features such as tempo, meter, or genre, intelligent Information Retrieval algorithms are necessary to automatically extract such high-level features from raw music data. Many works exist that describe or give overviews over basic MIR methods, for example, [2–8]. Besides tonal features, the temporal features play an important role. Tempo, meter, and beat locations form the basis for segmenting music and thus for further feature extraction such as chord change detection or higher level metrical analysis, for example, as performed in . Because of its importance, we will primarily focus on robust tempo detection within this article.
Currently existing state-of-the-art tempo detection algorithms are—generally speaking—based on methods of periodicity detection. That is, they use techniques such as autocorrelation, resonant filter banks, or onset time statistics to detect the tempo. A good comparison and overview is given in . However, very little work exists that combines various low-level detection methods, such as tempo induction, meter recognition, and beat tracking into a system that is able to use features from all these subtasks to perform robust high-level classification tasks, for example, ballroom dance style or genre recognition, and in turn use the classification results to improve the low-level detection results. Only few, such as [11, 12], present data-driven genre and meter recognition. Other methods, such as [13, 14], use rhythmic features only for specific tasks, like audio identification, and do not use rhythmic features in a multistep process to improve results themselves.
A novel approach that aims at robust, data-driven rhythm analysis primarily targeted at database applications is presented in this article. A compact set of low-level rhythmic features is described, which is highly suitable for discrimination between duple and triple meter as well as ballroom dance style classification. Based on the results of data-driven dance style and meter recognition, the quarter-note tempo can be detected very reliably reducing errors, where half or twice of the true tempo is detected. Beat tracking at the beat level for songs with an approximately constant tempo can be performed more reliably once the tempo is known—however, it will not be discussed in this article. A beat tracking method, that can be used in conjunction with the new data-driven rhythm analysis approach, is presented in . Although the primary aim of the presented approach is to robustly detect the quarter-note tempo, the complete procedure is referred to as rhythm analysis, because meter and ballroom dance style are also detected and used in the final tempo detection pass.
The article is structured as follows. In Section 2, an introduction to tempo detection, meter recognition, and genre classification is given along with an overview over selected related work. Section 3 describes the novel approach to improved data-driven tempo detection through prior meter and ballroom dance style classification. The results are presented in Section 4 and compared to results obtained at the ISMIR 2004 tempo induction contest before the conclusion and outlook in Section 5.
Tempo induction, beat tracking, and meter detection methods can roughly be divided into two major groups. The first group consists of those that attempt to explicitly find onsets in the first step (or use onsets obtained from a symbolic notation, e.g., MIDI), and then deduct information about tempo, beat positions, and possibly meter by analyzing the interonset intervals (IOIs) [9, 16–21]. The second group contains those that extract information about the tempo and metrical structure prior to onset detection. Correlation or resonator methods are mostly used for this task. If onset positions are required, onset detection can then be assisted by information from the tempo detection stage [2, 4–6, 8, 22].
The more robust methods, especially, for database applications, are those from the second group. However, we will first explain the concept of onset detection used in the methods of the first group, as we believe it is a very intuitive way to approach the problem of beat tracking and tempo detection.
Before we start explaining the tempo induction methods, we take a look at some music terminology regarding meter. The metrical structure of a musical piece is composed of multiple hierarchical levels , where the tempo of each higher level is an integer multiple of the tempo on the lowest level. The latter is called level. The level at which we tap along when listening to a song is the or level. Sometimes this tempo is referred to as the quarter-note tempo. The or level corresponds to the bar in notated music, and the period of its tempo gives the length of a measure. The relation between measure and beat level is often referred to as time signature or more generally the meter.
(1)Full-wave rectification and lowpass filtering of followed by down sampling to approximately 100 Hz.
(2)Dividing the signal into small windows having a length around 20 milliseconds with approximately 50% overlap and then calculating the RMS energy of each window by averaging over all in the window. This can be followed by an additional lowpass filter for smoothing purposes.
In , Scheirer states that the amplitude envelope does not contain all rhythmic information. Multiple nonlinear frequency bands must be analyzed separately and the results are to be combined at the end. To improve the simple onset detection introduced in the last paragraph, the signal can be split into six nonlinear bands using a bandpass filter bank. Onsets are still assumed to correspond to an increase in the amplitude envelope, not of the full-spectrum signal, but now of each bandpass signal. Therefore, for each bandpass signal the same onset detection procedure as described above can be performed. This results in onset data for each band. The data of the six bands must be combined. This is done by adding the onsets of all bands and combining onsets that are sufficiently close together. Such a multiple band approach gives better results for music, where no strong beats, such as base drums in electronic dance music, are present. A more advanced discussion of onset detection in multiple frequency bands is presented in .
All methods presented up to this point are based on detecting a sudden increase in signal energy. In recent years, phase based  or combined energy/phase approaches  introduced by Bello et al. have been shown to give better results than energy-only approaches. Basically, onset detection incorporating phase and energy, that is, operating in the complex domain, bases on the assumption that there is both a notable phase deviation and an energy increase when an onset occurs. Yet, to preserve the general and introductory nature of this overview and focus more on tempo detection, we will not go into details on these techniques.
For tempo detection from onset data mainly a histogram technique is used in the literature [2, 18]. The basic idea is the following: duration and weight of all possible IOIs are computed. Similar IOIs are grouped in clusters and the clusters are arranged in a histogram. From the weights and the centers of the clusters the tempo of several metrical levels can be determined. Dixon in  uses a simple rule-based method. Seppänen in  uses a more advanced method. He extracts only the tatum pulse level (fastest occurring tempo) directly from the IOI histogram, by picking the cluster with the center corresponding to the smallest IOI. Features in a window around each tatum pulse are extracted. Using Bayesian pattern recognition, the tatum pulses are classified with respect to their perceived accentuation. Thus, the beat level is detected by assuming that beats are more accented than offbeat pulses. Although Seppänen's work stops at the tatum level, the score level could be detected in the same way, assuming that beats at the beginning of a score are more accented than beats within.
We will now take a look at the second group of algorithms that attempt to find the tempo without explicitly detecting onsets. Still it is assumed that rhythmic events such as beats, percussion, or note onsets correspond to a change in signal amplitude in a few nonlinear bands. Again we start with either the envelopes or the differentials of the envelopes of the six frequency bands but omit the step of peak picking. To keep this overview general the term "detection function"  will be used in the ongoing, referring to either the envelope, its differential or any other function related to perceivable change in the signal.
The beat level tempo, which is what we are interested in at this point, can be viewed as a periodicity in the envelope function. A commonly used method to detect periodicities in a function is autocorrelation [8, 27]. The periodic autocorrelation is computed over a small window (10 seconds) of the envelope function. The index of the highest peak in the autocorrelation function (ACF) indicates the strongest periodicity. However, as findings in  suggest, the strongest periodicity in the signal may not always be the dominant periodicity perceived. The findings suggest an interval of preferred tapping linked to a supposed resonance between our perceptual and motor system. Still, as a first guess, which will work fairly well on music with strong beats in the preferred tapping range, the highest peak can be assumed to indicate the beat level tempo. We also have to combine the results from all bands. The simplest way is to add up the ACF of all bands and pick the highest peak in the summary ACF (SACF). Determining the tempo for each band and choosing the tempo that was detected in the majority of bands as the final tempo is an alternative method. Dixon describes a tempo induction method based on autocorrelation in . Uhle et al. use autocorrelation for meter detection in .
An alternative to autocorrelation is a resonant filter bank consisting of resonators tuned to different frequencies (periodicities), first introduced for beat tracking by Scheirer in . The detection function is fed to all resonators and the total output energy of each resonator is computed. In analogy to the highest autocorrelation peak, the resonator with the highest output energy matches the songs periodicity best and thus the beat level tempo is assumed to be its resonance frequency. As explained in the last paragraph, this assumption does not fully match our perception of rhythm. This is one reason why it is so difficult, even for most of state-of-the-art systems, to reliably detect the tempo on the beat level. Octave errors, that is, where double/triple or half/third the beat level tempo is detected, are very common according to . Even human listeners in some cases do not agree on a common tapping level.
All the methods introduced so far require the extraction of a detection function. Publications exist discussing how such a detection function can be computed, considering signal processing theory  and applying psychoacoustic knowledge . In order to bypass the issue of selecting a good detection function, a different periodicity detection approach as was introduced for tempo and meter analysis by Foote and Uchihashi  can be used. This approach is based on finding self-similarities among audio features. First, the audio data is split into small (20–40 milliseconds) overlapping windows. Feature vectors containing, for example, FFT coefficients or MFCC  are extracted from these windows and a distance matrix is computed by comparing every vector with all the remaining vectors via a distance measure or cross-correlation.
While still the choice of the feature set might have an influence on the performance, this method has an advantage over computing the ACF of a detection function. In computing the correlation or distance of every feature vector to every other feature vector all possible relations between all features in all feature vectors are accounted for. Detection functions for separate frequency bands can only account for (temporal) relations within each band. If the detection function is a sum over all bands, for example, relations between the frequency bands are accounted for, but only in a very limited way. This case would correspond to reducing the feature vector to one dimension by summing its elements before computing the distance matrix.
However, computing distance matrices is a very time consuming task and might thus not be applicable to live applications, for example, that demand real-time algorithms. For most mainstream music, it can be assumed that the sensation of tempo corresponds to a loudness periodicity, as can be represented by a single detection function or a set of detection functions for a few subbands. Therefore, even though in our opinion the distance matrix method seems to be the theoretically most advanced method, it is not used in the rhythm analysis method presented in the following.
In the remaining part of this overview section we will give a very short overview over selected meter detection and ballroom dance style and genre recognition methods.
Various work exists on the subject of genre recognition, for example, [30, 31]. The basic approach is to extract a large number of features representing acoustic properties for each piece of music to be classified. Using a classifier trained on annotated training data, the feature vectors extracted from the songs are assigned a genre. Reference  extracts features related to timbral texture, rhythmic content and pitch content. The rhythmic features are extracted from the result of autocorrelation of subband envelopes. As classifiers Gaussian mixture models (GMMs) and K-nearest-neighbour (K-NN) are investigated, a discrimination rate of 61% for 10 musical genres is reported. Reference  investigates the use of a large open feature sets and automatic feature selection combined with support vector machines as classifiers. A success rate of 92.2% is reported for discrimination between 6 genres.
The subject of ballroom dance style recognition is relatively new. Gouyon et al. have published a data-driven approach to ballroom dance style recognition in . They test various features extracted from IOI histograms using 1-NN classification. The best result is achieved with 15 MFCC like descriptors computed from the IOI histogram. 90.1% accuracy is achieved with these descriptors plus the ground truth tempo by 1-NN classifiers. Without ground truth tempo, that is, only the 15 descriptors, 79.6% accuracy is reported.
Meter detection requires tempo information from various metrical levels. Klapuri et al. introduce an extensive method to analyze audio on the tatum, pulse, and measure level . For each level, the period is estimated based on periodicity analysis using a comb filter bank. A probabilistic model encompasses the dependencies between the metrical levels. The method is able to deal with changing metrical structures throughout the song. It proves robust for phase and tempo on the beat level, but still has some difficulties on the measure level. The method is well suited for, in depth, metrical analysis of a wide range of musical genres. For a limited set of meters, for example, as in ballroom dance music the complexity can be reduced—at the gain of accuracy—to binary decisions between duple or triple periods on the measure level. Gouyon et al. assume a given segmentation of the song on the beat level and then focus on a robust discrimination between duple and triple meter  on the measure level. For each beat segment, a set of low-level descriptors is computed from the audio. Periodic similarities of each descriptor across beats are analyzed by autocorrelation. From the output of the autocorrelation, a decisional criterion is computed for each descriptor, which is used as a feature in meter classification.
A data-driven rhythm analysis approach is now introduced, capable of extracting rhythmic features, robustly identifying duple and triple meter, quarter-note tempo and ballroom dance style basing on 82 rhythmic features, which are described in the following sections.
Robustly identifying the quarter-note or beat level tempo is a challenging task, since octave errors, that is, where double or half of the true tempo is detected, are very common. Therefore, a new tempo detection approach, based on integrated ballroom dance style recognition, is investigated.
The tatum tempo [8, 18], that is, the fastest tempo, presents the basis for extracting rhythmic features. A resonator-based approach, inspired by , is used for detecting this tatum tempo and extracting features containing information about the distribution of resonances throughout the song.
The features are used to decide whether the song is in duple or triple meter. Confining the metrical decision to a binary one was introduced in . For dance music, the discrimination between duple and triple meter has the most practical significance. Identifying various time signatures, such as 2/4, 4/4, and 6/8 is a more complicated task and of less practical relevance for ballroom dance music. The rhythmic features are further used to classify songs into 9 ballroom dance style classes. These results will be used to assist the tempo detection algorithm by providing information about tempo distributions collected from the training data for the corresponding class. For evaluation 10-fold stratified cross-validation is used. This is described in more detail in Section 3.5.
3.1. Comb Filter Tempo Analysis
The approach for tatum tempo analysis discussed in this article is based on Scheirer's multiple resonator approach  using comb filters as resonators. His approach has been adapted and improved successfully in other work for tempo and meter detection [6, 10, 32]. The main concept is to filter the envelopes or detection functions (see Section 2) of six nonlinear frequency bands through a bank of resonators. The resonance frequency of the resonator with the highest output energy is chosen as tempo. The comb filters used here are a slight variation of Scheirer's filters. In the following paragraphs, there will be a brief theoretical discussion of IIR comb filters and a description of the chosen filter parameters.
In the ongoing, the symbol will be used to denote a tempo. The tempo is specified as a frequency having the unit BPM (beats per minute). If an index IOI is appended to the symbol , it is indicated that the tempo is given as IOI period in frames.
A comb filter adds a signal itself to a delayed version of the signal. Every comb filter is characterized by two parameters: the delay (or period, which is the inverse of the filters resonance frequency) and the gain .
To achieve optimal tempo detection performance, an optimal value for must be determined. Scheirer's  method of constant half-energy time by using variable gain depending on has not proven well in our test runs. Instead, we use a fixed value for . When choosing this value, we have to consider small temporary tempo drifts occurring in most music performances. So the theoretically optimal gain cannot be used. We conducted test runs with multiple values for in the range from 0.2 to 0.99. Best results were obtained with .
3.2. Feature Extraction
The comb filters introduced in the previous section are used to extract the necessary features for ballroom-dance style recognition, meter recognition, and tempo detection. The key concept is to set up comb filter banks over a much broader range than used by  in order to include higher metrical layers. The resulting features describe the distribution of resonances among several metrical layers, which provides qualitative information about the metrical structure.
To effectively reduce the number of comb filters required, we exploit the fact that in music performances several metrical layers are present (see Section 2). In a first step the tempo on the lowest level, the tatum tempo, is detected. It is now assumed that all possibly existing higher metrical levels can only have tempi that are integer multiples of the tatum tempo. This is true for a wide variety of musical styles.
The input data is down sampled to and converted into monophonic by stereo-channel addition in order to reduce computation time. The input audio of length seconds is split into frames of samples with an overlap of 0.57, resulting in a final envelope frame rate of 100 fps (frames per second). A Hamming window is applied to each frame and a fast Fourier transform (FFT) of the frame is computed, resulting in 128 FFT coefficients.
By using overlapping triangular filters, equidistant on the mel-frequency scale, the 128 FFT coefficients are reduced to envelope samples of nonlinear bands. These triangular filters are the same as used in speech recognition for the computation of MFCC .
Such a small set of frequency bands, still covering the whole human auditory frequency range, contains the complete rhythmic structure of the musical excerpt, according to experiments conducted in .
This method is based on the fact that a human listener perceives note onsets as more intense if they occur after a longer time of lower sound level and thus are not affected by temporal post-masking caused by previous sounds . The weighting with the right mean incorporates the fact that note duration and total note energy play an important role in determining the perceived note accentuation .
3.2.2. Tatum Features
For detecting the tatum tempo , an IIR comb filter bank is used consisting of 57 filters, with gain and delays ranging from to envelope samples. This filter bank is able to detect tatum tempos in the range from 81 to 333 pulses per minute. The range might need adjustments when very slow music is processed, that is, music with no tempo faster than 81 pulses per minute.
From three additional features are extracted that reveal the quality of the peaks.
(i) is computed by dividing the highest value by the lowest.
(ii) is the fraction of the first value over the last value.
(iii) is computed as mean of the maximum and minimum value normalized by the global mean.
The 63 tatum features consisting of , , , , , , and the tatum vector with 57 elements constitute the first part of the rhythmic feature set. A major difference to some existing work is the use of the complete tatum vector in the feature set. Reference  uses rhythmic features for genre classification. However, from a beat histogram, which is loosely comparable to the tatum vector (both contain information about the periodicities), only a small set of features is extracted, only considering the two highest peaks and the sum of the histogram.
3.2.3. Meter Features
The tatum features only contain information from a very small tempo range, hence, they are not sufficient when one is interested in the complete metrical structure and other tempi than the tatum tempo. Thus, features that contain information about tempo distributions over a broader range are required. These are referred to as meter features, although they do not contain explicit information about the meter.
A so called meter vector is introduced. This vector shows the distribution of resonances among 19 metrical levels, starting at, and including the tatum level.
The 19 elements of the meter vector , without further processing or reduction, constitute the second part of the rhythmic feature set. We would like to note at this point, that no explicit value for the meter (i.e., duple or triple) is part of the meter features. In the ongoing the reader will learn how the meter is detected in a data-driven manner using support vector machines (SVMs).
3.3. Feature Selection
Overview over all 82 rhythmic features. Feature set .
tatum vector (57 el.)
tatum candidates , [BPM]
final tatum tempo [BPM]
Meter vector (19 el.)
In order to find relevant features for meter and ballroom dance style classification, the dataset is analyzed for each of these two cases by performing a closed-loop hill-climbing feature selection employing the target classifier's error rate as optimization criterion, namely, sequential forward floating search (SVM-SFFS) .
The feature selection reveals the following feature subset to yield the best results for meter classification: , meter vector elements 4, 6, 8, 16, and the tatum vector .
For ballroom dance style classification the feature selection reveals the following feature subset to yield the best results: meter (see Section 3.5), , , , meter vector elements 4–6, 8, 11, 12, 14, 15, 19, and the tatum vector excluding elements 21 and 29.
3.4. Song Database
Mean , standard deviation , minimum and maximum tempo in BPM for each class, and complete set .
Results obtained on dataset for meter , quarter-note tempo , and ballroom dance style (BDS).
For the dataset, the ground truth of tempo and dance style is known from . The ground truth regarding duple or triple metrical grouping is also implicitly known from the given source because it can be deduced from the dance style. All Waltzes have triple meter, all other dances have duple meter. Tempo ground truths are not manually double checked as performed in , therefore errors among the ground truths might be present. Results with manually checked ground truths might improve slightly. This is further discussed near the end of Section 4.
3.5. Data-Driven Meter and Ballroom Dance Style Recognition
From the abstract features in set (see Section 3.3) meter and quarter-note tempo have to be extracted. While data-driven meter recognition by SVM yields excellent results, data-driven tempo detection is a complicated task because tempo is a continuous variable. An SVM regression method was investigated, but has not proven successful. The method was not able to correctly identify tempi within a tolerance of only a few percent relative BPM deviation. A hybrid approach is used therefore the data is divided into a small number of classes representing tempo ranges. The ranges are allowed to overlap slightly. As the database described in Section 3.4 already has one of nine ballroom dance styles assigned to each instance, the dance styles are chosen as the tempo classes, since music of the same dance style generally is limited to a specific tempo range. This is confirmed by other work, which uses tempo ranges to assign a ballroom dance style [2, 37].
The meter , from the previous step, is used as a feature in feature set (see Section 3.3) for ballroom dance style classification. The same 10-fold procedure as was used for meter classification in step 1 is performed in order to assign a ballroom dance style to all instances in the dataset.
With the results of both meter and ballroom dance style classification, it is now possible to quite robustly detect the quarter-note tempo. The following section describes the novel tempo detection procedure in detail.
3.6. From Ballroom Dance Style to Tempo
For the training data of each of the 10 folds introduced in the previous section, the means and variances of the distributions of quarter-note tempi (ground truths) and tatum tempi are computed for each of the 9 ballroom dance styles. No ground truth for the tatum tempo is available, so the automatically extracted tatum tempo (see Section 3.2.2) from step (1) in Section 3.5. is used. Results might improve further if ground truth tatum information were available, since correct tatum detection is crucial for correct results.
Now the candidate for which the function is maximal is chosen as the final tatum tempo . Based upon this new tatum, a new flattened meter vector is computed for all instances as described in Section 3.2.3.
The new meter vector is used for detection of the quarter-note tempo. Each element is multiplied by a Gaussian weighting factor . The parameters and in (11) are now set to the values and of the corresponding ballroom dance style. indicates the tempo the meter vector element belongs to (see Section 3.2.3).
Next, the index , for which the expression is maximized, is identified. The tempo belonging to index is the detected quarter-note (beat level) tempo .
Comparison of tempo detection without (w/o BDS), with incorporated ballroom dance style recognition (w BDS) and using ground truth ballroom classes to simulate optimal BDS recognition (gt BDS).
By the results in Table 4, it can be clearly seen that the number of instances, where the correct tempo octave is identified, increases by almost 20% absolute, when incorporating the ballroom dance style recognized in step (2). When assuming an optimal ballroom dance style recognition, that is, when ground truth ballroom data is used instead of the recognition results, the tempo octave is identified correctly in almost all cases, where the tempo is identified correctly. With the new data-driven approach to tempo detection, accuracies for the quarter-note tempo are improved by approximately 5% absolute for Waltz and over 10% for Viennese Waltz, compared to previous work on the same dataset . On 88% of all instances the correct tempo octave was identified, which is remarkble, considering the wide range of tempi of the dataset.
Detailed final results, after applying all the steps from Section 3.5 through Section 3.6, are depicted in Table 3. The tolerance for tempo detection hereby is 3.5% relative BPM deviation to maintain consistency with previous publications . We would like to note that ballroom dance style recognition has been performed completely without using the quarter-note tempo as a feature.
In , Dixon et al. use a rule-based approach for dance style classification basing on simple tempo ranges. However, results on a large dataset are not reported. In , Gouyon et al. test a data-driven approach on a subset of the dataset. They evaluate multiple feature sets and different classifiers. Using ground truth of tempo and meter from  with a K-nearest neighbour classifier, they report an accuracy of 82.3%. Using the same ground truths and SVM instead of k-NN, we achieve 84.6% of correctly classified instances. With a set of 15 MFCC-like features, comparable to our 82 rhythmic features, Gouyon et al. achieve accuracies of 79.6%. Using SVM on the rhythmic features introduced in this article, the ballroom dance style recognition results improve by almost 10% absolute to 89.1%.
Meter detection results improve by approximately 2% over those reported by Gouyon et al. in . However, different datasets and classifiers are used, so results cannot be properly compared. Comparing meter detection results with those reported by Klapuri et al.  is not feasible because in our article meter detection is restricted to a simple binary decision due to the main focus being on tempo detection incorporating ballroom dance style recognition. Klapuri et al. describe more in detail, multilevel tempo and meter analysis system.
Results on set for tempo detection without (w/o BDS), with incorporated ballroom dance style recognition (w BDS) and using ground truth ballroom classes (gt BDSs).
Tempo (acc. 1)
Octave (acc. 2)
Within this article, an overview over basic and current approaches for rhythm analysis on real audio was given. Further, a method to improve over today's robustness by combining tempo detection, rhythmic feature extraction, meter recognition, and ballroom dance style recognition in a data-driven manner was presented. As opposed to other work, ballroom dance style classification is carried out first, and significantly boosts performance of tempo detection. 82 rhythmic features were described and their high usefulness for all of these tasks was demonstrated.
Further applications for these features, ranging from general genre recognition to song identification , or measuring rhythmic similarity , must be investigated. Preliminary test runs for discrimination between 6 genres (Documentary, Chill, Classic, Jazz, Pop-Rock, and Electronic) on the same dataset, and with same test-conditions as used in  indicate accuracies of up to 70% using only the 83 rhythmic features.
It will further be investigated if adding other features, such as those described by [8, 12], or  can further improve results for all the presented rhythm analysis steps. Moreover, the data-driven tempo detection approach will be extended to nonballroom music, for example, popular and rock music.
Overall, automatic tempo detection on real audio—also outside of electronic dance music—has matured to a degree, where it is ready for multiple intelligent Music Information Retrieval applications in everyday life.
- Downie J: Music information retrieval. Annual Review of Information Science and Technology 2003,37(1):295-340.View ArticleGoogle Scholar
- Dixon S, Pampalk E, Widmer G: Classification of dance music by periodicity patterns. Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR '03), October 2003, Baltimore, Md, USA 159-165.Google Scholar
- Hu N, Dannenberg RB, Tzanetakis G: Polyphonic audio matching and alignment for music retrieval. Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA '03), October 2003, New Paltz, NY, USA 185-188.Google Scholar
- Foote J, Uchihashi S: The beat spectrum: a new approach to rhythm analysis. Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '01), August 2001, Tokyo, Japan 881-884.Google Scholar
- Scheirer ED: Tempo and beat analysis of acoustic musical signals. Acoustic Society of America 1998,103(1):588-601. 10.1121/1.421129View ArticleGoogle Scholar
- Klapuri AP, Eronen AJ, Astola JT: Analysis of the meter of acoustic musical signals. IEEE Transactions on Speech and Audio Processing 2006,14(1):342-355.View ArticleGoogle Scholar
- Orio N: Music retrieval: a tutorial and review. Foundations and Trends in Information Retrieval 2006,1(1):1-90. 10.1561/1500000002View ArticleMATHGoogle Scholar
- Uhle C, Rohden J, Cremer M, Herre J: Low complexity musical meter estimation from polyphonic music. Proceedings of the 25th International Conference on the Audio Engineering Society (AES '04), June 2004, London, UK 63-68.Google Scholar
- Goto M, Muraoka Y: Real-time rhythm tracking for drumless audio signals—chord change detection for musical decisions. Proceedings of of IJCAI-97 Workshop on Computational Auditory Scene Analysis (CASA '97), August 1997, Nagoya, Japan 135-144.Google Scholar
- Gouyon F, Klapuri AP, Dixon S, et al.: An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio, Speech and Language Processing 2006,14(5):1832-1844.View ArticleGoogle Scholar
- Gouyon F, Herrera P: Determination of the Meter of musical audio signals: seeking recurrences in beat segment descriptors. Proceedings of the 114th Convention of the Audio Engineering Society (AES '03), March 2003, Amsterdam, The NetherlandsGoogle Scholar
- Gouyon F, Dixon S, Pampalk E, Widmer G: Evaluating rhythmic descriptors for musical genre classification. Proceedings of the 25th International Conference on the Audio Engineering Society (AES '04), June 2004, London, UK 196-204.Google Scholar
- Kurth F, Gehrmann T, Muller M: The cyclic beat spectrum: tempo-related audio features for time-scale invariant audio identification. Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR '06), October 2006, Victoria, Canada 35-40.Google Scholar
- Kirovski D, Attias H: Beat-ID: identifying music with beat analysis. Proceedings of the International Workshop on Multimedia Signal Processing (MMSP '02), December 2002, St. Thomas, Virgin Islands, USA 190-173.Google Scholar
- Eyben F, Schuller B, Rigoll G: Wearable assistance for the ballroom-dance hobbyist—holistic rhythm analysis and dance-style classification. Proceedings of IEEE International Conference on Multimedia & Expo (ICME '07), July 2007, Beijing, China 92-95.Google Scholar
- Goto M, Muraoka Y: A real-time beat tracking system for audio signals. Proceedings of the International Computer Music Conference (ICMC '95), September 1995, Banff, Canada 171-174.Google Scholar
- Goto M: An audio-based real-time beat tracking system for music with or without drum-sounds. Journal of New Music Research 2001,30(2):159-171. 10.1076/jnmr.18.104.22.16814View ArticleGoogle Scholar
- Seppänen J: Computational models of musical meter recognition, M.S. thesis. Tampere University of Technology, Tampere, Finland; 2001.Google Scholar
- Dixon S: Automatic extraction of tempo and beat from expressive performances. Journal of New Music Research 2001,30(1):39-58. 10.1076/jnmr.22.214.171.12419View ArticleGoogle Scholar
- Hainsworth S, Macleod M: Beat tracking with particle filtering algorithms. Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA '03), October 2003, New Paltz, NY, USA 91-94.Google Scholar
- Alonso M, Richard G, David B: Tempo and beat estimation of musical signals. Proceedings of the 15th International Conference on Music Information Retrieval (ISMIR '04), October 2004, Barcelona, Spain 158-163.Google Scholar
- Sethares WA, Staley TW: Meter and periodicity in musical performance. Journal of New Music Research 2001,30(2):149-158. 10.1076/jnmr.126.96.36.19911View ArticleGoogle Scholar
- Klapuri AP: Musical meter estimation and music transcription. In Proceedings of the Cambridge Music Processing Colloquium, March 2003, Cambridge, UK. Cambridge University Press;Google Scholar
- Klapuri AP: Sound onset detection by applying psychoacoustic knowledge. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '99), March 1999, Phoenix, Ariz, USA 3089-3092.Google Scholar
- Bello JP, Sandler M: Phase-based note onset detection for music signals. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 5: 441-444.Google Scholar
- Duxbury C, Bello JP, Davies M, Sandler M: Complex domain onset detection for musical signals. Proceedings of the 6th International Conference on Digital Audio Effects (DAFx '03), September 2003, London, UK 90-93.Google Scholar
- Brown JC: Determination of meter of musical scores by autocorrelation. Journal of the Acoustical Society of America 1993,94(4):1953-1957. 10.1121/1.407518View ArticleGoogle Scholar
- van Noorden , Moelants D: Resonance in the perception of musical pulse. Journal of New Music Research 1999,28(1):43-66. 10.1076/jnmr.188.8.131.5222View ArticleGoogle Scholar
- Rabiner L, Juang B-H: Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs, NJ, USA; 1993.Google Scholar
- Tzanetakis G, Cook P: Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 2002,10(5):293-302. 10.1109/TSA.2002.800560View ArticleGoogle Scholar
- Schuller B, Wallhoff F, Arsic D, Rigoll G: Musical signal type discrimination based on large open feature sets. Proceedings of IEEE International Conference on Multimedia and Expo (ICME '06), July 2006, Toronto, Canada 1089-1092.Google Scholar
- Schuller B, Eyben F, Rigoll G: Fast and robust meter and tempo recognition for the automatic discrimination of ballroom dance styles. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '07), April 2007, Honolulu, Hawaii, USA 217-220.Google Scholar
- Zwicker E, Fastl H: Psychoacoustics: Facts and Models. 2nd edition. Springer, New York, NY, USA; 1999.View ArticleGoogle Scholar
- Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. 2nd edition. Morgan Kaufmann, San Francisco, Calif, USA; 2005.Google Scholar
- Ballroomdancers.com. Preview audio examples of ballroom dance music, November 2006, https://secure.ballroomdancers.com/Music/style.aspGoogle Scholar
- Songlist brd data-set, 2008, http://www.mmk.ei.tum.de/~sch/brd.txt
- Gouyon F, Dixon S: Dance music classification: a tempo-based approach. Proceedings of the 15th International Conference on Music Information Retrieval (ISMIR '04), October 2004, Barcelona, SpainGoogle Scholar
- Ballrom data-set, 2004, http://mtg.upf.edu/ismir2004/contest/tempoContest/node5.html
- Paulus J, Klapuri AP: Measuring the similarity of rhythmic patterns. Proceedings of the International Conference on Music Information Retrieval (ISMIR '02), October 2002, Paris, France 150-156.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.