Skip to main content

A multimodal tempo and beat-tracking system based on audiovisual information from live guitar performances


The aim of this paper is to improve beat-tracking for live guitar performances. Beat-tracking is a function to estimate musical measurements, for example musical tempo and phase. This method is critical to achieve a synchronized ensemble performance such as musical robot accompaniment. Beat-tracking of a live guitar performance has to deal with three challenges: tempo fluctuation, beat pattern complexity and environmental noise. To cope with these problems, we devise an audiovisual integration method for beat-tracking. The auditory beat features are estimated in terms of tactus (phase) and tempo (period) by Spectro-Temporal Pattern Matching (STPM), robust against stationary noise. The visual beat features are estimated by tracking the position of the hand relative to the guitar using optical flow, mean shift and the Hough transform. Both estimated features are integrated using a particle filter to aggregate the multimodal information based on a beat location model and a hand's trajectory model. Experimental results confirm that our beat-tracking improves the F-measure by 8.9 points on average over the Murata beat-tracking method, which uses STPM and rule-based beat detection. The results also show that the system is capable of real-time processing with a suppressed number of particles while preserving the estimation accuracy. We demonstrate an ensemble with the humanoid HRP-2 that plays the theremin with a human guitarist.

1 Introduction

Our goal is to improve beat-tracking for human guitar performances. Beat-tracking is one way to detect musical measurements such as beat timing, tempo, body movement, head nodding, and so on. In this paper, the proposed beat-tracking method estimates tempo, beats per minute (bpm), and tactus, often referred to as the foot tapping timing or the beat [1], of music pieces.

Toward the advancement of beat-tracking, we are motivated with an application to musical ensemble robots, which enable synchronized play with human performers, not only expressively but also interactively. Only a few attempts, however, have been made so far with interactive musical ensemble robots. For example, Weinberg et al. [2] reported a percussionist robot that imitates a co-player's playing to play according to the co-player's timing. Murata et al. [3] addressed a musical robot ensemble with robot noise suppression with the Spectro-Temporal Pattern Matching (STPM) method. Mizumoto et al. [4] report a thereminist robot that performs a trio with a human flutist and a human percussionist. This robot adapts to the changing tempo of the human's play, such as accelerando and fermata.

We focus on the beat-tracking of a guitar played by a human. The guitar is one of the most popular instruments used for casual musical ensembles consisting of a melody and a backing part. Therefore, the improvement of beat-tracking of guitar performances enables guitarist, from novices to experts, to enjoy applications such as a beat-tracking computer teacher or an ensemble with musical robots.

In this paper, we discuss three problems in beat-tracking of live human guitar performances: (1) tempo fluctuation, (2) complexity of beat patterns, and (3) environmental noise. The first is caused by the irregularity of humans. The second is illustrated in Figure 1; some patterns consist of upbeats, that is, syncopation. These patterns are often observed in guitar playing. Moreover, beat-tracking of one instrument, especially in syncopated beat patterns, is challenging since beat-tracking of one instrument has less onset information than with many instruments. For the third, we focus on stationary noise, for example, small perturbations in the room, and robot fan noise. It degrades the signal-to-noise ratio of the input signal, so we cannot disregard such noise.

Figure 1
figure 1

Typical guitar beat patterns. The symbol × represents guitar-cutting, a percussive sound made with quick muting sounds. The > denotes accented, ↑ and ↓ denote the directions of strokes, and (↑) and (↓) denote air strokes.

To solve these problems, this paper presents a particle-filter-based audiovisual beat-tracking method for guitar playing. Figure 2 shows the architecture of our method. The core of our method is a particle-filter-based integration of the audio and visual information based on a strong correlation between motions and beat timings of guitar playing. We modeled their relationship in the probabilistic distribution of our particle-filter method. Our method uses the following audio and visual beat features: the audio beat features are the normalized cross-correlation and increments obtained from the audio signal using Spectro-Temporal Pattern Matching (STPM), a method robust against stationary noise, and the visual beat features are the relative hand positions from the neck of the guitar.

Figure 2
figure 2

Architecture underlying our beat-tracking technique.

We implement a human-robot ensemble system as an application of our beat-tracking method. The robot plays its instrument according to the guitar beat and tempo. The task is challenging because the robot fan and motor noise interfere with the guitar's sound. All of our experiments are conducted in the situation with the robot.

Section 2 discusses the problems with guitar beat-tracking, and Section 3 presents our audiovisual beat-tracking approach. Section 4 shows that the experimental results demonstrate the superiority of our beat-tracking to Murata's method in tempo changes, beat structures and real-time performance. Section 5 concludes this paper.

2 Assumptions and problems

2.1 Definition of the musical ensemble with guitar

Our targeted musical ensemble consists of a melody player and a guitarist and assumes quadruple rhythm for simplicity of the system. Our beat-tracking method can accept other rhythms by adjusting the hand's trajectory model explained in Section 3.2.3.

At the beginning of a musical ensemble, the guitarist gives some counts to synchronize with a co-player as he would in real ensembles. These counts are usually given by voice, gestures or hit sounds from the guitar. We determine the number of counts as four and consider that the tempo of the musical ensemble can be only altered moderately from the tempo implied by counts.

Our method estimates the beat timings without prior knowledge of the co-player's score. This is because (1) many guitar scores do not specify beat patterns but only melody and chord names, and (2) our main goal focuses on improvisational sessions.

Guitar playing is mainly categorized into two styles: stroke and arpeggio. Stroke style consists of hand waving motions. In arpeggio style, however, a guitarist pulls strings with their fingers mostly without moving their arms. Unlike most beat-trackers in the literature, our current system is designed for a much more limited case where the guitar is strummed, not in a finger picked situation. This limitation allows our system to perform well in a noisy environment, to follow sudden tempo changes more reliably and to address single instrument music pieces.

Stroke motion has two implicit rules, (1) beginning with a down stroke and (2) air strokes, that is, strokes with a soundless tactus, to keep the tempo stable. These can be found in the scores, especially pattern 4 for air strokes, in Figure 1. The arrows in the figure denote the stroke direction, common enough to appear on instruction books for guitarists. The scores say that strokes at the beginning of each bar go downward, and the cycle of a stroke usually lasts the length of a quarter note (eight beats) or of an eighth note (sixteen beats). We assume music with eight-beat measures and model the hand's trajectory and beat locations.

No prior knowledge on the color of hands is assured in our visual-tracking. This is because humans have various hand colors and such colors vary according to the lighting conditions. The motion of the guitarist's arm, on the other hand, is modeled with prior knowledge: the stroking hand makes the largest movement in the body of a playing guitarist. The conditions and assumptions for guitar ensemble are summarized below:

Conditions and assumptions for beat-tracking


  1. (1)

    Stroke (guitar-playing style)

  2. (2)

    Take counts at the beginning of the performance

  3. (3)

    Unknown guitar-beat patterns

  4. (4)

    With no prior knowledge of hand color


  1. (1)

    Quadruple rhythm

  2. (2)

    Not much variance from the tempo implied by counts

  3. (3)

    Hand movement and beat locations according to eight beats

  4. (4)

    Stroking hand makes the largest movement in the body of a guitarist

2.2 Beat-tracking conditions

Our beat-tracking method estimates the tempo and bar-position, the location in the bar at which the performer is playing at a given time from audio and visual beat features. We use a microphone and a camera embedded in the robot's head for the audio and visual input signal, respectively. We summarize the input and output specifications in the following box:



  • Guitar sounds captured with robot's microphone

  • Images of guitarist captured with robot's camera


  • Bar-position

  • Tempo

2.3 Challenges for guitar beat-tracking

A human guitar beat-tracking must overcome three problems to cope with tempo fluctuation, beat pattern complexity, and environmental noise. The first problem is that, since we do not assume a professional guitarist, a player is allowed to play fluid tempos. Therefore, the beat-tracking method should be robust to such changes of tempo.

The second problem is caused by (1) beat patterns complicated by upbeats (syncopation) and (2) the sparseness of onsets. We give eight typical beat patterns in Figure 1. Patterns 1 and 2 often appear in popular music. Pattern 3 contains triplet notes. All of the accented notes in these three patterns are down beats. However, the other patterns contain accented upbeats. Moreover, all of the accented notes of patterns 7 and 8 are upbeats. Based on these observations, we have to take into account how to estimate the tempos and bar-positions of the beat patterns with accented upbeats.

The sparseness is defined as the number of onsets per time unit. We illustrate the sparseness of onsets in Figure 3. In this paper, guitar sounds consist of a simple strum, meaning low onset density, while popular music has many onsets as is shown in the Figures. The figure shows a 62-dimension mel-scaled spectrogram of music after the Sobel filter [5]. The Sobel filter is used for the enhancement of onsets. Here, the negative values are set to zero. The concentration of darkness corresponds to strength of onset. The left one, from popular music, has equal interval onsets including some notes between the onsets. On the other hand, the right one shows an absent note compared with the tactus. Such absences mislead a listener of the piece as per the blue marks in the figure. What is worse, it is difficult to detect the tactus in a musical ensemble with few instruments because there are few supporting notes to complement the syncopation; for example, the drum part may complement the notes in larger ensembles.

Figure 3
figure 3

The strength of onsets in each frequency bin with the power spectrogram after Sobel filtering. a Popular music (120 BPM), b guitar backing performance (110 bpm). Red ballets, red triangles, blue ballet denote tactuses of the pieces, absent notes at tactuses, error candidates of tactuses. In this paper, a frame is equivalent to 0.0116 sec. Detailed parameter values about time frame are shown in Section 3.1.

As for the third problem, the audio signal in beat-tracking of live performances includes two types of noise: stationary and non-stationary noise. In our robot application, the non-stationary noise is mainly caused by the robot joints' movement. This noise, however, does not affect beat-tracking, because it is small--6.68 dB in signal-to-noise ratio (SNR)--based on our experience so far. If the robot makes loud noise when moving, we may apply Ince's method [6] to suppress such ego noise. The stationary noise is mainly caused by fans on the computer in the robot and environmental sounds including air-conditioning. Such noise degrades the signal-to-noise ratio of the input signal, for example, 5.68dB in SNR, in our experiments with robots. Therefore, our method should include a stationary noise suppression method.

We have two challenges for visual hand tracking: false recognition of the moving hand and low time resolution compared with the audio signal. A naive application of color histogram-based hand trackers is vulnerable to false detections caused by the varying luminance of the skin color and thus captures other nearly skin-colored objects. While optical-flow-based methods are considered suitable for hand tracking, we have difficulty in employing this method because flow vectors include some noise from the movements of other parts of the body. Usually, audio and visual signals have different sampling rates from one another. According to our setting, the temporal resolution of a visual signal is about one-quarter compared to an audio signal. Therefore, we have to synchronize these two signals to integrate them.


Audio signal:

  1. (1)

    Complexity of beat patterns

  2. (2)

    Sparseness of onsets

  3. (3)

    Fluidity of human playing tempos

  4. (4)

    Antinoise signal

Visual signal:

  1. (1)

    Distinguishing hand from other parts of body

  2. (2)

    Variations in hand color depend on individual humans and their surroundings

  3. (3)

    Low visual resolution

2.4 Related research and solution of the problems

2.4.1 Beat-tracking

Beat-tracking has been extensively studied in music processing. Some beat-tracking methods use agents [7, 8] that independently extract the inter-onset intervals of music and estimate tempos. They are robust against beat pattern complexity but vulnerable to tempo changes because their target music consists of complex beat patterns with a stable tempo. Other methods are based on statistical methods like a particle filter using a MIDI signal [9, 10]. Hainsworth improves the particle-filter-based method to address raw audio data [11].

For the adaptation to robots, Murata achieved a beat-tracking method using the SPTM method [3], which suppresses robot stationary noise. While this STPM-based method is designed to adapt to sudden tempo changes, the method is likely to mistake upbeats for down beats. This is partly because the method fails to estimate the correct note lengths and partly because no distinctions can be made between the down and upbeats with its beat-detecting rule.

In order to robustly track the human's performance, Otsuka et al. [12] use a musical score. They have reported an audio-to-score alignment method based on a particle filter and revealed its effectiveness despite tempo changes.

2.4.2 Visual-tracking

We use two methods for visual-tracking, one based on optical flow and one based on color information. With the optical-flow method, we can detect the displacement of pixels between frames. For example, Pan et al. [13] use the method to extract a cue of exchanged initiatives for their musical ensemble.

With color information, we can compute the prior probabilistic distribution for tracked objects, for example, with a method based on particle filters [14]. There have been many other methods for extracting the positions of instruments. Lim et al. [15] use a Hough transform to extract the angle of a flute. Pan et al. [13] use a mean shift [16, 17] to estimate the position of the mallet's endpoint. These detected features are used as the cue for the robot movement. In Section 3.2.2, we give a detailed explanation of Hough transform and mean shift.

2.4.3 Multimodal integration

Integrating the results of elemental methods is a filtering problem, where observations are input features extracted with some preprocessing methods and latent states are the results of integration. The Kalman filter [18] produces estimates of latent state variables with linear relationships between observation and the state variables based on a Gaussian distribution. The Extended Kalman Filter [19] adjusts the state relationships of non-linear representations but only for differentiable functions. These methods are, however, unsuitable for the beat-tracking we face because of the highly non-linear model of the hand's trajectory of guitarists.

Particle filters, on the other hand, which are also known as Sequential Monte Carlo methods, estimate the state space of latent variables with highly nonlinear relationships, for example, a non-Gaussian distribution. At frame t, z t and x t denote the variables of the observation and latent states, respectively. The probability density function (PDF) of latent state variables p(x t |z1:t-1) is approximated as follows:

p ( x t | z 1 : t ) i = 1 I ω t ( i ) δ x t - x t ( i ) ,

where the sum of weights w t ( i ) is 1. I is the number of particles and w t ( i ) and x t ( i ) correspond to the weight and state variables of the i th particle, respectively. The δ ( x t - x t ( i ) ) is the Dirac delta function. Particle filters are commonly used for beat-tracking [912] and visual-tracking [14] as is shown in Section 2.4.1 and 2.4.2. Moreover, Nickel et al. [20] applied a particle filter as a method of audiovisual integration for the 3D identification of a talker. We will present the solution for these problems in the next section.

3 Audio and visual beat features extraction

3.1 Audio beat feature extraction with STPM

We apply the STPM [3] for calculating the audio beat features, that is, inter-frame correlation R t (k) and the normalized summation of onsets F t , where t is the frame index. Spectra are consecutively obtained by applying a short time Fourier transform (STFT) to an input signal sampled at 44.1kHz. A Hamming window of 4,096 points with the shift size of 512 points is used as a window function. The 2,049 linear frequency bins are reduced to 64 mel-scaled frequency bins by a mel-scaled filter bank. Then, the Sobel filter [5] is applied to the spectra to enhance its edges and to suppress the stationary noise. Here, the negative values of its result are set to zero. The resulting vector, d(t,f), is called an onset vector. Its element at the t th time frame and f-th mel-frequency bank is defined as follow:

d ( t , f ) = p sobel ( t , f ) if p sobel ( t , f ) > 0 , 0 otherwise
p sobel ( t , f ) = - p mel ( t - 1 , f + 1 ) + p mel ( t + 1 , f + 1 ) - p mel ( t - 1 , f - 1 ) + p mel ( t + 1 , f - 1 ) - 2 p mel ( t - 1 , f ) + 2 p mel ( t + 1 , f ) ,

where psobel is the spectra to which the Sobel filter is applied to. R t (k), the inter-frame correlation with the frame k frames behind, is calculated by the normalized cross-correlation (NCC) of onset vectors defined in Eq. (4). This is the result for STPM. In addition, we define F t as the sum of the values of the onset vector at the t th time frame in Eq. (5). F t refers to the peak time of onsets. R t (k) relates to the musical tempo (period) and F t to the tactus (phase).

R t ( k ) = j = 1 N F i = 0 N P - 1 d ( t - i , j ) d ( t - k - i , j ) j = 1 N F i = 0 N P - 1 d ( t - i , j ) 2 j = 1 N F i = 0 N P - 1 d ( t - k - i , j ) 2
F t = log f = 1 N F d ( t , f ) p e a k ,

where peak is a variable for normalization and is updated under the local peak of onsets. The N F denotes the number of dimensions of onset vectors used in NCC and N P denotes the frame size of pattern matching. We set these parameters to 62 dimensions and 87 frames (equivalent to 1 sec.) according to Murata et al. [3].

3.2 Visual beat feature extraction with hand tracking

We extract the visual beat features, that is, the temporal sequences of hand positions with these three methods: (1) hand candidate area estimation by optical flow, (2) hand position estimation by mean shift, and (3) hand position tracking.

3.2.1 Hand candidate area estimation by optical flow

We use Lucas-Kanade (LK) method [21] for fast optical-flow calculation. Figure 4 shows an example of the result of optical-flow calculation. We define the center of hand candidate area as a coordinate of the flow vector, which has the length and angle nearest from the middle values of flow vectors. This is because the hand motion should have the largest flow vector according to the assumption (3) in Section 2.1, and this allows us to remove noise vectors with calculating the middle values.

Figure 4
figure 4

Optical flow. a is the previous frame, b is the current frame, and c indicates flow vecto The horizontal axis and the vertical axis correspond to the time frame and hand positio respectively.

3.2.2 Hand position estimation by mean shift

We estimate a precise hand position using mean shift [16, 17], a local maximum detection method. Mean shift has two advantages: low computational costs and robustness against outliers. We used the hue histogram as a kernel function in the color space which is robust against shadows and specular reflections [22] defined by:

I x I y I z = 2 - 1 / 2 - 1 / 2 0 3 / 2 - 3 / 2 1 / 3 1 / 3 1 / 3 r g b
h u e = tan - 1 ( I y / I x ) .

3.2.3 Hand position tracking

Let (h x,t , h y,t ) be the hand coordination calculated by the mean shift. Since a guitarist usually moves their hand near the neck of their guitar, we define r t , a hand position at t time frame, as the relative distance between the hand and the neck as follows:

r t = ρ t - ( h x , t cos θ t + h y , t sin θ t ) ,

where ρ t and θ t are the parameters of the line of the neck computed with Hough transform [23] (see Figure 5a for an example). In Hough transform, we compute 100 candidate lines, remove outliers with RANSAC [24], and get the average of Hough parameters. Positive values indicate that a hand is above the guitar; negative values indicate below. Figure 5b shows an example of the sequential hand positions.

Figure 5
figure 5

Hand position from guitar. a Definition image. b Example of sequential data.

Now, let ω t and θ t be a beat interval and bar-position at the t th time frame, where a bar is modeled as a circle, 0 ≤ θ t < 2π and ω t is inversely proportional to the angle rate, that is, tempo. With assumption 3 in Section 2.1, we presume that down strokes are at θ t = /2 and up strokes are at θ t = /2 + π/4(n = 0,1,2,3). In other words, zero crossover points of hand position are at these θ. In addition, since a hand stroking is in a smooth motion to keep the tempo stable, we assume that the sequential hand position can be represented with a continuous function. Thus, hand position r t is defined by

r t = - a sin ( 4 θ t ) ,

where a is a constant value of hand amplitude and is set to 20 in this paper.

4 Particle-filter-based audiovisual integration

4.1 Overview of the particle-filter model

The graphical representation of the particle-filter model is outlined in Figure 6. The state variables, ω t and θ t , denote the beat interval and bar-position, respectively. The observation variables, R t (k), F t , and r t denote inter-frame correlation with k frames back, normalized onset summation, and hand position, respectively. The w t ( i ) and θ t ( i ) are parameters of the i th particle. Now, we will explain the estimation process with the particle filter.

Figure 6
figure 6

Graphical model. denotes state and □ denotes observation variable.

4.2 State transition with sampling

The state variables at the t th time frame ω t ( i ) θ t ( i ) are sampled from Eqs. (10) and (11) with the observations at the (t - 1)th time frame. We use the following proposal distributions:

ω t ( i ) ~ q ω t | ω t - 1 ( i ) , R t ( ω t ) , ω init R t ( ω t ) × Gauss ω t | ω t - 1 ( i ) , σ w q × Gauss ω t | ω init , σ w init
θ t ( i ) ~ q θ t | r t , F t , ω t - 1 ( i ) , θ t - 1 ( i ) = Mises θ t | Θ ^ t ( i ) , β θ q , 1 × penalty ( θ t ( i ) | r t , F t ) ,

Gauss(x|μ, σ) represents the PDF of a Gaussian distribution where x is a variable and parameters μ and σ correspond to the mean and standard deviation, respectively. The σ ω * denotes the standard deviation for the sampling of the beat interval. The ωinit denotes the beat interval estimated and fixed with the counts. Mises (θ|μ, β, τ) represents the PDF of a von Mises distribution [25], also known as the circular normal distribution, which is modified to have τ peaks. This PDF is defined by

Mises ( θ | μ , β , τ ) = exp ( β cos ( τ ( θ - μ ) ) ) 2 π I 0 ( β ) ,

where I0(β) is a modified Bessel function of the first kind of order 0. The μ denotes the location of the peak. The β denotes the concentration; that is, 1/β is analogous to σ2 of a normal distribution. Note that the distribution approaches a normal distribution as β increases. Let Θ ^ t ( i ) be a prediction of θ t ( i ) defined by:

Θ ^ t ( i ) = θ t - 1 ( i ) + b / ω t - 1 ( i ) ,

where b denotes a constant for transforming from beat interval into an angle rate of the bar-position.

We will now discuss Eqs. (10) and (11). In Eq. (10), the first term R t (k) is multiplied with two window functions of different means. The first is calculated from the previous frame and the second is from the counts. In Eq. (11), penalty(θ|r, F) is the result of five multiplied multipeaked window functions. Each function has a condition. If it is satisfied, the function is defined by the von Mises distribution; otherwise, it shows 1 in any θ. This penalty function pulls the peak of the θ distribution into its own peak and modifies the distribution to match it with the assumptions and the models. Figure 7 shows the change in the θ distribution by multiplying the penalty function.

Figure 7
figure 7

Example of changes in θ distribution while multiplying penalty function. Beginning the top, we show the distribution before being multiplied, an example of the penalty functio and the distribution after being multiplied. This penalty function is expressed by the von Mis distribution of the cycle of π/2.

In the following, we present the conditions for each window function and the definition of the distribution.

r t - 1 > 0 r t < 0 Mises ( 0 , 2 . 0 , 4 )
r t - 1 < 0 r t > 0 Mises( π 4 ,1 .9,4)
r t - 1 > r t Mises(0,3 .0,4)
r t - 1 < r t Mises( π 4 ,1 .5,4)
F t > t h r e s h . Mises ( 0 , 20 . 0 , 8 ) .

All β parameters are set experimentally through a trial and error process. thresh. is a threshold that determines whether F t is constant noise or not. Eqs. (14) and (15) are determined with the assumption of zero crossover points of stroking. Eqs. (16) and (17) are determined with the stroking directions. These four equations are based on the model of the hand's trajectory presented in Eq. (9). Equation (18) is based on eight beats; that is, notes should be on the tops of the modified von Mises function which has eight peaks.

4.3 Weight calculation

Let the weight of the i th particle at t th time frame be w t ( i ) . The weights are calculated using observations and state variables:

w t ( i ) = w t - 1 ( i ) p ω t ( i ) , θ t ( i ) | ω t - 1 ( i ) , θ t - 1 ( i ) p R t ( ω t ( i ) ) , F t , r t | ω t ( i ) , θ t ( i ) q ω t | ω t - 1 ( i ) , R t ( ω t ( i ) ) , ω init q θ t | r t , F t , ω t - 1 ( i ) , θ t - 1 ( i ) .

The terms of the numerator in Eq. (19) are called a state transition model function and a observation model function. The more the values of a particle match each model, the larger value its weight has with the high probabilities of these functions. The denominator is called a proposal distribution. When a particle of low probability is sampled, its weight increases with the low value of the denominator.

The two equations below give the derivation of the state transition model function.

ω t = ω t - 1 + n ω
θ t = Θ ^ t + n θ ,

where n ω denotes the noise of the beat interval distributed with a normal distribution and n θ denotes the one of the bar-position distributed with a von Mises distribution. Therefore, the state transition model function is expressed as the product of the PDF of these distributions.

p ( ω t ( i ) , θ t ( i ) | ω t 1 ( i ) , θ t 1 ( i ) ) = Mises ( Θ ^ t , β n θ , t ) Gauss( ω t 1 , σ n ω )

We give the deviation of the observation model function. The R t (ω) and r t are distributed according to the normal distributions where the means are w t ( i ) and -a sin ( 4 Θ ^ t ( i ) ) , respectively. The F t is empirically approximated with the values of the observation as:

F t f ( θ bea t t , σ f ) Gauss ( θ t ( i ) ; θ beat,t , σ f ) * r a t e + b i a s ,

where θbeat,t is the bar-position of the nearest beat in the model of eight beats from Θ ^ t ( i ) . rate is a constant value for the maximum of approximated F t to be 1, and is set to 4. bias is uniformly distributed from 0.35 to 0.5. Thus, the observation model function is expressed as the product of these three functions (Eq. (27)).

p R t ( ω t ) | ω t ( i ) = Gauss ω t ; ω t ( i ) , σ ω
p F t | ω t ( i ) , θ t ( i ) = Gauss F t ; f ( θ beat,t , σ f ) , σ f
p r t | ω t ( i ) , θ t ( i ) = Gauss r t ; - a sin ( 4 Θ ^ t ( i ) ) , σ r
p R t ω t ( i ) , F t , r t | ω t ( i ) , θ t ( i ) = p R t ( ω t ) | ω t ( i ) p F t | ω t ( i ) , θ t ( i ) p r t | ω t ( i ) , θ t ( i )

We finally estimate the state variables at the t th time frame from the average with the weights of particles.

ω ̄ t = i = 1 I w t ( i ) ω t ( i )
θ ̄ t = arctan i = 1 I w t ( i ) sin θ t ( i ) / i = 1 I w t ( i ) cos θ t ( i )

Finally we resample the particles to avoid degeneracy; that is, almost all weights become zero except for a few when the weight values satisfy the following equation:

1 i = 1 I w t ( i ) 2 < N t h ,

where N th is a threshold for resampling and is set to 1.

5 Experiments and results

In this section, we evaluate our beat-tracking system in the following four points:

  1. 1.

    Effect of audiovisual integration based on the particle filter,

  2. 2.

    Effect of the number of particles in the particle filter,

  3. 3.

    Difference between subjects, and

  4. 4.


Section 5.1 describes the experimental materials and the parameters used in our method for the experiments. In Section 5.2, we compare the estimation accuracies of our method and Murata's method [3], to evaluate the statistical approach. Since both methods share STPM, the main difference is caused by either the heuristic rule-based approach or statistical one. In addition, we evaluate the effect of adding the visual beat features by comparing with a particle filter using only audio beat features. In Section 5.3, we discuss the relationship between the number of particles versus computational costs and the accuracy of the estimates. In Section 5.4, we present the difference among subjects. In Section 5.5, we give an example of musical robot ensemble with a human guitarist.

5.1 Experimental setup

We asked four guitarists to perform one of each eight kinds of the beat patterns given in Figure 1, at three different tempos (70, 90, and 110), for total of 96 samples. The beat patterns are enumerated in order of beat pattern complexity; a smaller index number indicates that the pattern includes more accented down beats which is easily tracked, while a larger index number indicates that the pattern includes more accented upbeats that confuse the beat-tracker. A performance consists of four counts, seven repetitions of the beat pattern, one whole note and one short note, shown in Figure 8. The average length of each sample was 30.8[sec] for 70 bpm, 24.5[sec] for 90 bpm and 20.7[sec] for 110. The camera recorded frames at about 19 [fps]. The distance between the robot and a guitarist was about 3 [m] so that the entirety of the guitar could be placed inside the camera frame. We use a one-channel microphone and the sampling parameters shown in Section 3.1 Our method uses 200 particles unless otherwise stated. It was implemented in C++ on a Linux system with an Intel Core2 processor. Table 1 shows the parameters of this experiment. The unit of the parameter relevant to θ is [deg] that ranges from 0 to 360. They all are defined experimentally through a trial and error process.

Figure 8
figure 8

The score used in our experiments. X denotes the counts given by the hit soun from the guitar. White box denotes a whole note. Black box in the last of the score denot a short note.

Table 1 Parameter settings: abbreviations are SD for standard deviation, and dist. for distribution

In order to evaluate the accuracy of beat-tracking methods, we use the following thresholds to define successful beat detection and tempo estimations from ground truth: 150 msec for detected beats and 10 bpm for estimated tempos, respectively.

Two evaluative standard are used, F-measure and AMLc. F-measure is a harmonic mean of precision (rprec) and recall (rrecall) of each pattern. They are calculated by

F - measure = 2 / 1 / r prec + 1 / r recall ,
r prec = N e / N d ,
r recall = N e / N c ,

where N e , N d , and N c correspond to the number of correct estimates, whole estimates and correct beats, respectively. AMLc is the ratio of the longest continuous correctly tracked section to the length of the music, with beats at allowed metrical levels. For example, one inaccuracy in the middle of a piece leads to 50% performance. This represents that the continuity is in correct beat detections and is critical factor in the evaluation of musical ensembles.

The beat detection errors are divided into three classes: substitution, insertion and deletion errors. Substitution error means that a beat is poorly estimated in terms of the tempo or bar-position. Insertion errors and deletion errors are false-positive and false-negative estimations. We assume that a player does not know the other's score, thus one estimates score position by number of beats from the beginning of the performance. Beat insertions or deletions undermine the musical ensemble because the cumulative number of beats should be correct or the performers will lose synchronization. Algorithm 1 shows how to detect inserted and deleted beats. Suppose that a beat-tracker correctly detects two beats with a certain false estimation between them. When the method just incorrectly estimates a beat there, we regard it as a substitution error. In the case of no beat or two beats there, they are counted as a deleted or inserted beats, respectively.

5.2 Comparison of audiovisual particle filter, audio only particle filter, and Murata's method

Table 2 and Figure 9 summarize the precision, recall and F-measure of each pattern with our audiovisual integrated beat-tracking (Integrated), audio only particle filter (Audio only) and Murata's method (Murata). Murata does not show any variance in its result, that is, no error bars in result figures because its estimation is a deterministic algorithm, while the first two plots show variance due to the stochastic nature of particle filters. Our method Integrated stably produces moderate results and outperforms Murata for patterns 4-8. These patterns are rather complex with syncopations and downbeat absences. This demonstrates that Integrated is more robust against beat patterns than Murata. The comparison between Integrated and Audio only confirms that the visual beat features improve the beat-tracking performance; Integrated improves precision, recall, and F-measure by 24.9, 26.7, and 25.8 points in average from Audio only, respectively.

Table 2 Results of the accuracy of beat-tracking estimations
Figure 9
figure 9

Results: F-measure of each method. Exact values are shown in Table 2.

The F-measure scores of the patterns 5, 6, and 8 decrease for Integrated. The following mismatch causes this degradation; though these patterns contain sixteenth beats that make the hand move at double speed, our method assumes that the hand always moves downward only at quarter note positions as Eq. (9) indicates. To cope with this problem, we should allow for downward arm motions at eighth notes, that is, sixteen beats. However, a naive extension of the method would result in degraded performances with other patterns.

The average of F-measure for Integrated shows about 61%. The score is deteriorated due to these two reasons: (1) the hand's trajectory model does not match the sixteen-beat patterns, and (2) the low resolution and the error in estimating visual beat feature extraction do not make the penalty function effective in modifying the θ distribution.

Table 3 and Figure 10 present the AMLc comparison among the three method. As well as the F-measure result, Integrated is superior to Murata for patterns 4-8. The AMLc results of patterns 1 and 3 are not so high despite the high F-measure score. Here, we define result rate as the ratio of the AMLc score to the F-measure one. In patterns 1 and 3, the result rates are not so high, 72.7 and 70.8. Likewise the F-measure results, the result rates of patterns 4 and 5 remark lower scores, 48.9 and 55.8. On the other hand, the result rates of patterns 2 and 7 show still high percentage as 85.0 and 74.7. The hand's trajectory of patterns 2 and 7 is approximately the same with our model, a sign curve. In pattern 3, however, the triplet notes affect the trajectory to be late in the upward movement. In pattern 1, no upbeats, that is, no constraints in the upward movement allow the hand to move loosely upward in comparison with the trajectories in other patterns. To conclude, the result rate has a relationship with the similarity of a hand's trajectory of each pattern with our model. The model should be refined to raise scores in our future work.

Table 3 Results of AMLc
Figure 10
figure 10

Results: AMLc of each method. Exact values are shown in Table 3.

In Figure 11, Integrated demonstrates less errors than Murata with regard to the total errors of insertions and deletions. A detailed analysis shows that Integrated has less deletion errors than Murata in some patterns. On the other hand, Integrated has more insertion errors than Murata, especially in sixteen beats. However, the adaption to sixteen beats would produce fewer insertions in Integrated.

Figure 11
figure 11

Results: Number of inserted and deleted beats.

5.3 The influence of the number of particles

As a criterion of the computational cost, we use a real-time factor to evaluate our system in terms of a real-time system. The real-time factor is defined as computation time divided by data length; for example, when the system takes 0.5s to process 2 s data, the real-time factor is 0.5/2 = 0.25. The realtime factor must be less than 1 to run the system in real-time. Table 4 shows the real-time factors with various numbers of particles. The real-time factor increases in proportion to the number of particles. The real-time factor is kept under 1 with 300 particles or less. We therefore conclude that our method works well as a real-time system with fewer than 300 particles.

Table 4 Influence of the number of particles on the estimation accuracy and computational speed

Table 4 also shows that the F-measures differ by only about 1.3% between 400 particles showing the maximum result and 200 particles where the system works in real-time. This suggests that our system is capable of real-time processing with almost saturated performance.

5.4 Results with various subjects

Figure 12 indicates that we can observe only little difference among the subjects except Subject 3. In the case of Subject 3, the similarity of the skin color to the guitar caused frequent loss of the hand's trajectory. To improve the estimation accuracy, we should tune the algorithm or parameters to be more robust against such confusion.

Figure 12
figure 12

Comparison among the subjects.

5.5 Evaluation using a robot

Our system was implemented on a humanoid robot HRP-2 that plays an electronic instrument called the theremin as in Figure 13. The video is available on Youtube [26]. The humanoid robot HRP-2 plays the theremin with a feed-forward motion control developed by Mizumoto et al. [27]. HRP-2 captures a mixture of sound consisting of its own theremin performance and human partner's guitar performance with its microphones. HRP-2 first suppresses its own theremin sounds by using the semi-blind ICA [28] to obtain the audio signal played by the human guitarist. Then, our beat-tracker estimates the tempo of the human performance and predicts the tactus. According to the predicted tactus, HRP-2 plays the theremin. Needless to say, this prediction is coordinated to absorb the delay of the actual movement of the arm.

Figure 13
figure 13

An example image of musical robot ensemble with a human guitarist.

6 Conclusions and future works

We presented an audiovisual integration method for beat-tracking of live guitar performances using a particle filter. Beat-tracking of guitar performances has three following problems: tempo fluctuation, beat pattern complexity and environmental noise. The auditory beat features are the autocorrelation of the onsets and the onset summation extracted with a noise-robust beat estimation method, called STPM. The visual beat feature is the distance of the hand position from the guitar neck, extracted with the optical flow and mean shift and by Hough line detection, respectively. We modeled the stroke and the beat location based on an eight-beat assumption to address the single instrument situation. Experimental results show the robustness of our method against such problems. The F-measure of beat-tracking estimation improves by 8.9 points on average compared with an existing beat-tracking method. Furthermore, we confirmed that our method is capable of real-time processing by suppressing the number of particles while preserving beat-tracking accuracy. In addition, we demonstrate a musical robot ensemble with a human guitarist.

We still have two main problems to improve the quality of synchronized musical ensembles: beat-tracking with higher accuracy and robustness against estimation errors. For the first problem, we have to get rid of the assumption of quadruple rhythm and eight beats. The hand-tracking method should be also refined. One possible way for improved hand tracking is the use of infrared sensors that are recently gathering many researchers' interest. In fact, our preliminary experiments suggest that the use of an infrared sensor instead of an RGB camera would enable more robust hand tracking. Thus, we can also expect an improvement of the beat-tracking itself by using this sensor.

We suggest two extensions as future works to increase robustness to estimation errors: audio-to-score alignment with reduced score information, and the beat-tracking with prior knowledge of rhythm patterns. While standard audio-to-score alignment methods [12] require a full set of musical notes to be played, for example, an eighth note of F in the 4th octave and a quarter note of C in the 4th octave, guitarists use scores with only the melody and chord names, with some ambiguity with regard to the octave or note lengths. Compared to beat-tracking, this melody information would allow us to be aware of the score position at the bar level and to follow the music more robustly against insertion or deletion errors. The prior distribution of rhythm patterns can also alleviate the insertion or deletion problem by forming a distribution of possible beat positions in advance. This kind of distribution is expected to result in more precise sampling or state transition in particle-filter methods. Finally, we have to remark that we need the subjective evaluation as to how much our beat-tracking improves the quality of the human-robot musical ensemble.

Algorithm 1 Detection of inserted and deleted beats

deleted ← 0 {deleted denotes the number of deleted beats}

inserted ← 0 {inserted denotes the number of inserted beats}

prev_index ← 0

for all detected_beat do

   if |tempo(detected_beat)-tempo(ground_truth_beat)| < 10

   and | beat_time(detected_beat)-beat_time(ground_truth_beat)| < 150 then

      {detected_beat is correct estimation}

      new_index ← index(ground_truth_beat)

      N ← (new_index - prev_index - 1) - error_count

      deleteddeleted + MAX(0, N)

      insertedinserted + MAX(0, -N)

      prev_index ← new_index

      error_count ← 0


      error_counterror_count + 1

   end if

end for


  1. Klapuri A, Eronen A, Astola J: Analysis of the meter of acoustic musical signals. IEEE Trans Audio Speech Lang Process 2006, 14: 342-355.

    Article  Google Scholar 

  2. Weinberg G, Blosser B, Mallikarjuna T, Raman A: The creation of a multi-human, multi-robot interactive jam session. Proc of Int'l Conf on New Interfaces of Musical Expression 2009, 70-73.

    Google Scholar 

  3. Murata K, Nakadai K, Takeda R, Okuno HG, Torii T, Hasegawa Y, Tsujino H: A beat-tracking robot for human-robot interaction and its evaluation. Proc of IEEE/RAS Int'l Conf on Humanoids (IEEE) 2008, 79-84.

    Google Scholar 

  4. Mizumoto T, Lim A, Otsuka T, Nakadai K, Takahashi T, Ogata T, Okuno HG: Integration of flutist gesture recognition and beat-tracking for human-robot ensemble. Proc of IEEE/RSJ-2010 Workshop on Robots and Musical Expression 2010, 159-171.

    Google Scholar 

  5. Rosenfeld A, Kak A: Digital Picture Processing. Volume 1 & 2. Academic Press, New York; 1982.

    Google Scholar 

  6. Ince G, Nakadai K, Rodemann T, Hasegawa Y, Tsu-jino H, Imura J: A hybrid framework for ego noise cancellation of a robot. Proc of IEEE Int'l Conf on Robotics and Automation (IEEE) 2011, 3623-3628.

    Google Scholar 

  7. Dixon S, Cambouropoulos E: Beat-tracking with musical knowledge. Proc of European Conf on Artificial Intelligence 2000, 626-630.

    Google Scholar 

  8. Goto M: An audio-based real-time beat-tracking system for music with or without drum-sounds. J New Music Res 2001, 30(2):159-171. 10.1076/jnmr.

    Article  Google Scholar 

  9. Cemgil AT, Kappen B: Integrating tempo tracking and quantization using particle filtering. Proc of Int'l Computer Music Conf 2002, 419.

    Google Scholar 

  10. Whiteley N, Cemgil AT, Godsill S: Bayesian modelling of temporal structure in musical audio. Proc of Int'l Conf on Music Information Retrieval 2006, 29-34.

    Google Scholar 

  11. Hainsworth S, Macleod M: Beat-tracking with particle filtering algorithms. Proc of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE) 2003, 91-94.

    Google Scholar 

  12. Otsuka T, Nakadai K, Takahashi T, Komatani K, Ogata T, Okuno HG: Design and Implementation of Two-level Synchronization for Interactive Music Robot. Proc of AAAI Conference on Artificial Intelligence 2010, 1238-1244.

    Google Scholar 

  13. Pan Y, Kim MG, Suzuki K: A robot musician interacting with a human partner through initiative exchange. Proc of Conf on New Interfaces for Musical Expression 2010, 166-169.

    Google Scholar 

  14. Petersen K, Solis J, Takanishi A: Development of a realtime instrument tracking system for enabling the musical interaction with the Waseda Flutist Robot. Proc of IEEE/RSJ Int'l Conf on Intelligent Robots and Systems 2008, 313-318.

    Google Scholar 

  15. Lim A, Mizumoto T, Cahier L, Otsuka T, Takahashi T, Komatani K, Ogata T, Okuno HG: Robot musical accompaniment: integrating audio and visual cues for realtime synchronization with a human flutist. Proc of IEEE/RSJ Int'l Conf on Intelligent Robots and Systems 2010, 1964-1969.

    Google Scholar 

  16. Comaniciu D, Meer P: Mean shift: A robust approach toward feature space analysis. In Proc of IEEE Transactions on pattern analysis and machine intelligence. (IEEE Computer Society); 2002:603-619.

    Google Scholar 

  17. Fukunaga K: Introduction to Statistical Pattern Recognition. Academic Press, New York; 1990.

    Google Scholar 

  18. Kalman R: A new approach to linear filtering and prediction problems. J Basic Eng 1960, 82: 35-45. 10.1115/1.3662552

    Article  Google Scholar 

  19. Sorenson EH: Kalman Filtering: Theory and Application. IEEE Press, New York; 1985.

    Google Scholar 

  20. Nickel K, Gehrig T, Stiefelhagen R, McDonough J: A joint particle filter for audio-visual speaker tracking. Proc of Int'l Conf on multimodal interfaces 2005, 61-68.

    Google Scholar 

  21. Lucas BD, Kanade T: An iterative image registration technique with an application to stereo vision. Proc of Int'l Joint Conf on Artificial Intelligence 1981, 674-679.

    Google Scholar 

  22. Miyazaki D, Tan RT, Hara K, Ikeuchi K: Polarization-based inverse rendering from a single view. Proc of IEEE Int'l Conf on Computer Vision 2003, 982-987.

    Chapter  Google Scholar 

  23. Ballard DH: Generalizing the Hough transform to detect arbitrary shapes. Pattern recognition 1981, 13(2):111-122. 10.1016/0031-3203(81)90009-1

    Article  Google Scholar 

  24. Fischler M, Bolles R: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 1981, 24(6):381-395. 10.1145/358669.358692

    Article  MathSciNet  Google Scholar 

  25. von Mises R: Uber dir "Ganzzahligkeit" der Atom-gewichte und verwandte Fragen. Phys Z 1918, 19: 490-500.

    Google Scholar 

  26. Itohara T: HRP-2 follows the guitar.[]

  27. Mizumoto T, Otsuka T, Nakadai K, Takahashi T, Komatani K, Ogata T, Okuno HG: Human-robot ensemble between robot thereminist and human percussionist using coupled oscillator model. Proc of IEEE/RSJ Int'l Conf on Intelligent Robots and Systems (IEEE) 2010, 1957-1963.

    Google Scholar 

  28. Takeda R, Nakadai K, Komatani K, Ogata T, Okuno HG: Exploiting known sound source signals to improve ICA-based robot audition in speech separation and recognition. Proc of IEEE/RSJ Int'l Conf on Intelligent Robots and Systems 2007, 1757-1762.

    Google Scholar 

Download references


This research was supported in part of by a JSPS Grant-in-Aid for Scientific Research (S) and in part by Kyoto University's Global COE.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Tatsuhiko Itohara.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Authors’ original file for figure 17

Authors’ original file for figure 18

Authors’ original file for figure 19

Authors’ original file for figure 20

Authors’ original file for figure 21

Authors’ original file for figure 22

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Itohara, T., Otsuka, T., Mizumoto, T. et al. A multimodal tempo and beat-tracking system based on audiovisual information from live guitar performances. J AUDIO SPEECH MUSIC PROC. 2012, 6 (2012).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Tempo
  • Particle Filter
  • Audio Signal
  • Hand Position
  • Hand Tracking