2.1 Definition of the musical ensemble with guitar
Our targeted musical ensemble consists of a melody player and a guitarist and assumes quadruple rhythm for simplicity of the system. Our beat-tracking method can accept other rhythms by adjusting the hand's trajectory model explained in Section 3.2.3.
At the beginning of a musical ensemble, the guitarist gives some counts to synchronize with a co-player as he would in real ensembles. These counts are usually given by voice, gestures or hit sounds from the guitar. We determine the number of counts as four and consider that the tempo of the musical ensemble can be only altered moderately from the tempo implied by counts.
Our method estimates the beat timings without prior knowledge of the co-player's score. This is because (1) many guitar scores do not specify beat patterns but only melody and chord names, and (2) our main goal focuses on improvisational sessions.
Guitar playing is mainly categorized into two styles: stroke and arpeggio. Stroke style consists of hand waving motions. In arpeggio style, however, a guitarist pulls strings with their fingers mostly without moving their arms. Unlike most beat-trackers in the literature, our current system is designed for a much more limited case where the guitar is strummed, not in a finger picked situation. This limitation allows our system to perform well in a noisy environment, to follow sudden tempo changes more reliably and to address single instrument music pieces.
Stroke motion has two implicit rules, (1) beginning with a down stroke and (2) air strokes, that is, strokes with a soundless tactus, to keep the tempo stable. These can be found in the scores, especially pattern 4 for air strokes, in Figure 1. The arrows in the figure denote the stroke direction, common enough to appear on instruction books for guitarists. The scores say that strokes at the beginning of each bar go downward, and the cycle of a stroke usually lasts the length of a quarter note (eight beats) or of an eighth note (sixteen beats). We assume music with eight-beat measures and model the hand's trajectory and beat locations.
No prior knowledge on the color of hands is assured in our visual-tracking. This is because humans have various hand colors and such colors vary according to the lighting conditions. The motion of the guitarist's arm, on the other hand, is modeled with prior knowledge: the stroking hand makes the largest movement in the body of a playing guitarist. The conditions and assumptions for guitar ensemble are summarized below:
Conditions and assumptions for beat-tracking
Conditions:
-
(1)
Stroke (guitar-playing style)
-
(2)
Take counts at the beginning of the performance
-
(3)
Unknown guitar-beat patterns
-
(4)
With no prior knowledge of hand color
Assumptions:
-
(1)
Quadruple rhythm
-
(2)
Not much variance from the tempo implied by counts
-
(3)
Hand movement and beat locations according to eight beats
-
(4)
Stroking hand makes the largest movement in the body of a guitarist
2.2 Beat-tracking conditions
Our beat-tracking method estimates the tempo and bar-position, the location in the bar at which the performer is playing at a given time from audio and visual beat features. We use a microphone and a camera embedded in the robot's head for the audio and visual input signal, respectively. We summarize the input and output specifications in the following box:
Input-output
Input:
Output:
2.3 Challenges for guitar beat-tracking
A human guitar beat-tracking must overcome three problems to cope with tempo fluctuation, beat pattern complexity, and environmental noise. The first problem is that, since we do not assume a professional guitarist, a player is allowed to play fluid tempos. Therefore, the beat-tracking method should be robust to such changes of tempo.
The second problem is caused by (1) beat patterns complicated by upbeats (syncopation) and (2) the sparseness of onsets. We give eight typical beat patterns in Figure 1. Patterns 1 and 2 often appear in popular music. Pattern 3 contains triplet notes. All of the accented notes in these three patterns are down beats. However, the other patterns contain accented upbeats. Moreover, all of the accented notes of patterns 7 and 8 are upbeats. Based on these observations, we have to take into account how to estimate the tempos and bar-positions of the beat patterns with accented upbeats.
The sparseness is defined as the number of onsets per time unit. We illustrate the sparseness of onsets in Figure 3. In this paper, guitar sounds consist of a simple strum, meaning low onset density, while popular music has many onsets as is shown in the Figures. The figure shows a 62-dimension mel-scaled spectrogram of music after the Sobel filter [5]. The Sobel filter is used for the enhancement of onsets. Here, the negative values are set to zero. The concentration of darkness corresponds to strength of onset. The left one, from popular music, has equal interval onsets including some notes between the onsets. On the other hand, the right one shows an absent note compared with the tactus. Such absences mislead a listener of the piece as per the blue marks in the figure. What is worse, it is difficult to detect the tactus in a musical ensemble with few instruments because there are few supporting notes to complement the syncopation; for example, the drum part may complement the notes in larger ensembles.
As for the third problem, the audio signal in beat-tracking of live performances includes two types of noise: stationary and non-stationary noise. In our robot application, the non-stationary noise is mainly caused by the robot joints' movement. This noise, however, does not affect beat-tracking, because it is small--6.68 dB in signal-to-noise ratio (SNR)--based on our experience so far. If the robot makes loud noise when moving, we may apply Ince's method [6] to suppress such ego noise. The stationary noise is mainly caused by fans on the computer in the robot and environmental sounds including air-conditioning. Such noise degrades the signal-to-noise ratio of the input signal, for example, 5.68dB in SNR, in our experiments with robots. Therefore, our method should include a stationary noise suppression method.
We have two challenges for visual hand tracking: false recognition of the moving hand and low time resolution compared with the audio signal. A naive application of color histogram-based hand trackers is vulnerable to false detections caused by the varying luminance of the skin color and thus captures other nearly skin-colored objects. While optical-flow-based methods are considered suitable for hand tracking, we have difficulty in employing this method because flow vectors include some noise from the movements of other parts of the body. Usually, audio and visual signals have different sampling rates from one another. According to our setting, the temporal resolution of a visual signal is about one-quarter compared to an audio signal. Therefore, we have to synchronize these two signals to integrate them.
problems
Audio signal:
-
(1)
Complexity of beat patterns
-
(2)
Sparseness of onsets
-
(3)
Fluidity of human playing tempos
-
(4)
Antinoise signal
Visual signal:
-
(1)
Distinguishing hand from other parts of body
-
(2)
Variations in hand color depend on individual humans and their surroundings
-
(3)
Low visual resolution
2.4 Related research and solution of the problems
2.4.1 Beat-tracking
Beat-tracking has been extensively studied in music processing. Some beat-tracking methods use agents [7, 8] that independently extract the inter-onset intervals of music and estimate tempos. They are robust against beat pattern complexity but vulnerable to tempo changes because their target music consists of complex beat patterns with a stable tempo. Other methods are based on statistical methods like a particle filter using a MIDI signal [9, 10]. Hainsworth improves the particle-filter-based method to address raw audio data [11].
For the adaptation to robots, Murata achieved a beat-tracking method using the SPTM method [3], which suppresses robot stationary noise. While this STPM-based method is designed to adapt to sudden tempo changes, the method is likely to mistake upbeats for down beats. This is partly because the method fails to estimate the correct note lengths and partly because no distinctions can be made between the down and upbeats with its beat-detecting rule.
In order to robustly track the human's performance, Otsuka et al. [12] use a musical score. They have reported an audio-to-score alignment method based on a particle filter and revealed its effectiveness despite tempo changes.
2.4.2 Visual-tracking
We use two methods for visual-tracking, one based on optical flow and one based on color information. With the optical-flow method, we can detect the displacement of pixels between frames. For example, Pan et al. [13] use the method to extract a cue of exchanged initiatives for their musical ensemble.
With color information, we can compute the prior probabilistic distribution for tracked objects, for example, with a method based on particle filters [14]. There have been many other methods for extracting the positions of instruments. Lim et al. [15] use a Hough transform to extract the angle of a flute. Pan et al. [13] use a mean shift [16, 17] to estimate the position of the mallet's endpoint. These detected features are used as the cue for the robot movement. In Section 3.2.2, we give a detailed explanation of Hough transform and mean shift.
2.4.3 Multimodal integration
Integrating the results of elemental methods is a filtering problem, where observations are input features extracted with some preprocessing methods and latent states are the results of integration. The Kalman filter [18] produces estimates of latent state variables with linear relationships between observation and the state variables based on a Gaussian distribution. The Extended Kalman Filter [19] adjusts the state relationships of non-linear representations but only for differentiable functions. These methods are, however, unsuitable for the beat-tracking we face because of the highly non-linear model of the hand's trajectory of guitarists.
Particle filters, on the other hand, which are also known as Sequential Monte Carlo methods, estimate the state space of latent variables with highly nonlinear relationships, for example, a non-Gaussian distribution. At frame t, z
t
and x
t
denote the variables of the observation and latent states, respectively. The probability density function (PDF) of latent state variables p(x
t
|z1:t-1) is approximated as follows:
(1)
where the sum of weights is 1. I is the number of particles and and correspond to the weight and state variables of the i th particle, respectively. The is the Dirac delta function. Particle filters are commonly used for beat-tracking [9–12] and visual-tracking [14] as is shown in Section 2.4.1 and 2.4.2. Moreover, Nickel et al. [20] applied a particle filter as a method of audiovisual integration for the 3D identification of a talker. We will present the solution for these problems in the next section.
Comments
View archived comments (1)