Skip to main content

Dance2Music-Diffusion: leveraging latent diffusion models for music generation from dance videos

Abstract

With the rapid development of social networks, short videos have become a popular form of content, especially dance videos. In this context, research on automatically generating music for dance videos shows significant practical value. However, existing studies face challenges such as limited richness in music timbre and lack of synchronization with dance movements. In this paper, we propose Dance2Music-Diffusion, a novel framework for music generation from dance videos using latent diffusion models. Our approach includes a motion encoder module for extracting motion features and a music diffusion generation module for generating latent music representations. By integrating dance type monitoring and latent diffusion techniques, our framework outperforms existing methods in generating complex and rich dance music. We conducted objective and subjective evaluations of the results produced by various existing models on the AIST++ dataset. Our framework shows outstanding performance in terms of beat recall rate, consistency with GT beats, and coordination with dance movements. This work represents the state of the art in automatic music generation from dance videos, is easy to train, and has implications for enhancing entertainment experiences and inspiring innovative dance productions. Sample videos of our generated music and dance can be viewed at https://youtu.be/eCvLdLdkX-Y. The code is available at https://github.com/hellto/dance2music-diffusion.

1 Introduction

“When music and dance seamlessly intertwine, their enchantment conquers the mind and spirit”. For millennia, dance and music, as a natural form of expressive art, have enriched our daily lives through the harmonious interaction of melody, rhythm, and movement. In practice, it is often necessary to configure music for videos to enhance the entertainment experience. However, creating music requires a high level of musical literacy and selecting suitable music would be laborious since there are thousands of music clips in libraries. Consequently, automatically analyzing input dance videos to generate fitting output music has become a practical task.

Ongoing research are actively exploring the interplay of dance motion and music [1,2,3,4,5] in multimodal generation tasks. We focus on automatically generating creative music for dance videos. This study boosts the ability to craft captivating soundtracks for short videos on social media platforms, specially meeting the unique needs of nonprofessional dancers. For professional dancers, generating music automatically could be a valuable tool to explore the intricate interplay between dance and music and inspire innovative dance compositions. Moreover, it could also foster advancements in the creation of soundtracks for the fields of animation and games.

The current research on generating audio from dance videos is still not sufficient. In terms of audio generation, there are two predominant approaches: symbolic music generation models and audio generation models. Symbolic music generation models involve synthesizing representations such as musical notes and MIDI. In contrast, audio generation models involve synthesizing audio representations and using vocoders to convert these representations back into audio. Dance2Music [3] generates monophonic notes by taking the local history of the dance similarity matrix as input. The handcrafted features they use may discard much useful information in dance videos, and monophonic music is not suitable for practical applications. Dance2MIDI [6], on the other hand, takes the motion features of dance videos as input, utilizes temporal convolution for feature extraction, and generates MIDI [7] event sequences using a transformer decoder. The MIDI event sequence they use is a symbolic music representations, resulting in limited flexibility in the generated audio and a lack of coordination with dance movements. D2M-GAN [8] takes dance video frames and human body motions as input, employs a convolutional network for feature extraction, utilizes GAN to generate vector-quantized audio representations, and then decodes them into audio. Although it can generate continuous complex music, it is limited by the slow decoding speed of the VQ audio representation, resulting in music lengths of only 2 to 3 s. Recent research has started leveraging latent diffusion models [9, 10] to generate music for dance, and they could significantly improve the richness of the generated music and drastically reduce the time required for generation.

Overall, previous research mainly faces two issues. Firstly, during the extraction of dance motion features, the convolutional network used to extract input information overlooks the temporal characteristics and inter-frame correlations of dance movements, while lacking discrimination for different dance types. Secondly, in the music generation process, the richness of music generated using MIDI or musical notes is low, and the method of synthesizing audio using VQ audio representations leads to short and noisy music. These two issues collectively result in low coherence between the generated music and dance.

Recently, with the ascendancy of diffusion models and their achievements in image generation [11,12,13,14,15], there is a growing optimism about the potential advantages that diffusion models can offer in the field of audio generation. Methods for audio generation employing the diffusion model have been proposed [16,17,18,19]. In contrast to adversarial generative networks [20,21,22], autoencoders [23, 24] and transformers [25,26,27], the diffusion model exhibits superior capabilities in generating high-quality music. However, it is noteworthy that these emerging diffusion generation methods are primarily proposed for translating text description into music. The study of diffusion models in the field of text-to-music generation has now spilled over into the realm of dance-to-music generation [9, 10].

In this work, we design a novel multi-model framework, as shown in Fig. 1, Dance2Music-Diffusion, consisting of the Motion Encoder Module and the Music Diffusion Generation Module to learn the generation of complex musical samples from dance videos. Specifically, our framework takes a sequence of dance motions as input, which are then encoded by the Motion Encoder Module with a transformer based architecture to extract motion features. To enhance the encoder’s ability to recognize different dance types, we additionally introduce dance type as a supervision during the training process of the Motion Encoder Module. The extracted motion features from the Motion Encoder Module will serve as a prompt in the Music Diffusion Generation Module to guide music generation. In the Music Diffusion Generation Module, we employ a latent dance to audio diffusion generator to generate audio representation with motion features as input by iteratively “denoising” data in a latent space, which is a compression music space from a pre-trained model DMAE (Diffusion Magnitude Autoencoder) [16]. Subsequently, the latent music representation is decoded into music waveform by the decompression part of DMAE.

Fig. 1
figure 1

An overview of our proposed framework Dance2Music-Diffusion for music generation from dance videos, which takes human body motions as input, and generates suitable music accordingly. Our Dance2Music-Diffusion model consists of a Motion Encoder Module and a Music Diffusion Generation Module

Compared to the recurrent or convolutional neural networks [6, 8, 28] used in previous studies, our Motion Encoder Module enables superior extraction of temporal information and relationships between actions in the motion sequence. In contrast to existing conditional music generation approaches, which rely on symbolic music representations to synthesize music [3, 6], our method excels in generating complex and rich dance music. Furthermore, compared to direct diffusion of audio, our model adopts the structure of latent diffusion [11] by using pre-trained DMAE [16], leading to a significant reduction in training time and resource requirements for the model, while improving the quality of the generated music. In summary, our work has the following contributions:

  1. 1.

    We propose a framework Dance2Music-Diffusion, which employs a transformer architecture to extract motion features and the latent diffusion generation methodology to proceed the dance-to-music generation.

  2. 2.

    During the training process of the Motion Encoder Module, we introduce an additional dance category loss, enabling the module to extract discriminative dance motion features.

  3. 3.

    We introduce the evaluation metric of Beat Alignment Score from AI choreographer [2] to measure the beat consistency between dance and the generated music.

  4. 4.

    Our method significantly surpasses previous state-of-the-art methods in dance-to-music generation both in terms of the length and quality of music.

2 Method

An overview of the proposed Dance2Music-Diffusion architecture is shown in Fig. 2. Our model comprises two modules: the Motion Encoding Module and the Music Diffusion Generation Module.

Fig. 2
figure 2

Overview of the architecture of the Dance2Music-Diffusion. Our model takes the human body motions of dance performers as input and process them with the Motion Encoder Module. Subsequently, its output containing dance movements is regarded as a prompt to the Music Diffusion Generation Module. The prompt guides the U-net in the diffusion generation stage to reconstruct latent music representations from pure noise. Finally, the latent music representation are decoded into raw music samples by a pre-trained DMAE (Diffusion Magnitude Autoencoder). During the training process, the DMAE is employed as an encoder to compress the GT (ground truth) audio into latent music representations, which are then used for the supervised training of the Latent Dance-to-Audio Diffusion Generator. In the inference process, the Diffusion Magnitude Autoencoder acts as a decoder, reconstructing the latent music representation back into audio. The classification head is used only to compute the dance category loss during the training process to classify the class token generated by the Motion Encoder Module

During the training process, the input motion sequences enter the Motion Encoder Module, where they are initially augmented with a class token. Subsequently, motion embedding is applied to add temporal encoding, and the sequences are processed by the Transformer architecture to obtain motion representations. The class token is used to compute classification loss with the ground truth dance genre through the Classification Head. Simultaneously, the motion representations are fed into the Latent Dance-to-Audio Diffusion Generator within the Music Diffusion Generation Module. They serve as the prompt for the U-Net, guiding the “denoising” process to generate the latent music representation. They serve as a prompt for the U-net and guide the denoising process to generate the latent music representation.

During the inference process, the Latent Dance-to-Audio Diffusion Generator utilizes the encoded motion representations as guidance to generate latent music representations. Subsequently, the decoder part of DMAE is used to reconstruct the final music.

2.1 Motion encoder module

The purpose of this module is to extract information from dance and guide the subsequent music generation. In previous research, recurrent or convolutional neural networks were commonly used to extract dance motion features [1, 8], but there has been less emphasis on capturing the temporal information and inter-frame relationships in dance movements. Therefore, the Motion Encoder Module proposed in our work adopts the structure of Transformer encoder [29] to extract dance motion features and generate motion representations.

As shown in Fig. 2, the Motion Encoder Module takes a sequence of dance motion as input. In our framework, the input dance motion sequence is represented by the 3D Skinned Multi-Person Linear model (SMPL) [30], denoted as \(x_{1:T} = \{x_1, \ldots , x_T\} \in \mathbb {R}^{T \times D}\), where \(x_t\) represents the SMPL features within the tth frame, T is the number of frames, and D is the dimension of the SMPL features. Consequently, the Transformer encoder in the module also maintains a constant latent vector size D across all of its layers.

To enhance the Motion Encoder Module’s ability to extract motion features and discriminate dance types, we introduce a classification head and a class token during the training phase. The class token embedded as \(z_0^0=x_{\text {class}}\) is similar to BERT’s [31] and is a learnable embedding added to the sequence. Then the state at the output of the Transformer encoder \(x_\text {class}\) serves as the representation for the genre of dance. Furthermore, the Temporal embeddings \({temp}\) are added to the motion embeddings to retain temporal information of the dance motion sequence. The formula for the input sequence into the Transformer encoder is as follows:

$$\begin{aligned} \mathbf{Z}^0 & = \left[ z_0^0,z_1^0,...,z_L^0 \right] \nonumber \\ & = [x_{\text {class}};x_{1:T}] + {temp} \end{aligned}$$
(1)

Here, \(x_{\text {class}} \in \mathbb {R}^{(1 \times D)}\), \(x_{1:T} \in \mathbb {R}^{(T \times D)}\)\({temp} \in \mathbb {R}^{((T+1) \times D)} \nonumber\), \(z_i^0\) represents the ith token in the input sequence.

The Encoder part of the Transformer model consists of multiple identical layers, each composed of two sub-layers: Multi-Head Self-Attention Att and Feedforward Neural Network FFN. The computation of Self-Attention is as follows:

$$\begin{aligned} \mathbf{Z}_{\text {att}}^l & = \text {Att}(\mathbf{Z}^l) \nonumber \\ & = \text {Att}(Q,K,V) \nonumber \\ & = \text {softmax}\left( \frac{(QK^T)}{\sqrt{D}}\right) V \end{aligned}$$
(2)

where Q, K, and V are linear transformations of the input sequence \(\mathbf{Z}^l\)’s query, key, and value, and D is the dimension of the query. Here, \(l=0,1,...,L-1\) denotes the layer. For the output \(\mathbf{Z}_{\text {att}}^l\) of Self-Attention, apply Layer Normalization and Residual Connection to obtain the output \(\mathbf{Z}_{\text {att}\_\text{out}}^l\):

$$\begin{aligned} \mathbf{Z}_{\text {att}\_\text{out}}^l = \text {LayerNorm}(\mathbf{Z}_{\text {att}}^l + \mathbf{Z}^l) \end{aligned}$$
(3)

where \(\text {LayerNorm}(\cdot )\) denotes the Layer Normalization operation. For \(\mathbf{Z}_{\text {att}\_\text{out}}^l\), transform it through a feedforward neural network to get the output \(\mathbf{Z}_{\text {ffn}\_\text{out}}^l\):

$$\begin{aligned} \mathbf{Z}_{\text {ffn}\_\text{out}}^l = \text {FFN}(\mathbf{Z}_{\text {att}\_\text{out}}^l) \end{aligned}$$
(4)

where \(\text {FFN}(\cdot )\) represents the operation of the feedforward neural network. For \(\mathbf{Z}_{\text {ffn}\_\text{out}}^l\), apply Layer Normalization and Residual Connection once again to obtain the output \(\mathbf{Z}\) of the Transformer encoder:

$$\begin{aligned} \mathbf{Z}^{(l+1)} = \text {LayerNorm}(\mathbf{Z}_{\text {att}\_\text{out}}^l + \mathbf{Z}_{\text {ffn}\_\text{out}}^l) \end{aligned}$$
(5)

The expression for the Transformer encoder is thus represented as follows:

$$\begin{aligned} \mathbf{Z}^{(l+1)} & = \text {LayerNorm}\left[ \text {LayerNorm}(\mathbf{Z}_{\text {att}}^l + \mathbf{Z}_0^l) \right. \nonumber \\ & \quad \left. + \text {FFN}(\mathbf{Z}_{\text {ffn}\_\text{out}}^l)\right] \end{aligned}$$
(6)

where \(l=0,1,...,L-1\) denotes layer, and the Transformer encoder’s depth is L, and the entire process is repeated L times.

For the output of the Transformer encoder \(\mathbf{Z}^L= \left[z_0^L,z_1^L,...,z_L^L\right]\), apply the Feature Fusion Block (FFB) to integrate information and obtain the video motion representation \(\mathbf{M}\):

$$\begin{aligned} \mathbf{M} = \text {FFB} \left(\mathbf{Z}^L\right) = \text {Linear} \left(\text {conv1d}\left(\mathbf{Z}^L\right)\right) \end{aligned}$$
(7)

where \(\text {FFB}(\cdot )\) is primarily composed of linear layers \(\text {Linear}(\cdot )\) and 1-dimensional convolution \(\text {conv1d}(\cdot )\).

During the training process of the module, a classification head \(\text {class}(\cdot )\) is added to classify \(z_0^L\) in \(\mathbf{Z}^L= \left[z_0^L,z_1^L,...,z_L^L\right]\):

$$\begin{aligned} \mathbf{Z}_{\text {class}} = \text {class} \left(z_0^L\right) \end{aligned}$$
(8)

where \(\mathbf{Z}_{\text {class}}\) represents the output of the classification head. \(\mathbf{Z}_{\text {class}}\) will help the Motion Encoder Module to recognize dance categories through training.

2.2 Music diffusion generation module

The Music Diffusion Generation Module comprises two components: the Latent Dance-to-Audio Diffusion Generator and the pre-trained DMAE (Diffusion Magnitude Autoencoder).

The pre-trained DMAE from Mousai [16] has been pre-trained on a 25,000-h music dataset, and the model has excellent performance as a compressor or vocoder.

During the training process, the pre-trained DMAE is used to compress music into latent form. Specifically, the music waveform \(w \in \mathbb {R}^{c \times t }\), with dimensions [ct] for c channels and t timesteps, undergoes a short-time Fourier transform to obtain the magnitude spectrum \(m_w\). DMAE encodes the magnitude \(m_w\) into the latent music representation \(L = \text {DMAE}_{\text {enc}}(m_w) \in \mathbb {R}^{(H \times W)}\). Compared to the original waveform audio, this process compresses the audio signal to \(\frac{1}{64}\) of its original size, where \(\frac{1}{64} = \frac{H \times W}{c \times t}\).

During the inference process, DMAE is responsible for decoding the latent music representation \(\hat{L}\) generated by the Latent Dance-to-Audio Diffusion Generator and reconstructing it to the waveform audio \(\hat{w} = \text {DMAE}_{\text {dec}}(\hat{L}; \epsilon _d, s_d)\), where \(\text {DMAE}_{\text {dec}}(\cdot )\) denotes DMAE’s decoding operation, and \(s_d\) indicates the number of times that the U-Net is called to generate \(\hat{w}\) from the latent music representation \(\hat{L}\) and starting noise \(\epsilon _d\). DMAE employs the DDIM (Denoising Diffusion Implicit Models) method [32] to restore the audio, using a distinct U-Net architecture compared to the one utilized in the subsequent Latent Dance-to-Audio Diffusion Generator. It is worth noting that the DMAE encoder and decoder are frozen in our model training.

The Latent Dance-to-Audio Diffusion Generator is trained to generate the latent music representation \(\hat{L}\) that follows a similar distribution to the ones \(L = \text {DMAE}_{\text {enc}}(m_w)\) generated by the DMAE.

During the training process, the Latent Dance-to-Audio Diffusion Generator uses the V-objective diffusion [33] with a 1D U-Net architecture [16]. The GT latent music representation L is corrupted with a random amount of noise and the U-Net is trained to remove the noise. The process of adding noise is represented as \(L_{\sigma _t} = \alpha _{\sigma _t} L_0 + \beta _{\sigma _t} \epsilon\), where \(L_{\sigma _t}\) represents the data distribution with a given level of noise, \(\epsilon\) represents random noise, and \(L_0\) is from a distribution \(L = \text {DMAE}_{\text {enc}}(m_w)\), where \(\sigma _t \in [0,1]\) is the noise schedule. As the U-Net denoises the signal, the motion representation M of the dance is provided as a prompt. Specifically, the U-Net model estimates the \(\hat{v}_{\sigma _t} = f(L_{\sigma _t}, \sigma _t, M)\), where \(f(\cdot )\) represents the U-Net model taking the noise level \(\sigma _t\), the data distribution \(L_{\sigma _t}\) and the previously obtained dance feature representation M as inputs. The definition of \(v_{\sigma _t}\) is the derivative of \(L_{\sigma _t}\) with respect to \(\sigma _t\):

$$\begin{aligned} v_{\sigma _t} = \frac{\partial L_{\sigma _t}}{\partial \sigma _t} = \alpha _{\sigma _t} \epsilon - \beta _{\sigma _t} L_0 \end{aligned}$$
(9)

where \(\alpha _{\sigma _t}\) is defined as the cosine of the angle \(\alpha _{\sigma _t} = \cos (\phi _t)\), and \(\beta _{\sigma _t}\) is the sine of the angle \(\beta _{\sigma _t} = \sin (\phi _t)\), with \(\phi _t\) representing the radian measure of the current noise level \(\sigma _t\) as \(\phi _t = \frac{\pi }{2} \sigma _t\). Through \(v_{\sigma _t}\) we can predict the clean data distribution \(\hat{L}_0\) and the current noise level \(\epsilon _{\sigma _t}\), which in turn allows us to predict \(L_{\sigma _{t-1}}\), thereby achieving the denoising process of reducing the noise level from \(\sigma _t\) to \(\sigma _{t-1}\):

$$\begin{aligned} \hat{v}_{\sigma _t} = f(L_{\sigma _t}, \sigma _t, M) \end{aligned}$$
(10)
$$\begin{aligned} \hat{L}_0 = \alpha _{\sigma _t} L_{\sigma _t} - \beta _{\sigma _t} \hat{v}_{\sigma _t} \end{aligned}$$
(11)
$$\begin{aligned} \hat{\epsilon }_{\sigma _t} = \beta _{\sigma _t} L_{\sigma _t} + \alpha _{\sigma _t} \hat{v}_{\sigma _t} \end{aligned}$$
(12)
$$\begin{aligned} \hat{L}_{\sigma _{t-1}} = \alpha _{\sigma _{t-1}} \hat{L}_0 - \beta _{\sigma _{t-1}} \hat{\epsilon }_{\sigma _t} \end{aligned}$$
(13)

Equations 10, 11, and 12 illustrate how the model uses the predicted \(\hat{v}_{\sigma _t}\) to recover the clean data distribution \(\hat{L}_0\) and the current level of added noise \(\hat{\epsilon }_{\sigma _t}\). Equations 12 and 13 combine the predicted noise \(\hat{\epsilon }_{\sigma _t}\) with the original audio data distribution \(\hat{L}_0\) to obtain \(\hat{L}_{\sigma _{t-1}}\). The goal of model training is to enable the U-Net to predict the true \(v_{\sigma _t} = \alpha _{\sigma _t} \epsilon - \beta _{\sigma _t} L_0\) as accurately as possible using \(f(L_{\sigma _t}, \sigma _t, M)\), that is, to minimize the L2 norm between the model-predicted \(\hat{v}_{\sigma _t}\) and the true \(v_{\sigma _t}\). The diffusion process loss function \(\mathcal {L}_{\text {diff}}\) show in Eq. 16.

During the inference process, the Latent Dance-to-Audio Diffusion Generator generates the latent music representation \(\hat{L}\) under the guidance of the dance motion representation M. The generation function \(\hat{L} = \text {gen}(M, \epsilon , s)\) uses DDIM [32] sampling and calls U-Net s times to generate an approximate latent \(\hat{L}\) from the dance motion representation M and starting noise \(\epsilon\).

The complete process to generate a matching music waveform \(\hat{w}\) from the motion representation M is formulated as follows:

$$\begin{aligned} \hat{w} = \text {DMAE}_{\text {dec}}\left( \text {gen}(M, \epsilon , s), \epsilon _d, s_{d}\right) \end{aligned}$$
(14)

During the training process, we use both the GT (Ground Truth) latent music representation compressed from the GT music and the GT dance genre as supervision to train our model. Therefore, we only need to use the compression part of DMAE to encode the audio into the latent music representation, without using its decoding part. During the inference process, the model only requires the input of dance motion information to generate matching music.

2.3 Training objective

The Latent Dance-to-Audio Diffusion Generator in the Music Diffusion Generation Module employs v-objective diffusion [33, 34]. U-Net tries to estimate \(\hat{v}_{\sigma _t} = f(L_{\sigma _t}; \sigma _t, M)\) minimizing the following objective:

$$\begin{aligned} \mathcal {L}_{\text {diff}} = E_{t \sim [0,1], \sigma _t} [\Vert \hat{v}_{\sigma _t} - v_{\sigma _t} \Vert _2^2] \end{aligned}$$
(15)
$$\begin{aligned} = \mathbb {E}_{(t \sim [0,1], \sigma _t, L_{\sigma _t})} \left[ \Vert f(L_{\sigma _t}; \sigma _t, M) - v_{\sigma _t} \Vert _2^2\right] \end{aligned}$$
(16)

where \(v_{\sigma _t} = \frac{\partial L_{\sigma _t}}{\sigma _t} = \alpha _{\sigma _t} \epsilon - \beta _{\sigma _t} L_{\sigma _t}\) with \(\alpha _{\sigma _t} = \cos (\phi _t)\), \(\beta _{\sigma _t} = \sin (\phi _t)\), \(\phi _t = \frac{\pi }{2} \sigma _t\) and M denotes the dance motion representation. \(\mathcal {L}_{\text {diff}}\) serves as the training loss for the Music Diffusion Generation Module.

In addition to the generation loss, we propose a classification loss for the Motion Encoder Module, enhancing the module’s ability to differentiate dance types when extracting motion features. The training loss \(\mathcal {L}_{\text {motion}}\) for the Motion Encoder Module is as follows:

$$\begin{aligned} \mathcal {L}_{\text {cls}} = -\sum \limits _{i=1}^{C} \mathbf{Z}_{\text {class}} \log C_i \end{aligned}$$
(17)
$$\begin{aligned} \mathcal {L}_{\text {motion}} = \alpha \mathcal {L}_{\text {cls}} + (1-\alpha ) \mathcal {L}_{\text {diff}} \end{aligned}$$
(18)

where \(\mathcal {L}_{\text {cls}}\) represents the cross-entropy loss between the module’s classification head output \(\mathbf{Z}_{\text {class}}\) and the true dance category \(C_i\), and \(\alpha\) is used to adjust the ratio between classification Loss \(\mathcal {L}_{\text {cls}}\) and generative Loss \(\mathcal {L}_{\text {diff}}\).

3 Experiments

3.1 Model architecture and hyperparameters

For the Motion Encoder Module, we utilize a Transformer with 6 layers, 12 attention heads, and a 219-dimensional hidden layer dimension, incorporating a GELU activation function and a dropout rate of 0.1. The scaling factor \(\alpha\) between the classification loss \(\mathcal {L}_{\text {cls}}\) and music generative loss \(\mathcal {L}_{\text {diff}}\) is set to 0.1.

The Latent Dance-to-Audio Diffusion Generator is configured with a 6-layer nested U-Net structure for denoising, with channel counts increasing at each layer (128, 256, 512, 512, 1024, 1024). Each layer undergoes 2× downsampling, except for the first layer. Attention blocks are applied at the 3rd, 4th, 5th, and 6th layers, skipping the first two blocks for further downsampling before sharing information over the entire latent space. Cross-attention blocks are used at all resolutions. For both attention and cross-attention, 64 head features and 12 heads per layer are employed.

In our experiment, the input of the model contains a dance motion sequence with 356 frames (5.9 s) and a music sequence with length \(2^{18}\) (5.9 s at 44.1 kHz), where the two sequences are aligned on the first frame. We combine the 9-dimensional rotation matrix representation for all 24 joints, along with a 3-dimensional global translation vector, resulting in a 219-dimensional motion feature. We train the Motion Encoder Module using AdamW [35] with \(\beta _1 = 0.5\), \(\beta _2 = 0.9\), and a learning rate of \(1 \times 10^{-3}\). Simultaneously, we use another AdamW to train the Latent Dance-to-Audio Diffusion Generator, where \(\beta _1 = 0.9\), \(\beta _2 = 0.999\), the learning rate is \(1 \times 10^{-4}\), weight decay is \(1 \times 10^{-3}\), and the batch size is 6. The Motion Encoder Module has 682,748 trainable parameters, while the Latent Dance-to-Audio Diffusion Generator has 246,215,014 trainable parameters. The model was trained for 1800 epochs on an NVIDIA RTX 2080ti GPU (11GB), with a total training time of 25 h.

3.2 Dataset

We demonstrate the effectiveness of our method through experiments conducted on AIST++ dataset [2]. AIST++ dataset is a subset of the AIST dataset, enriched with 3D motion annotations. We adhere to the official cross-modality data splits for training, validation, and testing, with 980, 20, and 20 videos allocated to each, respectively. This ensures the exclusion of duplicated music segments. The database includes 10 street dance genres: “Break” (BR), “Pop” (PO), “Lock” (LO), “Waack” (WA), “Middle Hip-Hop” (MH), “LA-style HipHop” (LH), “House” (HO), “Krump” (KR), “Street Jazz” (JS), and “Ballet Jazz” (JB).

3.3 Evaluation criteria

We evaluate the generated music from both objective and subjective perspectives using publicly available metrics [2, 8]. In addition to comparing with previous studies [3, 6, 8, 28, 36], we also compare with the GT music from the test set and randomly selected GT random music from the test set as references.

Beat Coverage Scores (BCS) and Beats Hit Scores (BHS) [8] are commonly used to assess the alignment between the beats of the generated music and the ground truth. We utilize the beat tracking function from the librosa library [37] to detect beats in both the generated music and the GT. By comparing these beats, we evaluate the rhythmic consistency of the music. Specifically, we denote the number of detected beats in the generated music as \(B_g\), the total number of beats in the GT as \(B_t\), and the number of aligned beats between the generated music and the GT as \(B_a\). BCS is defined as \(B_g / B_t\), and BHS is defined as \(B_a / B_t\). The tolerance threshold for alignment is set to 5 ms.

Beat Alignment Score (BAS) can be used to assess the correlation between the generated music and 3D dance motions. Unlike BCS and BHS, which could evaluate the relationship between the generated music and the GT music, the correlation between the generated music and 3D dance motions is equally important for evaluating music generation performance. We introduce Beat Alignment Score (BAS) as an evaluation criterion, which is proposed in the Music Conditioned 3D Dance Generation task [2], to evaluate the coordination between motion beats and music beats. Music beats are extracted using the librosa library [37], and motion beats are computed as the local minima of the dynamic velocity, as shown in Fig. 4. BAS is defined as the average distance between each motion beat and its nearest music beat. Specifically, our Beat Alignment Score is defined as:

$$\begin{aligned} \text {BAS} = \frac{1}{m} \sum \limits _{i=1}^m \exp \left( -\frac{{\min _{\forall t_j^y \in B^y} \Vert t_i^x-t_j^y \Vert ^2}}{2\sigma ^2}\right) \end{aligned}$$
(19)

where \(B^x=\{t_i^x\}\) are the motion beats, \(B^y=\{t_j^y\}\) are the music beats, and \(\sigma\) is a parameter to normalize sequences with different FPS. We set \(\sigma =3\) in our experiments because the FPS of all our experimental sequences is 60. To calibrate the results, we compute BAS between 3D motion and the paired GT music, as well as BAS between 3D motion and a ramdomly selected GT music from test data. A BAS closer to that of the GT indicates better alignment between the generated music beats and the dance beats. The GT random serves only as a reference value in this context.

The subjective evaluation method referred to [6] to evaluate the generated music videos in terms of coherence, richness and noise. Specifically, we sequentially presented human evaluators with videos composed of the same dance, but with music generated using different methods. The human evaluators were asked to give a score between 1 and 5. A higher score indicates that the evaluators found the music more appropriate for the given dance video. The specific metrics include :

  • Coh (coherence): higher scores indicate better coherence between dance and music;

  • Noi (noise): higher scores indicate less noise in the video’s music; and

  • Ric (richness): higher scores indicate richer timbres in the video’s music.

A total of 15 human evaluators were involved in the evaluation process. Each of the human evaluators watched music-dance videos generated by 6 different models, as well as videos consisting of dance paired with GT and GT random music. The three different dance videos from each model were presented sequentially, followed by the participants giving a score. The order of the videos from different models was randomized. The final score for each model was calculated as the average score across all participants.

3.4 Comparation with state-of-the-arts

We compare the proposed Dance2Music-Diffusion method with state-of-the-art methods: Controllable Music Transformer (CMT) [36], Dance2Music [3], Dance2MIDI [6], D2M-GAN [8], CDCD [28], and DMD [10]. CMT is a transformer-based model that utilizes MIDI representation to generate background music for videos. Dance2Music utilizes dance similarity matrices as input to generate five types of note sequences as the soundtrack for dance. Similar to CMT, Dance2MIDI takes motion keypoints as input and utilizes a Transformer decoder to generate MIDI event sequences with a conditional probability model. Both D2M-GAN and CDCD methods utilize vector quantization as audio representation for audio synthesis, and D2M-GAN employs the network structure of VQ-GAN while CDCD employs a conditional discrete contrastive diffusion model. DMD, similar to our method, adopts a latent diffusion model and a pre-trained audio vocoder to generate music. However, our approach differs in several key aspects: we use a transformer-based Motion Module, which enhances the model’s understanding of dance movements, and we employ different audio representations and decoders for reconstructing the audio.

3.4.1 Objective evaluation

The objective evaluation results in Table 1 show that the proposed method, D2M-Diffusion, achieves the highest BHS compared to state-of-the-art methods. This indicates that our method can recall more music beats than other methods. In terms of Beat Coverage Score (BCS), our method outperforms CMT, Dance2Music, and Dance2MIDI, which shows that the number of generated music beats by our method is closer to the number of GT beats. Compared to D2M-GAN (0.7 BCS) and CDCD (0.93 BCS), their beats are fewer than GT beats, while our generated music beats (1.25 BCS) are more than GT beats. Our D2M-Diffusion model obtains the closest BAS value to that of the ground truth, indicating that our method can generate the most matching music to the motion compared to D2M-GAN, CDCD, and DMD.

Table 1 Objective evaluation results on the AIST++ dataset

The model parameters and training times are shown in Table 2, and the training time is tested on the same NVIDIA RTX 2080TI. Due to the lower richness of music generated by symbolic music generation models (CMT, Dance2MIDI, Dance2Music), and their significantly smaller model size compared to audio generation models (D2M-GAN, CDCD, DMD, and our model), we only compare the training time and model size of the audio generation models. It can be observed that the D2M-GAN and CDCD models have long training times due to their use of VQ audio representations. Compared to DMD, which also uses latent diffusion to generate music, our model has a larger number of parameters because the motion encoder module we have designed is more complex. Our model requires less training time because it use the DMAE encoder to obtain ground truth latent music representations before the training process, while DMD has to convert music representations into audio during training. Additionally, the classification loss employed in our model makes the entire model training more efficient. Therefore, even though our model has more parameters compared to DMD, it still trains faster.

Table 2 Details of the audio generation models

In Fig. 3, we visualize the spectrogram and rhythm points of the audio. It can be observed that our method produces audio closer to real audio compared to D2M-GAN and CDCD.

Fig. 3
figure 3

Visualization of the audio spectrogram and beat events. Compared to D2M-GAN and CDCD, the audio generated by my model contains more high-frequency information, indicating a wider range of frequencies and higher richness in timbre. Additionally, the alignment of our method beats with the ground truth music is closer

In Fig. 4, we show an example of beats alignment between the generated music and dance. It can be observed that the audio beats generated by our method are more aligned with dance motions compared to D2M-GAN, CDCD, and DMD. The D2M-GAN and CDCD methods use VQ audio representations, resulting in less alignment between the generated music beats and dance motions. Both DMD and our method utilize latent audio representations, but the experimental results show that our method achieves closer beat alignment with the ground truth (GT). This demonstrates that the music generated by our method is more harmonious with the dance.

Fig. 4
figure 4

Visualization of the BAS between the generated music and dance movements. A blue curve to illustrate the movement speed of the dance actions, a green dashed line to represent the rhythm of the movements, and a red dashed line to indicate the rhythm of the generated music. Orange boxes highlight instances where the motion rhythm aligns with the music rhythm. Clearly, the synchronization between the music generated by our Dance2Music-Diffusion model and the motion rhythm is closer to the original audio

3.4.2 Subjective evaluation

The subjective evaluation results with error bars, shown in Table 3, indicate that our model achieves a music Coh score of 4.14, surpassing all other generative models. This suggests a significant advantage of our method in ensuring a harmonious match between the music and video content.

Table 3 Subjective evaluation results on the AIST dataset

Regarding the Noi metric, our model gets 3.79 score, trailing behind methods such as CMT, Dance2Music, and Dance2MIDI, which use MIDI or note-based music generation approaches that introduce no additional noise beyond generating music from notes. However, our model outperforms VQ-GAN and CDCD, which use VQ audio representations, and also surpasses DMD, which uses latent diffusion. The GT random music, which is randomly selected from the test set, may have a low matching with the dance. However, as real music, it still outperforms the model-generated audio in terms of noise quality.

In terms of Ric, Dance2Music and CMT lag behind due to their monotonous audio timbre, while VQ-GAN and CDCD show some improvement in music richness but still fall short. The DMD model, which uses latent diffusion, achieves a commendable score of 4.14. Our method achieves the highest score 4.29 in richness, demonstrating a higher level of richness and appeal compared to other generated music. Combined with the Noi score, the subjective result indicates that the music diffusion generation module we designed produces high-quality music.

4 Ablation study

We conduct the following ablation experiments to study the effectiveness of key components of our model: the Motion Encoder Module, the classification head, and the latent diffusion. The performances are measured quantitatively using objective metrics.

To verify the Motion Encoder Module based on the Transformer encoder in our designed architecture, we replace this encoder part with a convolutional neural network, which is a commonly used model in previous studies [8, 28]. The results are shown in Table 4, where our method obtains 0.56 BHS and 0.225 BAS, and the replaced model CNN has 0.47 BHS and 0.215 BAS. The results indicate our adopted Motion Encoder Module can more effectively extract dance information from the motion sequences, thereby promoting harmony between the generated music and dance movements.

Table 4 Ablation study on the Motion Encoder Module

We compare the results of removing the classification head and the corresponding classification loss \(L_{cls}\) in the Motion Encoder Module. The results shown in Table 5 indicate that the model without classification head has lower performances (0.54 BHS and 0.217 BAS) than the whole model (0.56 BHS and 0.225 BAS). The classification head benefits the Motion Encoder Module in understanding dance information, resulting in better alignment between the generated music and the original dance types.

Table 5 Ablation study on the classification head

To verify the latent representation of audio in the Music Diffusion Generation Module, we compare with the model which directly applies audio waveform in the diffusion process. The results are shown in Table 6, and the model with audio waveform gets 0.42 BHS and 0.196 BAS, while our architecture with latent diffusion has better performances. Furthermore, the model with audio waveform requires 118 h for training while the model with latent representations only requires 25 h under the same hardware conditions since the dimension of latent representations is 1/64 of the dimension of the audio waveform.

Table 6 Ablation study on the latent diffusion

5 Conclusion

In this paper, we propose Dance2Music-Diffusion, which integrates a diffusion method for generating music from dance. Our method uses a transformer-based architecture to extract dance features and combines them with a latent diffusion module to generate music that matches the dance movements. Experimental results show that our method excels in the coordination between music and dance, with significant advantages in music richness. However, a limitation of this study is that the experiments were conducted only on laboratory datasets. Future research could include collecting online dance videos to expand the dataset and validate the effectiveness of the model across a wider range of dance and music genres.

Availability of data and materials

The datasets are available in AIST++ [2, 38], repository: https://google.github.io/aistplusplus_datase.

References

  1. H.Y. Lee, X. Yang, M.Y. Liu, T.C. Wang, Y.D. Lu, M.H. Yang, J. Kautz, Dancing to music. Adv. Neural Inform. Process. Syst. 32, 3586–3596 (2019)

  2. R. Li, S. Yang, D.A. Ross, A. Kanazawa, in Proceedings of the IEEE/CVF International Conference on Computer Vision. Ai choreographer: Music conditioned 3d dance generation with aist++ (IEEE, Piscataway, NJ, 2021), pp. 13401–13412

  3. G. Aggarwal, D. Parikh, Dance2music: Automatic dance-driven music generation. arXiv preprint arXiv:2107.06252 (2021)

  4. C. Gan, D. Huang, P. Chen, J.B. Tenenbaum, A. Torralba, in European Conference on Computer Vision. Foley music: Learning to generate music from videos (Springer, Cham, 2020), pp. 758–775

  5. H.K. Kao, L. Su, in Proceedings of the 28th ACM International Conference on Multimedia. Temporally guided music-to-body-movement generation (ACM, New York, 2020), pp. 147–155

  6. B. Han, Y. Ren, Y. Li, Dance2midi: Dance-driven multi-instruments music generation. arXiv preprint arXiv:2301.09080 (2023)

  7. G. Loy, Musicians make a standard: The midi phenomenon. Comput. Music. J. 9(4), 8–26 (1985)

    Article  Google Scholar 

  8. Y. Zhu, K. Olszewski, Y. Wu, P. Achlioptas, M. Chai, Y. Yan, S. Tulyakov, in European Conference on Computer Vision. Quantized gan for complex music generation from dance videos (Springer, Cham, 2022), pp. 182–199

  9. S. Li, W. Dong, Y. Zhang, F. Tang, C. Ma, O. Deussen, T.Y. Lee, C. Xu, Dance-to-music generation with encoder-based textual inversion of diffusion models. arXiv preprint arXiv:2401.17800 (2024)

  10. V. Tan, J. Nam, J. Nam, J. Noh, in SIGGRAPH Asia 2023 Technical Communications. Motion to dance music generation using latent diffusion mode (ACM, New York, 2023), pp. 1–4

  11. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. High-resolution image synthesis with latent diffusion models (IEEE, Piscataway, NJ, 2022), pp. 10684–10695

  12. P. Dhariwal, A. Nichol, Diffusion models beat gans on image synthesis. Adv. Neural Inform. Process. Syst. 34, 8780–8794 (2021)

    Google Scholar 

  13. L. Zhang, A. Rao, M. Agrawala, in Proceedings of the IEEE/CVF International Conference on Computer Vision. Adding conditional control to text-to-image diffusion models (IEEE, Piscataway, NJ, 2022), pp. 3836–3847

  14. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E.L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inform. Process. Syst. 35, 36479–36494 (2022)

    Google Scholar 

  15. A. Razavi, A. Van den Oord, O. Vinyals, Generating diverse high-fidelity images with vq-vae-2. Adv. Neural Inform. Process. Syst. 32, 14866–14876 (2019)

  16. F. Schneider, O. Kamal, Z. Jin, B. Schölkopf, Mo\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757 (2023)

  17. Q. Huang, D.S. Park, T. Wang, T.I. Denk, A. Ly, N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. Frank et al., Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917 (2023)

  18. A. Agostinelli, T.I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi et al., Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325 (2023)

  19. H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, M.D. Plumbley, Audioldm: Text-to-audio generation with latent diffusion models. H. Liu, Z. Chen, Y. Yuan et al., AudioLDM: text-to-audio generation with latent diffusion models, in Proceedings of the 40th International Conference on Machine Learning (PMLR, Brookline, MA, 2023), pp. 21450–21474

  20. M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, Y. Bengio, Chunked autoregressive gan for conditional waveform synthesis. arXiv preprint arXiv:2110.10139 (2021)

  21. C. Donahue, J. McAuley, M. Puckette, Synthesizing audio with generative adversarial networks. arXiv preprint arXiv:1802.04208 1 (2018)

  22. L.C. Yang, S.Y. Chou, Y.H. Yang, Midinet: A convolutional generative adversarial network for symbolic-domain music generation. arXiv preprint arXiv:1703.10847 (2017)

  23. K. Deng, A. Bansal, D. Ramanan, Unsupervised audiovisual synthesis via exemplar autoencoders. arXiv preprint arXiv:2001.04463 (2020)

  24. P. Dhariwal, H. Jun, C. Payne, J.W. Kim, A. Radford, I. Sutskever, Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 (2020)

  25. B. Yu, P. Lu, R. Wang, W. Hu, X. Tan, W. Ye, S. Zhang, T. Qin, T.Y. Liu, Museformer: Transformer with fine-and coarse-grained attention for music generation. Adv. Neural Inform. Process. Syst. 35, 1376–1388 (2022)

    Google Scholar 

  26. J. Ens, P. Pasquier, Mmm: Exploring conditional multi-track music generation with the transformer. arXiv preprint arXiv:2008.06048 (2020)

  27. Y.J. Shih, S.L. Wu, F. Zalkow, M. Müller, Y.H. Yang, Theme transformer: Symbolic music generation with theme-conditioned transformer. IEEE Trans. Multimedia 25, 3495–3508 (2022)

  28. Y. Zhu, Y. Wu, K. Olszewski, J. Ren, S. Tulyakov, Y. Yan, Discrete contrastive diffusion for cross-modal music and image generation. arXiv preprint arXiv:2206.07771 (2022)

  29. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need. Adv. Neural Inform. Process. Syst. 30, 5998–6008 (2017)

  30. M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, M.J. Black, in Seminal Graphics Papers: Pushing the Boundaries. Smpl: A skinned multi-person linear model, vol. 2 (ACM, New York, 2023), pp. 851–866

  31. J. Devlin, M.W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  32. J. Song, C. Meng, S. Ermon, Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  33. T. Salimans, J. Ho, Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022)

  34. F. Schneider, Archisound: Audio generation with diffusion. arXiv preprint arXiv:2301.13267 (2023)

  35. I. Loshchilov, F. Hutter, Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  36. S. Di, Z. Jiang, S. Liu, Z. Wang, L. Zhu, Z. He, H. Liu, S. Yan, in Proceedings of the 29th ACM International Conference on Multimedia. Video background music generation with controllable music transformer (ACM, New York, 2021), pp. 2037–2045

  37. B. McFee, C. Raffel, D. Liang, D.P. Ellis, M. McVicar, E. Battenberg, O. Nieto, in SciPy. librosa: Audio and music signal analysis in python (SciPy, Austin, TX, 2015), pp. 18–24

  38. S. Tsuchida, S. Fukayama, M. Hamasaki, M. Goto, in ISMIR. Aist dance video database: Multi-genre, multi-dancer, and multi-camera database for dance information processing, vol. 1 (ISMIR, Delft, Netherlands, 2019), p. 6

Download references

Acknowledgements

The authors would like to express their gratitude to the teams responsible for the open-source DMAE model and the AIST++ dataset, which have been instrumental in facilitating this research. Their contributions have significantly enhanced the quality and depth of this study.

Funding

The authors would like to thank the National Natural Science Foundation of China 61702466 for funding. This work is also supported by the Fundamental Research Funds for the Central Universities with number CUC230B028 and CUCAI24002.

Author information

Authors and Affiliations

Authors

Contributions

CYZ conceived of the study, participated in its design and coordination, performed the statistical analysis, and helped to draft the manuscript. YH participated in the design of the study, interpretation of data, and review of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yan Hua.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, C., Hua, Y. Dance2Music-Diffusion: leveraging latent diffusion models for music generation from dance videos. J AUDIO SPEECH MUSIC PROC. 2024, 48 (2024). https://doi.org/10.1186/s13636-024-00370-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-024-00370-6

Keywords