MetaMGC: a music generation framework for concerts in metaverse

In recent years, there has been a national craze for metaverse concerts. However, existing meta-universe concert efforts often focus on immersive visual experiences and lack consideration of the musical and aural experience. But for concerts, it is the beautiful music and the immersive listening experience that deserve the most attention. Therefore, enhancing intelligent and immersive musical experiences is essential for the further development of the metaverse. With this in mind, we propose a metaverse concert generation framework — from intelligent music generation to stereo conversion and sound field design for virtual concert stages. First, combining the ideas of reinforcement learning and value functions, the Transformer-XL music generation network is improved and used in training all the music in the POP909 dataset. Experiments show that both improved algorithms have advantages over the original method in terms of objective evaluation and subjective evaluation metrics. In addition, this paper validates a neural rendering method that can be used to generate spatial audio based on a binaural-integrated neural network with a fully convolutional technique. And the purely data-driven end-to-end model performs to be more reliable compared with traditional spatial audio generation methods such as HRTF. Finally, we propose a metadata-based audio rendering algorithm to simulate real-world acoustic environments.


Introduction
With the global explosion of metaverse discussions and the huge impact of the New Crown epidemic on the offline music performance industry, the virtual world has emerged as an ideal stage for music concerts and festivals. But the current metaverse concerts focused heavily on embodying the concept of virtual reality and the digital twin [1,2]. Secondly, it still takes a long production line from the beginning of planning to hold a metaverse concert, which is contrary to the real-time nature of the metaverse.
The metaverse concert also contains technologies such as virtual stages, motion capture, and digital human. But these virtual performance-related fields are already very mature. And, by the nature of the concerts, it is the beautiful music and immersive online listening that are more worthy of our attention.
In 2019, Danowski et al. proposed Connexion [3] which surrounds the audience with an eight-channel sound system that immerses the audience from all directions. More recently, PatchXR allows artists to turn a place into an instant music studio by providing spatial equivalents of a visual programming engine to create and perform music on a spatial level with sound building blocks [4]. The virtual reality concert of the Philharmonic Orchestra conducted by Esa-Pekka allowed the audience to move between each instrument group and even freely through the church [5], enabling an interactive experience.
However, most of the existing research works on metaverse-based sound field design for music composition have either considered only real-time music performance online or immersive experience online, without addressing both real-time and intelligence in metaverse Open Access concerts. Therefore, we propose a framework to efficiently and intelligently implement music generation and immersive sound field twinning for concerts in the metaverse, namely MetaMGC (Music Generation Framework for Concerts in Metaverse). It consists of three main parts: (1) a music generation part that enables improvised accompaniment of a virtual orchestra for metaverse concerts; (2) a digital audio twin part that enables virtual sound field reconstruction for metaverse concerts [6]; and (3) an audio rendering part that realizes the virtual soundstage production of the metaverse concert. For the three elements above, we investigate the Transformer-XL music generation model based on two methods: Monte Carlo search as well as deep reinforcement learning. Meanwhile, we propose a reward function to control the music generation-related rewards, which strengthens the constraints on the music theory knowledge in the music generation network. In addition, this paper investigates an end-to-end neural network synthesis method that simulates the differences caused by subtle effects on the final output signal through a temporal convolutional neural network module. We conducted extensive experiments on the POP909 dataset [7] and HRFT data [8] to evaluate the effectiveness and generality of the proposed method.
To summarize, the main contributions are as follows: • Optimization of Transformer-XL by Monte Carlo and DQN methods based on reinforcement learning. • A value function-based music generation system to intelligently generate accompaniment. • An end-to-end neural synthesis method capable of synthesizing natural and accurate binaural audio. • A metadata-based audio rendering system and sound field reconstruction in UE4 [9].The audio rendering system is tested to be effective.

Automatic music generation
Eck et al. first used LSTM for music production, improvising well-paced and structured blues music based on short recordings. In 2012, Boulanger et al. [10] proposed an RNN-RBM model that outperformed traditional models in generating polyphonic music from different datasets. In 2016, Google Brain's Magenta team further improved the RNN's ability to learn long-term structure by proposing the MelodyRNN model [11]. Hadjeres et al.
proposed Anticipation-RNN that allowed user-defined positional constraints to be enforced [12]. Johnson et al.
proposed TP-LSTM-NADE and BALSTM incorporating a parallel set of weighted recurrent networks for polyphonic music prediction and composition [13].
With the development of deep learning techniques, powerful deep generative models such as VAE, GAN, and Transformer have gradually emerged. In 2015, Samuel et al. first proposed the Variational Autoencoder (VAE) [14]. Roberts et al. proposed MusicVAE [15], a hierarchical VAE model that captures the long-term structure of polyphonic music with good interpolation and reconstruction performance. Jia et al. [16] proposed a coupled latent variable model with a binary regularizer to implement improvised accompaniment generation. Yang et al. proposed a MidiNet [17] network based on GAN networks that can generate music bar after bar and proposed a new conditional mechanism to generate chord-based music. Yu et al. proposed a sequence generation framework, SeqGAN [18], which successfully applied RNNbased GAN networks to the music generation process for the first time by combining reinforcement learning techniques. In 2018, Dong et al. proposed the MuseGAN model [19], which is considered to be the first model to generate multi-track polyphonic music.
More recently, Anna Huang et al. successfully applied the Transformer technique for the first time [20]. Donahuel et al. generated multiple instrument music using Transformer and proposed a pre-training technique based on migration learning [21]. Moreover, Huang et al. proposed a new music representation called REMI [22] and used the Transformer-XL sequence model [23] to generate popular piano music. The emergence of Transformer-XL further optimized the original Transformer model.

Binaural audio generation
The majority of existing binaural audio generation techniques are based primarily on conventional digital signal processing (DSP). Head-related transfer functions are measured in a radio wave darkroom [24], while highquality 2-spatialization requires binaural recordings at different spatial locations over nearly 10k [25]. To generate binaural audio, DSP-based renderers typically perform a series of convolutions on the fractional impulse responses [26].
More recently, neural network techniques have gained attention in audio generation as a result of the success of neural networks in speech synthesis [27]. Current approaches to neural networks focus primarily on frequency domain models [28,29], but the greater difficulty in modeling the long-term dependence of high-frequency audio signals has led to the long-term neglect of the original waveform model. With the success of WaveNet proposed by Van Den Oord et al. [30], direct wave-to-wave modeling has received tremendous attention and led to significant improvements in speech enhancement [31], denoising [32], speech synthesis [33] and music style translation [34].
The process of spatializing neural networks is now underway. A study by Gebru et al. [35] showed that HRTF enables implicit learning of neural networks by training raw waveforms. Morgado et al. [36] worked on predicting spatial sounds conditioned on visual information, but their work was limited to first-order binaural channels and did not exhaustively model binaural effects. In closer comparison, a series of papers by Gao and Grauman [37] targeting 2.5D visual sound systems, in which binaural audio is generated conditionally using video frame embedding. Thus, the location of the sound source can be effectively determined.

Approach
Our MetaMGC consists of three main systems(as is shown in Fig. 1): a value function-based music generation system, an end-to-end neural synthesis-based system, and an audio rendering system. First, we trained the Transformer-XL music generation network by inputting the music MIDI dataset. The value function in reinforcement learning (Monte Carlo or DQN) is combined to improve the music generation network and generate the mono MIDI events of the generated music. To create a spatial immersion experience, we converted the input mono audio into spatial stereo audio using a neural network-based spatial audio twin system. Finally, we input the generated stereo audio into an audio rendering system and then presented it in a virtual reality stage by building a digital twin sound field. The result turned out to be the creation of a metauniverse concert from intelligent music generation rendered with immersive binaural audio.

Music generation based on value functions
In this subsection, our music generation network based on the Transformer-XL network is presented, which transforms music theory rules into multiple reward functions to control the music generation process. This approach is to solve the optimal value function in reinforcement learning and makes the generated music more musical.

Music theory reward mechanism
To enable the network model to learn music theory, we quantify music theory knowledge in the form of textual descriptions and present it in the form of a reward function to control music generation during the reinforcement learning process. The music theory reward mechanism is set up in two parts: a basic music theory reward (see Table 1) ,where R m1 (s t:1 , a t ) is to determine whether the 4 generated pitches are the same;R m2 (s t:1 , a t ) is to determine that the intervals of two adjacent notes in a piece of music should be no greater than an octave; and R m3 (s t:1 , a t ) is a predetermined range for the sound range. and a melody writing reward (see Table 2). In this mechanism, R denotes the total reward for the current time step t, and a max and a min are the highest and lowest notes set according to demand.

Transformer-XL with the Monte Carlo method
The Monte Carlo method uses time-step limited, complete empirical trajectories and the resulting empirical information to derive the average reward for each state. In the case of an unknown environment, the intelligence samples are according to the strategy π . From the starting state, it executes this strategy T steps before reaching the termination state, thus obtaining an empirical trajectory and then calculating the cumulative future discounted rewards [38,39]. The Monte Carlo method utilizes the average future discounted cumulative reward G of the experience trajectory as the expectation of the state value: If a large enough sample of empirical trajectories is processed, it is possible to accurately estimate the expectation of following the policy in state s, known as the state value function v π (s): Since the P pitch (probability of each note) at the time of generation is different, a greater probability of the selected note itself being in the probability distribution indicates that the note has a higher value. Thus, in combination with the music theory reward function, the pitch reward value G for each time step can be defined as: This way the optimal pitch trajectory is found by obtaining the reward value G t returned by the state at a certain moment.

Transformer-XL with DQN method
The DQN model consists of four parts ( Fig. 2): (1) data preprocessing, which sends the processed data to the generative network for training; (2) outputting, which outputs the probability distribution of all event indexes through the Softmax layer after training the Transformer-XL network (12 layers, each containing 8 Multihead Attention layers and 12 Feedforward layers) and selecting the event index with the highest probability as the currently generated value; (3) the reward network, which includes the probability of the current generated number and the sum of multiple lemma rewards; (4) the DQN, which extracts all the indexes representing the pitch events from the generated sequence, combines the reward network to calculate the reward to train the Q network, controls the generated note sequence, and finally decodes the sequence to generate music.
Intervals less than an octave − 0.8 Within the set vocal range 0.1 Not in the set vocal range − 0.8 Table 2 Melodic writing incentives: (a) melodic interval reward; and (b) melodic tone towards reward Step-in ascending 0.5 The DQN method is a basic algorithm in deep reinforcement learning that makes the action value function Q(s, a, θ) converge to the optimal action value function Q * (s, a) by training the parameters [40] .
Combine the probability of note generation to obtain the current total bonus value: where are the basic music theory rules bonus and the writing rules bonus, respectively.
Calculate the gradient derivation of the objective function with respect to the parameters of the Q-network and conversion to an unbiased estimate using: Then update the Q network parameters: Finally, the model is updated by training the Q-network so that the Transformer-XL music generation network learns the rules of basic music theory and melody writing rules for note constraints to obtain the optimal strategy.

End-to-end binaural audio synthesis
Given the source and listener positions and orientations c 1:T for each time step, the single-channel input signal x 1:T is converted into a binaural signal. The final system is shown in Fig. 3.
We convert a single channel signal of length T: , with the former representing the left-ear signal and the latter representing the right-ear signal. x t , y (l) t , and y (r) t all represent the sample scalar of the audio at moment t. The conditional time signal c 1:T represents the position and direction of the source and the listener. Our goal is to obtain the following function: where is the receptive domain in the time domain c t ∈ R 1 4 containing the three-dimensional positions of
WarpNet is a shallow temporal convolutional network with four layers and 64 channels per layer. We define the distorted signal x 1:T as a linear interpolation of the source signal x 1:T at ⌊ρ t ⌋ and ⌈ρ t ⌉: In practice two warp fields are generated, one for each of the two ears. We enforce the physical constraint using σ (warp) . min(t,ρ t ) forces the tth element of the twisted field to be no larger than t itself to ensure causality while max(ρ t−1 , ·) implies that an element was twisted from ρ t−1 to (t − 1) , then the next element at position t must be twisted from ρ t−1 or the subsequent position, thus ensuring monotonicity. Therefore, compared with related methods such as deformable convolution and spatial transformer networks, the neural time regularization in this paper performs constrained regularization for input signals of arbitrary length as well as directly simulates the physical phenomena of sound.

Conditional superconvolution
Inspired by the DSP formulation, we predict the convolutional weights and biases of the input x 1:T of a given layer as a function of the conditional input c 1:T . Weights are generated from the conditional input c 1:T containing physical information about the relationship between the sound source and the listener: where H (W ) and H (b) are small convolutional hypernetworks that receive c 1:t as input and predict convolutional weights and biases as output, respectively. Thus, the input of the convolutional layer is not just a time series, and its weights and biases change over time.

Phase reconstruction using L2-loss
Using the L2-loss of the original waveform to train a generative audio model can lead to poor sound quality and signal distortion. Therefore, a fundamental problem of phase estimation of L2-loss on waveforms is analytically explained. A simple additional loss term can mitigate this problem.Define: as the time domain L2-loss between the predicted audio signal y 1:T and the target signal ŷ 1 : T . Y k ,Ŷ k ∈ C denotes the kth frequency component of y 1:T , and ŷ 1 : T is in the Fourier frequency domain. The amplitude error and angular phase error of the kth frequency component are denoted as (11)  where | · | is the modulo operation of the complex numbers.
According to the Parseval theorem, we write the L2-loss in the time domain as the L2-loss in the complex frequency domain as follows: Now, the distance | Y k −Ŷ k | is denoted as ε.
Theorem 1 Define Ŷ ∈ C as a specified complex number and Y ∈ B ε,Ŷ = {Y ∈ C :| Y −Ŷ |= ε} as any complex number with distance ε from Ŷ . The expected amplitude error and the expected angular phase error with respect to Ŷ are: According to Theorem 1, we can analyze the expected amplitude error and phase error along the kth frequency component. First, in the early stage of training, the expected amplitude error is low for higher energy signals, even with large L2-values. On the contrary, the phase is hardly optimized when the L2-loss is large. Second, the expected amplitude error in all target energies decreases during the training process when the L2-loss decreases with time. In contrast, the improvement of the expected phase error is primarily for the high-energy components, while the phase accuracy is poor for the medium and low-energy components. Therefore, optimizing the original waveform using the L2-loss in the time domain is not sufficient to achieve accurate phase reconstruction.
Due to the limited capacity of the model, the training data can usually only be fitted to L2-loss εmin. If this ε min is too large, the amplitude of the signal is modeled well but has a large phase error. To overcome the shortcomings of time-domain L2-loss in phase optimization, we add an explicit phase term to the loss function: where STFT (y 1:T ) is the short-time Fourier transform of the audio signal y 1:T .

Audio rendering system
The sound quality produced at a concert can determine the success or failure of a real concert, whether it is produced directly by the instrument or by the amplifier. To improve this aspect in a virtual concert, we divide the metadata-based audio rendering into two separate parts that are independent of each other: environment sound rendering and location-dependent sound (object) rendering. This is only an abstract logical distinction. In the actual rendering algorithm, ambient sound rendering is usually convolved with a post-reverberation tail, and location-dependent sound rendering handles direct sound and early reflections. The difference between these two components is whether their operation depends on the location of the sound source and the listener. For ambient sound rendering (i.e., the part that does not depend on the source location), our method sums the source signal and performs a single rendering. An overview of the algorithm is shown in the following Fig. 4. Classify the elements of the audio to correspond to the metadata. Each audioFormatExtended in the metadata is regarded as a scene, the scene includes sub-scenes with an audio programme to correspond, and the audioContent in the scene is used as the scene audio library SAL. SAL is classified as ancient and modern scenes, industrial scenes,nature-based scenes, and urban scenes.
The distinction described above is not used in some rendering algorithms. One method follows the rendering method based on the image source approach, where the reverberation tail is modeled with higher-order reflections with no ambient sound rendering and only position-dependent rendering (i.e., each part of the rendering algorithm depends on the source and listener positions). Another method employs an air volume simulationbased approach that requires only the source location and the listener location to inject the signal into another system where the source location and the listener location are unknown, retrieving information from them. With our distinction, the air volume modeling-based approach is similar to the algorithm that performs only the ambient sound rendering, which suggests that adopting a co-processing mechanism is most suitable between different sound sources.

Experiments
To evaluate our approach, we performed subjective and objective experiments for each of the two major models of metaverse concerts [41,42]. In Section 4.1, we present our experiments and results for the effectiveness of our Transformer-XL music generation network controlled with Monte Carlo and DQN methods using the POP909 dataset [7]. The dataset was assembled by the Music X lab team at New York University in Shanghai in 2020 and contains a total of 909 popular music tracks from 462 artists with a total duration of about 60 h. The experimentally generated music was compared with the music generated by the original Transformer-XL network and two basic algorithms (Melody_LSTM and RL-tuner) to obtain objective and subjective scores. In Section 4.2, we present experiments using the HRTF dataset as a control with a neural network-based synthesis dataset collected to verify the reliability of the pure data-driven end-to-end model, using the HRTF data of KEMAR No. 21 (Subject_003) from the CIPIC HRTF database. This CIPIC HRTF database includes high spatial resolution HRTF measurements for 45 different subjects, including KEMAR human models with small and large plumes. The HRTF data for each of these subjects included 2500 head-related impulse response measurements. These "standard" measurements were recorded at 25 different intermembrane polarity azimuths and 50 different intermembrane polarity elevations. Section 4.3 presents experiments using a sound field designed in UE4 combining digital twin and virtual reality technologies to achieve a virtual concert sound field simulation.

Training environment
The Transformer-XL generative network has been trained over 120 rounds.The pitches generated by the generative network were controlled by Monte Carlo and DQN methods. The Q-network was a 3-layer 256cell LSTM network with dropout = 0.5, using an Adam optimizer and a cross-entropy loss function due to the softmax output, which was trained over 120 rounds.
In the experiment, the Transformer-XL+MC method sampled 50 pitch event trajectories and generated 100 music tracks over 100 cycles. The average reward value was 0.142. The Transformer-XL+DQN method sampled 100 pitch event trajectories, with a final loss value of 0.157 and an average reward value of 0.185.

Objective evaluation
To demonstrate the effectiveness of the Transformer-XL+MC method and the Transformer-XL+DQN method on Transformer-XL, the generated results of the three methods were compared with the original dataset and subsequently with the results generated by Melody_LSTM and RL-tuner methods. A total of 3 sets of 100 16-bar music clips were generated using each of the three methods to calculate seven metrics for objective evaluation against the music in the POP909 dataset [43]. The results are given in Table 3. In addition, we randomly select a part of the generated result as an example and visualization, as shown in Fig. 5, our generated music consists of three instrument tracks.
The first four items reflect the complexity of the generated music. The range of the music generated by all three methods was controlled at about 3 octaves, a relatively stable range. The mean values of the number of different pitches were similar for the three methods but slightly higher for the Transformer-XL+DQN method. The mean number of pitches played and the polyphony time step ratio hardly differed. The latter three indicators reflect the goodness of the generated results. For the ratio of tones in tonality, both improved methods were significantly higher than the original method, Melody_LSTM and RL-tuner method. For pitch information entropy, both improved algorithms had lower values than the original method. For the first two bars of the rhythmic consistency, the two improved methods were basically equal and were improved for the original method as well as being close to the original data set.
The Transformer-XL+MC and Transformer-XL+DQN methods generated music with better tonality, richer melodies, and more stable rhythms, which is a good improvement of the Transformer-XL generation network as well as Melody_LSTM method and RL-tuner method in objective terms.

Subjective evaluation
Beautiful music possesses diversity, innovation, and flexibility while taking into account theoretical support and artistic aesthetics [44]. In this experiment, a popular subjective assessment experiment was conducted on music generated using the three methods examined so far. The subjects were divided into two groups: a non-professional group of 25 people consisting of music lovers who were not music students, and a professional group of 5 people consisting of testers with advanced musical education. Five pieces of music generated by the three methods were scored, with the professional group scoring the musicrelated knowledge and the non-professional group scoring the human ear. The test results are shown in Table 4.
The results in the table suggest that the Transformer-XL+MC model and the Transformer-XL+DQN model outperformed the original model and two basic algorithms in the subjective evaluation. In particular, the

Spatial audio generation experiments 4.2.1 Data setup
A total of 2 h of paired mono and bi-mono data at 48 kHz were recorded from eight different speakers (4 male and 4 female). The listener was a mannequin with binaural microphones in the ears. Participants were asked to walk around the mannequin within a radius of 1.5 meters and to engage in an unscripted conversation with it. The location and orientation of the sound source and listener were tracked throughout the recording using an Opti-Track system. Using a validation sequence and the last 2 min of each participant as test data, and the remainder as training data, our model was trained 100 times on the Adam Optimizer.

Objective evaluation
According to their audio waveforms, audio spectrum, and audio sound spectrogram,the mono audio, HRTFbased and neural network-generated audio were analyzed separately .
For monaural audio waveforms, the intensities of the audio in the left and right ears were the same at each time point so that the brain did not have a stereo sensation as when one human ear received audio with the same energy received by both ears. In contrast, the audio waveforms generated by HRTF and the audio waveforms generated by neural networks had different intensities for the left and right channels at the same point in time. Therefore, when the audio was received by both ears, the energy received by the left and right ears was different, and the brain produced a stereo sensation for the signal with different energy (Figs. 6 and 7).
The sound spectrum and the corresponding frequency spectrum (Fig. 8) show that the energy of the mono audio is concentrated between 0 and 2 kHz, with lower energy in the middle and high-frequency parts. The energy of the audio generated using HRTF was concentrated between 0 and 2 kHz, with lower energy in the high-frequency part and enhanced energy in the middle frequency part (4 kHz to 5 kHz) as compared with mono audio. The energy in the low-frequency part decreased compared to the mono audio energy. The energy of the neural network-based generated audio was mainly concentrated between 0 and 2 kHz, lower in the high-frequency part, and enhanced in the middle frequency part (4 kHz to 5 kHz) as compared to the mono audio. The overall energy of each frequency band was slightly increased; the energy of the low-frequency bands was decreased compared with the mono audio, and the energy of the low-frequency bands was decreased more than that of the HRTF-based audio.

Subjective evaluation
We used 27 testers with stereo music experience in this experiment to obtain the statistical results shown in Table 5 and visualized in Fig. 9.
The generated music showed overall higher scoring results and overall better music generation. The bin-aural_neural generated by the neural network was significantly more comfortable than the binaural_hrtf_003 generated by the traditional method of the head-related transfer function HRTF, with significant improvement in the four measures of fullness, intimacy, roundness, and brightness. However, there was no significant improvement in the performance of clarity using the neural network method, which failed to retain more information about the original song; there were still some phase and energy errors that needed to be improved. Combining the other four dimensions, the audio clips generated by the neural network were considered to have high clarity, good naturalness, wide range, low distortion in the pathway, low noise, good transient response, and sufficient reverberation.

Virtual concert soundstage design
The designed virtual soundstage included 12 speakers: four top left, right front and rear speakers, two right rear and left rear speakers, two left and right surround sound field speakers, three left, center, and right speakers, and one subwoofer, as shown in Fig. 10. The top left and right front and rear speakers of the virtual soundstage used the same full-range design and were placed according to the main listening seat. The right rear speaker and left rear speaker increased the intensity of the listening experience by further positioning the sound, placing them behind the seating area at an angle of 135 • to 150 • to the center. The left surrounds sound field speakers and right surround sound field speakers played a role in creating a realistic sense of space and providing ambient sound. The two are arranged in the seat position slightly behind the  area and form a certain angle, preferably just above the ear height. The left, center, and right speakers assisted the music with the change of stage lighting. The subwoofer emitted the strongest bass, thus adding power to the music. The 12 speaker placements designed in the UE4 virtual stage allowed each speaker to emit a different sound, each with its independent source, forming a new front, surround, and ceiling sound channel. Thus the external surround sound brought an immersive sound experience.

Conclusion and future work
In this paper, we have proposed a framework for metauniverse music generation from intelligent music generation and spatial audio twinning. Through subjective evaluation and objective experiments on The results of subjective evaluation and objective experiments on POP909 and HTRF datasets show that MetaMGC achieves superior results in both music generation and digital audio twinning.
However, although the model makes a good contribution to generating musical compositions, it is still not perfect. An important characteristic of live concerts is that listeners can feel and immerse themselves in the emotion and atmosphere conveyed by the music at close range [45], while our model only improves on the musicality of the music. Therefore, a music generation model that generates emotionally rich music is a better choice [46]. In subsequent experiments, we will also consider adding emotional expression factors to the digital audio twin system to make the meta-universe concert intelligent music generation framework closer to realistic emotion-rich live concert scenarios.