YuYin: a multi-task learning model of multi-modal e-commerce background music recommendation

Appropriate background music in e-commerce advertisements can help stimulate consumption and build product image. However, many factors like emotion and product category should be taken into account, which makes manually selecting music time-consuming and require professional knowledge and it becomes crucial to automatically recommend music for video. For there is no e-commerce advertisements dataset, we first establish a large-scale e-commerce advertisements dataset Commercial-98K, which covers major e-commerce categories. Then, we proposed a video-music retrieval model YuYin to learn the correlation between video and music. We introduce a weighted fusion module (WFM) to fuse emotion features and audio features from music to get a more fine-grained music representation. Considering the similarity of music in the same product category, YuYin is trained by multi-task learning to explore the correlation between video and music by cross-matching video, music, and tag as well as a category prediction task. We conduct extensive experiments to prove YuYin achieves a remarkable improvement in video-music retrieval on Commercial-98K.


Introduction
Background music (BGM) plays a vital role in advertisements, which can help build brand image and stimulate consumption [1][2][3].Many studies from psychology and brain science have been carried out on the effect factors of BGM.By observing the brain, these studies have proven that BGM is associated with faster response times and greater activations of frontoparietal areas during happy music, whereas sad music is associated with slower responses and greater occipital recruitment.When the emotion of BGM is in path with the advertisement, it can help catch the attention of customers [4] and makes the advertisement more memorable [5,6].However, with the expanding demand for e-commerce advertisements, manually selecting music one by one and clipping the music not only requires professional knowledge but is also time-consuming from the ever-growing music pool, which makes it a crucial task for automatically selecting suitable BGM.
Recommending appropriate music for a video can be considered a cross-modal retrieval task, aiming to search relevant data in different formats [7].Previous studies have mainly focused on retrieval between visual and textual modalities, such as retrieving images or videos corresponding to a given textual description [8][9][10] or generating textual descriptions for a given image or video [11][12][13].
Among existing video-audio retrieval research, some studies focus on sound events localization [14,15], which aims to localize the object in the video that produces the sound.Other studies concentrate on face-speech retrieval, which seeks the corresponding person for a given voice [16][17][18].However, there are several challenges in video-music retrieval.First, there are limited public datasets for video-music retrieval.The datasets used in existing studies are mainly music videos from YouTube.Second, there is no explicit correlation between the video and music.
Considering that music primarily depends on the video's emotion, many studies have relied on emotion tags [19,20], which are time-consuming to annotate and may introduce subjective bias.Then, some studies use the content-based model to directly learn the correlation between video and music through deep neural networks (DNN) by calculating the Euclidean distance or cosine similarity between video and music features [21][22][23].However, the music features are coarse in these studies, and emotion features may be ignored.
In this paper, to stress these challenges, we first establish a large-scale multi-modal dataset Commercial-98K from Alibaba, covering major product categories.Moreover, we propose a content-based video-music retrieval model YuYin.Instead of emotion tags, we extract emotion-related features from music, which avoid the subjective bias in annotating.Also, we introduce a weighted fusion module (WFM) to fuse the emotion features and the audio features for more fine-grained music representation, which can dynamically weigh different features, thus reducing information redundancy and enhancing the robustness of the model.
For the background music also relevant to the product category [24][25][26], we apply multi-task learning to train YuYin, including the cross-matching task and the category prediction task.Specifically, for the cross-matching task, the text features of the category tag are used to help align video and music.For the category prediction task, a weight-shared classifier is used to predict the category of videos, music, and text.
The main contributions of this paper can be summarized as follows: • We establish a large-scale dataset, Commercial-98K, containing large-scale advertisements from Alibaba and covering major product categories.• We propose a novel video-music retrieval model, YuYin, trained by multi-task learning, with categories included as labels to be predicted and as a supportive modality to align related video and music.• We introduce a weighted fusion module to fuse emotion features and audio features of music for finegrained music representations, learning to dynamically balance different features through training.
The rest part of the paper is organized as follows.We discuss related work in the field of multi-modal datasets and video-music retrieval in Section 2.Then, we explain the process of building our dataset Commercial-98K, including the data sources and the details of the data processing, in Section 3. Our proposed model YuYin is depicted in detail in Section 4.Then, we introduce the experiment setup and analyze the results in Section 5. Finally, we present the conclusion in Section 6.

Multi-modal Dataset
Compared with single-modal datasets, multi-modal datasets contain more than one data form and are proven to have more advantages.DEAP [27] includes self-assessment scores, audio, videos, facial expressions, and physiological data for analyzing human emotional states, which shows the improvement in the effectiveness of human emotion analysis.VQA [28] containing 25,000 images, 7600 questions, and 100,000 answers.Many studies on VQA achieve better results in the tasks of freeform and open-visual question answering (Table 1).
In the field of cross-modal retrieval, researchers have built multi-modal datasets with different scales, modalities, and sources for specific tasks (Table 1).YouTube-8M [35] is released by Google, which is one of the largest multi-modal datasets.YouTube-8M contains 8,000,000 videos and text annotations from Youtube, which are divided into 4800 categories, and each video contains an average of 1.8 tags.Based on Youtube-8M, there are many subsets like HIMV-200k [21], which contains 205,000 music video-audio pairs.UGV [30] is a video dataset with emotion tags and is used for music recommendation.HoK400 and CFM400 [29] are two game video datasets established from the short video platform and add voiceover in the dataset besides video and background music.TT-150K [34] is a large-scale dataset established from TikTok for background music recommendation, which contains 150,000 user-generated short videos corresponding to 3000 pieces of background music.However, the datasets for video-music retrieval are mainly music videos that are consciously made for the specific music, which makes the diversified demand difficult to achieve and can hardly fit the e-commerce scenario and there are still problems of uneven length and quality of audio and video.Therefore, we establish a multi-modal dataset Commercial-98K, containing videos, music, and tags from the top store in the largest Chinese e-commerce platform Alibaba.

Video-music retrieval
Cross-modal retrieval (CMR) aims to retrieve data between multiple modalities [7].While there have been many studies on visual-text retrieval along with public datasets, such as Flicker [36], HowTo100M [37], and YouCook2 [31].However, limited studies have focused on video-music retrieval (VMR).
Compared to visual-text retrieval, VMR is more challenging because both video and music contain rich information, which makes the "modality gap" rather huge.To bridge the modality gap between video and music, in paper [19], they notice that videos have a strong connection with the emotion of the BGM and perform music retrieval by calculating the textual similarity of their emotion tags.However, these emotion tags are mainly annotated through crowd-sourcing, while the quality of labels can hardly be guaranteed and may introduce subjective bias.
Then, Some studies use the content-based model to directly learn the correlation between video and music.The shallow model like CCA [38] has been used for correlation analysis between different modalities, in which a linear projection is learned to map different modalities in the same space and maximize their correlation.Based on CCA, DCCA [39] is proposed, which extends CCA with non-linear projections by DNN.In paper [40], a CCA-based model is used to match two modalities by maximizing the correlation between image and music features.
Normally, features of different modalities are extracted first and projected into a common space before calculating the metric [21,23,46,47].For video, the features are usually extracted by a pre-trained convolutional neural network (CNN) [22,48].However, the music representations are more complex.In [21], handcrafted features like Mel-spectrogram and MFCC are designed to represent music.Some other studies use pre-trained networks like VGGish [49] to extract audio features [23,29,34].To obtain the emotion of music, In [22], they improve the feature extractor by pre-training on the emotion classification task.CMVAE [34] extract emotion features by OpenSmile [50] as well as audio features by VGGish, then they apply concatenation and principal components analysis (PCA) to obtain the final music features.
After feature extraction, this study employs pair-wise metric learning to explore the correlation between different modalities [37,51,52].Specifically, positive and negative pairs are constructed for video and music, followed by pair-wise loss functions, such as noise-contrastive estimation (NCE) [53] and Triplet-loss [54], to minimize the distance between positive pairs while pushing negative pairs away.Additionally, CEVAR [23] constrains the distance between the video and music from the same clip by cosine similarity loss.In the work of Liu et al. [30], videos and music are first categorized into positive and negative pairs based on whether they have the same emotion label, and a DNN is utilized to align the two modalities.Moreover, CBVMR [21] utilizes intra-and inter-constraints to obtain more fine-grained information between video-music pairs.Generative methods are also applied in the field of CMR.Zhou et al. [55] propose an end-to-end model to generate sound for videos.Additionally, CMVAE [34] is based on a variational auto-encoder (VAE) to crossgenerate music and short videos, besides learning the correlation between different modalities in the latent space.However, most studies only consider videos and music, while the text also contains valuable information for video-music retrieval.Although CMVAE [34] fuses text features with video features, the performance is likely to degrade when the text is absent.
Therefore, we propose a content-based model YuYin, which learns the correlation between video and music by multi-task learning.For better music representation, YuYin extracts and fuses the emotion features and the audio features from music.Besides, text features are also included in multi-task learning which helps to gather the video and music in the same category in the common space.The text is only used as a supportive modality in the training phase and does not involve in the video-music retrieval, thus the missing text does not have any impact on the model.

Dataset
The main aim of the Commercial-98K Dataset is to bridge the gap that there is no dataset that associates advertisements with background music and facilitates research regarding the discovery of matching patterns between music and advertisements.In this section, we in detail discuss the steps we took to collect advertisements and background music from the e-commerce platform as well as the data preprocess methods.Then, we depict the statistics of the collected advertisements and music.

Data collection
We collect advertisements from Taobao, one of the largest e-commerce platforms in China belonging to the world-famous company Alibaba, where customers can buy and sell numerous products.Compared with other e-commerce platforms, Taobao is a customer-tocustomer e-commerce platform, where both enterprises and individuals can open online stores to sell their products.The stores on Taobao upload their products with basic information as well as some images or videos and categorize them into different sections, which produces a vast amount of advertisements in different categories.However, the advertisement quality on Taobao differs vastly, for enterprises upload advertisements made by professionals while many individuals just casually make videos introducing their products.Also, many advertisements only contain voice-over, which is not helpful and may introduce noisy data in video-music correlation learning.To ensure the quality of our data, we primarily collect advertisements from brand stores.These advertisements are designed, shot, and edited by professionals with careful attention paid to the tight correlation between the video and the background music.We ultimately gather 11,500 advertisements from 15 categories on Taobao including food, children's clothing, tablets, wedding dresses, women's t-shirts, men's t-shirts, men's suits, video games, women's suits, baby products, daily necessities, sports, down jacket, cosmetics, mobile phones.

Data preprocess
With the collected 115,000 e-commerce advertisements, as shown in Fig. 1, we separate the audio and visual content of the video by moviepy to find some audios are primarily voice-over or muted.To filter these voice-over or mute audios, we use a pre-trained time-domain convolution network [56] to calculate the onset and duration proportion of music in audios and exclude data where music accounts for less than 50% of the total time.Finally, as in Fig. 2a, we retain 98,071 advertisements in 15 categories and find the count of advertisements categories varies significantly.Hence, we further manually merged the advertisements from similar categories, e.g., tablets, mobile phones, and video games are merged as electronic products.

Data statistics
As depicted in Fig. 2b, the 98,071 advertisements consist of 4 categories, namely 43,841 on clothing, 25,860 on baby products, 26,824 on daily necessities, and 1546 on electronic products.Commercial-98K is still unbalanced, for the count of data in electronic products is notably lower.The reason may be that, compared with other categories, the number of brand stores on Taobao for electronic products is mainly famous brands at home and abroad, of which the number is limited.Due to copyright restrictions, we can not propose the raw data but the processed Commercial-98K can be downloaded on https:// github.com/ Venat oral/ Comme rcial-98K.

Problem definition
Let M stand for a collection of music and V for a collection of advertisements.The video features v in V are extracted from the frame image sequence.Thus, it is possible to define a function f : M × V → S , where S stands for the similarities matrix and each s ij ∈ S denotes the similarities between the ith piece of music ( m i ∈ M ) and Fig. 2 The data distribution of Commercial-98K before (left panel) and after (right panel) merging the categories the jth advertisement ( v j ∈ V ).Given a new advertise- ment v and the function f, the candidate music set C(m) from M can be selected by computing and scoring the similarities between v and each music clip m ∈ M.

Overall framework
As the framework of our proposed video-music retrieval model YuYin shown in Fig. 3, we extract emotion features and audio features from the music separately, while the video features and text features are extracted from the sampled image sequences and tags.Through the WFM, the emotion features and audio features are fused to be the music features.Then, different features are projected into the common space.Thus, multi-task learning is applied to compute the cross-matching loss as well as prediction loss to learn the correlation between advertisements and music.Eventually, by computing the cosine similarities between video and music in the common space and ranking, the candidate music for the given advertisements can be selected.

Feature extraction
We use multiple pre-trained networks as feature extractors for different modalities.Furthermore, for stability, all feature extractors are frozen during training.
For music, we extract the frequency domain features from the music clip by torchaudio [57] as the input of pre-trained AST [58], to obtain the audio features.Besides, we apply OpenSmile [50] to extract emotion features from music clips with its emobase feature set.
For video, we sample frames from the videos at a certain rate, then the sampled frames are fed into the pretrained inception [48] to get the frame-level features.Finally, inspired by the work [35], we use temporal global average pooling to obtain the video-level features.
For text, since the advertisements are from Chinese e-commerce platforms, we use Bert-wwm [59], which is pre-trained on the Chinese wiki, to extract the text features from the tag of the advertisement.

Weighted fusion module
As shown in Fig. 4, we introduce a more flexible fusion method called the weighted fusion module to get the music features m from the audio features a and emotion features e.The dynamic weights ranging from 0 to 1 are learned for concatenated features through the linear and sigmoid layers.Eventually, to reduce the dimension of the weighted features, a linear layer is applied to output the music feature.

Multi-task learning
YuYin is trained through multi-task learning.First, YuYin uses pair-wise metric learning to learn the direct relationship between different modalities.We set videos and music clips from the same advertisement as the positive pairs and others as the negative pairs.With the positive and negative video-music pairs constructed, NCE loss is applied to learn the correlation between the video-music pairs, as described in Eq. ( 1), where x and y stands for two different modalities, P(x) means the positive data of x, B is the batch size, and τ Fig. 3 The framework of our proposed YuYin for background music recommendation of e-commerce advertisements.In detail, a WFM fuses emotion features and audio features as music features.Then the extracted features are projected in the common space for multi-task learning.The video z v , music z m , and text projections z t in the common space are pair-wise cross-matched to compute NCE loss and pass through a weight-shared classifier to get the prediction probabilities p v , p m , and p t , which will be further used to compute cross-entropy loss with the true label as the prediction loss is a hyper-parameter.Besides, the text features are included to align video and music.Specifically, the text features of the tag are extracted and projected into the common space to match the corresponding videos and music by Eq. ( 1).
Equation ( 2) illustrates how video projections z v , music projections z m , and text projections z t yield the cross- matching loss L cm , where β is a hyper-parameter used to regulate the video-music matching loss.Through optimization, the distance between the positive video-music pairs in the common space steadily decreases, while the distance between the negative pairs keeps growing.
Additionally, we provide a category prediction task to aid YuYin in learning the relationship between video and music in the same product category.The prediction loss L pre is computed as Eq. ( 4), where CE is the cross-entropy loss and y is the ground-truth label.Specifically, a weightshared classifier predicts the label of various modalities in the common space separately.By optimizing L pre , the correlation between the videos and music with the same label is better exploited, reducing the distance between positive video-music pairs.Eventually, the loss L consists of cross-matching loss L cm and prediction loss L pre .The α is a hyper-parameter to control the impact of prediction loss.(2)

Experiment setup
For video-music retrieval methods can only retrieve music from the music pool without editing, which may cause misjudgment in subjective evaluation because listeners can hardly know how the music will be used as BGM of the given video, we only conduct objective experiments on Commercial-98K.We conduct experiments on Commercial-98K, with 95,607 data serving as the training set, 1464 as the testing set, and the remaining 1000 as the evaluation set.In addition, each set includes all of the dataset's categories.
YuYin is implemented in Pytorch with an embedding dimension of 1024 and the common space projection using a MLP with two layers of dimensions {512, 256} and activation function ReLu.α in Eq. ( 5) is set to 0.1, while β in Eq. ( 2) is set to 3.0.YuYin is trained on RTX3090 for 30 epochs using the Adam optimizer, with a batch size of 1024 and a learning rate of 0.0001.Following each epoch, the model is evaluated on the evaluation set to determine the evaluation loss, which is observed to prevent overfitting.

Evaluation metrics
As the standard cross-modal retrieval metric, Recall@K is used to validate the performance of YuYin on the videomusic retrieval task [60].As shown in Eq. ( 6), Recall@K denotes the top K retrievals obtained from the similarity list retrieved by the model, sorted in descending order S[ : K] as a ratio of the number of hits to the number of queries N query .
(5) L = L cm + α * L predict Fig. 4 The weighted fusion module (WFM) in YuYin, which learns to apply dynamic weights for audio and emotion features and output the music feature

Performance comparison
In this study, we compare YuYin with several videomusic retrieval methods below: • CCA [38]: CCA uses a linear projection and maximizes the correlation between the latent variables of video and music during training.• DCCA [61]: DCCA learns the projection for each modality and maximizes their correlation through deep learning.
• CEVAR [23]: CEVAR uses two sets of fully-connected networks (FC) to extract video features and audio features in Youtube-8M to calculate cosine loss and predict the label of video as the prediction loss.We maintained its strategy to use the tags in Commercial-98K for its prediction loss.• CBVMR [21]: CBVMR is a content-based videomusic retrieval model, which introduces intra-and inter-modality constraints on the audio features and the video features.• CMVAE [34]: CMVAE is based on the VAE architecture, which fuses the video features and text features through a Product-of-Expert (PoE) module and projects the fused video and music features into a latent space to compute reconstruction loss and cross-matching loss for training.For comparison, we retrain CMVAE on Commercial-98K and use the tags in Commercial-98K as the text features for fusion.• MRCMV [29]: MRCMV fuses voice-over with video features through a multi-head attention module and uses two separate self-attention modules for the video and music features.In our comparison, for there is no voice-over in Commercial-98K, we replace the voice-over features with our text features. ( • Random: randomly recommend music for the given video.
The results are shown in Table 2, which indicates that YuYin outperforms other methods on Commercial-98K.The performance of CCA indicates that the correlation between the videos and music is difficult to learn with a linear projection.CBVMR has inter-and intra-modality constraints, but the hand-crafted audio features can represent limited information.Although CEVAR introduces labels for prediction besides computing the cosine loss between the videos and music, fine-grained information of video and music can hardly be exploited through two sets of fully connected networks.The performance of CMVAE on video-music retrieval and music-video retrieval is equally well, which may be attributed to the cross-matching and reconstruction loss used in the training stage to help it catch more correlation between the videos and music.However, we also find that even though we replace the voice-over in MRCMV with our text features, it also shows considerable results.The reason may be that the multi-head attention module and the self-attention module in MRCMV refine the video features and music features, which makes MRCMV apply more task-related information.

Ablation study
In this section, we explore the specific impact effects of each component in YuYin.First, we investigate the effect of text and emotion in YuYin by eliminating and retraining.Then, we replace WFM in YuYin with traditional fusion approaches.Moreover, we investigate the effect of multi-task learning on the video and music features in the common space through feature visualization.

Effect of emotion and text
We verify the effect of each modality by eliminating the text features (YuYin w/o T) and the emotion features (YuYin w/o E), and YuYin pure only uses audio and video features.When eliminating emotion features, the related WFM is also removed.Furthermore, when eliminating text features, we also remove its related cross-matching loss while keeping the prediction loss.From the results in Table 3, removing either emotion or text decreases the performance of YuYin, among which the emotion features have the greatest impact on YuYin.The result may be attributed to that the emotion features extracted by OpenSmile have more intuitive meaning than the audio features extracted by AST.However, we also find the text features have less impact on the performance of YuYin.
To explore the reason why the text features can hardly improve the performance when eliminating the emotion features, we randomly extract features from 1000 music in each category in Commercial-98K and reduce the dimension of music features to 2 by t-distributed stochastic neighbor embedding (t-SNE) for visualization.As Fig. 6 The KDE analysis of the similarity between the positive and negative samples shown in Fig. 5, we can observe that the music features become more discriminative with the emotion features, while the music features in YuYin w/o E are sparse.Then, we analyze that the reason may be that the text features only act as a supportive role in aligning video and music in the training phase, and the sparse music features make it hard for text features to align other modalities, resulting in the performance of YuYin pure comparable to that of YuYin w/o E. In addition, the results further prove the effect of emotion features, the reason why audio features are less different compared with emotion features may be attributed to that AST is frozen during the training phase, while OpenSmile extracts emotion features from its fixed rules.

Effect of WFM
In the impact study of the WFM, we compare it with the traditional fusion methods, including Concat and Add.We replace the WFM with Concat or Add respectively.For YuYin (Concat), the multi-modal features are concatenated and fed into the subsequent network.For YuYin (Add), due to the inconsistency of the feature dimensions, different features are first transformed to the same dimension by a linear projection and then summed up for fusion.
As shown in Table 4, YuYin (WFM) performs the best, which may be attributed WFM refining the data processing granularity of the model by learning to weight different modalities in training, while the Concat and the Add method can hardly complete the targeted extraction of the data, which leads to more interference information in the fused data and limits the model performance.Furthermore, the results of YuYin (Add) may result from the missing information in the linear projection and summing up compared with direct concatenation in YuYin (Concat).

Effect of multi-task learning
To investigate the effect of multi-task learning, as shown in Fig. 6, kernel density estimation (KDE) is applied to visualize the similarity in video-music pairs to demonstrate how YuYin distinguishes between positive and negative video-music pairs in cross-matching.The results prove the similarity between positive video-music pairs is significantly bigger than that of negative pairs in the common space.Fig. 7 Visualization of the video and music projections in the common space with dimension reduced 2 t-SNE Furthermore, to explore the effect of the prediction task, as shown in Fig. 7, we observe the video and music features from YuYin with and without the prediction task, respectively.In detail, we use t-SNE to reduce the dimension and visualize the video and music features from each category in Commercial-98K.It shows that the video projections in the common space have a more distinct distribution with the prediction task.However, we also find there is no clear pattern in the distribution of the music features.The reason may owe to the lack of a fixed paradigm for selecting the background music, and even for the same advertisement, the music can be influenced by personal preferences, music popularity, and other factors.

Conclusion
To reduce the labor in manually selecting the background music for e-commerce advertisements.We first establish a large-scale dataset Commercial-98K from Alibaba, containing background music, videos, and product category tags of 98,071 advertisements.Then, we propose a video-music retrieval model YuYin with a novel WFM to fuse audio and emotion features and is trained by multi-task learning to cross-match video, music, and text as well as predict the category of video and music through a weight-shared classifier.We conduct experiments to find YuYin outperforms other models in video-music retrieval.and demonstrate the effect of multimodal and WFM in YuYin.Moreover, through visualization, we investigate the data distribution of each modality to prove YuYin can distinguish positive and negative video-music pairs in the common space.In the future, based on Commercial-98K, we will continue to carry out studies on the more effect factors besides emotion in video-music retrieval and replace our multi-modal feature extractors with the novel network.

Table 2
The results of YuYin and other compared methods on Commercial-98K dataset

Table 3
The results of YuYin that eliminating modalities on Commercial-98K 5ig.5The visualization of music features from YuYin (the panel) and YuYin w/o E (the right panel), of which the dimension is reduced to 2 by t-distributed stochastic neighbor embedding (t-SNE)

Table 4
The results of YuYin with different fusion approaches on Commercial-98K