Beyond the Big Five personality traits for music recommendation systems

The aim of this paper is to investigate the influence of personality traits, characterized by the BFI (Big Five Inventory) and its significant revision called BFI-2, on music recommendation error. The BFI-2 describes the lower-order facets of the Big Five personality traits. We performed experiments with 279 participants, using an application (called Music Master) we developed for music listening and ranking, and for collecting personality profiles of the users. Additionally, 29-dimensional vectors of audio features were extracted to describe the music files. The data obtained from our experiments were used to test several hypotheses about the influence of personality traits and the audio features on music recommendation error. The performed analyses take into account three types of ratings that refer to the cognitive-emotional, motivational, and social components of the attitude towards the song. The experiments showed that every combination of Big Five personality traits produces worse results than using lower-order personality facets. Additionally, we found a small subset of personality facets that yielded the lowest recommendation error. This finding can condense the personality questionnaire to only the most essential questions. The collected data set is publicly available and ready to be used by other researchers.


Introduction
The volume of music data uploaded to the Internet has increased radically. The expanding number of music collections, mobile access to audio files and streaming services pose challenges to finding appropriate songs. Today, thanks to the popularity of streaming services such as Spotify 1 , Last.fm 2 , Tidal 3 , Pandora 4 , or Qobuz 5 , music discovery and recommendation systems have become much more popular than they were several years ago. Most of these services are hybrid systems (HS) that combine collaborative filtering (CF) and content-based (CB) approaches.
CF analyzes the community's ratings to conclude one's musical preference. The underlying assumption is that if a person A highly rates the same music as person B, then the system is more likely to recommend to user A songs unheard by A from the music pool of user B than that from any randomly chosen user [1,2]. Although this approach is widely adopted and computationally fast, it has limitations. First, CF assumes that musical taste is fixed and does not change over time, which is not always true [3]. Another limitation is the tendency to recommend popular music over those pieces that have few ratings. Individual and unique preferences have no chance of being discovered by this algorithm. Therefore, the most critical obstacle is the Cold-Start (CS) problem [4]. It concerns the issue that the system has not yet gathered sufficient information about the user or item to infer precise recommendation. One of the strategies for tackling this problem is to resort to the *Correspondence: Mariusz Kleć mklec@pjwstk.edu.pl 1 Multimedia Dept., Polish-Japanese Academy of Information Technology, Warsaw, Poland 2 Institute of Psychology, Cardinal Stefan Wyszyński University, Warsaw, Poland user's contextual data (e.g., social network) in order to enrich rating profiles. The enhanced information about the user can be further used for clustering "similar" users and personalize the recommendation [5][6][7]. The user personality is a special case of such contextual data. The assumption is that people with similar personalities have similar interests and behavioral patterns [8], so they also rate the music in a similar way. Personality can be derived implicitly from social networks [9] or explicitly from users [10]. The latter is on asking the user to answer a list of personality questions. However, the personality questionnaires are well established in the psychology field but not for recommendation systems. Additionally, they may be very long (some of them contain even 240 items [11]). Therefore, in this paper we also want to address this problem and select only the most relevant personality traits for doing recommendations. This approach allows to reduce the number of personality questions and presumably increase the satisfaction from using the system.
The CB approach can also alleviate the CS problem. It focuses on the content of items, which can be the metadata or audio features. In this case, a singular song's rating from the user is enough to calculate the similarity of that song's features to the others to make the recommendation. However, it leads to the recommendations that are "too similar", without a chance to surprise the user (low serendipity). Hybridizing these two approaches (i.e., CF and CB) can give satisfactory results. The hybrid approach is used today by large companies like Spotify or Pandora. The significant contribution to this field comes from adopting Deep Learning (DL) [12][13][14], which allows automatic feature extraction from audio signals [15], or learning latent factors from user-item rating data [16,17].
However, the factors that influence musical taste vary among individuals. Therefore, music information retrieval systems need to go beyond these approaches to deliver better recommendations. The type of music that one wants to listen to depends not only on listening history but also on one's current disposition, activity, as well as health condition, education, gender, and musical training [18][19][20][21].

Factors underpinning musical preferences
A positive correlation between a specific situation (context) and the preference for the music exists [18,21]. It is possible to track the listener's context (e.g., time [22], weather [23], location [24]) and derive the musical taste in that context implicitly [25][26][27]. In the works [3,28,29], the authors utilized the surrounding environment (e.g., noise, time, light, and weather) to suggest music.
Other essential factors that influence musical preferences are emotions [20]. While listening to music, people want to relieve stress, change or match their current emotions with those expressed by the music. Descriptions exist on how to communicate emotions via musical structure and how our emotions are influenced by listening to music [30]. Tracking the listener's emotions can help to improve the quality of a recommendation [31]. It is usually achieved implicitly by tracking the context, such as keywords from an extensive collection of documents written by users [32], or extracting the users' texts from social networks [33,34]. Another approach is to derive emotions from the user's face using the inbuilt camera of a mobile phone [31,35] or from the signals obtained via wearable physiological sensors [36]. Consequently, research on Context-Aware Music Recommendation Systems (CA-MRS) has gained importance in recent years [37].
However, musical preferences depend not only on the way people regulate their emotions, and current situation, but also on their personality [38]. For example, people who are neurotic (i.e., have low emotional stability) are more likely to use music to foster emotions [20]. Conversely, people who are conscientious and low in creativity (low open-mindedness) are more likely to use music for emotional change and emotional regulation [39].
The systems that incorporate the user's personality into the recommendation process are called Personality-Aware Music Recommendation Systems (PA-MRS) and are a branch of the CA-MRS [40].
In 2003, Rentfrow and Gosling [41] empirically found the relationships between personalities and musical preferences. Namely, reflective, complex music (e.g., blues, jazz, or folk) and intense and rebellious music (e.g., rock, alternative or heavy metal) are positively related to Openness to experience. On the other hand, upbeat and conventional music (e.g., country or pop) negatively correlates with Openness, but it positively correlates with Extraversion, Agreeableness and Conscientiousness. Finally, energetic and rhythmic music (e.g., hip-hop, dance or electronic) is positively correlated with Extraversion and Agreeableness. Classical music positively correlates with Neuroticism [42]. In 2011, Rentfrow et al. in [43] provided an improved description of musical preferences. Their findings demonstrate a latent five-factor structure underlying music preferences (further called MUSIC factors): Mellow (comprising smooth and relaxing styles), Urban (defined largely by rhythmic and percussive music), Sophisticated (includes classical, operatic, world music, and jazz), Intense (defined by loud, forceful, and energetic music) and Campestral (comprising a variety of various styles of direct and rootsy music, often found in country and singer-songwriter genres).
In [44] Bansal and co-authors confirmed that the music genre relates to the Big Five personality traits. They analyzed a global music-download database consisting of millions of entries with music metadata describing people downloading songs onto Nokia mobile phones. They showed that many genres in people's music collections are positively associated with Openness and (unexpectedly) Agreeableness, suggesting that individuals with high Openness and Agreeableness have broader musical tastes than those with high levels of other personality traits. The outcomes also aligned with literature showing that individuals who prefer jazz and folk score highly in Openness [45]. Such persons also tend to avoid genres like pop [46]. Since the level of Openness is related with the level of IQ [47], the findings above also find confirmation in the work of [48]. The authors indicate that people with higher IQ tend to prefer reflective and complex (e.g., jazz, classical, folk, blues) to upbeat and conventional music (e.g., pop). It is because the complex and reflective music is more likely to suit those who seek intellectually stimulating experiences. These people use music in rational or intellectual rather than emotional ways, implying higher levels of cognitive processing.
In [42] the authors indicate strong positive correlations between Neuroticism and classical music preference. Interestingly, they did not find Conscientiousness, Extraversion, or Neuroticism to be predictors of genre exclusivity. However, in [49] they analyzed a large dataset consisting of music listening histories and personality scores of 1415 Last.fm users. Their results show agreements with prior work but also the negative relation between Conscientiousness and folk music. They also report positive relations between Extraversion and such genres as R&B or rap, Agreeableness and country or folk, and also between Neuroticism and alternative music. However, musical genre is a conventional term and often the border between different musical genres is quite blurry. The authors in [50] investigated how different musical taxonomies (e.g., mood, activity, genre) influence the user experience and satisfaction of using music streaming services. Their findings are correlated with the Big Five personality traits. They also describe the link between the musical expertise of the listener and the number of categories within the given taxonomy. Their outcomes show that musically sophisticated users (e.g., experts) enjoy using the system more when exposed to a broader set of categories. This is also confirmed in [51] where experts enjoy the music more when having a more diverse choice of them. Still, there is a need for describing the link between personality and music in a more quantitative way. Such an approach is described in [52]. They correlated such audio features as dynamics, mode, register and tempo with the Big Five. The authors have shown (among others) that slow tempo is rated higher by high in Conscientiousness, major mode is preferred by low in Conscientiousness but high in Extraversion and piano dynamics are rated higher by high in Openness.
In general, audio features are expressed in a quantitative way and can be used together with personality traits in PA-MRS. Interesting approach is described in [53] where authors try to predict the personality trait (Extraversion or Introversion) on the basis of the audio features of the excerpt by employing several classification algorithms.
The authors of [54] showed that the recommendation accuracy could be improved by integrating personality traits. They also demonstrated that the accuracy depends on the recommendation domain: higher accuracy can be achieved in the movie domain than in the music domain.
In another paper [55], the authors analyze the influence of personality traits and emotional states (among others) on ratings. They found that the users with a high degree of Agreeableness rate at least 0.5 stars higher compared to the users with low Agreeableness (on a rating scale from 1 to 5) [56]. In [57] the authors compared the contribution of personality features and physiological signals (recorded by a wearable device) to the accuracy of their recommendation system. They found that the physiological features contributed less than the personality features.
It is also worth mentioning that users with different personalities show different preferences, regarding not only the recommendation accuracy, but also such properties of recommendation as diversity, popularity, and serendipity [58,59]. The personalization of diversity is described in [60] and used in [61]. The authors demonstrated increased user satisfaction and recommendation diversity when they personalized the system according to the user's personality.

Personality acquisition
Developing the most efficient acquisition for music recommendation systems (MRS) is a challenge. The review of personality assessment questionnaires can be found in [40]. The most popular one is the Big Five Inventory (BFI) questionnaire, used for Big Five personality acquisition [62,63]. The Ten Item Personality Inventory (TIPI) is another common option [64]. Generally, the questionnaires vary in the number of questions that the user is to answer. The TIPI is a very short questionnaire containing only 10 items. However, most questionnaires contain more than 50 items (some even 100, 200, and more). Longer questionnaires provide higher reliability, but, at the same time, require more effort from the user. Therefore, researchers try to acquire personality factors implicitly, e.g., using machine learning techniques with features extracted from social media streams [9]. The implicit acquisition does not require any action from the user, but its performance is much worse than explicit methods. For example, in [65], the authors were able to predict personality parameters from Twitter within 11-18% of their actual value, by looking at the content of the user's tweets. Thus, the obtained results were very low, which was also confirmed in [66].

Contribution
We hypothesize that selecting only the most relevant personality traits for doing recommendations allows for reducing the recommendation error and limiting the number of questions the user needs to answer. To verify this hypothesis, we aimed at selecting the most relevant personality traits. In our study, we decided to use an explicit method for personality acquisition. Since using long questionnaires may be fatiguing for users, we wanted to find a trade-off between the reliability of the user personality representation and the length of the questionnaire. We used the revised version of the BFI (i.e., BFI-2) [67], as it contains 60 items (questions) and allows to go beyond the Big Five personality traits, by also measuring the lower-order level (i.e., facets) of the Big Five. We developed an application (called Music Master) for gathering users' personality information, listening to music, and rating it. Based on the data collected from the listening sessions, a memory-based hybrid music recommendation system has been developed and evaluated in an offline manner. The memory-based approach allows us to measure (among others) the similarities between users, in terms of various subsets of their personality traits, and clearly interpret the recommendation process. Based on the results, we selected only those traits (and their corresponding questions from the BFI-2 questionnaire) that contributed most to the system's performance. To the best of our knowledge, the BFI-2 has not been used before in any recommendation system. Additionally, we have published the collected data with ratings, features, and personality traits, to make them available for further investigations by other researchers.

Personality
Personality describes how individuals differ in their permanent emotional, interpersonal, experiential, attitudinal and motivational styles [68]. Over the past quarter-century, personality psychology has been dominated by theories of traits. There are several established and at the same time competing models of personality trait structure, such as the so-called Giant Three model by Eysenck [69], six-factor HEXACO model [70], or Two-Factor Model of higher-order personality factors [71,72]. However, the Five-Factor Model, which is also known as the Big Five [11,62,73], is the prevailing conceptualization of personality structure and its basic dimensions. According to the Big Five model, most of the significant individual differences in people's patterns of thinking, feeling, and behaving are embraced by five personality domains: Extraversion, Agreeableness, Conscientiousness, Neuroticism (or Negative Emotionality) and Openness to experience (alternatively labeled Intellect or Open-Mindedness) [11,62,67]. These domains are basic personality dimensions, and each of them is a quantitative variable with a positive and negative pole (e.g., the negative pole of Extraversion is introversion, and the negative pole of Neuroticism is emotional stability).
Most papers focus on the Big Five model [40], possibly because of the ease of its interpretation and because the results can be expressed quantitatively [74]. A discussion on the usability of this model in recommendation systems can be found in [75]. However, in our study, the revised version of the BFI (i.e., BFI-2) [67] was used. This psychometric model contains scales for the 5 primary domains and 15 subscales, nested within the primary ones (in total 20 personality dimensions, further referred to as traits). Brief characteristics of primary personality domains, and a list of their lower-order subscales (further referred to as facets) are given below: • Extraversion: characterizes the activity (energy) level, the number of social interactions and social self-confidence, as well as positive emotionality. -Sociability, Assertiveness, Energy Level; • Agreeableness: general disposition toward other people: positive, trustful, polite, empathic and altruistic vs. negative, antagonistic, and egocentric. -Compassion, Respectfulness, Trust; • Conscientiousness: revealed in relation to work, rules and obligations and characterizes the level of orderliness, dutifulness, as well as perseverance and diligence. -Organization, Productiveness, Responsibility; • Neuroticism: contains negative emotionality, over-sensitivity, volatility and irritability, as well as vulnerability, lack of resistance to stress, and low self-esteem. -Anxiety, Depression, Emotional Volatility; • Openness: positive (cognitive) attitude towards novelty and both intellectual stimuli (abstract ideas), as well as aesthetic (or artistic) experiences; vivid imagination and complex thinking. -Aesthetic Sensitivity, Intellectual Curiosity, Creative. Imagination

Music Master application
We developed an application to gather the listener's personality profiles and musical ratings. The application communicates with the server using the TCP/IP protocol. The client part is called Music Master (MM). Its User Interface (UI) is divided into three main views: personality registering, personality visualization, and music player (see Fig. 1).
First, the user needs to create an account by assigning a username and password and then rates the phrases about oneself. The phrases came from BFI-2, e.g., "I am someone who is outgoing" or "I am someone who is compassionate" [67]. Next, the personality profile is calculated and presented visually. When saving, the data is encrypted to ensure anonymity. Setting up a new account allows the user to start listening to music. The application is prepared to propose one song at a time or to generate a set of songs as a playlist. However, only the first option was used for gathering the data described in this paper. The client part has been written in Action-Script 3.0 in the Adobe Animate CC software. It allowed easy deployment on various platforms, such as for PC or mobile applications with Android or iOS operating systems. The server part has been written in JAVA. Its role is to communicate with the client and stream audio files. It saves the music meta-data, audio features, the user's profiles, ratings, and user actions. The recommendation engine has been written in Matlab. The 29-dimensional feature vector represents each song. The description of the features is presented below.

Audio features
There were 29 features calculated from each of the songs: 11 amplitude-based features, 6 spectrum-based features, 4 high-level features, and 8 emotion-based features. They were calculated using 50 ms frame length with Hamming windowing and half-frame overlapping by means of the MIRtoolbox in Matlab [76,77]. The values of the audio features were averaged across all the frames within the length of the audio file. Some features are based on the statistics of occurring sudden bursts of signal energy that usually corresponds to such events as notes, chords and rhythm beats. Additional information about each feature can be found in [76][77][78].

Amplitude-based features
• Attack time: the mean, standard deviation, slope, and entropy of the duration of events' attack phase, detected in the amplitude of the signal (AttackTimeMean, AttackTimeStd, AttackTimeSlope, AttackTimeEntropy). • Attack slope: the mean, standard deviation, slope, and entropy of the average slope of events' attack phase, detected in the amplitude of the signal (AttackSlopeMean, AttackSlopeStd, AttackSlopeSlope, AttackSlopeEntropy). • Zero crossing rate (Zerocross) is a simple indicator of the noisiness of the signal. It counts the average number of times that the signal changes sign in the frame. • RMS measures the global energy of the signal. It is defined as the root mean square of the energy of the amplitude. • Lowenergy is the percentage of frames that show less than average energy [79].

Spectrum-based features
• Centroid, spread, skewness, kurtosis, flatness, entropy are statistical descriptions of spectral distribution and are described by statistical moments.
-Centroid indicates the center of mass of the spectrum. It has a connection with the impression of the brightness of a sound. A higher value of centroid corresponds to a brighter sound (i.e., with more energy of the signal being concentrated within higher frequencies). -Spread is the indicator of how a spectrum is spread in the frequency domain. Noises have a high spectral spread, whereas sounds with isolated peaks in the spectrum have a low spectral spread. Noisy signals are more challenging to interpret. Spectral spread is used as an indication of the dominance of a tone because the spread is low in this case; pitched sounds have low spectral spread. For complex sounds, the spread increases as the tones diverge and decreases as the tones converge. -Skewness measures the symmetry of the distribution. A distribution can be positively skewed in the case when it has a long tail to the right, while a negatively skewed distribution has a longer tail to the left. Symmetrical distribution has a skewness of zero. For harmonic signals, the spectral skewness indicates the relative strength of higher and lower harmonics. -Kurtosis measures the flatness or non-Gaussianity of the spectrum around its centroid. It is used to indicate the "peakiness" of a spectrum. For example, if the white noise is occurring within the signal, then the kurtosis decreases. -Flatness can be used to distinguish between a harmonic (flatness close to zero) and a noisy signal (flatness close to one for white noise). -Entropy is low for a spectrum with many distinct spectral peaks and high for a flat spectrum. Spectral entropy is a measure of signal irregularity.

Higher-level features
• EventDensity estimates the average frequency of events per second. • PulseClarity estimates the rhythmic clarity, indicating the strength of the beats [80]. • Inharmonicity estimates the number of partials that are not multiples of the fundamental frequency. It takes into account the amount of energy outside the ideal harmonic series. • Brightness. Although spectral centroid can be used as brightness predictor, we decided to use an improvement to it, namely to calculate centroid only for signal energy above a particular frequency; we chose 1500 Hz [81,82]. This feature might be used to quantify the sensation of sharpness, related to the high frequency content of a sound.

Emotion-based features
• Activity, Valence, Tension, Happy, Sad, Tender, Anger, Fear: emotions evoked in music can be described using two paradigms: in terms of five basic emotions (i.e., happy, sad, tender, anger, and fear) and in terms of three dimensions: activity (or energetic arousal), valence (a pleasure-displeasure continuum) and tension (or tense arousal). The output of the predictive model of emotions, found on the basis of parameters from musical signal [77,83] gives the localization of emotional content within the five basic classes and within the three dimensions.

The experiment setup
In the presented work, 279 participants were invited to take part in the experiment. They were mainly students from the Faculty of Information Technology and the Faculty of New Media Arts of the Polish-Japanese Academy of Information Technology. The listening sessions were organized only for volunteers in classrooms with few students. Each participant was asked to set up an account with their personality profile in the Music Master (MM) application. It was preceded by a short presentation about the data encryption in the code because it was necessary to convince the participants that the research was entirely anonymous. Over-ear semi-open headphones AKG K-240 were used in the experiments. The participants were informed that they can listen to as many songs as they want for at least 10 min and they should not perform any other tasks on the computer.
The songs were on Creative Commons license, randomly chosen from the pool of 745 songs downloaded from the magnatune.com website. The details about the pool of songs used in our experiments are given in Table 1.
The participants were informed that they could skip the song after the minimum 20 seconds of continued listening (with the option to pause or skip to any desired point) and when the song received ratings. The three types of ratings we gathered denote answers, using a five-point Likert scale, to the following three questions: • Q1: How much do you like this song? -(1) "I definitely don't like it, " (2) "I rather do not like it, " (3) "I have no opinion, " (4) "I rather like it, " and (5) "I definitely like it. " • Q2: Would you like to listen to similar songs in the future? -(1) "I definitely would not want to, " (2) "I rather would not want to, " (3) "I have no opinion, " (4) "I rather would want to, " and (5) "I definitely would want to. " • Q3: Would you recommend this song to your friend? -(1) "I definitely would not recommend, " (2) "I rather would not recommend, " (3) "I have no opinion, " (4) "I would rather recommend, " and (5) "I would definitely recommend. " The majority of music recommendation systems ask users about "how much do you like this song?" (Q1 rating type) and try to predict the same for unknown songs. This question refers to the cognitive-emotional component of the attitude towards a particular song (i.e., simply to the actual opinion and belief concerning the reaction to music). The question Q2, "would you like to listen to similar songs in the future?", refers to the motivational component of the attitude, reflecting possible engagement in future contacts with the song. It is worth noting that the prediction of future engagement with similar songs is something the recommendation systems try to do. Finally, the question Q3 "would you recommend this song to your friend?" refers to the social component of the attitude, reflecting a willingness to share a given song with the user's friends. To summarize, we can say that while Q1 refers to just intrapsychic elements of the song preference, Q2 and Q3 are markers of its more extrinsic and behavioral aspects.

Collected data
In total, 5278 data items have been recorded. Each item represents the ratings of a particular song by one user, according to the three questions (Q1, Q2, and Q3). The answers to these three questions are further referred to as three rating types. The collected data set contains the values of 20 personality traits, the ratings for Q1, Q2, and Q3, and audio features extracted from musical files. The data set is publicly available.
Afterwards, three user-item matrices (each containing a different rating type) with 279 rows (users) and 745 columns (songs) were created. The sparsity of the matrices is equal to 0.9764. The global averages of the ratings are 2.85 for Q1, 2.58 for Q2, and 2.28 for Q3. We also studied the relationships between personality traits and ratings. We used Pearson's correlation coefficient to measure the strength and direction of each relationship (see Fig. 2). The correlation between Q1 and Q2 equals 0.871, between Q1 and Q3 0.764 and between Q2 and Q3 0.806. Figure 3 presents the distribution of each rating type across the Likert scale.

Proposed methodology
The role of MRS is to predict the user's rating value for an unknown song. The prediction is perfect when it is equal to the rating value that the user would give. More formally, having the group of users U and the set of songs S, the system's task is to learn a function f, which predicts the recommendation value r ∈ R for a song s to user u: Model based approaches, especially those incorporating DL techniques, can learn the recommendation function f to predict ratings with high accuracy [84]. It requires the amount of data that prevents the models from being over-fitted during the training. However, the size of our data-set is not sufficient for DL models. Moreover, we wanted to obtain high interpretability of the learning process and to analyze the results and interactions between variables in the prediction phase from a psychological point of view. This would be cumbersome or impossible in the case of DL. Therefore, we decided to implement an easy to interpret memory-based Collaborative Filtering (CF) algorithm. It utilizes the k most similar users (user-based) or similar items (item-based) for predicting rating for a given item, i.e., song [1,2]. Cosine similarity is one of the most common measures used to calculate the similarity of two vectors of ratings [2], and we decided to use this measure. For rating similarity calculations, the item and user ratings were first normalized using rnorm u,i = µ + b i + b u , to remove user and item bias. The rnorm u,i represents the normalized rating for user u and item i, µ denotes the global rating average, b i and b u are item and user bias, respectively. The biases are calculated as the difference between the global average and the average item or user ratings.
In order to determine the set of k most similar users (user-based) or items (item-based), we first calculate a similarity matrix for each approach, using cosine similarity, and k most similar users/items are found in the corresponding similarity matrix. In an item-based approach, the rating prediction for a song s and a user u is determined according to the following formula: (1) predItemBased(u, s) = n∈K sim(n, s) * (r u,n ) n∈K sim(n, s) where sim(n, s) denotes the similarity between the song s and its n ′ th most similar neighbor. The r u,n is the rating for the n ′ th item given by the user u.
In a user-based approach, the rating prediction can be defined according to the following formula: where sim(n, u) denotes the similarity between the user u and its n ′ th most similar neighbor. The r s,n is the rating for the item s given by the n ′ th user.
Next, the user and item based approach were combined in the following formula of the hybrid rating prediction: Besides the similarity of ratings, we also used the similarity of audio features (instead of sim(n, s)) and personality (2) predUserBased(u, s) = n∈K sim(n, u) * (r s,n ) n∈K sim(n, u) predHybrid(u, s) = ∑ n∈K sim(n, s) * (r u,n ) + ∑ n∈K sim(n, u) * (r s,n ) ∑ n∈K sim(n, s) + ∑ n∈K sim(n, u) domains (instead of sim(n, u)) in our experiments. These data were normalized to have zero mean and standard deviation equal to one (z-score normalization), and cosine similarity was applied. We evaluated the experiments by calculating Root Means Square Error (RMSE), using the 10-fold cross validation approach (10-CV) to evaluate the predictions of ratings. Therefore, the "recommendation quality" in the further text refers to the quality measured by RMSE obtained via the 10-CV procedure, and the lower the RMSE, the higher the recommendation quality. We will report the results for all three rating types: Q1, Q2, and Q3.

Experiments
In our experiments we studied two main hypotheses: 1 The recommendation quality differs when employing various personality domains (user-based approach) or audio features (item-based approach). 2 There is a difference in recommendation quality when using solely Big Five personality traits, or their low-level facets (using a hybrid approach).
In order to examine these hypotheses, baseline recommendation quality values (in terms of RMSE) were calculated first for various settings. First of all, a global average value of ratings was calculated as a baseline prediction. Next, we calculated baseline RMSE values for simple user-based and item-based CF. Subsequently, the similarity of ratings (sim(n, u) and sim(n, s)) were replaced with the similarity of all personality traits and the similarity of all the audio features. Finally, we calculated baselines for the hybridized approaches. The results of the baseline RMSE values are presented in Table 2.
In order to study the influence of individual personality traits on the quality of music recommendations, we used a user-based CF. The influence of individual audio features was examined using an item-based approach. In order to measure the similarity between individual personality domains and individual audio features, which are 1-dimensional vectors (scalars), 1 − d was applied as a similarity measure (instead of cosine similarity), where d denotes Euclidean distance, The results are presented in Figs. 4 and 5.
For studying the differences in recommendation quality between Big Five and their low-level personality facets, we used the hybrid model for rating prediction (see Eq. 3). First, we calculated the similarity values for simplified models, namely for each pair consisting of one personality trait and one audio feature, and these values were applied to calculate predictions, for each rating type (see Fig. 6). Next, from all performed experiments, two minimum RMSE values were chosen for each rating type: (1) belonging to one of the Big Five traits and (2) belonging to one of the personality facets. Together with their corresponding audio feature, these results were saved for further experiments. The pair (personality dimension and audio feature) that gave the lowest RMSE for each rating type, will be further called the "best pair". Therefore, we obtained 6 best pairs, i.e., two pairs for each of the three ratings Q1, Q2, and Q3.
In the next steps, we gradually improved the results. We started with the two pairs, for which minimal values of Q1 are obtained (see Fig. 6), i.e., curiosity, tender, and Openness, tender. Next, for each of the two previously selected best pairs, the next best pair was added and selected in the same manner as the first one. Namely, we added one personality trait (domain or facet) and one audio feature, together with the previous pair yields minimal RMSE for Q1. The difference was that the first selection used Euclidean distances (as we had one-dimensional vectors, for which cosine distance would not work), and in the next steps, cosine similarities were applied (as in this case we had multi-dimensional vectors). Every selection was performed in two ways: selecting only among Big Five domains and only low level facets. This process was repeated step by step until the RMSE error started to grow. In each step, we reported the minimum RMSE results. The same procedure was also performed for Q2 and Q3. The results are presented in Fig. 7.

Results and discussion
Every comparative analysis of results in the description below will concern the values of Q1, unless indicated otherwise. Aesthetic Sensitivity (Openness's facet) has the highest and positive correlation with all rating types (see Fig. 2). Interestingly, Assertiveness (Extraverion's facet) negatively correlates with all ratings. We believe that this can be explained by the genres used in our experiment (classical, world, jazz, hard rock, alternative rock, electronic rock, and electronica). Rentfrow et al. [85] show that people with high Openness usually prefer more complex music, like blues, jazz, folk, and rock, than Extravert people, who usually appreciate upbeat music like hiphop, funk and electronic [53]. Our experiments corroborate these findings.
As shown in Fig. 2, persons of high Openness usually give higher ratings, in contrast to the persons of high Extraversion, who usually give lower ratings. Additionally, the genres used in the experiments seem to be preferred by persons high in the trait of Openness. However, as described in [44], people of high Openness have broader musical tastes (and enjoy more genres) than Extraverted people. Therefore, they may rate the music higher because they generally like to listen to it, not only because of the preferred genres. To summarize, even though we found statistically significant correlations between ratings and personality domains, these correlations are relatively weak. The performed meta-analysis described in [86] confirms weak connections between personality and five-dimensional MUSIC factors for music preferences. The authors in [52] also confirm that associations between personality and acoustic features exist, though this association is relatively weak. Nevertheless, it is worth noting that some lower-order facets show higher correlation with ratings than their main personality domains (see Fig. 2). Looking at the baseline results presented in Table 2, we can conclude that predicting Q2 and Q3 gives lower RMSE than predicting Q1 in all performed experiments. This means that our models make more accurate predictions for Q2 (how much the user wants to listen to similar songs in the future) than for Q1 (how much the user likes the song). Furthermore, the models perform even better when predicting Q3 (how much the user would like to share the song with friends). We believe these differences can be explained by the different distribution of these rating types (see Fig. 3). In the case of Q2 and Q3, we can see that participants tended to give lower ratings more often than for Q1. Therefore, a system that predicts lower ratings for Q2 and Q3 will achieve lower RMSE. Q1 refers to the opinion or belief concerning a particular song that the listener has just heard, and therefore it could be treated as a somewhat superficial aspect of the attitude. In contrast, Q2 and Q3 reflect socio-motivational and therefore more behavioral aspects of the attitude towards the music, requiring more engagement. Therefore, Q2 and Q3 can be seen as concerning more profound psychological characteristics, more strongly related to stable personality dispositions (traits).
When comparing all the rating-based CF results, we can see that the user-based approach performs much worse than the item-based one (1.326 vs 1.192). It is not surprising as the user-item matrix had only 275 users but 745 songs. Thus, the model had a more limited number of user neighbors for making the prediction, compared to songs. Regarding the item-based approach, replacing rating with audio feature-based similarities improves the . This result may suggest that people with similar personalities might not share similar musical tastes with the same strength as people with similar song ratings. However, in [10] the authors have shown that combining the personality similarity with a rating-based CF can bring improvement in rating prediction, compared to predictions based on rating data only. Therefore, we think we could not get a lower error because we used personality similarity alone, without the similarity of rating data. To confirm whether this combination will improve the results, as stated in [10], there is a need to combine personality and ratings similarity in the future work. It is worth noting that only the similarity of Intellectual Curiosity gives a lower error than the similarity of all personality traits together (1.359 vs 1.365). It confirms the findings of Braunhofer et al. [87] who have shown that exploiting even a single personality trait may lead to a considerable improvement in recommendation accuracy. Still, even if the improvement was observed for a single personality trait (Curiosity), the error (1.359) is still higher than user-based CF with rating similarity (1.326). Therefore, additional experiments with more data that combine the similarity of ratings and personalities are needed in the future.
When analyzing the recommendation quality of the hybrid model, and using the similarity of a single personality trait and a single audio feature, we can see that Intellectual Curiosity and Tender (emotion-based audio feature of music) result in the lowest error, see Fig. 6. Furthermore, this hybrid model slightly outperforms the item-based CF that considers all the personality and audio features dimensions (RMSE=1.1630 for CF vs 1.1628 for this particular hybrid model).
Prediction using the similarity of personality facets yields a lower error for all Qs than prediction based on the similarity calculated for any combination of the main Big Five personality domains (see Fig. 7). Error reduction is relatively small, but always exists. However, the main gain results from the reduction of the set of personality facets (together with the appropriate set of audio features) applied in the similarity calculations. We found that Intellectual Curiosity, Responsibility, Aesthetic Sensitivity and Trust yielded the lowest recommendation error for Q1 (see Fig. 7). In this case, the RMSE error was reduced from 1.1628 to 1.1349. The characteristic of the above set of personality facets is as follows: people of high Intellectual Curiosity desire to acquire general knowledge about the world, such as on how systems work, about mathematical relationships, what objects are composed of, etc. Responsible people are being accountable or blamed for something. Therefore, they feel a moral Fig. 7 The recommendation quality when gradually adding consecutive the best pairs to the previous ones. They were selected in two ways: only the best results belonging to the low-level personality facets (blue colors) and only the best results belonging to the Big Five personality domains (red color). This gradual improvement method revealed the moment (the subset of personality domains and audio features), marked by dots on each graph, after which the errors started to grow obligation to behave correctly, so other people usually perceive them as reliable. Aesthetic Sensitivity describes the ability to detect and appreciate beauty wherever it exists.
We used miremotion library [83] to calculate all the audio features, including the description of musicevoked emotions, based on the analysis of the audio signal of the recordings. These emotions have been described using two representations: 1) a discrete model with five basic emotions: happy, sad, tender, anger, and fear, 2) a three-dimensional model, where these five basic emotions can also be placed: with the following dimensions: activity (energetic arousal), valence (a pleasuredispleasure continuum), and tension (or tense arousal). From Fig. 7, we can see that the similarity of activity of the tender and anger emotions evoked in music contributed most to the reduction of the recommendation error. This conclusion can also be drawn from an itembased approach, with single audio features used in similarity calculations (see Fig. 5).
It is also worth noting that, among other features, the indicator of how a spectrum is spread in the frequency (spread) contributed to reaching the minimum RMSE for Q1 and Q3. In addition, the global energy of the signal (rms) and its noisiness (zero-crossing rate) also contributed to reaching the minimum RMSE for Q2 and Q3.
Since the prediction highly depends on the similarity measure, further experiments may incorporate dimensionality reduction techniques (such as Singular Value Decomposition (SVD) or Principal Component Analysis (PCA)) together with clustering algorithms (such as k-means of Self Organizing Maps (SOM)) to infer similar users or items [88]. SOM produce clusters in an unsupervised manner from multi-dimensional data. Since the prediction could also depend on Q2 and Q3 values, the SOM can be used to group similar users or items, based on 3 rating types and also other available observations together (personality and audio features). The clusters obtained in this way can bring improvements in rating prediction [89]. Additionally, further analysis is required to investigate how motivational (Q2) and social (Q3) components of the attitude towards the song influence the cognitive-emotional (Q1) component, which may depend on the personality. This can be inferred from the dataset by employing appropriate statistical analysis. Another interesting hypothesis to check, similar to the described in [52], is the existence of the difference between features rated low and high, which may depend on the level of personality traits. The authors of this paper leave this (and also others) hypotheses to investigate by other researchers.
The recommendation quality does not depend on the prediction accuracy only. The prediction is needed for the recommender systems to build the list of songs for which the highest prediction of Q1 is obtained, indicating that the user would probably like to listen to these songs. Therefore, the songs are added to the recommendation list in the order of increasing RMSE for Q1 prediction. However, the user may actually prefer listening to other songs at the moment.
In our opinion, it seems reasonable to select the song for which the predictions of Q1 and Q2 were both high. This means that the system could recommend songs similar to those the user would like to listen to in the future (Q2) and reject those for which Q2 was low. It presumably would increase user satisfaction with the recommended items. As far as Q3 rating is concerned, the system could favor the songs that received high Q3 ratings from the user's friends. However, the link between "being the nearest neighbor" (used in the recommendation algorithm) and "being a friend" is unclear. Another idea is to formulate a confidence measure that will tell the system how trustworthy a particular prediction is. This measure would need to incorporate additional knowledge about the interactions between three rating types, number of ratings in neighbors and maybe other factors. The authors leave these issues to be investigated in the future, and keep it as open research questions, to be discussed by other researchers.
One of the limitations of our experiments is that we used the random selection of music from the magnatune. com website. It offers both Eastern and Western music to download. As it is stated in [90], in terms of BFI, only the preferences for Western music are universal across 53 countries but we do not know whether it is the true for Eastern music as well. Additionally, the range of music genres was limited, and a more elaborate genre taxonomy would allow us to compare the results with other researchers [49,50,53], in terms of the preferences to genres by personality traits. In the future study, there may also be a control question that verifies answers related to personality types. Additionally, there is still an opened question, how our findings correlate with real world scenarios and how the preferred use of music (relaxing or jogging) influences the way participants rate the music in the experimental controlled environment with the use of headphones. However, the most important conclusion from the experiments performed in this paper is that utilizing BFI-2 (instead of BFI) is worth considering with every rating type.

Conclusions
This article describes the effect of utilizing BFI-2 personality domains in the music recommendation systems on the recommendation error. The BFI-2 allowed performing the analysis with more granularity due to the availability of low-order facets of Big Five personality domains. We collected the personality profiles and three music rating types (related to cognitive, motivational and social components of the attitude towards the music) from 279 users of the newly developed Music Master application. In addition, 29-dimensional vectors of audio features were incorporated into the analysis. To the best of our knowledge, a dataset with BFI-2 personality profiles, three rating types, and audio features has never been published before.
The experiments with our hybrid recommendation model showed interesting interactions between personality domains and audio features. It turned out that only the several low-order personality facets were enough to obtain the lowest recommendation error. The Intellectual Curiosity, Responsibility and Aesthetic Sensitivity decreased the error significantly for predicting all three rating types. It is essential to note, when using memorybased methods, any combination of Big Five personality traits produced a higher error than lower-order personality facets. However, there is still an open question whether the results scale to the real world scenarios or to model based methods.
The experiment also revealed the subset of audio features that contributed most to obtaining the lowest error. These features refer to the activity of tender and anger emotions (i.e., two basic emotions, tender and anger, as represented along the activity axis in 3-dimensional space) evoked in music. These features were calculated based on the analysis of the audio contents of the recordings. More details about the predictive models of emotions can be found in [83].
We performed our experiments on a small dataset (5278 ratings from 279 users) and a relatively simple recommendation model based on user or item similarity. Unfortunately, our initial trials with training Singular Value Decomposition (SVD) caused overfitting with the dataset due to its relatively small size. Therefore, a more extensive setup and even live system, working in real time, are required to prove that the reported subset of personality domains scales well with different recommendation algorithms. Nevertheless, the proposed simple hybrid model allowed a detailed analysis, based on the similarity of users and the similarity of songs.
An additional conclusion is that, instead of implementing the complete BFI-2 questionnaire, it is more practical and more effective to implement only a small subset of its questions. We observed that the best trade off between the performance and the number of questions is to have the following three personality traits: Intellectual Curiosity, Aesthetic Sensitivity and Responsibility, and the following three audio features: tender, anger, and activity (see Fig. 7). When adding additional ones, the error improvement is negligible. Therefore, instead of 60 questions (4 questions per personality facet), only the 12 of them would result in a better recommendation performance and higher user satisfaction than a full questionnaire.
The authors hope that other researchers will find the data set practical and stimulating to design other experiments, and prove other hypotheses that relate to three aspects of ratings (Q1, Q2, and Q3), recommendation models, and personalities.