Automatic Recognition of Lyrics in Singing
© A. Mesaros and T. Virtanen. 2010
Received: 5 June 2009
Accepted: 23 November 2009
Published: 23 February 2010
The paper considers the task of recognizing phonemes and words from a singing input by using a phonetic hidden Markov model recognizer. The system is targeted to both monophonic singing and singing in polyphonic music. A vocal separation algorithm is applied to separate the singing from polyphonic music. Due to the lack of annotated singing databases, the recognizer is trained using speech and linearly adapted to singing. Global adaptation to singing is found to improve singing recognition performance. Further improvement is obtained by gender-specific adaptation. We also study adaptation with multiple base classes defined by either phonetic or acoustic similarity. We test phoneme-level and word-level n-gram language models. The phoneme language models are trained on the speech database text. The large-vocabulary word-level language model is trained on a database of textual lyrics. Two applications are presented. The recognizer is used to align textual lyrics to vocals in polyphonic music, obtaining an average error of 0.94 seconds for line-level alignment. A query-by-singing retrieval application based on the recognized words is also constructed; in 57% of the cases, the first retrieved song is the correct one.
Singing is used to produce musically relevant sounds by the human voice, and it is employed in most cultures for entertainment or self-expression. It consists of two main aspects: melodic (represented by the time-varying pitch) and verbal (represented by the lyrics). The sung lyrics convey the semantic information, and both the melody and the lyrics allow us to identify the song.
Thanks to the increased amount of music playing devices and available storage and transmission capacity, consumers are able to find plenty of music in the forms of downloadable music, internet radios, personal music collections, and recommendation systems. There is need for automatic music information retrieval techniques, for example, for efficiently finding a particular piece of music from a database or for automatically organizing a database.
The retrieval may be based not only on the genre of the music , but as well on the artist identity  (artist does not necessarily mean singer). Studies on singing voice have developed methods for detecting singing segments in polyphonic music , detecting solo singing segments in music , identifying the singer [5, 6] or identifying two simultaneous singers .
As humans can recognize a song by its lyrics, information retrieval based on the lyrics has a significant potential. There are online lyrics databases which can be used for finding the lyrics of a particular piece of music, knowing the title and artist. Also, knowing part of the textual lyrics of a song can help identify the song and its author by searching in lyrics databases. Lyrics recognition from singing would allow searching in audio databases, ideally automatically transcribing the lyrics of a song being played. Lyrics recognition can also be used for automatic indexing of music according to automatically transcribed keywords. Another application for lyrics recognition is finding songs using query-by-singing, based on the recognized sung words to find a match in the database. Most of the previous audio-based approaches to query-by-singing have used only the melodic information from singing queries  in the retrieval. In , a music retrieval algorithm was proposed which used both the lyrics and melodic information. The lyrics recognition grammar was a finite state automaton constructed from the lyrics in the queried database. In  the lyrics grammar was constructed for each tested song. Other works related to retrieval using lyrics include [11, 12].
Recognition of phonetic information in polyphonic music is a barely touched domain. Phoneme recognition in individual frames in polyphonic music was studied in , but there has been no work done using large vocabulary recognition of lyrics in English.
Because of the difficulty of lyrics recognition, many studies have focused on a simpler task of audio and lyrics alignment [14–18], where the textual lyrics to be synchronized with the singing are known. The authors of  present a system based on Viterbi forced alignment. A language model is created by retaining only vowels for Japanese lyrics converted to phonemes. In , the authors present LyricAlly, a system that aligns first the higher-level structure of a song and then within the boundaries of the detected sections performs a line-level alignment. The line-level alignment uses only a uniform estimated phoneme duration, rather than a phoneme-recognition-based method. The system works by finding vocal segments but not recognizing their content. Such systems have applications in automatic production of material for entertainment purposes, such as karaoke.
This paper deals with recognition of the lyrics, meaning recognition of the phonemes and words from singing voice, both in monophonic singing and polyphonic music, where other instruments are used together with singing. We aim at developing transcription methods for query-by-singing systems where the input of the system is a singing phrase and the system uses the recognized words to retrieve the song in a database.
The basis for the techniques in this paper is in automatic speech recognition. Even though there are differences between singing voice and spoken voice (see Section 2.1), experiments show that it is possible to use the speech recognition techniques on singing. Section 2.2 presents the speech recognition system. Due to the lack of large enough singing databases to train a recognizer for singing, we use a phonetic recognizer trained on speech and adapt it to singing, as presented in Section 2.3. We use different settings: adapt the models to singing, to gender-dependent models, and to singer-specific models, using different number of base classes in the adaptation. Section 2.4 presents phoneme- and word-level -gram language models that will be used in the recognition. Experimental evaluation and results are presented in Section 3. Section 4 presents two applications: automatic alignment of audio and lyrics in polyphonic music and a small-scale query-by-singing application. The conclusions and future work are presented in Section 5.
2. Singing Recognition
This section describes the speech recognition techniques used in the singing recognition system. We first review the main differences between speech and singing voice and present the basic structure of the phonetic recognizer architecture, and then the proposed singing adaptation methods and language models.
2.1. Singing Voice
Speech and singing convey the same kind of semantic information and originate from the same production physiology. In singing, however, the intelligibility is often secondary to the intonation and musical qualities of the voice. Vowels are sustained much longer in singing than in speech and independent control of pitch and loudness over a large range is required. The properties of the singing voice have been studied in .
In normal speech, the spectrum is allowed to vary freely, and the pitch or loudness changes are used to express emotions. In singing, the singer is required to control the pitch, loudness, and timbre.
The timbral properties of a sung note depend on the frequencies at which there are strong and weak partials. In vowels this depends on the formant frequencies, which can be controlled by the length and shape of the vocal tract and the articulators. Different individuals tune their formant frequencies a bit differently for each vowel, but skilled singers can control the pitch and the formant frequencies more accurately.
Another important part of the voice timbre differences between male and female voices seems to be the voice source: major difference is primarily in the amplitude of the fundamental. The voice spectrum of a male voice has a weaker fundamental than the voice spectrum of a female voice.
2.2. HMM Phonetic Recognizer
Despite of the above-mentioned differences between speech and singing, they still have many properties in common and it is plausible that singing recognition can be done using the standard technique in automatic speech recognition, a phonetic hidden Markov model (HMM) recognizer. In HMM-based speech recognition it is assumed that the observed sequence of speech feature vectors is generated by a hidden Markov model. An HMM consists of a number of states with associated observation probability distributions and a transition matrix defining transition probabilities between the states. The emission probability density function of each state is modeled by a Gaussian mixture model (GMM).
In the training process, the transition matrix and the means and variances of the Gaussian components in each state are estimated to maximize the likelihood of the observation vectors in the training data. Speaker-independent models can be adapted to the characteristics of a target speaker, and similar techniques can be used for adapting acoustic models trained on speech to singing, as described further in Section 2.3.
Linguistic information about the speech or singing to be recognized can be used to develop language models. They define the set of possible phoneme or word sequences to be recognized and associate probabilities for each of them, improving the robustness of the recognizer.
The singing recognition system used in this work consists of 39 monophone HMMs plus silence and short-pause models. Each phoneme is represented by a left-to-right HMM with three states. The silence model is a fully connected HMM with three states and the short pause is a one-state HMM tied to the middle state of the silence model. The system was implemented using HTK (The Hidden Markov Model Toolkit (HTK), http://htk.eng.cam.ac.uk/).
As features we use 13 mel-frequency cepstral coefficients (MFCCs) plus delta and acceleration coefficients, calculated in 25 ms frames with a 10 ms hop between adjacent frames.
2.3. Adaptation to Singing
Due to the lack of a large enough singing database for training the acoustic models of the recognizer, we first train models for speech and then adapt them linearly to singing.
The acoustic material used for the adaptation is called the adaptation data. In speech recognition the data is typically a small amount of speech from the target speaker. The adaptation is done by finding a set of transforms for the model parameters in order to maximize the likelihood that the adapted models have produced the adaptation data. When the phoneme sequence is known, the adaptation can be done in supervised manner.
The transform matrix and bias vector are estimated using the EM algorithm. It has been observed that MLLR can compensate the difference in the lengths of the vocal tract .
The same transform and bias vector can be shared between all the Gaussians of all the states (global adaptation). If enough adaptation data is available, multiple transforms can be estimated separately for sets of states or Gaussians. The states or Gaussians can be grouped by either their phonetic similarity or their acoustic similarity. The groups are called base classes.
We do the singing adaptation using a two-pass procedure. The usual scenario in speaker adaptation for speech recognition is to use a global transform followed by a second transform with more classes constructed in a data-driven manner, by means of a regression tree . One global adaptation is suitable best when we have a small amount of training data or when we need a robust transform .
Divisions of phonemes into classes by phonetic similarity.
Number of classes
vowels, consonants, silence/noise
monophthongs, diphthongs, approximants, nasals, fricatives, plosives, affricates, silence/noise
one class/vowel, approximants, nasals, fricatives, plosives, affricates, silence/noise
A second pass of the adaptation uses classes determined by acoustic similarity, by clustering the Gaussians of the states. We use clusters formed from the speech models and from the models after one-pass adaption to singing.
In the initial adaptation experiments  we observed that CMLLR performs better than MLLR, and therefore we restrict ourselves to CMLLR in this paper.
2.4. N-Gram Language Models
The linguistic information in the speech or singing to be recognized can be modeled using language models. A language model restricts the possible unit sequences into a set defined by the model. The language model can also provide probabilities for different sequences, which can be used together with the likelihoods of the acoustic HMM model to find the most likely phonetic sequence for an input signal. The language model consists of a vocabulary and a set of rules describing how the units in the vocabulary can be connected into sequences. The units in the vocabulary can be defined at different abstraction levels, such as phonemes, syllables, letters, or words.
An -gram language model can be used to model probabilities of unit sequences in the language model. It associates a probability for each subsequence of length : given previous units , it defines the conditional probability . The probability of a whole sequence can be obtained as the product of above conditional probabilities over all units in the sequence. An -gram of size one is referred to as a ; size two is a ; size three is a , while those of higher order are referred to as -grams. Bigrams and trigrams are commonly used in automatic speech recognition.
It is not possible to include all possible words in a language model. The percentage of out of vocabulary (OOV) words affects the performance of the language model, since the recognition system cannot output them. Instead, the system will output one or more words from the vocabulary that are acoustically close to the word being recognized, resulting in recognition errors. While the vocabulary of the speech recognizer should be as large as possible to ensure low OOV rates, increasing the vocabulary size increases the acoustic confusions and does not always improve the recognition results.
A language model can be assessed by its perplexity, which measures the uncertainty in each word based on the language model. It can be viewed as the average size of the word set from which each word recognized by the system is chosen [28, pages 449–450]. The lower the perplexity, the better the language model is able to represent the text. Ideally, a language model should have a small perplexity and a small OOV rate on an unseen text.
The actual recognition is based on finding the most likely sequence of units that has produced the acoustic input signal. The likelihood consists of the contribution of the language model likelihood and the acoustic model likelihood. The influence of the language model can be controlled by the grammar factor, which multiplies the log likelihood of the language model. The number of words output by the recognizer can be controlled by the word insertion penalty which penalizes the log likelihood by adding a cost for each word. The values of these parameters have to be tuned experimentally to optimize the recognizer performance.
In order to test phoneme recognition with a language model, we built unigram, bigram and trigram language models. As -grams are used for language recognition, we assume that a phoneme-level language model is characteristic to the English language and cannot differ significantly if estimated from general text or from lyrics text. For this reason, as training data for the phoneme-level language models we used the phonetic transcriptions of the speech database that was used for training the acoustic models.
To construct word language models for speech recognition we have to establish a vocabulary chosen as the most frequent words from the training text data. In large vocabulary recognition it is important to choose training text with similar topic to have a good coverage of vocabulary and words combinations. For our work we chose to use song lyrics text, with the goal of keeping a 5 k vocabulary.
2.5. Separation of Vocals from Polyphonic Music
Estimate the notes of the main vocal line using the algorithm . The algorithm has been trained using singing material, and it is able to distinguish between singing and solo instruments, at least to some degree.
Estimate the time-varying pitch of the main vocal line by picking the prominent local maxima in the pitch salience spectrogram near the estimated notes and interpolate between the peaks.
Predict the frequencies of the overtones by assuming perfect harmonicity and generate a binary mask which indicates the predicted vocal regions in the spectrogram. We use a harmonic mask where a ?Hz bandwidth around each predicted partial in each frame is marked as speech.
Learn a time-varying background model by using the nonvocal regions and nonnegative spectrogram factorization (NMF) on the magnitude spectrogram of the original signal. A hamming window and absolute values of the frame-wise DFT is used to calculate the magnitude spectrogram. We use a specific NMF algorithm which estimates the NMF model parameters using the nonvocal regions only. The resulting model represents the background accompaniment but not vocals.
The estimated NMF parameters are used to predict the amplitude of the accompaniment in the vocal regions, which is then subtracted from the mixture spectrogram.
The separated vocals are synthesized by assigning the phases of the mixture spectrogram for the estimated magnitude spectrogram of vocals and generating time-domain signal by inverse DFT and overlap add.
More detailed description of the algorithm is given in .
The algorithm has been found to produce robust results on realistic music material. It improved the signal-to-noise ratio of the vocals on average by 4.9 dB on material extracted from Karaoke DVDs and on average by 2.1 dB on material synthesized by mixing vocals and MIDI background. The algorithm causes some separation artifacts because of erroneously estimated singing notes (insertions or deletions), and some interference from other instruments. The spectrogram decomposition was found to perform well in learning the background model, since the sounds produced by musical instruments are well represented with the model .
Even though the harmonic model for singing used in the algorithm does not directly allow representing unvoiced sounds, the use of mixture phases carries some information about them. Furthermore, the rate of unvoiced sounds in singing is low, so that the effect of voiced sounds dominates the recognition.
3. Recognition Experiments
We study the effect of above recognition techniques for recognition of phonemes and words in both clean singing and singing separated from polyphonic music. Different adaptation approaches and both phoneme-level and word-level language models are studied.
3.1. Acoustic Data
The acoustic models of the recognizer were trained using the CMU Arctic speech database (CMU ARCTIC databases for speech synthesis: http://festvox.org/cmuarctic/). For testing and adaptation of the models we used a database containing monophonic singing recordings, 49 fragments (19 male and 30 female) of popular songs, which we denote as . The lengths of the sung phrases are between 20 and 30 seconds and usually consist in a full verse of a song. For the adaptation and testing on clean singing we used -fold cross-validation, with depending on the test case. The total amount of singing material is 30 minutes, and it consists of 4770 phoneme instances.
To test the recognition system on polyphonic music we chose 17 songs from commercial music collections. The songs were manually segmented into structurally meaningful units (verse, chorus) to obtain sung phrases having approximately the same durations with the fragments in the monophonic database. We obtained 100 fragments of polyphonic music containing singing and instrumental accompaniment. We denote this database as . For this testing case, the HMMs were adapted using the entire database. In order to suppress the effect of the instrumental accompaniment, we applied the vocal separation algorithm described in Section 2.5.
The lyrics of both singing databases were manually annotated for reference. The transcriptions are used in the supervised adaptation procedure and in the evaluation of the automatic lyrics recognition.
3.3. Adaptation of Models to Singing Voice
For adapting the models to singing voice, we use a 5-fold setting for the database, with one fifth of the data used as test data at a time and the rest for adaptation. As each song was sung by multiple singers, splitting into folds was done so that the same song appeared either in the test or in the adaptation set, not in both. The same singer was allowed in both adaptation and testing sets. We adapt the models using supervised adaptation procedure, providing the correct transcription to the system in the adaptation process.
Evaluation of the recognition performance was done without a language model; the number of insertion errors was controlled by the insertion penalty parameter with value fixed to (for reasons explained in Section 3.6).
Table 2 also presents recognition performance of systems adapted with different number of classes in the first pass and 8 classes in the second pass using the clustering to form these 8 classes. The second pass improve slightly the correct rate of systems where multiple classes were used in the first adaptation pass.
3.4. Gender-Dependent Adaptation
The gender differences in singing voices are much more evident than in speech, because of different singing techniques explained in Section 2.1. Gender-dependent models are used in many cases in speech recognition .
Adaptation to male singing voice is tested on four different male voices. The fragments belonging to the same voice were kept as test data, while the rest of the male singing in was used to adapt the speech models using a one-pass global adaptation. We do the same for adaptation to female singing voice, using as adaptation data all the female singing fragments except the tested one.
Phoneme recognition rates (correct % / accuracy %) for nonadapted and gender-adapted models for 4 different male and female sets.
The gender-specific adaptation improves the recognition performance for all the singers. Especially the recognition performance for female singers is improved, from negative values in the case of the nonadapted system. Negative accuracy means over 100% error rate, rendering recognition results unusable. The testing in this case was also done without a language model, using the fixed insertion penalty (see Section 3.6 for explanation).
3.5. Singer-Specific Adaptation
The models adapted to singing voice can be further adapted to a target singer. We tested singer adaptation for three male and three female voices. The adaptation to singing was carried out in a one-pass global step, using all the singing material from except the target voice. After this, the adapted models were adapted using another one-pass global adaptation and tested in 3-fold (for male 1, male 3, Fem 1, and Fem 3) or 5-fold (for male 2 and Fem 2), so that one fragment at a time was used in testing and the rest as adaptation data.
Phoneme recognition rates (correct % / accuracy %) for 3 male and 3 female voice adapted systems.
There are significant differences between male and female singers, which explains the fact that a gender-dependent recognizer performs better than gender-independent recognizer. The gender-adapted systems have lower accuracy than the systems adapted to singing, but higher correct rate. The two situations may need different tuning of the recognition step parameters (here only ) in order to maximize the accuracy of the recognition, but we kept the same value for comparison purposes.
3.6. Language Models and Recognition Results
The phoneme-level language models were trained using the phonetic transcriptions of the speech database that was used for training the acoustic models. The database contains 1132 phonetically balanced sentences, over 48000 phoneme instances.
Perplexities of bigram and trigram phoneme and word level language models on the training text (speech database transcriptions for phonemes, lyrics text for words) and on test lyrics texts.
For constructing a word language model we used the lyrics text of 4470 songs, containing over 1.2 million word instances, retrieved from http://www.azlyrics.com/. From a total of approximately 26000 unique words, a vocabulary of 5167 words was chosen by keeping the words that appeared at least 5 times. The perplexities of bigram and trigram word level language models evaluated on the training data and on the lyrics text of and databases are also presented in Table 6. The percentage of OOV words on the training text represents mostly words in languages other than English, also the words that appeared too few times and were removed when choosing the vocabulary. The perplexities of the language models on are not much higher than the ones on the training text, meaning that the texts are similar regarding the used words. The text is less well modeled by this language model. Nonetheless, in almost 4500 songs we could only find slightly over 5000 words that appear more than 5 times; thus, the language model for lyrics of mainstream songs is quite restricted vocabulary-wise.
In the recognition process, the acoustic models provide a number of hypotheses as output. The language model provides complementary knowledge about how likely those hypotheses are. The balance of the two components in the Viterbi decoding can be controlled using the grammar scale factor and the insertion penalty parameter . These parameters are usually set experimentally to values where the number of insertion errors and deletion errors in recognition is nearly equal. We fixed the values to and , where deletion and insertion errors for phoneme recognition using bigrams were approximately equal. No tuning was done for the other language models to maximize accuracy of the recognition results.
3.6.1. Phoneme Recognition
Phoneme recognition rates (correct % / accuracy %) for monophonic singing with no language model, unigram, bigram or trigram, using gender-adapted models.
No language model
When there is no prior information about the phoneme probabilities (no language model), the rate of recognized phonemes is quite high, but with a low accuracy. Including phoneme probabilities into the recognition process (unigram language model), the accuracy of the recognition improves significantly. The bigram language models gives more control over the output of the recognizer, yielding better rates than the unigram. For the trigram language model we obtained higher recognition rate but with lower accuracy of the recognition. This case might also need different tuning of the language model control parameters to maximize the accuracy of the recognition, but we kept the same values for comparison purposes.
The 2000 NIST evaluation of Switchboard corpus automatic speech recognition systems  reports error rates of 39%–55% for phoneme recognition in speech, while the lowest error rate (100-accuracy) in Table 7 is approximately 65%. Even though our singing recognition performance results are clearly lower, we find our results encouraging, considering that singing recognition has not been studied before.
Phoneme recognition rates (correct / accuracy %) for vocal line extracted from polyphonic music, with no language model, unigram, bigram or trigram, using singing-adapted models.
No language model
3.6.2. Word Recognition
If a closed vocabulary language model can be constructed from the lyrics of the songs in the database, then such knowledge gives an important advantage for recognition . For example, in the case of the database, a bigram language model constructed from the lyrics text of database has a vocabulary of only 185 words (compared to the vocabulary size of 5167 of the previously used language model) and a perplexity of 2.9 on the same text, offering a recognition rate of 55% with 40% accuracy for the singing-adapted models in the 5-fold test case.
The word recognition results are low, with even lower accuracy, and as a speech recognition tool, this system fails. Still, thinking about information retrieval purposes, even highly imperfect transcriptions of the lyrics can be useful. By maximizing the rate of correctly recognized words, even with producing a lot of insertion errors, the results may be useful. In the next section we present two applications for lyrics recognition.
4.1. Automatic Singing-to-Lyrics Alignment
Alignment of singing to lyrics refers to finding the temporal relationship between a possibly polyphonic music audio and the corresponding textual lyrics. We further present the system developed in .
A straightforward way to do alignment is by creating a phonetic transcription of the word sequence comprising the text in the lyrics and aligning the corresponding phoneme sequence with the audio using the HMM recognizer. For alignment, the possible paths in the Viterbi search algorithm are restricted to just one string of phonemes, representing the input text.
The polyphonic audio from the database was preprocessed to separate the singing voice. The text files are processed to obtain a sequence of words with optional silence, pause, and noise between them. An optional short pause is inserted between each two words in the lyrics. At the end of each line we insert optional silence or noise event, to account for the voice rest and possible background accompaniment. An example of resulting grammar for one of the test songs is
[sil noise] I [sp] BELIEVE [sp] I [sp] CAN [sp] FLY [sil noise] I [sp] BELIEVE [sp] I [sp] CAN [sp] TOUCH [sp] THE [sp] SKY [sil noise] I [sp] THINK [sp] ABOUT [sp] IT [sp] EVERY [sp] NIGHT [sp] AND [sp] DAY [sil noise] SPREAD [sp] MY [sp] WINGS [sp] AND [sp] FLY [sp] AWAY [sil noise]
where the encloses options and denotes alternatives. This way, the alignment algorithm can choose to include pauses and noise where needed. The noise model was separately trained on instrumental sections from different songs, other than the ones in the test database.
The text input contains a number of lines of text, each line corresponding roughly to one singing phrase.
Average alignment errors for different sets of singing-adapted models.
Avg. error (s)
Avg. error (s)
One main reason for misalignments is a faulty output of the vocal separation stage. Some of the songs are from pop-rock genre, featuring loud instruments as an accompaniment, and the melody transcription (step 1 in Section 2.5) fails to pick the voice signal. In this case, the output contains a mixture of the vocals with some instrumental sounds, but the voice is usually too distorted to be recognizable. In other cases, the errors appear when the transcribed lyrics do not have the lines corresponding to singing phrases, so there are breathing pauses in the middle of a text line. In these cases even the manual annotation of the lyrics can have ambiguity.
The relatively low average alignment error indicates that this approach could be used to produce automatic alignment of lyrics for various applications such as automatic lyrics display in karaoke.
4.2. Query-by-Singing Based on Word Recognition
In query-by-humming/singing, the aim is to identify a piece of music from its melody and lyrics. In a query-by-humming application, the search algorithm will transcribe the melody sung by the user and will try to find a match of the sung query with a melody from the database. For large databases, the search time can be significantly long. Assuming that we also have the lyrics of the songs we are searching through, the words output from a phonetic recognizer can be searched for in the lyrics text files. This will provide additional information and narrow down the melody search space. Furthermore, lyrics will be more reliable than the melody in the case of less skilled singers.
Examples of errors in recognition.
seemed so far away
seem to find away
finding the answer
fighting the answer
the distance in your eyes
from this is in your eyes
all the way
cause it's a bittersweet symphony
cause I said bittersweet symphony
this our life
trying to make ends meet
trying to maintain sweetest
you're a slave to the money
ain't gettin' money
then you die
then you down
I heard you crying loud
I heard you crying alone
all the way across town
all away across the sign
you've been searching for that someone
you been searching for someone
and it's me out on the prowl
I miss me I don't apologize
as you sit around
you see the rhyme
feeling sorry for yourself
feelin' so free yourself
We built a retrieval system based on sung queries recognized by the system presented in Table 9 (23.93% correct recognition rate), which that uses a bigram language model to recognize the clean singing voice in the presented 5-fold experiment. For this purpose, we constructed a lyrics database consisting of the text lyrics of and databases. We used as test queries the 49 singing fragments of the database. The recognized words for each query will be matched to the content of the lyrics database to identify the queried song.
Query-by-singing retrieval results.
This paper applied speech recognition methods to recognize lyrics from singing voice. We attempt to recognize phonemes and words in singing voice from monophonic singing input and from polyphonic music. In order to suppress the effect of the instrumental accompaniment, a vocal separation algorithm was applied.
Due to the lack of large enough singing databases to train a singing recognizer, we used a phonetic recognizer that was trained on speech and applied speaker adaptation techniques to adapt the models to singing voice. Different adaptation setups were considered. A general adaptation to singing using a global transform was found to provide a system with much higher performance in recognizing sung phonemes than the nonadapted one. Different numbers of base classes in the setup of the adaptation did not have very much importance for the system performance. Separate adaptation using male and female data led to gender-dependent models, producing the best performance in phoneme recognition. More specific speaker-dependent adaptation did not improve the results, but this may be due to the limited amount of speaker-specific adaptation data in comparison with the adaptation data used to adapt the speech models to singing in general.
The recognition results are also influenced by the language models used in the recognition process. Phoneme bigram language model built on phonetically balanced speech data was found to increase the accuracy of the recognition with up to 13% both for clean singing test cases and for vocal line extracted from polyphonic music. Word recognition in clean singing using a bigram language model built from lyrics text allows recognition of approximately one fifth of the sung words in clean singing. In polyphonic music, the results are lower.
Even though the results are far from being perfect, they have potential in music information retrieval. Our query-by-singing experiment indicates that a song might be retrieved based on words that are correctly recognized from a user query. We also demonstrated the capability of the recognition methods in automatic alignment of singing from polyphonic audio and text, where an average alignment error of 0.94 seconds was obtained.
The constructed applications prove that even such low recognition results can be useful in particular tasks. Still, it is important to find methods for improving the recognition rates. Ideally, a lyrics recognition system should be trained on singing material. We lack a large enough database with monophonic recordings, but we do have at our disposal plenty of polyphonic material. One approach could be using vocals separated from polyphonic music for training of the models. Also, considering that there are millions of songs out there, we know that we only selected a small amount of information to build the word language model. A better selection of the vocabulary can be obtained by using more text in the construction of the language model.
This research has been funded by the Academy of Finland.
- Tzanetakis G, Cook P: Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 2002, 10(5):293-302. 10.1109/TSA.2002.800560View ArticleGoogle Scholar
- Downie S, West K, Ehmann A, Vincent E: The 2005 music information retrieval evaluation exchange (MIREX 2005): preliminary overview. Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR '05), 2005Google Scholar
- Khine SZK, New TL, Li H: Singing voice detection in pop songs using co-training algorithm. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), 2008 1629-1632.Google Scholar
- Smit C, Ellis DPW: Solo voice detection vio optimal cancelation. Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2007Google Scholar
- Tsai W-H, Wang H-M: Automatic singer recognition of popular music recordings via estimation and modeling of solo vocal signals. IEEE Transactions on Audio, Speech and Language Processing 2006, 14(1):330-341.View ArticleGoogle Scholar
- Mesaros A, Virtanen T, Klapuri A: Singer identification and polyphonic music using vocal separation and pattern recognition methods. Proceedings of International Conference on Music Information Retrieval (ISMIR '07), 2007Google Scholar
- Tsai W-H, Liao S-J, Lai C: Automatic identification of simultaneous singers in duet recordings. Proceedings of International Conference on Music Information Retrieval (ISMIR '08), 2008Google Scholar
- Typke R, Wiering F, Veltkamp RC: Mirex symbolic melodic similarity and query by singing/humming. International Music Information Retrieval Systems Evaluation Laboratory(IMIRSEL) http://www.music-ir.org/mirex/2006/index.php/Main_Page
- Suzuki M, Hosoya T, Ito A, Makino S: Music information retrieval from a singing voice using lyrics and melody information. EURASIP Journal on Advances in Signal Processing 2007, 2007:-8.Google Scholar
- Sasou A, Goto M, Hayamizu S, Tanaka K: An auto-regressive, non-stationary excited signal parameter estimation method and an evaluation of a singing-voice recognition. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '05), 2005 1: 237-240.Google Scholar
- Fujihara H, Goto M, Ogata J: Hyperlinking lyrics: a method for creating hyperlinks between phrases in song lyrics. Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR '08), 2008Google Scholar
- Müller M, Kurth F, Damm D, Fremerey C, Clausen M: Lyrics-based audio retrieval and multimodal navigation in music collections. Proceedings of European Conference on Research and Advanced Technology for Digital Libraries, 2007 112-123.View ArticleGoogle Scholar
- Gruhne M, Schmidt K, Dittmar C: Phoneme recognition in popular music. Proceedings of the 8th International Conference on Music Information Retrieval (ISMIR '07), 2007Google Scholar
- Fujihara H, Goto M: Three techniques for improving automatic synchronization between music and lyrics: fricative detection, filler model, and novel feature vectors for vocal activity detection. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), 2008 69-72.Google Scholar
- Fujihara H, Goto M, Ogata J, Komatani K, Ogata T, Okuno HG: Automatic synchronization between lyrics and music CD recordings based on Viterbi alignment of segregated vocal signals. Proceedings of the 8th IEEE International Symposium on Multimedia (ISM '06), 2006 257-264.View ArticleGoogle Scholar
- Kan M-Y, Wang Y, Iskandar D, Nwe TL, Shenoy A: LyricAlly: automatic synchronization of textual lyrics to acoustic music signals. IEEE Transactions on Audio, Speech and Language Processing 2008, 16(2):338-349.View ArticleGoogle Scholar
- Mesaros A, Virtanen T: Automatic alignment of music audio and lyrics. Proceedings of the 11th International Conference on Digital Audio Effects (DAFx '08), 2008Google Scholar
- Wong CH, Szeto WM, Wong KH: Automatic lyrics alignment for Cantonese popular music. Multimedia Systems 2007, 12(4-5):307-323. 10.1007/s00530-006-0055-8View ArticleGoogle Scholar
- Sundberg J: The Science of Singing Voice. Northern Illinois University Press; 1987.Google Scholar
- Atal BS: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. Journal of the Acoustical Society of America 1974, 55(6):1304-1312. 10.1121/1.1914702View ArticleGoogle Scholar
- Leggetter CJ, Woodland PC: Flexible speaker adaptation using maximum likelihood linear regression. Proceedings of the ARPA Spoken Language Technology Workshop, 1995Google Scholar
- Gales MJF: Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language 1998, 12(2):75-98. 10.1006/csla.1998.0043View ArticleGoogle Scholar
- Pitz M, Ney H: Vocal tract normalization equals linear transformation in cepstral space. IEEE Transactions on Speech and Audio Processing 2005, 13(5):930-944.View ArticleGoogle Scholar
- Pye D, Woodland PC: Experiments in speaker normalisation and adaptation for large vocabulary speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '97), 1997Google Scholar
- Leggetter CJ, Woodland PC: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language 1995, 9(2):171-185. 10.1006/csla.1995.0010View ArticleGoogle Scholar
- Mesaros A, Virtanen T: Adaptation of a speech recognizer to singing voice. Proceedings of 17th European Signal Processing Conference, 2009Google Scholar
- Jurafsky D, Martin JH: Speech and Language Processing. Prentice-Hall, Upper Saddle River, NJ, USA; 2000.Google Scholar
- Rabiner LR, Juang B-H: Fundamentals of Speech Recognition. Prentice-Hall, Upper Saddle River, NJ, USA; 1993.Google Scholar
- Virtanen T, Mesaros A, Ryynänen M: Combining pitch-based inference and non-negative spectrogram factorization in separating vocals from polyphonic music. Proceedings of the ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition SAPA, 2008Google Scholar
- Ryynänen MP, Klapuri AP: Automatic transcription of melody, bass line, and chords in polyphonic music. Computer Music Journal 2008, 32(3):72-86. 10.1162/comj.2008.32.3.72View ArticleGoogle Scholar
- Virtanen T: Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech, and Language Processing 2007., 15(3):Google Scholar
- Woodland PC, Odell JJ, Valtchev V, Young SJ: Large vocabulary continuous speech recognition using HTK. Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP '94), 1994Google Scholar
- Greenberg S, Chang S, Hollenback J: An introduction to the diagnostic evaluation of the switchboard-corpus automatic speech recognition systems. Proceedings of the NIST Speech Transcription Workshop, 2000Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.