J. Pons, J. Janer, T. Rode, W. Nogueira, Remixing music using source separation algorithms to improve the musical experience of cochlear implant users. J. Acoust. Soc. Am. 140(6), 4338–4349 (2016)
Article
Google Scholar
A.J. Simpson, G. Roma, M.D. in International Conference on Latent Variable Analysis and Signal Separation. Plumbley, Deep karaoke: extracting vocals from musical mixtures using a convolutional deep neural network (Springer, 2015), pp. 429–436
A. Rosner, B. Kostek, Automatic music genre classification based on musical instrument track separation. J. Intell. Inf. Syst. 50(2), 363–384 (2018)
Article
Google Scholar
A. Rosner, B. Kostek, in International Symposium on Methodologies for Intelligent Systems. Musical instrument separation applied to music genre classification (Springer, 2015), pp. 420–430
J.S. Gómez, J. Abeßer, E. Cano, in ISMIR. Jazz solo instrument classification with convolutional neural networks, source separation, and transfer learning (ISMIR, Paris, 2018), pp. 577–584
B. Sharma, R.K. Das, H. Li, in INTERSPEECH. On the importance of audio-source separation for singer identification in polyphonic music (ISCA, Graz, 2019), pp. 2020–2024
Y. Gao, X. Zhang, W. Li, Vocal melody extraction via HRNnet-based singing voice separation and encoder-decoder-based F0 estimation. Electronics 10(3), 298 (2021)
Article
Google Scholar
J. Xu, X. Li, Y. Hao, G. Yang, in Proceedings of international conference on multimedia retrieval. Source separation improves music emotion recognition (ACM, Glasgow, 2014), pp. 423–426
E. Alfaro-Paredes, L. Alfaro-Carrasco, W. Ugarte, in International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Query by humming for song identification using voice isolation (Springer, 2021), pp. 323–334
R. Kumar, Y. Luo, N. Mesgarani, in Interspeech. Music source activity detection and separation using deep attractor network (ISCA, Hyderabad, 2018), pp. 347–351
D. Stoller, S. Ewert, S. Dixon, in International Conference on Latent Variable Analysis and Signal Separation. Jointly detecting and separating singing voice: a multi-task approach (Springer, 2018), pp. 329–339
Y. Hung, A. Lerch, in ISMIR. Multitask learning for instrument activation aware music source separation (ISMIR, Montréal, 2020), pp. 748–755
T. Nakano, K. Yoshii, Y. Wu, R. Nishikimi, K.W.E. Lin, M. Goto, in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Joint singing pitch estimation and voice separation based on a neural harmonic structure renderer (IEEE, 2019), pp. 160–164
A. Jansson, R.M. Bittner, S. Ewert, T. Weyde, in 2019 27th European Signal Processing Conference (EUSIPCO). Joint singing voice separation and f0 estimation with deep u-net architectures (IEEE, 2019), pp. 1–5
D. Stoller, S. Ewert, S. Dixon, in ISMIR. Wave-u-net: a multi-scale neural network for end-to-end audio source separation (ISMIR, Paris, 2018), pp. 334–340
F. Lluís, J. Pons, X. Serra, in INTERSPEECH. End-to-end music source separation: is it possible in the waveform domain? (ISCA, Graz, 2019), pp. 4619–4623
A. Défossez, N. Usunier, L. Bottou, F. Bach, Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254 (2019)
D. Samuel, A. Ganeshan, J. Naradowsky, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Meta-learning extractors for music source separation (IEEE, 2020), pp. 816–820
B. Zhu, W. Li, R. Li, X. Xue, Multi-stage non-negative matrix factorization for monaural singing voice separation. IEEE Trans. Audio Speech Lang. Process. 21(10), 2096–2107 (2013)
Article
Google Scholar
B. Rathnayake, K. Weerakoon, G. Godaliyadda, M. Ekanayake, in 2018 IEEE Symposium Series on Computational Intelligence (SSCI). Toward finding optimal source dictionaries for single channel music source separation using nonnegative matrix factorization (IEEE, 2018), pp. 1493–1500
T. Virtanen, A. Mesaros, M. Ryynänen, in SAPA@ INTERSPEECH. Combining pitch-based inference and non-negative spectrogram factorization in separating vocals from polyphonic music (ISCA, Brisbane, 2008), pp. 17–22
X. Zhang, W. Li, B. Zhu, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Latent time-frequency component analysis: a novel pitch-based approach for singing voice separation (IEEE, 2015), pp. 131–135
C.L. Hsu, D. Wang, J.S.R. Jang, K. Hu, A tandem algorithm for singing pitch extraction and voice separation from music accompaniment. IEEE Trans. Audio Speech Lang. Process. 20(5), 1482–1491 (2012)
Article
Google Scholar
Z. Rafii, B. Pardo, in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). A simple music/voice separation method based on the extraction of the repeating musical structure (IEEE, 2011), pp. 221–224
Y. Zhang, X. Ma, in International Symposium on Neural Networks. A singing voice/music separation method based on non-negative tensor factorization and repeat pattern extraction (Springer, 2015), pp. 287–296
Z. Rafii, B. Pardo, Repeating pattern extraction technique (REPET): a simple method for music/voice separation. IEEE Trans. Audio Speech Lang. Process 21(1), 73–84 (2012)
Article
Google Scholar
N. Takahashi, N. Goswami, Y. Mitsufuji, in 2018 16th International workshop on acoustic signal enhancement (IWAENC). Mmdenselstm: an efficient combination of convolutional and recurrent neural networks for audio source separation (IEEE, 2018), pp. 106–110
A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, T. Weyde, in ISMIR. Singing voice separation with deep u-net convolutional networks (ISMIR, Suzhou, 2017), pp. 745–751
Q. Kong, Y. Cao, H. Liu, K. Choi, Y. Wang, in ISMIR. Decoupling magnitude and phase estimation with deep resunet for music source separation (ISMIR, 2021), pp. 342–349
T. Li, J. Chen, H. Hou, M. Li, in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP). Sams-net: a sliced attention-based neural network for music source separation (IEEE, 2021), pp. 1–5
N. Takahashi, Y. Mitsufuji, D3net: densely connected multidilated densenet for music source separation. arXiv preprint arXiv:2010.01733 (2020)
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst. 25, pp. 1097–1105 (2012)
K. He, X. Zhang, S. Ren, J. Sun, in Proceedings of the IEEE conference on computer vision and pattern recognition. Deep residual learning for image recognition (IEEE, Las Vegas, 2016), pp. 770–778
G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, in Proceedings of the IEEE conference on computer vision and pattern recognition. Densely connected convolutional networks (IEEE, Honolulu, 2017), pp. 4700–4708
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput 9(8), 1735–1780 (1997)
Article
Google Scholar
K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, in EMNLP. Learning phrase representations using RNN encoder-decoder for statistical machine translation (ACL, Doha, 2014), pp. 1724–1734
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need. Adv Neural Inf Process Syst. 30, pp. 5998–6008 (2017)
J. Devlin, M. Chang, K. Lee, K. Toutanova, in NAACL-HLT. BERT: pre-training of deep bidirectional transformers for language understanding (1) (ACL, Minneapolis, 2019), pp. 4171–4186
R.P. Bingham, Harmonics-understanding the facts. Dranetz Technol. (Citeseer, Edison, 1994)
P. Chandna, M. Miron, J. Janer, E. Gómez, in International conference on latent variable analysis and signal separation. Monoaural audio source separation using deep convolutional neural networks (Springer, 2017), pp. 258–266
N. Takahashi, Y. Mitsufuji, in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Multi-scale multi-band densenets for audio source separation (IEEE, 2017), pp. 21–25
S. Park, T. Kim, K. Lee, N. Kwak, in ISMIR. Music source separation using stacked hourglass networks (ISMIR, Paris, 2018), pp. 289–296
O. Ronneberger, P. Fischer, T. Brox, in International Conference on Medical image computing and computer-assisted intervention. U-net: convolutional networks for biomedical image segmentation (Springer, 2015), pp. 234–241
V.S. Kadandale, J.F. Montesinos, G. Haro, E. Gómez, in 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP). Multi-channel u-net for music source separation (IEEE, 2020), pp. 1–6
R. Hennequin, A. Khlif, F. Voituret, M. Moussallam, Spleeter: a fast and efficient music source separation tool with pre-trained models. J Open Source Softw. 5(50), 2154 (2020)
Article
Google Scholar
W. Choi, M. Kim, J. Chung, D. Lee, S. Jung, Investigating U-Nets with various intermediate blocks for spectrogram-based singing voice separation. arXiv preprint arXiv:1912.02591 (2019)
J. Perez-Lapillo, O. Galkin, T. Weyde, in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Improving singing voice separation with the Wave-U-Net using minimum hyperspherical energy (IEEE, 2020), pp. 3272–3276
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, in Proceedings of the IEEE/CVF International Conference on Computer Vision. Swin transformer: hierarchical vision transformer using shifted windows (IEEE, Piscataway, 2021), pp. 10012–10022
L. Dong, S. Xu, B. Xu, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition (IEEE, 2018), pp. 5884–5888
N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, in Proceedings of the AAAI Conference on Artificial Intelligence. Neural speech synthesis with transformer network, vol. 33 (AAAI, Honolulu, 2019), pp. 6706–6713
S. Yu, C. Li, F. Deng, X. Wang, in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). Rethinking singing voice separation with spectral-temporal transformer (IEEE, 2021), pp. 884–889
Y. Luo, Z. Chen, C. Han, C. Li, T. Zhou, N. Mesgarani, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Rethinking the separation layers in speech separation networks (IEEE, 2021), pp. 1–5
J. Hu, L. Shen, G. Sun, in Proceedings of the IEEE conference on computer vision and pattern recognition. Squeeze-and-excitation networks (IEEE, Salt Lake City, 2018), pp. 7132–7141
J. Gu, H. Kwon, D. Wang, W. Ye, M. Li, Y.H. Chen, L. Lai, V. Chandra, D.Z. Pan, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Multi-scale high-resolution vision transformer for semantic segmentation (IEEE, New Orleans, 2022), pp. 12094–12103
Q. Hou, L. Zhang, M.M. Cheng, J. Feng, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Strip pooling: rethinking spatial pooling for scene parsing (IEEE, Piscataway, 2020), pp. 4003–4012
Z. Rafii, A. Liutkus, F.R. Stöter, S.I. Mimilakis, R. Bittner, The MUSDB18 corpus for music separation (2017). https://doi.org/10.5281/zenodo.1117372.
S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, Y. Mitsufuji, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Improving music source separation based on deep neural networks through data augmentation and network blending (IEEE, 2017), pp. 261–265
E. Vincent, R. Gribonval, C. Févotte, Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Article
Google Scholar
F.R. Stöter, A. Liutkus, N. Ito, in International Conference on Latent Variable Analysis and Signal Separation. The 2018 signal separation evaluation campaign (Springer, 2018), pp. 293–305
W. Wang, E. Xie, X. Li, D.P. Fan, K. Song, D. Liang, T. Lu, P. Luo, L. Shao, in Proceedings of the IEEE/CVF International Conference on Computer Vision. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions (IEEE, Piscataway, 2021), pp. 568–578
T. Hori, N. Moritz, C. Hori, J. Le Roux, in Interspeech. Transformer-based long-context end-to-end speech recognition (ISCA, Shanghai, 2020), pp. 5011–5015
F.R. Stöter, S. Uhlich, A. Liutkus, Y. Mitsufuji, Open-unmix-a reference implementation for music source separation. J. Open Source Softw. 4(41), 1667 (2019)
Article
Google Scholar
Z. Wang, R. Giri, U. Isik, J.M. Valin, A. Krishnaswamy, in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Semi-supervised singing voice separation with noisy self-training (IEEE, 2021), pp. 31–35