Bayesian group sparse learning for music source separation

Chien, Jen-Tzung; Hsieh, Hsin-Lung

doi:10.1186/1687-4722-2013-18

Research
Open access
Published: 05 July 2013

Bayesian group sparse learning for music source separation

Jen-Tzung Chien¹ &
Hsin-Lung Hsieh¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2013, Article number: 18 (2013) Cite this article

4170 Accesses
12 Citations
Metrics details

Abstract

Nonnegative matrix factorization (NMF) is developed for parts-based representation of nonnegative signals with the sparseness constraint. The signals are adequately represented by a set of basis vectors and the corresponding weight parameters. NMF has been successfully applied for blind source separation and many other signal processing systems. Typically, controlling the degree of sparseness and characterizing the uncertainty of model parameters are two critical issues for model regularization using NMF. This paper presents the Bayesian group sparse learning for NMF and applies it for single-channel music source separation. This method reconstructs the rhythmic or repetitive signal from a common subspace spanned by the shared bases for the whole signal and simultaneously decodes the harmonic or residual signal from an individual subspace consisting of separate bases for different signal segments. A Laplacian scale mixture distribution is introduced for sparse coding given a sparseness control parameter. The relevance of basis vectors for reconstructing two groups of music signals is automatically determined. A Markov chain Monte Carlo procedure is presented to infer two sets of model parameters and hyperparameters through a sampling procedure based on the conditional posterior distributions. Experiments on separating single-channel audio signals into rhythmic and harmonic source signals show that the proposed method outperforms baseline NMF, Bayesian NMF, and other group-based NMF in terms of signal-to-interference ratio.

1 Introduction

Many problems in audio, speech and music processing can be tackled through matrix factorization. Different cost functions and constraints may lead to different factorized matrices. This procedure can identify underlying sources from the mixed signals through blind source separation [1]. Nonnegative matrix factorization (NMF) is designed to find an approximate factorization X≈A S for a data matrix X into a basis matrix A and a weight matrix S which are all nonnegative [2]. Some divergence measures have been proposed to derive solutions to NMF [3, 4]. NMF provides a useful learning tool for clustering as well as for classification. When a portion of labeled data are available, the semi-supervised NMF was developed for an improved classification system [5]. Different from standard principal component analysis (PCA) and independent component analysis (ICA), NMF only allows additive combination due to the nonnegative constraints on matrices A and S. Nevertheless, nonnegative PCA and nonnegative ICA were proposed for blind source separation in the presence of nonnegative image and music sources [6].

On the other hand, NMF conducts a parts-based sparse representation where only a few components or bases are relevant for representation of input nonnegative matrix X. The sparseness constraint is imposed in objective function [2]. An automatic relevance determination (ARD) scheme [7–9] is developed to determine relevant bases for sparse representation. Such sparse coding is efficient and robust. However, controlling the sparseness or smoothness is influential for system performance. Bayesian learning is beneficial to deal with sparse representation [9] and model regularization [7]. In [10], Bayesian learning was performed for sparse representation of image data where Laplacian distribution was used as prior density. The ℓ₁-regularized optimization was comparably performed. In addition, the group-based NMF [11] was proposed to capture the intra-subject variations and the inter-subject variations in EEG signals. In [12], the group sparse NMF was proposed by minimizing the Itakura-Saito divergence between X and AS. In [13], NMF was applied for drum source separation where the factorized components were partitioned into rhythmic sources and harmonic sources. No Bayesian learning was performed in [11–13].

More recently, a Bayesian NMF approach [14] was proposed for model selection and image reconstruction. This approach inferred NMF model by a variational Bayes method and a Markov chain Monte Carlo (MCMC) algorithm. In [15], a Bayesian NMF with gamma priors for source signals and mixture weights was implemented through a MCMC algorithm. In [16], the Bayesian NMF with Gaussian likelihood and exponential prior was constructed for image feature extraction where the posterior distribution was approximated by Gibbs sampling procedure. In [17], a Bayesian approach for blind separation of linear mixtures of sources was developed. The Student t distribution for mixture weights was introduced to achieve sparse basis representation. The underdetermined noisy mixtures were separated. However, the case of nonnegative source was not applied. Besides, single-channel source separation is known as an underdetermined problem. In [18], the harmonic structure information was adopted to estimate the demixed instrumental sources. In [19], the NMF was applied for single-channel speech separation where the speech of target speaker over that of masking speaker was enhanced by using sparse dictionaries learned on a phoneme level for individual speakers.

This paper addresses the problem of underdetermined source separation based on NMF for an application to music source separation [20]. The uses of NMF and Bayesian theory to source separation are not new since they have been many papers [11–13, 15]. But, to our best knowledge, the novelty of this paper is to propose Bayesian group sparse (BGS) learning using Laplacian distribution and Laplacian scale mixture (LSM) distribution and apply it for single-channel music signal separation. We present a group-based NMF where the groups of common bases and individual bases are estimated for blind separation of rhythmic sources and harmonic sources, respectively. Bayesian sparse learning is developed by introducing LSM distributions as the priors for two groups of reconstruction weights. Gamma priors are used to represent two groups of nonnegative basis components. The BGS-NMF algorithm is accordingly established. A MCMC algorithm is derived to infer BGS-NMF parameters and hyperparameters according to full Bayesian theory. The rhythmic sources and harmonic sources are reconstructed through the relevant bases in common subspace and individual subspace, respectively. In the experiments, the proposed BGS-NMF is evaluated and compared with the other NMF methods for single-channel separation of audio signals into rhythmic signals and harmonic signals. From comparative study, we find that the improvement of separation performance benefits from Bayesian modeling, group basis representation, and sparse signal reconstruction. Sparser priors identify fewer but more relevant bases and correspondingly lead to a better performance in terms of signal-to-interference ratio.

The remaining of this paper is organized as follows. In the next section, the related studies on NMF and group basis representation are surveyed. Some Bayesian learning approaches are addressed. Section 3 highlights on the construction of BGS-NMF model as well as the inference procedure based on MCMC algorithm. The conditional posterior distributions of different parameters and hyperparameters are derived in the sampling procedure. Section 4 reports a series of experiments on underdetermined music source separation with different music sources. The convergence condition in MCMC sampling is investigated. The evaluation of demixed signals in terms of signal-to-interference ratio is reported. Finally, the conclusions drawn by this study are provided in Section 5.

2 Background survey

In what follows, nonnegative matrix factorization (NMF) and its extensions to different regularization functions are introduced. Several approaches to group basis representation are addressed. Group sparse coding is surveyed. Then Bayesian learning methods for matrix factorization and other related tasks are introduced.

2.1 Nonnegative matrix factorization

NMF is a linear model where the observed signals, factorized signals, and source signals are all assumed to be nonnegative. Given a data matrix X={X_{i
k}}, NMF estimates two factorized matrices A={A_{i
j}} and S={S_{j
k}} by minimizing the reconstruction error between X and AS. In [2], the sparseness constraint was imposed on minimization of an objective function $F$ which is based on a regularized error function

{∥X - AS∥}^{2} + η_{a} \sum_{i} \sum_{j} f (A_{ij}) + η_{s} \sum_{j} \sum_{k} f (S_{jk})

(1)

where η_a≥0 and η_s≥0 are regularization parameters and different sparseness measures could be used, e.g., f(S_{j
k})=|S_{j
k}|, f(S_{j
k})=S_{j
k}, f(S_{j
k})=S_{j
k} ln(S_{j
k}), etc. Several extensions of NMF have been proposed. In [21], the nonnegative matrix partial co-factorization (NMPCF) was proposed for rhythmic source separation. Given the magnitude spectrogram as input data matrix X, NMPCF decomposes the music signal into a drum or rhythmic part and a residual or harmonic part X≈A_rS_r+A_hS_h with the factorized matrices including basis matrix and weight matrix for rhythmic source {A_r,S_r} and for harmonic source {A_h,S_h}. The prior knowledge from drum-only signal Y≈A_rS_r given the same rhythmic bases A_r is incorporated in joint minimization of two Euclidean error functions

{∥X - A_{r} S_{r} - A_{h} S_{h}∥}^{2} + η {∥Y - A_{r} S_{r}∥}^{2}

(2)

where η is a trade-off between the first and the second reconstruction errors due to X and Y, respectively. In [22], the mixed signals were divided into L segments. Each segment X^(l) is decomposed into common and individual parts which reflect the rhythmic and harmonic sources, respectively. The common bases A_r are shared for different segments due to high temporal repeatability in rhythmic sources. The individual bases $A_{h}^{(l)}$ are separate for individual segment l due to the changing frequency and low temporal repeatability. The resulting objective function consists of a weighted Euclidean error function and the regularization terms due to bases A_r and $A_{h}^{(l)}$ which are expressed by

\sum_{l = 1}^{L} ω^{(l)} ∥ X^{(l)} - A_{r} S_{r}^{(l)} - A_{h}^{(l)} S_{h}^{(l)} ∥^{2} + ηL ∥ A_{r} ∥^{2} + η \sum_{l = 1}^{L} ∥ A_{h}^{(l)} ∥^{2}

(3)

where ${ω^{(l)}, S_{r}^{(l)}, S_{h}^{(l)}}$ denotes the segment-dependent weights and weight matrices for common basis and individual basis, respectively. This is a NMPCF for L segments. The solutions to these NMFs are derived and implemented by the multiplicative update rules so that nonnegative constraints are met for individual model parameters. For example, the terms in gradient of objective function $F$ with respect to nonnegative parameter A are divided into positive terms and negative terms $\frac{\partial F}{∂A} = {[\frac{\partial F}{∂A}]}^{+} - {[\frac{\partial F}{∂A}]}^{-}$ where ${[\frac{\partial F}{∂A}]}^{+} > 0$ and ${[\frac{\partial F}{∂A}]}^{-} > 0$ . The multiplicative update rule is yielded by

A \leftarrow A \otimes {[\frac{\partial F}{∂A}]}^{-} ⊘ {[\frac{\partial F}{∂A}]}^{+}

(4)

where ⊗ and ⊘ denote element-wise multiplication and division, respectively.

2.2 Group basis representation

The signal reconstruction methods in (2) and (3) correspond to the group basis representation where two groups of bases A_r and $A_{h}^{(l)}$ are applied. The separation of single-channel mixed signal into two source signals is achieved. The issue of underdetermined source separation is resolved. In [11], the group-based NMF (GNMF) was developed by conducting group analysis and constructing two groups of bases. The intra-subject variations for a subject in different trials and the inter-subject variations for different subjects could be compensated. Given the L subjects or segments, the l th segment is generated by $X^{(l)} \approx A_{r}^{(l)} S_{r}^{(l)} + A_{h}^{(l)} S_{h}^{(l)}$ where $A_{r}^{(l)}$ denotes the common bases which capture the intra and inter-subject variations and $A_{h}^{(l)}$ denotes the individual bases which reflect the residual information. In general, different common bases $A_{r}^{(l)}$ should be close together since these bases represent the shared information in mixed signal. Contrarily, individual bases $A_{h}^{(l)}$ characterize individual features which should be discriminated and mutually far apart [11]. The object function of GNMF is formed by

\begin{array}{l} \sum_{l = 1}^{L} ∥ X^{(l)} - A_{r}^{(l)} S_{r}^{(l)} - A_{h}^{(l)} S_{h}^{(l)} ∥^{2} + η_{a} \sum_{l = 1}^{L} ∥ A_{r}^{(l)} ∥^{2} \\ + η_{a} \sum_{l = 1}^{L} ∥ A_{h}^{(l)} ∥^{2} \\ + η_{a_{r}} \sum_{l = 1}^{L} \sum_{m = 1}^{L} ∥ A_{r}^{(l)} - A_{r}^{(m)} ∥^{2} \\ - η_{a_{h}} \sum_{l = 1}^{L} \sum_{m = 1}^{L} ∥ A_{h}^{(l)} - A_{h}^{(m)} ∥^{2} . \end{array}

(5)

In (5), the second and third terms are seen as the ℓ₂ regularization functions, the fourth term enforces the distance between different common bases to be small, and the fifth term enforces the distance between different individual bases to be large. Regularization parameters ${η_{a}, η_{a_{r}}, η_{a_{h}}}$ are used. The NMPCFs in [21, 22] and GNMF in [11] did not consider sparsity in group basis representation.

More generally, a group sparse coding algorithm [23] was proposed for basis representation of group instances ${X_{k}, k \in G}$ where objective function is defined by

\sum_{k \in G} {∥X_{k} - \sum_{j = 1}^{| D |} S_{j}^{k} A_{j}∥}^{2} + η \sum_{j = 1}^{| D |} ∥ S_{j} ∥ .

(6)

All the instances within a group $G$ share the same dictionary D with basis vectors ${A_{j}}_{j = 1}^{| D |}$ . The weight matrix ${S_{j}}_{j = 1}^{| D |}$ consists of nonnegative vectors $S_{j} = {[S_{j}^{1}, \dots, S_{j}^{| G |}]}^{T}$ . The weight parameters ${S_{j}^{k}}$ are estimated for different group instances $k \in G$ using different bases $j \in D$ . In (6), ℓ₁ regularization term is incorporated to carry out group sparse coding. The group sparsity was further extended to structural sparsity for dictionary learning and basis representation. Nevertheless, nonnegative constraints were not imposed on bases {A_j} and observed signals {X_k}. Basically, all the above-mentioned methods [2, 11, 21–24] did not apply probabilistic framework. No Bayesian learning was considered.

2.3 Bayesian learning approaches

Model regularization is critical for improving the generalization of a learning machine to new data [7]. Conducting Bayesian learning shall compensate the variations of the estimated parameters and accordingly improve model regularization. Typically, NMF and group basis representation are viewed as learning machine which is based on a set of bases. Following the perspective of relevance vector machines [8, 9], Bayesian sparse learning is beneficial to identify relevant bases for regularized basis representation. To do so, sparse priors based on Student t distribution [17] and Laplacian distribution [10, 25] could act as regularization functions and merged with likelihood function to come up with a posteriori probability. Maximizing the logarithm of a posteriori probability is equivalent to minimizing the ℓ₁-regularized error function if Laplacian prior is applied. Hyperparameters of sparse priors are then used as the regularization parameter which controls the trade-off between a reconstruction error function and a sparsity-favorable penalty function.

In the literature, a probabilistic matrix factorization (PMF) [26] for X=A^TS was proposed by assuming Gaussian noise for each independent entry of data matrix X={X_{i
k}} by $p (X | A, S, α) = \prod_{i = 1}^{N} \prod_{k = 1}^{M} N (X_{ik} | A_{i}^{T} S_{k}, α^{- 1})$ and assuming Gaussian priors $p (A | α_{a}) = \prod_{i = 1}^{N} N (A_{i} | 0, α_{a}^{- 1} I)$ and $p (S | α_{s}) = \prod_{k = 1}^{M} N (S_{k} | 0, α_{s}^{- 1} I)$ where {α,α_a,α_s} is a set of precision parameters of Gaussians. Here, A_i denotes the i th column of A and S_k denotes the k th column of S. Learning for PMF is equivalent to maximizing the log posterior likelihood

\begin{array}{l} ln p (A, S | X, α, α_{a}, α_{s}) = ln p (X | A, S, α) \\ + ln p (A | α_{a}) + ln p (S | α_{s}) + C \end{array}

(7)

with respect to A and S. In (7), C is a constant. This optimization turns out to minimizing the sum-of-squares error function with quadratic regularization terms

\sum_{i = 1}^{N} \sum_{k = 1}^{M} {(X_{ik} - A_{i}^{T} S_{k})}^{2} + η_{a} \sum_{i = 1}^{N} ∥ A_{i} ∥^{2} + η_{s} \sum_{k = 1}^{M} ∥ S_{k} ∥^{2} .

(8)

The regularization terms are determined from hyperparameters by η_a=α_a/α and η_s=α_s/α. Bayesian learning of PMF was performed through MCMC algorithm where Gaussian-Wishart priors for Gaussian mean vectors and precision matrices were assumed. There was no constraint on nonnegative matrices by using PMF. No sparse learning was considered.

In [27], a full Bayesian NMF was implemented to determine the number of bases according to the marginal likelihood. Furthermore, Bayesian nonparametric approach to NMF was proposed in [28] where model structure was determined through Gamma process NMF. This method was applied to find both latent sources in spectrograms and their number. In [25], the group sparse coding [23] was upgraded with Bayesian interpretation. Bayesian sparse learning was only developed for single-sample basis representation. In [29], the group sparse priors were presented for maximum a posteriori estimation of covariance matrix which was used in Gaussian graphical model. More recently, the group sparse hidden Markov models (HMMs) [30] were proposed to represent a sequence of observations and have been successfully applied for speech recognition. A set of common bases were shared for representation of speech samples across HMM states, while a set of individual bases were employed to represent speech samples within individual HMM states. Bayesian group sparse learning was performed for speech recognition [30] and signal separation [20] by using Laplacian scale mixture distribution.

3 Bayesian group sparse matrix factorization

Previous NMF methods [11, 13, 21] were developed to extract task-specific nonnegative factors, but they did not simultaneously consider the uncertainty of model parameters and control the sparsity of weight parameters. In [23, 25], the group sparse coding and its Bayesian extension did not impose nonnegative constraints in data matrix X and factorized matrices A and S. This paper presents a new Bayesian group sparse learning for NMF (denoted by BGS-NMF) and applied it for single-channel music source separation.

3.1 Model construction

In this study, magnitude spectrogram X={X^(l)} of a mixed audio signal is calculated and chopped into L segments for implementation of BGS-NMF algorithm. The audio signal is assumed to be mixed from two kinds of source signals. One is rhythmic or repetitive source signal and the other is harmonic or residual source signal. As illustrated in Figure 1, BGS-NMF aims to decompose a nonnegative matrix $X^{(l)} \in R_{+}^{N \times M}$ of the l th segment into a product of two nonnegative matrices A^(l)S^(l). A linear decomposition model is constructed in a form of

X^{(l)} = A_{r} S_{r}^{(l)} + A_{h}^{(l)} S_{h}^{(l)} + E^{(l)}

(9)

where $A_{r} \in R_{+}^{N \times D_{r}}$ denotes the shared basis matrix for all segments {X^(l),l=1,…,L}; $A_{h}^{(l)} \in R_{+}^{N \times D_{h}}$ and E^(l) denotes the individual matrix and the noise matrix for a given segment l, respectively. Typically, common bases capture the repetitive patterns which continuously happen in different segments of a whole signal. Individual bases are used to compensate the residual information that common bases could not handle. Without loss of generality, common bases and individual bases are applied to recover the rhythmic signal and the harmonic signal, respectively, from a mixed audio signal. Such a signal recovery problem could be interpreted from a perspective of subspace approach. Namely, an observed signal is demixed into one signal from principal subspace spanned by common bases and the other signal from minor subspace spanned by individual bases [31]. Moreover, the sparseness constraint is imposed on two groups of reconstruction weights $S_{r}^{(l)} \in R_{+}^{D_{r} \times M}$ and $S_{h}^{(l)} \in R_{+}^{D_{h} \times M}$ . It is assumed that the reconstruction weights of rhythmic sources $S_{r}^{(l)}$ and harmonic sources $S_{h}^{(l)}$ are independent, but the dependencies between reconstruction weights within each group are allowed. Assuming that the k th noise vector $E_{k}^{(l)}$ is Gaussian distributed with zero mean and N×N diagonal covariance matrix Σ^(l)=diag{[Σ^(l)]_{i
i}} which is shared for all samples within a segment l, the likelihood function of an audio signal segment X^(l) is expressed by

\begin{matrix} p (X^{(l)} | Θ^{(l)}) = \prod_{i = 1}^{N} \prod_{k = 1}^{M} N (X_{ik}^{(l)} ∣ {[A_{r} S_{r}^{(l)}]}_{ik} \\ + {[A_{h}^{(l)} S_{h}^{(l)}]}_{ik}, {[Σ^{(l)}]}_{ii}) . \end{matrix}

(10)

BGS-NMF model is therefore constructed with parameters $Θ^{(l)} = {A_{r}, A_{h}^{(l)}, S_{r}^{(l)}, S_{h}^{(l)}, Σ^{(l)}}$ .

3.2 Priors for Bayesian group sparse learning

From Bayesian perspective, the uncertainties of BGS-NMF parameters, expressed by prior densities, are considered to assure model regularization. Using BGS-NMF model, the common bases A_r are constructed to represent the characteristics of repetitive patterns for different data segments, while the individual bases $A_{h}^{(l)}$ are estimated to reflect unique information in each segment l. Sparsity control is enforced in the corresponding reconstruction weights $S_{r}^{(l)}$ and $S_{h}^{(l)}$ so that relevant bases are retrieved for group basis representation. In accordance with [15], the nonnegative basis parameters are assumed to be gamma distributed by

p (A_{r}) = \prod_{i = 1}^{N} \prod_{j = 1}^{D_{r}} G ({[A_{r}]}_{ij} | α_{rj}, β_{rj})

(11)

p (A_{h}^{(l)}) = \prod_{i = 1}^{N} \prod_{j = 1}^{D_{h}} G ({[A_{h}^{(l)}]}_{ij} | α_{hj}^{(l)}, β_{hj}^{(l)})

(12)

where $Φ_{a}^{(l)} = {{α_{rj}, β_{rj}}, {α_{hj}^{(l)}, β_{hj}^{(l)}}}$ denotes the hyperparameters of gamma distributions and {D_r,D_h} denote the numbers of common bases and individual bases, respectively. Gamma distribution is an exponential family distribution for nonnegative data. Its two parameters {α,β} can be adjusted to fit different shapes of distributions. In (11) and (12), all entries in matrices A_r and $A_{h}^{(l)}$ are assumed to be independent.

Importantly, we control the sparsity of reconstruction weights by using prior density based on the Laplacian scale mixture (LSM) distribution [25]. The LSM of a reconstruction weight of common basis is constructed by ${[S_{r}^{(l)}]}_{jk} = {(λ_{rj}^{(l)})}^{- 1} u_{rj}^{(l)}$ where $u_{rj}^{(l)}$ is a Laplacian distribution $p (u_{rj}^{(l)}) = \frac{1}{2} exp {- | u_{rj}^{(l)} |}$ with scale 1 and $λ_{rj}^{(l)}$ is an inverse scale parameter. Accordingly, the parameter ${[S_{r}^{(l)}]}_{jk}$ has a Laplacian distribution

p ({[S_{r}^{(l)}]}_{jk} | λ_{rj}^{(l)}) = \frac{λ_{rj}^{(l)}}{2} exp {- λ_{rj}^{(l)} {[S_{r}^{(l)}]}_{jk}}

(13)

which is controlled by a positive continuous mixture parameter $λ_{rj}^{(l)} \geq 0$ . Considering a gamma distribution for inverse scale parameter, i.e., $p (λ_{rj}^{(l)}) = G (λ_{rj}^{(l)} | γ_{rj}^{(l)}, δ_{rj}^{(l)})$ , the marginal distribution of a reconstruction weight can be calculated by [25]

\begin{matrix} p ({[S_{r}^{(l)}]}_{jk}) = \int_{0}^{\infty} p ({[S_{r}^{(l)}]}_{jk} | λ_{rj}^{(l)}) p (λ_{rj}^{(l)}) d λ_{rj}^{(l)} \\ = \frac{γ_{rj}^{(l)} {(δ_{rj}^{(l)})}^{γ_{rj}^{(l)}}}{2 {(δ_{rj}^{(l)} + {[S_{r}^{(l)}]}_{jk})}^{γ_{rj}^{(l)} + 1}} . \end{matrix}

(14)

In (13) and (14), the constraint ${[S_{r}^{(l)}]}_{jk} \geq 0$ has been considered. This LSM distribution is obtained by adopting the property that gamma distribution is the conjugate prior for Laplacian distribution. In application of image coding, LSM distribution was estimated and measured to be sparser than Laplacian distribution by approximately a factor of 2 [25]. Figure 2 compares Gaussian, Laplacian, and LSM distributions with specific parameters. In this example, LSM is the sharpest distribution among these distributions. In addition, a truncated LSM prior for nonnegative parameter ${[S_{r}^{(l)}]}_{jk} \in R_{+}$ is adopted, namely, the distribution of negative parameter is forced to be zero. The sparse prior for reconstruction weight for individual basis ${[S_{h}^{(l)}]}_{jk}$ is also expressed by LSM distribution with hyperparameter ${γ_{hj}^{(l)}, δ_{hj}^{(l)}}$ . The hyperparameters of BGS-NMF is formed by $Φ^{(l)} = {Φ_{a}^{(l)}, Φ_{s}^{(l)} = {γ_{rj}^{(l)}, δ_{rj}^{(l)}, γ_{hj}^{(l)}, δ_{hj}^{(l)}}}$ . Figure 3 displays a graphical representation for construction of BGS-NMF with different parameters Θ^(l) and hyperparameters Φ^(l).

By combining the likelihood function in (10) and the prior densities in (11) to (13), the negative logarithm of posterior distribution $- ln p (A_{r}, A_{h}^{(l)}, S_{r}^{(l)}, S_{h}^{(l)} | X)$ can be calculated and arranged as a new objective function expressed by

\begin{array}{l} \sum_{l = 1}^{L} \sum_{i = 1}^{N} \sum_{k = 1}^{M} (X_{ik}^{(l)} - {[A_{r} S_{r}^{(l)}]}_{ik} - {[A_{h}^{(l)} S_{h}^{(l)}]}_{ik})^{2} \\ + η_{a} L \sum_{i = 1}^{N} \sum_{j = 1}^{D_{r}} ((1 - α_{rj}) ln {[A_{r}]}_{ij} \\ + β_{rj} {[A_{r}]}_{ij}) \\ + η_{a} \sum_{l = 1}^{L} \sum_{i = 1}^{N} \sum_{j = 1}^{D_{h}} ((1 - α_{hj}^{(l)}) ln {[A_{h}^{(l)}]}_{ij} \\ + β_{hj}^{(l)} {[A_{h}^{(l)}]}_{ij}) + η_{s_{r}} \sum_{l = 1}^{L} \sum_{j = 1}^{D_{r}} \sum_{k = 1}^{M} {[S_{r}^{(l)}]}_{jk} \\ + η_{s_{h}} \sum_{l = 1}^{L} \sum_{j = 1}^{D_{h}} \sum_{k = 1}^{M} {[S_{h}^{(l)}]}_{jk} \end{array}

(15)

where ${η_{a}, η_{s_{r}}, η_{s_{h}}}$ denote the regularization parameters for two groups of bases and reconstruction weights. Some BGS-NMF parameters or hyperparameters have been absorbed in these regularization parameters. Comparing with the objective functions (3) for NMPCF, (5) for GNMF, and (8) for PMF, the optimization of (15) for BGS-NMF shall lead to two groups of signals which are reconstructed from the sparse common bases A_r and sparse individual bases $A_{h}^{(l)}$ . The regularization terms due to two gamma bases are additionally considered. Different from the Bayesian NMF (BNMF) [15], BGS-NMF conducts group sparse learning which does not only characterize the within-segment harmonic information but also represent the across-segment rhythmic regularity. Sparse sets of basis vectors are further determined for sparse representation. Basically, BGS-NMF follows a general objective function. By applying different hyperparameter values ${α_{rj}, β_{rj}, α_{hj}^{(l)}, β_{hj}^{(l)}}$ , probability structures, and prior distributions for ${A_{r}, A_{h}^{(l)}, S_{r}^{(l)}, S_{h}^{(l)}}$ , BGS-NMF can be realized to find solutions to NMF [2], NMPCF [21], GNMF [11], PMF [26], and BNMF [15]. Notably, the objective function in (15) is written for comparative study among different methods. This function only considers BGS-NMF based on Laplacian prior. BGS-NMF algorithms with Laplacian prior and LSM prior shall be both implemented in the experiments. Nevertheless, in what follows, we address the model inference procedure for BGS-NMF with LSM prior.

3.3 Model inference

The full Bayesian framework for BGS-NMF model based on the posterior distribution of parameters and hyperparameters p(Θ,Φ|X) is not analytically tractable. A stochastic optimization scheme is adopted. We develop a MCMC sampling algorithm for approximate inference through iteratively generating samples of parameters Θ and hyperparameters Φ according to the posterior distribution. This algorithm converges by those samples. The key idea of MCMC sampling is to simulate a stationary ergodic Markov chain whose samples asymptotically follow the posterior distribution p(Θ,Φ|X). The estimates of parameters Θ and hyperparameters Φ are then computed via Monte Carlo integrations on the simulated Markov chains. For simplicity, the segment index l is neglected in derivation of MCMC algorithm for BGS-NMF. At each new iteration t+1, the BGS-NMF parameters Θ^(t+1) and hyperparameters Φ^(t+1) are sequentially sampled in an order of {A_r,S_r,A_h,S_h,Σ,α_r,β_r,α_h,β_h,λ_r,λ_h,γ_r,δ_r,γ_h,δ_h} according to their corresponding conditional posterior distributions. In this subsection, we describe the calculation of conditional posterior distributions under BGS-NMF parameters {A_r,S_r,A_h,S_h,Σ}. The conditional posterior distributions for hyperparameters {α_r,β_r,α_h,β_h,λ_r,λ_h,γ_r,δ_r,γ_h,δ_h} are derived in the Appendix.

1. Sampling of [A_r]_{i
j}. First of all, the common basis parameter ${[A_{r}^{(t + 1)}]}_{ij}$ is sampled by the conditional posterior distribution

p ({[A_{r}]}_{ij} | X_{i}^{T}, Θ_{A_{rij}}^{(t)}, Φ_{A_{rij}}^{(t)}) \propto p (X_{i}^{T} | Θ_{A_{rij}}^{(t)}) p ({[A_{r}]}_{ij} | Φ_{A_{rij}}^{(t)})

(16)

where $Θ_{A_{rij}}^{(t)} = {{[A_{r}^{(t + 1)}]}_{i (1 : j - 1)}, {[A_{r}^{(t)}]}_{i (j + 1 : D_{r})}, S_{r}^{(t)}, A_{h}^{(t)}, S_{h}^{(t)}, Σ^{(t)}}$ and $Φ_{A_{rij}}^{(t)} = {α_{rj}^{(t)}, β_{rj}^{(t)}}$ . Here, X_i denotes the i th row vector of X. Notably, for each sampling, we use the preceding bases ${[A_{r}^{(t + 1)}]}_{i (1 : j - 1)}$ at new iteration t+1 and subsequent bases ${[A_{r}]}_{i (j + 1 : D_{r})}^{(t)}$ at current iteration t. The likelihood function can be arranged as a Gaussian distribution of [A_r]_{i
j}

p (X_{i}^{T} | Θ_{A_{rij}}^{(t)}) \propto exp \{- \frac{{({[A_{r}]}_{ij} - μ_{A_{rij}}^{likel})}^{2}}{2 {[σ_{A_{rij}}^{likel}]}^{2}}\}

(17)

where $μ_{A_{rij}}^{likel} = {[σ_{A_{rij}}^{likel}]}^{- 2} \sum_{k = 1}^{M} ({[S_{r}^{(t)}]}_{jk} ε_{ik}^{(- j)})$ , $ε_{ik}^{(- j)} = X_{ik} - (\sum_{m = 1}^{j - 1} {[A_{r}^{(t + 1)}]}_{im} {[S_{r}^{(t)}]}_{mk} + \sum_{m = j + 1}^{D_{r}} {[A_{r}^{(t)}]}_{im} {[S_{r}^{(t)}]}_{mk}) - \sum_{m = 1}^{D_{h}} {[A_{h}^{(t)}]}_{im} {[S_{h}^{(t)}]}_{mk}$ and ${[σ_{A_{rij}}^{likel}]}^{2} = {[Σ^{(t)}]}_{ii} {(\sum_{k = 1}^{M} {[S_{r}^{(t)}]}_{jk})}^{- 1}$ . By combining likelihood function of (17) and gamma prior $p ({[A_{r}]}_{ij} | Φ_{A_{rij}}^{(t)})$ of (11), the conditional posterior distribution in (16) is derived in a form of

{[A_{r}]}_{ij}^{α_{rj}^{(t)} - 1} exp \{- \frac{{({[A_{r}]}_{ij} - μ_{A_{rij}}^{post})}^{2}}{2 {[σ_{A_{rij}}^{post}]}^{2}}\} I_{[0, + \infty [} ({[A_{r}]}_{ij})

(18)

where $μ_{A_{rij}}^{post} = μ_{A_{rij}}^{likel} - β_{rj}^{(t)} {[σ_{A_{rij}}^{likel}]}^{2}$ , ${[σ_{A_{rij}}^{post}]}^{2} = {[σ_{A_{rij}}^{likel}]}^{2}$ , and $I_{[0, + \infty [} (z)$ denotes an indicator function which has value either 1 if z∈[0,+∞[ or 0 for the other case. In (18), the posterior distribution for negative [A_r]_{i
j} is forced to be zero. Derivations of (17) and (18) are detailed in the Appendix. However, (18) is not an usual distribution, therefore its sampling requires the use of a rejection sampling method, such as the Metropolis-Hastings algorithm [32]. Using this algorithm, an instrumental distribution q([A_r]_{i
j}) is chosen to fit at best the target distribution (18) so that high rejection condition is avoided or equivalently rapid convergence toward true parameter could be achieved. In case of rejection, the previous parameter sample is used, namely, ${[A_{r}^{(t + 1)}]}_{ij} \leftarrow {[A_{r}^{(t)}]}_{ij}$ . Generally, the shape of target distribution is characterized by its mode and width. The instrumental distribution is constructed as a truncated Gaussian distribution which is calculated by

q ({[A_{r}]}_{ij}) = N_{+} ({[A_{r}]}_{ij} | μ_{A_{rij}}^{inst}, {[σ_{A_{rij}}^{inst}]}^{2}) .

(19)

In (19), the mode $μ_{A_{rij}}^{inst}$ is obtained by finding the roots of a quadratic equation of [A_r]_{i
j} which appears in the exponent of the posterior distribution in (18). Derivation for the mode $μ_{A_{rij}}^{inst}$ is detailed in the Appendix. In case of complex-valued root or negative-valued root, the mode is forced by $μ_{A_{rij}}^{inst} = 0$ . The width of instrumental distribution is controlled by ${[σ_{A_{rij}}^{inst}]}^{2} = {[σ_{A_{rij}}^{post}]}^{2}$ .

2. Sampling of [S_r]_{j
k}. The sampling of reconstruction weight of common basis ${[S_{r}^{(t + 1)}]}_{jk}$ depends on the conditional posterior distribution

\begin{matrix} p ({[S_{r}]}_{jk} | X_{k}, Θ_{S_{rjk}}^{(t)}, Φ_{S_{rjk}}^{(t)}) \propto p (X_{k} ∣ {[S_{r}]}_{jk}, Θ_{S_{rjk}}^{(t)}) \\ p ({[S_{r}]}_{jk} | Φ_{S_{rjk}}^{(t)}) \end{matrix}

(20)

where $Θ_{S_{rjk}}^{(t)} = {A_{r}^{(t + 1)}, {[S_{r}^{(t + 1)}]}_{(1 : j - 1) k}, {[S_{r}^{(t)}]}_{(j + 1 : D_{r}) k}, A_{h}^{(t)}, S_{h}^{(t)}, Σ^{(t)}}$ and $Φ_{s_{rjk}}^{(t)} = λ_{rj}^{(t)}$ . X_k is the k th column of X. Again, the preceding weights ${[S_{r}^{(t + 1)}]}_{(1 : j - 1) k}$ at new iteration t+1 and subsequent weights ${[S_{r}^{(t)}]}_{(j + 1 : D_{r}) k}$ at current iteration t x are used. The likelihood function is rewritten as a Gaussian distribution of [S_r]_{j
k} given by

p (X_{k} ∣ {[S_{r}]}_{jk}, Θ_{S_{rjk}}^{(t)}) \propto exp \{- \frac{{({[S_{r}]}_{jk} - μ_{S_{rjk}}^{likel})}^{2}}{2 {[σ_{S_{rjk}}^{likel}]}^{2}}\} .

(21)

The Gaussian parameters are obtained by $μ_{S_{rjk}}^{likel} = {[σ_{S_{rjk}}^{likel}]}^{- 2} \sum_{i = 1}^{N} ({[Σ^{(t)}]}_{ii}^{- 1} {[A_{r}^{(t + 1)}]}_{ij} ε_{ik}^{(- j)})$ , $ε_{ik}^{(- j)} = X_{ik} - (\sum_{m = 1}^{j - 1} {[A_{r}^{(t + 1)}]}_{im} {[S_{r}^{(t + 1)}]}_{mk} + \sum_{m = j + 1}^{D_{r}} {[A_{r}^{(t + 1)}]}_{im} {[S_{r}^{(t)}]}_{mk}) - \sum_{m = 1}^{D_{h}} {[A_{h}^{(t)}]}_{im} {[S_{h}^{(t)}]}_{mk}$ and ${[σ_{S_{rjk}}^{likel}]}^{2} = {(\sum_{i = 1}^{N} {[Σ^{(t)}]}_{ii}^{- 1} {({[A_{r}^{(t + 1)}]}_{ij})}^{2})}^{- 1}$ . Given the Gaussian likelihood and Laplacian prior, the conditional posterior distribution is calculated by

λ_{rj}^{(t)} exp \{- \frac{{({[S_{r}]}_{jk} - μ_{S_{rjk}}^{post})}^{2}}{2 {[σ_{S_{rjk}}^{post}]}^{2}}\} I_{[0, + \infty [} ({[S_{r}]}_{jk})

(22)

where $μ_{S_{rjk}}^{post} = μ_{S_{rjk}}^{likel} - λ_{rj}^{(t)} {[σ_{S_{rjk}}^{likel}]}^{2}$ and ${[σ_{S_{rjk}}^{post}]}^{2} = {[σ_{S_{rjk}}^{likel}]}^{2}$ . Notably, the hyperparameters ${γ_{rj}^{(t + 1)}, δ_{rj}^{(t + 1)}}$ in LSM prior are also sampled and used to sample LSM parameter $λ_{rj}^{(t + 1)}$ based on a gamma distribution. Here, Metropolis-Hastings algorithm is applied again. The best instrumental distribution q([S_r]_{j
k}) is selected to fit (22). This distribution is derived as a truncated Gaussian distribution $N_{+} ({[S_{r}]}_{jk} | μ_{S_{rjk}}^{inst}, {[σ_{S_{rjk}}^{inst}]}^{2})$ where the mode $μ_{S_{rjk}}^{inst}$ is derived by finding the root of a quadratic equation of [S_r]_{j
k} and the width is obtained by ${[σ_{S_{rjk}}^{inst}]}^{2} = {[σ_{S_{rjk}}^{post}]}^{2}$ . In addition, the conditional posterior distributions for sampling the individual basis parameter ${[A_{h}^{(t + 1)}]}_{ij}$ and its reconstruction weight ${[S_{h}^{(t + 1)}]}_{jk}$ are similar to those for sampling ${[A_{r}^{(t + 1)}]}_{ij}$ and ${[S_{r}^{(t + 1)}]}_{jk}$ , respectively. We do not address these two distributions.

3. Sampling of ${[Σ]}_{ii}^{- 1}$ . The sampling of the inverse of noise variance ${({[Σ]}_{ii}^{(t + 1)})}^{- 1}$ is performed according to the conditional posterior distribution

\begin{matrix} p ({[Σ]}_{ii}^{- 1} | X_{i}^{T}, Θ_{Σ_{ii}}^{(t)}, Φ_{Σ_{ii}}^{(t)}) \propto p (X_{i}^{T} ∣ {[Σ]}_{ii}^{- 1}, Θ_{Σ_{ii}}^{(t)}) p ({[Σ]}_{ii}^{- 1} | Φ_{Σ_{ii}}^{(t)}) \end{matrix}

(23)

where $Θ_{Σ_{ii}}^{(t)} = {A_{r}^{(t + 1)}, S_{r}^{(t + 1)}, A_{h}^{(t + 1)}, S_{h}^{(t + 1)}}$ and $p ({[Σ]}_{ii}^{- 1} | Φ_{Σ_{ii}}^{(t)}) = G ({[Σ]}_{ii}^{- 1} | α_{Σ_{ii}}, β_{Σ_{ii}})$ . The resulting posterior distribution can be derived as a new gamma distribution with updated hyperparameters $α_{Σ_{ii}}^{post} = \frac{M}{2} + α_{Σ_{ii}}$ and $β_{Σ_{ii}}^{post} = \frac{1}{2} \sum_{k = 1}^{M} {(X_{ik} - \sum_{m = 1}^{D_{r}} {[A_{r}^{(t + 1)}]}_{im} {[S_{r}^{(t + 1)}]}_{mk} - \sum_{m = 1}^{D_{h}} {[A_{h}^{(t + 1)}]}_{im} {[S_{h}^{(t + 1)}]}_{mk})}^{2} + β_{Σ_{ii}}$ . In the experiments, we conduct MCMC sampling procedure for t_max iterations. However, the first t_min iterations are not stable. These burn-in samples are abandoned. The marginal posterior estimates of common basis ${[Â_{r}]}_{ij}$ , individual basis ${[Â_{h}]}_{ij}$ and their reconstruction weights ${[Ŝ_{r}]}_{jk}$ and ${[Ŝ_{h}]}_{jk}$ are calculated by finding the following sample means, e.g.,

{[Â_{r}]}_{ij} = \frac{1}{t_{max} - t_{min}} \sum_{t = t_{min} + 1}^{t_{max}} {[A_{r}]}_{ij}^{(t)} .

(24)

With these posterior estimates, the rhythmic source and the harmonic source are calculated by $Â_{r} Ŝ_{r}$ and $Â_{h} Ŝ_{h}$ , respectively. The BGS-NMF algorithm is completed. Different from BNMF [15], the proposed BGS-NMF conducts a group sparse learning based on LSM distribution. Common bases A_r are shared for different data segments l. The group sparse learning performs well in our experiments.

4 Experiments

In this study, BGS-NMF is implemented to estimate two audio source signals from a single-channel mixed signal. One source signal contains rhythmic pattern which is constructed by the bases shared for all audio segments while the other source contains harmonic information which is represented via bases from individual segments. Bayesian sparse learning is performed to conduct probabilistic reconstruction based on the relevant group bases. Some experiments are reported to evaluate the performance of model inference and signal reconstruction.

4.1 Experimental setup

In the experiments, we sampled six rhythmic signals and six harmonic signals from http://www.free-scores.com/index_uk.php3and http://www.freesound.org/. Six mixed music signals were collected as follows: ‘music 1,’ bass+piano; ‘music 2,’ drum+guitar; ‘music 3,’ drum+violin; ‘music 4,’ cymbal+organ; ‘music 5,’ drum+saxophone; and ‘music 6,’ cymbal+singing, which contained combinations of different rhythmic and harmonic source signals. Three different drum signals and two different cymbal signals were included. For each set of experimental data, we applied a different mixing matrix music 1 (1.2667 −1.9136), music 2 (1.1667 −1.9136), music 3 (−1.2667 1.6136), music 4 (1.8667 1.1136), music 5 (−1.1667 2.8136), and music 6 (1.9617 1.1510) to simulate the corresponding single-channel mixed signal. Each audio signal was 21 s long. Readers may access http://chien.cm.nctu.edu.tw/bgs-nmf to listen to the twelve source signals and the corresponding six mixed signals. The specification of 44,100-Hz sampling rate and 16-bit resolution was used in the collected audio signals. In our implementation, the magnitude of fast Fourier transform of audio signal was extracted every 1,024 samples with 512 samples in frame overlapping. Each mixed signal was equally chopped into L segments for music source separation. Each segment had a length of 3 s. Sufficient rhythmic signal existed within a segment. The numbers of common bases and individual bases were empirically set to be 15 and 10, respectively, i.e., D_r=15 and D_h=10. The common bases were sufficiently allocated so as to capture the shared base information from different segments. The initial common bases $A_{r}^{(0)}$ and individual bases $A_{h}^{(0)}$ were estimated by applying k-means clustering using the automatically detected rhythmic and harmonic segments, respectively. The detection was based on a classifier using Gaussian mixture model. We performed 1,000 Gibbs sampling iterations (t_max=1,000). The separation performance was evaluated according to the signal-to-interference ratio (SIR) in decibels

SIR (dB) = 10 \underset{10}{log} [\frac{\sum_{l = 1}^{L} \sum_{k = 1}^{M} ∥ X_{k}^{(l)} ∥^{2}}{\sum_{l = 1}^{L} \sum_{k = 1}^{M} ∥ {\hat{X}}_{k}^{(l)} - X_{k}^{(l)} ∥^{2}}] .

(25)

The interference was measured by the Euclidean distance between original signal ${X_{k}^{(l)}}$ and reconstructed signal ${{\hat{X}}_{k}^{(l)}}$ for different samples k in different segments l. These signals include rhythmic signals ${{[Â_{r} Ŝ_{r}^{(l)}]}_{k}}$ and harmonic signals ${{[Â_{h}^{(l)} Ŝ_{h}^{(l)}]}_{k}}$ .

For system initialization at t=0, we detected two short segments with only rhythmic signal and harmonic signal and applied them for finding rhythmic parameters ${A_{r}^{(0)}, S_{r}^{(0)}}$ and harmonic parameters ${A_{h}^{(0)}, S_{h}^{(0)}}$ , respectively. This prior information was used to implement five NMF methods for single-channel source separation. We carried out baseline NMF [2], Bayesian NMF (BNMF) [15], group-based NMF (GNMF) [11] (or NMPCF [22]), and the proposed BGS-NMF under consistent experimental conditions. To evaluate the effect of sparse priors in BGS-NMF for music source separation, we additionally realized BGS-NMF by applying Laplacian distribution. For this realization, the sampling steps of LSM parameters ${γ_{r_{j}}, δ_{r_{j}}, γ_{h_{j}}, δ_{h_{j}}}$ were ignored. The BGS-NMFs with Laplacian distribution (denoted by BGS-NMF-LP) and LSM distribution (BGS-NMF-LSM) were compared. All these NMFs were implemented for different segments l. Basically, the NMF model [2] was realized by using multiplicative updating algorithm in (4). The BNMF [15] conducted Bayesian learning of NMF model where MCMC sampling was performed, and gamma distributions were assumed for bases and reconstruction weights. No group sparse learning was considered in NMF and BNMF. Using NMPCF [22] or GNMF [11], the common bases and individual bases were constructed by applying multiplicative updating algorithm. No probabilistic framework was involved. The ℓ₂-norm regularization for basis parameters A_r and $A_{h}^{(l)}$ was considered. There was no sparseness constraint imposed on reconstruction weight parameters $S_{r}^{(l)}$ and $S_{h}^{(l)}$ . Only the result of GNMF method was reported. Using GNMF, the regularization parameters in (5) were empirically determined as ${η_{a} = 0.35, η_{a_{r}} = 0.2, η_{a_{h}} = 0.2}$ . Nevertheless, the Bayesian group sparse learning is presented in BGS-NMF-LP and BGS-NMF-LSM algorithms. Using this algorithm, the uncertainties of bases and reconstruction weights are represented by gamma distributions and LSM distributions, respectively. MCMC algorithm is developed to sample BGS-NMF parameters Θ^(t+1) and hyperparameters Φ^(t+1). The groups of common bases A_r and individual bases A_h are estimated to capture between-segment repetitive patterns and within-segment residual information, respectively. The relevant bases are detected via sparse priors in accordance with Laplacian or LSM distributions. Using BGS-NMF-LP, we sampled the parameters and hyperparameters by using different frames from six music signals and automatically calculated the averaged values of regularization parameters in (15) as ${η_{a} = 0.41, η_{s_{r}} = 0.31, η_{s_{h}} = 0.26}$ . The regularization parameters in (5) and (15) reflect different physical meanings in objective function. The computational cost and the model size are also examined. The computation times of running MATLAB codes were measured by a personal computer with Intel Core 2 Duo 2.4-GHz CPU and 4-GB RAM. In our investigation, the computation times of demixing an audio signal with 21 s long were measured as 3.1, 12.1, 16.2, 20.9, and 21.2 min by using NMF, BNMF, GNMF, and the proposed BGS-NMF-LP and BGS-NMF-LSM respectively. In addition, BNMF, GNMF, BGS-NMF-LP, and BGS-NMF-LSM were measured to be 2.5, 4.5, 5.2, and 5.3 times the model size of the baseline NMF respectively.

4.2 Evaluation for MCMC iterative procedure

In this set of experiments, the sampling process of BGS-NMF algorithm is evaluated. The control parameter of sparsity λ_{r
j} and its hyperparameters γ_{r
j} and δ_{r
j} for common basis are investigated. Figure 4 displays an example of MCMC iterative sampling process for LSM parameter $λ_{rj}^{(t + 1)}$ . The value of samples converges after 200 iterations. Also, Figure 5 shows an example of iterative sampling process for LSM hyperparameters $γ_{rj}^{(t + 1)}$ and $δ_{rj}^{(t + 1)}$ . Convergence condition is good in these examples. MCMC samples converge after 200 iterations. Empirically, the parameter t_{m
i
n} is specified as 200 when calculating posterior estimates of BGS-NMF parameters as given in (24). In addition, Figure 6 shows an estimated distribution of reconstruction weight of common basis p([S_r]_{j
k}|γ_{r
j},δ_{r
j}) where only nonnegative [S_r]_{j
k} is valid in the distribution. This distribution is shaped as a LSM distribution which is estimated from the 2^nd segment of “music 2”.

4.3 Evaluation for single-channel music source separation

A quantitative comparison over different NMFs is conducted by measuring SIRs of reconstructed rhythmic signal and reconstructed harmonic signal. Table 1 shows the experimental results on six mixed music signals. These six signals come from twelve different source signals. The averaged SIRs are reported in the last row. Comparing NMF and BNMF, we find that BNMF obtains higher SIRs on the reconstructed signals. Further, BNMF is more robust to different combination of rhythmic signals and harmonic signals. The variation of SIRs using NMF is relatively high. Bayesian learning provides model regularization for NMF. On the other hand, GNMF (or NMPCF) performs better than BNMF in terms of averaged SIR of the reconstructed signals. The key difference between BNMF and GNMF is the reconstruction of rhythmic signal. BNMF estimates the rhythmic bases for individual segments while GNMF (or NMPCF) calculates the shared rhythmic bases for different segments. Prior information ${A_{r}^{(0)}, S_{r}^{(0)}, A_{h}^{(0)}, S_{h}^{(0)}}$ is applied for these methods. From these results, we confirm the importance of basis grouping in signal reconstruction based on NMF. In particular, BGS-NMF-LP and BGS-NMF-LSM perform better than other NMF methods. BGS-NMF-LSM even outperforms BGS-NMF-LP in terms of SIRs. Reconstruction weights modeled by LSM distributions are better than those by Laplacian distributions. Sparser reconstruction weights identify fewer but more relevant basis vectors for signal separation. Nevertheless, among these five related NMFs, the highest SIRs of reconstructed signals are achieved by using BGS-NMF-LSM. The SIRs of reconstructed rhythmic and harmonic signals are measured as 8.13 dB and 8.40 dB which are higher than 3.71 dB and 3.38 dB by using NMF, 4.87 dB and 4.61 dB by using BNMF, 5.63 dB and 5.71 dB by using GNMF and 7.91 dB and 8.11 dB by using BGS-NMF-LP, respectively. Basically, the superiority of BGS-NMF-LSM to other NMFs is three-fold, i.e. Bayesian probabilistic modeling, group basis representation and sparse reconstruction weight. Again, compared to GNMF, the proposed BGS-NMF-LP and BGS-NMF-LSM obtain a more robust performance in SIRs against different music source signals. Figure 7 shows the waveforms of a drum signal, a saxophone signal and the resulting mixed signal in “music 5”. Figure 8 displays the spectrograms of these three signals. Figure 9 demonstrates the spectrograms of the reconstructed drum signal and saxophone signal using BGS-NMF-LSM. For the other five mixed signals, the performance of reconstructed signals in single-channel music source separation is shown at http://chien.cm.nctu.edu.tw/bgs-nmf.

Table 1 Comparison of SIR (in dB) of the reconstructed rhythmic signal and harmonic signal based on NMF, BNMF, GNMF, BGS-NMF-LP and BGS-NMF-LSM

Full size table

5 Conclusions

This paper has presented the Bayesian group sparse learning and applied it for single-channel nonnegative source separation. The basis vectors in NMF were grouped into two partitions. The first group was the common bases which were used to explore the inter-segment repetitive characteristics, while the second was the individual bases which were applied to represent the intra-segment harmonic information. The LSM distribution was introduced to express sparse reconstruction weights for two groups of basis vectors. Bayesian learning was incorporated into group basis representation with model regularization. The MCMC algorithm or the Metropolis-Hastings algorithm was developed to conduct approximate inference of model parameters and hyperparameters. Model parameters were used to find the decomposed rhythmic signals and harmonic signals. Hyperparameters were used to control the sparsity of reconstructed weights and the generation of basis parameters. In the experiments, we implemented the proposed BGS-NMFs for underdetermined source separation. The convergence condition of sampling procedure for approximate inference was investigated. The performance of BGS-NMF-LP and BGS-NMF-LSM was shown to be robust to the different kinds of rhythmic and harmonic sources and mixing conditions. BGS-NMF-LSM outperformed the other NMFs in terms of SIRs. The BGS-NMF controlled by LSM distribution performed better than that controlled by Laplacian distribution. In the future, the system performance of BGS-NMF may be further improved by some other considerations. For example, the numbers of common bases and individual bases could be automatically selected according to Bayesian framework by using marginal likelihood. The group sparse learning could be extended for constructing hierarchical NMF where hierarchical grouping of basis vectors is examined. The underdetermined separation under different number of sources and sensors could be tackled. Also, the online learning could be involved to update segment-based parameters and hyperparameters [33, 34]. The evolutionary BGS-NMFs shall work for nonstationary single-channel blind source separation. In addition, more evaluations shall be conducted by using realistic data with larger amount of mixed speech signals from different application domains, such as meetings and call centers.

Appendix

Derivations for inference of BGS-NMF parameters and hyperparameters

We address some derivations for model inference of BGS-NMF parameters and hyperparameters. First, the exponent of the likelihood function $p (X_{i}^{T} | {[A_{r}^{(t + 1)}]}_{i (1 : j - 1)}, {[A_{r}^{(t)}]}_{i (j + 1 : D_{r})}, S_{r}^{(t)}, A_{h}^{(t)}, S_{h}^{(t)}, Σ^{(t)})$ in (16) is expressed by

\begin{matrix} - \frac{1}{2 {[Σ^{(t)}]}_{ii}} \sum_{k = 1}^{M} [X_{ik} - \sum_{m = 1}^{j - 1} {[A_{r}^{(t + 1)}]}_{im} {[S_{r}^{(t)}]}_{mk} \\ - {[A_{r}]}_{ij} {[S_{r}^{(t)}]}_{jk} - \sum_{m = j + 1}^{D_{r}} {[A_{r}^{(t)}]}_{im} {[S_{r}^{(t)}]}_{mk} \\ {- \sum_{m = 1}^{D_{h}} {[A_{h}^{(t)}]}_{im} {[S_{h}^{(t)}]}_{mk}]}^{2} \end{matrix}

(26)

which can be manipulated as a quadratic function of parameter [A_r]_{i
j} and leads to (17). The conditional posterior distribution $p ({[A_{r}]}_{ij} | X_{i}^{T}, Θ_{A_{rij}}^{(t)}, Φ_{A_{rij}}^{(t)})$ is then derived by combining (17) and (11) and turns out to be

\begin{array}{l} {[A_{r}]}_{ij}^{α_{rj}^{(t)} - 1} \\ exp \{- \frac{{[A_{r}]}_{ij}^{2} - 2 (μ_{A_{rij}}^{likel} - β_{rj}^{(t)} {[σ_{A_{rij}}^{likel}]}^{2}) {[A_{r}]}_{ij} + {[μ_{A_{rij}}^{likel}]}^{2})}{2 {[σ_{A_{rij}}^{likel}]}^{2}}\} \\ I_{[0, + \infty [} ({[A_{r}]}_{ij}) \end{array}

(27)

which is proportional to (18). In addition, when finding the mode of (18), we take logarithm of (18) and solve a corresponding quadratic equation of [A_r]_{i
j} as

\begin{matrix} \frac{\partial}{\partial {[A_{r}]}_{ij}} \{(α_{rj}^{(t)} - 1) ln {[A_{r}]}_{ij} - \frac{{({[A_{r}]}_{ij} - μ_{A_{rij}}^{post})}^{2}}{2 {[σ_{A_{rij}}^{post}]}^{2}}\} = 0 \\ \Rightarrow {[A_{r}]}_{ij}^{2} - μ_{A_{rij}}^{post} {[A_{r}]}_{ij} - (α_{rj}^{(t)} - 1) {[σ_{A_{rij}}^{post}]}^{2} = 0 . \end{matrix}

(28)

By defining $△ = {(μ_{A_{rij}}^{post})}^{2} + 4 (α_{rj}^{(t)} - 1) {[σ_{A_{rij}}^{post}]}^{2}$ , the mode is determined by

μ_{A_{rij}}^{inst} = \{\begin{array}{l} 0, & if △ < 0 \\ max {\frac{1}{2} (μ_{A_{rij}}^{post} + \sqrt{△}), 0}, & else. \end{array}

(29)

On the other hand, following the model inference in Section 3.3, we continue to describe the MCMC sampling algorithm and the calculation of conditional posterior distributions for the remaining BGS-NMF hyperparameters {α_r,β_r,α_h,β_h,λ_r,λ_h,γ_r,δ_r,γ_h,δ_h}.

4. Sampling of α_{r
j}. The hyperparameter $α_{rj}^{(t + 1)}$ is sampled according to a conditional posterior distribution which is obtained by combining a likelihood function of [A_r]_{i
j} and an exponential prior density of α_{r
j} with parameter $λ_{α_{rj}}$ . The resulting distribution is written by

\begin{matrix} p (α_{rj} ∣ {[A_{r}^{(t + 1)}]}_{ij}, β_{rj}^{(t)}) \propto {(\frac{1}{Γ (α_{rj})} exp {λ_{α_{rj}}^{post} α_{rj}})}^{D_{r}} I_{[0, + \infty [} (α_{rj}) \end{matrix}

(30)

where $λ_{α_{rj}}^{post} = ln β_{rj}^{(t)} + (1 / D_{r}) \sum_{j = 1}^{D_{r}} ln {[A_{r}^{(t + 1)}]}_{ij} - (1 / D_{r}) λ_{α_{rj}}$ . This distribution does not belong to a known family, so the Metropolis-Hastings algorithm is applied. An instrumental distribution q(α_{r
j}) is obtained by fitting the term within the brackets of (30) through a gamma distribution as detailed in [15].

5. Sampling of β_{r
j}. The hyperparameter $β_{rj}^{(t + 1)}$ is sampled according to a conditional posterior distribution which is obtained by combining a likelihood function of [A_r]_{i
j} and a gamma prior density of β_{r
j} with parameters ${α_{β_{rj}}, β_{β_{rj}}}$ , i.e.,

\begin{array}{l} p (β_{rj} ∣ {[A_{r}^{(t + 1)}]}_{ij}, α_{rj}^{(t + 1)}) \propto {(β_{rj})}^{D_{r} α_{rj}^{(t + 1)}} \\ \times exp \{- β_{rj} \sum_{j = 1}^{D_{r}} {[A_{r}^{(t + 1)}]}_{ij}\} G (β_{rj} | α_{β_{rj}}, β_{β_{rj}}) . \end{array}

(31)

The resulting distribution is arranged as a new gamma distribution $G (β_{rj} | α_{β_{rj}}^{post}, β_{β_{rj}}^{post})$ where $α_{β_{rj}}^{post} = 1 + D_{r} α_{rj}^{(t + 1)} + α_{β_{rj}}$ and $β_{β_{rj}}^{post} = \sum_{j = 1}^{D_{r}} {[A_{r}^{(t + 1)}]}_{ij} + β_{β_{rj}}$ . Here, we do not describe the sampling of $α_{hj}^{(t + 1)}$ and $β_{hj}^{(t + 1)}$ since the conditional posterior distributions for sampling these two hyperparameters are similar to those for sampling of $α_{rj}^{(t + 1)}$ and $β_{rj}^{(t + 1)}$ .

6. Sampling of λ_{r
j} or λ_{h
j}. For sampling of scaling parameter $λ_{rj}^{(t + 1)}$ , the conditional posterior distribution is obtained by

\begin{matrix} p (λ_{rj} ∣ {[S_{r}^{(t + 1)}]}_{j (k = 1 : M)}, γ_{rj}^{(t)}, δ_{rj}^{(t)}) \propto \prod_{k = 1}^{M} p ({[S_{r}^{(t + 1)}]}_{jk} | λ_{rj}) p (λ_{rj} | γ_{rj}^{(t)}, δ_{rj}^{(t)}) \\ \propto {(λ_{rj})}^{M γ_{rj}^{(t)}} exp \{- M λ_{rj} (δ_{rj}^{(t)} + \sum_{k = 1}^{M} {[S_{r}^{(t + 1)}]}_{jk})\} . \end{matrix}

(32)

7. Sampling of γ_{r
j}. The sampling of LSM parameter $γ_{rj}^{(t + 1)}$ is performed by using the conditional posterior distribution which is derived by combining a likelihood function of λ_{r
j} and an exponential prior density of γ_{r
j} with parameter $λ_{γ_{rj}}$ . The resulting distribution is expressed as

p (γ_{rj} | λ_{rj}^{(t + 1)}, δ_{rj}^{(t)}) \propto \frac{1}{Γ (γ_{rj})} exp {λ_{γ_{rj}}^{post} γ_{rj}} I_{[0, + \infty [} (γ_{rj}),

(33)

where $λ_{γ_{rj}}^{post} = ln δ_{rj}^{(t)} + \frac{γ_{rj} - 1}{γ_{rj}} ln λ_{rj}^{(t + 1)} - λ_{γ_{rj}}$ . Again, we need to find an instrumental distribution q(γ_{r
j}) which optimally fits the conditional posterior distribution $p (γ_{rj} | λ_{rj}^{(t + 1)}, δ_{rj}^{(t)})$ . An approximate gamma distribution is found accordingly. The Metropolis-Hastings algorithm is then applied.

8. Sampling of δ_{r
j}. The sampling of the other LSM parameter $δ_{rj}^{(t + 1)}$ is performed by using the conditional posterior distribution which is derived from a likelihood function of λ_{r
j} and a gamma prior density of δ_{r
j} with parameters ${α_{δ_{rj}}, β_{δ_{rj}}}$

\begin{matrix} p (δ_{rj} | λ_{rj}^{(t + 1)}, γ_{rj}^{(t + 1)}) \propto {(δ_{rj})}^{γ_{rj}^{(t + 1)}} \\ exp {- δ_{rj} λ_{rj}^{(t + 1)}} G (δ_{rj} | α_{δ_{rj}}, β_{δ_{rj}}) . \end{matrix}

(34)

This distribution can be arranged as a new gamma distribution $G (δ_{rj} | α_{δ_{rj}}^{post}, β_{δ_{rj}}^{post})$ where $α_{δ_{rj}}^{post} = D_{r} γ_{rj}^{(t + 1)} + α_{δ_{rj}}$ and $β_{δ_{rj}}^{post} = λ_{rj}^{(t + 1)} + β_{δ_{rj}}$ . Similarly, the conditional posterior distributions for sampling $γ_{hj}^{(t + 1)}$ and $δ_{hj}^{(t + 1)}$ could be formulated by referring those for sampling $γ_{rj}^{(t + 1)}$ and $δ_{rj}^{(t + 1)}$ , respectively.

References

Cichocki A, Zdunek R, Amari S: New algorithms for non-negative matrix factorization in applications to blind source separation. In Proceedings of International Conference on Acoustic, Speech and Signal Processing (ICASSP). Piscataway: IEEE,; 2006:621-624.
Google Scholar
Hoyer PO: Non-negative matrix factorization with sparseness constraints. J. Mach. Lear. Res 2004, 5: 1457-1469.
MATH MathSciNet Google Scholar
Chien J-T, Hsieh H-L: Convex divergence ICA for blind source separation. IEEE Trans. Audio, Speech, Language Process 2012, 20(1):290-301.
Article Google Scholar
Kompass R: A generalized divergence measure for nonnegative matrix factorization. Neural Comput 2007, 19: 780-791. 10.1162/neco.2007.19.3.780
Article MATH MathSciNet Google Scholar
Lee H, Yoo J, Choi S: Semi-supervised nonnegative matrix factorization. IEEE Signal Process. Lett 2010, 17(1):4-7.
Article Google Scholar
Plumbley MD: Algorithms for nonnegative independent component analysis. IEEE Trans. Neural Netw 2003, 14(3):534-543. 10.1109/TNN.2003.810616
Article Google Scholar
Bishop CM: Pattern Recognition and Machine Learning. New York: Springer Science; 2006.
MATH Google Scholar
Saon G, Chien J-T: Bayesian sensing hidden Markov models. IEEE Trans. Audio, Speech Language, Process 2012, 20(1):43-54.
Article Google Scholar
Tipping ME: Sparse Bayesian learning and the relevance vector machine. J Mach. Learn. Res 2001, 1: 211-244.
MATH MathSciNet Google Scholar
Babacan SD, Molina R, Katsaggelos AK: Bayesian compressive sensing using Laplace priors. IEEE Trans. Image Process 2010, 19(1):53-63.
Article MathSciNet Google Scholar
Lee H, Choi S: Group nonnegative matrix factorization for EEG classification. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). JMLR; 2009:320-327.
Google Scholar
Lefevre A, Bach F, Fevotte C, Itakura-Saito: Nonnegative matrix factorization with group sparsity. In Proceedings of the International Conference on Acoustic, Speech and Signal Processing (ICASSP). Prague Congress Center; 22–27 May 2011:21-24.
Google Scholar
Kim M, Yoo J, Kang K, Choi S: Blind rhythmic source separation: nonnegativity and repeatability. In Proceedings of the International Conference on Acoustic, Speech and Signal Processing (ICASSP). Piscataway: IEEE,; 2010:2006-2009.
Google Scholar
AT: Bayesian inference for nonnegative matrix factorization models. University of Cambridge, Technical Report CUED/F-INFENG/TR.609, 2008
Moussaoui S, Brie D: Mohammad-A Djafari, C Carteret, Separation of non-negative mixture of non-negative sources using a Bayesian approach and MCMC sampling. IEEE Trans. Signal Process 2006, 54(11):4133-4145.
Article Google Scholar
Schmidt MN, Winther O, Hansen LK: Bayesian non-negative matrix factorization. In Proceedings of the International Conference on Independent Component Analysis and Signal Separation, Paraty, March 2009. Lecture Notes in Computer Science. Heidelberg: Springer,; 2009:540-547.
Google Scholar
Fevotte C, Godsill SJ: A Bayesian approach for blind separation of sparse sources. IEEE Trans. Audio, Speech, Language Process 2006, 14(6):2174-2188.
Article Google Scholar
Duan Z, Zhang Y, Zhang C, Shi Z: Unsupervised single-channel music source separation by average harmonic structure modeling. IEEE Trans. on Audio, Speech, Language Process 2008, 16(4):766-778.
Article Google Scholar
Schmidt MN, Olsson RK: Single-channel speech separation using sparse non-negative matrix factorization. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH). Pittsburgh; 17–21 September 2006:2614-2617.
Google Scholar
Chien J-T, Hsieh H-L: Bayesian group sparse learning for nonnegative matrix factorization. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH). Portland; 9–13 September 2012:1552-1555.
Google Scholar
Yoo J, Kim M, Kang K, Choi S: Nonnegative matrix partial co-factorization for drum source separation. In Proceedings of the International Conference on Acoustic, Speech and Signal Processing (ICASSP). Piscataway: IEEE,; 2010:1942-1945.
Google Scholar
Kim M, Yoo J, Kang K, Choi S: Nonnegative matrix partial co-factorization for spectral and temporal drum source separation. IEEE J. Sel. Top. Signal Process 2011, 5(6):1192-1204.
Article Google Scholar
Bengio S, Pereira F, Singer Y, Strelow D: Group sparse coding. In Advances in Neural Information Processing Systems (NIPS). La Jolla: NIPS; 2009:82-89.
Google Scholar
Jenatton R, Mairal J, Obozinski G, Bach F: Proximal methods for sparse hierarchical dictionary learning. In Proceedings of the International Conference on Machine Learning (ICML). Haifa; 21–25 June 2010.
Google Scholar
Garrigues PJ, Olshausen BA: Group sparse coding with a Laplacian scale mixture prior. In Advances in Neural Information Processing Systems (NIPS). La Jolla: NIPS; 2010:676-684.
Google Scholar
Salakhutdinov R, Mnih A: Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of the International Conference on Machine Learning (ICML). Helsinki; 5–9 July 2008:880-887.
Chapter Google Scholar
Zhong M, Girolami M: Reversible jump MCMC for non-negative matrix factorization. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). Clearwater Beach; 16–18 April 2009:663-670.
Google Scholar
Hoffman MD, Blei DM, Cook PR: Bayesian nonparametric matrix factorization for recorded music. In Proceedings of the International Conference on Machine Learning (ICML). Haifa; 21–24 June 2010.
Google Scholar
Marlin M, Murphy KP, BM: Group sparse priors for covariance estimation. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI). Montreal; 18–21 June 2009:383-392.
Google Scholar
Chien J-T, Chiang C-C: Group sparse hidden Markov models for speech recognition. In Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH). Portland; 9–13 September 2012:2646-2649.
Google Scholar
Chien J-T, Ting C-W: Factor analyzed subspace modeling and selection. IEEE Trans. Audio, Speech Language Process 2008, 16(1):239-248.
Article Google Scholar
Chib S, Greenberg E: Understanding the Metropolis-Hastings algorithm. Am. Statistician 1995, 49(4):327-335.
Google Scholar
Chien J-T, Hsieh H. -L: Nonstationary source separation using sequential and variational Bayesian learning. IEEE Trans. Neural Netw. Learn. Syst 2013, 24(5):681-694.
Article Google Scholar
Hsieh H-L, Chien J-T: Nonstationary and temporally-correlated source separation using Gaussian process. In Proceedings of the International Conference on Acoustic, Speech and Signal Processing (ICASSP). Prague Congress Center; 22–27 May 2011:2120-2123.
Google Scholar

Download references

Acknowledgments

The authors acknowledge anonymous reviewers for their constructive feedback and helpful suggestions. This work has been partially supported by the National Science Council, Taiwan, Republic of China, under contract NSC 100-2628-E-009-028-MY3.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, National Chiao Tung University, Taiwan, 30010, Republic of China
Jen-Tzung Chien & Hsin-Lung Hsieh

Authors

Jen-Tzung Chien
View author publications
You can also search for this author in PubMed Google Scholar
Hsin-Lung Hsieh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jen-Tzung Chien.

Additional information

Competing interests

Both authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Chien, JT., Hsieh, HL. Bayesian group sparse learning for music source separation. J AUDIO SPEECH MUSIC PROC. 2013, 18 (2013). https://doi.org/10.1186/1687-4722-2013-18

Download citation

Received: 28 October 2012
Accepted: 13 May 2013
Published: 05 July 2013
DOI: https://doi.org/10.1186/1687-4722-2013-18

Keywords

Bayesian sparse learning; Signal reconstruction; Subspace approach; Group sparsity; Nonnegative matrix factorization; Single-channel source separation

Bayesian group sparse learning for music source separation

Abstract

1 Introduction

2 Background survey

2.1 Nonnegative matrix factorization

2.2 Group basis representation

2.3 Bayesian learning approaches

3 Bayesian group sparse matrix factorization

3.1 Model construction

3.2 Priors for Bayesian group sparse learning

3.3 Model inference

4 Experiments

4.1 Experimental setup

4.2 Evaluation for MCMC iterative procedure

4.3 Evaluation for single-channel music source separation

5 Conclusions

Appendix

Derivations for inference of BGS-NMF parameters and hyperparameters

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords