In the following, we consider the bin-wise squared magnitude of a single-channel microphone signal in the STFT domain, represented in spectrograms denoted as \(\boldsymbol {Y} = \left [\boldsymbol {y}_{1},\dots,\boldsymbol {y}_{L}\right ] \in \mathbb {R}^{F\times L}_{+}\), where F is the number of frequency bins and L is the number of considered time frames.
Motor data descriptions and definitions
The physical state of a robot can be described by motor data, collected by proprioceptors providing angular position information of the joints driven by the motors. In the following, we consider a robot which is equipped with \(m=1,\dots,M\) proprioceptors each capturing one angle of a joint. We denote the s-th observed angular position in STFT frame ℓ for proprioceptor m by \(\alpha _{\ell,m}^{(s)}\in \mathbb {R}\). Within frame ℓ, a total number of Sℓ motor data samples is observed, i.e., \(s=1,\dots, S_{\ell }\). In this paper, we account for the fact that the motor data is not necessarily synchronized with the audio data recording so that for a fixed observation interval for the audio data, the number of motor data may vary, i.e., Sℓ may change with ℓ. This is specifically the case for the NAO robot used for the experiments in this paper.
Depending on the kind of ego-noise, only a subset of proprioceptors is relevant for ego-noise suppression. For example, if only ego-noise caused by arm movements is present, only motor data of the arm joints are required. In the following, we denote the index set of relevant proprioceptors for these joints by \(\mathcal {M}\).
From proprioceptor data collected for proprioceptor m, the instantaneous angular velocity can be estimated by
$$\begin{array}{*{20}l} \dot{\alpha}^{(s)}_{\ell,m} = \frac{\alpha^{(s)}_{\ell,m}-\alpha^{(s-1)}_{\ell,m}}{\Delta T_{\ell}^{(s)}},~~~~ m\in\mathcal{M}, \end{array} $$
(1)
where \(\Delta T_{\ell }^{(s)}\) denotes the time difference between adjacent observations \(\alpha ^{(s)}_{\ell,m}\) and \(\alpha ^{(s-1)}_{\ell,m}\). Note that for s=1, \(\alpha ^{(s-1)}_{\ell,m}\) is chosen to be the last angular sample of previous the frame ℓ−1. Analogously, angular acceleration \(\ddot {\alpha }^{(s)}_{\ell,m}\) can be computed from successive angular velocity estimates \(\dot {\alpha }^{(s)}_{\ell,m}\) and \(\dot {\alpha }^{(s-1)}_{\ell,m}\).
To associate each spectrogram frame yℓ with a single motor data sample, we propose first to compute the arithmetic average of all Sℓ angular positions in STFT frame ℓ
$$\begin{array}{*{20}l} \bar{\alpha}_{\ell,m}=\frac{1}{S_{\ell}}\sum_{s=1}^{S_{\ell}} \alpha^{(s)}_{\ell,m},~~~ m\in\mathcal{M}. \end{array} $$
(2)
We proceed analogously for angular velocity and acceleration and obtain \(\bar {\dot {\alpha }}_{\ell,m}, \bar {\ddot {\alpha }}_{\ell,m}\), respectively. We then concatenate the averaged angular data for all considered proprioceptors in a feature vector
$$\begin{array}{*{20}l} \bar{\boldsymbol{\alpha}}_{\ell} = \left[\bar{\alpha}_{\ell,1},\dots,\bar{\alpha}_{\ell,m},\bar{\dot{\alpha}}_{\ell,m},\bar{\ddot{\alpha}}_{\ell,m},\dots,\bar{\ddot{\alpha}}_{\ell,M}\right]^{\mathrm{T}}, \end{array} $$
(3)
which we will refer to as motor data vector for frame ℓ in the following. The left part of Fig. 2 exemplarily illustrates the described preprocessing of the data.
NMF for ego-noise suppression
In the following, we briefly summarize NMF. We introduce succinctly how semi-supervised NMF can be used for ego-noise suppression and explain the main drawback of the known approach before we introduce the proposed motor data-based regularization.
The objective of NMF is to approximate the nonnegative matrix Y, i.e., a matrix whose elements are all larger or equal than zero, by a product of two nonnegative matrices D and H
$$\begin{array}{*{20}l} \boldsymbol{Y} \approx\hat{\boldsymbol{Y}}=\boldsymbol{D}\boldsymbol{H}=\left[\boldsymbol{D}\boldsymbol{h}_{1},\dots,\boldsymbol{D}\boldsymbol{h}_{L}\right], \end{array} $$
(4)
where \(\boldsymbol {D}\in \mathbb {R}^{F\times K}_{+}\) is the so-called dictionary of size F×K and \(\boldsymbol {H}=\left [\boldsymbol {h}_{1},\dots,\boldsymbol {h}_{L}\right ]\in \mathbb {R}^{K\times L}_{+}\) is referred to as activation matrix [8, 21]. This approach can be interpreted as approximating each column of Y by a weighted sum of columns of D (the so-called atoms or bases), where the weights are given by the corresponding column entries of H. K is referred to as size of the dictionary and describes the number of atoms in D. Typically, K≪F,L holds, i.e., NMF can be considered as a compact representation of data.
The factorization is achieved by minimizing a cost function which penalizes the dissimilarity between Y and \(\hat {\boldsymbol {Y}}\) defined by the model parameters D,H. Typically, the cost function is applied element-wise on the elements of the matrices Y and \(\hat {\boldsymbol {Y}}\). In this paper, we consider the Euclidean distance between Y and \(\hat {\boldsymbol {Y}}\) as cost function yielding the optimization problem
$$ \begin{aligned} &\underset{\boldsymbol{D},\boldsymbol{H}}{\min}~\left\lVert \boldsymbol{Y}-\boldsymbol{D}\boldsymbol{H} \right\rVert_{\mathrm{F}}^{2}\\ & \text{s.t.}~~~~~~ \boldsymbol{D}, \boldsymbol{H} \succeq 0, \end{aligned} $$
(5)
where ∥·∥F denotes the Frobenius norm and D,H≽0 means that all elements of D,H are larger or equal to zero, ensuring nonnegativity. The optimization problem in Eq. 5 is typically solved using iterative updates alternating between D,H such that the nonnegativity of D,H is implicitly guaranteed if they are initialized with positive values. The update rules can be derived based on, e.g., the Majorization-Minimization principle or heuristic approaches [7, 8].
For ego-noise suppression, we apply a semi-supervised, two-stage strategy [21], c.f. Section 2.4: first, we use audio data containing ego-noise only and train an ego-noise dictionary. Then, given a mixture of ego-noise and speech, these dictionary elements remain constant and only its activations are estimated. For this, again, the same iterative update rules are used, which have shown to be sensitive to the additional speech signal. As a consequence, the atom activations are no longer estimated correctly. For improved robustness, we therefore propose to extend this audio only-based estimation of the activations by taking also the physical state of the robot, measured in terms of motor data, into account. Thus, the estimation of the activations is additionally guided by reference information which is completely unaffected by the speech signal.
Motor data-regularized NMF
The basic idea of our approach is that activations should be similar if the physical state of the robot is similar. For this, we measure the similarity between robot states in frames ℓ and j by comparing motor data vectors \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) and enforce similar activations hℓ and hj if \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) are close. This will be achieved by imprinting the intrinsic geometry of the motor data space to the NMF cost function. Results from spectral graph theory [22, 23] and manifold learning theory [24] have shown that local geometric structure of given data points can be modeled using an undirected graph. Based on these results, we first introduce a motor data-based graph structure and summarize subsequently how a regularization term, enforcing similar activations for similar motor data, can be derived. We then reformulate the NMF optimization problem Eq. 5 and present according update rules for its minimization.
Motor data graph structure
In the following, we define a graph where the motor data vectors \(\bar {\boldsymbol {\alpha }}_{1},\dots,\bar {\boldsymbol {\alpha }}_{L}\) constitute the nodes. The edges connecting the nodes are assumed to be bidirectional, i.e., we obtain an undirected graph. A part of an exemplary graph is illustrated in Fig. 3. The edge which connects nodes \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) has weight Wℓj=Wjℓ and should reflect the affinity between the two motor data points. Dependent on the considered scenario, numerous measures have been proposed to quantify the affinity between \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) [22], e.g., a nearest-neighbor or dot-product weighting. In this paper, we determine the weight Wℓj using a Gaussian kernel
$$\begin{array}{*{20}l} W_{\ell j}= W_{j\ell}= \exp\left(-\frac{\lVert\bar{\boldsymbol{\alpha}}_{\ell}-\bar{\boldsymbol{\alpha}}_{j}\rVert^{2}_{2}}{2\epsilon^{2}}\right) \in (0,1], \end{array} $$
(6)
with scale parameter \(\epsilon \in \mathbb {R}_{+}\). The larger Wℓj, the higher the affinity between two motor data samples is and we obtain Wℓj=1 if \(\bar {\boldsymbol {\alpha }}_{\ell }=\bar {\boldsymbol {\alpha }}_{j}\). Note that by adjusting ε, the connectivity of the graph can be controlled, e.g., for larger ε, the neighbors of a node are connected with a larger weight. Therefore, ε can be used to control the reach of the local neighborhood of a node. Based on the affinity weights, we define the affinity matrix W=WT ∈[0,1]L×L, where the [W]ℓj=Wℓj. Furthermore, we introduce the diagonal matrix Z of size L×L with \(Z_{\ell \ell }=\sum _{j}^{}W_{\ell j}=\sum _{j}^{}W_{j\ell }\) and zero else.
Motor data-based regularization term
The derivation of the regularization term is based on results from [24, 25]. It is assumed that the considered motor data lie on a Riemannian manifold \(\mathcal {A}\). We are looking for a mapping \(f:\mathcal {A}\rightarrow \mathbb {R}\), which can be interpreted as a mapping from the manifold to a line. f should preserve the local geometry of the manifold, i.e., close points on the manifold should be mapped to close points on the line. This implies that f is allowed to vary only smoothly for similar arguments. Appropriate mappings f can be obtained by an optimization on the manifold which can be discretely approximated on the motor data graph by searching for an f which minimizes
$$\begin{array}{*{20}l} \frac{1}{2} \sum_{\ell=1}^{L}\sum_{j=1}^{L}\left(f(\bar{\boldsymbol{\alpha}}_{\ell})-f(\bar{\boldsymbol{\alpha}}_{j})\right)^{2} W_{\ell j}, \end{array} $$
(7)
where f is a function of the nodes of the graph [24, 25].
To exploit the geometric information of the motor data manifold for the estimation of the activation vectors, we manipulate Eq. 7 and replace the abstract mapping f by the activation of atom k
$$\begin{array}{*{20}l} \mathcal{R}_{k} &= \frac{1}{2} \sum_{\ell=1}^{L}\sum_{j=1}^{L}\left(h_{k\ell}-h_{kj}\right)^{2} W_{\ell j}, \end{array} $$
(8)
where hkℓ denotes the ℓ-th element of hk, i.e., hkℓ is the scaling of atom ℓ in time frame k. The regularization term \(\mathcal {R}_{k}\) needs to be minimized jointly with Eq. 5 with respect to the activations for every atom k, c.f. Section 2.3.3. Note that the motor data-based regularization \(\mathcal {R}_{k}\) implicitly influences also the structure of the dictionary elements since the optimized activations directly affect the update of D.
Note that in Eq. 8, affinities Wℓj can be interpreted as weighting parameter: if two motor data vectors \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) are similar, Wℓj is close to one according to Eq. 6 and the minimization of Eq. 8 enforces similar hkℓ and hkj. Using the parameters defined in Section 2.3.1, Eq. 8 can be directly related to the so-called graph Laplacian L=Z−W [22]
$$ \begin{aligned} \mathcal{R}_{k} &=\boldsymbol{h}_{k}^{\mathrm{T}}\mathbf{Z}\boldsymbol{h}_{k} - \boldsymbol{h}_{k}^{\mathrm{T}}\mathbf{W}\boldsymbol{h}_{k}\\ &=\boldsymbol{h}_{k}^{\mathrm{T}}\mathbf{L}\boldsymbol{h}_{k}. \end{aligned} $$
(9)
Summing over all atoms results in the final regularization term
$$\begin{array}{*{20}l} \mathcal{R}=\sum_{k=1}^{K}\mathcal{R}_{k}=\text{tr}\left(\boldsymbol{H}^{\mathrm{T}}\boldsymbol{L}\boldsymbol{H}\right), \end{array} $$
(10)
where tr(·) denotes the trace operator.
Motor data-regularized NMF
The derived regularization term Eq. 10 can be directly included into Eq. 4. We obtain as modified optimization problem
$$ \begin{aligned} &\underset{\boldsymbol{D},\boldsymbol{H}}{\min}~\left\lVert \boldsymbol{Y}-\boldsymbol{D}\boldsymbol{H} \right\rVert_{\mathrm{F}}^{2}+\lambda\text{tr}\left(\boldsymbol{H}^{\mathrm{T}}\boldsymbol{L}\boldsymbol{H}\right)\\&\text{s.t.}~~~~~~ \boldsymbol{D}, \boldsymbol{H} \succeq 0, \end{aligned} $$
(11)
where λ≥0 controls the influence of the motor data-based regularization.
For minimization, we form the partial derivatives with respect to D and H in Eq. 11 and obtain iterative update rules [19, 20]
$$\begin{array}{*{20}l} \left[\boldsymbol{D}\right]_{fk}&\leftarrow\left[\boldsymbol{D}\right]_{fk}\cdot\frac{\left[\boldsymbol{Y}\boldsymbol{H}^{\mathrm{T}}\right]_{fk}}{\left[\hat{\boldsymbol{Y}}\boldsymbol{H}^{\mathrm{T}}\right]_{fk}}, \end{array} $$
(12)
$$\begin{array}{*{20}l} \left[\boldsymbol{H}\right]_{k\ell}&\leftarrow\left[\boldsymbol{H}\right]_{k\ell}\cdot\frac{\left[\boldsymbol{D}^{\mathrm{T}}\boldsymbol{Y}+\lambda\boldsymbol{H}\boldsymbol{W}\right]_{k\ell}}{\left[\boldsymbol{D}^{\mathrm{T}}\hat{\boldsymbol{Y}}+\lambda\boldsymbol{H}\boldsymbol{Z}\right]_{k\ell}}, \end{array} $$
(13)
where [D]fk selects the fk-th element from D. Similar to conventional NMF, the iterative update can be stopped, e.g., after a fixed number of iterations. In this paper, in each iteration we additionally compute the cost according to Eq. 11 and terminate updating Eqs. 12,13 after convergence.
Eqs. 12 and 13 reduce to the conventional update rules for NMF if λ=0 [8]. Note that since the proposed method aims at enforcing similar activations for close motor data vectors, the regularization has an effect on the update rule for H only, while the update for D is unaffected.
Proposed algorithm for ego-noise suppression
As mentioned in Section 2.2, we apply a semi-supervised, two-stage strategy for ego-noise suppression [21]. We first employ audio data containing ego-noise only and train D imprinting the intrinsic geometry of the motor data space onto the model using the proposed regularization. Given a mixture of ego-noise and speech, we use D to model and suppress the current ego-noise and to obtain a speech estimate. In the following, we describe the proposed algorithm for ego-noise suppression in detail, c.f. Fig. 4 for an overview.
-
LearningD: As input, spectrograms \(\boldsymbol {Y}=\big [\boldsymbol {y}_{1},\dots \boldsymbol {y}_{L}\big ]\) are given containing ego-noise only. Per spectrogram frame yℓ, a motor data vector \(\bar {\boldsymbol {\alpha }}_{\ell }\) is computed. \(\bar {\boldsymbol {\alpha }}_{\ell }, \ell =1,\dots,L\) is used to construct the affinity and degree matrix, W and Z, respectively. Subsequently, the update rules Eqs. 12 and 13 are used to compute dictionary D, where the introduced regularization term is weighted by λT.
-
Ego-noise suppression: Another dictionary DS of size KS and according activation HS is initialized to model the additional speech signal in the considered mixture Y. Analogously to the learning step before, W and Z are constructed from the new motor data vectors possibly representing different movements. Using the same update rules as before, DS, H and HS are updated while D remains constant. The motor data-based regularization term is weighted by λE. Note that for optimizing the activations of the speech model HS, we set λE=0 since the motor data-based regularization should affect only the estimation of the ego-noise activations. After identifying the optimum model parameters captured by DS, H and HS, we use a spectral enhancement filter to obtain an estimate for the desired speech signal \(\big [\hat {\boldsymbol {Y}}_{\mathrm {S}}\big ]_{f\ell }=\big [\boldsymbol {F}\big ]_{f\ell }\cdot \big [\boldsymbol {Y}\big ]_{f\ell }\) for the fℓ-th bin where the enhancement filter is given by
$$\begin{array}{*{20}l} \big[\boldsymbol{F}\big]_{f\ell}=\frac{\big[\boldsymbol{D}_{\mathrm{S}}\boldsymbol{H}_{\mathrm{S}}\big]_{f\ell}}{\big[\boldsymbol{D}\boldsymbol{H}\big]_{f\ell}+\big[\boldsymbol{D}_{\mathrm{S}}\boldsymbol{H}_{\mathrm{S}}\big]_{f\ell}}. \end{array} $$
(14)
Note that typically λE≠λT holds, i.e, the regularization terms in both steps have different weights. This is further detailed in the following section.