In the following, we consider the binwise squared magnitude of a singlechannel microphone signal in the STFT domain, represented in spectrograms denoted as \(\boldsymbol {Y} = \left [\boldsymbol {y}_{1},\dots,\boldsymbol {y}_{L}\right ] \in \mathbb {R}^{F\times L}_{+}\), where F is the number of frequency bins and L is the number of considered time frames.
Motor data descriptions and definitions
The physical state of a robot can be described by motor data, collected by proprioceptors providing angular position information of the joints driven by the motors. In the following, we consider a robot which is equipped with \(m=1,\dots,M\) proprioceptors each capturing one angle of a joint. We denote the sth observed angular position in STFT frame ℓ for proprioceptor m by \(\alpha _{\ell,m}^{(s)}\in \mathbb {R}\). Within frame ℓ, a total number of S_{ℓ} motor data samples is observed, i.e., \(s=1,\dots, S_{\ell }\). In this paper, we account for the fact that the motor data is not necessarily synchronized with the audio data recording so that for a fixed observation interval for the audio data, the number of motor data may vary, i.e., S_{ℓ} may change with ℓ. This is specifically the case for the NAO robot used for the experiments in this paper.
Depending on the kind of egonoise, only a subset of proprioceptors is relevant for egonoise suppression. For example, if only egonoise caused by arm movements is present, only motor data of the arm joints are required. In the following, we denote the index set of relevant proprioceptors for these joints by \(\mathcal {M}\).
From proprioceptor data collected for proprioceptor m, the instantaneous angular velocity can be estimated by
$$\begin{array}{*{20}l} \dot{\alpha}^{(s)}_{\ell,m} = \frac{\alpha^{(s)}_{\ell,m}\alpha^{(s1)}_{\ell,m}}{\Delta T_{\ell}^{(s)}},~~~~ m\in\mathcal{M}, \end{array} $$
(1)
where \(\Delta T_{\ell }^{(s)}\) denotes the time difference between adjacent observations \(\alpha ^{(s)}_{\ell,m}\) and \(\alpha ^{(s1)}_{\ell,m}\). Note that for s=1, \(\alpha ^{(s1)}_{\ell,m}\) is chosen to be the last angular sample of previous the frame ℓ−1. Analogously, angular acceleration \(\ddot {\alpha }^{(s)}_{\ell,m}\) can be computed from successive angular velocity estimates \(\dot {\alpha }^{(s)}_{\ell,m}\) and \(\dot {\alpha }^{(s1)}_{\ell,m}\).
To associate each spectrogram frame y_{ℓ} with a single motor data sample, we propose first to compute the arithmetic average of all S_{ℓ} angular positions in STFT frame ℓ
$$\begin{array}{*{20}l} \bar{\alpha}_{\ell,m}=\frac{1}{S_{\ell}}\sum_{s=1}^{S_{\ell}} \alpha^{(s)}_{\ell,m},~~~ m\in\mathcal{M}. \end{array} $$
(2)
We proceed analogously for angular velocity and acceleration and obtain \(\bar {\dot {\alpha }}_{\ell,m}, \bar {\ddot {\alpha }}_{\ell,m}\), respectively. We then concatenate the averaged angular data for all considered proprioceptors in a feature vector
$$\begin{array}{*{20}l} \bar{\boldsymbol{\alpha}}_{\ell} = \left[\bar{\alpha}_{\ell,1},\dots,\bar{\alpha}_{\ell,m},\bar{\dot{\alpha}}_{\ell,m},\bar{\ddot{\alpha}}_{\ell,m},\dots,\bar{\ddot{\alpha}}_{\ell,M}\right]^{\mathrm{T}}, \end{array} $$
(3)
which we will refer to as motor data vector for frame ℓ in the following. The left part of Fig. 2 exemplarily illustrates the described preprocessing of the data.
NMF for egonoise suppression
In the following, we briefly summarize NMF. We introduce succinctly how semisupervised NMF can be used for egonoise suppression and explain the main drawback of the known approach before we introduce the proposed motor databased regularization.
The objective of NMF is to approximate the nonnegative matrix Y, i.e., a matrix whose elements are all larger or equal than zero, by a product of two nonnegative matrices D and H
$$\begin{array}{*{20}l} \boldsymbol{Y} \approx\hat{\boldsymbol{Y}}=\boldsymbol{D}\boldsymbol{H}=\left[\boldsymbol{D}\boldsymbol{h}_{1},\dots,\boldsymbol{D}\boldsymbol{h}_{L}\right], \end{array} $$
(4)
where \(\boldsymbol {D}\in \mathbb {R}^{F\times K}_{+}\) is the socalled dictionary of size F×K and \(\boldsymbol {H}=\left [\boldsymbol {h}_{1},\dots,\boldsymbol {h}_{L}\right ]\in \mathbb {R}^{K\times L}_{+}\) is referred to as activation matrix [8, 21]. This approach can be interpreted as approximating each column of Y by a weighted sum of columns of D (the socalled atoms or bases), where the weights are given by the corresponding column entries of H. K is referred to as size of the dictionary and describes the number of atoms in D. Typically, K≪F,L holds, i.e., NMF can be considered as a compact representation of data.
The factorization is achieved by minimizing a cost function which penalizes the dissimilarity between Y and \(\hat {\boldsymbol {Y}}\) defined by the model parameters D,H. Typically, the cost function is applied elementwise on the elements of the matrices Y and \(\hat {\boldsymbol {Y}}\). In this paper, we consider the Euclidean distance between Y and \(\hat {\boldsymbol {Y}}\) as cost function yielding the optimization problem
$$ \begin{aligned} &\underset{\boldsymbol{D},\boldsymbol{H}}{\min}~\left\lVert \boldsymbol{Y}\boldsymbol{D}\boldsymbol{H} \right\rVert_{\mathrm{F}}^{2}\\ & \text{s.t.}~~~~~~ \boldsymbol{D}, \boldsymbol{H} \succeq 0, \end{aligned} $$
(5)
where ∥·∥_{F} denotes the Frobenius norm and D,H≽0 means that all elements of D,H are larger or equal to zero, ensuring nonnegativity. The optimization problem in Eq. 5 is typically solved using iterative updates alternating between D,H such that the nonnegativity of D,H is implicitly guaranteed if they are initialized with positive values. The update rules can be derived based on, e.g., the MajorizationMinimization principle or heuristic approaches [7, 8].
For egonoise suppression, we apply a semisupervised, twostage strategy [21], c.f. Section 2.4: first, we use audio data containing egonoise only and train an egonoise dictionary. Then, given a mixture of egonoise and speech, these dictionary elements remain constant and only its activations are estimated. For this, again, the same iterative update rules are used, which have shown to be sensitive to the additional speech signal. As a consequence, the atom activations are no longer estimated correctly. For improved robustness, we therefore propose to extend this audio onlybased estimation of the activations by taking also the physical state of the robot, measured in terms of motor data, into account. Thus, the estimation of the activations is additionally guided by reference information which is completely unaffected by the speech signal.
Motor dataregularized NMF
The basic idea of our approach is that activations should be similar if the physical state of the robot is similar. For this, we measure the similarity between robot states in frames ℓ and j by comparing motor data vectors \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) and enforce similar activations h_{ℓ} and h_{j} if \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) are close. This will be achieved by imprinting the intrinsic geometry of the motor data space to the NMF cost function. Results from spectral graph theory [22, 23] and manifold learning theory [24] have shown that local geometric structure of given data points can be modeled using an undirected graph. Based on these results, we first introduce a motor databased graph structure and summarize subsequently how a regularization term, enforcing similar activations for similar motor data, can be derived. We then reformulate the NMF optimization problem Eq. 5 and present according update rules for its minimization.
Motor data graph structure
In the following, we define a graph where the motor data vectors \(\bar {\boldsymbol {\alpha }}_{1},\dots,\bar {\boldsymbol {\alpha }}_{L}\) constitute the nodes. The edges connecting the nodes are assumed to be bidirectional, i.e., we obtain an undirected graph. A part of an exemplary graph is illustrated in Fig. 3. The edge which connects nodes \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) has weight W_{ℓj}=W_{jℓ} and should reflect the affinity between the two motor data points. Dependent on the considered scenario, numerous measures have been proposed to quantify the affinity between \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) [22], e.g., a nearestneighbor or dotproduct weighting. In this paper, we determine the weight W_{ℓj} using a Gaussian kernel
$$\begin{array}{*{20}l} W_{\ell j}= W_{j\ell}= \exp\left(\frac{\lVert\bar{\boldsymbol{\alpha}}_{\ell}\bar{\boldsymbol{\alpha}}_{j}\rVert^{2}_{2}}{2\epsilon^{2}}\right) \in (0,1], \end{array} $$
(6)
with scale parameter \(\epsilon \in \mathbb {R}_{+}\). The larger W_{ℓj}, the higher the affinity between two motor data samples is and we obtain W_{ℓj}=1 if \(\bar {\boldsymbol {\alpha }}_{\ell }=\bar {\boldsymbol {\alpha }}_{j}\). Note that by adjusting ε, the connectivity of the graph can be controlled, e.g., for larger ε, the neighbors of a node are connected with a larger weight. Therefore, ε can be used to control the reach of the local neighborhood of a node. Based on the affinity weights, we define the affinity matrix W=W^{T} ∈[0,1]^{L×L}, where the [W]_{ℓj}=W_{ℓj}. Furthermore, we introduce the diagonal matrix Z of size L×L with \(Z_{\ell \ell }=\sum _{j}^{}W_{\ell j}=\sum _{j}^{}W_{j\ell }\) and zero else.
Motor databased regularization term
The derivation of the regularization term is based on results from [24, 25]. It is assumed that the considered motor data lie on a Riemannian manifold \(\mathcal {A}\). We are looking for a mapping \(f:\mathcal {A}\rightarrow \mathbb {R}\), which can be interpreted as a mapping from the manifold to a line. f should preserve the local geometry of the manifold, i.e., close points on the manifold should be mapped to close points on the line. This implies that f is allowed to vary only smoothly for similar arguments. Appropriate mappings f can be obtained by an optimization on the manifold which can be discretely approximated on the motor data graph by searching for an f which minimizes
$$\begin{array}{*{20}l} \frac{1}{2} \sum_{\ell=1}^{L}\sum_{j=1}^{L}\left(f(\bar{\boldsymbol{\alpha}}_{\ell})f(\bar{\boldsymbol{\alpha}}_{j})\right)^{2} W_{\ell j}, \end{array} $$
(7)
where f is a function of the nodes of the graph [24, 25].
To exploit the geometric information of the motor data manifold for the estimation of the activation vectors, we manipulate Eq. 7 and replace the abstract mapping f by the activation of atom k
$$\begin{array}{*{20}l} \mathcal{R}_{k} &= \frac{1}{2} \sum_{\ell=1}^{L}\sum_{j=1}^{L}\left(h_{k\ell}h_{kj}\right)^{2} W_{\ell j}, \end{array} $$
(8)
where h_{kℓ} denotes the ℓth element of h_{k}, i.e., h_{kℓ} is the scaling of atom ℓ in time frame k. The regularization term \(\mathcal {R}_{k}\) needs to be minimized jointly with Eq. 5 with respect to the activations for every atom k, c.f. Section 2.3.3. Note that the motor databased regularization \(\mathcal {R}_{k}\) implicitly influences also the structure of the dictionary elements since the optimized activations directly affect the update of D.
Note that in Eq. 8, affinities W_{ℓj} can be interpreted as weighting parameter: if two motor data vectors \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) are similar, W_{ℓj} is close to one according to Eq. 6 and the minimization of Eq. 8 enforces similar h_{kℓ} and h_{kj}. Using the parameters defined in Section 2.3.1, Eq. 8 can be directly related to the socalled graph Laplacian L=Z−W [22]
$$ \begin{aligned} \mathcal{R}_{k} &=\boldsymbol{h}_{k}^{\mathrm{T}}\mathbf{Z}\boldsymbol{h}_{k}  \boldsymbol{h}_{k}^{\mathrm{T}}\mathbf{W}\boldsymbol{h}_{k}\\ &=\boldsymbol{h}_{k}^{\mathrm{T}}\mathbf{L}\boldsymbol{h}_{k}. \end{aligned} $$
(9)
Summing over all atoms results in the final regularization term
$$\begin{array}{*{20}l} \mathcal{R}=\sum_{k=1}^{K}\mathcal{R}_{k}=\text{tr}\left(\boldsymbol{H}^{\mathrm{T}}\boldsymbol{L}\boldsymbol{H}\right), \end{array} $$
(10)
where tr(·) denotes the trace operator.
Motor dataregularized NMF
The derived regularization term Eq. 10 can be directly included into Eq. 4. We obtain as modified optimization problem
$$ \begin{aligned} &\underset{\boldsymbol{D},\boldsymbol{H}}{\min}~\left\lVert \boldsymbol{Y}\boldsymbol{D}\boldsymbol{H} \right\rVert_{\mathrm{F}}^{2}+\lambda\text{tr}\left(\boldsymbol{H}^{\mathrm{T}}\boldsymbol{L}\boldsymbol{H}\right)\\&\text{s.t.}~~~~~~ \boldsymbol{D}, \boldsymbol{H} \succeq 0, \end{aligned} $$
(11)
where λ≥0 controls the influence of the motor databased regularization.
For minimization, we form the partial derivatives with respect to D and H in Eq. 11 and obtain iterative update rules [19, 20]
$$\begin{array}{*{20}l} \left[\boldsymbol{D}\right]_{fk}&\leftarrow\left[\boldsymbol{D}\right]_{fk}\cdot\frac{\left[\boldsymbol{Y}\boldsymbol{H}^{\mathrm{T}}\right]_{fk}}{\left[\hat{\boldsymbol{Y}}\boldsymbol{H}^{\mathrm{T}}\right]_{fk}}, \end{array} $$
(12)
$$\begin{array}{*{20}l} \left[\boldsymbol{H}\right]_{k\ell}&\leftarrow\left[\boldsymbol{H}\right]_{k\ell}\cdot\frac{\left[\boldsymbol{D}^{\mathrm{T}}\boldsymbol{Y}+\lambda\boldsymbol{H}\boldsymbol{W}\right]_{k\ell}}{\left[\boldsymbol{D}^{\mathrm{T}}\hat{\boldsymbol{Y}}+\lambda\boldsymbol{H}\boldsymbol{Z}\right]_{k\ell}}, \end{array} $$
(13)
where [D]_{fk} selects the fkth element from D. Similar to conventional NMF, the iterative update can be stopped, e.g., after a fixed number of iterations. In this paper, in each iteration we additionally compute the cost according to Eq. 11 and terminate updating Eqs. 12,13 after convergence.
Eqs. 12 and 13 reduce to the conventional update rules for NMF if λ=0 [8]. Note that since the proposed method aims at enforcing similar activations for close motor data vectors, the regularization has an effect on the update rule for H only, while the update for D is unaffected.
Proposed algorithm for egonoise suppression
As mentioned in Section 2.2, we apply a semisupervised, twostage strategy for egonoise suppression [21]. We first employ audio data containing egonoise only and train D imprinting the intrinsic geometry of the motor data space onto the model using the proposed regularization. Given a mixture of egonoise and speech, we use D to model and suppress the current egonoise and to obtain a speech estimate. In the following, we describe the proposed algorithm for egonoise suppression in detail, c.f. Fig. 4 for an overview.

LearningD: As input, spectrograms \(\boldsymbol {Y}=\big [\boldsymbol {y}_{1},\dots \boldsymbol {y}_{L}\big ]\) are given containing egonoise only. Per spectrogram frame y_{ℓ}, a motor data vector \(\bar {\boldsymbol {\alpha }}_{\ell }\) is computed. \(\bar {\boldsymbol {\alpha }}_{\ell }, \ell =1,\dots,L\) is used to construct the affinity and degree matrix, W and Z, respectively. Subsequently, the update rules Eqs. 12 and 13 are used to compute dictionary D, where the introduced regularization term is weighted by λ_{T}.

Egonoise suppression: Another dictionary D_{S} of size K_{S} and according activation H_{S} is initialized to model the additional speech signal in the considered mixture Y. Analogously to the learning step before, W and Z are constructed from the new motor data vectors possibly representing different movements. Using the same update rules as before, D_{S}, H and H_{S} are updated while D remains constant. The motor databased regularization term is weighted by λ_{E}. Note that for optimizing the activations of the speech model H_{S}, we set λ_{E}=0 since the motor databased regularization should affect only the estimation of the egonoise activations. After identifying the optimum model parameters captured by D_{S}, H and H_{S}, we use a spectral enhancement filter to obtain an estimate for the desired speech signal \(\big [\hat {\boldsymbol {Y}}_{\mathrm {S}}\big ]_{f\ell }=\big [\boldsymbol {F}\big ]_{f\ell }\cdot \big [\boldsymbol {Y}\big ]_{f\ell }\) for the fℓth bin where the enhancement filter is given by
$$\begin{array}{*{20}l} \big[\boldsymbol{F}\big]_{f\ell}=\frac{\big[\boldsymbol{D}_{\mathrm{S}}\boldsymbol{H}_{\mathrm{S}}\big]_{f\ell}}{\big[\boldsymbol{D}\boldsymbol{H}\big]_{f\ell}+\big[\boldsymbol{D}_{\mathrm{S}}\boldsymbol{H}_{\mathrm{S}}\big]_{f\ell}}. \end{array} $$
(14)
Note that typically λ_{E}≠λ_{T} holds, i.e, the regularization terms in both steps have different weights. This is further detailed in the following section.