Motor data-regularized nonnegative matrix factorization for ego-noise suppression

Schmidt, Alexander; Brendel, Andreas; Haubner, Thomas; Kellermann, Walter

doi:10.1186/s13636-020-00178-0

Research
Open access
Published: 31 July 2020

Motor data-regularized nonnegative matrix factorization for ego-noise suppression

Alexander Schmidt ORCID: orcid.org/0000-0002-3013-3192¹,
Andreas Brendel¹,
Thomas Haubner¹ &
…
Walter Kellermann¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2020, Article number: 11 (2020) Cite this article

2855 Accesses
3 Citations
Metrics details

Abstract

Ego-noise, i.e., the noise a robot causes by its own motions, significantly corrupts the microphone signal and severely impairs the robot’s capability to interact seamlessly with its environment. Therefore, suitable ego-noise suppression techniques are required. For this, it is intuitive to use also motor data collected by proprioceptors mounted to the joints of the robot since it describes the physical state of the robot and provides additional information about the ego-noise sources. In this paper, we use a dictionary-based approach for ego-noise suppression in a semi-supervised manner: first, an ego-noise dictionary is learned and subsequently used to estimate the ego-noise components of a mixture by computing a weighted sum of dictionary entries. The estimation of the weights is very sensitive against other signals beside ego-noise contained in the mixture. For increased robustness, we therefore propose to incorporate knowledge about the physical state of the robot to the estimation of the weights. This is achieved by introducing a motor data-based regularization term to the estimation problem which promotes similar weights for similar physical states. The regularization is derived by representing the motor data as a graph and imprints the intrinsic structure of the motor data space onto the dictionary model. We analyze the proposed method and evaluate its ego-noise suppression performance for a large variety of different movements and demonstrate the superiority of the proposed method compared to an approach without using motor data.

1 Introduction

Microphone-equipped robots are exposed to various kinds of noise, specifically to self-created noise, which is referred to as ego-noise in the following. It is caused by the robot’s electrical and mechanical components such as rotating motors and joints as well as the moving body parts. Ego-noise is a crucial problem in robot audition [1, 2] since it severely corrupts the recorded microphone signals and impairs the robot’s capability to react to unanticipated acoustic events. For this reason, ego-noise suppression is a crucial preprocessing step in robot audition.

Ego-noise suppression is particularly challenging for several reasons. First, ego-noise is usually louder than other signals of interest, e.g., a desired speech signal (“target”), since the ego-noise sources are typically located in immediate proximity of the microphones. For example, for the humanoïd robot NAO ^TM, which we will use as experimental platform in this paper, the microphones are mounted to the head of the robot. Thereby, they are only few centimeters away from the shoulder motors and joints, cf. Fig. 1. Another challenging aspect of ego-noise is that it cannot be modeled as a single static point interferer as the joints are located all over the body of the robot and the resulting structure-borne sound is transduced to air not just at isolated points. Furthermore, ego-noise is highly non-stationary since typically different movements are performed successively with varying speeds and accelerations.

One of the first approaches for ego-noise suppression goes back to the SIG humanoïd robot [1] which was equipped with microphones mounted inside the robot’s housing near the motors in order to record ego-noise reference signals. These signals were subsequently used as reference for adaptive filtering-based ego-noise cancelation. Interestingly, these reference signals were interpreted as additional auditory perception channels of the robot. Internal microphones were also used in [3] for speech enhancement for a human-robot dialog system. In this approach, the recorded reference signals are incorporated into a frequency-domain semi-blind source separation algorithm with subsequent multichannel Wiener filtering.

In many robot designs, it is not possible to mount additional reference microphones inside the robot due to space and hardware constraints. Furthermore, a potentially large number of internal microphones are required to obtain reference signals for each ego-noise source. This drawback motivates approaches which operate on the external microphone signals only. Here, it can be exploited that ego-noise exhibits a characteristic structure in the Short-Time Fourier Transform (STFT) domain. Due to the limited number of degrees of freedom for the movements of the robot, those spectral patterns cannot be arbitrarily diverse. These two properties motivate the use of learning-based dictionary methods where the ego-noise signals are approximated by prototype signals, so-called atoms, which are collected in a dictionary. Then, for each time frame, a linear combination of atoms has to be found which optimally fits the current ego-noise signal with respect to the chosen criterion. An example for such a dictionary learning algorithm is K-SVD [4], which has been applied for multichannel ego-noise suppression in [5]. Another widely used approach to train a dictionary is nonnegative matrix factorization (NMF) [6–8]. For NMF, the dictionary is restricted to nonnegative elements only which is well-suited to model power spectral densities (PSDs) of acoustic sources. An according approach for ego-noise suppression has been investigated, e.g., in [9]. The concept of NMF has been extended for multichannel recordings [10] by extending the (nonnegative) source model by an additional spatial model. This has been applied for ego-noise suppression in [11].

Besides methods using the audio modality only (referred to as audio only-based methods in the following), other ego-noise suppression approaches use knowledge about motor information given by, e.g., motor commands or motor data such as engine rotation frequency, joints’ angle, or angular velocities collected by proprioceptors. The advantage of using motor data compared to motor commands is that the emitted ego-noise is directly related to the instantaneous internal state of the robot, measured by motor data. Since a robot is not a fully deterministic system, this measured state may be significantly different from the target state defined by the motor command.

Typically, a sufficiently accurate analytical model of the dependency between motor data and emitted ego-noise can usually not be obtained since the mechanical dependencies and interactions between structure and airborne sounds are highly complex. Therefore, current ego-noise suppression approaches model these dependencies entirely or partly by learning-based strategies. For example in [13], a neural network-based approach is used to predict the PSDs of ego-noise caused by the Aibo ^TM robot. The feedforward neural network, consisting of two hidden layers containing thirty nodes each, is fed with angular position and velocity data of current and past time frames. The PSD estimates are subsequently used for spectral subtraction which was shown to result in a significant improvement of speech recognition rates. In [14], it is demonstrated that the harmonic structure of ego-noise can be estimated using motor data. This prior knowledge is included to a single-channel NMF-based ego-noise modeling. It is proposed to approximate the currently observed ego-noise spectrogram by combining elements from a dictionary D_H which models the harmonic structure and another dictionary D_R which captures the residual part of the ego-noise. The benefit of this approach is that only D_R requires a prior learning step while D_H is completely motor data-driven. It is shown that the proposed approach significantly outperforms an audio only-based method for the suppression of ego-noise that is not well represented in the training data. Although this approach is close to the proposed method from a methodical point of view, it aims at a different direction since it explicitly addresses the suppression of ego-noise if training and test data are unbalanced. This is not the case in the scenario considered in this paper.

Other popular methods for ego-noise suppression combining audio and motor information are template-based approaches. Here, the key idea is to save the characteristic spectral shape of the ego-noise as PSD templates in a data base. In [15], each template is associated with a motor command which triggered the current movement. Based on this, during application, matching templates are identified and temporally aligned to the recorded signal. An alternative template-based approach was presented in [16, 17], where motor data instead of motor commands are used to identify the templates in the data base. For a current motor data sample, the nearest neighbor in the motor data space is searched and the associated template used as ego-noise estimate.

The concept to associate motor data with ego-noise templates was adopted in [18]. However, there, motor data samples are linked to a set of atoms from a learned dictionary-based ego-noise model. Nonlinear classifiers in the motor data space are used to associate a motor data sample to a set of atoms, whose elements are subsequently combined to approximate the current ego-noise recording. Thereby, the classifiers replace the expensive iterative search for atoms in the dictionary.

In this paper, the idea of choosing atoms depending on motor data is adopted from [18] and we propose to expand the conventional, audio only-based NMF model by a motor data-dependent regularization term, which promotes similar atom activations in those time frames in which similar motor data is measured. The proposed regularization term is derived from a graph structure which encodes the similarity between the motor data samples. While the main benefit of the method in [18] was a reduction of computational complexity, the presented approach in this paper results in a significant performance improvement. The proposed method is inspired by graph-regularized NMF [19, 20], which was proposed in the context of clustering and classification of text documents. There, the NMF model and the regularization are operating in the same data space. In this work, however, we learn an NMF model on acoustic data while the regularization encodes the geometry of the motor data space. Thus, we combine an acoustic model with non-acoustic reference information.

This paper is structured as follows. In Section 2.1, we describe the used motor data. After succinctly introducing NMF in Section 2.2, we present the novel motor data-regularized NMF in Section 2.3 Thereby, we first describe the construction of the motor data graph structure in Subsection 2.3.1 and derive the proposed regularization term in Subsection. 2.3.2. Then, the modified NMF optimization problem is formulated and according update rules are presented in Subsection. 2.3.3. The resulting novel ego-noise suppression algorithm is summarized in Section 2.4 and its efficacy is demonstrated in Section 3.

2 Motor data-regularized NMF for ego-noise suppression

In the following, we consider the bin-wise squared magnitude of a single-channel microphone signal in the STFT domain, represented in spectrograms denoted as $\boldsymbol {Y} = \left [\boldsymbol {y}_{1},\dots,\boldsymbol {y}_{L}\right ] \in \mathbb {R}^{F\times L}_{+}$, where F is the number of frequency bins and L is the number of considered time frames.

2.1 Motor data descriptions and definitions

The physical state of a robot can be described by motor data, collected by proprioceptors providing angular position information of the joints driven by the motors. In the following, we consider a robot which is equipped with $m=1,\dots,M$ proprioceptors each capturing one angle of a joint. We denote the s-th observed angular position in STFT frame ℓ for proprioceptor m by $\alpha _{\ell,m}^{(s)}\in \mathbb {R}$. Within frame ℓ, a total number of S_ℓ motor data samples is observed, i.e., $s=1,\dots, S_{\ell }$. In this paper, we account for the fact that the motor data is not necessarily synchronized with the audio data recording so that for a fixed observation interval for the audio data, the number of motor data may vary, i.e., S_ℓ may change with ℓ. This is specifically the case for the NAO robot used for the experiments in this paper.

Depending on the kind of ego-noise, only a subset of proprioceptors is relevant for ego-noise suppression. For example, if only ego-noise caused by arm movements is present, only motor data of the arm joints are required. In the following, we denote the index set of relevant proprioceptors for these joints by $\mathcal {M}$.

From proprioceptor data collected for proprioceptor m, the instantaneous angular velocity can be estimated by

$$\begin{array}{*{20}l} \dot{\alpha}^{(s)}_{\ell,m} = \frac{\alpha^{(s)}_{\ell,m}-\alpha^{(s-1)}_{\ell,m}}{\Delta T_{\ell}^{(s)}},~~~~ m\in\mathcal{M}, \end{array} $$

(1)

where $\Delta T_{\ell }^{(s)}$ denotes the time difference between adjacent observations $\alpha ^{(s)}_{\ell,m}$ and $\alpha ^{(s-1)}_{\ell,m}$. Note that for s=1, $\alpha ^{(s-1)}_{\ell,m}$ is chosen to be the last angular sample of previous the frame ℓ−1. Analogously, angular acceleration $\ddot {\alpha }^{(s)}_{\ell,m}$ can be computed from successive angular velocity estimates $\dot {\alpha }^{(s)}_{\ell,m}$ and $\dot {\alpha }^{(s-1)}_{\ell,m}$.

To associate each spectrogram frame y_ℓ with a single motor data sample, we propose first to compute the arithmetic average of all S_ℓ angular positions in STFT frame ℓ

$$\begin{array}{*{20}l} \bar{\alpha}_{\ell,m}=\frac{1}{S_{\ell}}\sum_{s=1}^{S_{\ell}} \alpha^{(s)}_{\ell,m},~~~ m\in\mathcal{M}. \end{array} $$

(2)

We proceed analogously for angular velocity and acceleration and obtain $\bar {\dot {\alpha }}_{\ell,m}, \bar {\ddot {\alpha }}_{\ell,m}$, respectively. We then concatenate the averaged angular data for all considered proprioceptors in a feature vector

$$\begin{array}{*{20}l} \bar{\boldsymbol{\alpha}}_{\ell} = \left[\bar{\alpha}_{\ell,1},\dots,\bar{\alpha}_{\ell,m},\bar{\dot{\alpha}}_{\ell,m},\bar{\ddot{\alpha}}_{\ell,m},\dots,\bar{\ddot{\alpha}}_{\ell,M}\right]^{\mathrm{T}}, \end{array} $$

(3)

which we will refer to as motor data vector for frame ℓ in the following. The left part of Fig. 2 exemplarily illustrates the described preprocessing of the data.

2.2 NMF for ego-noise suppression

In the following, we briefly summarize NMF. We introduce succinctly how semi-supervised NMF can be used for ego-noise suppression and explain the main drawback of the known approach before we introduce the proposed motor data-based regularization.

The objective of NMF is to approximate the nonnegative matrix Y, i.e., a matrix whose elements are all larger or equal than zero, by a product of two nonnegative matrices D and H

$$\begin{array}{*{20}l} \boldsymbol{Y} \approx\hat{\boldsymbol{Y}}=\boldsymbol{D}\boldsymbol{H}=\left[\boldsymbol{D}\boldsymbol{h}_{1},\dots,\boldsymbol{D}\boldsymbol{h}_{L}\right], \end{array} $$

(4)

where $\boldsymbol {D}\in \mathbb {R}^{F\times K}_{+}$ is the so-called dictionary of size F×K and $\boldsymbol {H}=\left [\boldsymbol {h}_{1},\dots,\boldsymbol {h}_{L}\right ]\in \mathbb {R}^{K\times L}_{+}$ is referred to as activation matrix [8, 21]. This approach can be interpreted as approximating each column of Y by a weighted sum of columns of D (the so-called atoms or bases), where the weights are given by the corresponding column entries of H. K is referred to as size of the dictionary and describes the number of atoms in D. Typically, K≪F,L holds, i.e., NMF can be considered as a compact representation of data.

The factorization is achieved by minimizing a cost function which penalizes the dissimilarity between Y and $\hat {\boldsymbol {Y}}$ defined by the model parameters D,H. Typically, the cost function is applied element-wise on the elements of the matrices Y and $\hat {\boldsymbol {Y}}$. In this paper, we consider the Euclidean distance between Y and $\hat {\boldsymbol {Y}}$ as cost function yielding the optimization problem

$$ \begin{aligned} &\underset{\boldsymbol{D},\boldsymbol{H}}{\min}~\left\lVert \boldsymbol{Y}-\boldsymbol{D}\boldsymbol{H} \right\rVert_{\mathrm{F}}^{2}\\ & \text{s.t.}~~~~~~ \boldsymbol{D}, \boldsymbol{H} \succeq 0, \end{aligned} $$

(5)

where ∥·∥_F denotes the Frobenius norm and D,H≽0 means that all elements of D,H are larger or equal to zero, ensuring nonnegativity. The optimization problem in Eq. 5 is typically solved using iterative updates alternating between D,H such that the nonnegativity of D,H is implicitly guaranteed if they are initialized with positive values. The update rules can be derived based on, e.g., the Majorization-Minimization principle or heuristic approaches [7, 8].

For ego-noise suppression, we apply a semi-supervised, two-stage strategy [21], c.f. Section 2.4: first, we use audio data containing ego-noise only and train an ego-noise dictionary. Then, given a mixture of ego-noise and speech, these dictionary elements remain constant and only its activations are estimated. For this, again, the same iterative update rules are used, which have shown to be sensitive to the additional speech signal. As a consequence, the atom activations are no longer estimated correctly. For improved robustness, we therefore propose to extend this audio only-based estimation of the activations by taking also the physical state of the robot, measured in terms of motor data, into account. Thus, the estimation of the activations is additionally guided by reference information which is completely unaffected by the speech signal.

2.3 Motor data-regularized NMF

The basic idea of our approach is that activations should be similar if the physical state of the robot is similar. For this, we measure the similarity between robot states in frames ℓ and j by comparing motor data vectors $\bar {\boldsymbol {\alpha }}_{\ell }$ and $\bar {\boldsymbol {\alpha }}_{j}$ and enforce similar activations h_ℓ and h_j if $\bar {\boldsymbol {\alpha }}_{\ell }$ and $\bar {\boldsymbol {\alpha }}_{j}$ are close. This will be achieved by imprinting the intrinsic geometry of the motor data space to the NMF cost function. Results from spectral graph theory [22, 23] and manifold learning theory [24] have shown that local geometric structure of given data points can be modeled using an undirected graph. Based on these results, we first introduce a motor data-based graph structure and summarize subsequently how a regularization term, enforcing similar activations for similar motor data, can be derived. We then reformulate the NMF optimization problem Eq. 5 and present according update rules for its minimization.

2.3.1 Motor data graph structure

In the following, we define a graph where the motor data vectors $\bar {\boldsymbol {\alpha }}_{1},\dots,\bar {\boldsymbol {\alpha }}_{L}$ constitute the nodes. The edges connecting the nodes are assumed to be bidirectional, i.e., we obtain an undirected graph. A part of an exemplary graph is illustrated in Fig. 3. The edge which connects nodes $\bar {\boldsymbol {\alpha }}_{\ell }$ and $\bar {\boldsymbol {\alpha }}_{j}$ has weight W_ℓj=W_jℓ and should reflect the affinity between the two motor data points. Dependent on the considered scenario, numerous measures have been proposed to quantify the affinity between $\bar {\boldsymbol {\alpha }}_{\ell }$ and $\bar {\boldsymbol {\alpha }}_{j}$ [22], e.g., a nearest-neighbor or dot-product weighting. In this paper, we determine the weight W_ℓj using a Gaussian kernel

$$\begin{array}{*{20}l} W_{\ell j}= W_{j\ell}= \exp\left(-\frac{\lVert\bar{\boldsymbol{\alpha}}_{\ell}-\bar{\boldsymbol{\alpha}}_{j}\rVert^{2}_{2}}{2\epsilon^{2}}\right) \in (0,1], \end{array} $$

(6)

with scale parameter $\epsilon \in \mathbb {R}_{+}$. The larger W_ℓj, the higher the affinity between two motor data samples is and we obtain W_ℓj=1 if $\bar {\boldsymbol {\alpha }}_{\ell }=\bar {\boldsymbol {\alpha }}_{j}$. Note that by adjusting ε, the connectivity of the graph can be controlled, e.g., for larger ε, the neighbors of a node are connected with a larger weight. Therefore, ε can be used to control the reach of the local neighborhood of a node. Based on the affinity weights, we define the affinity matrix W=W^T ∈[0,1]^L×L, where the [W]_ℓj=W_ℓj. Furthermore, we introduce the diagonal matrix Z of size L×L with $Z_{\ell \ell }=\sum _{j}^{}W_{\ell j}=\sum _{j}^{}W_{j\ell }$ and zero else.

2.3.2 Motor data-based regularization term

The derivation of the regularization term is based on results from [24, 25]. It is assumed that the considered motor data lie on a Riemannian manifold $\mathcal {A}$. We are looking for a mapping $f:\mathcal {A}\rightarrow \mathbb {R}$, which can be interpreted as a mapping from the manifold to a line. f should preserve the local geometry of the manifold, i.e., close points on the manifold should be mapped to close points on the line. This implies that f is allowed to vary only smoothly for similar arguments. Appropriate mappings f can be obtained by an optimization on the manifold which can be discretely approximated on the motor data graph by searching for an f which minimizes

$$\begin{array}{*{20}l} \frac{1}{2} \sum_{\ell=1}^{L}\sum_{j=1}^{L}\left(f(\bar{\boldsymbol{\alpha}}_{\ell})-f(\bar{\boldsymbol{\alpha}}_{j})\right)^{2} W_{\ell j}, \end{array} $$

(7)

where f is a function of the nodes of the graph [24, 25].

To exploit the geometric information of the motor data manifold for the estimation of the activation vectors, we manipulate Eq. 7 and replace the abstract mapping f by the activation of atom k

$$\begin{array}{*{20}l} \mathcal{R}_{k} &= \frac{1}{2} \sum_{\ell=1}^{L}\sum_{j=1}^{L}\left(h_{k\ell}-h_{kj}\right)^{2} W_{\ell j}, \end{array} $$

(8)

where h_kℓ denotes the ℓ-th element of h_k, i.e., h_kℓ is the scaling of atom ℓ in time frame k. The regularization term $\mathcal {R}_{k}$ needs to be minimized jointly with Eq. 5 with respect to the activations for every atom k, c.f. Section 2.3.3. Note that the motor data-based regularization $\mathcal {R}_{k}$ implicitly influences also the structure of the dictionary elements since the optimized activations directly affect the update of D.

Note that in Eq. 8, affinities W_ℓj can be interpreted as weighting parameter: if two motor data vectors $\bar {\boldsymbol {\alpha }}_{\ell }$ and $\bar {\boldsymbol {\alpha }}_{j}$ are similar, W_ℓj is close to one according to Eq. 6 and the minimization of Eq. 8 enforces similar h_kℓ and h_kj. Using the parameters defined in Section 2.3.1, Eq. 8 can be directly related to the so-called graph Laplacian L=Z−W [22]

$$ \begin{aligned} \mathcal{R}_{k} &=\boldsymbol{h}_{k}^{\mathrm{T}}\mathbf{Z}\boldsymbol{h}_{k} - \boldsymbol{h}_{k}^{\mathrm{T}}\mathbf{W}\boldsymbol{h}_{k}\\ &=\boldsymbol{h}_{k}^{\mathrm{T}}\mathbf{L}\boldsymbol{h}_{k}. \end{aligned} $$

(9)

Summing over all atoms results in the final regularization term

$$\begin{array}{*{20}l} \mathcal{R}=\sum_{k=1}^{K}\mathcal{R}_{k}=\text{tr}\left(\boldsymbol{H}^{\mathrm{T}}\boldsymbol{L}\boldsymbol{H}\right), \end{array} $$

(10)

where tr(·) denotes the trace operator.

2.3.3 Motor data-regularized NMF

The derived regularization term Eq. 10 can be directly included into Eq. 4. We obtain as modified optimization problem

$$ \begin{aligned} &\underset{\boldsymbol{D},\boldsymbol{H}}{\min}~\left\lVert \boldsymbol{Y}-\boldsymbol{D}\boldsymbol{H} \right\rVert_{\mathrm{F}}^{2}+\lambda\text{tr}\left(\boldsymbol{H}^{\mathrm{T}}\boldsymbol{L}\boldsymbol{H}\right)\\&\text{s.t.}~~~~~~ \boldsymbol{D}, \boldsymbol{H} \succeq 0, \end{aligned} $$

(11)

where λ≥0 controls the influence of the motor data-based regularization.

For minimization, we form the partial derivatives with respect to D and H in Eq. 11 and obtain iterative update rules [19, 20]

$$\begin{array}{*{20}l} \left[\boldsymbol{D}\right]_{fk}&\leftarrow\left[\boldsymbol{D}\right]_{fk}\cdot\frac{\left[\boldsymbol{Y}\boldsymbol{H}^{\mathrm{T}}\right]_{fk}}{\left[\hat{\boldsymbol{Y}}\boldsymbol{H}^{\mathrm{T}}\right]_{fk}}, \end{array} $$

(12)

$$\begin{array}{*{20}l} \left[\boldsymbol{H}\right]_{k\ell}&\leftarrow\left[\boldsymbol{H}\right]_{k\ell}\cdot\frac{\left[\boldsymbol{D}^{\mathrm{T}}\boldsymbol{Y}+\lambda\boldsymbol{H}\boldsymbol{W}\right]_{k\ell}}{\left[\boldsymbol{D}^{\mathrm{T}}\hat{\boldsymbol{Y}}+\lambda\boldsymbol{H}\boldsymbol{Z}\right]_{k\ell}}, \end{array} $$

(13)

where [D]_fk selects the fk-th element from D. Similar to conventional NMF, the iterative update can be stopped, e.g., after a fixed number of iterations. In this paper, in each iteration we additionally compute the cost according to Eq. 11 and terminate updating Eqs. 12,13 after convergence.

Eqs. 12 and 13 reduce to the conventional update rules for NMF if λ=0 [8]. Note that since the proposed method aims at enforcing similar activations for close motor data vectors, the regularization has an effect on the update rule for H only, while the update for D is unaffected.

2.4 Proposed algorithm for ego-noise suppression

As mentioned in Section 2.2, we apply a semi-supervised, two-stage strategy for ego-noise suppression [21]. We first employ audio data containing ego-noise only and train D imprinting the intrinsic geometry of the motor data space onto the model using the proposed regularization. Given a mixture of ego-noise and speech, we use D to model and suppress the current ego-noise and to obtain a speech estimate. In the following, we describe the proposed algorithm for ego-noise suppression in detail, c.f. Fig. 4 for an overview.

LearningD: As input, spectrograms $\boldsymbol {Y}=\big [\boldsymbol {y}_{1},\dots \boldsymbol {y}_{L}\big ]$ are given containing ego-noise only. Per spectrogram frame y_ℓ, a motor data vector $\bar {\boldsymbol {\alpha }}_{\ell }$ is computed. $\bar {\boldsymbol {\alpha }}_{\ell }, \ell =1,\dots,L$ is used to construct the affinity and degree matrix, W and Z, respectively. Subsequently, the update rules Eqs. 12 and 13 are used to compute dictionary D, where the introduced regularization term is weighted by λ_T.
Ego-noise suppression: Another dictionary D_S of size K_S and according activation H_S is initialized to model the additional speech signal in the considered mixture Y. Analogously to the learning step before, W and Z are constructed from the new motor data vectors possibly representing different movements. Using the same update rules as before, D_S, H and H_S are updated while D remains constant. The motor data-based regularization term is weighted by λ_E. Note that for optimizing the activations of the speech model H_S, we set λ_E=0 since the motor data-based regularization should affect only the estimation of the ego-noise activations. After identifying the optimum model parameters captured by D_S, H and H_S, we use a spectral enhancement filter to obtain an estimate for the desired speech signal $\big [\hat {\boldsymbol {Y}}_{\mathrm {S}}\big ]_{f\ell }=\big [\boldsymbol {F}\big ]_{f\ell }\cdot \big [\boldsymbol {Y}\big ]_{f\ell }$ for the fℓ-th bin where the enhancement filter is given by
$$\begin{array}{*{20}l} \big[\boldsymbol{F}\big]_{f\ell}=\frac{\big[\boldsymbol{D}_{\mathrm{S}}\boldsymbol{H}_{\mathrm{S}}\big]_{f\ell}}{\big[\boldsymbol{D}\boldsymbol{H}\big]_{f\ell}+\big[\boldsymbol{D}_{\mathrm{S}}\boldsymbol{H}_{\mathrm{S}}\big]_{f\ell}}. \end{array} $$
(14)

Note that typically λ_E≠λ_T holds, i.e, the regularization terms in both steps have different weights. This is further detailed in the following section.

3 Experimental evaluation

In the following, we evaluate the proposed method using real microphone recordings. We first describe the hardware setup, the synchronization of audio and motor data and the recording scenarios, and introduce the evaluation metrics. Then, we present suppression results for ego-noise of different movements and discuss the influence of crucial parameters.

3.1 Recording setup

For our experiments, we conducted experiments with a commercially available NAO H25 robot [12]. For the audio recordings, we used a self-constructed head [26] with a microphone array of 12 sensors. For all following experiments, we used the frontmost microphone. Since the NAO platform does not provide an in-built synchronization on audio sample level, we developed a synchronization scheme which is illustrated in Fig. 5: the microphone signals are fed into an external analog-to-digital (A/D) converter using conventional phone connectors (IEC 60603-11). The sampled data is forwarded to the robot’s internal CPU via USB, where it is synchronized with the motor data collected by the proprioceptors of the robot. The resulting data stream, containing audio and motor data, is finally transmitted to an external PC via Ethernet, which is used for recording.

3.2 Scenario description

The recordings were conducted in a room with moderate reverberation (T₆₀=200 ms). We investigate ego-noise of different right arm movements of the robot. Compared to movement noise of other body parts, ego-noise of the arms has the most severe effect on the microphone signals due to the immediate closeness of the active joints to the microphones. In total, we recorded ego-noise of three motion sequences:

Sequence I consists of repeating right arm waving movements, activating all six joints of the arm. The robot lifts the arm using the right shoulder pitch motor, while performing waving movements with the remaining five motors of the right arm.
Sequence II resembles Sequence I; however, the lifting of the arm is performed with randomly varying velocity and acceleration of the right shoulder pitch motor. The number of employed joints is identical to Sequence I.
Sequence III is a mixture of left and right arm movements where both left and right joints are controlled independently with varying speeds. Since movements of the left and right arm are considered, 12 joints are used in total, i.e., compared to Sequence I/II the number of joints is doubled.

While Sequence I is a relatively simple scenario due to its repetitive character, Sequence II and Sequence III are more challenging for a description by a dictionary. For Sequence II, the random accelerations of the right shoulder pitch motor result in a large variety of spectral patterns which must be captured by the dictionary. The same holds for ego-noise of Sequence III, where the doubling of employed joints causes more spectral diversity.

The recorded ego-noise was used for training the dictionary and evaluation, where the data for evaluation was not contained in the training data. In total, we recorded 60 s for each motion sequence and split the ego-noise data such that approximately 30 s ego-noise for the learning of D is available.

To evaluate the suppression performance, we consider a scenario in which a target source is talking to the robot. The robot is standing on the floor level while it performs different waving movements of the right arm. The microphones of the robot are at a height of 55 cm. For the speech signal, utterances from male and female speakers of the GRID corpus [27] were used. The loudspeaker was positioned at 1 m distance of the robot, at a height of 1 m. The recorded reverberant utterances were added to the ego-noise with varying signal-to-noise (SNR) ratios (see Section 3.3).

The audio signals are sampled at f_S=16 kHz and transformed to the STFT domain using a Hamming window of length 64 ms with overlap of 50 %. The internal operating system of the NAO robot saves motor data samples of all joints into an internal cache which can be accessed by the user. This cache is typically updated every 10 ms. Consequently, the sampling frequency of the motor data is given by f_S≈100 Hz, i.e, typically S_ℓ=6 motor data samples are available per time frame.

We evaluated the overall performance of the ego-noise suppression in terms of signal-to-distortion ratio (SDR in dB) and signal-to-artifacts ratio (SAR in dB). For the computation of both, Matlab functions provided by [28] are used. In practice, it must be expected that the ego-noise and speech estimates, i.e., DH and D_SH_S, contain estimation errors resulting in imperfect enhancement filter F, cf. Eq. 14. As a consequence, the ego-noise cannot be removed entirely from the mixture and/or the desired speech is distorted. The severity of both effects is reflected in the selected performance criteria SDR and SAR, respectively. While SDR measures the amount of remaining ego-noise and speech distortion after processing, SAR considers introduced speech distortion only. For unprocessed data, the SDR corresponds to the SNR of the input mixture while SAR is infinite. Beside SDR and SAR, we also evaluate PESQ (perceptual evaluation of speech quality [29]). To obtain representative results, we averaged over 100 runs with random initialization of the matrices in NMF. Standard deviations for all results are given in brackets.

3.3 Evaluation and discussion of the results

For evaluating the proposed method for ego noise of motion Sequence I, II, and III, the size of the ego-noise dictionary and speech dictionary has been chosen to K=K_S=20 for Sequence I and K=30, K_S=20 for Sequence II and III. These parameters have shown best suppression performance in terms of SDR for audio only-based NMF on the respective ego-noise recordings. We first discuss the choice of λ_T and λ_E and illustrate the effect of the regularization term $\mathcal {R}$. Subsequently, we evaluate the suppression performance for different SNRs and finally discuss alternative choices for the motor data vector $\bar {\boldsymbol {\alpha }}_{\ell }$.

3.3.1 Impact and choice of λ_T and λ_E

In Table 1, the suppression results for particular choices of λ_T and λ_E are given. First, we incorporate the motor data information only into the training (λ_T=0.9) and leave the suppression step unchanged λ_E=0. Compared to audio only-based NMF (denoted as “NMF” in the following), this already shows a slight improvement of the results. For λ_T=0 and λ_E=19, we obtain significantly better results than for NMF, which shows that enforcing similar activations for similar physical states of the robot helps even if this constraint has not be learnt during the learning of the dictionary. Best results are obtained if the regularization term is included to both learning and suppression. Note that λ_T and λ_E are of different orders of magnitude, what will be further investigated and interpreted in Section 3.3.2.

Table 1 Performance achieved by the proposed method and audio only-based NMF for different λ_T and λ_E. Ego-noise is caused by movements of Sequence I (K=20, K_S=20, SNR = 3 dB, ε=5·10⁻³)

Full size table

The effect of the proposed regularization term is illustrated in Fig. 6. We consider two time frames ℓ and j of a mixture of ego-noise (Sequence I) and speech. Frames ℓ and j are chosen such that W_ℓj is large, i.e., $\bar {\boldsymbol {\alpha }}_{\ell }$ and $\bar {\boldsymbol {\alpha }}_{j}$ are close indicating that the robot has similar physical states. Hence, similar activations are desired. Figure 6a shows elements of the activation vectors h_ℓ and h_j obtained if audio only-based NMF is used. It is obvious that the activations differ significantly, which can be explained by the additional speech signal present in frames ℓ and j which affects the estimation of the ego-noise activations. Figure 6 b illustrates the elements of h_ℓ and h_j estimated by proposed motor data-regularized NMF. Here, in contrast to audio only-based NMF, the activations coincide even if additional speech is present.

For further illustration, Fig. 7 shows spectrograms of an ego-noise extract and its estimates using audio-only NMF and the proposed method. Without motor data regularization, the speech signal leads to additional, undesired components in the ego-noise estimate. In contrast, this effect is not or only weakly pronounced for the proposed method.

The effect of varying λ_E and ε on the suppression result is illustrated in Fig. 8. For λ_E=0, the regularization is ineffective and the proposed method reduces to audio only-based NMF (λ_T=0 holds during the learning of D). If λ_E is chosen too large, the effect of the motor data dominates and the suppression performance degrades. λ_E=19 appears to result in the best result for the considered mixture. However, note that the optimal choice of λ_E depends on the SNR of the mixture, as will be discussed in more detail in Section 3.3.2. We now consider the suppression performance for varying scale parameter ε, c.f. Eq. 6. For ε→0, we obtain according to Eq. 6

$$\begin{array}{*{20}l} ~~~~~~~~W_{\ell j}\rightarrow 0,~~~\forall \ell,j\in 1,\dots, L, \end{array} $$

i.e, all connections in the graph are set to zero. Accordingly, the regularization term in Eq. 10 equals zero and the results of the proposed method and audio only-based NMF coincide. For increasing ε, Eq. 6 gets less selective and the number neighbors of a node with large affinity increases. For the setup in Fig. 8, the maximum SDR is obtained for ε=5·10⁻³, which turned out to result in robust performance even for ego-noise of other movements. For larger ε, the suppression performance deteriorates since more and more connections between nodes obtain large weights and the discriminative nature of the graph is reduced.

3.3.2 Varying SNRs

So far, we only considered mixtures with constant SNR. In a typical human-robot interaction, the SNR is however changing due to, e.g., varying distances between desired source and robot or different power levels of the signal of interest. Therefore, a robust ego-noise suppression at different SNRs is of high importance.

In the following, we evaluate the proposed approach for SNR ∈{± 10,± 5,± 2,± 1,0} dB of the input mixture. For this, we added scaled versions of the speech signal to the ego-noise. Note that for the considered NAO robot SNR=10 dB is an unlikely scenario since it corresponds to an human-robot distance of only a couple of centimeters or a very loud human voice. We acknowledge, however, that for robots which emit less loud ego-noise, such a high SNR could be realistic.

Results for ego-noise Sequence I are given in Table 2. In the right part of Table 2, parameters λ_T,λ_E, and ε used for the proposed method are summarized. Interestingly, for SNR= − 10 dB, the proposed method shows best result if the regularization is ineffective. Consequently, it does not show any benefit compared to audio only-based NMF. For larger SNRs, SDR and SAR increase both for audio only-based NMF and the proposed method. Motor data-regularized NMF consistently shows superior performance, while the relative improvement between the proposed approach and NMF increases for growing SNR, e.g., for SNR= + 2 dB, a gain of + 2 dB in SDR and 2.5 dB in SAR is achieved. This effect can be explained by the fact that for audio only-based NMF, the estimation of the activations is more severely impaired by the additional speech signal. This effect becomes more pronounced for increasing SNR. Since motor data is a non-acoustic reference signal, the regularization term is not affected by the increasing power of the additional speech component. This also explains why almost no benefit could be observed for low SNR when the additional speech signal does not have an impact on the ego-noise estimation.

Table 2 Suppression performance achieved by NMF and the proposed method for varying SNRs. For this experiment, ego-noise by movements of Sequence I is used. Parameters used for proposed methods are given in the last rows. For all SNRs, dictionaries of size K=20, K_S=20 are used

Full size table

While ε is constant for all SNRs, especially λ_E has to be increased continuously for larger SNRs. By this, the influence of the motor data-dependent regularization gets more aggressive compensating the increasingly negative impact of the speech on the estimation of the ego-noise activations.

We conducted the same experiments for ego-noise caused by motions of Sequence II and Sequence III. The results are summarized in Tables 3 and 4. In principle, the results obtained for Sequence I can be confirmed: the proposed method outperforms audio only-based NMF consistently, especially for high SNRs, and λ_E shows a significant dependence on the SNR. Interestingly for Sequence II, the absolute values for λ_E have to be chosen slightly smaller for optimum performance than for Sequence I. Overall, the suppression results are ≈ 1 dB (Sequence II) and ≈ 0.5 dB (Sequence III) worse than for Sequence I, which can be explained by the more complex movements, c.f., movement description in Section 3.2.

Table 3 Suppression performance achieved by NMF and the proposed method for varying SNRs. For this experiment, ego-noise by movements of Sequence II is used. Parameters used for proposed methods are given in the last rows. For all SNRs, dictionaries of size K=30, K_S=20 are used

Full size table

Table 4 Suppression performance achieved by NMF and the proposed method for varying SNRs. For this experiment, ego-noise by movements of Sequence III is used. Parameters used for proposed methods are given in the last rows. For all SNRs, dictionaries of size K=30, K_S=20 are used

Full size table

Note that for an optimal choice of λ_E knowledge of the SNR is required for which a single-channel or multichannel SNR estimator can be employed. However, it must be expected that the SNR estimation is imperfect leading to a suboptimal choice of λ_E. The resulting effect on the suppression performance is shown in Table 5. We chose λ_E=8.0 and λ_T=0.5, i.e, optimal parameters for SNR =1 dB, and evaluated the proposed method for SNRs −1 dB, $\dots $, 3 dB, simulating imperfect SNR estimates. Overall, a suboptimal parameter choice leads to a degradation of the suppression performance. However, the proposed method still shows superior results compared to audio only-based NMF.

Table 5 Performance degradation of proposed method for suboptimal parameter settings. We chose λ_T=0.5=const. and λ_E=8.0=const. which are optimal for SNR =1 dB and evaluated SNRs − 1 dB, $\dots $, 3 dB. For each performance value of the proposed method, the degradation relative to optimum parameter settings for λ_E and λ_T is given in parentheses. For this experiment, ego-noise by movements of Sequence III is used

Full size table

3.3.3 Alternative choices for $\bar {\alpha }_{\ell }$

In the previous experiments, the motor data vector $\bar {\boldsymbol {\alpha }}_{\ell }$ was composed of the angular position and its first- and second-order temporal derivatives, i.e., angular velocity and acceleration. By complementing angular position by its first and second order derivatives, we implicitly added temporal information to our model since not only current, but also past motor data samples are taken into account for the construction of $\bar {\boldsymbol {\alpha }}_{\ell }$, c.f. Eq. 1.

In the following, we evaluate how the performance of the proposed method depends on the amount of temporal context included into $\bar {\boldsymbol {\alpha }}_{\ell }$.

Results are given in Table 6. First, we consider a motor data vector $\bar {\boldsymbol {\alpha }}_{\ell }$ which contains only angular positions, i.e., no derivatives are used. The suppression result lags significantly behind audio only-based NMF. This drop in performance is not surprising since by considering angular position alone it cannot be distinguished whether, e.g., the robot raises or drops its arm if up- and downwards movements have the same trajectory. Consequently, the ego-noise caused by these two movements is assessed as similar. The results improve drastically if angular velocity is added to motor data vector $\bar {\boldsymbol {\alpha }}_{\ell }$. If also angular acceleration is included, the results further improve; however, the additional gain is clearly smaller compared to that of adding the first derivative. Adding higher order derivatives to $\bar {\boldsymbol {\alpha }}_{\ell }$ does not offer further benefit as results get slightly worse with increasing number of derivatives incorporated to $\bar {\boldsymbol {\alpha }}_{\ell }$.

Table 6 Performance for different designs of $\bar {\boldsymbol {\alpha }}_{\ell }$: if no derivatives are considered, $\bar {\boldsymbol {\alpha }}_{\ell }$ is composed of angular positions only. If one derivative is considered, $\bar {\boldsymbol {\alpha }}_{\ell }$ contains angular positions and their first order temporal derivatives. For this experiment, ego-noise caused by movements of Sequence II is used, SNR=0 dB, K=30, K_S=20, ε=5·10⁻³, λ_T=0.9, and λ_E=2

Full size table

4 Summary and outlook

In this paper, we proposed motor data-regularized NMF and used it in a semi-supervised manner for ego-noise suppression.

The basic idea of the presented method is to improve the approximation of the ego-noise by taking motor data describing the physical state of the robot into account. We propose to construct a motor data graph which encodes the similarities between motor data samples. Based on this, a regularization term can be derived and added to the conventional, audio only-based NMF cost function. It enforces the activation of similar dictionary entries when the robot is in similar physical states. We evaluated the proposed method for mixtures of desired speech signals and ego-noise of different movements and considered various SNRs of the mixture. The presented approach showed superior performance in all scenarios, especially for high SNR when the power of the additional speech signal is large and the estimation of the ego-noise activations based on audio data-only is challenging. Consequently, the weighting of the motor data-dependent regularization term has to be increased for larger SNR.

For future work, we plan to evaluate the proposed method for other NMF cost functions, such as Itakuro-Saito and Kullback divergence. Furthermore, we plan to evaluate the presented concept for multichannel NMF, where dictionary-activation-modeling of single-channel NMF is extended by a spatial covariance matrix for each atom and frequency bin.

References

K. Nakadai, T. Lourens, H. G. Okuno, H. Kitano, in Proc. 17th Nat. Conf. Artificial Intell. (AAAI). Active audition for humanoid (AAAIAustin, TX, 2000), pp. 832–839.
Google Scholar
H. G. Okuno, K. Nakadai, in Proc. IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP). Robot audition: its rise and perspectives (IEEESouth Brisbane, QL, Australia, 2015), pp. 5610–5614.
Google Scholar
J. Even, H. Saruwatari, K. Shikano, T. Takatani, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS). Semi-blind suppression of internal noise for hands-free robot spoken dialog system (IEEESt. Louis, MO, 2009), pp. 658–663.
Google Scholar
M. Aharon, M. Elad, A. Bruckstein, K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process.54(11), 4311–4322 (2006).
Article Google Scholar
A. Deleforge, W. Kellermann, in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. (ICASSP). Phase-optimized K-SVD for signal extraction from underdetermined multichannel sparse mixtures (IEEESouth Brisbane, QL, Australia, 2015), pp. 355–359.
Google Scholar
D. D. Lee, H. S. Seung, Learning the parts of objects by non-negative matrix factorization. Nature. 401(6755), 788–791 (1999).
Article Google Scholar
D. D. Lee, H. S. Seung, in Proc. 13th Int. Conf. Neural Inform. Process. Syst. (NIPS). Algorithms for non-negative matrix factorization (NeurlPSDenver, CO, 2000), pp. 535–541.
Google Scholar
C. Févotte, J. Idier, Algorithms for non-negative matrix factorization with the β-divergence. Neural Comput.23(9), 2421–2456 (2011).
Article MathSciNet Google Scholar
T. Tezuka, T. Yoshida, K. Nakadai, in Proc. IEEE Int, Conf. Robotics and Automation (ICRA). Ego-motion noise suppression for robots based on semi-blind infinite non-negative matrix factorization (IEEEFlorence, Italy, 2014), pp. 6293–6298.
Google Scholar
H. Sawada, H. Kameoka, S. Araki, N. Ueda, Multichannel extensions of non-negative matrix factorization with complex-valued data. IEEE/ACM Trans. Audio, Speech, Language Process.21(5), 971–982 (2013).
Article Google Scholar
T. Haubner, A. Schmidt, W. Kellermann, in Proc. ITG Fachtagung Sprachkommunikation. Multichannel nonnegative matrix factorization for ego-noise suppression (VDE-VerlagOldenburg, Germany, 2018), pp. 136–140.
Google Scholar
Clean PNG, NAO, der humanoide Roboter.https://de.cleanpng.com/png-m5r7ur/ Accessed 20 May 2020.
A. Ito, T. Kanayama, M. Suzuki, S. Makino, in Proc. European Conf. Speech Communication and Technology (INTERSPEECH - Eurospeech). Internal noise suppression for speech recognition by small robots (ISCALisbon, Portugal, 2005), pp. 2685–2688.
Google Scholar
A. Schmidt, W. Kellermann, in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. (ICASSP). Informed ego-noise suppression using motor data-driven dictionaries (IEEEBrighton, UK, 2019), pp. 116–120.
Google Scholar
Y. Nishimura, M. Ishizuka, K. Nakadai, M. Nakano, H. Tsujino, in Proc. IEEE/ RAS Int, Conf. Humanoid Robots (Humanoids). Speech recognition for a humanoid with motor noise utilizing missing feature theory (IEEECancun, Mexico, 2006), pp. 26–33.
Google Scholar
G. Ince, K. Nakadai, T. Rodemann, Y. Hasegawa, H. Tsujino, J. Imura, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS). Ego-noise suppression of a robot using template subtraction (IEEESt. Louis, MO, 2009), pp. 199–204.
Google Scholar
G. Ince, K. Nakadai, T. Rodemann, Y. Hasegawa, H. Tsujino, in Proc. IEEE Int, Conf. Robotics and Automation (ICRA). Imura: A hybrid framework for ego noise cancellation of a robot (IEEEAnchorage, AK, 2010), pp. 3623–3628.
Google Scholar
A. Schmidt, A. Deleforge, W. Kellermann, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS). Ego-noise reduction using a motor data-guided multichannel dictionary (IEEEDaejon, South Korea, 2016), pp. 1281–1286.
Google Scholar
D. Cai, X. He, X. Wu, J. Han, in Proc. 8th IEEE Int, Conf. on Data Mining. Non-negative matrix factorization on manifold (IEEEPisa, Italy, 2008), pp. 63–72.
Google Scholar
D. Cai, X. He, J. Han, T. S. Huang, Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. and Mach. Intell.33(8), 1548–1560 (2011).
Article Google Scholar
M. N. Schmidt, J. Larsen, F. -T. Hsiao, in Proc. IEEE Workshop Mach. Learning Signal Process. Wind noise reduction using non-negative sparse coding (IEEEThessaloniki, Greece, 2007), pp. 431–436.
Google Scholar
U. von Luxburg, A tutorial on spectral clustering. Statistics and Computing. 17(4), 395–416 (2007).
Article MathSciNet Google Scholar
F. R. K. Chung, Spectral graph theory, 1st edn, vol. 1 (American Mathematical Soc., Providence, RI, 1997).
MATH Google Scholar
M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and uUnlabeled examples. J. Mach. Learn. Research. 7:, 2399–2434 (2006).
MathSciNet MATH Google Scholar
M. Belkin, Problems of learning on manifolds. PhD Thesis (The University of Chicago, Chicago, 2003).
Google Scholar
Seventh Framework Programme, ‘Embodied Audition for RobotS’ (EARS).https://robot-ears.eu/. Accessed 25 Sept 2018.
M. Cooke, J. Barker, An audio-visual corpus for speech perception and automatic speech recognition. J. Acoustical Society of America. 120(5), 2421–2424 (2006).
Article Google Scholar
C. Févotte, R. Griboval, E. Vincent, in Technical Report 1706. BSS EVAL toolbox user guide (IRISARennes, France, 2005). Software available at http://www.irisa.fr/metiss/bsseval/.
Google Scholar
ITU-T Recommendation P.862.2: Wideband extension to recommendation P.862 for the assessment of wideband telephone networks and speech codecs. Recommendation, ITU (November 2007).

Download references

Funding

This work was partially supported by the DFG under contract no <Ke890/10-2> within the Research Unit FOR2457 “Acoustic Sensor Networks”.

Author information

Authors and Affiliations

Multimedia Communications and Signal Processing, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Cauerstrasse 7, Erlangen, 91058, Germany
Alexander Schmidt, Andreas Brendel, Thomas Haubner & Walter Kellermann

Authors

Alexander Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Brendel
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Haubner
View author publications
You can also search for this author in PubMed Google Scholar
Walter Kellermann
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AS has conducted the research on this paper. AB, TH, and WK contributed valuable feedback on the conceptual idea and assisted the work intensively. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Alexander Schmidt.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Schmidt, A., Brendel, A., Haubner, T. et al. Motor data-regularized nonnegative matrix factorization for ego-noise suppression. J AUDIO SPEECH MUSIC PROC. 2020, 11 (2020). https://doi.org/10.1186/s13636-020-00178-0

Download citation

Received: 08 January 2020
Accepted: 10 June 2020
Published: 31 July 2020
DOI: https://doi.org/10.1186/s13636-020-00178-0

Motor data-regularized nonnegative matrix factorization for ego-noise suppression

Abstract

1 Introduction

2 Motor data-regularized NMF for ego-noise suppression