 Research
 Open Access
 Published:
Motor dataregularized nonnegative matrix factorization for egonoise suppression
EURASIP Journal on Audio, Speech, and Music Processing volume 2020, Article number: 11 (2020)
Abstract
Egonoise, i.e., the noise a robot causes by its own motions, significantly corrupts the microphone signal and severely impairs the robot’s capability to interact seamlessly with its environment. Therefore, suitable egonoise suppression techniques are required. For this, it is intuitive to use also motor data collected by proprioceptors mounted to the joints of the robot since it describes the physical state of the robot and provides additional information about the egonoise sources. In this paper, we use a dictionarybased approach for egonoise suppression in a semisupervised manner: first, an egonoise dictionary is learned and subsequently used to estimate the egonoise components of a mixture by computing a weighted sum of dictionary entries. The estimation of the weights is very sensitive against other signals beside egonoise contained in the mixture. For increased robustness, we therefore propose to incorporate knowledge about the physical state of the robot to the estimation of the weights. This is achieved by introducing a motor databased regularization term to the estimation problem which promotes similar weights for similar physical states. The regularization is derived by representing the motor data as a graph and imprints the intrinsic structure of the motor data space onto the dictionary model. We analyze the proposed method and evaluate its egonoise suppression performance for a large variety of different movements and demonstrate the superiority of the proposed method compared to an approach without using motor data.
Introduction
Microphoneequipped robots are exposed to various kinds of noise, specifically to selfcreated noise, which is referred to as egonoise in the following. It is caused by the robot’s electrical and mechanical components such as rotating motors and joints as well as the moving body parts. Egonoise is a crucial problem in robot audition [1, 2] since it severely corrupts the recorded microphone signals and impairs the robot’s capability to react to unanticipated acoustic events. For this reason, egonoise suppression is a crucial preprocessing step in robot audition.
Egonoise suppression is particularly challenging for several reasons. First, egonoise is usually louder than other signals of interest, e.g., a desired speech signal (“target”), since the egonoise sources are typically located in immediate proximity of the microphones. For example, for the humanoïd robot NAO ^{TM}, which we will use as experimental platform in this paper, the microphones are mounted to the head of the robot. Thereby, they are only few centimeters away from the shoulder motors and joints, cf. Fig. 1. Another challenging aspect of egonoise is that it cannot be modeled as a single static point interferer as the joints are located all over the body of the robot and the resulting structureborne sound is transduced to air not just at isolated points. Furthermore, egonoise is highly nonstationary since typically different movements are performed successively with varying speeds and accelerations.
One of the first approaches for egonoise suppression goes back to the SIG humanoïd robot [1] which was equipped with microphones mounted inside the robot’s housing near the motors in order to record egonoise reference signals. These signals were subsequently used as reference for adaptive filteringbased egonoise cancelation. Interestingly, these reference signals were interpreted as additional auditory perception channels of the robot. Internal microphones were also used in [3] for speech enhancement for a humanrobot dialog system. In this approach, the recorded reference signals are incorporated into a frequencydomain semiblind source separation algorithm with subsequent multichannel Wiener filtering.
In many robot designs, it is not possible to mount additional reference microphones inside the robot due to space and hardware constraints. Furthermore, a potentially large number of internal microphones are required to obtain reference signals for each egonoise source. This drawback motivates approaches which operate on the external microphone signals only. Here, it can be exploited that egonoise exhibits a characteristic structure in the ShortTime Fourier Transform (STFT) domain. Due to the limited number of degrees of freedom for the movements of the robot, those spectral patterns cannot be arbitrarily diverse. These two properties motivate the use of learningbased dictionary methods where the egonoise signals are approximated by prototype signals, socalled atoms, which are collected in a dictionary. Then, for each time frame, a linear combination of atoms has to be found which optimally fits the current egonoise signal with respect to the chosen criterion. An example for such a dictionary learning algorithm is KSVD [4], which has been applied for multichannel egonoise suppression in [5]. Another widely used approach to train a dictionary is nonnegative matrix factorization (NMF) [6–8]. For NMF, the dictionary is restricted to nonnegative elements only which is wellsuited to model power spectral densities (PSDs) of acoustic sources. An according approach for egonoise suppression has been investigated, e.g., in [9]. The concept of NMF has been extended for multichannel recordings [10] by extending the (nonnegative) source model by an additional spatial model. This has been applied for egonoise suppression in [11].
Besides methods using the audio modality only (referred to as audio onlybased methods in the following), other egonoise suppression approaches use knowledge about motor information given by, e.g., motor commands or motor data such as engine rotation frequency, joints’ angle, or angular velocities collected by proprioceptors. The advantage of using motor data compared to motor commands is that the emitted egonoise is directly related to the instantaneous internal state of the robot, measured by motor data. Since a robot is not a fully deterministic system, this measured state may be significantly different from the target state defined by the motor command.
Typically, a sufficiently accurate analytical model of the dependency between motor data and emitted egonoise can usually not be obtained since the mechanical dependencies and interactions between structure and airborne sounds are highly complex. Therefore, current egonoise suppression approaches model these dependencies entirely or partly by learningbased strategies. For example in [13], a neural networkbased approach is used to predict the PSDs of egonoise caused by the Aibo ^{TM} robot. The feedforward neural network, consisting of two hidden layers containing thirty nodes each, is fed with angular position and velocity data of current and past time frames. The PSD estimates are subsequently used for spectral subtraction which was shown to result in a significant improvement of speech recognition rates. In [14], it is demonstrated that the harmonic structure of egonoise can be estimated using motor data. This prior knowledge is included to a singlechannel NMFbased egonoise modeling. It is proposed to approximate the currently observed egonoise spectrogram by combining elements from a dictionary D_{H} which models the harmonic structure and another dictionary D_{R} which captures the residual part of the egonoise. The benefit of this approach is that only D_{R} requires a prior learning step while D_{H} is completely motor datadriven. It is shown that the proposed approach significantly outperforms an audio onlybased method for the suppression of egonoise that is not well represented in the training data. Although this approach is close to the proposed method from a methodical point of view, it aims at a different direction since it explicitly addresses the suppression of egonoise if training and test data are unbalanced. This is not the case in the scenario considered in this paper.
Other popular methods for egonoise suppression combining audio and motor information are templatebased approaches. Here, the key idea is to save the characteristic spectral shape of the egonoise as PSD templates in a data base. In [15], each template is associated with a motor command which triggered the current movement. Based on this, during application, matching templates are identified and temporally aligned to the recorded signal. An alternative templatebased approach was presented in [16, 17], where motor data instead of motor commands are used to identify the templates in the data base. For a current motor data sample, the nearest neighbor in the motor data space is searched and the associated template used as egonoise estimate.
The concept to associate motor data with egonoise templates was adopted in [18]. However, there, motor data samples are linked to a set of atoms from a learned dictionarybased egonoise model. Nonlinear classifiers in the motor data space are used to associate a motor data sample to a set of atoms, whose elements are subsequently combined to approximate the current egonoise recording. Thereby, the classifiers replace the expensive iterative search for atoms in the dictionary.
In this paper, the idea of choosing atoms depending on motor data is adopted from [18] and we propose to expand the conventional, audio onlybased NMF model by a motor datadependent regularization term, which promotes similar atom activations in those time frames in which similar motor data is measured. The proposed regularization term is derived from a graph structure which encodes the similarity between the motor data samples. While the main benefit of the method in [18] was a reduction of computational complexity, the presented approach in this paper results in a significant performance improvement. The proposed method is inspired by graphregularized NMF [19, 20], which was proposed in the context of clustering and classification of text documents. There, the NMF model and the regularization are operating in the same data space. In this work, however, we learn an NMF model on acoustic data while the regularization encodes the geometry of the motor data space. Thus, we combine an acoustic model with nonacoustic reference information.
This paper is structured as follows. In Section 2.1, we describe the used motor data. After succinctly introducing NMF in Section 2.2, we present the novel motor dataregularized NMF in Section 2.3 Thereby, we first describe the construction of the motor data graph structure in Subsection 2.3.1 and derive the proposed regularization term in Subsection. 2.3.2. Then, the modified NMF optimization problem is formulated and according update rules are presented in Subsection. 2.3.3. The resulting novel egonoise suppression algorithm is summarized in Section 2.4 and its efficacy is demonstrated in Section 3.
Motor dataregularized NMF for egonoise suppression
In the following, we consider the binwise squared magnitude of a singlechannel microphone signal in the STFT domain, represented in spectrograms denoted as \(\boldsymbol {Y} = \left [\boldsymbol {y}_{1},\dots,\boldsymbol {y}_{L}\right ] \in \mathbb {R}^{F\times L}_{+}\), where F is the number of frequency bins and L is the number of considered time frames.
Motor data descriptions and definitions
The physical state of a robot can be described by motor data, collected by proprioceptors providing angular position information of the joints driven by the motors. In the following, we consider a robot which is equipped with \(m=1,\dots,M\) proprioceptors each capturing one angle of a joint. We denote the sth observed angular position in STFT frame ℓ for proprioceptor m by \(\alpha _{\ell,m}^{(s)}\in \mathbb {R}\). Within frame ℓ, a total number of S_{ℓ} motor data samples is observed, i.e., \(s=1,\dots, S_{\ell }\). In this paper, we account for the fact that the motor data is not necessarily synchronized with the audio data recording so that for a fixed observation interval for the audio data, the number of motor data may vary, i.e., S_{ℓ} may change with ℓ. This is specifically the case for the NAO robot used for the experiments in this paper.
Depending on the kind of egonoise, only a subset of proprioceptors is relevant for egonoise suppression. For example, if only egonoise caused by arm movements is present, only motor data of the arm joints are required. In the following, we denote the index set of relevant proprioceptors for these joints by \(\mathcal {M}\).
From proprioceptor data collected for proprioceptor m, the instantaneous angular velocity can be estimated by
where \(\Delta T_{\ell }^{(s)}\) denotes the time difference between adjacent observations \(\alpha ^{(s)}_{\ell,m}\) and \(\alpha ^{(s1)}_{\ell,m}\). Note that for s=1, \(\alpha ^{(s1)}_{\ell,m}\) is chosen to be the last angular sample of previous the frame ℓ−1. Analogously, angular acceleration \(\ddot {\alpha }^{(s)}_{\ell,m}\) can be computed from successive angular velocity estimates \(\dot {\alpha }^{(s)}_{\ell,m}\) and \(\dot {\alpha }^{(s1)}_{\ell,m}\).
To associate each spectrogram frame y_{ℓ} with a single motor data sample, we propose first to compute the arithmetic average of all S_{ℓ} angular positions in STFT frame ℓ
We proceed analogously for angular velocity and acceleration and obtain \(\bar {\dot {\alpha }}_{\ell,m}, \bar {\ddot {\alpha }}_{\ell,m}\), respectively. We then concatenate the averaged angular data for all considered proprioceptors in a feature vector
which we will refer to as motor data vector for frame ℓ in the following. The left part of Fig. 2 exemplarily illustrates the described preprocessing of the data.
NMF for egonoise suppression
In the following, we briefly summarize NMF. We introduce succinctly how semisupervised NMF can be used for egonoise suppression and explain the main drawback of the known approach before we introduce the proposed motor databased regularization.
The objective of NMF is to approximate the nonnegative matrix Y, i.e., a matrix whose elements are all larger or equal than zero, by a product of two nonnegative matrices D and H
where \(\boldsymbol {D}\in \mathbb {R}^{F\times K}_{+}\) is the socalled dictionary of size F×K and \(\boldsymbol {H}=\left [\boldsymbol {h}_{1},\dots,\boldsymbol {h}_{L}\right ]\in \mathbb {R}^{K\times L}_{+}\) is referred to as activation matrix [8, 21]. This approach can be interpreted as approximating each column of Y by a weighted sum of columns of D (the socalled atoms or bases), where the weights are given by the corresponding column entries of H. K is referred to as size of the dictionary and describes the number of atoms in D. Typically, K≪F,L holds, i.e., NMF can be considered as a compact representation of data.
The factorization is achieved by minimizing a cost function which penalizes the dissimilarity between Y and \(\hat {\boldsymbol {Y}}\) defined by the model parameters D,H. Typically, the cost function is applied elementwise on the elements of the matrices Y and \(\hat {\boldsymbol {Y}}\). In this paper, we consider the Euclidean distance between Y and \(\hat {\boldsymbol {Y}}\) as cost function yielding the optimization problem
where ∥·∥_{F} denotes the Frobenius norm and D,H≽0 means that all elements of D,H are larger or equal to zero, ensuring nonnegativity. The optimization problem in Eq. 5 is typically solved using iterative updates alternating between D,H such that the nonnegativity of D,H is implicitly guaranteed if they are initialized with positive values. The update rules can be derived based on, e.g., the MajorizationMinimization principle or heuristic approaches [7, 8].
For egonoise suppression, we apply a semisupervised, twostage strategy [21], c.f. Section 2.4: first, we use audio data containing egonoise only and train an egonoise dictionary. Then, given a mixture of egonoise and speech, these dictionary elements remain constant and only its activations are estimated. For this, again, the same iterative update rules are used, which have shown to be sensitive to the additional speech signal. As a consequence, the atom activations are no longer estimated correctly. For improved robustness, we therefore propose to extend this audio onlybased estimation of the activations by taking also the physical state of the robot, measured in terms of motor data, into account. Thus, the estimation of the activations is additionally guided by reference information which is completely unaffected by the speech signal.
Motor dataregularized NMF
The basic idea of our approach is that activations should be similar if the physical state of the robot is similar. For this, we measure the similarity between robot states in frames ℓ and j by comparing motor data vectors \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) and enforce similar activations h_{ℓ} and h_{j} if \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) are close. This will be achieved by imprinting the intrinsic geometry of the motor data space to the NMF cost function. Results from spectral graph theory [22, 23] and manifold learning theory [24] have shown that local geometric structure of given data points can be modeled using an undirected graph. Based on these results, we first introduce a motor databased graph structure and summarize subsequently how a regularization term, enforcing similar activations for similar motor data, can be derived. We then reformulate the NMF optimization problem Eq. 5 and present according update rules for its minimization.
Motor data graph structure
In the following, we define a graph where the motor data vectors \(\bar {\boldsymbol {\alpha }}_{1},\dots,\bar {\boldsymbol {\alpha }}_{L}\) constitute the nodes. The edges connecting the nodes are assumed to be bidirectional, i.e., we obtain an undirected graph. A part of an exemplary graph is illustrated in Fig. 3. The edge which connects nodes \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) has weight W_{ℓj}=W_{jℓ} and should reflect the affinity between the two motor data points. Dependent on the considered scenario, numerous measures have been proposed to quantify the affinity between \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) [22], e.g., a nearestneighbor or dotproduct weighting. In this paper, we determine the weight W_{ℓj} using a Gaussian kernel
with scale parameter \(\epsilon \in \mathbb {R}_{+}\). The larger W_{ℓj}, the higher the affinity between two motor data samples is and we obtain W_{ℓj}=1 if \(\bar {\boldsymbol {\alpha }}_{\ell }=\bar {\boldsymbol {\alpha }}_{j}\). Note that by adjusting ε, the connectivity of the graph can be controlled, e.g., for larger ε, the neighbors of a node are connected with a larger weight. Therefore, ε can be used to control the reach of the local neighborhood of a node. Based on the affinity weights, we define the affinity matrix W=W^{T} ∈[0,1]^{L×L}, where the [W]_{ℓj}=W_{ℓj}. Furthermore, we introduce the diagonal matrix Z of size L×L with \(Z_{\ell \ell }=\sum _{j}^{}W_{\ell j}=\sum _{j}^{}W_{j\ell }\) and zero else.
Motor databased regularization term
The derivation of the regularization term is based on results from [24, 25]. It is assumed that the considered motor data lie on a Riemannian manifold \(\mathcal {A}\). We are looking for a mapping \(f:\mathcal {A}\rightarrow \mathbb {R}\), which can be interpreted as a mapping from the manifold to a line. f should preserve the local geometry of the manifold, i.e., close points on the manifold should be mapped to close points on the line. This implies that f is allowed to vary only smoothly for similar arguments. Appropriate mappings f can be obtained by an optimization on the manifold which can be discretely approximated on the motor data graph by searching for an f which minimizes
where f is a function of the nodes of the graph [24, 25].
To exploit the geometric information of the motor data manifold for the estimation of the activation vectors, we manipulate Eq. 7 and replace the abstract mapping f by the activation of atom k
where h_{kℓ} denotes the ℓth element of h_{k}, i.e., h_{kℓ} is the scaling of atom ℓ in time frame k. The regularization term \(\mathcal {R}_{k}\) needs to be minimized jointly with Eq. 5 with respect to the activations for every atom k, c.f. Section 2.3.3. Note that the motor databased regularization \(\mathcal {R}_{k}\) implicitly influences also the structure of the dictionary elements since the optimized activations directly affect the update of D.
Note that in Eq. 8, affinities W_{ℓj} can be interpreted as weighting parameter: if two motor data vectors \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) are similar, W_{ℓj} is close to one according to Eq. 6 and the minimization of Eq. 8 enforces similar h_{kℓ} and h_{kj}. Using the parameters defined in Section 2.3.1, Eq. 8 can be directly related to the socalled graph Laplacian L=Z−W [22]
Summing over all atoms results in the final regularization term
where tr(·) denotes the trace operator.
Motor dataregularized NMF
The derived regularization term Eq. 10 can be directly included into Eq. 4. We obtain as modified optimization problem
where λ≥0 controls the influence of the motor databased regularization.
For minimization, we form the partial derivatives with respect to D and H in Eq. 11 and obtain iterative update rules [19, 20]
where [D]_{fk} selects the fkth element from D. Similar to conventional NMF, the iterative update can be stopped, e.g., after a fixed number of iterations. In this paper, in each iteration we additionally compute the cost according to Eq. 11 and terminate updating Eqs. 12,13 after convergence.
Eqs. 12 and 13 reduce to the conventional update rules for NMF if λ=0 [8]. Note that since the proposed method aims at enforcing similar activations for close motor data vectors, the regularization has an effect on the update rule for H only, while the update for D is unaffected.
Proposed algorithm for egonoise suppression
As mentioned in Section 2.2, we apply a semisupervised, twostage strategy for egonoise suppression [21]. We first employ audio data containing egonoise only and train D imprinting the intrinsic geometry of the motor data space onto the model using the proposed regularization. Given a mixture of egonoise and speech, we use D to model and suppress the current egonoise and to obtain a speech estimate. In the following, we describe the proposed algorithm for egonoise suppression in detail, c.f. Fig. 4 for an overview.

LearningD: As input, spectrograms \(\boldsymbol {Y}=\big [\boldsymbol {y}_{1},\dots \boldsymbol {y}_{L}\big ]\) are given containing egonoise only. Per spectrogram frame y_{ℓ}, a motor data vector \(\bar {\boldsymbol {\alpha }}_{\ell }\) is computed. \(\bar {\boldsymbol {\alpha }}_{\ell }, \ell =1,\dots,L\) is used to construct the affinity and degree matrix, W and Z, respectively. Subsequently, the update rules Eqs. 12 and 13 are used to compute dictionary D, where the introduced regularization term is weighted by λ_{T}.

Egonoise suppression: Another dictionary D_{S} of size K_{S} and according activation H_{S} is initialized to model the additional speech signal in the considered mixture Y. Analogously to the learning step before, W and Z are constructed from the new motor data vectors possibly representing different movements. Using the same update rules as before, D_{S}, H and H_{S} are updated while D remains constant. The motor databased regularization term is weighted by λ_{E}. Note that for optimizing the activations of the speech model H_{S}, we set λ_{E}=0 since the motor databased regularization should affect only the estimation of the egonoise activations. After identifying the optimum model parameters captured by D_{S}, H and H_{S}, we use a spectral enhancement filter to obtain an estimate for the desired speech signal \(\big [\hat {\boldsymbol {Y}}_{\mathrm {S}}\big ]_{f\ell }=\big [\boldsymbol {F}\big ]_{f\ell }\cdot \big [\boldsymbol {Y}\big ]_{f\ell }\) for the fℓth bin where the enhancement filter is given by
$$\begin{array}{*{20}l} \big[\boldsymbol{F}\big]_{f\ell}=\frac{\big[\boldsymbol{D}_{\mathrm{S}}\boldsymbol{H}_{\mathrm{S}}\big]_{f\ell}}{\big[\boldsymbol{D}\boldsymbol{H}\big]_{f\ell}+\big[\boldsymbol{D}_{\mathrm{S}}\boldsymbol{H}_{\mathrm{S}}\big]_{f\ell}}. \end{array} $$(14)
Note that typically λ_{E}≠λ_{T} holds, i.e, the regularization terms in both steps have different weights. This is further detailed in the following section.
Experimental evaluation
In the following, we evaluate the proposed method using real microphone recordings. We first describe the hardware setup, the synchronization of audio and motor data and the recording scenarios, and introduce the evaluation metrics. Then, we present suppression results for egonoise of different movements and discuss the influence of crucial parameters.
Recording setup
For our experiments, we conducted experiments with a commercially available NAO H25 robot [12]. For the audio recordings, we used a selfconstructed head [26] with a microphone array of 12 sensors. For all following experiments, we used the frontmost microphone. Since the NAO platform does not provide an inbuilt synchronization on audio sample level, we developed a synchronization scheme which is illustrated in Fig. 5: the microphone signals are fed into an external analogtodigital (A/D) converter using conventional phone connectors (IEC 6060311). The sampled data is forwarded to the robot’s internal CPU via USB, where it is synchronized with the motor data collected by the proprioceptors of the robot. The resulting data stream, containing audio and motor data, is finally transmitted to an external PC via Ethernet, which is used for recording.
Scenario description
The recordings were conducted in a room with moderate reverberation (T_{60}=200 ms). We investigate egonoise of different right arm movements of the robot. Compared to movement noise of other body parts, egonoise of the arms has the most severe effect on the microphone signals due to the immediate closeness of the active joints to the microphones. In total, we recorded egonoise of three motion sequences:

Sequence I consists of repeating right arm waving movements, activating all six joints of the arm. The robot lifts the arm using the right shoulder pitch motor, while performing waving movements with the remaining five motors of the right arm.

Sequence II resembles Sequence I; however, the lifting of the arm is performed with randomly varying velocity and acceleration of the right shoulder pitch motor. The number of employed joints is identical to Sequence I.

Sequence III is a mixture of left and right arm movements where both left and right joints are controlled independently with varying speeds. Since movements of the left and right arm are considered, 12 joints are used in total, i.e., compared to Sequence I/II the number of joints is doubled.
While Sequence I is a relatively simple scenario due to its repetitive character, Sequence II and Sequence III are more challenging for a description by a dictionary. For Sequence II, the random accelerations of the right shoulder pitch motor result in a large variety of spectral patterns which must be captured by the dictionary. The same holds for egonoise of Sequence III, where the doubling of employed joints causes more spectral diversity.
The recorded egonoise was used for training the dictionary and evaluation, where the data for evaluation was not contained in the training data. In total, we recorded 60 s for each motion sequence and split the egonoise data such that approximately 30 s egonoise for the learning of D is available.
To evaluate the suppression performance, we consider a scenario in which a target source is talking to the robot. The robot is standing on the floor level while it performs different waving movements of the right arm. The microphones of the robot are at a height of 55 cm. For the speech signal, utterances from male and female speakers of the GRID corpus [27] were used. The loudspeaker was positioned at 1 m distance of the robot, at a height of 1 m. The recorded reverberant utterances were added to the egonoise with varying signaltonoise (SNR) ratios (see Section 3.3).
The audio signals are sampled at f_{S}=16 kHz and transformed to the STFT domain using a Hamming window of length 64 ms with overlap of 50 %. The internal operating system of the NAO robot saves motor data samples of all joints into an internal cache which can be accessed by the user. This cache is typically updated every 10 ms. Consequently, the sampling frequency of the motor data is given by f_{S}≈100 Hz, i.e, typically S_{ℓ}=6 motor data samples are available per time frame.
We evaluated the overall performance of the egonoise suppression in terms of signaltodistortion ratio (SDR in dB) and signaltoartifacts ratio (SAR in dB). For the computation of both, Matlab functions provided by [28] are used. In practice, it must be expected that the egonoise and speech estimates, i.e., DH and D_{S}H_{S}, contain estimation errors resulting in imperfect enhancement filter F, cf. Eq. 14. As a consequence, the egonoise cannot be removed entirely from the mixture and/or the desired speech is distorted. The severity of both effects is reflected in the selected performance criteria SDR and SAR, respectively. While SDR measures the amount of remaining egonoise and speech distortion after processing, SAR considers introduced speech distortion only. For unprocessed data, the SDR corresponds to the SNR of the input mixture while SAR is infinite. Beside SDR and SAR, we also evaluate PESQ (perceptual evaluation of speech quality [29]). To obtain representative results, we averaged over 100 runs with random initialization of the matrices in NMF. Standard deviations for all results are given in brackets.
Evaluation and discussion of the results
For evaluating the proposed method for ego noise of motion Sequence I, II, and III, the size of the egonoise dictionary and speech dictionary has been chosen to K=K_{S}=20 for Sequence I and K=30, K_{S}=20 for Sequence II and III. These parameters have shown best suppression performance in terms of SDR for audio onlybased NMF on the respective egonoise recordings. We first discuss the choice of λ_{T} and λ_{E} and illustrate the effect of the regularization term \(\mathcal {R}\). Subsequently, we evaluate the suppression performance for different SNRs and finally discuss alternative choices for the motor data vector \(\bar {\boldsymbol {\alpha }}_{\ell }\).
Impact and choice of λ_{T} and λ_{E}
In Table 1, the suppression results for particular choices of λ_{T} and λ_{E} are given. First, we incorporate the motor data information only into the training (λ_{T}=0.9) and leave the suppression step unchanged λ_{E}=0. Compared to audio onlybased NMF (denoted as “NMF” in the following), this already shows a slight improvement of the results. For λ_{T}=0 and λ_{E}=19, we obtain significantly better results than for NMF, which shows that enforcing similar activations for similar physical states of the robot helps even if this constraint has not be learnt during the learning of the dictionary. Best results are obtained if the regularization term is included to both learning and suppression. Note that λ_{T} and λ_{E} are of different orders of magnitude, what will be further investigated and interpreted in Section 3.3.2.
The effect of the proposed regularization term is illustrated in Fig. 6. We consider two time frames ℓ and j of a mixture of egonoise (Sequence I) and speech. Frames ℓ and j are chosen such that W_{ℓj} is large, i.e., \(\bar {\boldsymbol {\alpha }}_{\ell }\) and \(\bar {\boldsymbol {\alpha }}_{j}\) are close indicating that the robot has similar physical states. Hence, similar activations are desired. Figure 6a shows elements of the activation vectors h_{ℓ} and h_{j} obtained if audio onlybased NMF is used. It is obvious that the activations differ significantly, which can be explained by the additional speech signal present in frames ℓ and j which affects the estimation of the egonoise activations. Figure 6 b illustrates the elements of h_{ℓ} and h_{j} estimated by proposed motor dataregularized NMF. Here, in contrast to audio onlybased NMF, the activations coincide even if additional speech is present.
For further illustration, Fig. 7 shows spectrograms of an egonoise extract and its estimates using audioonly NMF and the proposed method. Without motor data regularization, the speech signal leads to additional, undesired components in the egonoise estimate. In contrast, this effect is not or only weakly pronounced for the proposed method.
The effect of varying λ_{E} and ε on the suppression result is illustrated in Fig. 8. For λ_{E}=0, the regularization is ineffective and the proposed method reduces to audio onlybased NMF (λ_{T}=0 holds during the learning of D). If λ_{E} is chosen too large, the effect of the motor data dominates and the suppression performance degrades. λ_{E}=19 appears to result in the best result for the considered mixture. However, note that the optimal choice of λ_{E} depends on the SNR of the mixture, as will be discussed in more detail in Section 3.3.2. We now consider the suppression performance for varying scale parameter ε, c.f. Eq. 6. For ε→0, we obtain according to Eq. 6
i.e, all connections in the graph are set to zero. Accordingly, the regularization term in Eq. 10 equals zero and the results of the proposed method and audio onlybased NMF coincide. For increasing ε, Eq. 6 gets less selective and the number neighbors of a node with large affinity increases. For the setup in Fig. 8, the maximum SDR is obtained for ε=5·10^{−3}, which turned out to result in robust performance even for egonoise of other movements. For larger ε, the suppression performance deteriorates since more and more connections between nodes obtain large weights and the discriminative nature of the graph is reduced.
Varying SNRs
So far, we only considered mixtures with constant SNR. In a typical humanrobot interaction, the SNR is however changing due to, e.g., varying distances between desired source and robot or different power levels of the signal of interest. Therefore, a robust egonoise suppression at different SNRs is of high importance.
In the following, we evaluate the proposed approach for SNR ∈{± 10,± 5,± 2,± 1,0} dB of the input mixture. For this, we added scaled versions of the speech signal to the egonoise. Note that for the considered NAO robot SNR=10 dB is an unlikely scenario since it corresponds to an humanrobot distance of only a couple of centimeters or a very loud human voice. We acknowledge, however, that for robots which emit less loud egonoise, such a high SNR could be realistic.
Results for egonoise Sequence I are given in Table 2. In the right part of Table 2, parameters λ_{T},λ_{E}, and ε used for the proposed method are summarized. Interestingly, for SNR= − 10 dB, the proposed method shows best result if the regularization is ineffective. Consequently, it does not show any benefit compared to audio onlybased NMF. For larger SNRs, SDR and SAR increase both for audio onlybased NMF and the proposed method. Motor dataregularized NMF consistently shows superior performance, while the relative improvement between the proposed approach and NMF increases for growing SNR, e.g., for SNR= + 2 dB, a gain of + 2 dB in SDR and 2.5 dB in SAR is achieved. This effect can be explained by the fact that for audio onlybased NMF, the estimation of the activations is more severely impaired by the additional speech signal. This effect becomes more pronounced for increasing SNR. Since motor data is a nonacoustic reference signal, the regularization term is not affected by the increasing power of the additional speech component. This also explains why almost no benefit could be observed for low SNR when the additional speech signal does not have an impact on the egonoise estimation.
While ε is constant for all SNRs, especially λ_{E} has to be increased continuously for larger SNRs. By this, the influence of the motor datadependent regularization gets more aggressive compensating the increasingly negative impact of the speech on the estimation of the egonoise activations.
We conducted the same experiments for egonoise caused by motions of Sequence II and Sequence III. The results are summarized in Tables 3 and 4. In principle, the results obtained for Sequence I can be confirmed: the proposed method outperforms audio onlybased NMF consistently, especially for high SNRs, and λ_{E} shows a significant dependence on the SNR. Interestingly for Sequence II, the absolute values for λ_{E} have to be chosen slightly smaller for optimum performance than for Sequence I. Overall, the suppression results are ≈ 1 dB (Sequence II) and ≈ 0.5 dB (Sequence III) worse than for Sequence I, which can be explained by the more complex movements, c.f., movement description in Section 3.2.
Note that for an optimal choice of λ_{E} knowledge of the SNR is required for which a singlechannel or multichannel SNR estimator can be employed. However, it must be expected that the SNR estimation is imperfect leading to a suboptimal choice of λ_{E}. The resulting effect on the suppression performance is shown in Table 5. We chose λ_{E}=8.0 and λ_{T}=0.5, i.e, optimal parameters for SNR =1 dB, and evaluated the proposed method for SNRs −1 dB, \(\dots \), 3 dB, simulating imperfect SNR estimates. Overall, a suboptimal parameter choice leads to a degradation of the suppression performance. However, the proposed method still shows superior results compared to audio onlybased NMF.
Alternative choices for \(\bar {\alpha }_{\ell }\)
In the previous experiments, the motor data vector \(\bar {\boldsymbol {\alpha }}_{\ell }\) was composed of the angular position and its first and secondorder temporal derivatives, i.e., angular velocity and acceleration. By complementing angular position by its first and second order derivatives, we implicitly added temporal information to our model since not only current, but also past motor data samples are taken into account for the construction of \(\bar {\boldsymbol {\alpha }}_{\ell }\), c.f. Eq. 1.
In the following, we evaluate how the performance of the proposed method depends on the amount of temporal context included into \(\bar {\boldsymbol {\alpha }}_{\ell }\).
Results are given in Table 6. First, we consider a motor data vector \(\bar {\boldsymbol {\alpha }}_{\ell }\) which contains only angular positions, i.e., no derivatives are used. The suppression result lags significantly behind audio onlybased NMF. This drop in performance is not surprising since by considering angular position alone it cannot be distinguished whether, e.g., the robot raises or drops its arm if up and downwards movements have the same trajectory. Consequently, the egonoise caused by these two movements is assessed as similar. The results improve drastically if angular velocity is added to motor data vector \(\bar {\boldsymbol {\alpha }}_{\ell }\). If also angular acceleration is included, the results further improve; however, the additional gain is clearly smaller compared to that of adding the first derivative. Adding higher order derivatives to \(\bar {\boldsymbol {\alpha }}_{\ell }\) does not offer further benefit as results get slightly worse with increasing number of derivatives incorporated to \(\bar {\boldsymbol {\alpha }}_{\ell }\).
Summary and outlook
In this paper, we proposed motor dataregularized NMF and used it in a semisupervised manner for egonoise suppression.
The basic idea of the presented method is to improve the approximation of the egonoise by taking motor data describing the physical state of the robot into account. We propose to construct a motor data graph which encodes the similarities between motor data samples. Based on this, a regularization term can be derived and added to the conventional, audio onlybased NMF cost function. It enforces the activation of similar dictionary entries when the robot is in similar physical states. We evaluated the proposed method for mixtures of desired speech signals and egonoise of different movements and considered various SNRs of the mixture. The presented approach showed superior performance in all scenarios, especially for high SNR when the power of the additional speech signal is large and the estimation of the egonoise activations based on audio dataonly is challenging. Consequently, the weighting of the motor datadependent regularization term has to be increased for larger SNR.
For future work, we plan to evaluate the proposed method for other NMF cost functions, such as ItakuroSaito and Kullback divergence. Furthermore, we plan to evaluate the presented concept for multichannel NMF, where dictionaryactivationmodeling of singlechannel NMF is extended by a spatial covariance matrix for each atom and frequency bin.
References
K. Nakadai, T. Lourens, H. G. Okuno, H. Kitano, in Proc. 17th Nat. Conf. Artificial Intell. (AAAI). Active audition for humanoid (AAAIAustin, TX, 2000), pp. 832–839.
H. G. Okuno, K. Nakadai, in Proc. IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP). Robot audition: its rise and perspectives (IEEESouth Brisbane, QL, Australia, 2015), pp. 5610–5614.
J. Even, H. Saruwatari, K. Shikano, T. Takatani, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS). Semiblind suppression of internal noise for handsfree robot spoken dialog system (IEEESt. Louis, MO, 2009), pp. 658–663.
M. Aharon, M. Elad, A. Bruckstein, KSVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process.54(11), 4311–4322 (2006).
A. Deleforge, W. Kellermann, in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. (ICASSP). Phaseoptimized KSVD for signal extraction from underdetermined multichannel sparse mixtures (IEEESouth Brisbane, QL, Australia, 2015), pp. 355–359.
D. D. Lee, H. S. Seung, Learning the parts of objects by nonnegative matrix factorization. Nature. 401(6755), 788–791 (1999).
D. D. Lee, H. S. Seung, in Proc. 13th Int. Conf. Neural Inform. Process. Syst. (NIPS). Algorithms for nonnegative matrix factorization (NeurlPSDenver, CO, 2000), pp. 535–541.
C. Févotte, J. Idier, Algorithms for nonnegative matrix factorization with the βdivergence. Neural Comput.23(9), 2421–2456 (2011).
T. Tezuka, T. Yoshida, K. Nakadai, in Proc. IEEE Int, Conf. Robotics and Automation (ICRA). Egomotion noise suppression for robots based on semiblind infinite nonnegative matrix factorization (IEEEFlorence, Italy, 2014), pp. 6293–6298.
H. Sawada, H. Kameoka, S. Araki, N. Ueda, Multichannel extensions of nonnegative matrix factorization with complexvalued data. IEEE/ACM Trans. Audio, Speech, Language Process.21(5), 971–982 (2013).
T. Haubner, A. Schmidt, W. Kellermann, in Proc. ITG Fachtagung Sprachkommunikation. Multichannel nonnegative matrix factorization for egonoise suppression (VDEVerlagOldenburg, Germany, 2018), pp. 136–140.
Clean PNG, NAO, der humanoide Roboter.https://de.cleanpng.com/pngm5r7ur/ Accessed 20 May 2020.
A. Ito, T. Kanayama, M. Suzuki, S. Makino, in Proc. European Conf. Speech Communication and Technology (INTERSPEECH  Eurospeech). Internal noise suppression for speech recognition by small robots (ISCALisbon, Portugal, 2005), pp. 2685–2688.
A. Schmidt, W. Kellermann, in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Process. (ICASSP). Informed egonoise suppression using motor datadriven dictionaries (IEEEBrighton, UK, 2019), pp. 116–120.
Y. Nishimura, M. Ishizuka, K. Nakadai, M. Nakano, H. Tsujino, in Proc. IEEE/ RAS Int, Conf. Humanoid Robots (Humanoids). Speech recognition for a humanoid with motor noise utilizing missing feature theory (IEEECancun, Mexico, 2006), pp. 26–33.
G. Ince, K. Nakadai, T. Rodemann, Y. Hasegawa, H. Tsujino, J. Imura, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS). Egonoise suppression of a robot using template subtraction (IEEESt. Louis, MO, 2009), pp. 199–204.
G. Ince, K. Nakadai, T. Rodemann, Y. Hasegawa, H. Tsujino, in Proc. IEEE Int, Conf. Robotics and Automation (ICRA). Imura: A hybrid framework for ego noise cancellation of a robot (IEEEAnchorage, AK, 2010), pp. 3623–3628.
A. Schmidt, A. Deleforge, W. Kellermann, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS). Egonoise reduction using a motor dataguided multichannel dictionary (IEEEDaejon, South Korea, 2016), pp. 1281–1286.
D. Cai, X. He, X. Wu, J. Han, in Proc. 8th IEEE Int, Conf. on Data Mining. Nonnegative matrix factorization on manifold (IEEEPisa, Italy, 2008), pp. 63–72.
D. Cai, X. He, J. Han, T. S. Huang, Graph regularized nonnegative matrix factorization for data representation. IEEE Trans. Pattern Anal. and Mach. Intell.33(8), 1548–1560 (2011).
M. N. Schmidt, J. Larsen, F. T. Hsiao, in Proc. IEEE Workshop Mach. Learning Signal Process. Wind noise reduction using nonnegative sparse coding (IEEEThessaloniki, Greece, 2007), pp. 431–436.
U. von Luxburg, A tutorial on spectral clustering. Statistics and Computing. 17(4), 395–416 (2007).
F. R. K. Chung, Spectral graph theory, 1st edn, vol. 1 (American Mathematical Soc., Providence, RI, 1997).
M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and uUnlabeled examples. J. Mach. Learn. Research. 7:, 2399–2434 (2006).
M. Belkin, Problems of learning on manifolds. PhD Thesis (The University of Chicago, Chicago, 2003).
Seventh Framework Programme, ‘Embodied Audition for RobotS’ (EARS).https://robotears.eu/. Accessed 25 Sept 2018.
M. Cooke, J. Barker, An audiovisual corpus for speech perception and automatic speech recognition. J. Acoustical Society of America. 120(5), 2421–2424 (2006).
C. Févotte, R. Griboval, E. Vincent, in Technical Report 1706. BSS EVAL toolbox user guide (IRISARennes, France, 2005). Software available at http://www.irisa.fr/metiss/bsseval/.
ITUT Recommendation P.862.2: Wideband extension to recommendation P.862 for the assessment of wideband telephone networks and speech codecs. Recommendation, ITU (November 2007).
Funding
This work was partially supported by the DFG under contract no <Ke890/102> within the Research Unit FOR2457 “Acoustic Sensor Networks”.
Author information
Authors and Affiliations
Contributions
AS has conducted the research on this paper. AB, TH, and WK contributed valuable feedback on the conceptual idea and assisted the work intensively. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Schmidt, A., Brendel, A., Haubner, T. et al. Motor dataregularized nonnegative matrix factorization for egonoise suppression. J AUDIO SPEECH MUSIC PROC. 2020, 11 (2020). https://doi.org/10.1186/s13636020001780
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636020001780
Keywords
 Egonoise
 Motor data
 Robot audition
 Humanoïd robot