Motor data-regularized nonnegative matrix factorization for ego-noise suppression

Ego-noise, i.e., the noise a robot causes by its own motions, significantly corrupts the microphone signal and severely impairs the robot’s capability to interact seamlessly with its environment. Therefore, suitable ego-noise suppression techniques are required. For this, it is intuitive to use also motor data collected by proprioceptors mounted to the joints of the robot since it describes the physical state of the robot and provides additional information about the ego-noise sources. In this paper, we use a dictionary-based approach for ego-noise suppression in a semi-supervised manner: first, an ego-noise dictionary is learned and subsequently used to estimate the ego-noise components of a mixture by computing a weighted sum of dictionary entries. The estimation of the weights is very sensitive against other signals beside ego-noise contained in the mixture. For increased robustness, we therefore propose to incorporate knowledge about the physical state of the robot to the estimation of the weights. This is achieved by introducing a motor data-based regularization term to the estimation problem which promotes similar weights for similar physical states. The regularization is derived by representing the motor data as a graph and imprints the intrinsic structure of the motor data space onto the dictionary model. We analyze the proposed method and evaluate its ego-noise suppression performance for a large variety of different movements and demonstrate the superiority of the proposed method compared to an approach without using motor data.


Introduction
Microphone-equipped robots are exposed to various kinds of noise, specifically to self-created noise, which is referred to as ego-noise in the following. It is caused by the robot's electrical and mechanical components such as rotating motors and joints as well as the moving body parts. Ego-noise is a crucial problem in robot audition [1,2] since it severely corrupts the recorded microphone signals and impairs the robot's capability to react to unanticipated acoustic events. For this reason, ego-noise suppression is a crucial preprocessing step in robot audition. Ego-noise suppression is particularly challenging for several reasons. First, ego-noise is usually louder than other signals of interest, e.g., a desired speech signal ("target"), since the ego-noise sources are typically located in immediate proximity of the microphones. For example, for the humanoïd robot NAO TM , which we will use as experimental platform in this paper, the microphones are mounted to the head of the robot. Thereby, they are only few centimeters away from the shoulder motors and joints, cf. Fig. 1. Another challenging aspect of ego-noise is that it cannot be modeled as a single static point interferer as the joints are located all over the body of the robot and the resulting structure-borne sound is transduced to air not just at isolated points. Furthermore, ego-noise is highly non-stationary since typically different move- Fig. 1 Illustration of a typical human-robot interaction scenario. The desired source ("target") is shown on the left. Two exemplary, spatially distributed ego-noise sources are shown in blue. Besides, typical distances between the sources and the microphones are given. Obviously, the ego-noise sources are located close to the microphones while the distance between target and microphones is typically larger. Image of robot taken from [12] ments are performed successively with varying speeds and accelerations.
One of the first approaches for ego-noise suppression goes back to the SIG humanoïd robot [1] which was equipped with microphones mounted inside the robot's housing near the motors in order to record ego-noise reference signals. These signals were subsequently used as reference for adaptive filtering-based ego-noise cancelation. Interestingly, these reference signals were interpreted as additional auditory perception channels of the robot. Internal microphones were also used in [3] for speech enhancement for a human-robot dialog system. In this approach, the recorded reference signals are incorporated into a frequency-domain semi-blind source separation algorithm with subsequent multichannel Wiener filtering.
In many robot designs, it is not possible to mount additional reference microphones inside the robot due to space and hardware constraints. Furthermore, a potentially large number of internal microphones are required to obtain reference signals for each ego-noise source. This drawback motivates approaches which operate on the external microphone signals only. Here, it can be exploited that ego-noise exhibits a characteristic structure in the Short-Time Fourier Transform (STFT) domain. Due to the limited number of degrees of freedom for the movements of the robot, those spectral patterns cannot be arbitrarily diverse. These two properties motivate the use of learning-based dictionary methods where the ego-noise signals are approximated by prototype signals, so-called atoms, which are collected in a dictionary. Then, for each time frame, a linear combination of atoms has to be found which optimally fits the current ego-noise signal with respect to the chosen criterion. An example for such a dictionary learning algorithm is K-SVD [4], which has been applied for multichannel ego-noise suppression in [5]. Another widely used approach to train a dictionary is nonnegative matrix factorization (NMF) [6][7][8]. For NMF, the dictionary is restricted to nonnegative elements only which is well-suited to model power spectral densities (PSDs) of acoustic sources. An according approach for ego-noise suppression has been investigated, e.g., in [9]. The concept of NMF has been extended for multichannel recordings [10] by extending the (nonnegative) source model by an additional spatial model. This has been applied for ego-noise suppression in [11].
Besides methods using the audio modality only (referred to as audio only-based methods in the following), other ego-noise suppression approaches use knowledge about motor information given by, e.g., motor commands or motor data such as engine rotation frequency, joints' angle, or angular velocities collected by proprioceptors. The advantage of using motor data compared to motor commands is that the emitted ego-noise is directly related to the instantaneous internal state of the robot, measured by motor data. Since a robot is not a fully deterministic system, this measured state may be significantly different from the target state defined by the motor command.
Typically, a sufficiently accurate analytical model of the dependency between motor data and emitted egonoise can usually not be obtained since the mechanical dependencies and interactions between structure and airborne sounds are highly complex. Therefore, current ego-noise suppression approaches model these dependencies entirely or partly by learning-based strategies. For example in [13], a neural network-based approach is used to predict the PSDs of ego-noise caused by the Aibo TM robot. The feedforward neural network, consisting of two hidden layers containing thirty nodes each, is fed with angular position and velocity data of current and past time frames. The PSD estimates are subsequently used for spectral subtraction which was shown to result in a significant improvement of speech recognition rates. In [14], it is demonstrated that the harmonic structure of ego-noise can be estimated using motor data. This prior knowledge is included to a single-channel NMFbased ego-noise modeling. It is proposed to approximate the currently observed ego-noise spectrogram by combining elements from a dictionary D H which models the harmonic structure and another dictionary D R which captures the residual part of the ego-noise. The benefit of this approach is that only D R requires a prior learning step while D H is completely motor data-driven. It is shown that the proposed approach significantly outperforms an audio only-based method for the suppression of ego-noise that is not well represented in the training data. Although this approach is close to the proposed method from a methodical point of view, it aims at a different direction since it explicitly addresses the suppression of ego-noise if training and test data are unbalanced. This is not the case in the scenario considered in this paper.
Other popular methods for ego-noise suppression combining audio and motor information are template-based approaches. Here, the key idea is to save the characteristic spectral shape of the ego-noise as PSD templates in a data base. In [15], each template is associated with a motor command which triggered the current movement. Based on this, during application, matching templates are identified and temporally aligned to the recorded signal. An alternative template-based approach was presented in [16,17], where motor data instead of motor commands are used to identify the templates in the data base. For a current motor data sample, the nearest neighbor in the motor data space is searched and the associated template used as ego-noise estimate.
The concept to associate motor data with ego-noise templates was adopted in [18]. However, there, motor data samples are linked to a set of atoms from a learned dictionary-based ego-noise model. Nonlinear classifiers in the motor data space are used to associate a motor data sample to a set of atoms, whose elements are subsequently combined to approximate the current ego-noise recording. Thereby, the classifiers replace the expensive iterative search for atoms in the dictionary.
In this paper, the idea of choosing atoms depending on motor data is adopted from [18] and we propose to expand the conventional, audio only-based NMF model by a motor data-dependent regularization term, which promotes similar atom activations in those time frames in which similar motor data is measured. The proposed regularization term is derived from a graph structure which encodes the similarity between the motor data samples. While the main benefit of the method in [18] was a reduction of computational complexity, the presented approach in this paper results in a significant performance improvement. The proposed method is inspired by graphregularized NMF [19,20], which was proposed in the context of clustering and classification of text documents. There, the NMF model and the regularization are operating in the same data space. In this work, however, we learn an NMF model on acoustic data while the regularization encodes the geometry of the motor data space. Thus, we combine an acoustic model with non-acoustic reference information.
This paper is structured as follows. In Section 2.1, we describe the used motor data. After succinctly introducing NMF in Section 2.2, we present the novel motor dataregularized NMF in Section 2.3 Thereby, we first describe the construction of the motor data graph structure in Subsection 2.3.1 and derive the proposed regularization term in Subsection. 2.3.2. Then, the modified NMF optimization problem is formulated and according update rules are presented in Subsection. 2.3.3. The resulting novel egonoise suppression algorithm is summarized in Section 2.4 and its efficacy is demonstrated in Section 3.

Motor data-regularized NMF for ego-noise suppression
In the following, we consider the bin-wise squared magnitude of a single-channel microphone signal in the STFT domain, represented in spectrograms denoted as Y = y 1 , . . . , y L ∈ R F×L + , where F is the number of frequency bins and L is the number of considered time frames.

Motor data descriptions and definitions
The physical state of a robot can be described by motor data, collected by proprioceptors providing angular position information of the joints driven by the motors. In the following, we consider a robot which is equipped with m = 1, . . . , M proprioceptors each capturing one angle of a joint. We denote the s-th observed angular position in STFT frame for proprioceptor m by α (s) ,m ∈ R. Within frame , a total number of S motor data samples is observed, i.e., s = 1, . . . , S . In this paper, we account for the fact that the motor data is not necessarily synchronized with the audio data recording so that for a fixed observation interval for the audio data, the number of motor data may vary, i.e., S may change with . This is specifically the case for the NAO robot used for the experiments in this paper.
Depending on the kind of ego-noise, only a subset of proprioceptors is relevant for ego-noise suppression. For example, if only ego-noise caused by arm movements is present, only motor data of the arm joints are required. Illustration of data collection and processing. We consider a robot for which motor data of 23 joints and a single-channel microphone signal are recorded. The th STFT frame is processed. Note that the total number of motor data samples can be varying from frame to frame, c.f., Section 2.1. Right: Example spectrogram for right arm shoulder ego-noise and corresponding normalized motor data samples (angular position and angular velocity) of this joint In the following, we denote the index set of relevant proprioceptors for these joints by M.
From proprioceptor data collected for proprioceptor m, the instantaneous angular velocity can be estimated bẏ where T (s) denotes the time difference between adjacent observations α (s) ,m and α (s−1) ,m . Note that for s = 1, α ,m is chosen to be the last angular sample of previous the frame − 1. Analogously, angular accelerationα ,m . To associate each spectrogram frame y with a single motor data sample, we propose first to compute the arithmetic average of all S angular positions in STFT frame We proceed analogously for angular velocity and acceleration and obtainᾱ ,m ,ᾱ ,m , respectively. We then concatenate the averaged angular data for all considered proprioceptors in a feature vector which we will refer to as motor data vector for frame in the following. The left part of Fig. 2 exemplarily illustrates the described preprocessing of the data.

NMF for ego-noise suppression
In the following, we briefly summarize NMF. We introduce succinctly how semi-supervised NMF can be used for ego-noise suppression and explain the main drawback of the known approach before we introduce the proposed motor data-based regularization. The objective of NMF is to approximate the nonnegative matrix Y , i.e., a matrix whose elements are all larger or equal than zero, by a product of two nonnegative matrices D and H is referred to as activation matrix [8,21]. This approach can be interpreted as approximating each column of Y by a weighted sum of columns of D (the so-called atoms or bases), where the weights are given by the corresponding column entries of H. K is referred to as size of the dictionary and describes the number of atoms in D. Typically, K F, L holds, i.e., NMF can be considered as a compact representation of data.
The factorization is achieved by minimizing a cost function which penalizes the dissimilarity between Y andŶ defined by the model parameters D, H. Typically, the cost function is applied element-wise on the elements of the matrices Y andŶ . In this paper, we consider the Euclidean distance between Y andŶ as cost function yielding the optimization problem where · F denotes the Frobenius norm and D, H 0 means that all elements of D, H are larger or equal to zero, ensuring nonnegativity. The optimization problem in Eq. 5 is typically solved using iterative updates alternating between D, H such that the nonnegativity of D, H is implicitly guaranteed if they are initialized with positive values. The update rules can be derived based on, e.g., the Majorization-Minimization principle or heuristic approaches [7,8].
For ego-noise suppression, we apply a semi-supervised, two-stage strategy [21], c.f. Section 2.4: first, we use audio data containing ego-noise only and train an ego-noise dictionary. Then, given a mixture of ego-noise and speech, these dictionary elements remain constant and only its activations are estimated. For this, again, the same iterative update rules are used, which have shown to be sensitive to the additional speech signal. As a consequence, the atom activations are no longer estimated correctly. For improved robustness, we therefore propose to extend this audio only-based estimation of the activations by taking also the physical state of the robot, measured in terms of motor data, into account. Thus, the estimation of the activations is additionally guided by reference information which is completely unaffected by the speech signal.

Motor data-regularized NMF
The basic idea of our approach is that activations should be similar if the physical state of the robot is similar. For this, we measure the similarity between robot states in frames and j by comparing motor data vectorsᾱ and α j and enforce similar activations h and h j ifᾱ andᾱ j are close. This will be achieved by imprinting the intrinsic geometry of the motor data space to the NMF cost function. Results from spectral graph theory [22,23] and manifold learning theory [24] have shown that local geometric structure of given data points can be modeled using an undirected graph. Based on these results, we first introduce a motor data-based graph structure and summarize subsequently how a regularization term, enforcing similar activations for similar motor data, can be derived. We then reformulate the NMF optimization problem Eq. 5 and present according update rules for its minimization.

Motor data graph structure
In the following, we define a graph where the motor data vectorsᾱ 1 , . . . ,ᾱ L constitute the nodes. The edges connecting the nodes are assumed to be bidirectional, i.e., we obtain an undirected graph. A part of an exemplary graph is illustrated in Fig. 3. The edge which connects nodesᾱ andᾱ j has weight W j = W j and should reflect the affinity between the two motor data points. Dependent on the considered scenario, numerous measures have been proposed to quantify the affinity betweenᾱ andᾱ j [22], e.g., a nearest-neighbor or dot-product weighting. In this paper, we determine the weight W j using a Gaussian kernel with scale parameter ∈ R + . The larger W j , the higher the affinity between two motor data samples is and we obtain W j = 1 ifᾱ =ᾱ j . Note that by adjusting , the connectivity of the graph can be controlled, e.g., for larger , the neighbors of a node are connected with a larger weight. Therefore, can be used to control the reach of the local neighborhood of a node. Based on the affinity weights, we define the affinity matrix

Motor data-based regularization term
The derivation of the regularization term is based on results from [24,25]. It is assumed that the considered motor data lie on a Riemannian manifold A. We are looking for a mapping f : A → R, which can be interpreted as a mapping from the manifold to a line. f should preserve the local geometry of the manifold, i.e., close points on the manifold should be mapped to close points on the (2020) 2020:11 Page 6 of 15 line. This implies that f is allowed to vary only smoothly for similar arguments. Appropriate mappings f can be obtained by an optimization on the manifold which can be discretely approximated on the motor data graph by searching for an f which minimizes where f is a function of the nodes of the graph [24,25].
To exploit the geometric information of the motor data manifold for the estimation of the activation vectors, we manipulate Eq. 7 and replace the abstract mapping f by the activation of atom k where h k denotes the -th element of h k , i.e., h k is the scaling of atom in time frame k. The regularization term R k needs to be minimized jointly with Eq. 5 with respect to the activations for every atom k, c.f. Section 2.3.3. Note that the motor data-based regularization R k implicitly influences also the structure of the dictionary elements since the optimized activations directly affect the update of D.
Note that in Eq. 8, affinities W j can be interpreted as weighting parameter: if two motor data vectorsᾱ and α j are similar, W j is close to one according to Eq. 6 and the minimization of Eq. 8 enforces similar h k and h kj . Using the parameters defined in Section 2.3.1, Eq. 8 can be directly related to the so-called graph Laplacian L = Z − W [22] Summing over all atoms results in the final regularization term where tr(·) denotes the trace operator.

Motor data-regularized NMF
The derived regularization term Eq. 10 can be directly included into Eq. 4. We obtain as modified optimization problem where λ ≥ 0 controls the influence of the motor databased regularization. For minimization, we form the partial derivatives with respect to D and H in Eq. 11 and obtain iterative update rules [19,20] where [D] fk selects the fk-th element from D. Similar to conventional NMF, the iterative update can be stopped, e.g., after a fixed number of iterations. In this paper, in each iteration we additionally compute the cost according to Eq. 11 and terminate updating Eqs. 12,13 after convergence.
Eqs. 12 and 13 reduce to the conventional update rules for NMF if λ = 0 [8]. Note that since the proposed method aims at enforcing similar activations for close motor data vectors, the regularization has an effect on the update rule for H only, while the update for D is unaffected.

Proposed algorithm for ego-noise suppression
As mentioned in Section 2.2, we apply a semi-supervised, two-stage strategy for ego-noise suppression [21]. We first employ audio data containing ego-noise only and train D imprinting the intrinsic geometry of the motor data space onto the model using the proposed regularization. Given a mixture of ego-noise and speech, we use D to model and suppress the current ego-noise and to obtain a speech estimate. In the following, we describe the proposed algorithm for ego-noise suppression in detail, c.f. Fig. 4 for an overview.
• Learning D: As input, spectrograms Y = y 1 , . . . y L are given containing ego-noise only. Per spectrogram frame y , a motor data vectorᾱ is computed. α , = 1, . . . , L is used to construct the affinity and degree matrix, W and Z, respectively. Subsequently, the update rules Eqs. 12 and 13 are used to compute dictionary D, where the introduced regularization term is weighted by λ T . • Ego-noise suppression: Another dictionary D S of size K S and according activation H S is initialized to model the additional speech signal in the considered mixture Y . Analogously to the learning step before, W and Z are constructed from the new motor data vectors possibly representing different movements.
Using the same update rules as before, D S , H and H S are updated while D remains constant. The motor data-based regularization term is weighted by λ E . Note that for optimizing the activations of the speech model H S , we set λ E = 0 since the motor data-based regularization should affect only the estimation of the ego-noise activations. After identifying the optimum model parameters captured by D S , H and H S , we use a spectral enhancement filter to obtain an estimate for the desired speech signal Ŷ S f = F f · Y f for the f -th bin where the enhancement filter is given by Note that typically λ E = λ T holds, i.e, the regularization terms in both steps have different weights. This is further detailed in the following section.

Experimental evaluation
In the following, we evaluate the proposed method using real microphone recordings. We first describe the hardware setup, the synchronization of audio and motor data and the recording scenarios, and introduce the evaluation metrics. Then, we present suppression results for egonoise of different movements and discuss the influence of crucial parameters.

Recording setup
For our experiments, we conducted experiments with a commercially available NAO H25 robot [12]. For the audio recordings, we used a self-constructed head [26] with a microphone array of 12 sensors. For all following experiments, we used the frontmost microphone. Since the NAO platform does not provide an in-built synchronization on audio sample level, we developed a synchronization scheme which is illustrated in Fig. 5: the microphone signals are fed into an external analog-todigital (A/D) converter using conventional phone connectors (IEC 60603-11). The sampled data is forwarded to the robot's internal CPU via USB, where it is synchronized with the motor data collected by the proprioceptors of the robot. The resulting data stream, containing audio and motor data, is finally transmitted to an external PC via Ethernet, which is used for recording.

Scenario description
The recordings were conducted in a room with moderate reverberation (T 60 = 200 ms). We investigate ego-noise of different right arm movements of the robot. Compared to movement noise of other body parts, ego-noise of the arms has the most severe effect on the microphone signals due to the immediate closeness of the active joints to the microphones. In total, we recorded ego-noise of three motion sequences: • Sequence I consists of repeating right arm waving movements, activating all six joints of the arm. The robot lifts the arm using the right shoulder pitch motor, while performing waving movements with the remaining five motors of the right arm. • Sequence II resembles Sequence I; however, the lifting of the arm is performed with randomly varying velocity and acceleration of the right shoulder pitch motor. The number of employed joints is identical to Sequence I. • Sequence III is a mixture of left and right arm movements where both left and right joints are controlled independently with varying speeds. Since movements of the left and right arm are considered, 12 joints are used in total, i.e., compared to Sequence I/II the number of joints is doubled.
While Sequence I is a relatively simple scenario due to its repetitive character, Sequence II and Sequence III are more challenging for a description by a dictionary. For Sequence II, the random accelerations of the right shoulder pitch motor result in a large variety of spectral patterns which must be captured by the dictionary. The same holds for ego-noise of Sequence III, where the doubling of employed joints causes more spectral diversity.
The recorded ego-noise was used for training the dictionary and evaluation, where the data for evaluation was not contained in the training data. In total, we recorded 60 s for each motion sequence and split the ego-noise data such that approximately 30 s ego-noise for the learning of D is available.
To evaluate the suppression performance, we consider a scenario in which a target source is talking to the robot. The robot is standing on the floor level while it performs different waving movements of the right arm. The microphones of the robot are at a height of 55 cm. For the speech signal, utterances from male and female speakers of the GRID corpus [27] were used. The loudspeaker was positioned at 1 m distance of the robot, at a height of 1 m. The recorded reverberant utterances were added to the ego-noise with varying signal-to-noise (SNR) ratios (see Section 3.3).
The audio signals are sampled at f S = 16 kHz and transformed to the STFT domain using a Hamming window of length 64 ms with overlap of 50 %. The internal operating system of the NAO robot saves motor data samples of all joints into an internal cache which can be accessed by the user. This cache is typically updated every 10 ms. Consequently, the sampling frequency of the motor data is given by f S ≈ 100 Hz, i.e, typically S = 6 motor data samples are available per time frame.
We evaluated the overall performance of the ego-noise suppression in terms of signal-to-distortion ratio (SDR in dB) and signal-to-artifacts ratio (SAR in dB). For the computation of both, Matlab functions provided by [28] are used. In practice, it must be expected that the egonoise and speech estimates, i.e., DH and D S H S , contain estimation errors resulting in imperfect enhancement filter F, cf. Eq. 14. As a consequence, the ego-noise cannot be removed entirely from the mixture and/or the desired speech is distorted. The severity of both effects is reflected in the selected performance criteria SDR and SAR, respectively. While SDR measures the amount of remaining ego-noise and speech distortion after processing, SAR considers introduced speech distortion only. For unprocessed data, the SDR corresponds to the SNR of the input mixture while SAR is infinite. Beside SDR and SAR, we also evaluate PESQ (perceptual evaluation of speech quality [29]). To obtain representative results, we averaged over 100 runs with random initialization of the matrices in NMF. Standard deviations for all results are given in brackets.

Evaluation and discussion of the results
For evaluating the proposed method for ego noise of motion Sequence I, II, and III, the size of the ego-noise dictionary and speech dictionary has been chosen to K = K S = 20 for Sequence I and K = 30, K S = 20 for Sequence II and III. These parameters have shown best suppression performance in terms of SDR for audio onlybased NMF on the respective ego-noise recordings. We first discuss the choice of λ T and λ E and illustrate the effect of the regularization term R. Subsequently, we evaluate the suppression performance for different SNRs and finally discuss alternative choices for the motor data vectorᾱ .

Impact and choice of λ T and λ E
In Table 1, the suppression results for particular choices of λ T and λ E are given. First, we incorporate the motor data information only into the training (λ T = 0.9) and leave the suppression step unchanged λ E = 0. Compared to audio only-based NMF (denoted as "NMF" in the following), this already shows a slight improvement of the results. For λ T = 0 and λ E = 19, we obtain significantly better results than for NMF, which shows that enforcing similar activations for similar physical states of the robot helps even if this constraint has not be learnt during the learning of the dictionary. Best results are obtained if the regularization term is included to both learning and suppression. Note that λ T and λ E are of different orders of magnitude, what will be further investigated and interpreted in Section 3.3.2.
The effect of the proposed regularization term is illustrated in Fig. 6. We consider two time frames and j of a mixture of ego-noise (Sequence I) and speech. Frames and j are chosen such that W j is large, i.e.,ᾱ andᾱ j are close indicating that the robot has similar physical states. Hence, similar activations are desired. Figure 6a shows elements of the activation vectors h and h j obtained if audio only-based NMF is used. It is obvious that the activations differ significantly, which can be explained by the additional speech signal present in frames and j which affects the estimation of the ego-noise activations. Figure 6 b illustrates the elements of h and h j estimated by proposed motor dataregularized NMF. Here, in contrast to audio only-based NMF, the activations coincide even if additional speech is present. For further illustration, Fig. 7 shows spectrograms of an ego-noise extract and its estimates using audio-only NMF and the proposed method. Without motor data regularization, the speech signal leads to additional, undesired components in the ego-noise estimate. In contrast, this effect is not or only weakly pronounced for the proposed method.
The effect of varying λ E and on the suppression result is illustrated in Fig. 8. For λ E = 0, the regularization is ineffective and the proposed method reduces to audio only-based NMF (λ T = 0 holds during the learning of D).
If λ E is chosen too large, the effect of the motor data dominates and the suppression performance degrades. λ E = 19 appears to result in the best result for the considered mixture. However, note that the optimal choice of λ E depends on the SNR of the mixture, as will be discussed in more detail in Section 3.3.2. We now consider the suppression performance for varying scale parameter , c.f. Eq. 6. For → 0, we obtain according to Eq. 6 i.e, all connections in the graph are set to zero. Accordingly, the regularization term in Eq. 10 equals zero and the results of the proposed method and audio only-based NMF coincide. For increasing , Eq. 6 gets less selective and the number neighbors of a node with large affinity increases. For the setup in Fig. 8, the maximum SDR is obtained for = 5 · 10 −3 , which turned out to result in robust performance even for ego-noise of other movements. For larger , the suppression performance deteriorates since more and more connections between nodes obtain large weights and the discriminative nature of the graph is reduced.

Varying SNRs
So far, we only considered mixtures with constant SNR. In a typical human-robot interaction, the SNR is however changing due to, e.g., varying distances between desired source and robot or different power levels of the signal of interest. Therefore, a robust ego-noise suppression at different SNRs is of high importance. In the following, we evaluate the proposed approach for SNR ∈ {± 10, ± 5, ± 2, ± 1, 0} dB of the input mixture. For this, we added scaled versions of the speech signal to the ego-noise. Note that for the considered NAO robot SNR=10 dB is an unlikely scenario since it corresponds to an human-robot distance of only a couple of centimeters or a very loud human voice. We acknowledge, however, that for robots which emit less loud ego-noise, such a high SNR could be realistic.
Results for ego-noise Sequence I are given in Table 2. In the right part of Table 2, parameters λ T , λ E , and used for the proposed method are summarized. Interestingly, for SNR=− 10 dB, the proposed method shows best result if the regularization is ineffective. Consequently, it does not show any benefit compared to audio only-based NMF. For larger SNRs, SDR and SAR increase both for audio only-based NMF and the proposed method. Motor data-regularized NMF consistently shows superior performance, while the relative improvement between the proposed approach and NMF increases for growing SNR, e.g., for SNR=+ 2 dB, a gain of + 2 dB in SDR and 2.5 dB in SAR is achieved. This effect can be explained by the fact that for audio only-based NMF, the estimation of the activations is more severely impaired by the additional speech signal. This effect becomes more pronounced for increasing SNR. Since motor data is a non-acoustic reference signal, the regularization term is not affected by the increasing power of the additional speech component. This also explains why almost no benefit could be observed for low SNR when the additional speech signal does not have an impact on the ego-noise estimation.
While is constant for all SNRs, especially λ E has to be increased continuously for larger SNRs. By this, the influence of the motor data-dependent regularization gets more aggressive compensating the increasingly negative impact of the speech on the estimation of the ego-noise activations.    We conducted the same experiments for ego-noise caused by motions of Sequence II and Sequence III. The results are summarized in Tables 3 and 4. In principle, the results obtained for Sequence I can be confirmed: the proposed method outperforms audio only-based NMF consistently, especially for high SNRs, and λ E shows a significant dependence on the SNR. Interestingly for Sequence II, the absolute values for λ E have to be chosen slightly smaller for optimum performance than for Sequence I. Overall, the suppression results are ≈ 1 dB (Sequence II) and ≈ 0.5 dB (Sequence III) worse than for Sequence I, which can be explained by the more complex movements, c.f., movement description in Section 3.2.
Note that for an optimal choice of λ E knowledge of the SNR is required for which a single-channel or multichannel SNR estimator can be employed. However, it must be expected that the SNR estimation is imperfect leading to a suboptimal choice of λ E . The resulting effect on the suppression performance is shown in Table 5. We chose λ E = 8.0 and λ T = 0.5, i.e, optimal parameters for SNR= 1 dB, and evaluated the proposed method for SNRs −1 dB, . . . , 3 dB, simulating imperfect SNR estimates. Overall, a suboptimal parameter choice leads to a degradation of the suppression performance. However, the proposed method still shows superior results compared to audio only-based NMF.

Alternative choices forᾱ
In the previous experiments, the motor data vectorᾱ was composed of the angular position and its first-and second-order temporal derivatives, i.e., angular velocity and acceleration. By complementing angular position by its first and second order derivatives, we implicitly added temporal information to our model since not only current, but also past motor data samples are taken into account for the construction ofᾱ , c.f. Eq. 1.
In the following, we evaluate how the performance of the proposed method depends on the amount of temporal context included intoᾱ .
Results are given in Table 6. First, we consider a motor data vectorᾱ which contains only angular positions, i.e., no derivatives are used. The suppression result lags significantly behind audio only-based NMF. This drop in performance is not surprising since by considering angular position alone it cannot be distinguished whether, e.g., the robot raises or drops its arm if up-and downwards movements have the same trajectory. Consequently, the ego-noise caused by these two movements is assessed as similar. The results improve drastically if angular velocity is added to motor data vectorᾱ . If also angular acceleration is included, the results further improve; however, the additional gain is clearly smaller compared to that of adding the first derivative. Adding higher order derivatives toᾱ does not offer further benefit as results get slightly worse with increasing number of derivatives incorporated toᾱ .