Skip to main content

Improving sign-algorithm convergence rate using natural gradient for lossless audio compression


In lossless audio compression, the predictive residuals must remain sparse when entropy coding is applied. The sign algorithm (SA) is a conventional method for minimizing the magnitudes of residuals; however, this approach yields poor convergence performance compared with the least mean square algorithm. To overcome this convergence performance degradation, we propose novel adaptive algorithms based on a natural gradient: the natural-gradient sign algorithm (NGSA) and normalized NGSA. We also propose an efficient natural-gradient update method based on the AR(p) model, which requires \(\mathcal {O}(p)\) multiply–add operations at every adaptation step. In experiments conducted using toy and real music data, the proposed algorithms achieve superior convergence performance to the SA. Furthermore, we propose a novel lossless audio codec based on the NGSA, called the natural-gradient autoregressive unlossy audio compressor (NARU), which is open-source and implemented in C. In a comparative experiment with existing, well-known codecs, NARU exhibits superior compression performance. These results suggest that the proposed methods are appropriate for practical applications.


Greater storage capacity is required to further enrich digital audio content [1]. Therefore, lossless audio coding, which allows audio data compression without information loss, is vital for various applications, such as lossless music delivery, editing, and recording [2]. Figure 1 depicts the general structure of a lossless audio codec [3]. First, the codec converts the audio signal to a residual via prediction using a mathematical model. Second, it compresses the residual through entropy coding. If the model provides an accurate prediction, the residual signal is sparse and, thus, high compression performance is achieved. The Shorten lossless codec [4] was one of the first codecs with the structure shown in Fig. 1, and several codecs that follow the same structure have been implemented since. For example, MPEG4-ALS [5], ALAC [6], and FLAC [7] use linear predictive coding (LPC) as the predictive model, whereas WavPack [8], TTA [9], and Monkey’s Audio [10] use adaptive filters. LPC is generally formulated based on the assumption that the residual follows a Gaussian distribution; hence, FLAC and MPEG4-ALS are based on Gaussian distribution. In contrast, Wavpack, TTA, and Monkey’s Audio are based on Laplacian distribution, using adaptive algorithms.

Fig. 1
figure 1

General structure of lossless audio codec

In entropy coding, the Golomb–Rice code [11] is generally employed, as this code is optimal when the residual follows a Laplace distribution. Therefore, a residual assumption for LPC is mismatched. To overcome this problem, Kameoka et al. [12] improved the compression rate by formulating an LPC under a Laplace distribution. The sign algorithm (SA) [13] is a practical choice for the adaptive algorithm when the residual follows a Laplace distribution; however, the SA converges at a considerably slower rate than that of the least mean square (LMS) algorithm [14].

To overcome this performance gap, several SA variants such as the convex combination [15] and the logarithmic cost function [16] have been proposed. However, these attempts have not yielded superior convergence performance to the normalized LMS (NLMS). Notably, the algorithm proposed by Gay and Douglas [17] outperforms the NLMS through the use of a natural gradient [18].

In this study, we improve the SA convergence performance using a natural gradient. We propose two novel adaptive algorithms: the natural-gradient sign algorithm (NGSA) and normalized NGSA (NNGSA) [19]. These algorithms employ \(\mathcal {O}(p)\) multiply–add operations to calculate the natural gradient at every step based on the p-th order autoregressive model assumption for the input data. The proposed algorithms achieve superior convergence performance to the SA. Furthermore, we propose a novel lossless audio codec based on the NGSA, called the natural-gradient autoregressive unlossy audio compressor (NARU) (Taiyo Mineo, Hayaru Shouno: NARU: Natural-gradient AutoRegressive Unlossy Audio Compressor, submitted), which is implemented and published under the MIT license. NARU exhibits superior compression performance to existing codecs such as FLAC, WavPack, TTA, and MPEG4-ALS. Moreover, its decoding speed is faster than that of Monkey’s Audio without strict optimization.

The remainder of this paper is organized as follows: Section 2 provides an overview of the relevant mathematical theories; Section 3 presents the proposed methods and the NARU codec structure; Section 4 reports computer-based experiments to demonstrate the performance of the proposed algorithms and codec; and Sections 5 and 6 present the discussion and conclusion, respectively.

Theoretical background

Adaptive filter

An overview of an adaptive filter is shown in Fig. 2. The input signal x[n] and observation noise v[n] are discrete-time signal sequences. v[n] is a noise adding for unknown system outputs. In this study, x[n] is assumed to have weak stationarity and to be an ergodic process. Let h[n]=[h1[n],...,hN[n]]T be the adaptive filter coefficients, where T represents the matrix transposition. This study employs a finite impulse response (FIR) filter. Hence, the filter output is denoted as h[n]Tx[n], where x[n]=[x[nN+1],...,x[n]]T represents the input vector. We denote the coefficient vector for an unknown system as h. Filter adaptation is performed by updating the h[n] coefficients based on the observed signal

$$\begin{array}{*{20}l} d[n] := \boldsymbol{h}^{\ast\mathsf{T}}\boldsymbol{x}[n] + v[n], \end{array} $$
Fig. 2
figure 2

Adaptive filter

and the residual

$$\begin{array}{*{20}l} \varepsilon[n] := d[n] - \boldsymbol{h}[n]^{\mathsf{T}} \boldsymbol{x}[n]. \end{array} $$

Sign algorithm (SA)

The SA is derived using the maximum likelihood method under the assumption that ε[n] follows a Laplace distribution. The probability density function of the Laplace distribution is

$$\begin{array}{*{20}l} p(\varepsilon[n] \mid \boldsymbol{h}) = \frac{1}{2\sigma} \exp \left[ -\frac{|\varepsilon[n]|}{\sigma} \right], \end{array} $$

where σ>0 represents the deviation. The likelihood L(h) and log-likelihood logL(h) functions for M independent and identically distributed (i.i.d.) samples are expressed as

$$\begin{array}{*{20}l} L(\boldsymbol{h}) &= \frac{1}{(2\sigma)^{M}} \prod_{k=1}^{M} \exp \left[ -\frac{|\varepsilon[k]|}{\sigma} \right], \end{array} $$
$$\begin{array}{*{20}l} \log L(\boldsymbol{h}) &= -M \log(2\sigma) - \frac{1}{\sigma} \sum_{k=1}^{M} |\varepsilon[k]|. \end{array} $$

We let M=1 because the SA adapts at each step. To maximize the likelihood, we partially differentiate logL(h) with respect to h

$$\begin{array}{*{20}l} \frac{\partial \log L(\boldsymbol{h})}{\partial\boldsymbol{h}} = \frac{1}{\sigma} \text{sgn}(\varepsilon[n]) \boldsymbol{x}[n], \end{array} $$

where sgn(·) denotes the sign function, which is defined as

$$\begin{array}{*{20}l} \text{sgn}(x) = \left\{ \begin{array}{cc} 1 & (x > 0) \\ 0 & (x = 0) \\ -1 & (x < 0) \end{array}\right.. \end{array} $$

The SA adaptation rule is expressed as

$$\begin{array}{*{20}l} \boldsymbol{h}[n+1] = \boldsymbol{h}[n] + \mu \text{sgn}(\varepsilon[n]) \boldsymbol{x}[n], \end{array} $$

where μ>0 denotes the step-size parameter.

Autoregressive model

To simplify the inverse calculation for an autocorrelation matrix for the input signal, we introduce an autoregressive model. Here, AR(p) indicates the autoregressive model with order p that satisfies the following equation for signal s:

$$\begin{array}{*{20}l} {}s[n] = \sum_{i = 1}^{p} \psi_{i} s[n - i] + \nu[n], \ \psi_{i} \in \mathbb{R} \ (i = 1,...,p), \end{array} $$

where ν[n] is a sample from an independent standard normal distribution. The ith row and jth column element of the inverse autocovariance matrix for the AR(p) process \(\boldsymbol {K}_{p}^{-1}\) is calculated explicitly as [20]

$$\begin{array}{*{20}l} {}\left(\boldsymbol{K}_{p}^{-1}\right)_{ij} = \left\{\begin{array}{ll} \sum_{k=1}^{j} \psi_{i-k}\psi_{j-k}, & 1 \leq i \leq p+1 \\ \sum_{k=1}^{L-i+1} \psi_{L-i+1-k}\psi_{L-j+1-k}, & L-p \leq j \leq L \\ 0, & i \geq j + p + 1 \\ \sum_{k=i-p}^{j} \psi_{i-k}\psi_{j-k}, & \text{otherwise} \end{array} \right., \end{array} $$

where ij,ψ0=1, and L is the matrix size satisfying L>2p.

Proposed methods

Natural-gradient sign algorithm (NGSA)

The natural gradient is derived from the multiplication of the inverse of a Fisher information matrix F−1 and a gradient of the cost function [18]. The matrix F is calculated using the covariance of the gradient for the log-likelihood function (Eq. (6)), as follows:

$$\begin{array}{*{20}l} \boldsymbol{F} &:= \mathrm{E}\left[{\left\{ \frac{\partial\log L(\boldsymbol{h})}{\partial\boldsymbol{h}} \right\} \left\{ \frac{\partial\log L(\boldsymbol{h})}{\partial\boldsymbol{h}} \right\}^{\mathsf{T}}}\right] \end{array} $$
$$\begin{array}{*{20}l} &= \mathrm{E}\left[{\left\{ \frac{\text{sgn}(\varepsilon[n])}{\sigma} \right\}^{2} \boldsymbol{x}[n]\boldsymbol{x}[n]^{\mathsf{T}}}\right] \end{array} $$
$$\begin{array}{*{20}l} &= \frac{1}{\sigma^{2}} \mathrm{E}\left[{\boldsymbol{x}[n]\boldsymbol{x}[n]^{\mathsf{T}}}\right] \quad (\mathrm{a.s.}) \end{array} $$
$$\begin{array}{*{20}l} &= \frac{1}{\sigma^{2}} \boldsymbol{R}, \end{array} $$

where R is the autocorrelation matrix of the input signal. Note that Eq. (13) holds because {sgn(x)}2=1 is satisfied if x≠0. Using Eq. (14), we obtain the NGSA as follows:

$$\begin{array}{*{20}l} \boldsymbol{h}[n+1] = \boldsymbol{h}[n] + \mu_{\text{NGSA}} \text{sgn}(\varepsilon[n]) \boldsymbol{R}^{-1} \boldsymbol{x}[n], \end{array} $$

where μNGSA denotes the step-size parameter and R is assumed to be a regular matrix. In addition, the NGSA can be derived by replacing ε[n] with sgn(ε[n]) in the LMS/Newton algorithm [21], which is an approximation of the Newton method for the LMS algorithm.

The NGSA adaptation rule (Eq. (15)) satisfies the following inequality:

$$\begin{array}{*{20}l} {\lim}_{n\to \infty} \frac{1}{n} \sum_{k=1}^{n} \mathrm{E}\left[{|\varepsilon[k]|}\right] \leq \varepsilon_{\text{min}} + \mu_{\text{NGSA}} \frac{h}{\lambda_{\text{min}}}, \end{array} $$

where \(\varepsilon _{\text {min}}=\mathrm {E}\left [{|v[n]|}\right ], h=(1/2)\mathrm {E}\left [{\|{\boldsymbol {x}[n]}\|_{2}^{2}}\right ]\), and λmin denotes the minimum eigenvalue of R. The proof of Eq. (16) follows that provided in [14] (see Appendix 1: “NGSA inequality”).

Normalized natural-gradient sign algorithm (NNGSA)

The NGSA encounters difficulties in determining μNGSA because its optimal settings vary according to the input signal. To overcome this difficulty, we introduce a variable step-size adaptation method that minimizes the posterior residual criterion; this approach is identical to that of the NLMS [22].

Let μ[n] and ε+[n] be the adaptive step size and the posterior residual at time n, respectively. Then, ε+[n] is calculated as

$$\begin{array}{*{20}l} & \varepsilon^{+}[n] := d[n] - \boldsymbol{h}[n+1]^{\mathsf{T}}\boldsymbol{x}[n] \end{array} $$
$$\begin{array}{*{20}l} &= d[n] - \left\{ \boldsymbol{h}[n] + \mu[n]\text{sgn}(\varepsilon[n]) \boldsymbol{R}^{-1} \boldsymbol{x}[n] \right\}^{\mathsf{T}} \boldsymbol{x}[n] \end{array} $$
$$\begin{array}{*{20}l} &= \varepsilon[n] - \mu[n]\text{sgn}(\varepsilon[n])\boldsymbol{x}[n]^{\mathsf{T}}\boldsymbol{R}^{-1}\boldsymbol{x}[n]. \end{array} $$

We let ε+[n]=0; then, solving Eq. (19) for μ[n], we obtain

$$\begin{array}{*{20}l} \mu[n] = \frac{|\varepsilon[n]|}{\boldsymbol{x}[n]^{\mathsf{T}} \boldsymbol{R}^{-1} \boldsymbol{x}[n]}. \end{array} $$

Substituting Eq. (20) into Eq. (15), we obtain the NNGSA as follows:

$$\begin{array}{*{20}l} \boldsymbol{h}[n+1] = \boldsymbol{h}[n] + {\frac{\mu_{\text{NNGSA}} \varepsilon[n]}{\boldsymbol{x}[n]^{\mathsf{T}} \boldsymbol{R}^{-1} \boldsymbol{x}[n]} }\boldsymbol{R}^{-1} \boldsymbol{x}[n], \end{array} $$

where μNNGSA>0 denotes the scale parameter. If μNNGSA<2 holds and h[n] and x[n] are statistically independent, this adaptation rule achieves a first-order convergence rate. The proof of this proposition follows that of the NLMS provided in [22] (see Appendix 1: “NNGSA convergence condition”).

The NNGSA can be interpreted as a variable step-size modification of the LMS/Newton algorithm [23]. In [24], the authors state that [23] is a generalization of the recursive least squares (RLS) algorithm. Furthermore, it is evident that Eq. (21) is identical to the NLMS if R=I, where I denotes the identity matrix.

Geometric interpretation of NNGSA

The adaptation rule in Eq. (21) is used to solve the following optimization problem:

$$ \begin{aligned} & \underset{\boldsymbol{h}}{\text{argmin}}\ (\boldsymbol{h} - \boldsymbol{h}[n])^{\mathsf{T}} \boldsymbol{R} (\boldsymbol{h} - \boldsymbol{h}[n]), \\ & \text{subject to}\ d[n] = \boldsymbol{h}^{\mathsf{T}} \boldsymbol{x}[n]. \end{aligned} $$

The Lagrange multiplier can be used to solve the aforementioned problem. Therefore, Eq. (21) projects h[n] onto the hyperplane W={h | d[n]=hTx[n]}, the metric of which is defined as R (see Fig. 3). Moreover, according to information geometry [25], the Kullback–Leibler divergence KL[··] for models associated with the neighborhoods of parameter h[n] can be calculated as

$$\begin{array}{*{20}l} & \text{KL}[{p(\varepsilon[n] \mid \boldsymbol{h}[n])}\|{p(\varepsilon[n] \mid \boldsymbol{h})}] \\ & \approx \frac{1}{2}(\boldsymbol{h} - \boldsymbol{h}[n])^{\mathsf{T}} \boldsymbol{F} (\boldsymbol{h} - \boldsymbol{h}[n]) \end{array} $$
Fig. 3
figure 3

Geometric interpretation of NNGSA. The NNGSA update procedure (Eq. (21)) projects h[n] onto hyperplane W, having the metric R

$$\begin{array}{*{20}l} &= \frac{1}{2\sigma^{2}} (\boldsymbol{h} - \boldsymbol{h}[n])^{\mathsf{T}} \boldsymbol{R} (\boldsymbol{h} - \boldsymbol{h}[n]). \end{array} $$

Thus, Eq. (21) can be considered the m-projection from model p(ε[n]h[n]) to the statistical manifold S={p(ε[n]h)d[n]=hTx[n]}, the elements of which have the minimum posterior residual.

Efficient natural-gradient update method

The natural gradient R−1x[n] must be calculated at every step. The Sherman–Morrison formula is typically used to reduce RLS complexity; however, this algorithm involves \(\mathcal {O}(N^{2})\) operations, which generate high cost in practical applications [26]. Therefore, we propose an efficient method to solve this problem.

We assume that the input signals follow the AR(p) process. The natural gradient at time n, i.e., \(\boldsymbol {m}[n] = [m_{1}[n],..., m_{N}[n]]^{\mathsf {T}} := \boldsymbol {K}_{p}^{-1} \boldsymbol {x}[n]\), can be updated as

$$ \begin{aligned} \boldsymbol{K}_{p}^{-1}\boldsymbol{x}[n+1] &= \left[ \begin{array}{c} m_{2}[n] \\ m_{3}[n] \\ \vdots \\ m_{N}[n] \\ 0 \end{array} \right] + m_{1}[n] \left[ \begin{array}{c} \psi_{1} \\ \psi_{2} \\ \vdots \\ \psi_{p} \\ \boldsymbol{0}_{N-p} \end{array} \right] \\ &\quad - m_{N}[n+1] \left[ \begin{array}{c} \boldsymbol{0}_{N-p-1} \\ \psi_{p} \\ \vdots \\ \psi_{1} \\ -1 \end{array} \right], \\ m_{N}[n+1] &= x[n+1] - \sum_{i=1}^{p} \psi_{i} x[n+1-i], \end{aligned} $$

where 0N is an N×1 zero vector. Equation (25) is followed by a direct calculation (see Appendix 1: “Derivation of efficient natural-gradient update method”). Furthermore, the Mahalanobis norm \(\boldsymbol {x}[n]^{\mathsf {T}} \boldsymbol {K}_{p}^{-1} \boldsymbol {x}[n]\) can be updated as follows:

$$\begin{array}{*{20}l} & \boldsymbol{x}[n+1]^{\mathsf{T}} \boldsymbol{K}_{p}^{-1} \boldsymbol{x}[n+1] \\ & = \boldsymbol{x}[n]^{\mathsf{T}} \boldsymbol{K}_{p}^{-1} \boldsymbol{x}[n] - m_{1}[n]^{2} + m_{N}[n+1]^{2}. \end{array} $$

Equation (25) requires 3p multiply–add (subtract) calculations, and Eq. (26) requires 2. Hence, we can update the natural gradient in \(\mathcal {O}(p)\) operations. Besides, Eq. (25) requires \(\mathcal {O}(N)\) space complexity since its referring to previous step gradient m[n]. Equation (25) is essentially the same as that of [27], in which a lattice filter (with partial autocorrelation coefficients) is used for gradient updating. The present method is suitable for norm updating.

Algorithm 1 describes the NNGSA coding procedure under the AR(p) assumption.

Application to LMS/Newton algorithm

We can apply the proposed procedure to the LMS/Newton algorithm:

$$\begin{array}{*{20}l} \boldsymbol{h}[n+1] &= \boldsymbol{h}[n] + \mu_{\text{LMSN}} \boldsymbol{R}_{p}^{-1} \boldsymbol{x}[n], \end{array} $$
$$\begin{array}{*{20}l} \boldsymbol{R}_{p}^{-1} &:= \sigma_{p}^{-1} \boldsymbol{K}_{p}^{-1}, \end{array} $$

where μLMSN>0 denotes the step-size parameter and σp is a constant that depends on p. For p=1, Eq. (27) achieves first-order convergence if

$$\begin{array}{*{20}l} \mu_{\text{LMSN}} < \frac{2(1 - \psi_{1})}{N(1 + \psi_{1})}. \end{array} $$

The proof of this proposition follows that for the LMS provided in [21], and employs the eigenvalue range of R1 [29] (see Appendix 1: “Convergence condition for LMS/Newton algorithm”).

Codec structure

This section describes the NARU encoder and decoder.


The NARU encoding procedure is illustrated in Fig. 4. Below, we describe each component of the NARU encoding procedure. Mid-side conversion The mid-side conversion eliminates the inter-channel correlation from the stereo signal. This conversion is expressed as follows:

$$\begin{array}{*{20}l} M &= \frac{L + R}{2}, \end{array} $$
Fig. 4
figure 4

NARU encoder structure

$$\begin{array}{*{20}l} S &= L - R, \end{array} $$

where L,R,M, and S are the signals of the left, right, mid, and side channels, respectively. Pre-emphasis The pre-emphasis is the first-order FIR filter with a fixed coefficient, which is expressed as follows:

$$\begin{array}{*{20}l} y[n] = x[n] - \eta\ x[n-1], \end{array} $$

where η denotes a constant that satisfies η≈1, and x[n] and y[n] are the filter input and output at time n, respectively. This filter reduces the static offset of the input signal. Hence, we can prevent R from being ill-conditioned [28]. Here, we choose η=31/32=0.96875 because its division is implemented by a 5-bit arithmetic right shift. NGSA filter The NGSA filter is the core predictive model of this codec and is the highest-order (N≤64) FIR filter. Here, we adopt a rule that follows Algorithm 1 and set d[n]:=x[n+1] in Eq. (2) so that the filter equalizes to become the input signal. SA filter We cascade the SA filter after the NGSA filter, as this cascaded filter scheme [30] exhibits superior compression performance. This filter has a lower filter order than the NGSA (N≤8) and follows the same rule as the SA (Eq. (8)). Recursive Golomb coding This stage converts the residual signal to a compressed bitstream. We employ recursive Golomb coding [31] as the entropy coder; this is a refinement of the Golomb–Rice code and has exhibited acceptable performance in WavPack and TTA.


The decoder structure is shown in Fig. 5. As apparent from the figure, the decoding procedure is simply the inverse of the encoding procedure: the SA filter and NGSA filter produce the same predictions as for encoding at each instance and, hence, the input signal is perfectly reconstructed. Additionally, the de-emphasis follows

$$\begin{array}{*{20}l} x[n] &= y[n] + \eta\ x[n-1], \end{array} $$
Fig. 5
figure 5

NARU decoder structure

and the left–right conversion is expressed as

$$\begin{array}{*{20}l} L &= M + \frac{S}{2}, \end{array} $$
$$\begin{array}{*{20}l} R &= M - \frac{S}{2}. \end{array} $$

Codec implementation

As part of this study, the developed codec was implemented. To ensure speed and portability, we implemented the codec in the C programming language [32]. All encoding/decoding procedures were implemented via fixed-point operations so that the decoder reconstructed the input signal perfectly. We published this implementation under the MIT license.

The fixed-point numbers were represented by 32-bit signed integers with 15 fractional bits. Note that, at present, the codec supports 16-bit linear pulse code modulation (PCM) Wav (Waveform Audio File Format) files only, to prevent multiplication overflow and to maintain implementation simplicity. We assume that the appropriate bit-width rounding is available for 24-bit Wav.

Experiment results

This section reports the evaluation results for the proposed algorithms and codec.

Adaptive algorithm comparison

Toy-data experiments

We observed the convergence performance under the following artificial settings. The elements of the unknown parameter h were randomly chosen with a uniform distribution of [−1,1], the filter order N was set to 5, and the observation noise v[n] was white Gaussian noise with −20, −40, and −60 dB variances. These settings were adopted from [16]. We calculated the mean square deviation (MSD) criteria hh2 from 200 independent trials. In addition, we set p=1 and the following step sizes for the proposed algorithms: μNGSA=0.01,μNNGSA=0.1, and μLMSN=0.01. We implemented the algorithms in Python 3.8.1 and performed simulations using an Intel® Core-i7 2.8 GHz Dual Core CPU with 16 GB RAM.

First, we tracked the MSD learning curves for x[n] with a variance of 0 dB. Figure 6 shows a comparison between the results obtained for the proposed algorithms and the SA, NLMS, and RLS (see Fig. 9 in Appendix 2 for −20 and −60 dB results). We set various step sizes for the SA and NLMS and employed various forgetting factors λ for the RLS. Figure 6 shows that the NGSA and NNGSA achieved almost the same performance as the SA and NLMS, respectively. This is because \(\boldsymbol {R}^{-1}_{1} \approx \boldsymbol {I}\) holds for i.i.d. noise input.

Fig. 6
figure 6

Learning curves for white-Gaussian-noise input

Second, we observed the case in which the Gaussian noise is correlated with x[n]←x[n]+x[n−1]×0.8. Figure 7 shows the correlation results (see Fig. 10 in Appendix 2 for −20 and −60 dB results). The SA and NLMS exhibited poorer convergence performance than for the non-correlated noise input (Fig. 6). Moreover, the steady-state errors for the proposed algorithms also deteriorated. This is because R was close to being ill-conditioned, and the right-hand side of Eq. (16) was large.

Fig. 7
figure 7

Learning curves for correlated noise input

Real-data experiments

We observed the absolute error (AE) for filter prediction using real music data from the Real World Computing (RWC) music dataset [33]. In this experiment, we assumed that the input data was composed of an audio data signal only and that the reference output and observation noise was zero (silence). We set the same configurations for the proposed algorithms as in the toy-data experiments. Figure 8 shows the AE curves obtained for the first second (at a 44100 Hz sampling rate) for the left channel of the tune “When the Saints Go Marching In.” From Fig. 8, the NNGSA and LMS/Newton exhibited superior performance to the NLMS and approximately the same performance as the RLS. However, the NGSA with AR(1) exhibited considerably poorer performance. We assume that this poor performance stemmed from a greater steady-state error for the NGSA, which arose from long-term (≈ 10000 samples) signal stationarity.

Fig. 8
figure 8

AE comparison for real music data

Codec evaluation

We observed the compression performance under the following settings, treating the following existing codecs as competitors: FLAC version 1.3.2 with “highest compression” option (-8). WavPack version 5.4.0 with “very high quality” option (-hh). TTA version 2.3 with default setting. Monkey’s Audio version 6.14 with “extra high” option (-c4000). MPEG4-ALS RM23 with default setting. We did not use the optimum compression option (-7) as the required encoding time was unrealistic. NARU The NGSA filter order was set to 64, the AR order was 1, and the SA filter order was 8.

There were two evaluation criteria:

$$\begin{array}{*{20}l}& \text{compression ratio} \\ &= {\frac{\text{compressed size (byte)}}{\text{original wav file size (byte)}}} \times 100 \ \text{[\%]} \end{array} $$
$$\begin{array}{*{20}l} & \text{decoding speed} \\ &= {\frac{\text{decoding time (sec)}}{\text{wav file length (sec)}}} \times 100 \ \text{[\%]} \end{array} $$

We employed the RWC music dataset [33] detailed in Table 1 and measured the root mean square (RMS) amplitude for each music data element. All the music data elements were formatted as Wav files, with 16-bit/sample, a stereo-channel setting, and a 44100 Hz sampling rate. The experiments were conducted on a Windows 10 OS PC having an Intel® Core i7-9750H 2.6 GHz CPU with 32 GB RAM.

Table 1 Dataset description

The compression ratio and decoding speed results are presented in Tables 2 and 3, respectively.

Table 2 Compression ratio (Eq. (36)) comparison
Table 3 Decoding speed (Eq. (37)) comparison


The proposed algorithms clearly achieved superior convergence performance to the SA and NLMS for correlated signal inputs. Furthermore, the NNGSA and LMS/Newton algorithms exhibited similar performance to the RLS, as indicated in [24]. The NNGSA is not superior to NLMS and RLS in both aspects of convergence speed and steady-state error. However, the NNGSA showed superior performance than the NLMS in highly correlated signals (Fig. 7). In general, digital audio signals exhibit high autocorrelation in small order. Hence, we suggested that the NNGSA showed superior convergence speed than the NLMS for empirical data. Furthermore, the NNGSA time complexity for update gradient is O(p) per adaptation; hence, its complexity is faster than RLS, which employs the Sharman–Morrison formula (O(N2)). Therefore, we concluded that the NNGSA was a more accurate predictive algorithm than the SA and practical application to a lossless audio codec.

However, the proposed algorithms suffer from two major problems with regard to practical applications. First, matrix R must be singular and dependent on input signals. For example, a static offset will be zero mean, variance, and autocorrelations by pre-emphasis. One approach to resolving this problem is to introduce regularization, which would involve calculation of the inverse matrix for R+γI (γ>0) instead of R. Second, the AR coefficients ψi (i=1,...,p) must be calculated before the adaptation process, which can generate difficulties for streaming data processing.

As apparent from Tables 2 and 3, although Monkey’s Audio yielded the best average compression performance, it also exhibited the lowest decoding speed. This is because Monkey’s Audio uses a rich prediction/coding scheme, with a convolutional neural network for prediction and arithmetic coding. In addition, FLAC yielded an inverse trend, i.e., it exhibited the highest decoding speed and poorest compression performance.

NARU exhibited superior compression performance to FLAC, WavPack, TTA, and MPEG4-ALS. This method showed strength in the classical and jazz categories, whereas WavPack exhibited superior performance for popular music. We believe that NARU excels for quieter music, as classical and jazz music tends to have lower signal amplitudes than popular music (see Table 1).


We proposed two novel adaptive algorithms that introduce a natural gradient to the SA. The adaptive step-size algorithm, NNGSA, exhibits certain similarities with well-known algorithms such as NLMS and RLS. Furthermore, we demonstrated the superior performance of the proposed algorithms compared with the SA via toy-data and real-music-data experiments. In a future study, we will introduce an iterative method for estimation of the AR coefficients and expansion methods for affine projection algorithms [22].

We also proposed a novel lossless audio codec scheme based on the NGSA, namely NARU, which exhibited superior compression performance to existing codecs such as FLAC, WavPack, TTA, and MPEG4-ALS. The NARU decoding speed was lower than those of the other codecs, excluding Monkey’s Audio. We found that the filter prediction and coefficient updating processes occupied the majority of the CPU time. Thus, we expect an acceleration of this process through optimization, e.g., though loop unrolling and explicit use of SIMD instructions. Finally, it is remarkable that NARU achieves competitive performance compared to other state-of-the-art codecs despite its simple implementation.

In future work, we will add support for a high-resolution bit (24-bit or higher) Wav and perform further optimization for practical applications, including hardware support. We also plan to employ multichannel decorrelation methods [34] to compression rate improvement for multichannel audio.

We believe that the proposed methods are acceptable to other signal processing tasks, e.g., noise cancellation, audio enhancement, and system identification.

Appendix 1: Proposition proofs

For convenience in the following proofs, we employ the residual vector θ[n] between an unknown parameter h and a current parameter h[n], as

$$\begin{array}{*{20}l} \boldsymbol{\theta}[n] := \boldsymbol{h}^{\ast} - \boldsymbol{h}[n], \end{array} $$

and we define an exponent for the autocorrelation matrix R as

$$\begin{array}{*{20}l} \boldsymbol{R}^{\alpha} := \boldsymbol{Q} \boldsymbol{\Lambda}^{\alpha} \boldsymbol{Q}^{\mathsf{T}}, \quad \alpha \in \mathbb{Q}, \end{array} $$

where Q is an orthogonal matrix and Λ is a diagonal matrix for which the diagonal elements are eigenvalues of R.

NGSA inequality

When Eq. (15) is employed,

$$\begin{array}{*{20}l} \boldsymbol{\theta}[n+1] &= \boldsymbol{\theta}[n] - \mu \text{sgn}(\varepsilon[n]) \boldsymbol{R}^{-1} \boldsymbol{x}[n], \end{array} $$

where μ:=μNGSA. Multiplying both sides by \(\boldsymbol {R}^{\frac {1}{2}}\) from the left, and taking the square of the L2 norm \(\|{\cdot }\|_{2}^{2}\), we obtain

$$\begin{array}{*{20}l} & \|{\boldsymbol{R}^{\frac{1}{2}}\boldsymbol{\theta}[n+1]}\|_{2}^{2} \\ &= \|{\boldsymbol{R}^{\frac{1}{2}} \boldsymbol{\theta}[n]}\|_{2}^{2} - 2\mu\text{sgn}(\varepsilon[n])\boldsymbol{\theta}[n]^{\mathsf{T}} \boldsymbol{x}[n] \\ &\quad + \mu^{2} \|{\boldsymbol{R}^{-\frac{1}{2}}\boldsymbol{x}[n]}\|_{2}^{2} \end{array} $$
$$\begin{array}{*{20}l} &\leq \|{\boldsymbol{R}^{\frac{1}{2}} \boldsymbol{\theta}[n]}\|_{2}^{2} - 2\mu|\varepsilon[n]| + 2\mu|v[n]| + \mu^{2} \frac{\|{\boldsymbol{x}[n]}\|_{2}^{2}}{\lambda_{\text{min}}}. \end{array} $$

Taking the mean of Eq. (42) yields

$$\begin{array}{*{20}l} & \mathrm{E}\left[{\|{\boldsymbol{R}^{\frac{1}{2}}\boldsymbol{\theta}[n+1]}\|_{2}^{2}}\right] \\ & \leq \mathrm{E}\left[{\|{\boldsymbol{R}^{\frac{1}{2}} \boldsymbol{\theta}[n]}\|_{2}^{2}}\right] - 2\mu\mathrm{E}\left[{|\varepsilon[n]|}\right] + 2\mu\varepsilon_{\text{min}} \\ &\quad + 2\mu^{2} \frac{h}{\lambda_{\text{min}}} \end{array} $$
$$\begin{array}{*{20}l} & \leq \dots \\ & \leq r - 2\mu \sum_{k=1}^{n}\mathrm{E}\left[{|\varepsilon[k]|}\right] + 2n\mu\varepsilon_{\text{min}} + 2n\mu^{2} \frac{h}{\lambda_{\text{min}}}, \end{array} $$

where \(r = \mathrm {E}\left [{\|{\boldsymbol {R}^{\frac {1}{2}} \boldsymbol {\theta }[1]}\|_{2}^{2}}\right ]\). Dividing both sides by 2nμ and rearranging, we obtain

$$\begin{array}{*{20}l} \frac{1}{n} \sum_{k=1}^{n} \mathrm{E}\left[{|\varepsilon[k]|}\right] \leq \varepsilon_{\text{min}} + \mu\frac{h}{\lambda_{\text{min}}} + \frac{r}{2n\mu}. \end{array} $$

Hence, we obtain Eq. (16) by n.

NNGSA convergence condition

In the case where Eq. (21) is employed,

$$\begin{array}{*{20}l} & \boldsymbol{\theta}[n+1] = \boldsymbol{h}^{\ast} - \boldsymbol{h}[n] - {\frac{\mu\varepsilon[n]}{\boldsymbol{x}[n]^{\mathsf{T}} \boldsymbol{R}^{-1} \boldsymbol{x}[n]} }\boldsymbol{R}^{-1} \boldsymbol{x}[n] \end{array} $$
$$\begin{array}{*{20}l} &= (\boldsymbol{I} - \mu \boldsymbol{P}[n]) \boldsymbol{\theta}[n] + {\frac{\mu\varepsilon^{\ast}[n]}{\boldsymbol{x}[n]^{\mathsf{T}} \boldsymbol{R}^{-1} \boldsymbol{x}[n]}} \boldsymbol{R}^{-1}\boldsymbol{x}[n], \end{array} $$

where \(\varepsilon ^{\ast }[n] = d[n] - \boldsymbol {h}^{\ast \mathsf {T}}\boldsymbol {x}[n], \boldsymbol {P}[n] = \frac {\boldsymbol {R}^{-1}\boldsymbol {x}[n]\boldsymbol {x}[n]^{\mathsf {T}}}{\boldsymbol {x}[n]^{\mathsf {T}} \boldsymbol {R}^{-1} \boldsymbol {x}[n]}\) and μ:=μNNGSA. Taking the mean of Eq. (47), we obtain

$$\begin{array}{*{20}l} \mathrm{E}\left[{\boldsymbol{\theta}[n+1]}\right] = \mathrm{E}\left[{(\boldsymbol{I} - \mu \boldsymbol{P}[n])\boldsymbol{\theta}[n]}\right], \end{array} $$

as the mean gradient for the unknown parameter is 0. Furthermore, h[n] and x[n] are statistically independent, such that

$$\begin{array}{*{20}l} \mathrm{E}\left[{\boldsymbol{\theta}[n+1]}\right] = \mathrm{E}\left[{\boldsymbol{I} - \mu \boldsymbol{P}[n]}\right] \mathrm{E}\left[{\boldsymbol{\theta}[n]}\right]. \end{array} $$

We can denote E[P[n]] as

$$\begin{array}{*{20}l} \mathrm{E}\left[{\boldsymbol{P}[n]}\right] = \boldsymbol{Q}\boldsymbol{\Lambda}^{-\frac{1}{2}} \boldsymbol{R}_{\boldsymbol{q}} \boldsymbol{\Lambda}^{\frac{1}{2}}\boldsymbol{Q}^{\mathsf{T}}, \end{array} $$

where \(\boldsymbol {q}[n] = \boldsymbol {\Lambda }^{-\frac {1}{2}} \boldsymbol {Q}^{\mathsf {T}} \boldsymbol {x}[n], \boldsymbol {R}_{\boldsymbol {q}} = \mathrm {E}\left [{\frac {\boldsymbol {q}[n]\boldsymbol {q}[n]^{\mathsf {T}}}{\boldsymbol {q}[n]^{\mathsf {T}}\boldsymbol {q}[n]}}\right ]\). Furthermore,

$$\begin{array}{*{20}l} \boldsymbol{R}_{\boldsymbol{q}} = \boldsymbol{Q}_{\boldsymbol{q}}\boldsymbol{\Lambda}_{\boldsymbol{q}}\boldsymbol{Q}_{\boldsymbol{q}}^{\mathsf{T}} \end{array} $$

holds as Rq is symmetric, where Qq is an orthogonal matrix and Λq is the diagonal matrix in which the elements are eigenvalues of Rq. Hence, Eq. (49) is rewritten as

$$\begin{array}{*{20}l} \mathrm{E}\left[{\boldsymbol{I} - \mu \boldsymbol{P}[n]}\right] = \boldsymbol{Q}\boldsymbol{\Lambda}^{-\frac{1}{2}}\boldsymbol{Q}_{\boldsymbol{q}}(\boldsymbol{I} - \mu\boldsymbol{\Lambda}_{\boldsymbol{q}})\boldsymbol{Q}_{\boldsymbol{q}}^{\mathsf{T}}\boldsymbol{\Lambda}^{\frac{1}{2}}\boldsymbol{Q}^{\mathsf{T}}. \end{array} $$

Therefore, to satisfy \({\lim }_{n\to \infty }\mathrm {E}\left [{\boldsymbol {\theta }[n]}\right ] = \boldsymbol {0}\),

$$\begin{array}{*{20}l} |1 - \mu \lambda_{\boldsymbol{q}i}| < 1 \quad (i = 1,...,N) \end{array} $$

is required, where λqi is the eigenvalue of Rq. Here, Rq is a positive semi-definite matrix and

$$\begin{array}{*{20}l} \text{tr}[{\boldsymbol{\Lambda}_{\boldsymbol{q}}}] = \text{tr}[{\boldsymbol{R}_{\boldsymbol{q}}}] = 1 \end{array} $$

holds. Hence, the eigenvalue range is

$$\begin{array}{*{20}l} 0 \leq \lambda_{\boldsymbol{q}i} \leq 1 \quad (i = 1,...,N). \end{array} $$

Therefore, the convergence condition is obtained when maxi{1,...,N}λqi=1.

Derivation of efficient natural-gradient update method

Employing Eq. (10), the elements of m[n] can be calculated as

$$ \begin{aligned} & m_{1}[n] = x_{n-N+1} - \psi_{1} x_{n-N+2} -... - \psi_{p} x_{n - N + p + 1}, \\ & m_{2}[n] = -\psi_{1} x_{n-N+1} + (1 + \psi_{1}^{2}) \psi_{1} x_{n-N+2} \\ & \quad +... + (\psi_{1}\psi_{p} - \psi_{p-1}) x_{n - N + p + 1} - \psi_{p} x_{n - N + p + 2}, \\ & \vdots \\ & m_{p+1}[n] = -\psi_{p} x_{n-N+1} + (\psi_{1}\psi_{p} - \psi_{p-1}) x_{n-N+2} \\ & \quad +... + (1 + \psi_{1}^{2} +... + \psi_{p}^{2})x_{n-N+p+1} \\ & \quad +... + (\psi_{1}\psi_{p} - \psi_{p-1}) x_{n - N + 2p} - \psi_{p} x_{n - N + 2p + 1}, \\ & m_{p+2}[n] = -\psi_{p} x_{n-N+2} + (\psi_{1}\psi_{p} - \psi_{p-1}) x_{n-N+3} \\ & \quad +... + (1 + \psi_{1}^{2} +... + \psi_{p}^{2})x_{n-N+p+2} \\ & \quad +... + (\psi_{1}\psi_{p} - \psi_{p-1}) x_{n - N + 2p + 1} - \psi_{p} x_{n - N + 2p + 2}, \\ & \vdots \\ & m_{N-p}[n] = -\psi_{p} x_{n-2p} + (\psi_{1}\psi_{p} - \psi_{p-1}) x_{n-2p+1} \\ & \quad +... + (1 + \psi_{1}^{2} +... + \psi_{p}^{2})x_{n-p} \\ & \quad +... + (\psi_{1}\psi_{p} - \psi_{p-1}) x_{n - 1} - \psi_{p} x_{n}, \\ & m_{N-p+1}[n] = -\psi_{p} x_{n-2p+1} + (\psi_{1}\psi_{p} - \psi_{p-1}) x_{n-2p+2} \\ & \quad +... + (1 + \psi_{1}^{2} +... + \psi_{p}^{2})x_{n-p+1} \\ & \quad +... + (\psi_{1}\psi_{p-1} - \psi_{p-2}) x_{n - 1} - \psi_{p} x_{n}, \\ & \vdots \\ & m_{N-1}[n] = -\psi_{p} x_{n-p+1} + (\psi_{1}\psi_{p} - \psi_{p-1}) x_{n-p} \\ & \quad +... + (1 + \psi_{1}^{2})x_{n-1} - \psi_{1} x_{n}, \\ & m_{N}[n] = -\psi_{p} x_{n-p} - \psi_{p-1} x_{n-p+1} -... - \psi_{1}x_{n-1} + x_{n}. \end{aligned} $$

Hence, we can denote m[n+1] as follows:

$$ \begin{aligned} & m_{1}[n+1] = m_{2}[n] + \psi_{1} m_{1}[n], \\ & m_{2}[n+1] = m_{3}[n] + \psi_{2} m_{1}[n], \\ & \vdots \\ & m_{p}[n+1] = m_{p+1}[n] + \psi_{p} m_{1}[n], \\ & m_{p+1}[n+1] = m_{p+2}[n], \\ & \vdots \\ & m_{N-p-1}[n+1] = m_{N-p}[n], \\ & m_{N-p}[n+1] = m_{N-p+1}[n] - \psi_{p} m_{N}[n+1], \\ & \vdots \\ & m_{N-1}[n+1] = m_{N}[n] - \psi_{1} m_{N}[n+1], \\ & m_{N}[n+1] = -\psi_{p} x_{n-p+1} -... - \psi_{1}x_{n} + x_{n+1}, \end{aligned} $$

and the Mahalanobis norm \(\boldsymbol {x}[n]^{\mathsf {T}} \boldsymbol {K}_{p}^{-1} \boldsymbol {x}[n] = \boldsymbol {m}[n]^{\mathsf {T}} \boldsymbol {x}[n]\) can be updated as follows:

$$\begin{array}{*{20}l} \boldsymbol{x}&[n+1]^{\mathsf{T}} \boldsymbol{K}_{p}^{-1} \boldsymbol{x}[n+1] = \boldsymbol{m} [n+1]^{\mathsf{T}} \boldsymbol{x}[n+1] \end{array} $$
$$\begin{array}{*{20}l} &= \sum_{i = 2}^{N} m_{i}[n] x_{n-N+i} + m_{1}[n] \sum_{i = 1}^{p} \psi_{i} x_{n-N+1+i} \\ &\quad - m_{N}[n+1] \sum_{i = 1}^{p} \psi_{i} x_{n+1-i} + m_{N}[n+1]x_{n+1} \end{array} $$
$$\begin{array}{*{20}l} &= \boldsymbol{m}[n]^{\mathsf{T}} \boldsymbol{x}[n] \\ &\quad + m_{1}[n] \left(-x_{n-N+1} + \sum_{i = 1}^{p} \psi_{i} x_{n-N+1+i} \right) \\ &\quad - m_{N}[n+1] \left(\sum_{i = 1}^{p} \psi_{i} x_{n+1-i} - x_{n+1} \right) \end{array} $$
$$\begin{array}{*{20}l} &= \boldsymbol{x}[n]^{\mathsf{T}} \boldsymbol{K}_{p}^{-1} \boldsymbol{x}[n] - m_{1}[n]^{2} + m_{N}[n+1]^{2}. \end{array} $$

Convergence condition for LMS/Newton algorithm

In the case that Eq. (27) is used,

$$\begin{array}{*{20}l} & \boldsymbol{h}[n+1] \\ & = \boldsymbol{h}[n] + \mu\boldsymbol{R}_{1}^{-1}(d[n]\boldsymbol{x}[n] - \boldsymbol{x}[n]\boldsymbol{x}[n]^{\mathsf{T}}\boldsymbol{h}[n]), \end{array} $$

where μ:=μLMSN. Taking the mean of both sides, we obtain

$$\begin{array}{*{20}l} & \mathrm{E}\left[{\boldsymbol{h}[n+1]}\right] \\ &= \mathrm{E}\left[{\boldsymbol{h}[n]}\right] \\ &\quad + \mu\boldsymbol{R}_{1}^{-1}\left(\mathrm{E}\left[{d[n]\boldsymbol{x}[n]}\right] - \mathrm{E}\left[{\boldsymbol{x}[n]\boldsymbol{x}[n]^{\mathsf{T}}\boldsymbol{h}[n]}\right]\right) \end{array} $$
$$\begin{array}{*{20}l} &= \mathrm{E}\left[{\boldsymbol{h}[n]}\right] \\ &\quad + \mu\boldsymbol{R}_{1}^{-1}(\mathrm{E}\left[{d[n]\boldsymbol{x}[n]}\right] - \mathrm{E}\left[{\boldsymbol{x}[n]\boldsymbol{x}[n]^{\mathsf{T}}}\right]\mathrm{E}\left[{\boldsymbol{h}[n]}\right]) \end{array} $$
$$\begin{array}{*{20}l} &= \mathrm{E}\left[{\boldsymbol{h}[n]}\right] + \mu\boldsymbol{R}_{1}^{-1}(\boldsymbol{R}\boldsymbol{h}^{\ast} - \boldsymbol{R}\mathrm{E}\left[{\boldsymbol{h}[n]}\right]) \end{array} $$
$$\begin{array}{*{20}l} &= \mathrm{E}\left[{\boldsymbol{h}[n]}\right] + \mu\boldsymbol{R}_{1}^{-1}\boldsymbol{R}\mathrm{E}\left[{\boldsymbol{\theta}[n]}\right]. \end{array} $$

Here, Eq. (64) exploits the statistical independence between x[n] and h[n], and Eq. (65) utilizes the Wiener–Hopf solution. Subtracting h from both sides, we have

$$\begin{array}{*{20}l} \mathrm{E}\left[{\boldsymbol{\theta}[n+1]}\right] = (\boldsymbol{I} - \mu\boldsymbol{R}_{1}^{-1}\boldsymbol{R})\mathrm{E}\left[{\boldsymbol{\theta}[n]}\right]. \end{array} $$

Hence, for h[n] to converge to h,

$$\begin{array}{*{20}l} 0 < \mu < \frac{2}{\eta_{\max}} \end{array} $$

is required [21], where ηmax is the maximum eigenvalue of \(\boldsymbol {R}_{1}^{-1}\boldsymbol {R}\). Furthermore, the eigenvalue range of R1 satisfies [29] the following:

$$ \begin{aligned} \lambda_{k} &= \frac{\sigma^{2}(1 - \psi_{1}^{2})}{1 - 2\psi_{1} \cos\theta_{k} + \psi_{1}^{2}}, \\ & \frac{(k-1) \pi}{N+1} < \theta_{k} < \frac{k \pi}{N+1}. \end{aligned} \quad (k = 1,..., N), $$

More roughly, eigenvalues λk (k=1,...,N) satisfy

$$\begin{array}{*{20}l} \frac{\sigma^{2}(1 - \psi_{1})}{1 + \psi_{1}} < \lambda_{k} < \frac{\sigma^{2}(1 + \psi_{1})}{1 - \psi_{1}}. \end{array} $$

Therefore, employing the Rayleigh quotient,

$$\begin{array}{*{20}l} \eta_{\max} &= \max_{\boldsymbol{x} \neq \boldsymbol{0}} \frac{\boldsymbol{x}^{\mathsf{T}}\boldsymbol{R}\boldsymbol{x}}{\boldsymbol{x}^{\mathsf{T}}\boldsymbol{R}_{1}\boldsymbol{x}} = \max_{\boldsymbol{x} \neq \boldsymbol{0}} \frac{\boldsymbol{x}^{\mathsf{T}}\boldsymbol{R}\boldsymbol{x}}{\boldsymbol{x}^{\mathsf{T}}\boldsymbol{x}} \frac{\boldsymbol{x}^{\mathsf{T}}\boldsymbol{x}}{\boldsymbol{x}^{\mathsf{T}}\boldsymbol{R}_{1}\boldsymbol{x}} \end{array} $$
$$\begin{array}{*{20}l} &\leq \left(\max_{\boldsymbol{x} \neq \boldsymbol{0}} \frac{\boldsymbol{x}^{\mathsf{T}}\boldsymbol{R}\boldsymbol{x}}{\boldsymbol{x}^{\mathsf{T}}\boldsymbol{x}} \right) \left(\min_{\boldsymbol{x} \neq \boldsymbol{0}} \frac{\boldsymbol{x}^{\mathsf{T}}\boldsymbol{R}_{1}\boldsymbol{x}}{\boldsymbol{x}^{\mathsf{T}}\boldsymbol{x}} \right)^{-1} \end{array} $$
$$\begin{array}{*{20}l} &< N\sigma^{2} \left\{ \frac{\sigma^{2}(1 - \psi_{1})}{1 + \psi_{1}} \right\}^{-1} \end{array} $$
$$\begin{array}{*{20}l} &= \frac{N(1 + \psi_{1})}{1 - \psi_{1}}. \end{array} $$

Here, Eq. (73) exploits the fact that the maximum eigenvalue of R is smaller than tr[R]=Nσ2.

Appendix 2: Toy-data experiment results for other configurations

Figures 9 and 10 show learning curves for toy-data experiments for −20 and −60 dB variance configurations.

Fig. 9
figure 9

Learning curves for white-Gaussian-noise input (above: −20 dB, below: −60 dB)

Fig. 10
figure 10

Learning curves for correlated noise input (above: −20 dB, below: −60 dB)

Availability of data and materials

The NARU codec implementation is available at



Sign algorithm


Natural-gradient sign algorithm


Least mean square


Normalized least mean square


Normalized natural-gradient sign algorithm


Finite impulse response


Recursive least squares


Pulse code modulation


Mean square deviation


Absolute error


Real-world computing


Root mean square


  1. K. Konstantinides, An introduction to super audio CD and DVD-audio. IEEE Signal Proc. Mag.20(4), 71–82 (2003).

    Article  Google Scholar 

  2. T. Moriya, N. Harada, Y. Kamamoto, H. Sekigawa, MPEG-4 ALS international standard for lossless audio coding. NTT Tech. Rev.4(8), 40–45 (2006).

    Google Scholar 

  3. M. Hans, R. W. Schafer, Lossless compression of digital audio. IEEE Signal Proc. Mag.18(4), 21–32 (2001).

    Article  Google Scholar 

  4. T. Robinson, Shorten: simple lossless and near-lossless waveform compression. Technical Report, Cambridge Univ., Eng. Dept. (1994).

  5. T. Liebchen, MPEG-4 ALS-the standard for lossless audio coding. J. Acoust. Soc. Korea. 28(7), 618–629 (2009).

    Google Scholar 

  6. Apple Lossless Audio Codec (2011). Accessed 23 Apr 2022.

  7. FLAC - free lossless audio codec (2011). Accessed 23 Apr 2022.

  8. WavPack audio compression (2004). Accessed 23 Apr 2022.

  9. TTA lossless audio codec - true audio compressor algorithms (2005). Accessed 23 Apr 2022.

  10. Monkey’s Audio - a fast and powerful lossless audio compressor (2000). Accessed 23 Apr 2022.

  11. R. F. Rice, in Appl. Digit. Image Process. III, 207. Practical universal noiseless coding, (1979), pp. 247–267.

  12. H. Kameoka, Y. Kamamoto, N. Harada, T. Moriya, A linear predictive coding algorithm minimizing the Golomb-Rice code length of the residual signal. Trans. Inst. Electron. Inf. Commun. Eng. A. 91:, 1017–1025 (2008).

    Google Scholar 

  13. P. S. Diniz, et al., Adaptive Filtering, vol. 4 (Springer, Massachusetts, 1997).

    Book  Google Scholar 

  14. A. Gersho, Adaptive filtering with binary reinforcement. IEEE Trans. Inf. Theory. 30(2), 191–199 (1984).

    MathSciNet  Article  Google Scholar 

  15. L. Lu, H. Zhao, K. Li, B. Chen, A novel normalized sign algorithm for system identification under impulsive noise interference. Circ. Syst. Signal Proc.35(9), 3244–3265 (2016).

    MathSciNet  Article  Google Scholar 

  16. M. O. Sayin, N. D. Vanli, S. S. Kozat, A novel family of adaptive filtering algorithms based on the logarithmic cost. IEEE Trans. Signal Process.62(17), 4411–4424 (2014).

    MathSciNet  Article  Google Scholar 

  17. S. L. Gay, S. C. Douglas, in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2. Normalized natural gradient adaptive filtering for sparse and non-sparse systems (IEEENew York, 2002), p. 1405.

    Google Scholar 

  18. S. -I. Amari, Natural gradient works efficiently in learning. Neural Comput.10(2), 251–276 (1998).

    Article  Google Scholar 

  19. T. Mineo, H. Shouno, in 2021 29th European Signal Processing Conference (EUSIPCO). Improving convergence rate of sign algorithm using natural gradient method (IEEENew York, 2021), pp. 51–55.

    Chapter  Google Scholar 

  20. M. Siddiqui, On the inversion of the sample covariance matrix in a stationary autoregressive process. Ann. Math. Stat.29(2), 585–588 (1958).

    MathSciNet  Article  Google Scholar 

  21. W. Bernard, D. S. Samuel, Adaptive signal processing (Prentice Hall, Englewood Cliffs, 1985).

    MATH  Google Scholar 

  22. S. S. Haykin, Adaptive Filter Theory (Pearson Education India, 2005).

  23. P. S. Diniz, L. W. Biscainho, Optimal variable step size for the LMS/Newton algorithm with application to subband adaptive filtering. IEEE Trans. Signal Process.40(11), 2825–2829 (1992).

    Article  Google Scholar 

  24. P. S. Diniz, M. L. de Campos, A. Antoniou, Analysis of LMS-Newton adaptive filtering algorithms with variable convergence factor. IEEE Trans. Signal Process.43(3), 617–627 (1995).

    Article  Google Scholar 

  25. S. -I. Amari, Differential-Geometrical Methods in Statistics, vol. 28 (Springer, New York, 2012).

    Google Scholar 

  26. T. Petillon, A. Gilloire, S. Theodoridis, The fast newton transversal filter: an efficient scheme for acoustic echo cancellation in mobile radio. IEEE Trans. Signal Process.42(3), 509–518 (1994).

    Article  Google Scholar 

  27. B. Farhang-Boroujeny, Fast LMS/Newton algorithms based on autoregressive modeling and their application to acoustic echo cancellation. IEEE Trans. Signal Process.45(8), 1987–2000 (1997).

    MathSciNet  Article  Google Scholar 

  28. J. E. Markel, A. H. Gray, Linear Prediction of Speech (Springer, Berlin, 1982).

    MATH  Google Scholar 

  29. U. Grenander, G. Szegö, Toeplitz Forms and Their Applications (Univ of California Press, California, 1958).

    Book  Google Scholar 

  30. H. Huang, P. Franti, D. Huang, S. Rahardja, Cascaded RLS–LMS prediction in MPEG-4 lossless audio coding. IEEE Trans. Audio Speech Lang. Process.16(3), 554–562 (2008).

    Article  Google Scholar 

  31. D. Salomon, Data compression: the complete reference (Springer Science & Business Media, Berlin/Heidelberg, 2007).

    MATH  Google Scholar 

  32. (International Organization for Standardization, Geneva, 1990).

  33. M. Goto, H. Hashiguchi, T. Nishimura, R. Oka, RWC music database: popular, classical and jazz music databases. Ismir. 2:, 287–288 (2002).

    Google Scholar 

  34. Y. Kamamoto, N. Harada, T. Moriya, N. Ito, N. Ono, T. Nishimoto, S. Sagayama, in 2009 IEEE 13th International Symposium on Consumer Electronics. An efficient lossless compression of multichannel time-series signals by MPEG-4 ALS (IEEENew York, 2009), pp. 159–163.

    Chapter  Google Scholar 

Download references


The authors thank the associate editor and the anonymous reviewers for their constructive comments and useful suggestions.


Not applicable.

Author information

Authors and Affiliations



Authors’ contributions

Taiyo Mineo: software and writing—original draft. Hayaru Shouno: writing, review, and editing. Both authors read and approved the final manuscript.

Authors’ information

Taiyo Mineo received a B. Eng. from the University of Electro-Communications, Tokyo, in 2014, and received an M. Eng. from the Tokyo Institute of Technology in 2016. He was employed by CRI Middleware Co., Ltd., from 2016 to 2020, and is currently pursuing a Ph.D. in information engineering at the University of Electro-Communications. His research interests include signal processing and machine learning.

Hayaru Shouno received a Ph.D. in Engineering from Osaka University, Osaka, in 1999. He is currently a Professor at the Graduate School of Informatics and Engineering, the University of Electro-Communications, Tokyo. His research interests include computer vision and machine learning involving neural networks. He is an Action Editor of Neural Networks and an elected governor of the Asia Pacific Neural Network Society (APNNS).

Corresponding author

Correspondence to Taiyo Mineo.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mineo, T., Shouno, H. Improving sign-algorithm convergence rate using natural gradient for lossless audio compression. J AUDIO SPEECH MUSIC PROC. 2022, 12 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Adaptive algorithm
  • Autoregressive model
  • Lossless audio codec
  • Natural gradient method
  • Sign algorithm