As explained in Section 2, the original adaptive muting method in Appendix IV of ITU-T G.722 is linearly applied between successive frames according to the pre-determined curve. In the proposed algorithm, we present an improved adaptive muting algorithm which is applied non-linearly using the two parametric shaping functions such as the exponential function and raised-cosine function, commonly used for the logistic function. We firstly compare the performance of the two parametric shaping functions as for the muting curve according to the various error criteria. Then, optimal values of the parameters of the two parametric shaping functions are selected according to the grid-search [23], which is an exhaustive search method to find the optimal point in a manually specified subset of the parameter space of the learning algorithm, established by the given error criteria: the MSE and WB-PESQ, segSNR, and fwSNRseg. First, the sigmoid function is employed by incorporating three parameters as given by
$$\begin{array}{@{}rcl@{}} G_{s}(n)= \frac{1+0.1 \cdot \alpha_{s} e^{-3 \beta_{s} \gamma_{s}}}{1+0.1 \cdot \alpha_{s} e^{\beta_{s} (n-3 \gamma_{s})}} \end{array} $$
((3))
where α
s
and β
s
denote sloping parameters to control the shape of the sigmoid function, and γ
s
denotes a shift parameter, respectively. Second, we consider the raised-cosine type function, which is commonly used for pulse shaping:
$$\begin{array}{@{}rcl@{}} F(x) &=&\left\{\begin{array}{lllll} -1 &,& x<-\frac{1+\beta_{r}}{2\alpha_{r}}\\ \alpha_{r}x-\frac{1-\beta_{r}}{2} \\ \quad -\frac{\beta_{r}}{\pi}\cos\left[\frac{2\alpha_{r}x\pi+\pi}{2\beta_{r}}\right] &,& -\frac{1+\beta_{r}}{2\alpha_{r}}\leq x\leq -\frac{1-\beta_{r}}{2\alpha_{r}} \\ 2\alpha_{r}x &,& -\frac{1-\beta_{r}}{2\alpha_{r}} \leq x \leq \frac{1-\beta_{r}}{2\alpha_{r}} \\ \alpha_{r}x+\frac{1-\beta_{r}}{2} \\ \quad +\frac{\beta_{r}}{\pi}\cos\left[\frac{2\alpha_{r}x\pi-\pi}{2\beta_{r}}\right] &,& \frac{1-\beta_{r}}{2\alpha_{r}} \leq x \leq \frac{1+\beta_{r}}{2\alpha_{r}} \\ 1 &,& x>\frac{1+\beta_{r}}{2\alpha_{r}} \end{array}\right. \end{array} $$
((4))
where α
r
and β
r
determine the shape and the dynamic range of the function, and we modify this equation to the three-parameter raised-cosine type function which is a time-scaled and shifted version of (4) in accordance with the muting curve:
$$\begin{array}{@{}rcl@{}} G_{r}(n) = \frac{1}{2}\left[F\left(\frac{-n+\gamma_{r}}{2 \gamma_{r}}\right)+1\right] \end{array} $$
((5))
where γ
r
denotes a shift parameter. It is worthwhile noting that the muting curve G(n) can be controlled by the core parameters (i.e., α, β, and γ) of the function. Also, in contrast with the original method, the muting factor G(n) does not become zero after 320 samples for both sigmoid and raised-cosine functions to offer a great amount of flexibility for the muting curve.
As a consequence, these parametric shaping functions, which decrease monotonically, can offer more flexibility to the shape of the muting curve than that of the reference muting curves [21, 22]. Since it is well-known that the Other cases class which includes unvoiced, weakly voiced, and voiced signals plays a dominant role in the perceived speech quality of reconstructed speech, we use the parametric shaping functions as the muting curve to the Other cases class only, while (2) is applied to the Transient and UV transition classes.
To estimate the difference between the desired speech signal and reconstructed speech signal, we adopt various error criteria: MSE, WB-PESQ, segSNR, and fwSNRseg. For considering the MSE criterion first, we use (1) so that the error between the desired signal and the reconstructed signal can be interpreted as
$$\begin{array}{@{}rcl@{}} \varepsilon(n)&=&dl(n)-yl(n) \\ &=&dl(n)-G(n)\cdot yl_{\text{pre}}(n), \end{array} $$
((6))
where d
l(n) denotes the desired lower-band signal, which is equal to a lower-band decoded signal without any packet losses, and G(n) can be defined by (3) or (5). Thus, the MSE can be expressed as
$$\begin{array}{@{}rcl@{}} J(\alpha, \beta, \gamma) = \sum\limits^{N}_{n=1}E\left[\varepsilon(n)\right]^{2}, \end{array} $$
((7))
where N denotes the total number of samples for a training data file. Note that the cost function in (7) contains three unknown parameters; α, β, and γ for both sigmoid (α
s
, β
s
, γ
s
) and raised-cosine (α
r
, β
r
, γ
r
) type functions so that they can be expressed as a function of α, β, and γ. From (7), the average of MSEs for training data files is expressed as
$$\begin{array}{@{}rcl@{}} \xi(\alpha, \beta, \gamma)=\frac{1}{L}\sum\limits^{L}_{l=1} J_{l}(\alpha, \beta, \gamma) \end{array} $$
((8))
where l and L, respectively, denote the index of the training file and the total number of training files for the grid-search according to the processed speech by the proposed PLC algorithm. In (8), to find the optimal parameters, we compute the average of MSEs over all training data in the speech materials by varying α, β, and γ:
$$\begin{array}{@{}rcl@{}} ({\alpha^ \ast}, {\beta^ \ast}, {\gamma^ \ast})=\arg\min_{\alpha, \beta, \gamma} \xi(\alpha, \beta, \gamma) \end{array} $$
((9))
and taking the optimal parameters α
∗, β
∗, and γ
∗ to be those that minimize ξ(α,β,γ).
Since the packet losses actually affect the signal quality during speech periods, we also adopt the well-known objective speech quality measures such as WB-PESQ, segSNR, and fwSNRseg for the error criterion to measure the speech quality.
First, the segSNR, instead of working on the whole signal, is calculated by the average of the SNR values on short frames as given by
$$\begin{array}{@{}rcl@{}} segSNR = \frac{10}{M}\sum\limits^{M-1}_{m=0} \log_{10} \frac{\sum^{Tm+T-1}_{n=Tm} x^{2}(n)}{\sum^{Tm+T-1}_{n=Tm} \{x(n)-x(n)\}^{2}}, \end{array} $$
((10))
where T and M indicate the frame length (10 ms) and number of frames in the signal, respectively. And, the values for the upper and lower ratio limit are 35 and −10 dB, respectively.
Next, fwSNRseg is a weighted segSNR within a frequency band proportional to the critical band which can be defined as follows:
$$\begin{array}{@{}rcl@{}}{} fwSNRseg = \frac{10}{M}\sum\limits^{M-1}_{m=0}\frac{\sum^{K-1}_{j=0} W(j,m) \log_{10} \frac{X(j,m)^{2}}{\{X(j,m)-\hat{X}{j,m} \}^{2}} }{\sum^{K-1}_{j=0} W(j,m) }, \end{array} $$
((11))
where W(j,m) is the weight on the jth subband in the mth frame which is taken from the ANSI SII standard, K is the number of subbands, X(j,m) is the spectrum magnitude of the jth subband in the mth frame, and \(\hat {X}({j,m})\) is distorted spectrum magnitude. Previous studies have shown that segSNR and fwSNRseg show significantly higher correlation with subjective quality than the classical SNR.
In a similar manner to the MSE estimator, the score of speech quality measures F abovementioned is calculated based on training data files and an average value of it is computed by
$$\begin{array}{@{}rcl@{}} \xi(\alpha, \beta, \gamma)=\frac{1}{L}\sum\limits^{L}_{l=1} F_{l}(k). \end{array} $$
((12))
Each file lasting 8 s consists of two different sentences and then processed by the proposed PLC algorithm in which G
s
(n) and G
r
(n) are applied to (1) as on adaptive muting mechanism. Since F depends on the reconstructed signal, ξ can be determined as a function of α, β, and γ. To find the optimal parameters, we compute the average of F in each objective measure over all training data in the speech materials by varying α, β, and γ and taking the optimal parameters α
∗, β
∗, and γ
∗ to be those that satisfy the below:
$$\begin{array}{@{}rcl@{}} ({\alpha}^{\ast}, {\beta}^{\ast}, {\gamma}^{\ast})=\arg\max_{\alpha, \beta, \gamma} \xi(\alpha, \beta, \gamma). \end{array} $$
((13))
For parameter training of parametric shaping functions, we used a number of speech materials from the TIMIT database, as will be described further in Section 4. Then, the parameters according to the objective measures will be obtained and those will be applied to (3) and (5), respectively, for the test phase. Finally, the optimal parameters based on the MOS test will be chosen from the parameters obtained by the objective measures. It is noted that the proposed muting method does not cause any additional algorithmic delay and storage since finding the optimal parameters is based on the off-line training process. Also, like [21] and [22], worst-case complexity of bad frame processing is still lower than that of good frame processing, the overall worst-case complexity is unchanged. Finally, this proposed adaptive muting method is applied to the higher-band in the same way as to the lower-band.