 Research
 Open Access
 Published:
Effective blind speech watermarking via adaptive mean modulation and package synchronization in DWT domain
EURASIP Journal on Audio, Speech, and Music Processing volume 2017, Article number: 10 (2017)
Abstract
This paper outlines a package synchronization scheme for blind speech watermarking in the discrete wavelet transform (DWT) domain. Following twolevel DWT decomposition, watermark bits and synchronization codes are embedded within selected frames in the secondlevel approximation and detail subbands, respectively. The embedded synchronization code is used for frame alignment and as a location indicator. Tagging voice active frames with sufficient intensity makes it possible to avoid ineffective watermarking during the silence segments commonly associated with speech utterances. We introduce a novel method referred to as adaptive mean modulation (AMM) to perform binary embedding of packaged information. The quantization steps used in mean modulation are recursively derived from previous DWT coefficients. The proposed formulation allows for the direct assignment of embedding strength. Experiment results show that the proposed DWTAMM is able to preserve speech quality at a level comparable to that of two other DWTbased methods, which also operate at a payload capacity of 200 bits per second. DWTAMM exhibits superior robustness in terms of bit error rates, as long as the recovery of adaptive quantization steps is secured.
Introduction
In the digital era, copyright protection of multimedia data (e.g., images, audios, and videos) is an important issue for content owners and service providers. Digital watermarking technology has received considerable attention owing to its potential application to the protection of intellectual property rights, content authentication, fingerprinting, and covert communications. Watermarking technology generally takes four factors (i.e., imperceptibility, security, robustness, and capacity) into consideration [1, 2]. An ideal watermarking algorithm minimizes perceptual distortion due to signal alteration while embedding a sufficient quantity of information within a host signal to ensure resistance against malicious attacks. The fact that the requirements of capacity, robustness, and imperceptibility are contradictory necessitates a tradeoff in the design of various watermarking schemes. For example, medical information systems are primarily intended to provide security and ensure the integrity of information, whereas payload capacity is of paramount importance in air traffic control systems. The annotation watermarking of music products emphasizes imperceptibility and robustness.
Robust watermarks are strongly resistant to attacks, whereas fragile watermarks are supposed to crumble under any attempt at tampering. There are numerous ways to categorize watermarking techniques. Depending on the requirement of the source material, watermarking schemes can be classified as blind, semiblind, and nonblind. Blind watermarking is designed to recover an embedded watermark without the presence of the original source, while the nonblind approach can only be carried out using the original source. Semiblind watermarking involves situations where information other than the source itself is required for watermark extraction.
Over the past two decades, numerous watermarking methods have been developed for images, audios, and videos. Far less attention has been paid to the watermarking of speech signals. Speech is a specific form of audio signal; therefore, the techniques developed for audio watermarking are presumed to be applicable to speech watermarking. However, speech differs from typical audio signals with regard to spectral bandwidth, intensity distribution, signal continuity, and production modeling [3, 4]. The techniques developed for audio watermarking are not necessarily suitable for speech watermarking [5].
In [6], Hofbauer et al. exploited the fact that the ear is insensitive to the phase of a signal in nonvoiced speech. They performed speech watermarking by replacing the excitation signal of an autoregressive representation in nonvoiced segments. Chen and Liu [7] modified the position indices of selected excitation pulses in a watermarking scheme based on the codebookexcited linear prediction (CELP)based speech codec. Coumou and Sharma [8] embedded data via pitch modification in voiced segments. The fact that multiple voiced segments may coalesce into a single voiced segment (or viceversa) in a communication channel means that mismatches in voiced segments can lead to insertion, deletion, and substitution errors in the estimates of embedded data. They resorted to a concatenated coding scheme for synchronization and error recovery.
The vocal tract transfer function modeled by linear prediction (LP) has also been employed as an embedded target. Chen and Zhu [9] achieved robust watermarking by embedding watermark bits into codebook indices, while applying multistage vector quantization (MSVQ) to the derived LP coefficients. Yan and Guo [10] converted the LP coefficients to reflection coefficients, which were then transformed to inverse sine (IS) parameters. Watermark embedding was achieved by modifying the IS parameters using oddeven modulation [11].
Many watermarking algorithms applied to audio signals are implemented in the transform domain, such as discrete Fourier transform (DFT) [12,13,14], discrete cosine transform (DCT) [15,16,17,18,19], discrete wavelet transform (DWT) [15, 20,21,22,23,24], and cepstrum [25,26,27]. The objective is to take advantage of signal characteristics and/or auditory properties [28]. Among the transforms used to perform audio watermarking, DWT is currently the most popular due to its perfect reconstruction and good multiresolution characteristics. The effectiveness of this approach in audio watermarking leads to conclude that it may also work for speech watermarking as well if speech characteristics can be adequately taken into account.
In this study, we introduce a robust blind watermarking scheme for hiding two types of information (watermark bits and synchronization codes) within embeddable regions of DWT subbands designated as information packages. The position of the synchronization codes is used in frame alignment to indicate the start of packaged binary data. This scheme allows the watermark to be dissembled into parts during the embedding phase and reassembled during extraction.
The remainder of this paper is organized as follows. Section 2 describes the watermarking framework, whereby information bits and synchronization codes are embedded in selected DWT subbands. Section 3 discusses configuring the watermark to cope with speech signals. We also outline a package strategy used for information grouping and synchronization and delineate the complete watermarking process. Section 4 presents experiment results aimed at evaluating speech quality and watermark robustness against commonly encountered attacks. Conclusions are drawn in Section 5.
Mean modulation in the DWT domain
In this study, we embedded two types of binary data (watermark bits and synchronization codes) within the same time frame under the same framework. This was achieved using DWT to conduct signal decomposition, thus allowing the embedding of different types of binary data within separate DWT subbands. However, the detectability of the embedded binary information differs somewhat between the watermark bits and synchronization codes. Unlike watermarks, where each bit conveys individual information, the bit sequence of a synchronization code can be considered a distinct entity. The existence of the synchronization code depends on a certain number of bits being recognizable, which means that the synchronization code may have some tolerance for faults. Thus, the watermark bits in our design are inserted within the lowest subband, wherein the coefficient magnitudes are larger than that observed in the subband used for the insertion of synchronization codes. Larger coefficients enable the use of stronger strengths to embed watermark bits. This is conducive to the robustness of the watermark and helps to keep it perceptually imperceptible.
Watermarking by adaptive mean modulation in DWT domain
Assuming that a speech signal is sampled at a rate of 16 kHz with 16bit resolution, a twolevel onedimensional (1D) DWT is employed to decompose the speech signal into a single approximation subband and two detail subbands. Here, the secondlevel DWT is performed on the approximation coefficients obtained from the firstlevel DWT of the host speech signal. The Daubechies8 basis [29] is used as a wavelet function. Thus, the resulting secondlevel approximation subband occupies a frequency range roughly between 0 and 2000 Hz, while the secondlevel detail subband spans 2000 to 4000 kHz. The spectral density of speech signals is normally concentrated below 4 kHz; therefore, these two subbands are considered suitable candidates for watermarking applications. Analogous to most watermarking methods, we divide the selected DWT coefficients into frames in order to facilitate the embedding and detection of watermarks. Within each frame, l adjacent coefficients, termed c(i)’s, drawn from the selected subband are gathered as a subgroup for the implementation of binary embedding.
In this study, the embedding of a binary bit w _{ k } within G _{ k } is achieved by modulating the coefficient mean m _{ k }, which is defined as follows, using quantization index modulation (QIM) [30]:
The formulation of the QIM can be expressed as
where ⌊ • ⌋ denotes the floor function and ∆_{ k } represents a quantization step. In essence, Eq. (3) changes m _{ k } to the nearest integer multiple of ∆_{ k } if w _{ k } = 0 and to the middle of two integers multiples of ∆_{ k } whenever w _{ k } = 1. We note that QIM can be regarded as a special case of dither modulation [30, 31], wherein dither noise is added first and then quantized. The distortion compensation technique introduced in [30, 32] may also be incorporated into the quantization realization. Distortion compensated QIM allows the adjustment of the quantization steps without introducing extra distortion while pursuing robustness. In our former studies [33, 34], we have shown that the incorporation of distortion compensation into QIM can successfully enhance the robustness of the watermark while the imperceptibility is still maintained.
In this study, the mean value of the coefficients in each subgroup is selected as the embedding target because this statistical property is less susceptible to intentional attacks and/or unintentional modifications. Besides the insusceptibility to probable perturbation, employing the mean value as the embedding target makes it fairly easy to control the signaltowatermark ratio, while using mean modulation for binary embedding. The robustness of the embedded watermark generally depends on the number of involved coefficients (i.e., l) and the embedding strength characterized by the quantization step size (i.e., ∆_{ k }). Our scheme shares some similarities with those in [35, 36], where the one in [35] changed the DWT coefficients based on the average of a relevant frame and the one in [36] took into account the average of the linear regression of fast Fourier transform (FFT) values.
With QIM embedding, embedding strength is reflected in the quantization step size. The use of a large step size tends to increase robustness but impair quality by imposing more alterations to the speech signal. Making the embedded watermarks inaudible would require suppressing the distortion below the auditory masking threshold. Because the maximum tolerable noise level in each critical band is generally proportional to the shorttime energy of the host speech, a sensible strategy involves adapting the quantization step according to the intensity of the speech segment. In other words, the quantization step is augmented when the energy of the DWT coefficients climbs, and it is reduced when the energy drops. By referring to the approach presented in [37], we derive the local energy level from previous coefficients in a recursive manner.
where \( {\widehat{\rho}}_{k1}={\displaystyle {\sum}_{i=1}^l{\widehat{c}}^2\left({\kappa}_{k1}+ i\right)} \) is the energy computed from the modified coefficients in the (k − 1)th subgroup and \( {\overline{\rho}}_k \) is the output of the firstorder recursive lowpass filter. α is a positive controlling parameter deliberately rendering a unity DC gain. It should be pointed out that the coefficients in the current subgroup cannot be used for estimation, due to the fact that they are about to be modified by watermarking. Consequently, \( {\overline{\rho}}_k \) can be regarded as an estimate of the shorttime energy derivable from previous coefficients. Owing to the shorttime stability of speech signals, the resulting \( {\overline{\rho}}_k \) can be regarded as a smoothed version of \( {\widehat{\rho}}_k \).
The acquisition of shorttime energy \( {\overline{\rho}}_k \) makes it possible to regulate the signaltowatermark ratio, which is defined as the energy ratio between the signal and watermarking perturbation measured in decibels. The relationship among \( {\overline{\rho}}_k \), η, and ∆_{ k } is expressed mathematically as follows:
where E[•] denotes the expected probability distribution. The term on the lefthand side of Eq. (5) is meant to convert a specific decibel value η to its linear magnitude, while the numerator and denominator of the fractional expression on the righthand side of Eq. (5) denote the energy levels of the signal and noise, respectively. The alteration due to the QIM in Eq. (3) is presumably distributed uniformly over [−Δ _{ k }/2, Δ _{ k }/2]. As a result, ∆_{ k } can be computed directly once η is specified, as follows:
Following the acquisition of \( {\widehat{m}}_k \) as in Eq. (3), binary embedding in each subgroup is accomplished by modifying the corresponding DWT coefficients.
After all of the watermark bits are embedded within designated DWT subbands, we take inverse DWT to obtain the watermarked speech signal using the modified subband coefficients.
Watermark extraction follows the basic procedure used in embedding. After applying twolevel DWT to the watermarked file, the coefficients in the selected subband are divided into subgroups in order to derive the coefficient mean \( {\tilde{m}}_k \) and quantization step \( {\tilde{\varDelta}}_k \) using Eqs. (2) and (6). The watermark bit \( {\tilde{w}}_k \) residing in each subgroup is extracted based on standard QIM:
where the tilde atop participating variables implies the effect of possible attacks.
Frame synchronization via DWTAMM
The prerequisite for accurate watermark extraction using the abovementioned adaptive mean modulation (AMM) scheme is the perfect alignment of the boundary of each subgroup. A simple strategy by which to synchronize the locations used in watermark insertion and detection is to insert synchronization codes within the host signal. Actual watermark extraction begins after identifying the locations of the synchronization codes.
In fact, mean modulation with a fixed quantization step has previously been explored for the embedding of synchronization codes in the time domain for many audio watermarking algorithms [14, 15, 17, 23, 38]. In principle, watermark bits and synchronization codes are hidden in different segments of the audio file to avoid mutual interference. In this study, we propose embedding watermark bits and synchronization codes within separate DWT subbands. This arrangement offers additional advantages other than an increase in payload capacity. For example, the successful detection of synchronization codes in one subband can signify the presence of a data sequence in another subband.
Assume that the secondlevel detail subband has been selected as the embedding target. The derivation of a detail coefficient sequence can be imagined as a process of highpass filtering and subsequent downsampling; hence, the spectral orientation is reversed for the detail coefficients. To flip the spectrum back to its normal direction, we simply alter the sign of the odd index coefficients, as follows:
where \( {c}_d^{(2)}\left({\kappa}_k+ i\right) \) denotes the ith coefficient in the kth subgroup of the secondlevel detail subband. The DWT level is specified in the superscript alongside the coefficient variable. The subscript “d” denotes the initial of the word “detail.” Note that the spectral energy of speech signals is normally concentrated in low frequencies, and AMM tends to track lowfrequency variations. Once Eq. (9) restores the energy distribution in the low frequencies, spectral flipping enables the rendering of larger quantization steps that eventually enhances robustness.
In our design, the synchronization code is a random binary sequence φ(i) ∈ {0, 1} of length L _{code}. Each φ(i) bit is inserted into l _{sync} coefficients in the secondlevel detail subband. We tentatively choose L _{code} = 120, l _{sync} = 4, and η = 10 to provide adequate resistance against possible attacks. The overall length of the synchronization code covers an interval of l _{sync} × L _{code} (=480) coefficients. Variable α used in the recursive filter is set to 0.9 in order to render a slowly varying estimate of the shorttime energy. The search for subgroup demarcation is on a samplebysample basis. Because the secondlevel detail coefficients are derived from a signal of which the length is four times the coefficient amount, we conduct four times of DWT decomposition respectively from the first to fourth position of the speech signal and then attach the resultant coefficient sequence to every sample location. While detecting the presence of synchronization codes, we switch around the four sequences as the process proceeds from sample to sample. Specifically, for every replacement of a new coefficient in one of the four sequences, we regroup l _{sync} coefficients and recompute the shorttime energy. After obtaining the coefficient mean and quantization step for each subgroup, we acquire binary bit b _{ w }(i) using Eq. (8). The synchronization code is then detected using a matched filter. The entire computation proceeds through three steps. First, the information bit sequence and synchronization code are both converted to bipolar form. Second, the extracted bipolar stream is convolved with the reversed version of the bipolarconverted synchronization code. Third, the presence of the synchronization code is presumed whenever the filter output y(i) exceeds threshold T, which is set as 0.45L _{code}. The following inequality summarizes the aforementioned three steps.
As shown on the left side of Eq. (10), the bipolar stream is decimated by l _{sync} during convolution due to the fact that each bit stems from every l _{sync} coefficient. There can be two types of error during the search for synchronization codes. A falsepositive error (FPE) involves declaring a nonembedded speech signal as an embedded one, whereas a falsenegative error (FNE) involves classifying an embedded speech signal as a nonembedded one.
Assuming that the extracted watermark bits are independent random variables with probability P _{ e }, then FPE P _{ fp } can be computed as follows:
where k denotes the number of matched bits in a total of L _{code} bits. \( \left(\begin{array}{c}\hfill {L}_{\mathrm{code}}\hfill \\ {}\hfill k\hfill \end{array}\right) \) represents the binomial coefficient. The threshold T and the number of matched bits T' hold the relationship as T' = (L _{code} − T)/2, because T is the summed result of T' matched bits and L _{code} − T' unmatched bits.
where a matched bit corresponds to +1 and an unmatched bit corresponds to −1. Since nonembedded bits are either 0 or 1 with pure randomness, P _{ e } is assumed to be 0.5. Thus, Eq. (11) can be further simplified as
Given that L _{code} = 120 and T = 0.45L _{code} =54, P _{ fp } turns out to be 4.34 × 10^{−7}, which implies that FPE rarely happens while using the presumed parameter setting.
Analogous to the discussion on the derivation of FPE, the FNE P _{ fn } can be computed as
where P _{BER} denotes the error rate for each bit. According to Eq. (14), P _{ fn } remains below 0.982 even if P _{BER} is as high as 0.2.
Watermarking with package synchronization
In Sections 2 and 3, we discuss the DWTAMM framework and its application to watermark synchronization. Further considerations must be taken into account in the development of a practical speech watermarking system. It has been pointed out in the introduction that a speech signal exhibits several distinct acoustic characteristics, which differentiate the speech from other types of audio. Unlike most music signals, silent segments commonly occur in speech utterances. The insertion of watermark bits into silent segments would render them vulnerable to noise perturbation and susceptible to attacks through the simple removal of silence. Thus, we developed an energybased scheme to enable the selection of frames for the embedding of watermarks and synchronization codes. One general principle in watermarking is to hide information among large coefficients in the transformed domain, because this enables the employment of stronger embedding to resist attacks with less concern for imperceptibility.
The energy of a speech signal is normally concentrated below 4 kHz. To make the best use of DWT decomposition, we selected the secondlevel approximation subband for the embedding of binary information and reserved the secondlevel detail subband for frame synchronization on condition that the speech is sampled at 16 kHz. After taking twolevel DWT of the host signal, the coefficients in the secondlevel approximation and detail subbands are both partitioned into nonoverlapping frames of size L _{ f }. In this study, L _{ f } is tentatively set to 160 to facilitate subsequent scheme development. Then, we calculate the rootmeansquare (RMS) values, termed σ _{ a }(t) and σ _{ d }(t), respectively, for the secondlevel approximation and detail subbands.
where \( {c}_a^{(2)}\left( i;\kern0.5em t\right) \) and \( {c}_d^{(2)}\left( i;\kern0.5em t\right) \) are respectively the ith secondlevel approximation and detail coefficients in the tth frame. Let ψ _{ a } and ψ _{ d } be the corresponding thresholds, which are assigned as ratios proportional to the maximum values. The frames with RMS values exceeding prespecified thresholds are selected for watermarking. This type of frame selection can be expressed as follows:
with
The frame attribute Λ(t) is categorized as “embeddable” if both σ _{ a }(t) and σ _{ d }(t) surpass their respective thresholds.
Figure 1 illustrates the process of searching embeddable frames within a speech signal. In Fig. 1e, the frame is categorized as embeddable, as long as the corresponding approximation and detail coefficients are of sufficiently large magnitude to allow the embedding of watermark bits and synchronization codes. The insertion and detection of synchronization codes are illustrated in Fig. 2. In this example, the synchronization code is embedded in four places, each spanning an interval of three frames. The four embedding segments are rendered in red in (b). As indicated by the four sharp peaks precisely at the ends of the red areas, the output of the matched filter is sufficient to identify the synchronization code. For a segment comprising consecutive embeddable frames, the synchronization code is inserted only within the first three frames in the secondlevel detail subband, whereas binary embedding is applied to every embeddable frame in the approximation subband.
Depending on the number of frames available for data hiding, we divide the watermark bits into several packages, each containing a header in conjunction with a series of watermark bytes. The implantation of a complete synchronization code requires an interval stretching 480 coefficients; therefore, only the speech segments extending beyond three consecutive frames are used as watermark packages. During the embedding phase, the DWTAMM settings are \( {l}_a^{(2)}=20 \) and \( {\eta}_a^{(2)}=20 \) for watermark embedding in the approximation subband and \( {l}_d^{(2)}=4 \) and \( {\eta}_d^{(2)}=10 \) for synchronization in the detail subband. The subscripts “a” and “d” alongside the variables represent subband attributes. In accordance with these specifications, each frame in the approximation subband carries 8 bits of information. Thus, for a wideband speech signal sampled at 16 kHz, the maximum payload capacity would be 200 (=16000/(2^{2} × 20)) bits per second (bps).
The header of each package consists of a 15bit message produced by a [11, 15] BCH encoder [39]. The message contains information in two parts: 7 bits indicating the allocated position and 4 bits specifying the total length. This means that there are 2^{7} starting positions that could be assigned. The length of data allowable in each package stretches from 1 to 16 bytes. The maximum size of watermark bits that can be accommodated is 8 × 2^{7}. Through the BCH encoder, the 11bit message is appended with a parity symbol to form a code of length 15. The resulting BCH code is capable of correcting 1 bit error.
Figure 3 illustrates the means by which embeddable frames are configured for various lengths of data. The start locations of embeddable segments implicitly synchronize the time windows for the embedding and extraction of the watermark. The watermark is tentatively selected as a binary image logo of size 32 × 32 with an equal number of “1s” and “0s”. To reinforce security, we scrambled the watermark using the Arnold transform [40] and then converted it to a 1D bit sequence. The bit sequence was then divided into packages of various size matching the lengths of the embeddable segments in different locations. Multiple watermarks can be embedded as long as the speech file is of sufficient length. When reconstructing the watermark, we employ a majority voting scheme to verify each retrieved bit in cases where multiple copies are received.
To conclude this section, Fig. 4 outlines the processing flow of the proposed watermarking method. The required steps are summarized as follows:

Step 0
Scramble the watermark logo using an encryption key and convert the results to a bit stream.

Step 1
Decompose the host speech signal using twolevel DWT.

Step 2
Seek embeddable segments.

Step 3
Implant the synchronization code into the first three frames of an embeddable segment in the secondlevel detail subband.

Step 4
Partition the watermark bit sequence into packages in accordance with the size of the embeddable segment. The location and size of each watermark package are saved as a 15bit message using a (15,11) BCH encoder, and the resulting BCH code is combined with scrambled watermark bits to form a package.

Step 5
For each embeddable segment, the packaged bits are embedded within the approximation coefficients using AMM.

Step 6
Repeat steps 3~5 if the end of the file is reached; otherwise, perform a twolevel inverse DWT to attain a watermarked speech signal with synchronization information inside.
Watermark extraction follows the same procedure as that used in embedding. Figure 5 provides an illustrative depiction of the process, as briefly outlined in the following:

Step 1
Decompose the host speech signal using twolevel DWT. To take every sample shift into account, we need to perform twolevel DWT four times starting from the first to fourth position.

Step 2
Inspect the segment beginning with current sample i. Detect the synchronization code using the technique developed in Section 4. If the synchronization code is present, go to step 3; otherwise, move one sample forward (i ← i + 1) and repeat step 2.

Step 3
Extract the bits residing in each package using AMM. The located position of the retrieved watermark bits is resolved from the BCH decoder. Update the current index i to the new position.

Step 4
If the index reaches the end, go to step 5. Otherwise, go to step 2.

Step 5
Adopt the majority vote strategy to determine the ultimate value of each bit.

Step 6
Convert the 1D bit sequence to a matrix and apply the inverse Arnold transform to descramble the matrix using the correct key.
Performance evaluation
The test materials consisted of 192 sentences uttered by 24 speakers (16 males and 8 females) drawn from the core set of the TIMIT database [41]. Speech utterances were recorded at 16 kHz with 16bit resolution. For the convenience of computer simulation, speech files belonging to the same dialect region were concatenated to form a longer file. Since each speech utterance was recorded separately, the maximum amplitude of each file was uniformly rescaled to an identical level to maintain consistent intensity. The watermark bits for the test were a series of alternate 1s and 0s of sufficient length to cover the entire host signal.
Smoothing factor for the recursive filter
Our initial concern lies in the choice of an appropriate value for variable α used in the recursive filtering (i.e., Eq. (4)) of the DWTAMM framework. The recursive filter is meant to render a smooth estimate of shorttime energy. To understand the influence of variable α, we conducted a pilot test examining the watermarked speech in the presence of white Gaussian noise with signaltonoise set at 20 dB. The testing set included an arithmetic sequence ranging from 0.3 to 0.975 in increments of 0.025. We measured the variations in signaltonoise ratio (SNR), mean opinion score of listening quality objective (MOSLQO), and bit error rate (BER) under changes in α. Among the three abovementioned measures, SNR and MOSLQO reflect the impairment of quality due to watermarking, while BER indicates the robustness of the embedded watermark against possible attacks. The definition of SNR is given as follows:
where s(n) and ŝ(n) denote the original and watermarked speech signals, respectively. MOSLQO is the consequence of the perceptual evaluation of speech quality (PESQ) metric [42], which was developed to model subjective tests commonly used in telecommunications. The PESQ assesses speech quality on a −0.5 to 4.5 scale. A mapping function to MOSLQO is described under ITUT Recommendation P.862.1, covering a range from 1 (bad) to 5 (excellent). Table 1 specifies the MOSLQO scale. In this study, we adopted the implementation released from ITUT website [43].
To determine the effect on robustness, we examined the BER between the recovered watermark \( \tilde{W}=\left\{{\tilde{w}}_n\right\} \) and the original watermark W = {w _{ n }}:
where N _{ w } denotes the number of watermark bits.
Figure 6 presents the average BER, SNR, and MOSLQO obtained from the test set with the parameters \( {l}_a^{(2)}=20 \) and \( {\eta}_a^{(2)}=20 \). In this experiment, nonembeddable frames were not excluded from average calculation. The obtained MOSLQOs were therefore slightly lower than that attained by the actual watermarking scheme, and the resulting BERs were somewhat higher than the outcomes involving merely the embeddable frames. As shown in Fig. 6, the average BER, SNR, and MOSLQO remain roughly steady when α < 0.5 and gradually descend with an increase in α. The increasing tendency become increasingly obvious once α exceeds 0.8. The lower SNR values at α > 0.9 can be attributed to the fact that the computation of \( {\overline{\rho}}_k \) refers more to previous data than recent data. This often results from large quantization steps at the end of a speech segment where the volume drops abruptly. A lower SNR also implies a more pronounced modification to the speech signal; therefore, the MOSLQO presents a downward trend. In subsequent experiments, we eventually set α to 0.8 for embedding watermark bits, as this achieves suitable BER and SNR values without deviating MOSLQO too far from a desirable score.
Detection rate of synchronization codes
In Section 2.2, we discuss how to embed and detect the synchronization codes in the secondlevel detail subband. All the theoretical analysis in that section is deduced from a probability aspect. Here, we present the experiment results with respect to the test materials. In accordance to the rule given in Section 3, there were 781 speech segments selected for embedding synchronization codes over a length of 9,229,695 samples in total (or equivalently, 576.86 s). The embedding of synchronization codes as per the specifications in Section 2.2 led to a SNR of 27.73 dB and a MOSLQO score of 4.35. The competence of the proposed method was verified by inspecting the frequency counts of miss and false alarm in the presence of various attacks. The attack types in this study involved resampling, requantization, amplitude scaling, noise corruption, lowpass and highpass filtering, DA/AD conversion, echo addition, jittering, and compression. Table 2 lists the details of these attacks. The timeshifting attack considered in case N is intended to find out the consequence if frames are slightly misaligned. This particular attack is not designed for the synchronization test but will be examined in the evaluation of watermarking performance. For the other attack types ranging from A to M, the test results are tabulated in Table 3.
As revealed by the results in Table 3, the false alarm events seldom occurred because of the choice of a relative high detection threshold, i.e., T = 0.45L _{code} = 54. The proposed synchronization technique survived most attacks except for highpass filtering above 1 kHz. The reason can be attributed to the fact that the synchronization codes are inserted into the secondlevel detail subband, of which the spectrum is primarily distributed from 2 to 4 kHz. Consequently, obliterating the frequency components above 1 kHz will ruin the synchronization. Apart from the highpass filtering, the noise corruption with SNR = 20 dB was another attack that caused obvious damage. Nonetheless, the miss rate 66/781 is still considered acceptable since over 91.5% of the package locations are recoverable.
Comparison with other WTbased watermarking methods
This study compared the performance of three wavelet transform (WT)based speech watermarking methods, namely, DWTSVD [4], LWTDCTSVD, [3] and the proposed DWTAMM. For the sake of a fair comparison, the watermark bits were embedded in the secondlevel approximation subband using an identical payload capacity of 200 bps for all three methods. It should also be noted that the idea of embedding the watermark bits and synchronization codes within different subbands is applicable to any waveletbased method. We assumed that the secondlevel detail subband was reserved for the embedding of synchronization codes in all cases to ensure that each method was equally capable of resisting cropping and/or timeshifting attacks. Only frames satisfying the conditions in (17) were used to embed binary information. Furthermore, in order to provide more insights into the proposed DWTAMM approach, we also implemented watermark embedding at a rate of 100 bps with respect to the thirdlevel approximation and detail subbands, both of which were obtained by splitting the secondlevel approximation subband. The parametric settings followed those in the secondlevel approximation subband. That is, \( {l}_a^{(3)}={l}_d^{(3)}=20 \) and \( {\eta}_a^{(3)}={\eta}_d^{(3)}=20 \).
The quality of the watermarked speech signal obtained using the abovementioned methods was evaluated based on SNR and PESQ. We intentionally adjusted the parameters of the three methods to permit SNR values nearby 22 dB, which is above the level (20 dB) recommended by International Federation of the Phonographic Industry (IFPI) [28]. The commensurate SNR values also imply the use of comparable embedding strengths for all three methods. As shown in Table 4, the MOSLQO values for DWTSVD, LWTDCTSVD, and DWTAMM were distributed over a range just above 3.2. These outcomes suggest that these three methods render comparable quality. Nonetheless, the score of 3.2 merely reflects a fair auditory perception. The cause is conceivably connected with the embedding strength and payload capacity. For the DWTAMM implemented in the thirdlevel approximation subband with a payload capacity of 100 bps, the average MOSLQO value has been raised above 4.0. The MOSLQO score could be further lifted beyond 4.2 when the DWTAMM was applied to the thirdlevel detail subband with the same capacity.
We examined the BER defined in Eq. (21) to evaluate the robustness of the algorithms against various attacks previously specified in Table 2. Table 5 presents the average BERs obtained by each of the methods in the presence of various attacks. All three methods successfully retrieved the watermark when no attack was present. All of them demonstrated comparable satisfactory resistances against G.722 and G.726 codecs. They also survived lowpass filtering (I) and resampling, due to the fact that these two attacks do not have a severe effect on coefficients in the secondlevel approximation subband. For the same reason, these three methods did not pass the highpass filtering attack, by which the lowfrequency components below 1 kHz were destroyed. The lowpass filtering with a cutoff frequency of 1 kHz inflicted obvious damage on DWTSVD and LWTDCTSVD; however, only minor damage was observed in the results of DWTAMM. This can be ascribed to the use of the statistical mean for watermarking.
In cases involving echo addition (Attack J) and slight timeshift (Attack N), DWTAMM outperformed DWTSVD and LWTDCTSVD, due primarily to its adaptability to signal intensity. Adaptively adjusting the quantization steps also enables DWTAMM to withstand amplitude scaling attacks. By contrast, both DWTSVD and LWTDCTSVD failed in the case of amplitude scaling, due to the use of a fixed quantization step.
The addition of Gaussian white noise with SNR controlled at 30 and 20 dB did not appear to cause any problems for DWTSVD or LWTDCTSVD; however, DWTAMM suffered minor deterioration. The reason is conceivably due to the imperfect acquisition of quantization steps from noisecorrupted speech. Requantization can be regarded as a type of noise corruption [44]; therefore, DWTAMM is also subject to performance degradation. The same explanation applies to the results obtained under DA/AD conversion attacks, which led to composite impairment in time scaling, amplitude scaling, and noise corruption [44]. DWTAMM was unable to entirely avoid damage under these conditions; however, DWTSVD and LWTDCTSVD suffered even more due to a lack of amplitude scaling.
For the two 100bp versions of DWTAMM, the one implemented in the thirdlevel approximation subband, termed \( \mathrm{D}\mathrm{W}\mathrm{T}\hbox{} {\mathrm{AMM}}_a^{(3)} \), generally exhibited superior robustness in terms of BER, and yet, the resultant MOSLQO is above 4.0. The reduction of BER is ascribed to the fact that the watermark embedding is performed over a subband with higher intensity, while the imperceptibility seems improvable at the cost of payload capacity. By contrast, the DWTAMM implemented in the thirdlevel detail subband, termed \( \mathrm{D}\mathrm{W}\mathrm{T}\hbox{} {\mathrm{AMM}}_d^{(3)} \), offered an average MOSLQO of 4.238. The associated SNR of 30.611 dB reflects a weaker embedding strength, thus resulting in a worse BER in comparison with the one obtained from \( \mathrm{D}\mathrm{W}\mathrm{T}\hbox{} {\mathrm{AMM}}_a^{(3)} \).
Conclusions
This paper proposes a novel DWTbased speech watermarking scheme. In the proposed scheme, information bits and synchronization codes are embedded within the secondlevel approximation and detail subbands, respectively. The synchronization code serves in frame alignment and indicates the start position of an enciphered bit sequence referred to as a package. The watermarking process is executed on a framebyframe basis to facilitate detectability. Binary embedding in the secondlevel subband is performed by adaptively modifying the mean value of the coefficients gathered in each subgroup. During watermark extraction, all fragments of binary bits are retrieved with the assistance of a synchronization scheme and repacked according to the header content of each package. The robustness of the embedded watermark is reinforced through the selection of frames with sufficient intensity. The proposed formulation makes it possible to specify the embedding strength in terms of the SNR of the intended subband. Specifically, the quantization steps can be acquired from the speech signal by referring to the energy level of the passing coefficients in a recursive manner.
The watermarking scheme outlined in this paper has a maximum rate of 200 bps. PESQ test results indicate that the proposed DWTAMM renders speech quality comparable to that obtained using two existing waveletbased methods. With the exception of attacks that compromise the retrieval of quantization steps, the proposed DWTAMM generally outperforms the compared methods. Overall, the proposed DWTAMM demonstrates satisfactory performance. The incorporation of the package synchronization scheme allows the splitting of the watermark to cope with the intermittent characteristic of speech signals.
References
 1.
N Cvejic, T Seppänen, Digital audio watermarking techniques and technologies: applications and benchmarks (Information Science Reference, Hershey, 2008)
 2.
X He, Watermarking in audio: key techniques and technologies (Cambria Press, Youngstown, 2008)
 3.
B Lei, I Song, SA Rahman, Robust and secure watermarking scheme for breath sound. J Syst Softw 86(6), 1638–1649 (2013)
 4.
MA Nematollahi, SAR AlHaddad, F Zarafshan, Blind digital speech watermarking based on eigenvalue quantization in DWT. J King Saud Univ Comp Inf Sci 27(1), 58–67 (2015)
 5.
MA Nematollahi, SAR AlHaddad, An overview of digital speech watermarking. Int J Speech Tech 16(4), 471–488 (2013)
 6.
K Hofbauer, G Kubin, WB Kleijn, Speech watermarking for analog flatfading bandpass channels. IEEE Trans on Audio Speech and Language Processing 17(8), 1624–1637 (2009)
 7.
OTC Chen, CH Liu, Contentdependent watermarking scheme in compressed speech with identifying manner and location of attacks. IEEE Trans on Audio Speech Language Processing 15(5), 1605–1616 (2007)
 8.
DJ Coumou, G Sharma, Insertion, deletion codes with featurebased embedding: a new paradigm for watermark synchronization with applications to speech watermarking. IEEE Trans Inf Forensics Secur 3(2), 153–165 (2008)
 9.
N Chen, J Zhu, Multipurpose speech watermarking based on multistage vector quantization of linear prediction coefficients. J China Univ Posts Telecom 14(4), 64–69 (2007)
 10.
B Yan, YJ Guo, B Yan, YJ Guo, Speech authentication by semifragile speech watermarking utilizing analysis by synthesis and spectral distortion optimization. Multimed Tools Appl 67(2), 383–405 (2013)
 11.
D Kundur, (1999). Multiresolution digital watermarking: algorithms and implications for multimedia signals, Ph. D. Thesis, University of Toronto, Ontario, Canada.
 12.
W Li, X Xue, P Lu, Localized audio watermarking technique robust against timescale modification. IEEE Trans Multimedia 8(1), 60–69 (2006)
 13.
R Tachibana, S Shimizu, S Kobayashi, T Nakamura, An audio watermarking method using a twodimensional pseudorandom array. Signal Process 82(10), 1455–1469 (2002)
 14.
D Megías, J SerraRuiz, M Fallahpour, Efficient selfsynchronised blind audio watermarking system based on time domain and FFT amplitude modification. Signal Process 90(12), 3078–3092 (2010)
 15.
XY Wang, H Zhao, A novel synchronization invariant audio watermarking scheme based on DWT and DCT. IEEE Trans Signal Processing 54(12), 4835–4840 (2006)
 16.
IK Yeo, HJ Kim, Modified patchwork algorithm: a novel audio watermarking scheme. IEEE Trans Speech Audio Processing 11(4), 381–386 (2003)
 17.
BY Lei, IY Soon, Z Li, Blind and robust audio watermarking scheme based on SVD–DCT. Signal Process 91(8), 1973–1984 (2011)
 18.
B Lei, IY Soon, F Zhou, Z Li, H Lei, A robust audio watermarking scheme based on lifting wavelet transform and singular value decomposition. Signal Process 92(9), 1985–2001 (2012)
 19.
HT Hu, LY Hsu, Robust, transparent and highcapacity audio watermarking in DCT domain. Signal Process 109, 226–235 (2015)
 20.
XY Wang, PP Niu, HY Yang, A robust digital audio watermarking based on statistics characteristics. Pattern Recogn 42(11), 3057–3064 (2009)
 21.
S Wu, J Huang, D Huang, YQ Shi, Efficiently selfsynchronized audio watermarking for assured audio data transmission. IEEE Trans Broadcasting 51(1), 69–76 (2005)
 22.
X Wang, P Wang, P Zhang, S Xu, H Yang, A normspace, adaptive, and blind audio watermarking algorithm by discrete wavelet transform. Signal Process 93(4), 913–922 (2013)
 23.
HT Hu, LY Hsu, HH Chou, Variabledimensional vector modulation for perceptualbased DWT blind audio watermarking with adjustable payload capacity. Digital Signal Processing 31, 115–123 (2014)
 24.
A AlHaj, An imperceptible and robust audio watermarking algorithm, EURASIP J Audio Speech Music Processing 2014, 37 (2014) doi:10.1186/s1363601400372
 25.
X Li, HH Yu, Transparent and robust audio data hiding in cepstrum domain, in IEEE Int Conf Multimedia and Expo (2000), 397–400
 26.
SC Liu, SD Lin, BCH codebased robust audio watermarking in cepstrum domain. J Inf Sci Eng 22(3), 535–543 (2006)
 27.
HT Hu, WH Chen, A dual cepstrumbased watermarking scheme with selfsynchronization. Signal Process 92(4), 1109–1116 (2012)
 28.
S Katzenbeisser, FAP Petitcolas, in Information hiding techniques for steganography and digital watermarking, ed. by FAP Petitcolas (Artech House, Boston, 2000)
 29.
I Daubechies, Ten lectures on wavelets (SIAM, Philadelphia, 1992)
 30.
B Chen, GW Wornell, Quantization index modulation: a class of provably good methods for digital watermarking and information embedding. IEEE Trans Inf Theory 47(4), 1423–1443 (2001)
 31.
B Chen, GW Wornell, Quantization index modulation methods for digital watermarking and information embedding of multimedia. J VLSI Signal Processing Systems Signal Image Video Technol 27(1), 7–33 (2001)
 32.
P Moulin, R Koetter, Datahiding codes. Proc IEEE 93(12), 2083–2126 (2005)
 33.
HT Hu, JR Chang, LY Hsu, Windowed and distortioncompensated vector modulation for blind audio watermarking in DWT domain, Multimed Tools Appl 121 (2016) doi:10.1007/s1104201642028
 34.
HT Hu, LY Hsu, Supplementary schemes to enhance the performance of DWTRDMbased blind audio watermarking. Circuits Syst Signal Process 36(5), 1890–1911 (2016)
 35.
M Fallahpour, D Megias, DWTbased high capacity audio watermarking. IEICE Trans Fundam Electron Commun Comput Sci E93A(1), 331–335 (2010)
 36.
M Fallahpour, D Megias, High capacity robust audio watermarking scheme based on FFT and linear regression. Int J Innovative Comput Inf Control 8(4), 2477–2489 (2012)
 37.
HT Hu, LY Hsu, A DWTbased rational dither modulation scheme for effective blind audio watermarking. Circuits Syst Signal Process 35(2), 553–572 (2016)
 38.
HT Hu, LY Hsu, Incorporating spectral shaping filtering into DWTbased vector modulation to improve blind audio watermarking. Wireless Personal Communications 94(2), 221–240 (2017)
 39.
G Forney Jr, On decoding BCH codes. IEEE Trans Inf Theory 11(4), 549–557 (1965)
 40.
VI Arnold, A Avez, Ergodic problems of classical mechanics (Benjamin, New York, 1968)
 41.
W Fisher, G Doddington, K GoudieMarshall, The DARPA speech recognition research database: Specifications and status, in Proceedings of DARPA Workshop on Speech Recognition (1986), pp. 93–99
 42.
P Kabal, An examination and interpretation of ITUR BS.1387: Perceptual evaluation of audio quality, TSP Lab Technical Report, Dept. Electrical & Computer Engineering, McGill University (2002)
 43.
ITUT Recommendation P.862 Amendment 1, Source code for reference implementation and conformance tests, [Online]. (2003) Available: http://www.itu.int/rec/TRECP.862200303S!Amd1/en
 44.
S Xiang, Audio watermarking robust against D/A and A/D conversions. EURASIP J Adv Signal Process 2011(1), 1–14 (2011)
Acknowledgements
This research work was supported by the Ministry of Science and Technology, Taiwan, Republic of China, under grants MOST 1042221E197023 & MOST 1052221E197019.
Authors’ contributions
In this research work, HTH and LYH jointly developed the algorithms and conducted the experiments. HTH was responsible for drafting the manuscript. SJL provided valuable comments and helped to improve the manuscript. All authors have read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Hu, H., Lin, S. & Hsu, L. Effective blind speech watermarking via adaptive mean modulation and package synchronization in DWT domain. J AUDIO SPEECH MUSIC PROC. 2017, 10 (2017). https://doi.org/10.1186/s1363601701064
Received:
Accepted:
Published:
Keywords
 Blind speech watermarking
 Adaptive mean modulation
 Package synchronization
 Discrete wavelet transform