An imperceptible and robust audio watermarking algorithm

In this paper, we propose a semi-blind, imperceptible, and robust digital audio watermarking algorithm. The proposed algorithm is based on cascading two well-known transforms: the discrete wavelet transform and the singular value decomposition. The two transforms provide different, but complementary, levels of robustness against watermarking attacks. The uniqueness of the proposed algorithm is twofold: the distributed formation of the wavelet coefficient matrix and the selection of the off-diagonal positions of the singular value matrix for embedding watermark bits. Imperceptibility, robustness, and high data payload of the proposed algorithm are demonstrated using different musical clips.


Introduction
The recent advancements of digital audio technology have increased the ease with which audio files are stored, transmitted, and reproduced. However, along with such conveniences come new risks such as copyright violation. Conventional encryption algorithms permit only authorized users to access encrypted digital data; however, once decrypted, there is no way to prohibit illegal copying and distribution of the data [1]. A promising solution to the copyright violation problem is to apply audio watermarking in which audio files are marked with secret, robust, and imperceptible watermarks to achieve copyright protection [2][3][4][5]. Indeed, a digital watermark is a good deterrent to illicit copying and dissemination of copyrighted audio since it can provide evidence of copyright infringements after the copyright violation has occurred.
Audio watermarking techniques which are used for copyright protection of digital audio signals must satisfy two main requirements: imperceptibility and robustness [6]. Imperceptibility refers to the condition that the embedded watermark should not produce audible distortion to the sound quality of the original audio. That is, the watermarked version of the audio signal must be indistinguishable from the original audio signal. On the other hand, robustness ensures the resistance of the watermark against removal or degradation. The watermark should survive malicious attacks such as random cropping and noise adding. Some watermarking applications may demand additional requirements such as high data payload and low computational time of the watermarking algorithm [3]. In practice, there exists a fundamental trade-off between the different watermarking requirements.
Audio watermarking can be carried out in the time domain or the transform domain of the audio signal. Time-domain techniques based on least significant bit substitution and echo hiding are found extensively in literature [7][8][9][10][11][12]. In general, time-domain audio watermarking techniques are relatively easy to implement and require few computing resources. However, they are less robust than transform-domain techniques which employ the human perceptual properties and frequency masking characteristics of the human auditory system [13]. Popular transforms that have been widely used in digital watermarking include the discrete Fourier transform (DFT), the discrete cosine transform (DCT), the discrete wavelet transform (DWT), and the singular value decomposition (SVD) [14][15][16][17][18][19][20].
It has been reported recently that imperceptible and robust audio watermarking can be achieved by applying a cascade of two different transforms on the original audio signal. Being different, the cascaded transforms may provide different, but complementary, levels of robustness against the same attack. Many audio watermarking techniques based on hybrid transforms have been proposed in literature. These techniques include but are not limited to DWT-DCT [21], DWT-SVD [22], and SVD-STFT [23].
Several hybrid algorithms based on the SVD transform have been recently proposed in literature. In the algorithm proposed by [23], the audio signal is first converted into a matrix form using the short-time Fourier transform (STFT), the SVD transform is then applied on the matrix, and finally embedding is carried out by adaptively modifying the SVD coefficients with watermark bits. In the hybrid algorithm proposed by [24], the audio signal is partitioned into blocks, and the watermark bits are embedded using dither modulation quantization of the singular values of the blocks. In [23], an audio watermarking algorithm is proposed in which watermark embedding and extraction procedures are based on the quantization of the norms of the singular values of audio blocks. The same authors proposed in [25] a hybrid algorithm in which watermark bits are embedded by applying quantization index modulation (QIM) on the singular values of wavelet-domain blocks. All of the abovementioned SVD-based hybrid algorithms employ some sort of quantization to embed watermark bits. Although quantization is simple, an acceptable level of robustness against noise and filtering attack may not always be achieved.
In this paper, we propose a semi-blind hybrid audio watermarking algorithm based on the DWT and SVD transforms. In the proposed algorithm, the audio signal is sampled, partitioned into short audio segments called frames, and a four-level DWT decomposition is applied on each frame. A matrix is then formed by arranging the wavelet coefficients of all detail sub-bands in a unique distributed pattern which scatters the watermark bits throughout the transformed frame to provide a high degree of robustness. The SVD operator is then applied on the matrix, and the watermark bits are embedded onto the off-diagonal zero elements of the S matrix produced by the SVD transform. Unlike the other SVD-based algorithms, the proposed algorithm leaves the non-zero singular values of the S matrix unchanged to ensure high watermarking imperceptibility.
The rest of the paper is organized as follows. In the next section, the DWT and SVD transforms are described, and their unique utilization in the proposed algorithm is outlined. The proposed audio DWT-SVD watermarking algorithm is described in detail in Section 3, and evaluated with respect to imperceptibly, robustness, and data payload in Section 4. Concluding remarks are given in Section 5.

Related work and contribution
The proposed algorithm is based on cascading the two transforms: DWT and SVD. The uniqueness of the proposed algorithm is twofold: the distributed formation of the DWT coefficient matrix and the selection of the offdiagonal positions of SVD's singular value matrix for embedding watermark bits. Description of the two transforms and their exact utilization in the proposed algorithm is given in this section.

DWT-based audio watermarking
DWT is a frequency transform capable of giving a timefrequency representation of any given signal [26]. Starting from an audio signal S, DWT produces two sets of coefficients: the approximation coefficients A 1 produced by passing S through a low-pass filter and the detail coefficients D 1 produced by passing S through a high-pass filter. Depending on the application and the length of S, A 1 can be further decomposed into more levels.  illustrates a three-level DWT decomposition of the audio signal S. Many DWT-based audio watermarking algorithms can be found in literature. Many variations among the different algorithms exit; however, the main variation is in the sub-band chosen for embedding the watermark bits. In [27][28][29], the approximation sub-band is used for embedding the watermark bits, while in most algorithms, only one detail sub-band is used to embed the watermark bits [30][31][32][33][34][35][36]. Claims of good imperceptibility and robustness have been reported using the two embedding approaches.
In this paper, watermark bits are not embedded in one sub-band only, rather the bits are distributed among all multi-resolution detail sub-bands. For a three-level DWT decomposition, this is done by forming a matrix of the detail sub-bands (D 1 , D 2 , and D 3 ) as shown in Figure 2. This matrix formation allows for better scattering of the watermark bits throughout the sub-bands, leading to a higher degree of robustness. The resultant DWT matrix is processed by the SVD transform to embed the watermark bits, as will be explained in the next subsection.

SVD-based audio watermarking
The SVD of matrix A is defined by the operation A = U Σ V T , as shown in Figure 3. The non-zero diagonal entries of Σ are called the singular values of A and are assumed to be arranged in decreasing order σ i > σ i+1 . The columns of the U matrix are called the left singular vectors, while the columns of the V matrix are called the right singular vectors of A.
The SVD transform has been used in several audio watermarking algorithms [22][23][24][25][37][38][39]. The algorithms varied in the way the singular values were used in the watermarking process. For example, in [37], the single largest singular value, σ 11, was quantized and used to embed the watermark, whereas in [38], the encrypted watermark signal was added to all singular values of matrix Σ. In [22,24,25], the norms of all singular values were quantized and used in the watermark embedding process.
In our proposed algorithm, matrix A represents the detail sub-bands matrix shown in Figure 2, which is produced after applying DWT on the original audio signal. After applying the SVD operator on the DWT matrix, watermark bits are embedded onto the off-diagonal zero elements of the S matrix, while the diagonal singular values of the matrix remain unchanged. This embedding procedure will eliminate the possibility of any distortion caused to the singular values which may affect imperceptibility and watermarking quality. Related preliminary works have been published by the author and others in [40,41]. The algorithms reported in those papers have low capacity as they embed the watermark bits in the single largest singular value, σ 11 , and not in the off-diagonal zero elements of the Σ matrix, as it is the case in the proposed algorithm.

Proposed DWT-SVD audio watermarking algorithm
In this section, we describe the proposed DWT-SVD algorithm. The algorithm consists of two procedures: watermark embedding and watermark extraction procedures. Figure 4 The watermark embedding procedure.

Watermark embedding procedure
The watermark embedding procedure transforms the audio signal using DWT and SVD, embeds the bits of a binary image watermark in appropriate locations in the transformed signal, and finally produces a watermarked audio signal by performing inverse SVD and DWT operations. The procedure is illustrated in the block diagram shown in Figure 4 and described thereafter.
Step 1: Convert the binary image watermark into a one-dimensional vector b of length M × N. A watermark bit b i may take one of two values: 0 or 1.
Step 2: Sample the original audio signal at a sampling rate of 44,100 samples per second and partition the sampled file into N frames. The optimal frame length will be determined experimentally in such a way to increase data payload.
Step 3: Perform a four-level DWT transformation on each frame. This operation produces five multi-resolution sub-bands: D 1 , D 2 , D 3 , D 4 , and A 4 . The D sub-bands are called 'detail sub-bands' and the A 4 sub-band is called 'approximation sub-band'. The five sub-bands are arranged in the vector shown in Figure 5.
Step 4: Arrange the four detail sub-bands D 1 , D 2 , D 3 , and D 4 in a matrix D as shown in Figure 6. The matrix formation is done this way to distribute the watermark bits throughout the multi-resolution sub-bands D 1 , D 2 , D 3 , and D 4 . Forming the matrix with the Ds, rather than using A alone, is done to allow for matrix formation and subsequent application of the matrix-based SVD operator. The size of matrix D is 4 × (L/2), where L refers to the length of the frame.
Step 5: Decompose matrix D using the SVD operator. This operation produces the three orthonormal matrices Σ, U, and V T as follows: where the diagonal matrix Σ has the same size of the D matrix. The diagonal σ ii entries correspond to the singular values of the D matrix. However, for embedding purposes, only a 4 × 4 subset of matrix Σ, assigned the name S hereafter, is used as shown below. This is a trade-off between imperceptibility (inaudibility) and payload (embedding capacity). That is, using the whole Σ matrix for embedding will increase embedding capacity but will lead to severe distortion in imperceptibility (inaudibility) of the watermarked audio signal.
S ¼ Step 6: Arrange 12 bits of the original watermark bit vector b into a scaled 4 × 4 watermark matrix W. The watermark bits must be located in the non-diagonal positions within the matrix, as shown below.
As an example, the watermark 12-bit watermark pattern 1010 0011 0101 must be converted to the following matrix form before the actual embedding is carried out. Step 7: Embed watermark matrix W bits into matrix S according to the following 'additive-embedding' formula: where S w is the watermarked S matrix, and α is the watermark intensity which should be chosen to tune the trade-off between robustness and imperceptibility. With this type of embedding, the singular values of D remain unchanged, and thus, audible distortion caused by modifying the singular values is avoided.
Step 8: Decompose the new watermarked matrix S w using the SVD operator. This operation produces three new orthonormal matrices as follows: The matrices U 1 and V 1 T are stored for later use in the extraction process. This makes the proposed watermarking algorithm semi-blind, as the whole original audio frame is not required in the extraction process.
Step 9: Apply the inverse SVD operation using the U and V T matrices, which were unchanged, and the S 1 matrix, which has been modified according to Equation (6). The D w matrix given below is the watermarked D matrix given in Equation (2).
where matrix Σ′ is the original Σ matrix with the S sub-matrix replaced by the S 1 sub-matrix.
Step 10: Apply the inverse DWT operation on the D w matrix to obtain the watermarked audio frame.
Step 11: Repeat all previous steps on each frame. The overall watermarked audio signal is obtained by concatenating the watermarked frames obtained in the previous steps.

Watermark extraction procedure
Given the watermarked audio signal and the corresponding U 1 and V 1 matrices that were computed in Equation (7) and stored for each frame, the embedded watermark can be extracted according to the procedure outlined in Figure 7 and described in detail in the following steps: Step 1: Obtain the matrix S 1 ′ from each frame of the watermarked audio signal following the general steps presented in Figure 7.
Step 2: Multiply matrix S 1 ′ by U 1 and V 1 which were computed in the watermark embedding procedure and stored for use in the extraction process. This results in the following matrix.
Step 3: Extract the 12 watermark bits from each frame by examining the non-diagonal values of matrix S w '. It has been experimentally noticed that there are two groups of non-diagonal values that are extremely distinct. The values at the positions where a 0 bit has been embedded tend to be much smaller than those values at the positions where a 1 bit has been embedded. Thus, to determine the watermark bit W(n), the average of non-diagonal values is first computed, name it avg, then for each non-diagonal value S w ' ij , W(n) is extracted according to the following formula: Step 4: Construct the original watermark image by assembling the bits extracted from all frames.

Experimental results
Different types of audio signals have different perceptual properties, and therefore, watermarking performance may vary from type to another. Accordingly, we evaluated the performance of the proposed algorithm using three mono audio signals representing pop music, instrumental music, and speech. Each signal has a duration of 11 s and was sampled at 44.1 kHz and quantized to 16 bits per sample. The watermark used for experimentation is the 12 × 10 binary image shown in Figure 8. The watermark is embedded repeatedly throughout the sampled signal, such that one single watermark image is embedded in a sequence of ten frames.
Four-level DWT decomposition is applied on each frame using the Daubechies wavelet (db1). Using other wavelet types has a little effect on the performance, as it was observed experimentally. Values ranging from 1 to 5 were used for the watermark intensity α. However, the results reported in this paper were obtained when the intensity value was set to 3. In what follows, we present performance results of the proposed algorithm with respect    to three metrics: imperceptibility, robustness, and data payload [42,43].

Imperceptibility results
Imperceptibility ensures that the quality of the signal is not perceivably distorted and the watermark is imperceptible to listeners. To measure imperceptibility, different authors use different metrics; however, the most commonly used metrics are signal-to-noise ratio (SNR) and listening tests.

Signal-to-noise ratio
SNR is a statistical difference metric which is used to measure the similitude between the undistorted original audio signal and the distorted watermarked audio signal. The SNR computation is done according to Equation (11), where A corresponds to the original signal, and A′ corresponds to the watermarked signal.
We obtained the SNR dB values given in Table 1. As shown in the table, the values are much higher than the 20 dB minimum requirement set by the International Federation of Phonographic Industry [13]. Although SNR is a simple metric to measure the noise introduced by the embedded watermark and can give a general idea of imperceptibility, it does not take into account the specific characteristics of the human auditory system.

Listening tests
For better evaluation of imperceptibility, subjective and objective listening tests are used. Subjective difference grade (SDG) listening tests are implemented by human listeners, and objective difference grade (ODG) listening tests are implemented by software packages incorporating the human auditory system. The two listening tests use the 5-grade scale shown in Table 2.
We employed a blind subjective listening test to estimate the audio quality of the watermarked signals. The listening test was performed repeatedly with five adults in a listening room equipped with audio testing and recording devices. A computer system running a special software was also used for computer-controlled presentation of the watermarked signals to the listeners and for recording their responses. Each person was presented with ten pairs of signals (original and watermarked) and then asked to give performance scores using the 5-grade impairment scale given in Table 1. The five persons listened to each pair of signals ten times and gave an average SDG value for each pair. The average grade for each pair submitted by all persons is considered the final grade for that particular pair of signals. The SDG averages obtained for the subjective listening tests are 4.67, 4.72, and 4.81 for the pop, instrumental, and speech signals, respectively. These values clearly indicate that imperceptibility has been achieved by the proposed audio watermarking algorithm.
The ODG scores were also computed using the Perceptual Evaluation of Audio Quality (PEAQ) standard. The standard is specified in ITU-R BS.1387 [44] and implemented by the software tool EAQUAL [45]. The ODG values we obtained are −0.67, −0.71, and −0.91 for the pop, instrumental, and speech signals, respectively. These results confirm with those obtained by subjective listening tests. The measured SDG and ODG values are given in Table 3.
Comparing imperceptibility results with results achieved by other algorithms is not straightforward, since different authors use different evaluation metrics. Moreover, subjective evaluation is relative and may differ from one listener to another. This may explain why imperceptibly results are hardly compared in literature. Nonetheless, and for the sake of completion, we present in Table 4 some imperceptibility results achieved by recently proposed algorithms. It is important to note that the values in table are average values taken over different audio types.

Robustness results
Watermarked audio signals may undergo signal processing operations such as linear filtering, lossy compression, among many other operations [46,47]. Although these operations may not affect the perceived quality of the host signal, they may corrupt the watermark embedded within the signal. Two sets of attacks were performed to test the robustness of our proposed algorithm. The first set includes the following set of common signal processing operations: Gaussian noise addition, re-quantization, re-sampling, MP3 compression, low-pass filtering, and echo addition. The other set is the Stirmark® audio watermarking benchmark which includes a whole set of add, modify, and filter attacks [48,49]. Robustness is measured using the bit error rate (BER) metric since the watermark used in the simulation is a binary image. BER is defined as the ratio of incorrect extracted bits to the total amount of embedded bits, as expressed in Equation (12).
where l is the watermark length, W n is the nth bit of the embedded watermark, and W′ n is the nth bit of the extracted watermark.

Common signal processing operations
The following common signal processing attacks were applied to test the robustness of the proposed algorithm: The BER values we obtained after applying the common signal processing operations are listed in Table 5. As shown in the table, the BER values, which have been computed over the whole period of the test signals, are very small in magnitude and thus reflect the robustness of the proposed algorithm against common signal operations. Maximum robustness has been achieved against the Gaussian noise attacks, re-quantization, and MP3 compression at 128 kbps. BER values due to re-sampling increased as the watermarked signal was down-sampled to lower frequencies.
The same observation is also seen for the MP3 compression attack, where higher BER values were obtained as the compression rate of the watermarked signal was increased. The watermarked signal is also robust against filtering operations as shown in the corresponding small BER values. The least robustness is seen against the echo addition operation as indicated by the relatively higher BER values.
Finally, we compared the robustness of the proposed algorithm with the robustness of recently published transform-based algorithms. Its clear from Table 6 that the proposed algorithm performs better when compared with the other algorithms. It is important to note that the values in Table 6 represent average values taken over different audio types.

Stirmark© attacks
To evaluate robustness of the proposed algorithm furthermore, we implemented a set of attacks defined by Stirmark® benchmark for audio [48,49]. The attacks are comprehensive as they include add, filter, and modification attacks. The results are recorded in Table 7 alongside with snapshots of extracted watermarks from the watermarked signals. It is noted in Table 7 that BER values due to most of the attacks are zero. It is also noted that the proposed algorithm performs comparably well with regard to the three audio signal types.
The Stirmark® attacks have been used by several transform-based algorithms. Table 8 compares the BER results we obtained and the BER results reported in four relevant references. As shown in the table, the results are comparable among the different transform-based references with regard to most of the Stirmark® attacks. It is instructive to note here that Stirmark® package can be used to simulate composite attacks, where two or more attacks are tested in one run. Such composite attacks may give better comparison between the different algorithms; however, they are rarely reported in literature.

Data payload results
Data payload is defined as the data embedding capacity of the algorithm and is measured as the number of bits  embedded within one second of the audio signal (bps).
In the proposed algorithm, the audio signal is segmented into frames, with each frame having a fixed embedding capacity of 12 watermark bits, as shown in matrix W given in (5). Therefore, the payload is computed by multiplying number of frames per second by the bit capacity of the frame. The number of frames per second depends on the frame length and is computed by dividing the 44.1 KHz sampling rate by the frame length. Table 9 shows the data payload as a function of the frame length. As shown in the table, the payload increases as the frame length decreases. However, short-length frames degrade performance and result in unacceptable imperceptibility and robustness results. A frame length of 2,048 samples has been fixed and used to evaluate imperceptibly and robustness of the proposed algorithm.
The data payload we obtained is higher than payload rates obtained by other recently proposed algorithms. Table 10 lists the payload of different transform-based audio watermarking algorithms.

Conclusions
In this paper, we proposed an imperceptible and a robust audio watermarking technique based on cascading two well-known transforms: the discrete wavelet transform and the singular value decomposition. The two transforms were used in a unique way that scatters the watermark bits throughout the transformed frame in order to achieve high degrees of imperceptibility and robustness. High data payloads were also achieved. The simulation results obtained were in total agreement with the requirements set by IFPI for audio watermarking, thus proving the effectiveness of the proposed algorithm.
Future research will focus on enhancing the proposed algorithm to resist de-synchronization attacks such as random cropping, pitch shifting, amplitude variation, timescale modification, and jittering. Methods proposed in the literature that counter de-synchronization attacks include the all-list-search method, the combination of spread spectrum and spread spectrum code method, the selfsynchronization strategy method, and the synchronization code method. Our approach will be based on embedding synchronization codes with the watermark bits so that the hidden data have the self-synchronization capability.