 Research
 Open Access
 Published:
Speech steganography using wavelet and Fourier transforms
EURASIP Journal on Audio, Speech, and Music Processing volume 2012, Article number: 20 (2012)
Abstract
A new method to secure speech communication using the discrete wavelet transforms (DWT) and the fast Fourier transform is presented in this article. In the first phase of the hiding technique, we separate the speech highfrequency components from the lowfrequency components using the DWT. In a second phase, we exploit the lowpass spectral proprieties of the speech spectrum to hide another secret speech signal in the lowamplitude highfrequency regions of the cover speech signal. The proposed method allows hiding a large amount of secret information while rendering the steganalysis more complex. Experimental results prove the efficiency of the proposed hiding technique since the stego signals are perceptually indistinguishable from the equivalent cover signal, while being able to recover the secret speech message with slight degradation in the quality.
Introduction
One of the concerns in the field of secure communication is the concept of information security. Today’s reality is still showing that communication between two parties over long distances has always been subject to interception. Providing secure communication has driven researchers to develop several cryptography schemes. Cryptography methods achieve security in order to make the information unintelligible to guarantee exclusive access for authenticated recipients. Cryptography consists of making the signal look garbled to unauthorized people. Thus, cryptography indicates the existence of a cryptographic communication in progress, which makes eavesdroppers suspect the existence of valuable data. They are thus incited to intercept the transmitted message and to attempt to decipher the secret information. This may be seen as weakness in cryptography schemes. In contrast to cryptography, steganography allows secret communication by camouflaging the secret signal in another signal (named the cover signal), to avoid suspicion. This quality motivated the researchers to work on this burning field to develop schemes ensuring better resistance to hostile attackers.
The word steganography is derived from two Greek words: Stego (means cover) and graphy (means writing). The two combined words constitute steganography, which means covert writing, it is the art of hiding written communications. Several steganography techniques were used to send message secretly during wars through the territories of enemies. The use of steganography dates back to ancient time where it was used by romans and ancient Egyptians [1]. One technique according to Greek historian Herodotus was to shave the head of a slave, tattoo the message on the slave’s scalp, and send him after his hair grew back. Another technique was to write the secret message underneath the wax of a writing tablet. A third one is to use invisible ink to write secret messages within covert letters [2].
Many techniques have been developed for hiding secret signals into other cover signals. Sridevi et al. [3] presented a method for audio steganography. It consists of substituting the least significant bit (LSB) of each sample of the cover speech signal with the secret data. While this method is easy to implement and can be used to hide larger secret messages, it cannot protect the hidden message from small modifications that can happen as a result of format conversion or compression. Hiding data in LSBs of audio samples in the time domain is one of the simplest algorithms enabling a very high data rate of inserted information. However, several steganalysis algorithms have been developed to challenge the robustness of this method. Bender et al. [4] have presented a technique for data hiding based on phase coding. This method consists of substituting the phase of the first part of an audio segment by a reference phase that represents the data. In order to conserve the relative phase between segments, an adjustment must be made in the phase of the succeeding segment. The series of steps of phase coding is as follows: (i) The original audio signal is decomposed into smaller segments such that their length is equal to the size of the message to be encoded; (ii) A discrete Fourier transform (DFT) is then applied on each segment leading to a phase matrix; (iii) Compute the differences between the phase of each pair of the consecutive segments; (iv) Identify the phase shifts between the consecutive segments. Although, the absolute phases of the segments may change, the relative phase differences between the consecutive segments must remain unchanged; (v) Use the new phase of the first segment and the set of original phase differences to create a new phase matrix; (vi) Regenerate the audio signal with an inverse DFT and then connect the audio segments together. This step is based on the original magnitude matrix and the newly created phase matrix. The receiver determines the length of the secret message, then applies a DFT and extract the hidden message from the cover signal. A distinctive characteristic of phase coding is the low data transmission rate due to the fact that the secret data are encoded only in the first segment of the audio signal. Controversially, any enhancement in the length of the segment may result in shifting the phase relations among the frequency elements of the segment, leading therefore to an easier detection of the existence of a secret message. Thus, the phase coding algorithm is more efficient when hiding small amount of data. Kirovski and Malvar [5] have proposed a new steganographic scheme, called Spread Spectrum (SS) coding method. This method randomly spreads the bits of the secret data message across the frequency spectrum of the audio signal. However, in contrast to LSB coding, the SS coding scheme spreads the secret message using a code independent from the concrete cover signal. The SS coding technique may outperform the LSB coding and phase coding techniques by offering a good quality for medium data transmission rates while ensuring a high level of robustness against steganalysis. However, similarly to the LSB coding technique, the SS method may introduce noise to the audio file. This is presenting a weakness since it facilitates detection by steganalysis systems.
Huang and Yeo [6] have presented an information hiding method based on echo hiding. An echo is introduced into the discrete audio signal in order to embed secret information. Similar to the SS coding method, echo hiding is used to provide a better data transmission rate and higher robustness comparing to the noiseinducing techniques. To accomplish successfully the hiding process, three fundamental parameters need to be changed from the original signal: decay rate, offset (time delay), and amplitude. These three parameters are easily defined since they are located below the human audible threshold limit which is different from the echo. Also, the offset is altered to characterize the binary message to be hidden. The first and the second offsets represent a one (binary) and a zero (binary), respectively. Shirali and Shahreza [7] present an approach for hiding information in a speech signal. This method consists of detecting the silence intervals of a speech and the corresponding length of these intervals (number of samples) and changing them with the secret information. Hiding data in silent interval of the audio samples is one of the simplest algorithms enabling a very high data rate of inserted information. However, this method is already well known and several steganalysis algorithms have been developed to defeat the robustness of this method.
Speech steganography takes advantage of the recent advancements in speech compression and data hiding. Speech is a lowpass signal; its intelligibility is retained when preserving at least the first three formants of the magnitude spectrum. In this article, we will take advantage of these speech characteristics to propose an efficient speechinspeech hiding method. Our speech steganography system consists of embedding the secret speech parameters in the highfrequency regions of the magnitude spectrum of the cover speech. Our aim is to ensure that the stego signal obtained from combining the original phase spectrum and the modified magnitude spectrum shows similar subjective quality to the cover signal. Theoretically, the resultant stego speech is expected to be perceptually indistinguishable from the cover speech since the pertinent lowfrequency components will remain intact.
Potential applications of our speech hiding scheme are reduction of speech storage and transmission overhead in electronic voice mail applications and audio streaming, speech translation, data communication secrecy, and many other webbased applications.
Objectives
Our objective is to develop a high performance speech steganography system. The design of such system consists principally of the optimization of the following attributes:
The hiding capacity, defined by the amount of the secret information (speech, text, or image) to be hidden in the cover speech signal.
The impact of the hiding process on the cover speech quality. We hope to produce a stego signal that is perceptually indistinguishable from the cover signal.
The complexity of the steganography system. Our aim is to render the steganalysis (the attempt to discover the existence of the secret message from the stego signal) by the opponent more complex.
The accuracy with which the hidden message can be recovered at the receiver. Efficient techniques are to be developed to minimize the impact of the compression on the stego signal.
We choose a speech signal as secret information to be hidden in the cover speech. Since our objective in discrete wavelet transformfast Fourier transform (DWTFFT)based hiding approach is secrecy, we propose to hide the secret information within the highfrequency of the wavelet components.
The rest of the article is organized as follows: in the following section, we introduce our DWTFFTbased approach dedicated for the steganography task. Section “Secret speech parameterization” will describe the secret speech analysis including the linear predictive coding (LPC) analysis and the line spectral frequencies (LSF) extraction procedure. In Section “Speech hiding algorithm”, we proceed with the description of the used speech hiding algorithm. The general step to retrieve the secret speech signal is also included in this section. Then a description of the speech signals database used for our simulations, the parameters of our experiments, the evaluation and discussion of the results of our proposed DWTFFT hiding approach are presented in Section “Evaluation”. Finally, we conclude and suggest directions for further research in Section “Conclusions”.
DWTFFTbased approach
Speech DWT
The wavelet transform can be considered as transforming the signal from the time domain to the wavelet domain. This new domain contains more complicated basis functions called wavelets, mother wavelets, or analyzing wavelets [8]. The fundamental idea behind wavelets is to analyze according to scale. Any signal can then be represented by translated and scaled versions of the mother wavelet. Wavelet analysis is capable of enlightening aspects of data that other signal analysis techniques are unable to perform, aspects like trends, and discontinuities in higher derivatives, breakdown points, and selfsimilarity.
The basic idea of DWT for onedimensional signals is shortly described. The wavelet analysis allows the split of a signal into two parts, usually the high and the lowfrequency parts. This process is called decomposition. The edge components of the signal are largely limited to the highfrequency part. The signal is passed through a series of highpass filters to analyze the high frequencies, and it is passed through a series of lowpass filters to analyze the low frequencies. Filters of different cutoff frequencies are used to analyze the signal at different resolutions [9, 10].
The DWT involves choosing scales and positions based on powers of two, the socalled dyadic scales and positions. The mother wavelet is rescaled by powers of two and transformed by integers. Specifically, a function f(t) ∈ L^{2}(R) (defines space of square integrable functions) can be represented as:
The function ψ(t) is known as the mother wavelet, while ϕ(t) is known as the scaling function. The set of function $\left\{\sqrt{{2}^{L}}\mathit{\varphi}\left({2}^{L}tk\right),\sqrt{{2}^{j}}\psi \left({2}^{j}tk\right)j\le L,j,k,L\in Z\right\}$ where Z is the set of integers in an orthonormal basis for L^{2}(R). The numbers a(L, k) are known as the approximation coefficients at scale L, while d(j, k) are identified as the detail coefficients at scale j. The approximation and detail coefficients can be expressed consecutively as:
For a better understanding of the above coefficients, let’s consider a projection f_{ l }(t) of the function f(t) that provides the best approximation (in the sense of minimum error energy) to f(t) at a scale l. This projection can be constructed from the coefficients a(L, k), using the equation:
As the scale l decreases, the approximation becomes finer, converging to f(t) as l → 0. The difference between the approximation at scale l + 1 and that at l, f_{l+1}(t) − f_{ l }(t), is totally defined by the coefficients d(j, k) using the equation of decomposition and can mathematically be expressed as follows:
These given relations, a(L, k) and {d(j, k)j ≤ L}, are useful for building the approximation at any scale. Hence, the wavelet transform breaks the signal up into a coarse approximation f_{ L }(t) (given a(L, k)) and a number of layers of detail coefficients {f_{j+1} − f_{ j }(t)j < L} (given by {d(j, k)j ≤ L}). As each layer of detail is added, the approximation at the next higher scale is achieved. The original signal can be reconstructed using the Inverse DWT (IDWT), following the above procedures in the reverse order. The synthesis starts with the approximation and detail coefficients cA_{ j } and cD_{ j }, and then reconstructs cA_{j−1} by up sampling and filtering with the reconstruction filters [11, 12].
Speech Fourier transform
Since speech is processed on a timeframe basis, the speech spectrum is evaluated using the DFT. The DFT of a signal s(n) defined for 0 ≤ n ≤ M − 1 is given by
In general, S(k) is a complex function of the variable k and can be expressed in polar coordinates as:
The sequence S(k) has the same number of elements as s(n). However, the last M/2 elements of the DFT are conjugates of the first M/2 elements, in inverse order. Consequently, the magnitude spectrum S(k) could be defined uniquely by the first M/2 frequency components since it satisfies the following symmetry:
This equation represents one of the DFT properties that must be maintained when hiding a message in the magnitudes. This feature is used in the fast Fourier transform (FFT) algorithm to reduce the DFT computational complexity [13]. For simplicity, we will adopt in the subsequent sections the following notations:
and
where ifft, the inverse FFT, calculates the inverse DFT.
Speech spectrum characteristics
Speech is a baseband signal with most of the pertinent intelligibilitypreserving frequency components confined to a bandwidth of 4 and 7 kHz for narrowband and wideband speech, respectively [14]. The distribution of the first three speech formants represents the primary cues to the English vowels. Most of the vowel energy is condensed below 1 kHz and decays at about −6 db/oct with frequency [15]. Figure 1 shows the wideband speech spectrum for both a liquid frame and an unvoiced fricative frame. In all vowels and most of the voiced consonants, the magnitude spectrum shows very week components at high frequencies. Even though few unvoiced fricative consonants, such as /s/, present large magnitudes at high frequencies, the intelligibility of the speech signal is negligibly affected if we do not model accurately these frequency components [14]. On the other hand, even for wideband unvoiced fricative consonants, frequencies above 7 kHz do not contribute considerably to the speech spectrum content. These two facts have motivated us to embed a separate signal in the low amplitude highfrequencies of the cover signal.
Secret speech parameterization
Many factors require the parameterization of the secret speech message before the hiding process. Among these factors, we cite the restricted number of the hiding locations in the narrowband cover speech. Speech parameterization termed as speech analysis is generally used in different research areas, such as automatic speech recognition and speech coding. In speech coding, the original signal is subject to a speech analysis algorithm to extract the pertinent speech parameters. In order to recreate a copy of the original signal, an inverse algorithm known as speech synthesis is used. Most of the speech analysis schemes are based on the human speech production model [15]. In this speech production model, a sequential excitation of two filters is used to produce a speech signal, a linear prediction (LP) filter is used to model the vocal tract, produces a shortterm correlation present in all types of speech and a pitch filter to represent the periodicity created to the vibration of the vocal cords in voiced segments. A basic diagram of the speech production model is shown in Figure 2. The LPC is based on this diagram. The LPC schemes are usually used in the field of speech coding. For example in transmission, the speech frames are represented with a restricted number of parameters. These parameters in the receiver side are used to reconstruct a syntheticquality speech signal. The speech analysis algorithm is based on two phases: an LP analysis to obtain p LP coefficients, a_{ i }(i = 1, …, p) and a pitch analysis to extract the pitch gain g and the pitch delay d. The LP filter and the pitch filter are constructed using the LP parameters and the pitch, respectively. In the LPC model, for the unvoiced speech signal, an LP filter is used since there is no periodicity in this class of speech. The pitch filter is used for the voiced frames. Details about the speech analysis procedure are given in [16]. The LP coefficients (LPC) must be transformed to a more improved representation before any processing, since the LPC are very susceptible to errors and their direct quantization might generate an unbalanced LP filter. One of the most used representations is the LSF [17]. In this study, we adopted this representation, in the hiding process p magnitude locations are replaced by p LSF coefficients of the secret speech.
Secret speech analysis
To perform the secret speech analysis, we will use the LP speech production model. In this model, the speech signal is subject to an LP analysis followed by pitch analysis.
LP analysis
The LP analysis is performed every Lms (for M = L × Fs samples), for a sampling frequency of Fs kHz, to extract p LP coefficients. These coefficients represent the vocaltrack poles (or formants). To smooth the interframe variation of the spectral parameters, the analysis window contains more samples than the analysis frame. In addition to the current speech frame, the analysis window contains 5 ms from past speech and 5 ms from future speech. In the LP analysis, we adopt a tapered rectangular window with three parts [18]. The first part is the first half of a hamming window, the second part is a rectangular window, and the third part is the second half of a Hamming window. This window produces a narrower main lobe than the asymmetric window used in G.729 and G722.2 codec standards.
The existence of a shortterm correlation in speech signals motivates us to adopt the LP analysis. This correlation is helpful to predict a speech sample s_{2}(n) at time n from its previous p samples s_{2}(n − i). For each speech frame, a 10order predictor (p = 10) is employed on the windowed speech, s_{2}(n), to estimate the spectral envelope. The predicted signal ŝ(n) is given by
The LP coefficients a_{ i }(i = 1, …, p) are predicted from the minimization (by autocorrelation method) of the error between the windowed sample s_{2}(n) and the predicted sample ŝ_{2}(n). Since the pitch and excitation analysis phases are completed in a closedloop manner, the LP synthesis filter is required in order to reduce the error between the original speech and the synthesized speech candidates. The LP synthesis filter in the Zdomain, H(z), is connected to the LPC vector by
The filter H(z) is represented in the time domain by the impulse response function h(n).
Pitch analysis
Due to the vocal cords vibration, the voice speech segments show some longterm correlation. The vibration frequency, named pitch, is reflected in the quasiperiodicity behavior of the time domain speech waveform. An autocorrelation scheme is used to calculate the pitch lag (the inverse of the pitch frequency). Since the LP analysis frame may contain more than one pitch period, the pitch analysis is performed on subframe basis to extract one pitch gain and one pitch delay. One pitch gain and one pitch lag are used to represent consequently the periodicity in each speech frame [19]. In the pitch analysis algorithm, an openloop analysis is first applied to each speech frame to estimate the pitch period. Openloop pitch estimation is based on the weighted speech signal s_{ w }(n) which is obtained by filtering the input speech signal through the perceptual weighting filter, s_{ w } is given by:
That is, in a frame of size L, the weighted speech is given by:
Residual excitation
The signal e(n) after removing the longterm and short term redundancies has a noiselike shape with a flat spectrum. Figure 3 shows the residual signal after removing the long and short correlations. This signal could be modulated by a random signal. Since the random signal has no correlation, this residual will be generated at the receiver side using a random signal generator. By this, we reduce the amount of information to be hidden in the cover signal. As mentioned below, the speech analysis algorithm is based on two phases: an LP analysis to obtain p LP coefficients, a_{ i }(i = 1, …, p) and a pitch analysis phase to extract the pitch gain g and the pitch delay d. Table 1 shows the used parameters of the LPmodel for narrowband speech.
LPmodel parameters adjustment
The spectral amplitudes must always be positive due to the absolute value applied to the speech spectrum. Direct embedding of the LP coefficients C in the magnitude spectrum will drastically destroy the cover signal since the LP parameters could have negative values. To accommodate this problem, we propose to convert the LP coefficients C to one of their frequency representations, such as LSF. As shown in the following equation, the LSF parameters w_{ i } are ordered and are all positive.
Since the pitch delay varies from 20 to 147 samples, direct embedding of the pitch delay in the cover speech spectrum will affect the highfrequencies smallamplitudes cover spectrum components. Hence, the need to normalize the pitch delay is by 147, the maximum pitch delay, before the hiding process. The normalized pitch delay will have a value ranging from 0 to 1. For this reason, the best location to hide these parameters is the last cover speech spectrum location since the amplitude of this last component is very small.
LSF cues
Itakura [20] has proposed the LSF to represent the LPC. They have been demonstrated to acquire different advantageous proprieties like bounded range, sequential ordering, and ability of constancy verification [21]. Moreover, the LSFs coefficients facilitate the integration of human observation system proprieties in the frequency domain representation. According to the ITUT Recommendation G.723.1, the extraction of the LSFs parameters is recommended in case of need to convert the LPC parameters to LSFs [22]. In LPC, the mean squared error between the original and the predicted speech is minimized over a short time interval to produce distinctive set of LP coefficients. The transfer function of the LPC filter is given by
where P the prediction order, G is the gain, and a_{ k } is the LPC filter coefficients. The poles of this transfer function contain the poles of the vocal tract as well as those of the voice source. Solving for roots of the denominator of the transfer function gives both the formant frequencies and the poles corresponding to the voice source. Two transfer functions Q_{p+1}(z) and P_{p+1}(z), respectively, called difference and sum polynomials can be resulting from H(z). The difference polynomial is given by:
and the sum polynomial is given by:
where A_{ p }(z) is the denominator of H(z). The polynomials contain trivial zeros for even values of p at z = − 1 and at z = 1. These roots can be removed in order to obtain the following quantities:
and
The LSFs are the roots of $\widehat{Q}\left(z\right)$ and $\widehat{P}\left(z\right)$ and alternate with each other on the unit circle. Note that Q_{p+1}(z) is an antisymmetric polynomial and P_{p+1}(z) is a symmetric polynomial. The polynomials $\phantom{\rule{0.25em}{0ex}}\widehat{Q}\left(z\right)$ and $\widehat{P}\left(z\right)\phantom{\rule{0.25em}{0ex}}$ derived from Q_{p+1}(z) and P_{p+1}(z) are symmetrical. Therefore, for even values of p we can derive the following property:
Consequently (20) and (21) can be written as follows:
and
By putting z = e^{jw} and then z + z^{−1} = 2 cos (w), we obtain the equations to be solved in order to find the LSFs according to the real root scheme ITUT Recommendation G.723.1:
and
Input speech is segmented to different frames. Additionally, each frame is subdivided into four subframes. On these subframes, the LPC analysis is performed. The conversion of the p LPC coefficients into their p corresponding LSFs is performed in the last subframe. For the three of the subframes, the LSFs are obtained by executing linear interpolation between the LSFs of the current and the previous frame.
To achieve this purpose, the unit circle is then divided into 512 equal intervals, each of length π/256. The roots (LSFs) of Q(z) and P(z) polynomials are searched along the unit circle from 0 to π. A linear interpolation is performed on intervals where a sign change is observed in order to find the zeros of the polynomials. According to [20], if a sign change appears between intervals l and l − 1, a firstorder interpolation is executed as follows:
where $\widehat{l}$ is the interpolated solution index, P(z)_{ l } is the absolute magnitude of the result of sum polynomial evaluation at interval l (similarly for l − 1). Since the LSFs are interlacing in the region from 0 to π, only one zero is evaluated on P(z) at each step. The search for the next solution is performed by evaluating the different polynomial Q(z), starting from the current solution [23, 24]. Therefore, two main reasons motivated our choice to consider the LSFs representation. The first reason is related to the fact that LP coefficients are very sensitive to errors. The direct quantization of these coefficients might produce an unstable LP filter. The second reason is related to the fact that LSFs are widely used in conventional coding schemes. This avoids the incorporation of new parameters that may require significant and costly modifications to current devices and codecs.
Speech hiding algorithm
We propose a new method for speech signal steganography, the secret speech signal is embedded into the coefficients in the wavelet domain. The DWT decomposes the cover speech signal into low and highfrequency components. For speech signals, the lowfrequency component is the most significant part for speech perception. On the other hand, the highfrequency component impacts flavor or nuance (noise) to the signals. Let’s consider the human voice. If we remove the highfrequency components, the voice sounds different, but we can still tell what’s being said. However, if we remove sufficient amount of the lowfrequency components, we hear gibberish and we cannot understand what’s being said. For this reason, we decide to hide information in the highfrequency in the wavelet domain. Furthermore, in wavelet analysis, we can divide the speech signal in approximations and details. The approximations are the highscale, lowfrequency components of the signal. The details are the lowscale, highfrequency components. As shown in Figure 4 after passing through two complementary filters, two signals emerge from the original signal.
A variety of wavelets can be used depending on the expected results. Each family of wavelets (such as Haar or Daubechies family) are wavelet subclasses distinguished by the number of filter coefficients and the level of iteration. In steganography, whatever the used algorithm for hiding data, we need to reconstruct the speech signals after embedding the message in the original signal. After that, performance measure can be used to compare the original speech signal and the stegospeech. In our method, after using the DWT to decompose the speech signals for hiding a message speech signals, we use the IDWT to reconstruct the signal. The speechinspeech hiding algorithm is illustrated in Figure 5. Both of secret and cover speech must be preprocessed in order to facilitate the hiding process. The cover speech is partitioned into Lms frames. The DFT of each timeframe s_{1}(m) defined for 0 ≤ m ≤ m − 1 is computed using the DWTFFT method. The obtained speech spectrum is decomposed into magnitude and phase spectra. Each Lms of the secret message s_{2}(m) is embedded in the lowamplitude highfrequency region of the magnitude spectrum of the cover signal.
Secret speech hiding
In order to hide the secret speech, the DWT is applied to the speech cover speech frame to separate the high and the lowfrequency regions. Then the FFT is applied to the highfrequency wavelets part producing a spectrum S_{1}(k)(k = 0, …, M − 1). The spectrum is decomposed into magnitude spectrum S_{1}(k) and phase spectrum ϕ_{1}(k).
The magnitude spectrum is symmetric. The hiding process consists of representing the L last elements of the first half of S_{1}(k) by the LP parameters V_{2} of the secret speechs_{2}(m).
The resulting magnitude spectrum, denoted by S_{3}(k), is defined by the following expressions:
The third righthand term in the above equation is included to preserve the DFT symmetry. These modifications lead to a new speech signal s_{3}. Its spectrum is a simple combination of the magnitude spectrum S_{3}(k) and the cover phase spectrum ϕ_{1}(k),
The timeframe composite (stego) signal s_{3}(m), m = 0, …, M − 1, is obtained by the IDWT,
The stego signal s_{3}(m) is a composite signal since it contains the Lms cover speech s_{1}(m) and the Lms secret signal s_{2}(m).
Energy normalization
In order to improve the speech quality, we preserved the speech energy by normalizing all the hidden parameters by the total energy of the original spectrum magnitudes. However, the energy preservation requires the hiding of the energy as side information. At the receiver, this energy will be used to rescale the hidden information to its original values. The scaling coefficient a is given by
where E_{ c } is the energy of the cover speech spectrum and E_{LSF} is the energy of the LSF vector.
Secret speech reconstruction
The secret speech is reconstructed from the stego speech by subsequent the hiding algorithm in overturn order. Figure 6 illustrates the pursued steps to extract the hidden information and reconstruct the secret speech message. The first step consists of performing the DWT. Transforming by FFT the high frequencies obtained with the DWT to its corresponding spectrum. The magnitude spectrum is then acquired from the speech spectrum. The secret speech parameters are extracted from the same locations they were embedded in the spectral magnitude of the stego speech signal. The LSF vector is converted back to a Porder LPC vector (a_{1}, …, a_{ p }) to build the LP synthesis filter H(z).
A random excitation signal e(n) is applied to the series of the pitch and LP synthesis filters. The signal ŝ(n), at the output of the LP synthesis filter, is a reproduction of the original secret message s(n). Since the LPCmodel parameter values that are extracted from the stego speech have approximately the same exact values as the embedded parameters, the reconstructed secret speech signal is not affected by the hiding process. The minor degradations noticed in this signal, when compared with the original secret signal, are resulting from the LPC model and the LSF conversion.
Evaluation
Experimental setup
To evaluate the performance of the proposed hiding technique, we conducted several simulations using NOIZEUS database [25–27]. This corpus contains 30 sentences from the IEEE sentence database, recorded in a soundproof booth using Tucker Davis Technologies recording equipment. The sentences are produced by three male and female speakers. The 30 sentences: 15 male and 15 female speakers include all phonemes in the American English language. The sentences were originally sampled at 25 kHz and downsampled to 8 kHz. The length of the speech file varies between 0.02 and 0.03 ms. In the comparative evaluation, we conducted four sets of tests. In the first set of simulations, we embedded each of the 15 male speech files in each of the 15 female speech files. In the second set of tests, we hide each of the 15 female speech files in each of the 15 male speech files. In the third set of tests, we embedded each of the 15 male speech signals in the remaining 14 male speech files. In the last sets of tests, we hide each of the 15 female speech segments in the remaining same gender speech files. Each set is iterated for five different wavelet families (Haar, Daubechies, Symlets, Coiflets, and BiorSplines). In total, we conducted 4,210 computer simulations ((15*15*2 + 14*14*2)*5).
In order to evaluate the impact of the DWTFFT technique, we conducted two different comparative experiments using DWTFFT method and then using FFT only.
Evaluation outcomes
One of the performance measures of any steganographic system is the comparison between the cover and the stego signals. In this study, we used subjective and objective performance measures. In the subjective measures, we conducted several informal listening comparative tests. In these simulations, we played in a random order the cover speech s_{1}(m) and the stego signal s_{3}(m) to several listeners. Each listener had to identify the better quality speech file among the cover and the stego signals. The majority of listeners could not distinguish between the two speech files. As an objective measure, we used the segmental signaltonoise ratio (SegSNR) and the perceptual evaluation of speech quality (PESQ). PESQ measurement provides an objective and automated method for speech quality assessment. The SegSNR is defined by
where s_{1} and s_{3} are the cover and the stego speech files, respectively. In this study, we segmented the speech files into frames of 20 ms (L = 20) (or 160 samples (M = 160)). In Table 2, we present the average SegSNR values for each of the four different sets of tests using DWTFFT algorithm. In Table 3, we present the average SegSNR of the same set of tests using the FFT only. The quality of the stego signal produced by the FFT is better than the one produced by the DWTFFT. However, the DWTFFT increases the robustness of the hiding algorithm against steganalysis techniques. We used some of the existing wavelets to compare the impact of the different wavelet on the speech quality. The decomposition of all used wavelets is done with one level. Table 4 shows the result of different wavelets for the four different sets of tests. As can be noticed, different wavelets have almost similar results; therefore, this method is not depending on a particular type of wavelet. The SegSNR value did not differ a lot for different wavelets. The SegSNR is just an indicative performance measure. The PESQ is a more reliable method to assess the performance of our hiding technique. The PESQ measurement provides an objective and automated technique for speech quality evaluation. The degradation of the speech sample can be predicted using the PESQ algorithm with subjective opinion score. In general, the PESQ returns a score from 0.5 to 4.5, with higher scores signifying better quality [28, 29]. The PESQ method is used in our experiments to evaluate the stego speech. The reference signal refers to an original (cover) signal and the degraded signal refers to the stego signal with the hidden secret message. In Table 5, we present the average PESQ values for male and female speakers obtained by the two hiding techniques (using DWTFFT and FFT only). Figure 7 shows variations of PESQ for 20 speech signals of the 2 hiding approach. The hiding method achieves 3.68 and 4.14 PESQ average for DWTFFT and FFT algorithms, respectively. Figure 8 shows the magnitude spectrum of the cover signal and the corresponding of stego speech after hiding the LPC parameters of the secret signal. The PESQ analysis shows that the stego and cover speech provide similar subjective quality. This result is supported by the resemblance between the cover and stego speech spectrograms in Figure 9. The objective and subjective performance measures show that the proposed hiding technique attracts no suspicion about the existence of a hidden message in the stego speech, while being able to recover an intelligible copy of the original secret message at the receiver side. The informal listening test to the original and the reassembled secret speech message advocate the result of the other objective performance measurement. The reconstructed secret speech ŝ(n) (from both DWTFFT and FFT hiding approaches) still completely comprehensible, even some perceptual distortions are simply noticeable. What concerns us is the speech intelligibility since the objective is to convey the secret message to the intended receiver. Table 6 shows the impact of the hiding algorithms on the secret speech in terms of the SegSNR.
Conclusions
In this article, we presented a new steganography system for secrecy applications. The proposed hiding method produces stego speech files that are indistinguishable from their equivalent cover speech files. Moreover, the complexity of our hiding technique is so high any eavesdropper cannot extract the hidden information even after suspecting the existence of a secret message. Since our aim is to render the steganalysis (the attempt to extract the secret message from the stego signal) by the opponent more complex. Our method exploits first the high frequencies using a DWT, then exploits the lowpass spectral properties of the speech magnitude spectrum to hide another speech signal in the lowamplitude highfrequencies region of the cover speech signal. Experimental simulations on both female and male speakers showed that our approach is capable of producing a stego speech that is indistinguishable from the cover speech. The receiver is still able to recover an intelligible copy of the secret speech message. In the future work, we will endeavor to extend our approach to applications involving Voiceover IP speech secrecy, which involves compressing the stego speech before transmission. This opens up the issue of preserving the secret speech after decoding the compressed stego speech.
Abbreviations
 DFT:

Discrete Fourier transform
 DWT:

Discrete wavelet transforms
 DWT:

FFT Discrete wavelet transformfast Fourier transform
 FFT:

Fast Fourier transform
 IDWT:

Inverse discrete wavelet transform
 IFFT:

Inverse fast Fourier transform
 LP:

Linear prediction
 LPC:

Linear predictive coding
 LSB:

Least significant bit
 LSF:

Line spectral frequencies
 PESQ:

Perceptual evaluation of speech quality
 SegSNR:

Segmental signaltonoise ratio
 SS:

Spread spectrum.
References
 1.
Kahn D: The History of Steganography. Lecture Notes in Computer Science. 1174 edition. Springer, New York; 1996:11023.
 2.
Johnson NF, Jajodia S: Exploring steganography: seeing the unseen. IEEE Comput. 1998, 31(2):2634.
 3.
Sridevi R, Damodaram A, Narasimham SVL: Efficient method of audio steganography by modified LSB algorithm and strong encryption key with enhanced security. J. Theor. Appl. Inf. Technol. 2009, 5(6):768771.
 4.
Bender W, Gruhl D, Morimoto N: Techniques for data hiding. IBM Syst. J. 1996, 35(3):313336.
 5.
Kirovski D, Malvar H: Spreadspectrum watermarking of audio signals. IEEE Trans. Signal Process. 2003, 51(4):10201033. 10.1109/TSP.2003.809384
 6.
Huang D, Yeo T: Robust and Inaudible MultiEcho Audio Watermarking, in Proceedings of the Third IEEE PacificRim Conference on Multimedia, Advances in Multimedia Information Processing Taipei. China; 2002:615622.
 7.
ShiraliShahreza S, ShiraliShahreza M: Steganography in Silence Intervals of Speech, Proceedings of the Fourth IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIHMSP 2008). Harbin, China; August 15–17, 2008:605607.
 8.
Misiti M, Misiti Y, Oppenheim G, Poggi JM: Matlab Wavelet Toolbox (Version 4.0): Tutorial and Reference Guide The Mathworks. Natick, USA; janv 2007.
 9.
Lin B, Nguyen B, Olsen ET Signal Processing Methods for Audio, Images and Telecommunications. In Orthogonal Wavelets and Signal Processing. Edited by: Clarkson PM, Stark H. Academic, London; 1995:170. ed. by
 10.
Mallat S: A Wavelet Tour of Signal Processing. Academic, San Diego, CA; 1998.
 11.
Nievergelt Y: Wavelets Made Easy. Birkhäuser, Boston; 1999.
 12.
Ooi J, Viswanathan V: Applications of Wavelets to Speech Processing. Modern Methods of Speech Processing. Edited by: Ramachandran RP, Mammone R. Kluwer Academic Publishers, Boston; 1995:449464.
 13.
Elliott DF, Rao KR: Fast Transforms: Algorithms. Analyses, Applications (Academic, New York; 1982.
 14.
Andreas S, Ed PT, Venkatraman A: Audio Signal Processing and Coding. WileyInterscience Publication, USA; 2006. ISBN 9780471791478, TK5102.92.S73
 15.
Strange W, Edman TR, Jenkins JJ: Acoustic and phonological factors in vowel identification. J. Exp. Psychol. Hum. Percept. Perform. 1979, 5(4):643656.
 16.
CY EspyWilson: Acoustic measures for linguistic features distinguishing the semivowels in American English. J. Acoust. Soc. Am 1992, 92: 736757. 10.1121/1.403998
 17.
Childers DG, Hahn M, Larar JN: Silent and voiced/unvoiced/mixed excitation (fourway) classification of speech. IEEE Trans. ASSP 1989, 37(11):17711774. 10.1109/29.46561
 18.
O’Shaughnessy D: Speech Communications: Human and Machine. 2nd edition. WileyIEEE Press, New York, NY; 1999.
 19.
Makhoul J: Linear prediction: a tutorial review. Proc. IEEE 1975, 63(5):561580.
 20.
Itakura F: Line spectrum representation of linear predictive coefficients of speech signals. J. Acoust. Soc. Am 1975, 57(1):S35. 10.1121/1.380398
 21.
Oppenheim AV, Schafer WR, Buck AJ: DiscreteTime Signal Processing. Prentice Hall, Upper Saddle River, NJ; 1999:468471. ISBN 0137549202
 22.
Hess W: Pitch Determination of Speech Signals. Springer, Berlin; 1983.
 23.
Soong F, Juang B: Line spectrum pair (LSP) and speech data compression. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’84). San Diego, Calif, USA 9; March 1984:3740.
 24.
ITUT: Recommendation G. 723.1. Dual rate speech coder for multimedia communications transmitting at 5.3 and 6.3 kbit/s. 1996.
 25.
Hu Y, Loizou P: Subjective evaluation and comparison of speech enhancement algorithms. Speech Commun 2007, 49: 588601. 10.1016/j.specom.2006.12.006
 26.
Hu Y, Loizou P: Evaluation of objective quality measures for speech enhancement. IEEE Trans. Speech Audio Process. 2008, 16(1):229238.
 27.
Ma J, Hu Y, Loizou P: Objective measures for predicting speech intelligibility in noisy conditions based on new bandimportance functions. J. Acoust. Soc. Am. 2009, 125(5):33873405. 10.1121/1.3097493
 28.
ITU: Perceptual Evaluation of Speech Quality (PESQ), and Objective Method for EndtoEnd Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs, ITUT Recommendation 862. 2000.
 29.
ITUT Recommendation: Methods for Subjective Determination of Speech Quality International Telecommunication Union. Geneva; 2003:800.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Rekik, S., Guerchi, D., Selouani, S. et al. Speech steganography using wavelet and Fourier transforms. J AUDIO SPEECH MUSIC PROC. 2012, 20 (2012). https://doi.org/10.1186/16874722201220
Received:
Accepted:
Published:
Keywords
 Audio steganography
 Discrete wavelet transform
 Fast Fourier transform
 Data hiding
 Speech steganography