In this section, we first give the notation and summarize the basic preprocessing steps of perceptual masking. Later, we present the mathematical formulation to describe how we derived each feature.
5.1. Preprocessing
In the preprocessing step, time signals sampled at 16 kHz are divided into frames of 43 ms with 50% overlap where the emotional audio signal behaves in a stationary manner by which we can better model its statistics. Let s
n
[k
t
, n] denote the temporal domain signal where n is the index of time-frames and k
t
is the time sample index. In order to apply short-time Fourier Transform (STFT), the windowed audio frame can be represented as
(5)
by using the Hann window
(6)
where N
FT
is the is the size of the DFT which is equal to 2048 in this study.
Successive frames of the time-domain signal are transformed to a basilar membrane representation based on the PEAQ psycho-acoustic model [24, 25]. Hence, first each windowed frame is transformed to the frequency domain by taking STFT. Let the transformed and windowed audio frame be expressed as
(7)
where k
f
is the frequency bin index. In order to extract the perceptual components of the audio spectrum, a mapping reflecting the outer and middle ear frequency responses is applied on the spectral components, yielding the "Outer ear weighted DFT outputs" given as
(8)
The weighting function W[kf] shown in Equation (8) represents the effect of the ear canal and the middle ear frequency response [24, 25]. The outer middle ear frequency response is formulated as
(9)
where kf denotes the frequency bin index.
Note that these weights enable us to filter the spectral components according to the human auditory system, because both the outer and middle ears act as band pass filters. Hence, the outer and middle ear transfer functions limit the ability to detect low-amplitude audio signals and affect the absolute threshold of hearing [24, 25]. According to the frequency response of the outer and middle ears, the absolute threshold of hearing tend to be lowest in the 2-3 kHz band and increases with increased or decreased frequency. The frequency components lower than 1 kHz are drastically attenuated and the components between 3 and 4 kHz become stronger and perceived better. The frequency borders of the band pass filter range from 80 to 18000 Hz.
Unlike the conventional audio feature extraction modules that mostly operates in Mel scale, in which speech contents are efficiently modeled rather than emotion, we propose working on perceptual spectrums derived in Bark scale. Hence, a mapping from the frequency domain to Bark scale is performed. Such an approximation leads to the notion of critical bands or perceptual scales in other words. Hence, the frequency bins of the attenuated spectral energy values are grouped into z = 109 critical bands as in the basic version of the PEAQ model. The attenuated spectral energy values are mapped from frequency domain and grouped into a pitch (Bark) scale by the following approximation as
(10)
It can be seen from Equation (10) that the Bark scale frequency bands are almost linear below 1 kHz while they grow exponentially above 1 kHz that yields a perceptual filter bank.
Let Fp[kf, n] be the energy representation of the "Outer ear weighted FFT outputs" and Pe[k, n] is the Bark representation of Fp[kf, n] = |Fe[kf, n]|2. Note that the frequency index kf in Hz is replaced by k after mapping to Bark scale. The energy components which are transformed to Bark domain are convolved with a spreading function SdB(.) to simulate the dispersion of energy along the basilar membrane and to model the spectral masking effects in the Bark domain. The pitch patterns, Pe[k, n], are smeared out over frequency using the level dependent spreading function. Conventionally, SdB(i, k, n, Pe) the spreading function of band i for an energy component at band k is defined as a two-sided exponential
(11)
where Δz = i-k = 1/4 for the basic version of PEAQ. Smearing the spectral energy over frequency gives the frequency domain spreading function, Es[k, n] which is called as the "unsmeared excitation pattern" [24, 25],
(12)
where Bs[i, k, n] is a normalizing factor which is calculated for a reference level of 0 dB and can be pre-computed since it does not depend on the data.
The feature extraction process is then followed by a time domain spreading that accounts for forward masking effects. In spite of the conventional time masking functions commonly used in audio compression, we prefer using the one introduced in PEAQ that enables us tracking emotional variances of successive frames. Hence, to model forward masking, the energy levels in each critical band are smeared out over time according to Equation (13) as
(13)
where a is a time constant depending on the center frequency of each critical band. The excitation pattern, E[k, n], shown in Equation (13) is calculated as
(14)
where n is the actual frame number, k is the band index and .
Briefly, we observed the perceptual effect of our method in means of both the proposed features and the perceptual intermediate steps which are applied prior to feature extraction. In Figure 5, the impact of the preprocessing steps is provided comparatively for two audio records taken from VAM. Note that perceptual masking in Bark scale highlights the emotional differences.
5.2. Perceptual features
The preprocessing stages detailed in the previous section are employed in the system in order to model physiological and perceptual effects of the human ear. The preprocessing is followed by feature extraction process. The proposed perceptual feature set consists of seven low level and two statistical descriptors. These features are formulated in the following sections.
5.2.1. Average harmonic structure magnitude of the emotional difference
Our motivation of using AHSM of the emotional difference as a representative feature is to highlight the harmonic structure of emotional speech that is much more similar to a periodical signal with stable harmonics with respect to unemotional speech. Depending on the fact that emotion does not change as fast as the phonemes, we prefer to use the correlation of the logarithmic power spectrum rather than the spectrum of signal itself. So, AHSM emphasizes the harmonic pattern and reflects the variations in the fundamental frequency. AHSM is a useful feature to discriminate the speech on the valence dimension where a sample for q2-q3 pairwise discrimination is shown in Figure 6.
The harmonic structure of the signal is evaluated in linear frequency spectrum rather than Bark, since nonlinear frequency transformation would smear the harmonic structure. Extraction of the feature AHSM can briefly be summarized as follows. First, we compute the emotional differences for each critical band. Then, the autocorrelation function of emotional differences through the critical bands is obtained. Fundamental frequency is estimated from the log-spectrum of the autocorrelation function. Average value of the fundamental frequencies estimated for successive Y audio frames is reported as AHSM. Conventionally, fundamental frequency is estimated from the log-spectrum of autocorrelation function of audio signal [6]. Unlike these methods, we use the correlation of emotional differences through critical bands instead of time domain audio signal itself.
To formulate the AHSM, let the emotional difference PEDiff[kf, n] of frame n in spectral index kf refer to the variations from the reference set within that band. PEDiff[kf, n] in Equation (15) is calculated in the frequency domain as the log spectra of the ratio of magnitudes of the emotional and the reference audio signals spectral energy FeE[kf, n] and FeR[kf, n], respectively. Note that the spectral energy obtained after outer and middle ear filtering of the STFT spectrum is calculated by Equation (8)
(15)
Let denote the row vector obtained by grouping the energies of the emotional differences at critical bands kf through successive 256 frames. The normalized autocorrelation function C[l] is calculated in a defined neighborhood of l = 256. Hence, the normalized autocorrelation function C[l] of gathering emotional differences in M groups within a defined neighborhood l is calculated as
(16)
The power spectrum S[kf] of the normalized autocorrelation function is calculated by
(17)
and its maximum peak specifies the harmonic magnitude EH max[n]. AHSM is the average of the magnitudes estimated through Y successive audio frames and calculated as
(18)
5.2.2. Average number of emotional blocks
The excitation patterns of different emotional audio signals are processed and stored in the brain. The brain keeps the brief initial audio information in a short-term memory. Subjective evaluation of the emotional signals depends on this short-term memory [25]. Hence, the feature AEB provides a measure for the occurrence of high excitation levels through successive frame groups analyzed in Bark scale. To calculate the expected number of emotional blocks within a time interval, a probabilistic approach that estimates the number of excitation patterns remaining above a loudness threshold is applied [24].
AEB provides a measure for the occurrence of high excitation levels through successive Y frames analyzed in Bark scale. Specification of Y is directly related to the granularity of the system and it is set to Y = 70 in this study. In order to calculate the expected number of emotional blocks within a time interval, we have applied a probabilistic approach that estimates the number of excitation patterns remaining above a loudness threshold.
Let e[k, n] denote the difference between the excitation levels of reference and emotional audio computed in Bark scale k for audio frame n in dB as
(19)
Our aim is specifying frames in which the excitation level difference above a threshold [24, 25]. Probability of an excitation pattern remaining above a loudness threshold can be modeled by [24]
(20)
where b is a constant equal to 6 and where s[k, n] is a normalizing coefficient. Hence, assuming that the observed frames are uncorrelated, the total probability of declaring the frame n as emotional can be calculated by
(21)
Basically, the feature AEB is computed as the average number of blocks declared as emotional within 1 s. It can be shown that P[n] becomes greater than 0.5 for these frames. Since both probability of detection and number of steps remaining above the loudness threshold are dependent on the excitation patterns, we can expect the excitation pattern of the audio in mode happy to have higher peaks with respect to the mode bored which are, respectively, located on the positive and negative scales of arousal. The discrimination capability of the feature AEB is promising as it can be seen in Figure 7 for the pairwise training set of bored and happy modes.
5.2.3. Perceptual bandwidth
The perceptual bandwidth of emotional audio varies according to the perceived timbre, dullness, or muffling effects. To measure this effect, the maximum of the frequency spectrum in upper frequency range is obtained. This is used as an estimate of the noise floor. Then, beginning from higher frequencies and scanning the highest frequency component which exceeds the noise floor by at least 10 dB toward lower frequencies is defined as the estimated perceptual bandwidth. This feature aims to classify emotional states based on the variations in signal bandwidth. Hence, a rough estimate of the observed emotional signal bandwidth is computed for each audio frame with 43 ms. To do this, first the maximum W1[n] of the spectrum of the emotional audio signal obtained within a frequency band from 14.4 to 16 kHz is specified as the noise floor by using
(22)
The first frequency component where the spectral energy exceeds the noise floor at least by 10 dB in the reference audio signal is reported as the bandwidth of the emotional audio for the n th frame and it is denoted by W2[n] and calculated as
(23)
Furthermore, searching downward from W2[n], the first value which exceeds W1[n] by 5 dB in the emotional audio signal is recorded as W3[n]
(24)
The perceptual bandwidth of the emotional audio is extracted by calculating the mean value over Y successive frames as
(25)
Discrimination capability of the feature W
E
can be seen from Figure 8 that plots the distribution of bandwidth estimates through the samples taken from q4 and q1 modes. Perceptual bandwidth of the reference audio is extracted by calculating the mean value over Y successive frames as
(26)
5.2.4. Normalized spectral envelope
The term spectral envelope refers to the normalized amplitude variations of loudness that arise from the emotional differences of successive frames. NSE NSE[k, n] formulated in Equation (27)
(27)
and NSE difference NSEDiff[k, n] given as
(28)
to quantify local variations in energy through time.
, shown in Equation (27), is calculated by
(29)
where ES[k, n] is the unsmeared excitation pattern formulated in Equation (12).
As it can be seen, models the envelope changes through successive frames. The parameter a (0 < a < 1) shown in Equation (29) reflects the impact of the past frames to current n th frame gradually. The frame which has maximum effect on other than n th frame is the previous (n - 1)th frame. The scalar a behaves like an attenuation parameter which both reflects the impact of the past and reduces the effect of the past gradually. This concept precisely models the time spreading energy effect which is based on the perceptual effect of the sequential speech tones. A sample case is given in Figure 9 for angry and bored modes.
The term , shown in Equation (27) denotes the average loudness that effects as an adaptive normalization term on the loudness and is calculated according to
(30)
5.2.5. Normalized emotional difference
The perceived loudness of an emotional audio signal depends on its duration and its temporal and spectral structure. The local loudness of an emotional signal is the perceived loudness after it has been reduced by a masker [25]. The masker induces the loudness to be perceived at different frequency bands. The masker is effective in the low frequencies, thus the locality is established by adaptively masking the low-frequency components. This masking describes the effect by which an audible signal becomes inaudible when a louder signal masks it. We refer the reference audio signal as masker and compute a local loudness other than conventional loudness values. In conclusion, we evaluate a localized loudness with respect to a reference set.
The NED is formulated as the ratio of the emotional difference PEDiff[kf, n] given by Equation (15) to the masking threshold M[k, n]. We use the total NED that is calculated as the average (expressed in dB) of the NED values computed at the bark scales,
(31)
where Z = 109, k denotes the number of critical bands and n refers to the frame number.
The masking threshold M[k, n] formulated below
(32)
is calculated by weighting the excitation patterns E[k, n] with the masking offset m[k] as given in
(33)
The masking offset is plotted in Figure 10. Since masking offset is placed at the denominator of both the masking threshold, M[k, n], and the NED, it effects as a high pass filter. Hence, the expectation from the feature NED is to emphasize the distinction between emotional categories at higher frequencies as shown in Figure 11.
5.2.6. Emotional loudness
We propose using the overall loudness of the emotional differences as a representative feature of the emotional modes. The specific loudness pattern for a signal can be formulated as
(34)
where EIN is the internal noise of the ear. The threshold index s[k] is calculated according to
(35)
The overall loudness of the signal Ltotal is calculated as the sum across all filter channels of all specific loudness values above zero, as
(36)
An example of the effect captured with this feature is shown in Figure 12.