2.1. The Cochlear Model and the Motivation for Its Use
The performance of PESQ can be largely attributed to the use of a PAM. The PAM, however, is a functional model that approximates simultaneous masking. It can be treated only as an approximate estimation of the Basilar Membrane (BM) response. A primary failing of the PAM in the context of the current work of isolating and distinguishing temporal distortions is its lack of temporal resolution. To achieve high temporal resolution, the analysis frames used in PAM would be required to have compact time support. This would however render inadequate frequency resolution—inherently necessary for the PAM to produce accurate results. Moreover, the spreading functions (or filters) typically used in PAM to model BM functionality reduce frequency resolution, which is not reflective of the BM travelling waves. These carry far more spatiotemporal detail than can be observed by the PAM. It may be argued that such detail is not necessary to predict human perception. However, it is also true that not all of the loss in resolution depicted by the use of PAM is due to cochlear mechanics. Some of it is likely to be at higher stages of the auditory neurophysiology pathway. The methodology adopted in this work involves using the cochlear response resolution to first identify salient features and, only when the features have been detected and tracked, to reduce the resolution to a level that is representative of human perception. This strategy is impossible if a PAM is to be used to reduce time-frequency resolution as a front-end acoustic model.
Further, the linear characteristic of PAM means that it is not able to predict a number of nonlinear characteristics of the true physiological response of the cochlea [7], such as two-tone-suppression and cochlear emissions, and corresponding psychophysical phenomenon such as Upward Spread of Masking (USM) and loudness [8]. An explicit physiological model of the cochlea, on the other hand, is not burdened by the drawbacks of a PAM and is able to provide precise and high resolution spatiotemporal response of the cochlea due to auditory stimuli. In a latter section of this paper we discuss and compare the characteristics of the PAM as well as spectrograms with cochlear model output in the context of the current work. In particular we show that the PAM output lacks the resolution to carry out the analysis described within this paper, and that a consistent gain in prediction is achieved from using a nonlinear cochlear model (when compared with a linear cochlear model).
The cochlear model (CM) used in this paper is a spatially 2D hydromechanical model, which computes various electrical and mechanical responses in the cochlea. In particular, the model can be used to calculate BM and Inner Hair Cell (IHC) response as a function of time and space. A block diagram of the cochlear model depicting the transduction path from the acoustic stimuli to its eventual transduction is shown in Figure 3. While detailed aspects of the cochlear model are beyond the scope of this paper, they may be found in various publications [6, 7, 9, 10]. The cochlear model can be broadly divided into three components: the macromechanical model, the micromechanical model, and the nonlinear elements. The ear-canal and ossicles are modelled as a linear filter—shown simply as "Middle Ear" in Figure 3. Various benchmarks comparing the model output to physiological and psychophysical data have been carried out to verify the performance of the model [7–9].
The macromechanical model is concerned with the dynamics of the fluid filled scalae and the Organ of Corti along the length of the cochlea. Of particular relevance is the travelling wave type mechanical response of the basilar membrane (BM). A Green's function [10] is used to numerically solve (in the time domain) the differential equations that result from assumptions of continuity (or conservation of fluid mass), inviscid and incompressible cochlear fluid loaded by the mass/stiffness and damping of the fluid and structures along the length of the cochlea. Spatial sampling is achieved by linearly discretizing the cochlea at 512 points along the 3.5 cm length of the cochlea.
The micromechanical model is concerned with the cilia (submerged between the tectorial membrane and the BM) and the associated Inner (IHC) and Outer (OHC) Hair Cells. The movement of the cilia are modelled as the direct result of the shear force created within the subtectorial space as a result of the relative movement of the BM to the tectorial membrane (TM). The TM is modelled as a transmission line, terminated by the cilia [9]. The phenomenological result of the micromechanical model is a cilia response that reflects an attenuated BM response basal to the Characteristic Place (CP). The cilia displacements are rectified and low-passed to derive the OHC and IHC receptor potentials. The IHC and OHC models are thus alike except for a high-pass filter that precedes the IHC model to account for the fact that the IHC cilia are not attached to the TM, but are driven by viscous fluid drag [11]. The IHC response from the model are reflective of receptor potentials, however no attempt is made to normalize them to units of Volts. Throughout the paper, it is these IHC responses that have been used as the output of the cochlear model and referenced as the CM response.
Cochlear nonlinearity imposed by OHC motility is modelled as mechanical feedback from the OHC, which modifies the macromechanical impedance. This is shown in Figure 3 as "Mechanical feedback." This is a cycle-by-cycle effect meaning an almost instantaneous feedback path in the model. The second and slower feedback due to efferent nerve fibres is not modelled within the model.
The model is implemented completely in the time domain. Due to discretization methods used in the model, as well as noise considerations inherent in nonlinear feedback systems, stability of the model is guaranteed only when it is run at a sampling rate considerably above the Nyquist sampling rate [10]. To adhere to this requirement, the 8 kHz sampled acoustic stimuli used in this work were upsampled by a factor of six before being processed by the cochlear model. Input to the cochlear model is on a sample by sample basis. Thus, for every sample into the model there is effectively a frame of 512 points of spatial data at the output. We discard every five of six frames, which has the effect of temporal downsampling back to 8 kHz.
A drawback of the use of the CM model is that it is highly redundant—due to the fact that the output is a 512 times oversampled relative to the input stimuli. This necessitates dimensionality reduction and our strategy towards this has been to extract distinct features from the model response. In particular, we isolate features which correspond to the perception of the temporally localized distortions—the focus of this paper.
2.2.
D Evolution Tracking
The 2D cochlear Model response across time
, at a single discrete place
(of arbitrary units), is a quasiperiodic waveform, with primary period
, dictated by the characteristic frequency
, at place
. For voiced speech, a second mode of periodicity
can also be observed on the smooth low-passed envelope of the signal
. This periodicity is due to the pitch of the speaker and is independent of place
(except for a slow evolution across space). These
,
are shown for a typical voiced section in Figure 4.
Due to causality, at place
, the envelope of the cochlear Model response
will have evolved albeit slowly for voiced sections. The rate of evolution is a function of the amount of voicing, such that for highly voiced sections, this evolution is slow, whereas the rate is fast for unvoiced sections. Exactly the same argument can be made in the alternate dimension of looking at the Cochlear response as a function of place at discrete time
and its evolution at
. It is necessary to track this evolution in both space and time dimensions since the envelope is evolving in both dimensions. A peak tracking algorithm is used in Figure 5 to illustrate this evolution for a voiced section of speech.
We hypothesize that these peak tracks of the cochlear Model response are essential features that represent the rate of evolution of the response. It can be observed that the peak tracks are almost parallel when the rate of evolution is slow as is the case for voiced speech. This parallel structure is lost for unvoiced sections of speech and is shown in Figure 6.
The output of the cochlear model is 2D data across time and space. The spatial sampling is
, such that there are
discrete points across the approximate
length of the human BM. The relationship between place and frequency can be approximated using Greenwood's map [12]. This mapping is however only valid at threshold levels. To provide an indication to the reader, 24 mm along the cochlear length represents the characteristic place for a 600 Hz sinusoid (at threshold), as can be seen in Figure 7.
The steps below describe an algorithm to track the 2D evolution of the cochlear response
on a closed spatial region
along the BM, where
and
are the lower and upper bounds along the place axis with
.
We start at the lowest boundary place
, which corresponds to the highest frequency in the region
. All local maxima along the time axis
are found, such that there are
peaks at time
. The peaks are chosen such that at time
, the cochlear response
satisfies the criterion that it is larger than the
neighbouring time samples, on either side of it, as follows:
, and
. The value of
is a function of the temporal sampling rate and is empirically determined to ensure the capture of salient features.
The process in Step
is repeated for each spatial point in the range
. The position of the peaks are stored in a matrix
, such that
. The size of the matrix is given by the maximum number of peaks at any place (i.e.,
).
The next step is to associate each peak with a track across time and place. To do this we look in a distinct neighborhood (i.e.,
) of each peak position from the previous place,
. Due to causality, the peak tracks always move towards increasing time and place. For this reason,
can be small. If a peak is found within the above range, then it is considered to be part of the same track as the one at
. If more than one peak is found within that range, then the one closest to
is chosen. If no peaks are found within that range, then the track is terminated at place
and no further search along this track is performed in the future. It is important to account for any new tracks that originate at a higher place (i.e., was not at place
) by ensuring that new peaks not associated with the previous place are not discarded but are stored for future tracking until they terminate.
Further postprocessing involves connecting broken tracks which are possibly part of the same track, and checking to ensure that the track lengths are longer than a certain threshold. If not, these short tracks are discarded.
The final tracks are stored in a matrix
where each column describes a single track.
An example of the above steps is illustrated in Figures 5 and 6. The continuous lines capture information related to the evolution of the spectrum over time and space. During voiced speech, this evolution is slow and is characterised by peak tracks which do not change drastically (over time and space) and thus result in almost parallel looking tracks.
2.3. Locating Perceptually Relevant Regions
Articulatory features such as vocal tract resonances (formants) and pitch harmonics are easily distinguishable in the 2D rendering of the CM response. During voiced speech, these features are distinguishable as distinct "peaks" or high energy regions in the CM response, as can be observed in Figure 5. In the figure, three pitch harmonics at the first formant region can clearly be tracked over time and place. They appear at approximately
,
, and
from the base of the BM, their positions changing slightly with time. These places correspond to approximately
,
, and
. Instead of referring to these in terms of articulatory features, it is more appropriate to refer to these as Perceptually Relevant Regions (PRR), reflecting the association between each place along the length of the cochlea with a characteristic frequency.
The peak tracking algorithm described in the previous section tracks the PRRs extremely accurately over time and place. What is actually being tracked is the effect of the articulatory features as processed by the cochlea. This is one of the main reasons that the use of CM response is far superior to the use of a spectrogram or a PAM, as the CM response reflects only the information that remains after nonlinear cochlear processing.
One of the important features of the PRRs is their stationary nature over time and place. This can be observed on the CM response by the fact that the number of peaks remain unchanged for the duration of the voiced speech, as well as the fact that the peak-tracks are approximately parallel to each other (in the 2D projection across time and place)—especially in the regions of the PRRs. This is demonstrated in Figure 4.
The next step in our feature extraction is to focus on just the PRRs. This is facilitated by the observation that the average time difference between the peak tracks
(over the duration of the voiced section) is almost constant across the region of each PRR, where
is the index of track, and
is the total number of tracks. This is shown in Figure 8 which shows that in each of the three PRRs,
,
, and
, the
, shown by the blue line, is almost constant along the width of the each of three formant places. The standard deviation of the time difference, shown in red, is also shown to be low. Further, there is a conspicuous increase in the average time difference with increasing distance—such that the
for region
is lower than the
for region
. This is a direct consequence of the fact that the number of peaks at any one place decreases with increasing distance, reflecting the fact that the characteristic frequencies
decrease with distance.
To focus on the PRRs, we use a two pronged strategy. First, we impose an energy threshold such that only sections of the CM response above the threshold are kept. In addition, we use the characteristic of the
, whereby it increases in (almost) discrete steps (as shown by the blue line in Figure 8). The boundaries of the plateaus further distinguish relevant regions. These regions are shown in Figure 9 as areas between horizontal lines ("PRR1", "PRR2", and "PRR3"). The three regions correspond to the three dominant pitch harmonics in the vicinity of the first formant.
2.4. Center of Mass for Each Formant Region
A characteristic of the peak tracks within each PRR is the fact that they are quasiparallel on the time-place plane (much more so than in other regions). To reduce the dimensionality and computational complexity, the "center of mass" of each track slot (restrained by PRRs) is computed. Each new point is characterised by a time, place, and amplitude, (
,
,
). We call these points Track Center Points (TCP). The amplitude is simply the average of the IHC responses constrained by the boundaries of a track. The time (
) and place
values are calculated using the following three equations:
Here
is the IHC amplitude,
is time position, and
is the place position, of point
.
is the number of points in one track. A typical set of consecutive TCPs (in one formant region) is plotted in Figure 10, which is inferred from
in Figure 9. The plot reveals a swirling 3D curve. The period of the swirl corresponds to the periodicity of the underlying (time domain) speech signal and is given by
, in Figure 4.
Corresponding TCPs across period
, are also similar in intensity and place—more so than neighbouring TCPs. In a further attempt at reducing dimensionality, each set of TCPs in a single period
is reduced to a single "center of mass" as given by (1). We call these points the Salient Formant Points (SFP), reflecting the fact that they are indicative of formant energy as a function of time and place. Time periodicity has been removed as a result of this final process. These corresponding SFPs between the original and distorted speech signals are highly synchronized in time. This is of great benefit as most intrusive objective speech quality measures, such as PESQ [2], require fairly complex preprocessing to synchronize the two signals accurately, a step for which our system can afford to be less precise due to this automatic SFP synchronization. Figure 11 indicates the final result of this process and shows the extracted Salient Formant Points (SFP) in 3D space of time, place, and IHC response. Figure 12 is a plot of the points showing the extraction times of the original and distorted signals, respectively. A most notable feature is that the points extracted in this manner for the two different systems are automatically synchronized, without an explicit requirement for the signals to be synchronized accurately at the input.
Figure 14 shows that the points are lightly dispersed over place due to the different coding systems, as should be expected. Finally, Figure 13 shows the IHC response at each of the extracted points. Note the significant amplitude difference between original and distorted signals. In our intrusive prediction for speech quality, original signals are used as a reference of "smoothness". A perceptual formant distance
is defined as below:
The
is used to predict temporal distortions, as described in the next section. Note that in an extreme situation, if the original and distorted SFPs are parallel to each other in amplitude, the
is flat or constant, only reflecting a multiplicative constant between the two signals. It is the deviation along the time axis of the
that carries information on temporal distortions.