Music can be described as a series of complex acoustic sounds composed of tones with fundamentals and overtones that are harmonically related to each other [1]. The majority of musical instruments generate fundamental frequencies below 1 kHz [2]. An important aspect of music is melody [3] which can be defined as a sequence of individual tones that are perceived as a single entity [4]. Preserving the harmonic structure of individual tones is important for preserving the melody perception.
Cochlear Implants (CIs) were originally designed to restore speech perception for patients with profound hearing loss [5, 6]. The standard ACE (advanced combination encoder) speech coding strategy used with the Nucleus CI typically encodes signals between 188 and 7980 Hz onto maximally 22 intracochlear electrodes. The frequency range up to 1 kHz is represented by only up to eight electrodes in the standard (Std) ACE frequency to electrode mapping. This is insufficient to preserve the representation of the harmonic structure of musical tones, because the fundamental frequencies as well as overtones of adjacent musical tones will often be mapped onto the same electrode, especially for frequencies below 500 Hz. It can be hypothesized therefore that this coding strategy will not be optimal for musical melody representation.
One way to improve tonotopic melody representation would be to ensure that the fundamental frequencies of adjacent tones on the musical scale are assigned to separate electrodes. Such an approach involves mapping fundamental frequencies of musical tones to electrodes based on a semitone scale. The idea was initially investigated in a study by Kasturi and Loizou [7], using the 12 electrode Clarion CII (Advance Bionics) implant with a limited range of semitone frequencies. They concluded that semitone spacing improved melody recognition with CI recipients. Additionally, music could be further enhanced by increasing the frequency representation in CIs. This may be possible by using virtual channels (VCs) formed by stimulating two adjacent electrodes simultaneously with the same current level. Busby and Plant reported that VCs invoked the perception of an intermediate pitch [8]. VCs on an array of 22 electrodes would yield a total number of 43 channels, which would allow covering three and a half octaves with semitone (Smt) mapping with one-semitone intervals between the characteristic frequencies of successive channels. Note that the middle VC between two adjacent electrodes is the only VC which can be created in present Nucleus CI devices, because they are equipped with only one current source.
In this study, we extended Kasturi and Loizou's idea and propose a Smt mapping algorithm incorporating VCs. Two different Smt mapping ranges were considered. The first one, Smt-LF, is restricted to the low and mid frequency range [130 to 1502 Hz] and the second, Smt-MF, maps frequencies in the mid and high frequency range [440 to 5009 Hz]. The ranges of Smt-LF and Smt-MF mappings in relation to a piano keyboard are illustrated in Figure 1. Note that at the lower end of the piano scale, the fundamental frequencies of successive tones differ by as little as 3 Hz at A0 (f0 = 27 Hz) and approx. 8 Hz at C3 (f0 = 130 Hz). This difference increases as the fundamental frequency of the tone is also increased.
The frequency range of the Smt-LF mapping covers low and mid frequencies. These frequencies are common to most musical instruments [2]. The Smt-LF mapping has a band-pass filter that filters out the fundamental frequencies and the lower partials lying within the first and second piano octaves (less than 130 Hz) as well partials in the sixth piano octave and above (greater than 1502 Hz). The range of the Smt-MF mapping covers part of the mid and high frequencies that are common between music and comprehensible speech bands used in telephone lines [9]. The Smt-MF mapping band-pass filters out frequencies lower than 440 Hz (A4) and higher than 5009 Hz. Thus fundamental frequencies and partials in most of the fourth piano octave (261 to 493 Hz) and below will not be represented. Smt-MF mapping also allocates frequency bands of audible sounds to electrodes with similar characteristic frequencies according to Greenwood's formula [10, 11] assuming an average cochlear length of 33 mm and electrode insertion depth of 22 mm.
This article is organized as follows: 'Theoretical basis of semitone mapping' section describes the theoretical basis of semitone mapping and explains why Smt mapping was chosen. 'Processing and implementation' section presents a brief description of the processing technique. 'Semitone mapping frequency ranges' section describes in detail the two Smt-LF and Smt-MF ranges in concern. 'Frequency time matrix' section describes how the resolution at low frequencies was improved. 'Channel time matrix' section describes how the frequency bands were then mapped to their corresponding channels. 'Nucleus Matlab Toolbox' section is a description of how an acoustic model could be implemented to resynthesize channel activities into an acoustic sound that will be used in psychoacoustic tests with normal hearing (NH) listeners. 'Analysis' section shows an analysis for the resynthesized sounds. 'Pilot test' section describes a pilot pitch ranking test using acoustic simulations of CI sounds with NH subjects to investigate the perceptual difference between 43 and 22 channels using the Std ACE mapping. The hypothesis for this test being that 43 Channel mode would increase the frequency representation, and as a result enhance synthetic tone discrimination and improve pitch ranking with smaller semitone intervals than with 22 channels using the Std ACE mapping. 'Procedure' and 'Results' sections describe the experimental procedures and results. This is followed by a discussion and a conclusion section.
Theoretical basis of semitone mapping
Smt mapping assigns fundamental frequencies of successive semitones on the musical scale to individual channels. Note that the harmonic overtones, which are integer multiples of the fundamental frequency, of each musical tone will also be mapped to the center frequency of separate channels with Smt mapping. Therefore, different musical tones will correspond to different sets of channels.
The relationship between the fundamental frequencies f
n
and f
r
of two musical tones k semitones apart is described by Equation 1a below.
where f
r
is the fundamental frequency of the lower tone. Equation 1b represents the ratio of characteristic frequencies of channels in Smt mapping. Substituting k = 1 gives frequency ratios for one-semitone steps.
The characteristic frequencies of Smt mapping for 43 channels each 1 semitone apart (k = 1) for the two Smt-LF and Smt-MF ranges (squares and filled circles, respectively) are plotted in Figure 2. Note that higher channel numbers correspond to lower frequencies, to be consistent with the numbering used for Nucleus CIs. The characteristic frequencies of the Std mapping with 43 channels (i.e., including VCs) are also shown in Figure 2 (open circles).
The two Smt mapping functions yield straight lines with a slope corresponding to the value of 0.025 as given in Equation 1b. This value is required to map consecutive semitones to consecutive individual channels. Shallower slopes would result in more than one semitone being mapped to the same channel, distorting the original harmonic structure of the overtones. This would be the case for the Std mapping function, particularly with the first eight channels in the lower frequency range. This distortion decreases at higher frequencies as the slope approaches a value corresponding to 0.025.
Since the inner ear resolves frequencies mainly based on a logarithmic function, harmonic overtones with the Smt mapping will be regularly spaced along the basilar membrane as described by the following equations.
Equation 2 below describes the characteristic frequencies at distance x mm from the cochlea's apex according to Greenwood's empirically derived function which was verified against data that correspond to a range of x from 1 to 26 mm [12].
The distance (in mm) between two locations with different characteristic frequencies f1 and f2 is given by Equation 4
Substituting f2 and f1 by f
n
and f
r
, respectively, from Equation 1 yields:
Equation 5 shows that the spacing along the basilar membrane between two successive semitones (substitute k = 1 and f
r
with the fundamental frequency of the lower tone) will vary depending on the frequency range, and is smaller at low frequencies, asymptotically approaching 0.4 mm with higher frequencies. For C3 (f0 = 130.8 Hz), the spacing would be about 0.19 mm, whereas at C8 (f0 = 4186 Hz) about 0.4 mm. The electrode spacing between successive electrodes in the Nucleus 24 implant straight array has a center to center distance of 0.75 mm [13]. For VCs, assuming that the center of stimulation is halfway between the two physical electrodes, the channel spacing would be about 0.38 mm. This corresponds roughly to the tonotopical spacing for the tones involved in the Smt-MF mapping.
Processing and implementation
The block diagram in Figure 3 shows the Std ACE processing algorithm. An acoustic signal undergoes fast Fourier transform (FFT), from which the power spectral density (PSD) is calculated. The frequency range of the PSD is divided into different bands. The n bands with the highest energies (maximas) are then selected for presentation, where n is a parameter that can be defined for each CI recipient's map. The resulting frequency time matrix (FTM) is then processed as follows: The energy within each selected band is used to determine the corresponding stimulation level according to a loudness growth function (LGF). Using a mapping function, the respective bands are then assigned to channels, which can be physical electrodes or VCs, to produce the channel time matrix (CTM).
Semitone mapping frequency ranges
The fundamental frequencies of the musical tones from the piano keyboard vary between 27.5 Hz (A0) and 4186 Hz (C8) [2]. Two ranges were investigated for the Smt mapping:
Smt-LF [130 to 1502 Hz] (C3 to F6#)
The minimum required frequency resolution for the Smt-LF mapping is approx. 8 Hz at C3 (f0 = 130 Hz). Analyzing a signal that has a sampling rate of 16 kHz with 2048 FFT points provides a 7.8-Hz resolution between successive frequency bins. The lowest acoustic frequency of 130 Hz for Smt-LF will be mapped to the most apical electrode location, which would correspond to a characteristic tonotopical frequency of approximately 571 Hz estimated according to Greenwood's equation [10], assuming an average cochlear length of 33 mm and an electrode array insertion depth of 22 mm. This will cause sounds to be perceived higher in pitch. However, as the frequency shift is expected to be the same for all partials with Smt mapping, this would be equivalent to a transposition.
Smt-MF [440 to 5009 Hz] (A4 to D8#)
A small frequency bandwidth is enough to maintain speech comprehension as in telephone transmission, where the bandwidth used is [300 to 3000 Hz] [9]. The Smt-MF mapping covers part of the bandwidth that is common between speech and music [440 to 5009 Hz]. Note that transposing the Smt-MF range three semitones higher to cover a range from 523 Hz (C5) to 5919 Hz (F8#) would minimize the difference between characteristic and tonotopical frequencies of electrodes according to Greenwood [10] (see Figure 2) assuming an average cochlear length of 33 mm and an insertion depth of 22 mm. However, it is impossible to precisely match the tonotopical characteristic frequencies for any given individual through Greenwood's function as the latter is empirical in nature and is also supposed to only represent the average NH listener. Also cochlear length and electrode insertion depth vary among patients. Thus, some discrepancy is always to be expected.
Frequency time matrix
Frequency components at different time frames are analyzed using FFT and are organized into a FTM. A typical CI processor like the Nucleus Freedom uses a sampling rate f
s
of 16 kHz to produce the FTM with a 128 points FFT [14, 15], giving a frequency resolution Δf of 125 Hz [16–18]. However, Smt-LF mapping needs a higher resolution (Δf of approx. 8 Hz) at low frequencies (approx. 130 Hz). Increasing the number of points N = fs/Δf increases the frequency resolution but at the same time will produce smearing in the time domain due to the larger processing window. In order to increase the frequency resolution at low frequencies and retain some of the time resolution at higher frequencies, frequency subband decomposition [19–21] is used to generate the FTM.
Frequency resolution and subband decomposition
First, the input signal is sampled at 16 kHz. Then the sampled signal is processed in two frequency subbands (see Figure 4) to yield two different frequency resolutions.
Figure 4 shows how the input signal flows into two parallel pathways: one for the low frequencies and the other for the high frequencies. The low frequency pathway uses 512 (N) samples which are split into overlapping time frames and analyzed. The amount of overlap depends on the stimulation rate such that at the end of each stimulation period, as much new data (sampled at 16 kHz) as needed is added to the data buffer. For instance, with a stimulation rate of 500 Hz, 32 new samples are added every stimulation period to the data buffer of length 512 samples, resulting in an overlap of 480 samples. The signal is first filtered using a Kaiser LPF with a cutoff at 4 kHz, and then decimated by a factor of two (d = 2) which increases the frequency resolution by the same factor while keeping the buffer length 512 points. Each time frame with 512 points is Hanning filtered and zero padded before undergoing a 2048 (m) point FFT. Notice that after zero padding each bin represents a frequency band of 3.9 Hz (fs/m = 8 k/2048). Every two successive bins are then summed to preserve the power and decrease the overall minimum detectable frequency difference (Δf = fs/(d·m/2)) in the low frequency branch within successive bands to 7.8 Hz.
The high frequency pathway uses the same number of points (N = 512) used in the low frequency pathway, producing a frequency resolution of 31.25. The signal is split into overlapping time frames in the same manner as in the low frequency pathway. Each time frame is processed with a Hanning filter of the same number of points, zero padded and undergoes a 2048 point FFT.
The output bins from both pathways are combined to form the FTM which has a bin resolution of 7.8. The boundary between the low and the high frequency pathways was set to 1054 Hz (between C6 and C6#) where the difference in frequency between successive semitones starts to exceed the HF resolution. This ensures that any successive semitones will at least lie on successive electrodes. The lower 134 bins are from the low frequency pathway, while the higher bins are from the high frequency pathway.
An example of a FTM produced using frequency subband decomposition for a signal with four sinusoidal components with 900, 936, 1200, and 1295 Hz is shown in Figure 5. The difference in frequency resolution can be clearly seen in the narrower bands at lower frequencies and wider bands at higher frequencies.
The above frequency subband decomposition only applies to the Smt-LF mapping. For Smt-MF mapping, the minimum frequency resolution required is approx. 26.6 Hz at A4 (f0 = 440 Hz). Using N = 512 provides a minimum resolution of 31.25 Hz which is slightly larger than the required resolution for the lowest semitone frequencies. Note that in the present implementation of the Smt-MF, the first two tones (A4 and A4#) will not be adequately resolved and therefore fall within a single FFT bin which will in turn be mapped to two adjacent channels because the difference between them is less than the LF resolution (7.8 Hz). To preserve the starting frequency and approach CFs to Greenwood frequencies without having an empty channel, while having frequencies of (A4 and A4#) semitones being in the same bin, it is suggested to activate the first two electrodes. The starting frequency could have been made slightly lower, but a drawback would produce a bigger difference between CFs of electrodes and Greenwood frequencies. The remaining tones are adequately resolved. Subband decomposition is not used with Smt-MF mapping. The processing block diagram for Smt-MF is similar to the one described for the high frequency pathway in Figure 4, with N = 512 without zero padding and without the frequency scaling block. Note that the frequency resolution of Smt-MF is 31.25 Hz, compared to 7.8 Hz (for frequencies below 1054 Hz) and 31.25 Hz (for frequencies above 1054 Hz) for Smt-LF.
Channel time matrix
Depending on the frequency range of interest (e.g. Smt-LF [130 to 1502 Hz] and Smt-MF [440 to 5009 Hz]), different bins in the FTM are combined into frequency channels to produce a CTM. A mapping matrix (M) is introduced to define which FFT bins should be mapped to which corresponding channels. The mapping matrix attempts to map the center frequencies of the channels and FFT bins as close as possible to the fundamental frequencies of each corresponding semitone.
Figure 6a, b illustrates the mapping matrices for both the Smt-MF and the Smt-LF mapping, respectively, for 43 channels. The Smt-LF mapping covers frequency band [130 to 1502 Hz] which corresponds to bin numbers [17 to 200], with a frequency resolution of 7.8 Hz. The Smt-MF mapping covers frequency band [440 to 5009 Hz] which corresponds to bin numbers [15 to 169] where the frequency resolution is 31.25 Hz. Smt-MF mapping does not incorporate subbands and accordingly bin 15 (corresponding to 440 Hz) is mapped to channels 1 and 2.
Nucleus Matlab Toolbox
The Smt mapping follows the ACE strategy in selecting the highest n channels. It was implemented in Matlab and incorporated into the Nucleus Matlab Toolbox (NMT) framework [15]. The acoustic model (AMO) was based on noise band vocoders [22]. The activity in each channel is simulated as a white noise convolved with an exponentially decaying filter, where its center frequency is the characteristic frequency of the channel. Channel interactions arising from the spread of the electric field from its center at the stimulation site can be set by a "width of stimulation" parameter. The resulting stimulation of the auditory nerve, causing also the perception of adjacent pitches, is simulated with the AMO. In the Smt-LF mapping, the AMO simulated the frequency transposition.
Analysis
Following the definition of [4] for melody, a better representation of individual musical tones is expected to ameliorate melody recognition, or in other words, melody is poorly resolved if individual musical tones are poorly represented. Musical tones are characterized by their harmonic structure. To compare the harmonic structure representation for the three different mappings (Std, Smt-MF, and Smt-LF), a sound sequence consisting of 36 consecutive synthetic musical tones was constructed. Each tone consisted of five partials with successive 20% decrease in amplitude and lasting for 150 ms. The fundamental frequency of each tone increased from 130 Hz (C3) to 987 Hz (B5) with 1-semitone interval.
Figure 7 shows the harmonic structure being preserved with both Smt-MF and Smt-LF mappings, where the spacing of the partials remains uniform across tones. With the Std mapping at low frequencies, partials are not resolved. With Smt-MF, frequency components below 440 Hz are filtered out (as indicated by arrows in Figure 7), while with Smt-LF, the high frequency partials greater than 1.6 kHz are filtered out.
Pilot test
A pitch difference limen test was conducted using a pair of single harmonic pure tones with 1, 3, and 6 semitone intervals. Tones were preprocessed with a CI acoustic model [23] that uses a noise band vocoder in the resynthesis algorithm. A CI acoustic resynthesis model was used to simulate the sound CI patients may perceive and to present these to NH subjects. The model assumes that there is no change in the effective spread of excitation width between 43 and 22 channels. Pure tones were used that corresponded to fundamental frequencies of musical tones. All tones were modified to have the same temporal envelope, and a duration of 0.5 s. The starting and ending of all tones were faded with 30 ms attack and release times simultaneously. The reference note (D) was used for all tone groups (1, 3, and 6 semitone intervals).
Procedure
All tones were processed using 22 and 43 channels with the acoustic model using a stimulation width of 1 mm since [22, 24] found that a width of stimulation of around 1 mm produced electrode discrimination similar to that of average Nucleus CI24 recipients. Sound samples were then normalized to have equal loudness. NH subjects were seated in front of a loudspeaker at a distance of 1.5 m and sounds were presented at a level of 70 dBA. MACarena [25] software was used to randomly select and play a pair of two tones from three octave groups: octave 3, 4, and 5. The tone pairs from each octave group were D-D#, D-F, and D-G#, with 1, 3, and 6 semitone intervals, respectively. The randomization was to minimize learning effects of tone sequences. For each group the same number of repetitions was presented. The tone pairs were presented sequentially with a pause in between of 0.5 s. Levels were roved by ± 6 dB to avoid loudness cues from being used. Eight NH subjects aged between 27 and 55 years took part in this experiment.