Our proposed system is illustrated in Figure 2. The input to the system is a monaural polyphonic mixture consisting of two instrument sounds (see Section 3 for details). In the TF decomposition stage, the system decomposes the input into its frequency components using an auditory filterbank and divides the output of each filter into overlapping frames, resulting in a matrix of TF units. The next stage computes a correlogram from the filter outputs. At the same time, the pitch contours of different instrument sounds are detected in the multipitch detection module. Multipitch detection for musical mixtures is a difficult problem because of the harmonic relationship of notes and huge variations of spectral shapes in instrument sounds [11]. Since the main focus of this study is to investigate the performance of pitch-based separation in music, we do not perform multiple pitch detection (indicated by the dashed box); instead we supply the system with pitch contours detected from premixed instrument sounds. In the pitch-based labeling stage, pitch points, that is, pitch values at each frame, are used to determine which instrument each TF unit should be assigned to. This creates a temporary binary mask for each instrument. After that, each T-segment, to be explained in Section 2.3, is classified as overlapped or nonoverlapped. Nonoverlapped T-segments are directly passed to the resynthesis stage. For overlapped T-segments, the system exploits the information obtained from nonoverlapped T-segments to decide which source is stronger and relabel accordingly. The system outputs instrument sounds resynthesized from the corresponding binary masks. The details of each stage are explained in the following subsections.
2.1. Time-Frequency Decomposition
In this stage, the input sampled at
kHz is first decomposed into its frequency components with a filterbank consisting of
gammatone filters (also called channels). The impulse response of a gammatone filter is
where
is the order of the gammatone filter,
is the center frequency of the filter, and
is related to the bandwidth of the filter [12] (see also [3]).
The center frequencies of the filters are linearly distributed on the so-called "ERB-rate" scale,
, which is related to frequency by
It can be seen from the above equation that the center frequencies of the filters are approximately linearly spaced in the low frequency range while logarithmically spaced in the high frequency range. Therefore more filters are placed in the low frequency range, where speech energy is concentrated.
In most speech separation tasks, the parameter
of a fourth-order gammatone filter is usually set to be
where
is the equivalent rectangular bandwidth of the filter with the center frequency
. This bandwidth is adequate when the intelligibility of separated speech is the main concern. However, for musical sound separation, the 1-ERB bandwidth appears too wide for analysis and resynthesis, especially in the high frequency range. We have found that using narrower bandwidths, which provide better frequency resolution, can significantly improve the quality of separated sounds. In this study we set the bandwidth to a quarter ERB. The center frequencies of channels are spaced from
to
Hz. Hu [13] showed that a 128-channel gammatone filterbank with the bandwidth of
ERB per filter has a flat frequency response within the range of passband from
to
Hz. Similarly, it can be shown that a gammatone filterbank with the same number of channels but the bandwidth of
ERB per filter still provides a fairly flat frequency response over the same passband. By a flat response we mean that the summated responses of all the gammatone filters do not vary with frequency.
After auditory filtering, the output of each channel is divided into frames of
milliseconds with a frame shift of
milliseconds.
2.2. Correlogram
After TF decomposition, the system computes a correlogram,
, a well-known mid-level auditory representation [3, Chapter 1]. Specifically,
is computed as
where
is the output of a filter.
is the channel index and
is the time frame index.
is the frame length, and
is the frame shift.
is the time lag. Similarly, a normalized correlogram,
, can be computed for TF unit
as
The normalization converts correlogram values to the range of
with
at the zero time lag.
Several existing CASA systems for speech separation have used the envelope of filter outputs for autocorrelation calculation in the high frequency range, with the intention of encoding the beating phenomenon resulting from unresolved harmonics in high frequency (e.g., [8]). A harmonic is called resolved if there exists a frequency channel that primarily responds to it. Otherwise it is unresolved [8]. However, due to the narrower bandwidth used in this study, different harmonics from the same source will unlikely activate the same frequency channel. Figure 3 plots the bandwidth corresponding to
ERB and
ERB with respect to the channel number. From Figure 3 we can see that the bandwidths of most filter channels are less than
Hz, smaller than the lowest pitches most instruments can produce. As a result, the envelope extracted would correspond to either the fluctuation of a harmonic's amplitude or the beating created by the harmonics from different sources. In both cases, the envelope information would be misleading. Therefore we do not extract envelope autocorrelation.
2.3. Pitch-Based Labeling
After the correlogram is computed, we label each TF unit
using single-source pitch points detected from premixed sound sources. Since we are concerned only with 2-source separation, we consider at each TF unit the values of
at time lags that correspond to the pitch periods,
and
, of the two sources. Because the correlogram provides a measure of pitch strength, a natural choice is to compare
and
and assign the TF unit accordingly, that is,
Intuitively if source 1 has stronger energy at
than source 2, the correlogram would reflect the contribution of source 1 more than that of source 2 and the autocorrelation value at
would be expected to be higher than that at
. Due to the nonlinearity of the autocorrelation function and its sensitivity to the relative phases of harmonics, this intuition may not hold all the time. Nonetheless, empirical evidence shows that this labeling is reasonably accurate. It has been reported that when both pitch points are used for labeling as in (6) for cochannel speech separation, the results are better compared to when only one pitch point is used for labeling [13]. Figure 4 shows the percentage of correctly labeled TF units for each channel. We consider a TF unit correctly labeled if labeling based on (6) is the same as in the IBM. The plot is generated by comparing pitch-based labeling using (6) to that of the IBM for all the musical pieces in our database (see Section 3). It can be seen that labeling is well above the chance level for most of the channels. The poor labeling accuracy for channel numbers below
is due to the fact that the instrument sounds in our database have pitch higher than
Hz, which roughly corresponds to the center frequency of channel
. The low-numbered channels contain little energy therefore labeling is not reliable.
Figure 5 plots the percentage of correctly labeled TF units according to (6) with respect to the local energy ratio obtained from the same pieces as in Figure 4. The local energy ratio is calculated as
, where
and
are the energies of the two sources at
. The local energy ratio is calculated using premixed signals. Note that the local energy ratio is measured in decibels and
. Hence the local energy ratio definition is symmetric with respect to the two sources. When the local energy ratio is high, one source is dominant and pitch-based labeling gives excellent results. A low local energy ratio indicates that two sources have close values of energy at
. Since harmonics with sufficiently different frequencies will not have close energy in the same frequency channel, a low local energy ratio also implies that in
harmonics from two different sources have close (or the same) frequencies. As a result, the autocorrelation function will likely have close values at both pitch periods. In this case, the decision becomes unreliable and therefore the percentage of correct labeling is low.
Although this pitch-based labeling (see (6) works well, it has two problems. The first problem is that the decision is made locally. The labeling of each TF unit is independent of the labeling of its neighboring TF units. Studies have shown that labeling on a larger auditory entity, such as a TF segment, can often improve the performance. In fact, the emphasis of segmentation is considered as a unique aspect of CASA systems [3, Chapter 1]. The second problem is overlapping harmonics. As mentioned before, in TF units where two harmonics from different sources overlap spectrally, unit labeling breaks down and the decision becomes unreliable. To address the first problem, we construct T-segments and find ways to make decisions based on T-segments instead of individual TF units. For the second problem, we exploit the observation that sounds from the same source tend to have similar spectral envelopes.
The concept of T-segment is introduced in [13] (see also [14]). A segment is a set of contiguous TF units that are supposed to mainly originate from the same source. A T-segment is a segment in which all the TF units have the same center frequency. Hu noted that using T-segments gives a better balance on rejecting energy from a target source and accepting energy from the interference than TF segments [13]. In other words, compare to TF segments, T-segments achieve a good compromise between false rejection and false acceptance. Since musical sounds tend to be stable, a T-segment naturally corresponds to a frequency component from its onset to offset. To get T-segments, we use pitch information to determine onset times. If the difference of two consecutive pitch points is more than one semitone, it is considered as an offset occurrence for the first pitch point and an onset occurrence for the second pitch point. The set of all the TF units between an onset/offset pair of the same channel defines a T-segment.
For each T-segment, we first determine if it is overlapped or nonoverlapped. If harmonics from two sources overlap at channel
,
. A TF unit is considered overlapped if at that unit
, where
is chosen to be
. If half of the TF units in a T-segment is overlapped, then the T-segment is considered overlapped; Otherwise, the T-segment is considered nonoverlapped. With overlapped T-segments, we can also determine which harmonics of each source are overlapped. Given an overlapped T-segment at channel
, the frequency of the overlapping harmonics can be roughly approximated by the center frequency of the channel. Using the pitch contour of each source, we can identify the harmonic number of each overlapped harmonic. All other harmonics are considered nonoverlapped.
Since each T-segment is supposedly from the same source, all the TF units within a T-segment should have the same labeling. For each TF unit within a nonoverlapped T-segment, we perform labeling as follows:
where
and
are the sets of TF units previously labeled as 1 and 0 (see (6), respectively, in the T-segment. The zero time lag of
indicates the energy of
. Equation (7) means that, in a T-segment, if the total energy of the TF units labeled as the first source is stronger than that of the TF units labeled as the second source, all the TF units in the T-segment are labeled as the first source; otherwise, they are labeled as the second source. Although this labeling scheme works for nonoverlapped T-segments, it cannot be extended to overlapped T-segments because the labeling of TF units in an overlapped T-segment is not reliable.
We summarize the above pitch-based labeling in the form of a pseudoalgorithm as Algorithm 1.
Algorithm 1: Pitch-based labeling.
for Each T-segment between an onset/offset pair and each
frequency channel
do
for Each TF unit indexed by
and
do
Increase TotalTFUnitCount by 1
if
then
Increase OverlapTFUnitCount by 1
else
Increase NonOverlapTFUnitCount by 1
end if
end for
if OverlapTFUnitCount
TotalTFUnitCount
then
The T-Segment is overlapped
else
The T-Segment is nonoverlapped
end if
if The T-Segment is nonoverlapped then
for Each TF unit indexed by
and
do
if
then

else

end if
end for
if
then
All the TF units in the T-Segment are labeled as
source 1
else
All the TF units in the T-Segment are labeled as
source 2
end if
end if
end for
2.4. Relabeling
To make binary decisions for an overlapped T-segment, it is helpful to know the energies of the two sources in that T-segment. One possibility is to use the spectral smoothness principle [15] to estimate the amplitude of an overlapped harmonic by interpolating its neighboring nonoverlapped harmonics. However, the spectral smoothness principle does not hold well for many real instrument sounds. Another way to estimate the amplitude of an overlapped harmonic is to use an instrument model, which may consist of templates of spectral envelopes of an instrument [16]. However, instrument models of this nature unlikely work due to enormous intrainstrument variations of musical sounds. When training and test conditions differ, instrument models would be ineffective.
Intra-instrument variations of musical sounds result from many factors, such as different makers of the same instrument, different players, and different playing styles. However, in the same musical recording, the sound from the same source is played by the same player using the same instrument with typically the same playing style. Therefore we can reasonably assume that the sound from the same source in a musical recording shares similar spectral envelopes. As a result, it is possible to utilize the spectral envelope of some other sound components of the same source to estimate overlapped harmonics. Concretely speaking, consider an instrument playing notes
and
consecutively. Let the
th harmonic of note
be overlapped by some other instrument sound. If the spectral envelopes of note
and note
are similar and harmonic
of
is reliable, the overlapped harmonic of
can be estimated. By having similar spectral envelopes we mean
where
and
are the amplitudes of the
th harmonics of note
and note
, respectively. In other words, the amplitudes of corresponding harmonics of the two notes are approximately proportional. Figure 6 shows the log-amplitude average spectra of eight notes by a clarinet. The note samples are extracted from RWC instrument database [17]. The average spectrum of a note is obtained by averaging the entire spectrogram over the note duration. The note frequencies range from D (293 Hz) to A (440 Hz). As can be seen, the relative amplitudes of these notes are similar. In this example the average correlation of the amplitudes of the first ten harmonics between two neighboring notes is 0.956.
If the
th harmonic of
is overlapped while the same-numbered harmonic of
is not, using (8), we can estimate the amplitude of harmonic
of
as
In the above equation, we assume that the first harmonics of both notes are not overlapped. If the first harmonic of
is also overlapped, then all the harmonics of
will be overlapped. Currently our system is not able to handle this extreme situation. If the first harmonic of note
is overlapped, we try to find some other note which has the first harmonic and harmonic
reliable. Note from (9) that with an appropriate note, the overlapped harmonic can be recovered from the overlapped region without the knowledge of the other overlapped harmonic. In other words, using temporal contextual information, it is possible to extract the energy of only one source.
It can be seen from (9) that the key to estimating overlapped harmonics is to find a note with a similar spectral envelope. Given an overlapped harmonic
of note
, one approach to finding an appropriate note is to search the neighboring notes from the same source. If harmonic
of a note is nonoverlapped, then that note is chosen for estimation. However, it has been shown that spectral envelopes are pitch dependent [18] and related to dynamics of an instrument nonlinearly. To minimize the variations introduced by pitch as well as dynamics and improve the accuracy of binary decisions, we search notes within a temporal window and choose the one with the closest spectral envelope. Specifically, consider again note
with harmonic
overlapped. Within a temporal window, we first identify the set of nonoverlapped harmonics, denoted as
, for each note
from the same instrument as note
. We then check every
and find the harmonics which are nonoverlapped between notes
and
. This is to find the intersection of
and
. After that, we calculate the correlation of the two notes,
, based on the amplitudes of the nonoverlapped harmonics. The correlation is obtained by
where
is the common harmonic number of nonoverlapped harmonics of both notes. After this is done for each such note
, we choose the note
that has the highest correlation with note
and whose
th harmonic is nonoverlapped. The temporal window in general should be centered on a note being considered, and long enough to include multiple notes from the same source. However, in this study, since each test recording is 5-second long (see Section 3), the temporal window is set to be the same as the duration of a recording. Note that, for this procedure to work, we assume that the playing style within the search window does not change much.
The above procedure is illustrated in Figure 7. In the figure, the note under consideration,
, has its fourth harmonic (indicated by an open arrowhead) overlapped with a harmonic (indicated by a dashed line with an open square) from the other source. To uncover the amplitude of the overlapped harmonic, the nonoverlapped harmonics (indicated by filled arrowheads) of note
are compared to the same harmonics of the other notes of the same source in a temporal window using (10). In this case, note
has the highest correlation with note
.
After the appropriate note is identified, the amplitude of
of note
is estimated according to (9). Similarly, the amplitude of the other overlapped harmonic,
(i.e., the dashed line in Figure 7), can be estimated. As mentioned before, the labeling of the overlapped T-segment depends on the relative overall energy of overlapping harmonics
and
. If the overall energy of harmonic
in the T-segment is greater than that of harmonic
, all the TF units in the T-segment will be labeled as source 1. Otherwise, they will be labeled as source 2. Since the amplitude of a harmonic is calculated as the square root of the harmonic's overall energy (see next paragraph), we label all the TF units in the T-segment based on the relative amplitudes of the two harmonics, that is, all the TF units are labeled as 1 if
and 0 otherwise.
The above procedure requires the amplitude information of each nonoverlapped harmonic. This can be obtained by using single-source pitch points and the activation pattern of gammatone filters. For harmonic
, we use the median pitch points of each note over the time period of a T-segment to determine the frequency of the harmonic. We then identify which frequency channel is most strongly activated. If the T-segment in that channel is not overlapped, then the harmonic amplitude is taken as the square root of the overall energy over the entire T-segment. Note that the harmonic amplitude refers to the strength of a harmonic over the entire duration of a note.
We summarize the above relabeling in Algorithm 2.
Algorithm 2: Relabeling.
for Each overlapped T-Segment do
for Each source overlapping at the T-Segment do
Get the harmonic number
of the overlapped note 
Get the set of nonoverlapped harmonics,
, for 
for Each note
from the same source do
Get the set of nonoverlapped harmonics,
, for 
Get the correlation of
and
using (10)
end for
Find the note,
, with the highest correlation and
harmonic
nonoverlapped
Find
based on (9)
end for
if
from source 1
from source 2 then
All the TF units in the T-Segment are labeled as source 1
else
All the TF units in the T-Segment are labeled as source 2
end if
end for
2.5. Resynthesis
The resynthesis is performed using a technique introduced by Weintraub [19] (see also [3, Chapter 1]). During the resynthesis, the output of each filter is first phase-corrected and then divided into time frames using a raised cosine with the same frame size used in TF decomposition. The responses of individual TF units are weighted according to the obtained binary mask and summed over all the frequency channels and time frames to produce a reconstructed audio signal. The resynthesis pathway allows the quality of separated lines to be assessed quantitatively.