Towards multidimensional attentive voice tracking—estimating voice state from auditory glimpses with regression neural networks and Monte Carlo sampling

Selective attention is a crucial ability of the auditory system. Computationally, following an auditory object can be illustrated as tracking its acoustic properties, e.g., pitch, timbre, or location in space. The difficulty is related to the fact that in a complex auditory scene, the information about the tracked object is not available in a clean form. The more cluttered the sound mixture, the more time and frequency regions where the object of interest is masked by other sound sources. How does the auditory system recognize and follow acoustic objects based on this fragmentary information? Numerous studies highlight the crucial role of top-down processing in this task. Having in mind both auditory modeling and signal processing applications, we investigated how computational methods with and without top-down processing deal with increasing sparsity of the auditory features in the task of estimating instantaneous voice states, defined as a combination of three parameters: fundamental frequency F0 and formant frequencies F1 and F2. We found that the benefit from top-down processing grows with increasing sparseness of the auditory data.


Introduction
Selective auditory attention is essential in most real-life acoustic environments.Human listeners without hearing impairment tune out the irrelevant acoustic clutter and attentively follow sound objects with ease, but from the machine listening perspective, selective attention is a challenging task.For example, despite many technological advances, hearing aids still tend to amplify the background sounds together with the signal of interest.Other than filtering the sound based on specific properties like the spectral range, direction of arrival, or degree of interaural correlation, hearing devices have no ability to follow a specific sound object.The goal of this study is to contribute to understanding what computational strategies of human audition are still missing in the audio algorithms.
Computationally, following an auditory object can be illustrated as tracking its acoustic properties, e.g., pitch, timbre, or location in space.Results of previous studies indicate that to track voices in a crowded acoustic space, a fusion of several dimensions representing different acoustic properties is required.These properties can be estimated based on the auditory features extracted from the acoustic signal.The difficulty is related to the fact that in a complex auditory scene, the information about the tracked object is not available in a clean form.The more cluttered the sound mixture, the more time and frequency regions where the object of interest is masked by other sound sources [1,2].Previous studies suggest that the remaining sparse time-frequency regions where the voice of interest dominates over other sound objects (auditory glimpses) are essential in decoding auditory scenes.Glimpses provide robust information about the voice of interest and, hence, can be used as reliable cues for tracking.
However, the incomplete glimpsed information taken out of the context may be ambiguous at times: sparse glimpses alone may not provide enough evidence to be linked with a unique possible underlying cause.Many scholars believe that solving this ill-posed problem is possible due to the top-down processing involved in perception [3].In contrast to the feed-forward processing system in which the information travels straight from input to output in a direct way, the perception is frequently described as a top-down processing architecture, where the input features are confronted with the expectations formed at the higher level of abstraction.In a top-down system, the output depends both on the input and on some prior beliefs, which are a set of constraints restricting the possible outcomes of the task.By allocating the neural resources to the regions of expected high importance, the brain simplifies the task of making sense of fragmentary information available in a complex auditory scene.
In our previous work [4], we proposed a computational model of attentive tracking of competing voices [5], which combined the top-down and bottom-up processing.The model was used to track F0 of two simultaneous voices based on sparse periodicity-based auditory features (sPAF) [6][7][8] extracted from the mixture of voices.It was realized with sequential Monte Carlo sampling (particle filters) [9], coupled with simple analytically designed probabilistic F0-models (described in detail in [4]).This process simulated attentive tracking in humans.We found that although the information carried by sPAF extracted from the mixture of two voices is sufficient to simultaneously track both F0s, the knowledge of F0 alone is not sufficient to correctly segregate the features.Our results confirmed that more voice properties need to be estimated to solve the attentive tracking task.
In this study, we extend the previously used system as follows: (1) instead of tracking only F0, we track voice states consisting of three parameters (fundamental frequency F0 and formant frequencies F1 and F2); (2) in the feature extraction, we include energy-based features instead of using solely periodicity-based features; (3) instead of a likelihood model of F0 for the periodicitybased features, we propose a joint F0, F1, F2-likelihood model for combined periodicity-and energy-based features.F0, F1, and F2 are related to speech production and perception and provide critical cues for the identification of speech sounds.F0 corresponds to the rate of vibration of the vocal cords, which determines the perceived pitch of the sound.F1 and F2-the first and the second lowest resonant frequency in the vocal tractare influenced primarily by the position of the tongue during speech production.They are both found to be critical for distinguishing between different vowels.
We investigate the potential of a new approach including all three parameters.Firstly, we test how much the performance is affected by the increasing sparsity of the auditory features.We bypass the segregation problem and let the model estimate the 3 dimensional state based on already segregated target-related sPAF.We generate continuously voiced signals with defined state trajectories, extract sPAF features, and simulate a varying degree of difficulty in the sPAF features.Secondly, to investigate the benefit of top-down processing, we compare the proposed model with two classes of methods without top-down processing: nonsequential Monte Carlo sampling and regression neural networks.In the first class, a straightforward Monte Carlo simulation is used [10]: competing hypotheses are distributed across the possible range of parameter values and evaluated with the same likelihood model as used in the particle.This can be understood as a particle filter without a continuity model.In the second class, the regression neural network [11] learns the mapping between the sPAF and the voice state and applies it to predict the most likely instantaneous voice states.We know from [12] that this approach is successful for estimating state of a single voice.However, it was not clear whether this purely bottom-up approach would be able to deal with sPAF extracted from a more difficult auditory scenes, where there are less targetrelated glimpses.
The main contributions of this study are: 1. Extension of the previously published voice state likelihood model (from F0 estimation to F0-F1-F2 estimation) 2. Comparison of the sampling-based voice tracking approach with a regression neural network 3. Introduction of a novel, perceptually motivated F0 tracking error measure This paper is organized as follows: In Section 2, we introduce implementation details of the auditory feature extraction, four state estimation methods, and F0, F1, F2-data likelihood models.Section 3 reviews conditions in which the methods were evaluated and describes the performance measures.Section 4 guides the reader through the results, and Section 5 discusses the results in a broader context.

Sparse periodicity-based feature extraction
Feature extraction in this study is motivated by human auditory processing.We used the approach developed in [6][7][8] called sparse periodicity-based auditory feature (sPAF) extraction.We adopted this approach in [4] to track F0 of two competing voices.The method was designed to blindly extract auditory glimpses from the sound mixture.In particular, the auditory glimpses are here defined as salient tonal components across frequency.The sPAF extraction consists of three main stages: 1. Auditory pre-processing: This pre-processing stage simulates the sound processing in the peripheral auditory system, including: (a) Acoustic modification due to the middle ear implemented as a band-pass filter (0.5-2 kHz) (b) Spectral analysis in the cochlea implemented as filterbank of 23 gammatone band-pass filters (c) Dynamic range compression in the cochlea implemented as a power-low with an exponent of 0.4 (d) Neural transduction from the vibrations of the basilar membrane in the cochlea to electrical stimulation of the auditory nerve implemented as a half-wave rectification, followed by the 5th-order lowpass-filter at 770 Hz, and 40 Hz high-pass filter.
The auditory pre-processing stage yields 23 time-domain signals, which are forwarded to the periodicity analysis stage.

Periodicity analysis:
In each frequency channel c, the periodicity analysis [13] is performed every 20 ms to reveal the dominant periods in the analyzed time instance of a signal.Around each considered time step n, eight signal segments of duration P ′ are formed, as depicted in Fig. 1A.The number of salient glimpses in a set depends on the content of the analyzed signal.Glimpses obtained with this method in one instance of time can be visualized as glimpse patterns, shown in Fig. 1 (B).They represent all the salient periods found across 23 frequency channels and their corresponding total energy values.For more implementation details about the sPAF extraction, the reader is referred to [4,[6][7][8]14].

Observation vector for the neural network
The number of glimpses varies depending on the content of the acoustic signal: in general the more complex the auditory scene, the less salient glimpses available.
(1) synch cn (P ′ ) = E P,cn (P ′ ) E tot,cn (P ′ ) . ( Above, the sPAF were defined as a set O(n) with a varying number of salient components.While such representation can be used as an input to an analytically designed model from Section 2.3, it is not suitable for training a neural network that requires a fixed dimensionality of the input features.To present the features to the neural network, sPAF can be treated as a 2-dimensional matrix of size 112 × 23, representing 112 tested period values in 23 frequency channels.Such a matrix can be reshaped into an observation vector with a dimensionality D = 112 × 23 = 2576 .Entries of the matrix for which a glimpse was found are set to the glimpse energy value E cnm .All remaining entries are set to 0.

State estimation methods
Within a time frame of 20-ms duration, speech is usually considered to be stationary.For every frame number n, a continuously voiced signal can be characterized by a state vector containing the values of the instantaneous fundamental frequency F0 and first two formant frequencies F1 and F2: In this study, we considered voiced signals, for which continuous three-dimensional state trajectories can be defined (see Fig. 6).
The objective of the models presented in this study is to infer voice state parameters s(n) based on the observation O(n) containing sparse periodicity-based auditory features (sPAF).Note that while the state is defined as a vector with a fixed dimensionality, the observation is a set  [4] in which the number of elements depends on the acoustic signal (for more details, see Section 2.1).The sections below outline the state estimation methods compared in this study.For a theoretical introduction to statespace methods and Monte Carlo sampling, the reader is referred to [9,10,15].

Non-sequential Monte Carlo sampling
In a Bayesian framework, the state estimation can be formulated as finding the posterior probability of a current state given the current observation.According to Bayes' rule, this probability can be computed as: where the likelihood p(O(n)|� s(n)) describes variation in the data for a fixed state and the prior p( s(n)) describes prior beliefs about the possible state.Even if the likelihood and prior can be evaluated to give an unnormalized posterior, the integral in the denominator of the equation above is usually intractable.An analytic closed-form expression is, therefore, not available.However, it is possible to approximate posteriors or expectation values w.r.t.posteriors.We here use a Monte Carlo approach based on samples with normalized importance weights (e.g., [10]) in order to represent the posterior, and in order to compute corresponding expected values.Concretely, in each time step, we draw 2000 three-dimensional samples (hypothetical states) from a uniform prior distribution in 3 dimensions ( U(100, 400) for F0, U(300, 800) for F1, and U(800, 2500) for F2).Next, the weight for each of the three-dimensional hypothetical states is computed by evaluating the p(O(n)|� s(n)) , which is designed to capture the relationship between the voice parameters F0, F1, and F2 and the observed sPAF (see Section 2.3.2). p(O(n)|� s(n)) is a probabilistic func- tion that takes sPAF data and hypothetical voice parameters as input arguments and outputs a likelihood weight value.Weights for 2000 state samples are normalized so that they sum to 1.The final estimate � s(n) is the three- dimensional state which maximizes the posterior.

Sequential Monte Carlo sampling
The state estimation method introduced in this section was previously used in [4,14] to simulate the process of attentive voice tracking.The sequential Bayesian state estimation is formulated as finding the posterior probability of a current state given the sequence of previous observations: where Similarly to the non-sequential case, analytical evaluation is usually not possible, and an approximation is needed.Sequential Monte Carlo sampling, also called particle filtering, is a broadly used sampling method, which is used for sequentially estimating the posterior density [9].The key idea is to represent the required posterior density function by a set of random samples with associated weights and to compute estimates based on these samples and weights.The main difference is that the new samples depend on the previous samples and that the weight propagates across time steps.The relationship between two subsequent states is described by the state transition model p(� s(n)|� s(n − 1)).
Specifically, the particle filter iteratively executes the following steps: with the old weight.4. Estimation: The final estimate � s(n) is computed as expected value from the approximated posterior, i.e., discrete distribution of state samples with corresponding weights. 5. Resampling: To direct the hypotheses set into the region of high importance, the samples with small weights are eliminated, and the samples with large weights are duplicated.After this step, the new iteration begins.
Likelihood models used in the above procedure are detailed in Section 2.3.( 7)

Regression neural network
If Target parameters (F0, F1 and F2) were scaled using the global mean and standard deviation, so that each parameter can only take values between 0 ad 1.After training, the inferred parameter values were scaled back to original value ranges.
To generate the training pairs, we created random 3-dimensional state trajectories with a sampling rate of 50 Hz.They were used as an input to the Klatt formant synthesizer [17], yielding synthetic voice signals.The instantaneous sPAF were extracted from the acoustic signal with the same sampling rate as the trajectory.
Two different training data sets were used in the study: 1. Clean sPAF: data set generated based on sPAF extracted from 1000 voice signals of 10 s each, in total 501,000 state-observation pairs.2. Clean and fragmentary sPAF: data set generated based on clean voice sPAF, artificially removed sPAF, and segregated sPAF (for details, see Section 3).In each category, 1000 trajectories of 2 s each, in total 404,000 state-observation pairs. (

F0, F1, F2-models
This section reviews probabilistic models required for the Bayesian Monte Carlo state estimation methods (Sections 2.2.2 and 2.2.1), specifically for estimating instantaneous voice parameters (F0, F1, and F2) based on sPAF.

F0, F1, F2-transition model p(� s(n)|� s(n − 1))
The state transition model describes the temporal evolution of parameters F0, F1, and F2, which are naturally limited due to physical constraints of speech production.For simplicity, we assume independence of the individual dimensions; therefore, subsequent values in each dimension are predicted individually.
To predict the next value for the i-th state dimension between two previous estimates is calculated, the next value according to that trend s i (n) + � s i (n) is predicted, and finally, gaussian noise is added to the predicted value and σ i is 0.5 Hz for F0, 1 Hz for F1, and 5 Hz for F2.In addition, we make sure that the difference between two previous estimates � s i (n) does not exceed the largest allowed step (10 Hz for F0, 50 Hz for F1, and 100 Hz for F2) and that the extrapolated value s i (n) + � s i (n) does not exceed a possible value range ([100, 400] for F0, [300, 800] for F1, and [800, 2500] for F2). Figure 2 depicts the procedure, repeated for every state sample.

F0, F1, F2-observation model p(O(n)|� s(n))
The F 0, F 1, F 2−observation model is a probabilistic func- tion that relates the observed sPAF and the underlying voice parameters.It quantifies the likelihood that the sPAF O(n) extracted in a given time frame come from a hypothetical three-dimensional voice state s(n) .There are two major assumptions, which influence the design of this function.First, we assume that the glimpses in one channel G cn origi- nate from a single voice, even if the acoustic signal contains a mixture of voices.This saliency assumption is based on the fact that we use the glimpsing thresholds, which ensure that the glimpses are extracted only if one voice dominates in the signal.This is demonstrated in Fig. 3.
Secondly, we assume that the state dimensions are independent and that period P cnm is solely the evidence of F0 and energy E cnm is solely the evidence of F1 and F2.
Hence, the likelihood of a single glimpse p � g cnm |� s(n) can be approximated as follows: where the approximation is motivated by the above stated assumption.(11)  See Fig. 4 for a scheme demonstrating the procedure to evaluate this function.We call p(P cnm |F 0(n)) the period likelihood and p(E cnm |F 1(n), F 2(n)) the energy likelihood.
Energy likelihood p(E cnm |F 1(n), F 2(n)) For the energy likelihood, we used a codebook approach to evaluate this function.The codebook entries were computed from the simulated data: First, a list of F1 values logarithmically spaced between 350 and 700 Hz, and a list of F2 values logarithmically spaced between 800 and 2500 Hz was created.Next, for each F1 value, a 10-s signal with fixed F1 and varying F0 and F2 was generated.Likewise, for each F2 value, 10-s signal with fixed F2 and varying F0 and F1.sPAF were extracted from the signal and for every F1 a mean glimpse energy µ E (F 1, c) and a stand- ard deviation σ E (F 1, c) was computed and stored.Simi- larly for every F2 a mean energy µ E (F 2, c) and a standard deviation σ E (F 2, c) was computed and stored.
The likelihood that an observed glimpse energy E cnm originates from a hypothetical F1 and F2 is modeled using two normal distributions with mean and standard deviation defined by the two codebooks for F1 and F2 as follows: We also experimented with other models for the energy likelihood including non-factorizing likelihoods or differently parameterized codebooks.The above modeling was finally chosen based on its relative performance, stability, and efficiency benefits.It should be noted that the codebook summarizes energy distribution for only one input signal level.In order to account for voice signal level fluctuations, the procedure would need to be repeated for multiple input levels.
Period likelihood p(P cnm |F 0(n)) The likelihood that an observed glimpse period value P cnm originates from a given F0 is modeled using a mixture of circular von-Mises distributions [see 4].Every value P cnm is generated (12) p by a mixture of 11 circular von-Mises distributions.The number 11 comes from the highest reported number of resolved harmonics [18].Each element of the sum represents a different harmonic j of F0: where F0 is the hypothetical fundamental frequency, M denotes von-Mises distribution with the mean µ = 0 and concentration parameter κ = 5 .C j = j −1 11 j ′ j ′−1 is the normalizing constant for the j-th harmonic.It is reciprocal to harmonic number: the higher the harmonic number, the lower the probability of the period glimpse originating from that harmonic.R cnm (j • F 0) is the relative period value with respect to the j-th harmonic of the hypothetical F0 and is computed as: where P0 = F 0 −1 is the period of F0 and rem(•) is the remainder from the division.
For a detailed explanation of the period likelihood function, the reader is referred to our recent study [4].

Likelihood integration
In each non-empty channel set G cn , the likelihood is integrated by computing a mean across the likelihoods of the elements of the channel set: Finally, the likelihood is integrated as a product across frequency channels:

Evaluation
Four different state estimation methods were compared:

Regression neural network trained with fragmentary sPAF (Section 2.2.3), denoted regNN+
Figure 5 illustrates the above listed methods.The performance of each method was evaluated under the following conditions: a) Clean voice sPAF: features were extracted from the acoustic signal containing one voice.b) 40% Artificially removed sPAF: 40% of glimpses were artificially removed from the clean voice sPAF, with uniform removal probability for all channels and time steps.c) 80% Artificially removed sPAF: 80% of glimpses were artificially removed from the clean voice sPAF, with uniform removal probability for all channels and time steps.d) Optimally segregated: Features were extracted from the acoustic signal containing two voices and segregated using an F0-based feature segregation method (see the Appendix "Optimal feature segregation method" section for details).sPAF assigned to the considered voice were used in the state estimation.
Figure 6 depicts the above listed test conditions.Test data were obtained using the same procedure as in [4,5]: First, 100 random 3-dimensional state trajectories each of length L = 100 were created.The trajectory of each parameter (F0, F1, F2) was generated independently, by picking a random excerpt of Gaussian noise (500 Hz sampling rate), filtering it between 0.05 and 0.6 Hz and adjusting the value range to 100 − 400 Hz for F0, 300 − 800 Hz for F1, and 700 − 2200 Hz for F2.The trajectories with a sampling rate of 50 Hz were used as an input to the Klatt formant synthesizer [17], yielding 2 s-long synthetic voice signals, from which the sPAF were extracted with the same sampling rate.Conditions with artificially removed glimpses approximate adding noise to the signal before extracting sPAF.For example, 40% and 80% of glimpse loss corresponds to adding white noise to a voiced signal at 15 − 20 dB SNR, and 0 − 5 dB SNR, respectively.In the condition using a mixture of 2 voices (optimally segregated), a second set of 100 acoustic signals was synthesized and mixed with the single voice signals before feature extraction.In the case of state estimation with a regression neural network, the sPAF features were additionally transformed to a suitable format with a fixed dimensionality (for details, see Section 2.1).For test conditions with fragmentary data (b-d), this resulted in more zero-entries in the observation vector.
The following performance measures were used to compare the state estimation with different methods: where N is the cumulative length of all trajectories.2 Gross error: percentage of time steps for which the estimate lies outside the allowed interval (5 Hz for F0, 25 Hz for F1, and 100 Hz for F2). 3 Harmonic error: model-based similarity measure between tracks of fundamental frequency (for details, see the Appendix "Harmonic error" section).

Results
This section presents the results for four state estimation methods in five conditions with different types of input features.
Figure 7 shows examples of ground truth F0, F1, and F2 trajectories together with the estimated trajectories for all methods and conditions.
The results are analyzed and discussed for each feature dimension.Figure 8 shows performance measures computed for the first dimension of the state space: F0.
Non-seqMC method results in the highest F0 RMSE of all methods, and in all conditions.The values are similar across different feature conditions and reach 92.8 Hz, which exceeds the RMSE computed for the white noise with mean 250 Hz and standard deviation 50 Hz.This indicates a systematic error leading to very low performance in terms of absolute estimation accuracy.Examples in Fig. 7 (1.a-1.d)demonstrate that, while the estimated values are indeed far off from the underlying values, the errors are not random and are caused by the F0 harmonic confusions typical for pitch tracking.This observation is confirmed by harmonic error, which is close to the seqMC and is lower than the harmonic error for regNN+.This suggests that the errors of the non-seq MC are mainly caused by harmonic confusion.Similar observation can be made for the seqMC method in the segregated sPAF condition-the increased RMSE is caused mostly due to harmonic confusions, which can be seen in the Example in Fig. 7 (2.d).
As seqMC can avoid harmonic confusion, it reaches the best performance of all methods, in most conditions (see examples in Fig. 7 (2.a and 2.d)).This confirms that limiting the possible outcomes by adding the expectation component in the state estimation is crucial for its performance.The only exception is the clean sPAF condition, where the regDNN, trained with the clean data, achieves the lowest errors.
A much different relationship between the error values and feature types can be observed for the regDNN method.The lowest errors are observed for the clean sPAF, with RMSE of 6.6 Hz.Performance decreases to 16.1 Hz after removing 40% of sPAF and to 37.6 Hz after removing 80% .The difficulty in the feature conditions has a significant effect on the regNN performance.Although for clean sPAF the model outperforms all other methods, the benefit from regNN decreases for all types of sPAF features, which were not included in the training.This shows that this model is not capable of generalizing well for fragmentary information.
The regDNN+ method achieves similar results across all feature conditions.Performance decreases slowly from 10.7 Hz in the clean sPAF condition to 39.9 Hz in the segregated condition.This shows that the network trained with various types of fragmentary information achieves on average good results for all feature types, at the cost of precise results in the clean sPAF.
Figure 9 shows performance measures computed for the second dimension of the state space: F1.
The best overall F1 estimation performance is achieved by the regNN+ method-a method that leads to only Fig. 6 Testing conditions.sPAF are depicted together with the hidden state trajectories.a sPAF extracted from a clean synthetic voiced signal.b 40% of non-empty glimpse channels removed from the sPAF extracted from a clean synthetic voiced signal.c 80% of non-empty glimpse channels removed from the sPAF extracted from a clean synthetic voiced signal.d 1 voice features segregated from the sPAF extracted from a mixture of two synthetic voiced signals average performance for the F0 estimation.RegNN+ obtains the RMSE of 37.8 − 76.7 Hz and outperforms seqMC in most conditions.It can approximate the relationship between sPAF and F1 more precisely than the seqMC method.The energy likelihood model used in the Monte Carlo methods assumes that the glimpsed energy is only the evidence of F1 and F2 and that the period values are only the evidence of F0.This assumption might not be valid, especially in the low frequency channels, where most evidence for F1 can be found: on the one hand, the energy in those channels is influenced by F0, and on the other hand, the observed period harmonics depend on the spectral filtering dictated by the formants.This simplistic energy likelihood model is most likely valid only for the higher frequency channels that provide less evidence for F1.This can explain the poor F1 estimation performance of the Monte Carlo methods.regNN+ trained with the fragmentary information is not bound by such assumptions and can capture a more complex relationship between the sPAF patterns and F1.
A similar problem with the energy likelihood model is manifested as a large error difference between the 80% Fig. 7 Example of a chosen excerpt of ground truth parameter trajectories (red) plotted together with the estimated parameter trajectories (black) for all methods and feature conditions removed sPAF and the segregated sPAF.In the first condition, the glimpses are removed from clean sPAF with equal probability for all channels.In the second condition, glimpses are removed depending on the second voice in the mixture.If the segregation removes the only frequency channels that can be interpreted by the energy likelihood model, the model will fail to estimate F1.
For the clean sPAF, the best performance is again achieved by the regNN, which was trained for the clean data.Here again, the performance drops drastically after removing some information from the sPAF, indicating that the model is overfitted to the clean sPAF.
All methods besides regNN+ for some conditions exceed the RMSE level computed for the noise (normally Fig. 8 Performance measures for F0 estimation.y-axis: error measure, x-axis: different input features, solid lines: different state estimation methods, dashed line: error computed for the artificially generated white noise with mean 250 Hz, and standard deviation 50 Hz Fig. 9 Performance measures for F1 estimation.y-axis: error measure, x-axis: different input features, solid lines: different state estimation methods, dashed line: error computed for the artificially generated white noise with mean 550 Hz, and standard deviation 83 Hz distributed with mean 550 Hz, and standard deviation 83 Hz).Above this level, we can consider the method unable to estimate F1.The failure of the Monte Carlo methods is most likely caused by the over-simplified energy model, and the failure of the regDNN is caused by the inability to generalize for the unseen data.
Figure 10 shows performance measures computed for the third dimension of the state space: F2.
As in the previous two dimensions, regNN is overfitted to the clean sPAF.It achieves the best results of all for clean sPAF (RMSE of 99.98 Hz), but the worst results of all in all other conditions.The best performance is achieved by the seqMC method; however, unlike in the F0 estimation, the errors increase with the difficulty in the feature condition.SeqMC is significantly better than the remaining methods (non-seqMC and regDNN+), which do not use expectation in the estimation process.

Discussion
In this study, we compared the voice state estimation performance of two classes of state estimation methods: Bayesian sampling and deep learning.The first class uses analytically formulated probability models.They evaluate the data likelihood for a finite hypotheses set, thus approximating state posterior distribution, based on which the most likely state can be estimated.The second class approximates the mapping between the hidden state and the data in a supervised learning procedure.A trained model allows for predicting the most likely state for a given data vector.As presented in this work, both approaches can be used for voice parameter estimation.However, there are interesting differences in the performance of these methods for different types of input sPAF.
To understand these differences, it is useful to quickly review the main objectives of the approaches.Probability models used in Monte Carlo simulations are designed to describe specific properties of the sound.They provide a concise explanation of the relationship between the observed data and the hidden parameters.They are interpretable but use assumptions that limit their complexity.In contrast to that, the objective of deep learning is to precisely approximate this relationship.They provide limited interpretability but can model complex non-linear dependencies.This demonstrates the different nature of these approaches.The question we posed in this study is how these two (to a certain extend contradictory) powers can be used to interpret fragmentary data.
F0 estimation performance with Monte Carlo methods, which used period likelihood formulation from [4], is least influenced by the changes in sPAF, which proves that the observation model can generalize well across several conditions with fragmentary information.It suggests that the period likelihood model accurately describes the properties of a single voice, and it might be better suited to model human performance.
In contrast to F0 estimation, formant estimation does not benefit so much from the analytical modeling approach.Especially for F1, the simplistic model of the energy likelihood does not seem to sufficiently capture the relationship between the observed energy glimpses and the parameters.F1 and F2 estimation performance of Monte Carlo methods is sensitive to the feature types, meaning that the model is not capable of inferring the formants based on fragmentary information without losing accuracy.
While Monte Carlo methods depend on the quality of the probabilistic model, the neural networks are highly dependent on the training data.As expected, the network trained using clean sPAF alone leads to a model overfitted to this condition, i.e., the best results are observed in the clean condition but it is not able to generalize well to other conditions.The network trained using fragmentary sPAF is much less sensitive to the changes in sPAF.It leads to only average performance in F0 and F2 dimensions but is particularly beneficial for estimating F1: the network can learn dependencies between the state parameters, which are oversimplified in the probabilistic model used in MC methods.
The Sequential Monte Carlo method outperforms other methods in the ability to generalize across conditions with fragmentary information.The reason why the method is observed particularly capable is that the range of possible state outcomes is limited by the finite number of hypotheses that are updated in every time step.This property in combination with a valid observation model is what makes the performance independent of the degree of sparseness in the data.From the perspective of computational auditory scene analysis, this result confirms that top-down processing is essential in complex auditory scenes.Because of sound superposition, only a limited amount of robust information about a specific sound object is available in such conditions.Inference based on this fragmentary information is ambiguous and some form of a top-down expectation is required to resolve this ambiguity.Our results show that the benefit from the top-down approach grows with the difficulty in the input features, which indicates that the top-downprocessing is increasingly important for incomplete input features.A similar conclusion was made by [19] in a machine learning study that investigated network architectures with attention mechanisms and showed that the more random modifications in the input data, the more the model relied on the top-level information.
In [4], we argued that a particle filter with a resampling step, which allows for focusing the distribution of the top-down expectation on the regions of high importance, is a plausible model of attentive tracking of one of two competing voices.In this study, we took a closer look at the features of a single voice separated from a mixture.The results of both studies lead us to the conclusion that to effectively model selective attention in the auditory system, we need optimal feature segregation followed by a model which confronts the incomplete input features with the top-down expectation to infer the current state of the auditory object.Results presented here indicate that simple feed-forward networks are not well suited for this task in comparison to Monte Carlo (MC) methods coupled with analytical probability models.
Before favoring the MC approach, it should be highlighted that the simple non-recursive network architecture, which we chose for our experiments, represents only a small fraction of the ways deep networks could be used in the context of attention modeling.On the one hand, our experiments do not show how deep learning could be harnessed to solve even smaller sub-tasks of the auditory scene.For example, a neural network could replace the likelihood models in the Monte Carlo framework.One could imagine a model that predicts the likelihood of cooccurrence for the input pair of state and observation [20].On the other hand, the current work did not consider substituting the whole Monte Carlo framework with more complex recurrent architectures allowing for modeling the sequential dependencies in the data [21] and top-down processing [19,[22][23][24].
Possible future research is likely to profit from combining the positive aspects of deep networks and probabilistic methods.Developments in this direction are the integration of deep networks and probabilistic models as, e.g., represented by variational autoencoders [25,26].Standard VAEs are, however, not modeling time dependence which is, of course, crucial for data as considered in this study (as here evident, e.g., by the differences between sequential and non-sequential Monte Carlo).Recent developments have, therefore, extended VAE approaches by including, e.g., Gaussian processes as priors for VAEs [27][28][29][30].The research direction is new, and applications to acoustic data including complex acoustic scenes still have to be investigated.But the positive aspects observed for deep neural networks and probabilistic approaches as studied here could in principle be combined based on such novel developments.
For applications, any approach also has to consider efficiency alongside other performance measures, however.Monte Carlo approaches such as particle filtering are known for their considerable computational costs related to testing a large number of hypotheses, and also VAE approach rely on sampling which becomes more challenging if complex priors such as Gaussian processes are used.An appropriate balance between performance and efficiency is, therefore, likely to determine which setup is finally the most appropriate for complex acoustic scenes.
Our work builds upon the previously developed auditory model of attentive tracking.Some of the choices in the model's components were driven by the intention to mimic the auditory system, without immediate consideration for applied signal processing.Furthermore, in this study, we prioritized assessing the model's behavior within highly controlled scenarios over evaluating its performance in more realistic conditions.As a consequence, from the results presented here, it is difficult to conclude how close the proposed model is to tracking voices in realistic scenes with more elements of natural speech, ambient noise or reverberation.Nevertheless, it is important to note that removing information from the input features can be just as challenging to the model as adding noise.While some audio processing systems have already been tested with the fragmentary speech information in the past [1,2,6,31], this is, to our knowledge, the first attempt to present such data in a recursive voice tracking paradigm.
In our study, we have shown that even relatively simple analytical likelihood models describing sound properties, when coupled with top-down expectation, can deal with fragmentary observation better than standard feed-forward neural network.In general, we believe this highlights the importance of incorporating top-down processing in models of selective listening.

Harmonic error
Harmonic error is an auditory-inspired F0 estimation performance measure.The task of this measure is to evaluate the distance between two compared F0 trajectories in a perceptually relevant way.
Various studies demonstrate ambiguity of pitch perception [32][33][34][35].Any tonal sounds other than a pure tone, especially complex tones lacking some harmonics, are more or less ambiguous in pitch.Computational models of pitch and F0 tracking algorithms also reflect this property of sound and suffer from ambiguous F0 estimates [33,36,37].Ambiguity does not mean randomness: the pitches evoked by a stimulus are in systematic relationships to each other (they lie at the harmonics and their submultiples).Based on this, we can conclude that some of the F0 estimation errors are caused by the inherent nature of sound; hence, the performance measure should penalize those types of errors less than errors due to lack of precision in the algorithm.
The harmonic error computes the error between the ground truth F 0 GT and estimated F 0 EST using likelihood ratios computed with the F0 observation model from [4].Specifically, the following procedure is used: 1 For both compared F0 generate a set Õ of 100 hypo- thetical period glimpses by sampling from the mixture of circular von-Mises distributions (see Sec. 2.3.2): 2 Compute likelihoods: 3 Compute harmonic error as: It can be interpreted that the error computes the likelihood that the period values generated by the two F0 values would lead to similar F0 estimation results.E harm is computed for each point on the trajectory and averaged to obtain cumulative measure.Figure 11 demonstrates the output of the measure for several estimated trajectories.For more details about the period likelihood function and motivation behind the circular von-Mises distribution, the reader is referred to [4].

Optimal feature segregation method
If the acoustic signal contains a mixture of two voices, then the observation O(n) can be segregated into foreground observation O F (n) and background observation O B (n) .Following the assumption that each channel set represents only one voice, each set G cn is assigned to either the foreground or the background.We use the approach from [4] which proved to provide segregation sufficient to simultaneously track the fundamental frequency of 2 competing voices.The likelihood that sPAF belong to the foreground voice is compared with the likelihood that they belong to the background voice.
F 0 GT → ÕGT = P GT , P GT , ...P Specifically, the integrated (across elements in the set and frequency channels) period likelihood given the true F0 of the first voice was compared with the integrated period likelihood given the true F0 of the second voice: Fig. 11 Examples of trajectories in a different relation to the ground truth trajectory and the corresponding harmonic errors.The lowest error is found for the ground truth (GT) trajectory itself (in black), next for 2× the GT trajectory (in red), GT trajectory with additional noise (in blue), 2 3 × the GT trajectory (in magenta), and the highest error is found for noise uncorrelated with the GT trajectory.Note that the error in terms of absolute deviation is the highest for 2× the GT trajectory (in red).The harmonic error is low because it is likely to obtain similar period values for those trajectories

Fig. 1
Fig. 1 Extraction of sparse periodicity-based auditory features (sPAF).A Three main processing stages.B Glimpsed observation in one time instance.C Two ways of representing sPAF for different state estimation methods.Image based on[4]

Fig. 2 Fig. 3
Fig. 2 State transition probability model predicting the next state value.s i (n) : hypothetical value of the state dimension i, s i (n − 1) : estimate in the last time step, s i (n − 2) : estimate in the second-to-last time step, � s i (n) : difference between the last two estimates.Allowed step size was 10 Hz for F0, 50 Hz for F1, and 100 Hz for F2.Possible value range was [100, 400] for F0, [300, 800] for F1, and [800, 2500] for F2.Image based on[4]

Fig. 4
Fig. 4 Evaluating glimpse likelihood.(1) Procedure for evaluating period likelihood given F0.Based on a single observed period value (1.a), 11 relative period values R cnm (jF0) are computed, where j is the harmonic number (1.b).Next, they are multiplied by 2π to obtain a circular variable (1.c).The resulting 11 values are evaluated with the circular von-Mises distribution centered at 0 (1.d).Each likelihood is multiplied with a normalizing constant, which depends on the harmonic number (1.e).The values are added (1.f ) and the final result is the likelihood of a period value given hypothetical F0 (1.g).(2).Procedure for evaluating energy likelihood given F1, F2.To obtain energy likelihood given F1, the observed energy value (2.a) is evaluated with a normal distribution (2.b) whose parameters are taken from a codebook storing mean and standard deviation for different F1 values.The same is repeated for F2 to obtain the energy likelihood given F2.The energy likelihoods for F1 and F2 (2.c) are multiplied (2.d) to obtain joint likelihood (2.e).The final glimpse likelihood given a hypothetical state (F0, F1, F2) is the product of period and energy likelihoods

Fig. 5
Fig. 5 State estimation methods. 1 Non-sequential Monte Carlo simulation: 3 plots show state estimation procedure in each of the 3 dimensions: F0, F1, F2.Dots represent hypothetical states.Color intensity and the size represent the likelihood p(O(n)|� s(n)) computed for each state sample, given the input sPAF data O(n).The state is estimated independently in each time step. 2 Sequential Monte Carlo simulation: 3 plots show state estimation procedure in each of the 3 dimensions: F0, F1, F2.Dots represent hypothetical states (particles).Color intensity and the size represent the weight computed via likelihood p(O(n)|� s(n)) for each state sample, given the input sPAF data O(n).Continuity model p(� s(n)|� s(n − 1)) defines how the samples evolve between consecutive time steps.Due to resampling, particles are focused on the region of the highest importance.3 Regression neural network trained with the clean sPAF: trained network generates the most likely output state s(n) , given the observation o(n) .4 Regression neural network trained with the fragmentary sPAF ( 40% removed, 80% removed, and optimally segregated): trained network generates the most likely output state s(n) , given the observation o(n) .o(n) is the transformed version of the O(n)

Fig. 10
Fig. 10 Performance measures for F2 estimation.y-axis: error measure, x-axis: different input features, solid lines: different state estimation methods, dashed line: error computed for the artificially generated white noise with mean 1600 Hz, and standard deviation 300 Hz |F 0 EST ) p( ÕGT |F 0 GT ) + p( ÕEST |F 0 GT ) p( ÕEST |F 0 EST ) we assume the states s(n) given an observation o(n) to be well approximated by a deterministic function f (•) , [16]ork and trainingWe used a regression neural network to learn the relationship between sPAF patterns o(n) and corresponding voice parameters s(n) .We used a standard feed-forward regression neural network with an input layer of 2576 neurons, matched with the dimensionality of an observation vector.The network has two fully connected hidden layers with 1000 and 100 neurons respectively.The activations for all layers apart from the output are sigmoid functions.The output layer has a dimensionality matched with the state vector and a linear activation function.The network was trained using the Nadam optimizer[16]and log-cosh regression loss function, with 100 training epochs.The training data set consisted of pairs of data points (observation vectors o ) and corresponding labels (state vectors s ).Input feature vectors o(n) were normalized to values between 0 and 1.