Automatic classification of the physical surface in sound uroflowmetry using machine learning methods

This work constitutes the first approach for automatically classifying the surface that the voiding flow impacts in non-invasive sound uroflowmetry tests using machine learning. Often, the voiding flow impacts the toilet walls (traditionally made of ceramic) instead of the water in the toilet. This may cause a reduction in the strength of the recorded audio signal, leading to a decrease in the amplitude of the extracted envelope. As a result, just from analysing the envelope, it is impossible to tell if that reduction in the envelope amplitude is due to a reduction in the voiding flow or an impact on the toilet wall. In this work, we study the classification of sound uroflowmetry data in male subjects depending on the surface that the urine impacts within the toilet: the three classes are water, ceramic and silence (where silence refers to an interruption of the voiding flow). We explore three frequency bands to study the feasibility of removing the human-speech band (below 8 kHz) to preserve user privacy. Regarding the classification task, three machine learning algorithms were evaluated: the support vector machine, random forest and k-nearest neighbours. These algorithms obtained accuracies of 96%, 99.46% and 99.05%, respectively. The algorithms were trained on a novel dataset consisting of audio signals recorded in four standard Spanish toilets. The dataset consists of 6481 1-s audio signals labelled as silence, voiding on ceramics and voiding on water. The obtained results represent a step forward in evaluating sound uroflowmetry tests without requiring patients to always aim the voiding flow at the water. We open the door for future studies that attempt to estimate the flow parameters and reconstruct the signal envelope based on the surface that the urine hits in the toilet.


Introduction
The growing interest in information and communication technologies is generating a paradigm shift in current health care systems, which are transitioning from face-to-face and reactive systems to remote and proactive systems.This has mutual benefits, as it provides advantages both for patients living in rural and hard-toreach areas who have difficulty accessing such services and for healthcare providers, as it allows them to access up-to-date medical information and resources quickly and efficiently.As a result, the quality of medical care is improved and the associated costs are reduced.
One of the problems currently affecting the ageing population is lower urinary tract symptoms (LUTS).LUTS affect bladder storage, emptying and postvoiding, and they mostly affect the ageing male population and are caused by benign prostatic hyperplasia (BPH) [1].LUTS lead to a decreased quality of life and a significant expenditure of health care resources [2].
It is estimated that more than 60% of the population of men over 60 years of age suffer from LUTS [3].There is a non-invasive clinical test that is widely used to assess urinary tract function called uroflowmetry (UF) [4].UF is used to provide objective evidence to evaluate the degree of prostate enlargement, overactive bladder, urinary incontinence and neurogenic bladder [5].UF is performed with a uroflowmeter, a device that measures the bladder emptying rate as a function of time, the total volume voided and the duration of the process.With these values, urologists can obtain criteria to determine how well the urinary tract is functioning and thus obtain a diagnosis.A limitation associated with this test is that it generates situational stress in patients; this is known as shy bladder syndrome [6].The patient is asked to void on demand in an unnatural environment, often with a very full bladder.This situation generates significant variability from test to test.As a result, it is recommended that more than one test should be performed, requiring several visits, which can be time-consuming and costly, to a clinic [7].
As an alternative that allows flow parameters to be measured remotely and in a natural environment, sound uroflowmetry (SU) has emerged; it attempts to estimate the flow parameters from the sound generated by the impact of urine on the water in the toilet.It has been shown that there is a good correlation between the flow parameters obtained by UF and those obtained from SU and the shapes of the visual flow traces [8,9].Multiple platforms have been developed to perform SU using various hardware configurations.These platforms make use of dedicated microphones [10] and use general-purpose devices such as smartphones [9,11,12], and recently, the first platform for performing SU using smart watches was developed and validated [13,14].
One of the limitations associated with the SU test is that the person must aim the voiding flow at the water in the toilet at all times.If the voiding flow impacts the toilet walls (made of ceramic) instead of the water base, the sound units captured by the recording device decrease as a result of the change in the physical surface.This results in prediction errors in the flow and envelope parameters of the signal: the sound produced by the impact of the voiding flow on ceramic could be wrongly interpreted as a flow interruption or a decrease in the flow rate.This limitation can become a serious problem if we consider the fact that the majority of the target population undergoing the test are elderly people.To address this limitation, in this work, we seek to explore SU audio signals from male subjects to extract patterns related to their characteristics in the frequency domain by using the fast Fourier transform (FFT) to identify the voiding flow impact surface.We develop a three-class classification algorithm with a high accuracy (ACC) for the detection of time intervals in which there is voiding against ceramic, voiding against water and silence in SU tests.
For this purpose, this work makes use of a set of machine learning (ML) algorithms from mixed voiding event data obtained during this study.This work hopes to provide an essential step forward to improve the performance of SU tests and contribute to increasing their reliability by removing the requirement that patients always target water in SU tests; instead, this method detects the physical environment and acts accordingly.
The paper is organised as follows: Section 2 briefly reviews the state of the art for audio feature extraction and classification using ML; Section 3 presents the materials and methods proposed in this research, where the specific characteristics of the dataset, feature selection and the theoretical foundations of the ML algorithms used in the classification process are described; Section 4 shows the results obtained from the proposed methodology; and, finally, Section 5 provides some concluding remarks.

Feature extraction in audio signals
Feature extraction is the process of identifying the distinctive properties of a signal [15], which are subsequently used as inputs for classification methods.Features can be extracted from signals in one of three domains: the frequency domain, the time domain and the time-frequency domain.In the frequency domain, spectral components obtained using the FFT [16], mel spectrograms [17] and mel frequency cepstral coefficients (MFCCs) [18] are conventionally used.In the time domain, several statistics have been used to characterise the discriminant information, such as the zero crossing rate (ZCR) and kurtosis [19].Finally, in [20,21], novel approaches for the computational analysis of auditory scenes using time-frequency representations and discriminative content extraction are performed.Within all domains, the frequency domain includes a wide variety of representations [22], and MFCCs have been used extensively with both classical and deep learning approaches to obtain a high ACC [23].

Audio signal classification using ML
Audio classification has become a focus of attention in audio processing and pattern recognition research.It is difficult to find an optimal classifier and to select the optimal features from several features extracted from an audio fragment.Several methods have been proposed; they range from traditional signal processing techniques to more recent techniques using deep learning approaches.In [24,25], support vector machine (SVM)based classifiers were proposed for audio signal classification.Other works made use of the SVM and random forest (RF), and a comparison of the behaviour of both classifiers showed that better results are obtained using RF [23,26].
With the advent of deep learning, more advanced techniques have been developed that can learn sound tagging tasks exceptionally well; they have become the standard in mobile and embedded applications [27].These techniques include the convolutional neural network (CNN), recurrent neural network (RNN) and their variants, such as convolutional recurrent neural networks (CRNNs).In [28], an extensive study investigating CNN sets for audio classification is carried out, and in [29], a study using an RNN to classify environmental sound signals is carried out; very satisfactory results were obtained in both cases.
In summary, automatic audio classification is an active area of research, and there have been significant advances in both traditional and deep learning-based approaches.In this paper, we develop a classification algorithm to determine the surface in SU tests to classify when voiding against water or against ceramic is occurring or when there is silence (absence of voiding).To the best of our knowledge, there are no previous works that use ML for surface classification in SU tests.As a result, there are no datasets of voiding sounds that include the three sound labels, so we have created a dataset of labelled sounds that was used to train our ML algorithms.

Dataset description
For the classification task, we have created, from real voiding event audio recordings that have been segmented into 1-s chunks, a dataset of 6481 1-s audio clips recorded with a professional microphone, the Ultramic384.This device has a highly sensitive audio sensor that allows a sampling rate (SR) of 384 kHz, allowing the study of a wide frequency spectrum.All the voiding audio clips recorded were obtained from 15 male subjects voiding in a standing up position.
The audio recordings were carried out in four Spanish domestic bathrooms, where the height of the toilet water from the floor was approximately 15 cm.The recording device was placed above the water tank of the toilet, with an approximate height above the floor of 90 cm.The audio clips are composed of three classes: voiding against ceramic (ceramic class), voiding against water (water class) and silence (silence class).The ceramic, water and silence classes represent 32.5 %, 34 % and 33.5 % of the total recordings, respectively.The experimental procedures conform to the provisions of the Declaration of Helsinki (as revised in Edinburgh in 2000).Table 1 shows the proportions of the audio clips recorded in each of the bathrooms according to the class and the dimensions of each of the bathrooms.The procedure for the collection of the audio recordings of each class is detailed below: • Ceramic class: It is composed of 2108 1-s audio clips that correspond to 104 voiding events of 15 different subjects who were aiming at the toilet wall.Subsequently, we took the time intervals of the recordings in which only ceramics surface sounds were present based on the validation of the participants and fragmented the audio recordings into 1-s frames.• Water class: It is composed of 2203 1-s audio clips corresponding to 96 audio recordings of 12 subjects aiming at the water base.The audio recordings were fragmented using the same procedure that was used for the ceramic class.• Silence class: This class does not represent a physical surface as such but is associated with an interruption of the voiding flow.It is composed of 2170 silent audio recordings made while a person was present in the bathroom, with the objective of recording the characteristics of the breathing process when there is an absence of voiding.

Feature selection
The first step in audio classification is to select the best procedure for characterising each audio sample in the dataset.First, we perform a spectral analysis in the entire frequency band recorded by our specialised microphone (0-192 kHz) to determine where the components that provide the most information in the classification process are located.For this purpose, we extract 1000 linearbinned FFT samples for each 1-s audio clip, where the frequency range (0-192 kHz) is divided into 1000 equally spaced intervals, and for each interval, we sum the absolute values of the amplitudes of the components present in each interval, finally obtaining a vector with 1000 values that characterises each audio clip.Then, we perform supervised feature selection and classification using RF and build a model using a Gini impurity-based metric [30].By using the Gini impurity to measure the quality of our split criterion, we can quantify the weighted impurity of each feature in the tree, indicating its importance.Figure 1 shows the predictive power of each frequency component based on Gini impurities for the ceramic, water and silence audio clips in our dataset.This figure shows that the bins around 1 kHz, 17 kHz and 30 kHz contain the greatest predictive power for the task of distinguishing among the three classes.To develop the ML models, we selected the band from 0 to 22.05 kHz because it is the frequency band captured by the vast majority of commercial microphones (SR = 44.1 kHz).This represents a compromise between the model performance and the cost and availability of the microphone being used.
For the study of the 0-22.05kHz band, we extract a 20-linear-bin FFT.Next, to visualise the degree of separability between the three classes, we apply the dimensionality reduction technique t-distributed stochastic neighbour embedding (t-SNE) [31], which converts similarities between data points into joint probabilities.The results are shown in Fig. 2; they demonstrate a high degree of separability between the three classes.

Sound classification model
In this subsection, we build three supervised ML algorithms to classify the physical void impact surface in an SU test.We have selected three models for our study: an SVM, an RF and a k-nearest neighbours (k-NN) classifier.We applied the stratified k-fold cross-validation method with k = 10 to divide our data into training and testing sets for each of the algorithms used.This validation method provides a robust and reliable estimate of a model's performance on unseen data and ensures that each split maintains a class distribution similar to that of the original dataset.These models have been selected because our dataset is too small to apply deep learning techniques.Below, we detail why we chose each model: • SVM: It is a supervised learning algorithm used mostly for classification purposes.This algorithm is easy to use and will provide the best output, even when it is tested on limited-size training datasets [23].The only data-dependent step is the choice of the kernel and the corresponding feature space [32].
In our case, we have used the polynomial kernel since it generally performs better in classifying high-Fig.1 Predictive power (importance) of each frequency component in the classification task with three classes: ceramic, water and silence.The frequency band selected in our algorithms is shown in blue.The importance is calculated using the Gini impurity with a random forest model Fig. 2 t-SNE plot that shows that the ceramic (blue), water (green) and silent (red) classes can be distinguished well dimensional data when the data are not linearly separable, which is the case for the data in this paper.• RF: This method is used for popular ML tasks related to regression and classification in any domain of interest.RF works by constructing an outsized quantity of decision trees.Random decision forests prevent decision trees from overfitting the training data [23].For the selection of the number of estimators, a parameter that indicates the number of trees in the forest, we have experimentally tested different values and selected a value of 10 trees.• k-NN: It is one of the simplest and most common classifiers, yet it can compete with the most complex classifiers in the literature [33].k-NN is based on the idea of clustering data of the same nature.In other words, objects of the same category should be closer in terms of distance [34].The core of this classifier depends mainly on measuring the distance or similarity between the tested examples and the training examples.To use the classifier, it is necessary to determine the number of neighbours; in our case, it is three.
Figure 3 shows a graphical pipeline diagram of the proposed methodology.Our input data are the SU audio signals.First, the audio signal is segmented into 1-s frames.
Next, the FFT is applied to each of the frames to process the data in the frequency domain, and 20 linear bins are extracted.These bins are the input features of the classification algorithms.Finally, the algorithm outputs the classification results: the signal is predicted to be in the ceramic class, water class or silence class.

Results and discussion
We next evaluate the three different ML algorithms using three different frequency bands.The first band, 0-22.05kHz, covers the entire band available for the vast majority of commercial recording devices (SR = 44.1 kHz); this includes devices integrated into smartphones and smartwatches and dedicated devices.The second one corresponds to 0-8 kHz, which includes only information within the human speech band.Finally, the third one from 8 to 22.05 kHz is selected to evaluate the algorithms for the case in which it is necessary to preserve the users' privacy by eliminating human speech components.
For each of the three bands, we used 20 linear-binned FFT features.We used stratified 10-fold validation to ensure that each fold of the dataset is class-balanced across labels.For each model, we report the following performance metrics: the F1-score, ACC, standard deviation (SD), false positive rate (FPR) and false negative rate (FNR).Figure 4 shows the confusion matrices for each of the three models in the three frequency bands analysed.Table 2 shows the results obtained.This table shows that similar results are obtained for the three models, with the values of the ACC and F1-score ranging from 89.38 to 99.46% for the three models across the three frequency bands.Overall, the RF model presents the best performance results for each frequency band for the task of classifying the physical surface in SU tests.Furthermore, we can safely remove the human speech frequency band and consider the range 8-22.0 kHz, since the RF model maintains a high ACC (93.29%) and F1-score (93.30%).We believe that the removal of human speech could be a requirement for some users who want privacy in their SU test.
These positive results reinforce the decision in this work to consider frequencies below 22.05 kHz, eliminating the need for specialised microphones.This demonstrates that the surface can be classified accurately in SU tests using commercial recording devices.Therefore, it is not necessary to use specialised and expensive recording equipment with sample rates above 44.1 kHz.

Surface classification in mixed-surface SU audio clips
Next, we need to validate our models for the typical voiding event in which, within the same voiding event, the urine impacts both the water and the ceramic surface.We collected 15 voiding events in two bathrooms corresponding to bathrooms 2 and 3 in Table 1.The audio recordings for these tests were not used in the training phases of our models.The participants were asked to aim the voiding flow at the toilet ceramic and water within the same voiding event.Table 3 summarises the characteristics of the voiding forms performed.
During the tests, there were time intervals, especially at the end of some tests, in which the flow gradually   decreased until it became a dribble.We considered this indeterminate and did not take it into account in the evaluation of the algorithm (see Fig. 5, where the indeterminate time is marked with grey dots).This is because it was impossible for the volunteers who performed the test to determine accurately whether these seconds corresponded to voiding against ceramic or water.It is important to note that this time interval contains a mixture of dribbling against water and ceramic.These intervals generate some uncertainty in the classification task but become somewhat meaningless if we consider that, according to urologists' criteria, the final seconds of the voiding event do not provide relevant information for screening or diagnosis.
In the 15 audio recordings processed, 700 s were analysed, corresponding to 258, 222 and 220 s of the ceramic, water and silence classes, respectively.To evaluate the automatic classification of the impact surface, we used the RF classifier with the features extracted for the 0-20.05kHz band.We selected this configuration because it provided the best overall classification results.Additionally, most commercial recording devices allow recording in this band, which facilitates its implementation.
Figures 5, 6 and 7 show the results obtained by the algorithm for three selected voiding events.Red, blue and green represent the silence, ceramic and water classes, respectively, for each 1-s interval.The circles represent the ground truth, while the diamonds represent Fig. 5 Results for signal four, repetition one (see Table 3) Fig. 6 Results for signal six, repetition two (see Table 3) the inference made by the RF algorithm.By comparing the ground truth and the output of the RF model, we obtained a classification ACC of 98.17 %.

Conclusions
This work addresses the problem of the automatic classification of the physical voiding flow impact surface in SU tests.One of the SU requirements is that the voiding flow must always impact the water in the bowl of the toilet.However, in a real-world scenario, the voiding flow impacts the toilet wall often.This requirement represents a constraint, especially for elderly people and children.If this requirement is not met, the estimation of the flow parameters will be negatively affected.
We built a dataset of 6481 1-s audio clips labelled as silent (no voiding), ceramic (voiding against ceramic) and water (voiding against water) to train three automatic classification models.Three algorithms were trained to automatically evaluate the classification of the surface in three frequency bands within the 0-22.05kHz commercial band: the SVM, RF and k-NN.The results show that the RF classifier using the FFT-based features in the frequency range of 0-22.05 kHz obtains a classification ACC of 99.46 % for distinguishing among voiding events against ceramic or water and silence (absence of voiding flow).Furthermore, we can safely remove the human speech frequency band and consider the range 8-22.05kHz, since the RF model maintains a high ACC (93.29%) and F1-score (93.30%).
Next, we collected data from 15 real SU tests performed by three male subjects in three different bathrooms.The subjects were instructed to change the impact surface during the voiding event.We validated the positive inference performance of the model for differentiating among the three surfaces.With this work, we open the door for new studies that will allow the analysis of the voiding flow and the extraction of the envelope parameters as a function of the surface that the urine impacts.The results will allow SU tests to be performed without the existing limitation of always targeting the water in the toilet.

Future work
For future work, our goal is to study the estimation of the voiding parameters (flow rate and volume) as a function of the surface that the voiding flow impacts (water or ceramic) and to be able to eliminate the requirement in current SU tests to always aim at the water in the toilet bowl.Additionally, we will analyse the reconstruction of the signal envelope in the time intervals in which the voiding flow impacts a ceramic surface, as if it had impacted water.This will allow us to automatically classify the voiding patterns according to the four existing patterns in the literature, normal, intermittent, fluctuating and plateau, which each represent a set of underlying dysfunctions, regardless of the voiding impact surface.

Fig. 3
Fig. 3 Diagram showing the pipeline of the proposed methodology

Table 1
Proportion of audio clips of each class recorded in each bathroom

Table 2
Evaluation of models by frequency range in terms of the classification ACC, F1-score, SD, FPR and FNR

Table 3
Voiding characteristics