Depression-level assessment from multi-lingual conversational speech data using acoustic and text features

Depression is a widespread mental health problem around the world with a significant burden on economies. Its early diagnosis and treatment are critical to reduce the costs and even save lives. One key aspect to achieve that goal is to use technology and monitor depression remotely and relatively inexpensively using automated agents. There has been numerous efforts to automatically assess depression levels using audiovisual features as well as text-analysis of conversational speech transcriptions. However, difficulty in data collection and the limited amounts of data available for research present challenges that are hampering the success of the algorithms. One of the two novel contributions in this paper is to exploit databases from multiple languages for acoustic feature selection. Since a large number of features can be extracted from speech, given the small amounts of training data available, effective data selection is critical for success. Our proposed multi-lingual method was effective at selecting better features than the baseline algorithms, which significantly improved the depression assessment accuracy. The second contribution of the paper is to extract text-based features for depression assessment and use a novel algorithm to fuse the text- and speech-based classifiers which further boosted the performance.


Introduction
Depression is a vital problem that affects a large percentage of the population around the world. It not only affects the well-being and productivity of individuals but also causes heavy economic burden on the society [1]. In fact, with more than 300 million depression patients, world health organization (WHO) declared depression as the leading cause of ill health and disability worldwide [2]. Because access to the diagnosis and treatment are expensive and sometimes not possible, inexpensive and accurate diagnosis with the help of technology became an increasingly important research challenge [3].
Speech signal has been investigated for detecting depression since it carries significant information about *Correspondence: cenk.demiroglu@ozyegin.edu.tr † Equal contributors 1 Department of Electrical and Electronics Engineering, Ozyegin University, Istanbul, Turkey Full list of author information is available at the end of the article mental health of the speakers [4][5][6][7]. Combined with the pervasive use of smartphones in our daily lives, hence, relatively easy and non-intrusive access to good-quality speech data, remote monitoring of patients through acoustic analysis became a promising research area [8].
In [9], phase distortion deviation that is used for voice quality examinations is found to be helpful for detecting depression. In [10], distortions in formant trajectories were used to detect depression. In [11], degradation in spectral variability was used. In [12], gender-dependent feature extraction was found to improve the detection performance. In [13], i-vectors and MFCC features that are commonly used for speaker verification were found to be helpful for depression detection even when the utterances were only 10 s long.

Feature selection
A large number of acoustic features can be derived from conversational speech to detect depression. However, building models with those features is challenging because of the curse of dimensionality and the typically small amounts of training data available in depression studies.
One way of reducing the dimensionality of features is to use feature selection where features that are most relevant for the classification task and least correlated among themselves are selected for classification. To that end, Minimum Redundancy Maximum Relevance (MRMR) algorithm is commonly used [25][26][27].
In [28], a two-step feature selection algorithm was proposed. The conversation is segmented into topics and features are extracted for each topic. As a first step, correlation-based feature subset selection was applied regardless of the topics [29]. In the second step, the selected features for each topic were further refined by first ranking them based on relevance and selecting subsets using regression tests. In [30], a simple t test was used to select features from a set of 504 acoustic features.
Besides selecting features automatically, there are knowledge-based set of features that are designed for emotion detection. One of the more popular examples to that approach is the Geneva feature set (GeMAPS) [31] which is developed by augmenting a minimum set of acoustic features that were shown in the literature to be reliable indicators of emotional state and that have the highest theoretical significance.

Fusion of text and audio features
Transcriptions of the speech signal have also been used as another mode of information [3] for depression detection. In [32], transcription-derived features were used in addition to the speech features. Furthermore, sentiment analysis was performed on text and sentiment features were used to build an independent detector. Then, score fusion was used to combine acoustic and text-based system scores. Syntactic and semantic features were derived from transcriptions in [33] and shown to be effective indicators of depression.
Conversations with patients can be designed in a way to obtain data that is more indicative of depression, as opposed to a regular conversation. In [34], type of questions (positive and negative stimulus) during conversations have been shown to impact voice quality parameters in psychologically distressed subjects. Speech segments with higher articulation effort were found to be more informative for depression detection in [6].
In [35], biomarkers that are derived from facial coordination and timing features were used together with vocal cues and semantic features from dialogue content using a sparse-coded lexical embedding space. In [36], depressed individuals were shown to use less social words and more anxiety-related words.
A depression-detection algorithm is presented in [37] where interactions between subjects and the computer agent were modeled without explicit topic modeling. Long-short term memory (LSTM) neural networks were used with audio and text features. The results in [37] suggested that minimal knowledge of the conversation is required for depression detection.
In [38], both conversation-level (number of sentences, number of words used, etc.) and content-level (feeling good/bad, extrovert/introvert personality, etc.) information derived from the transcripts of the dialogs were used to extract features and then scores from both audio and text features were fused via a DNN model.
There are also attempts to extract both audio and text features using deep networks as well as fusing those features using a deep network. For example, deep spectrum features [14] for audio was fused with BERT-based text representation in [39] using fully connected layer.

Cross-lingual depression detection
In depression detection, a less studied research challenge is to use speech data from other languages to train models. This approach is not only important for understanding universal cues of depression across different cultures/languages, but it also allows the use of data from other languages, which is important given the typically small amounts of data available in the public databases for each language. In [40], prediction models built with a German database were shown to produce prediction scores in English that were correlated with the self-assessment scores. In [30], combination of datasets in different languages was shown to yield high accuracy whereas if the train and test data are in different languages, performance was found to be lower.
In [41], transfer of models developed for the resourcerich English language to other languages with limited datasets was investigated. The method was shown to be improve Aphasia detection and have promise for Alzheimer's disease detection.

Minimum redundancy maximum relevance (MRMR) feature selection
In the MRMR approach, F-statistic is used for computing the relevance of a selected feature set (S) for a K-class classification task. F-statistic for feature g i is defined as whereḡ i,k is the mean of g i for the training samples in class k,ḡ i is the global mean of g i over all samples in all classes, and n k is the total number of samples in class-k. σ 2 i , the pooled variance, is where σ 2 i,k is the variance of g i in class-k, and N s is the total number of samples .
Relevance of a feature set S is then defined as Redundancy of the feature set S is defined using the Pearson's correlation for every possible feature combination: where |c(i, j)| is the absolute value of the correlation c(i, j) between feature i and feature j. Finally, the MRMR algorithm selects the features set (S) using

Proposed feature selection algorithms
The MRMR algorithm works well in many machine learning problems. However, for the depression detection problem, training data is typically limited, and therefore, computation of the F-statistic and feature correlations are often unreliable. Here, we propose three algorithms to more reliably compute the statistics required for the MRMR algorithm as described below.

Multi-lingual computation of relevance
The F-statistic computation in Eq. (1) requires estimation of the global variance (σ 2 i ), the global mean (g i ), and the class means (ḡ i,k ) for each class k and feature i. Even though the global mean and variance can usually be estimated relatively reliably, estimating the class means is more challenging when the number of classes is large and the data is limited as is often the case in depression screening tests.
The publicly available databases used in depression studies typically have less than 200 subjects. Moreover, commonly used depression evaluation tests BDI-II and PHQ-8 have 64 and 25 classes, respectively. Thus, the number of subjects available per class is usually not enough to compute the relevance reliably. In the multilingual MRMR (ml-MRMR) approach, to increase the number of available samples for each class and improve the computation of F-statistic, we propose populating each class using samples available in a different language for that same class. For example, if there is only one subject with a PHQ-8 score of 10 in the Turkish dataset, then feature vectors of subjects with a PHQ-8 score of 10 from the German datasets can be used to populate the Turkish dataset.
In some cases, the number of samples is still low after cross-lingual population of classes. In that case, samples from the neighboring classes in a different language are used for further increase the sample size. This approach takes advantage of the fact that subjects in neighboring classes (PHQ-8 levels 9 and 10, for example) are expected to be similar to each other. That assumption, though, becomes less valid, as the neighboring class is further away from the target class. Thus, while populating classes with cross-lingual data, each sample borrowed from the neighboring class is weighted according to its distance from the target class. The weight parameter γ is defined as where j = |c tar − c nb | is the distance of the target class c tar to the neighboring class c nb . After cross-lingual population of class k, number of samples, n k , is computed using the weight parameter γ as follows: (8) where k+j is the number of samples borrowed from class k + j. J k is set such that n k > N min . Thus, by including data from the same and neighboring classes in a different database, we ensure that there are at least N min samples for each class in the target database. The adjusted mean of each class k,ḡ i,k , is then where the cross-lingual component and g i,k,j (s) is the ith feature of sample s borrowed from the jth neighbor of class k. Using the n k , andḡ i,k the new F-score is An example of how sparse classes are populated is shown in Fig. 1 where the Turkish dataset is populated with samples from the German dataset. Note that the classes that do not have samples are not populated in the ml-MRMR algorithm as shown in Fig. 1. Those classes are ignored in the MRMR computations. Figure 2 shows the histograms ofḡ i,k for all features i and classes k using the baseline MRMR and the proposed ml-MRMR algorithms with N min = 3. Samples from the German database are used to populate the Turkish database. The distribution gets closer to a Gaussian with the ml-MRMR algorithm compared to the baseline MRMR algorithm. Moreover, heavy-tails generated with the baseline MRMR algorithm are suppressed, which indicates that the ml-MRMR algorithm can effectively reduce the outliers in the data.

Clustering approach
Depression screening tests often have large number of classes. For example, for PHQ-8 has 24 classes and BDI-II has 64 classes. However, in diagnosis, level of depression (severe, moderate, etc.) corresponds to a range of classes. For example, all subjects that have PHQ-8 scores between 20 and 24 are diagnosed as severely depressed subjects. Thus, the distinction between classes with similar scores is likely not represented in conversational speech. For instance, the difference between two subjects with scores of s or s + 1 may not be as significant to warrant different classes for those two cases. Given the limited training data available, we propose clustering samples that have similar scores and reducing the number of distinct classes to increase the number of samples per class.
In the clustering approach, the depression classes are clustered the number of classes in the MRMR training process is reduced to improve the feature selection performance by increasing the data available for each class. In this approach, data is split uniformly into N clus classes. Cluster centroids are found by first uniformly dividing the score scale. If the centroid class has no samples, then the nearest non-empty class is assigned as the centroid. After setting the centroids, each class is assigned to the nearest centroid. Figure 3 shows the sample distribution after the clustering approach is applied to the Turkish database with N clus = 14. Comparing the new distribution to the original distribution in Fig. 1, distribution of samples per class becomes more uniform after clustering, which enables more robust computation of relevance required for the MRMR algorithm.

Robust computation of redundancy (RCR)
Class labels are not required for the computation of redundancy as shown in Eq.(4). Thus, large amounts of speech data without depression scores can be exploited for computing the redundancy. In the RCR approach, we propose using such unlabeled speech databases to compute redundancy for feature selection. Figure 4 shows the distribution of correlations between features. Enriching the English database with unlabeled data had a significant effect on the distribution with a sharper peak around zero and slightly suppressed tails. Thus, the distribution has lower variance after applying the RCR approach, which is expected to improve the feature selection performance.

Description of text-based features
Sentiments in questions and patient responses in the Turkish database were manually classified as positive, negative, and neutral. Examples of questions and answers with their sentiment tags are shown in Table 1. Feature vectors were generated from the sentiment tags where each dimension holds the frequency of question-answer sentiment pairs. Because there are three sentiments for questions and three sentiments for answers, a total of 9dimensional sentiment feature vector was generated for each conversation. Speech characteristics such as rate of speech and duration of responses can also be informative in depression studies. For example, given two positive responses from the subject, longer ones with elaboration are preferable to short ones. Similarly, short negative answers may indicate deeper depression than longer complaints. Thus, for each sentiment type, average rate of speech and average duration of responses were extracted using the timing information in the transcriptions. Because those two features were derived for each of the three sentiment types, 6-dimensional features were obtained for each conversation. Concatenating them with the 9 features described above, a total of 15 features were derived from the transcriptions. A summary of those features are shown in Table 2. Table 2 Descriptions of the text features derived from the transcripts of the conversations

Average length of the utterances
Average length of subjects' negative, positive, and neutral answers separately. Three-dimensional feature.

Rate of speech
Rate of speech for negative, positive, and neutral answers separately. Three-dimensional feature.

Sentiment features
Sentiments of the question-answer pairs. All possible combinations sentiments are considered. Nine-dimensional feature.

Fusion of acoustic-and text-based features
The fusion algorithm is designed based on the observation that acoustics-only system sometimes makes large errors particularly when the subjects are very depressed or not depressed as shown in Fig. 5. Those large errors significantly impact the overall performance of the system and reduce its reliability. In our approach, instead of using a typical score or feature fusion method, we propose a novel algorithm to adjust the acoustic-based scores using the text-based scores. In this approach, the data is first divided into two classes. Patients with BDI-II scores above 30 are tagged as class 1 and patients with scores below 18 are tagged as class 2.
If the acoustic-only system generates a depression level estimate that is above 30 or below 18 and if the text-only system also produces a score in the same range (agreement case), then the score from the acoustics-only system is used. If they are in disagreement, i.e., one of the systems produces an estimate that is in class 1 and the other produces an estimate that is in class 2, the final estimate is computed by fine-tuning the acoustics-only prediction by getting it closer to the opposite class.
In the case of disagreement, the following algorithm is used to adjust the estimate produced by the acousticsbased system. If the prediction of the acoustics-based system is p acou , final prediction p final is computed by the linear model: where α and are constant parameters. Because the training data is limited, to avoid overfit, linear regression parameters were learned using a maximum a posteriori (MAP) approach where the prior distribution of α was modeled with a Gaussian distribution where variance and mean were both set to 1. Mean is set to 1 so that p final does not deviate significantly from p acou . The prior distribution of was also modelled with the Gaussian where variance is set to 1 and mean is set to μ g . Mean of the hyper-parameter is learned from the data by setting α to 1 and learning the optimal using leave-one-out.
6 Experimental setup

Databases
Three speech databases that are in Turkish, German, and English were used in this study. The databases are described below. Turkish database: The Turkish database was collected at a hospital in Istanbul. It consists of 70 subjects. The mean age of the patients is 34. Fourteen of them are males and 56 of them are females. Beck scores of all subjects are available using the depression questionnaire the Beck Depression Inventory-II (BDI-II) [42]. Average BDI-II score of the patients is 23.45 with a standard deviation of 11.01. The Turkish database consists of interviews with the patients. Three types of questions were directed to the patients: neutral, positive, and negative questions. Each question type refers to the sentiment that we expect to invoke in the patient. Sentiments of the responses from the patients were manually tagged by three independent evaluators. Majority voting was used for the final sentiment label of each response. Examples of sentiment labels for the questions and answers are shown in Table 1.
Interviews consist of 16 questions. Mean length of the conversations is approximately 5 min. Total length of the recordings is 6 h. Recordings were done using a headphone microphone connected to the built-in sound card of a laptop with a sampling rate of 48 kHz.
German database: The German database, distributed as part of the AVEC 2014 challenge [43], consists of conversations with 84 patients. Some of the patients have multiple recordings with a period of 2 weeks. Even though Beck scores of the 100 recordings in the training and development data are available, the scores of the 50 recordings in the test data are not available. The mean age of German database subjects is 31.5. The duration of the recordings ranges from 6 s to 4 min. All recordings below 20 s were removed from the experiments, which left 98 recordings for processing.
English database: The English database is part of The Distress Analysis Interview Corpus (DAIC) [44]. It contains clinical interviews designed to help diagnose psychological distress conditions such as anxiety, depression, or post-traumatic stress disorder. The depression part of the corpus is the Wizard-of-Oz interviews that are conducted by a virtual interviewer. The depression scores of the patients were calculated by using the PHQ-8 depression inventory [45], which differs from the German and Turkish databases. The average depression severity of the training and development data is 6.67, and the standard deviation is 5.75. Total of 189 recordings from 189 patients is available.

Depression scores
We performed both regression and classification experiments in this study. For the classification task, the scores were split into two classes. For the BDI-II scores that were available in the Turkish and German databases, subjects that have scores below 18 were classified as non-depressed and other patients were classified as depressed. For the PHQ-8 scores available in the English database, subjects that have scores below 10 were classified as non-depressed and other patients were classified as depressed.
For regression, the ml-MRMR algorithm requires databases to use the same depression scale for computing within class statistics. However, in our experimental setup, the English database has PHQ-8 scores that range from 0 to 24 and the German and Turkish databases have Beck scores ranging from 0 to 63. Thus, a mapping function between those two scales was needed to carry out the multi-lingual regression experiments.
The BDI-II and the PHQ-8 are both widely used as self-rating scales to measure depression symptoms and severity of depression in psychiatric and normal populations [46]. Recall period for items for each scale is the last 2 weeks. There are 21 items in BDI-II and 8 items in PHQ-8. For PHQ-8, each item is scored on a four point scale (0-3) where 0 corresponds to not at all and 3 corresponds to nearly everyday. BDI-II items also have four point scales (0-3), but those do not measure the frequency of occurrence but rather general presence of a feeling/behavior.
The BDI-II was designed to correspond to Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV) criteria for diagnosing depressive disorders and includes items measuring cognitive, affective, somatic, and vegetative symptoms of depression. Similarly, PHQ-8 consist of the 8 criteria of DSM-IV and covers all of the DSM-IV criteria except self-harm.
Even though they have differences, the PHQ-8 and BDI-II scores are strongly correlated [47]. For PHQ-8, scores of 5, 10, 15, and 20 are cut-off points for mild, moderate, moderately severe, and severe depression respectively [45,48]. For the BDI-II, the cut-off points for mild, moderate, and severe depression are 14, 20, and 29 respectively. Thus, the cut-off scores of the two measures have an approximately linear relationship.
Considering the strong correlation between the BDI-II and PHQ-8 scores, we mapped a given BDI-II score (s b ) to the corresponding PHQ-8 score, (s p ), by rounding (24s b )/63 to nearest integer.

Acoustic features extraction
The open-source toolkit OpenSMILE [49] was used for acoustic feature extraction. The AVEC 2013 [43] and GeMAPS [15] feature extraction protocols were used. Feature vectors for AVEC 2013 include 32 energy-and spectral-related low-level descriptors (LLDs) and their functionals such as statistical functionals (maximum, mean, skewness, flatness, etc.), regression functionals (linear regression slope, quadratic regression coefficient a, etc.) and local minima/maxima-related functionals (mean and standard deviation of rising and falling slopes, etc.). 2268 dimensional features were extracted per speaker. Functionals were computed over 20 s time windows and averaged over the recording.
GeMAPS [15] has 18 low-level descriptors. Only the first 4 MFCC features are used in GeMAPS because those are more crucial for affect and paralinguistic voice analysis studies [15]. In addition, jitter, shimmer, loudness, and spectral slope were used. Similar to AVEC 2013, functionals of those low-level descriptors were also computed. The dimensionality of the final feature set is 62. Because GeMAPS is a hand-crafted feature set with reduced dimensionality, it is used for comparison with the proposed feature selection techniques here.

Baseline system
In the baseline system, MRMR feature selection method was first applied [50] to reduce the number of acoustic features. Support Vector Regression (SVR) was used for regression and SVMs were used for classification. Because the amount of training data is small, leave-oneout method was used for the Turkish and German experiments. For the English tasks, the training set has 107 subjects and the test set has 35 subjects. Because there is enough data both for training and test, leave-one-out method was not used for the English tasks. The evaluation criteria for all regression experiments were root mean square error (RMSE), which is also used in the AVEC challenges [3,43,51,52]. Statistical significance of the results were tested using the t test with p < 0.05.
The evaluation criteria for all classification experiments were F1-score, precision, and recall for both depressed and non-depressed subjects. For the classification tasks, statistical significance of results were measured with McNemar's test with p < 0.05.

Results and discussion
Two sets of experiments were conducted. In the first set, the proposed feature selection algorithms were tested and compared with the baseline MRMR algorithm for the German, Turkish, and English regression and classification tasks. The RCR algorithm proposed for redundancy computation in Section 4.3 was used only for the German and English tasks since unlabeled data is not available in the Turkish database. In the second test set, text-based features wee extracted and fused with the acoustic features for the Turkish database. The second set was performed only for the Turkish database because the transcriptions were not available for the German database; and, for the English database, the interviews were not in the question/answer format but rather a free-form talk between a human and computer.

Performance of the ml-MRMR feature selection and clustering algorithms 7.1.1 Turkish task
Regression: Table 3 shows the regression test results with the baseline MRMR and the ml-MRMR algorithms for the Turkish task. Lowest RMSE was 9.36 with the ml-MRMR (N min = 3) algorithm using the Turkish-English data, and the improvement compared to the baseline was statistically significant. Similarly, ml-MRMR algorithm using the Turkish-German data outperformed the baseline system, and the difference was statistically significant. Moreover, the ml-MRMR algorithms performed better than the Gemaps feature set. Performance was better when N min was set to 3 compared to setting it to 5. For regression, the clustering algorithm described in Section 4.2 was used with 2, 9, and 15 clusters instead of the 45 distinct classes available in the Beck scores. Results are shown in Table 4. Even though the system with 15 clusters significantly outperformed the baseline system, the improvement was still below what was obtained with the multi-lingual MRMR approach. Classification: The ml-MRMR algorithm was not applied directly in the case of classification because there are only two classes and each class has enough number of samples (27 non-depressed and 50 depressed subjects). However, it is still possible to use the ml-MRMR algorithm in the binary classification case by populating each class from a cross-lingual dataset before dividing the data into two classes. After each regression class (1 to 45) is populated with the cross-lingual samples, training data is split into two classes for the classification task.
Classification results are shown in Table 5. Even though ml-MRMR algorithm improves the performance, the improvement was not found to be statistically significant. Thus, in the classification case, ml-MRMR algorithm was not as effective because enough Turkish data was available in each class. The system trained with only the text-based features significantly outperformed the other systems.

English task
Regression: Table 6 shows the regression results for the English task. Best result was obtained by using the ml-MRMR algorithm with Turkish (N min = 5). Even though ml-MRMR using the German database performed better than the baseline, the improvement was not significant. Note that the English database uses the PHQ-8 scores that are coarser than the Beck scores used in the German and Turkish databases. Thus, there are more samples for each class and using ml-MRMR algorithm with N min = 3 was not possible for the English case.
Classification: Table 7 shows the classification results for the English task. Similar to the regression task, the ml-MRMR algorithm with Turkish using N min = 5 outperformed the baseline system, and, when German data was used, performance did not significantly change. The ml-MRMR algorithm using the English-Turkish datasets improved the F1 scores of both depressed and nondepressed subjects. Improvement for the depressed subjects were higher compared to the non-depressed subjects.

German task
Since the German dataset contains unlabeled data, RCR algorithm was used to compute feature correlations in addition to the ml-MRMR algorithms. Results using those two algorithms for the regression and classification tasks are discussed below.  Regression: Regression performance of the baseline and the proposed ml-MRMR (with Turkish) and RCR feature selection algorithms for the German task are shown in Table 8. ml-MRMR algorithm was not effective when German was used with English. When German was used Table 7 Best classification results for the development set of the English task. Multi-lingual methods annotated with ml. There are 23 non-depressed and 12 depressed subjects in the database. Avec 2013 feature set used for all results except Gemaps tab. Number of selected features are shown in parenthesis for each case. Improvement with ml-MRMR using the Turkish database is statistically significant compared to the baseline MRMR algorithm with Turkish, performance improved for N min = 3; however, the improvement was not significant. Improvement with N min = 5 was found to be significant only when the RCR algorithm was also used. RCR algorithm was not effective when it was used without ml-MRMR. Classification: Classification performance of the baseline and the proposed ml-MRMR and RCR algorithms for the German task are shown in Table 9. The ml-MRMR algorithm with Turkish using N min = 3 outperformed the rest of the systems when RCR was used.

Performance of score fusion
Text-based features were only available for the Turkish dataset. Therefore, results for the score fusion algorithm are reported only for the Turkish dataset. Table 10 shows results when speech-based features were fused with text-based features using the proposed approach described in Section 5.2. Fusion algorithm significantly improved the performance (p value=0.00006) compared to the baseline case by reducing the error by more than 25% using the ml-MRMR algorithm with English and Turkish (N min = 3) datasets. Spread of the prediction errors is substantially reduced after fusion as shown in Fig. 6.

Regression task
Comparison of the real and predicted scores are shown in Figs. 5 and 7 for the baseline and the best ml-MRMR algorithm with fusion. Predictions get closer to the true scores and errors significantly decrease with the proposed fusion method, which can be seen when Figs. 4 and 7 are compared. The best RMSE is 8.30, which is interestingly obtained with only 5 features.
Three of the 5 selected acoustic features are MFCC related: peak standard deviation of MFFC-5, amplitude mean of maxima for MFCC-5 and mean segment length of MFCC-14. The other two is mean of the rising slope for spectral harmonicity and up-level time (25) of spectral flatness.
The ml-MRMR algorithm with German and Turkish datasets (N min = 3) worked well compared to the baseline as shown in the fourth column in Table 10. Still, it did not perform as well as the Turkish and English case. Moreover, its performance was not significantly different from the base-fusion. These results are in agreement with the results reported in Table 3 where performance with Turkish and English datasets was better compared to the Turkish and German datasets.
Sixth column in Table 10 shows the results obtained with the clustering approach together with the fusion method. That algorithm not only outperformed the baseline but also significantly outperformed the base-fusion algorithm. However, it performed worse than the best performing ml-MRMR system.

Classification task
For the classification task, text-only model predictions outperformed the acoustic system predictions as shown in Table 5. When the text-only model predictions were fused with the acoustic predictions, the F1-scores outperformed both modalities when English (N min = 3) was used to supplement the Turkish samples as shown in Table11. The result is statistically significant with p value of p = 0.02. When German (N min = 3) was used with Turkish, the F1-score for the depressed case slightly improved. However, the F1-score for the non-depressed case did not improve. Performance of the Turkish-German and Turkish-English systems were not significantly different for the classification task.

Discussion
The ml-MRMR algorithm was the best performing feature selection algorithm in regression tasks. Populating the Turkish dataset with English, English dataset with Turkish, and German dataset with Turkish generated the best results. The ml-MRMR algorithm was effective for German only when it was used together with the RCR algorithm. Thus, the cross-lingual population of depression classes was not as effective with German as the other two languages. The acoustic characteristics of Turkish and English seems to be closer to each other and those two languages complement each other better than they do with German. Note that the English and German datasets have significantly more training data compared to the Turkish dataset. Thus, because the Turkish dataset is more sparse across depression classes, N min = 3 performed better than N min = 5. Using N min > 5 caused overly aggressive population of the Turkish dataset with crosslingual data, which degraded the performance. Similarly, the English and German datasets performed better when N min = 5 since using a lower N min caused insufficient population of their depression classes with cross-lingual data.
The proposed feature selection algorithms were designed for cases when the number of classes are large. Thus, they were not as effective for binary classification tasks as they were in regression tasks. Still, some improvement was observed for the English classification task when ml-MRMR was used with Turkish.
The text features were tested for the Turkish dataset and found to outperform all acoustic feature sets for  the classification task. Thus, the sentiment-based text features were found to be effective for binary classification of depression. Similarly, when the text features were fused with the acoustic features using the proposed fusion algorithm, performance with the Turkish dataset significantly improved both for regression and classification tasks. Fusion of text and acoustic features outperformed both of the feature sets. The clustering algorithm helped improve the performance of the Turkish dataset with and without fusion of text features. However, it was not as effective as the multi-lingual approach. Thus, cross-lingual population of depression classes was found to be more effective than simply reducing the number of classes through clustering. The multi-lingual approach allows computation of relevance with more data per class without reducing the resolution of the depression scale. If the languages have similar acoustic representations of depression, such as Turkish and English as found in our experiments, then the multi-lingual approach outperforms the within-language clustering algorithm.

Common selected features among languages
In this study, we explored the features that are most effective at predicting depression for three different languages. In addition, we did further analysis of our results to find features that are common across those three languages. Table 12 shows overlapping features between Turkish-English, Turkish-German, and English-German pairs within the top 150 MRMR-selected features. Overlapping features and their functionals are described in Tables 13 and 14, respectively. Features that are based on spectral harmonicity and energy in 1000-4000 Hz are dominant in the Turkish-English comparison. 1000-4000 Hz typically contain the second and the third formants. Thus, the rate of change of those formants, measured with up-level time, and distance between them appear to be strong indicators of depression for Turkish and English. Similarly, change in spectral harmonicity is also a strong indicator for both languages.
Interestingly, MFCC features are dominant in the Turkish-German comparison. MFCC features are related to the envelope of the spectrum. Thus, changes in the locations of all formants and their bandwidths during speech are important indicators of depression detection both for Turkish and German.
Overlapping features between English and German have a mix of energy, spectral harmonicity, jitter, and MFCC features. As opposed to Turkish-English comparison where the second and third formants are important, energy in the 250-650 Hz that typically contains the first    Table 13 Overlapping low-level descriptors in language pairs that are in the top 150 MRMR-selected features

MFCC 1-16
Mel Frequency Cepstral Coefficient is a common used automatic speech recognition (ASR) feature, in the Avec 2013 feature set 16 dimension were used.

Energy
Sum squares of amplitudes of a signal.

Spectral harmonicity
Number of the harmonics in a signal.

Spectral skewness
The third order moment of the power spectrum.

Jitter (local)
Variation of the fundamental period from one single period towards the next.

Jitter (DDP)
Delta period-to-period jitter can be defined as "Jitter of the Jitter". It is explained as the change between two successive period-toperiod jitters.

Relative mean of peaks
Proportion of the mean of the peak amplitudes to the mean of windowed feature.

Kurtosis
Fourth order moment.

Skewness
Third order moment.

Up-level time
Number of frames that the feature is above a threshold. The threshold percentiles are set to 25, 50, 75, and 90.

Minimum segment length
Minimum length of a particular segment.

Mean segment length
Arithmetic mean of a particular segment.

Mean distance between peaks
The mean of distances between the peaks.

Rise-time
The time where the feature contour is rising.

Percentile 1.0
The minimum value of a feature.
formant appears to be important and overlapping for English and German. Jitter, which quantifies pitch variations is also important both for English and German but not for Turkish.

Conclusion and future work
We investigated exploiting multi-lingual databases for feature selection in the context of depression assessment. Proposed algorithms were effective especially for the regression tasks where there is limited amounts of data for each class. As a second contribution, we proposed novel features derived from transcriptions and fused them with the acoustic features, which significantly improved the performance. The results are significant because they indicate that there are similarities between entirely different languages in the way that they manifest depression. Thus, our findings is a step towards using larger multi-lingual databases for depression detection.
The focus of this work was multi-lingual feature selection algorithms and not the classification algorithms. Thus, even though the SVM and SVR algorithms are solid baselines when the amount of training data is limited, in future work, we will experiment with other types of classification/regression algorithms such as gradient boosting and random forests.
Even though the Turkish database used here is unique to this work, the English and German databases are publicly available and have been used together in the literature as discussed in Section 2.3. A comparison of our