Skip to main content

Evaluation of linguistic and prosodic features for detection of Alzheimer’s disease in Turkish conversational speech


Automatic diagnosis and monitoring of Alzheimer’s disease can have a significant impact on society as well as the well-being of patients. The part of the brain cortex that processes language abilities is one of the earliest parts to be affected by the disease. Therefore, detection of Alzheimer’s disease using speech-based features is gaining increasing attention. Here, we investigated an extensive set of features based on speech prosody as well as linguistic features derived from transcriptions of Turkish conversations with subjects with and without Alzheimer’s disease. Unlike most standardized tests that focus on memory recall or structured conversations, spontaneous unstructured conversations are conducted with the subjects in informal settings. Age-, education-, and gender-controlled experiments are performed to eliminate the effects of those three variables. Experimental results show that the proposed features extracted from the speech signal can be used to discriminate between the control group and the patients with Alzheimer’s disease. Prosodic features performed significantly better than the linguistic features. Classification accuracy over 80% was obtained with three of the prosodic features, but experiments with feature fusion did not further improve the classification performance.

1 Introduction

As the worldwide elderly population increases, the incidence of Alzheimer’s disease is becoming more widespread. It is estimated that 7% of the world’s population over 65 years old has Alzheimer’s or a related dementia [1]. Moreover, only one in four patients has been diagnosed [1]. Because there is no treatment to cure the disease, years of healthcare costs are becoming a significant economic burden on governments as well as patients and their families. The global cost of Alzheimer’s and dementia is estimated to be $605 billion, which is equivalent to 1% of the entire world’s gross domestic product [2]. The problem intensifies each year with the aging world population. Thus, simplifying healthcare processes and reducing costs through the use of automated systems can make a significant socio-economic impact.

Diagnosis of the disease is costly and difficult. Moreover, even if the disease is diagnosed correctly, monitoring the progression of the disease by a clinician over time further increases the cost. Thus, patients cannot visit clinicians frequently and what happens between the visits is largely unknown to clinicians.

Typically, clinicians use tests such as Mini-Mental State Examination (MMSE) and linguistic memory tests [3]. Linguistic memory tests are based on the recall rates of word lists and narratives, and they are typically more effective than the MMSE. Moreover, individual’s medical and family histories are used, along with MRI scans to test for other brain-related conditions, such as stroke. Biomarkers showing the level of beta-amyloid accumulation in the brain or the neurons that are injured or actually degenerating can also be used in combination with the other tests [4].

None of those typical practices consider the speech signal in diagnosing the disease even though the part of the brain cortex that processes language abilities is one of the earliest parts to be affected by the disease [5]. For example, narrative retelling ability is found to be strongly correlated with the disease [6]. Similarly, linguistic features derived from the transcription of a narrative retelling task were found to be significantly correlated with primary progressive aphasia, which is a type of dementia [7]. Similarly, analysis of the speech signal has been shown to be useful for Alzheimer’s detection in [8,9]. However, in those works, the speech signal is recorded during the administration of standard clinical tests. Moreover, most of the focus is on the high-level structural processing of spoken language for a specific language. For example, in [10], features such as moderate word finding difficulty, reduced phrase length, and reduced comprehension are manually tagged by humans and shown to contain information complementary to standardized tests. Correlation of linguistic capability with Alzheimer’s disease was also shown in [11]. Speech-based features are investigated in [5] to detect fronto-temporal lobar degeneration with promising results. Speech is recorded in a semi-structured interview setting in [5]. The frequency and ratio of syntactic categories such as pronouns and adverbs are found to be markers of the disease.

In addition to natural language processing (NLP) features, speech acoustics have also been studied and reported in the literature. A limited study with one patient with a focus on the prosodic features of speech such as stress, intonation, and emotion is reported in [12]. Problems of speech production that are related to central nervous system problems are also noted in [13]. In [14], dysfluency cycles in speech are measured using the length and frequency of hesitations in speech. Subjects with dementia were found to have patterns different from those of control subjects.

Here, we focus on extracting an extensive set of acoustic and linguistic features from spoken language to detect Alzheimer’s disease. Because the patients with Alzheimer’s are usually not able to take automated tests or to carry on a structured conversation, data collection is done during unstructured conversational speech. In this way, a subject’s speech can be recorded in the most natural and effortless way by a person with minimal technical or clinical skills. Semi-structured conversational data has been investigated in [15,16], but only linguistic features are analyzed, and speech features are not considered. Similarly, conversational data has been investigated for limited sets of linguistic and speech dysfluency features by [5,17], who measured the correlation of those features with the disease and attempt to use the features for diagnosis.

In our work, we focused on evaluating the effectiveness of a large set of features for detecting Alzheimer’s disease in unstructured conversations. The data was collected in Turkish, which has not been studied to the extent of languages such as English. We propose 20 prosodic features extracted automatically from the recordings and 18 linguistic features derived from the transcriptions of patients and control subjects. We have investigated the predictive power of each feature as well as combination of features using support vector machines (SVM), nearest neighbor (NN) classifiers, naive Bayesian classifiers, and classification trees (CTree). Our results indicate that some of the investigated features are strong predictors of the disease with high statistical significance independent of the age, education, and gender of the subjects. Prosodic features were more successful than the linguistic features. In fact, only two of the linguistic features were found to be significant. Accuracy of greater than 80% was obtained with three of the prosodic features. Silence ratio, which is defined as the rate of silences in speech regardless of their durations, was found to be the most useful feature. Feature fusion did not improve the performance, which indicates that the features are not complementary to one another.

2 Linguistic features

A list of the linguistic features that are extracted from the transcriptions of the recorded conversations with the test subjects is shown in Table 1. The features are geared towards detecting problems with the flow of the conversation and measuring how well the subject can understand the question or carry on the conversation without getting confused. Recordings are manually transcribed. Each recording is first split into conversation turns. Then, the turns where the subjects speak are further split into utterances that are segments where the subjects talk without interruption by the interviewer and without long silences. The splitting mechanism is shown in Figure 1 where a voice activity detector is used for detecting silences. Only the turns of subjects are used in feature extraction.

Figure 1
figure 1

Conversation turns between a patient and the interviewer are shown. Each conversation turn contains one or more utterances. Voice activity detector (VAD) is used to detect the long and short silences.

Table 1 Lists of linguistic and prosodic features and their IDs

2.1 Hesitation and confusion features

During recording, we found that patients tend to hesitate more, forget what they were talking about, and have a harder time finding the right words or remembering details about their pasts. They also sometimes get confused about why they cannot remember the details or forget the context of the conversation. Those observations led us to propose features that will be able to capture those patterns in transcriptions.

2.1.1 2.1.1 Question ratio

Patients are more likely to forget details in the middle of conversation, to not understand the questions, or to forget the context of the question. In those cases, they tend to ask the interviewer to repeat the question or they get confused, talk to themselves, and ask further questions about the details. The question words such as ‘which,’ ‘what,’ etc. are tagged automatically in each conversation. The full list of question tags that were used here is shown in Table 2. The question ratio of a subject is computed by dividing the total number of question words by the number of utterances spoken by the subject.

Table 2 Question tags that were used in computing the question ratio

2.1.2 2.1.2 Filler ratio

Filler sounds such as ‘ahm’ and ‘ehm’ are used by people in spoken language when they think about what to say next. We hypothesize that they may be used more frequently by the patients because of slow thinking and memory recall processes. Patients tend to forget what they are talking about and to use fillers more often than the control subjects. The filler ratio is computed by dividing the total number of filler words by the total number of utterances spoken by the subject.

2.1.3 2.1.3 Incomplete sentence ratio

One of our observations of the patients is their inability to complete sentences. They seem to either forget what they were going to say or to completely change the context and start talking about a different topic. Incomplete sentences are manually labeled for each conversation. To compute this feature, the ratio of incomplete sentences to the total number of the sentences is calculated.

2.2 POS-based features

Part of speech (POS) tags can be used to extract markers for detecting the disease. For example, frequent adjectives can indicate more colorful and descriptive use of language, while frequent adverbs can indicate the ability to relate different utterances to each other. The frequency of each POS tag can also be a useful identifier of patients with Alzheimer’s disease.

POS tags are added automatically to each word using a Turkish stemmer [18]. In cases where a word can have multiple alternative POS tags, equal weights are given to all possibilities. For instance, if a word can be either a noun or an adverb, depending on the sentence, that word is counted as half adverb and half noun in computation. The following POS tag frequencies are used as features:

  • Verb frequency

  • Noun frequency

  • Pronoun frequency

  • Adverb frequency

  • Adjective frequency

  • Particle frequency

  • Conjunction frequency

  • Pronoun-to-noun ratio

Frequency of a POS tag is computed by dividing the total number of words with that tag by the total number of words spoken by the subject in the recording. Pronoun-to-noun ratio is the ratio of the total number of pronouns to the total number of nouns.

2.3 Unintelligible word ratio

During the conversations, some of the words spoken by the patients were unintelligible. These are mostly because patients could not produce the words correctly, they mumbled, or they were thinking while talking, which reduced intelligibility. Unintelligible word ratio is the ratio of unintelligible words to all words spoken by the subject.

Annotation of unintelligible words was done manually by three listeners for each conversation. A word was marked as unintelligible only when at least two of the three listeners could not understand it.

2.4 Complexity features

2.4.1 2.4.1 Standardized word entropy

One of the earliest parts of the brain to be damaged by Alzheimer’s disease is the part of the brain that deals with language ability [5]. We hypothesize that this may cause a degradation in the variety of words and word combinations that a patient uses. Standardized word entropy, i.e., word entropy divided by the log of the total word count, is used to model this phenomenon. Because the aim is to compute the variety of word choice, stemming is done, and only the stems of the words are considered.

2.4.2 2.4.2 Suffix ratio

The standardized word entropy feature focuses on the variety of the stem words while ignoring the suffixes. However, suffixes can also be strong indicators of the complexity of a sentence. Turkish, in particular, has a rich and complex morphological structure [19]. Hundreds of different words can be generated from the same stem word by appending suffixes to it. Thus, we investigated whether the patients tend to construct simpler words than the control subjects by analyzing the suffixes they used. The suffix ratio of a subject is calculated by dividing the total number of suffixes by the total number of words spoken by the subject.

2.4.3 2.4.3 Number ratio

During conversations, subjects give details about their birth dates, how many kids they have, and other numerical information. Such use of numbers in a sentence can be a measure of recall ability. The number ratio feature is calculated by dividing the total count of numbers by the total count of words the subject used in the conversation.

2.4.4 2.4.4 Brunet’s index

Brunet’s index (W) quantifies lexical richness [20]. It is calculated as \(W=N^{V^{-0.165}}\), where N is the total text length and V is the total vocabulary. Lower values of W correspond to richer texts. As with standardized word entropy, stemming is done on words and only the stems are considered.

2.4.5 2.4.5 Honore’s statistic

Honore’s statistic [21] is based on the notion that the larger the number of words used by a speaker that occur only once, the richer his overall lexicon is. Words spoken only once (V 1) and the total vocabulary used (V) have been shown to be linearly associated. Honore’s statistic generates a lexical richness measure according to R=100×log(N/(1−V 1/V)), where N is the total text length. Higher values correspond to a richer vocabulary. As with standardized word entropy, stemming is done on words and only the stems are considered.

2.4.6 2.4.6 Type-token ratio

A pattern that we noticed in the recordings of the Alzheimer’s patients is the frequency of repetitions in conversation. Patients tend to forget what they have said and to repeat it elsewhere in the conversation. The metric that we used to measure this phenomenon is type-token ratio [22]. Type-token ratio is defined as the ratio of the number of unique words to the total number of words. In order to better assess the repetitions, only the stems of the words are considered in calculations.

3 Prosodic features

A total of 20 prosodic features were extracted and evaluated for detecting Alzheimer’s disease. A list of all prosodic features used here is shown in Table 1. Descriptions of the prosodic features are given below. All prosodic feature computations are performed over the locution of the subject. Locution is the total response period of the subject which is the sum of all of the subject’s speech turns. Each speech turn includes utterances, long silences, and short silences, as shown in Figure 1.

3.1 Voice activity-related features

Silence and speech segments are automatically labeled in each conversation with a voice activity detector (VAD). The VAD used here is based on the distribution of the short-time frame energy of the speech signal. Because there is both silence and speech in the recordings, the energy distribution has two modes, both of which can be modeled with a Gaussian distribution. The bimodal distribution of silence and speech is trained using the expectation-maximization (EM) algorithm. The mode that has a lower mean is used to represent silence, and the mode that has a higher mean is used to represent speech.

Energy of each short-time speech frame in the recording is classified as either speech or silence using the likelihood ratio test (LRT). Because the test treats each frame independently, a second processing step is used where silence and speech segments that were shorter than four frames are removed.

The transcriptions of recordings were available and could be used for VAD through forced alignment using an automatic speech recognition system. However, the VAD described above worked well and more sophisticated VAD techniques were not required.

3.1.1 3.1.1 Response time

When the interviewer asks a question, it often takes some time before the subject gives a response. It is hypothesized that this time can be an indicator of the disease since it is expected to be related to cognitive processes such as attention and memory. The time it takes the subject to answer a question is calculated in each segment as the response time measure.

3.1.2 3.1.2 Response length

Response length is the average length of a subject’s response in seconds to the interviewer’s question. Beginning and trailing silences are removed.

3.1.3 3.1.3 Silence ratio

The plan-and-execute cycle in speech production was found to be distinctly different in patients compared to control subjects as noted in [14]. In our data, we also observed that patients tend to stop more in the middle of sentences to think about what to say next. The silence ratio is computed by dividing the total number of silences in the whole locution by the total number of words in the locution. Dividing by the number of words, we reduce the variability that arises from different speaking rates.

3.1.4 3.1.4 Silence-to-utterance ratio

Silence-to-utterance ratio is the ratio of the total number of silences to the total number of utterances. Similar to silence ratio, it is a measure of the hesitation rate of the subject.

3.1.5 3.1.5 Long silence ratio

Patients sometimes pause for a long time while answering a question. They do not use fillers during these long periods, and the interviewers did not interrupt these periods of silence. We hypothesized that these pauses may correspond to moments when the subject is retrieving information which is expected to be longer for the Alzheimer’s patients. Similarly, confusion may also lead to long silences. The rate of such long hesitation events, defined as silences longer than approximately one second, is used to detect the disease. This feature is computed as the ratio of the total count of long silences to the total number of words.

3.1.6 3.1.6 Average silence count

This feature specifies the average number of silences produced by a speaker in one second of speech. It is calculated by dividing the total number of silences by the duration of the locution.

3.1.7 3.1.7 Silence rate

The silence rate measures the silence as a proportion of the whole locution. It is computed by dividing the total duration of all silence segments by the duration of the locution.

3.1.8 3.1.8 Continuous speech rate

This feature measures how long the subject speaks until the next long silence, which is considered to be a thinking or recalling state. It is defined as the average duration of continuous speech segments over the whole locution.

3.1.9 3.1.9 Average continuous word count

As mentioned above, the thinking process longer for patients than for the control subjects. The silence rate features discussed above try to exploit this long thinking process. Another way to measure it is to compute the average number of consecutive words that are spoken without intervening silences. First, the number of words for each continuous segment is computed. Then, the mean of these counts is used as the feature.

3.2 Articulation-related features

The voice activity-related features discussed above are related to cognitive thought processes. However, it is also important to measure how the subject uses his or her voice articulations during speech. For example, if the subject becomes emotional, significant changes in the fundamental frequency (pitch) can be expected. Similarly, changes in the resonant frequencies (formants) of speech can be a strong indicator of the subject’s health. If the formants do not change fast enough or are not distinct enough, sounds may become harder for listeners to identify, leading to the perception of mumbling. In order to see the impact of these effects on classification of the disease, pitch and formant trajectories are extracted, and the following features are derived over the whole locution.

3.2.1 3.2.1 Average absolute delta energy

Energy variations can convey information about the mood of the subject. Changing energy significantly during speech may indicate a conscious effort to stress words that are semantically important or a change in mood related to the content of the speech. The absolute value of each frame-to-frame energy change is measured, and the average of these changes over the whole locution is computed.

3.2.2 3.2.2 Deviation of absolute delta energy

In addition to changes in energy, changes in the delta energy, which is the acceleration of energy, can be used. The standard deviation of the average absolute delta energy is used to further investigate the possible impacts of the disease on the energy change rate.

3.2.3 3.2.3 Average absolute delta pitch

The average of the absolute delta pitch shows the rate of variations in pitch. This feature is highly correlated with the emotions carried through the speech signal.

3.2.4 3.2.4 Deviation of absolute delta pitch

The standard deviation of the absolute delta pitch is also used as a feature to further analyze the possible impacts of the disease on the pitch change rate. A monotonic increase or decrease in the pitch may simply be related to routine changes in sentence structure. However, acceleration of pitch, measured with the standard deviation of absolute delta pitch, can capture unusual pitch events in speech.

3.2.5 3.2.5 Average absolute delta formants

The average of the absolute delta formant frequencies indicates the rate of change in the formant features. Formants are related to the positions of the vocal organs such as the tongue and lips. Reduction of control over these organs related to damage in the brain caused by Alzheimer’s disease can create speech impairments such as mumbling. In this case, formants do not change quickly and speech becomes less intelligible [23]. Changes in the first four formants are used as features in this research.

3.2.6 3.2.6 Voicing ratio

Another speech impairment is the loss of voicing in speech. In this case, the subject loses the ability to control the vibrations of the vocal cords, which results in breathy and noisy speech. The ratio of the total duration of voiced speech to the total duration of speech in the locution is used to detect any potential impairment in the vocal cords.

3.3 Rate of speech-related features

3.3.1 3.3.1 Phoneme rate

A basic identifier of rate of speech is the average number of phonemes spoken per second. The phoneme rate of a subject is computed by dividing the number of phonemes by the duration of the locution.

3.3.2 3.3.2 Word rate

Similar to phoneme rate, word rate is used to measure the rate of speech at the word level. Word rate is computed by dividing the number of words by the duration of the locution.

4 Experiments

Conversational speech recordings of 32 patients and 51 age and education-matched control subjects were collected and manually transcribed. Recordings from four patients were neglected because they were either unintelligible for the most part or they did not talk much. Thus, recordings from a total of 28 patients were used in experiments. The Alzheimer’s patients and the control subjects were recruited from the same healthcare facility, but the control subjects were receiving treatment for injuries or illnesses other than Alzheimer’s disease. Gender, age, and education details of the subjects are shown in Table 3. The age range is between 60 and 90 in both control subjects and patients.

Table 3 Gender, age, and years of education for the patient and control subjects

Unstructured conversations were carried with the subjects where questions were asked depending on the flow of the conversation. Thus, different topics and different questions were used to make the subjects feel comfortable. The transcriptions were produced by one person and then reviewed by another person. The transcribers and the subjects were native Turkish speakers. In order to annotate unintelligible words, a third person also listened to the recordings.

The data was collected at elderly healthcare facilities in Istanbul. For each subject, approximately 10 min of conversation was recorded using a high-quality microphone. The recording was then manually segmented into speech turns between the interviewer and subject. In each speech turn, only the subject or the interviewer speaks. Segments of speech where both the subject and the interviewer talk were not used in the analysis.

After linguistic and prosodic features are extracted, SVM, NN classifiers, naive Bayesian classifiers, and CTree are used for classification. A linear Kernel is used for the SVM. For the NN classifier, Euclidean distance is used. For the CTree, nodes are split to minimize within-node impurity. Impure nodes that contain samples both from patients and control subjects are split only if they have more than nine samples.

There is more data available for the control subjects than for the Alzheimer’s patients because the number of subjects that were in the healthcare facilities and willing to provide data was larger. Even though equal amounts of data from both groups could be used in the experiments to have a balance, all of the available data was used with special care while training the classifiers as discussed below.

Data imbalance can become a problem for the SVM, NN, and decision tree algorithms, where the data points are used directly, as opposed to the naive Bayes approach, where the distribution of the data is used. For the NN, SVM, and decision tree classifiers, a random subsampling approach is used, in which a subset of the control subjects is randomly selected such that there is an equal number of data points for the control subjects and the patients. For each test case, the subsampling procedure is repeated ten times, and the average performance is reported.

In the first phase of testing, each feature is tested separately to assess the classification power of individual features. Then, in the second phase, combinations of features are used to increase the classification power of the algorithms. Features are normalized to have zero-mean and unit variance.

Because there is a limited number of subjects in the data set, a leave-one-out evaluation strategy is used, in which one of the subjects is left out and the classifier is trained with the rest of the subjects and tested on the left-out subject.

5 Results and discussion

The age, education level, and gender of the subjects can significantly affect performance in classification tests. Therefore, initial testing is done to control for the effect of age, education, and gender on the performance. Only features that have significant performance in all three control tests are reported as significant markers. Significance is measured using the paired t-test. A given feature may not have significant performance with all classifiers. In this case, the feature is reported as significant if it can pass the significance test with at least one of the classifiers.

Age-, education-, and gender-controlled experimental results are discussed below. Analysis and discussion of the features and combination of features with statistically significant discriminative power are reported in Section 5.4.

5.1 Age-controlled experiments

The age-controlled linguistic features are shown in Table 4. All features related to POS tags other than nouns and pronouns were found to be insignificant with this control variable. Incomplete sentences and unintelligible word ratios were found to be age related and not disease related. Similarly, all features that are related to the complexity of the language also became insignificant when age was used as a control variable.

Table 4 Accuracy (%) of each classifier using the linguistic features in age-controlled experiments

The age-controlled prosodic features are shown in Table 5. The significance of these features was found to be less related to age compared to the linguistic features. In particular, formant and voicing features that are related to articulation were found to be age related and not disease related. Patients in the older age group had a harder time controlling their vocal cords and other articulatory organs, but this was not a significant discriminative factor for the younger group of subjects.

Table 5 Accuracy (%) of each classifier using the prosodic features in age-controlled experiments

5.2 Education-controlled experiments

As in the age-controlled experiments, linguistic features performed poorly in the education-controlled experiments, shown in Table 6. Most notably, pronoun frequency and pronoun-to-noun ratio were significant indicators of the disease independent of the education level. Patients use pronouns more often than nouns. This is surprising since we hypothesized that patients would use pronouns less often since they are used to refer to nouns mentioned earlier in the conversation which we assumed would require more cognitive effort. Analyzing the transcripts in more detail, we have found that patients use pronouns without necessarily referring to a specific noun. Sometimes it is hard, even impossible, for the interviewer to understand what a pronoun is referring to. Patients seem to prefer using pronouns instead of actual nouns, which are not always specified in the conversation.

Table 6 Accuracy (%) of each classifier using the linguistic features in education-controlled experiments

Prosodic features were less dependent on the education level compared to age, as shown in Table 7. Response length was found to be insignificant in education-controlled experiments. Some of the patients with higher education either do not talk much or talk significantly more than the control group. However, such speakers do not exist in the lower education group. Hence, on average, response length was not found to be significant.

Table 7 Accuracy (%) of each classifier using the prosodic features in education-controlled experiments

The average absolute delta pitch and average absolute delta formant-2 features were also found to be dependent on the education level. These two features are particularly interesting since they also have high correlation with the display of mood and depression [23]. Patients in the high education group sometimes displayed exaggerated emotions which increased the pitch variability. Interestingly, some of the subjects in the same group tend to have lower second formant deviations, which can be a sign of depression. Those two patterns, however, were not observed in the younger patients. Thus, they were not found to be significant markers of the disease.

5.3 Gender-controlled Experiments

Gender-controlled experiments were performed to measure the performance of features for each gender separately. Results are shown in Tables 8 and 9. There are three features that performed well in the age-controlled and education-controlled experiments but not in the gender-controlled experiments. Those features are: deviation of absolute delta energy, average absolute delta energy, and long silence ratio.

Table 8 Accuracy (%) of each classifier using the linguistic features in gender-controlled experiments
Table 9 Accuracy (%) of each classifier using the prosodic features in gender-controlled experiments

All three features performed well for males but not for females. In the recordings, we have found that males displayed less emotion which resulted in less expressive speech compared to male subjects in the control group. However, that pattern was not as strong with the female speakers. Moreover, the number of males is significantly larger than the number of females which makes it easier to get statistically significant results in classification experiments for the male subjects.

5.4 Analysis of significant features

Features that have significant performance in education-, age-, and gender-controlled tests are shown in Table 10, along with missed detection and false alarm rates. SVM and naive Bayes classifiers always outperform the CTree and NN classifiers. SVM has the best accuracy among all classifiers. In particular, SVM classifier with the silence ratio feature has the highest accuracy among all features and classifiers.

Table 10 Overall accuracy (%), missed detection (%), and false alarm (%) rates of features with statistically significant performance

Missed detection rates are significantly higher than the false alarm rates in the best performing classifiers, as shown in Table 10. Even though more data from the control group is available, the subsampling method is used to ensure an equal number of patient and control subjects in the training datasets, as discussed in Section 4. Thus, the results show that significantly more patients were classified as healthy compared to control subjects classified as patients. It also indicates that features extracted from some patients are significantly different from some of the other patients and most of the control subjects, which helps in classification.

Voice activity-related features are particularly good at identifying the disease as shown in Table 10. Interestingly, features 5.3 and 5.6, which are related to the rate of silences, independent of the silence duration, were found to be more powerful discriminators than 5.4 and 5.7, which are related to long silences and duration of silences. Thus, frequency of silences during speech was found to be more important than the duration of silences. Features 5.8 and 5.9 indicate how long the subject can talk without a long silence. These two features are also highly correlated with the silence rate features, as shown in Table 11, and they had good performance in the classification experiments. Similarly, phoneme rate was strongly correlated with the rate of silences, and it performed well in experiments. Even though word rate is strongly correlated with the phoneme rate, it is not as strongly correlated with the silence rate as the phoneme rate, and it was not as successful in prediction of the disease.

Table 11 Correlation of features with statistically significant accuracies in age-, education-, and gender-controlled experiments

Linguistic features did not perform as well as the prosodic features, as discussed in the previous section. Only pronoun rate and pronoun-to-noun ratio were strong indicators of the disease, but their prediction powers are not as strong as the prosodic features. However, their missed detection and false alarm rates are more balanced compared to prosodic features.

Features within each feature category are strongly correlated with each other, as shown in Table 11. Performances of the features with highest accuracy from each category are compared with confidence intervals in Table 12.

Table 12 Accuracy, missed detection, and false alarm rates of the best performing features

Feature fusion is used in an attempt to further boost the performance. In that approach, classifiers were trained with two features instead of a single feature. Because of high within-category correlations, feature fusion experiments were done by using the best performing feature in each category. Results are shown in Table 12. Not only statistically significant improvement over the best single feature could not be achieved but also performance slightly degraded with feature fusion.

Scatter diagrams of the features and decision boundaries for classification are shown in Figure 2. Silence ratio had better discrimination power than the other features. Unfortunately, the information in the other features failed to correct the errors made with silence ratio. Similarly, phoneme rate was found to be a powerful feature, but the pronoun frequency could not correct the errors it generates, as shown in the Figure 2C.

Figure 2
figure 2

Scatter diagram of features in Table 12. Decision boundary of the best performing classifier is shown in each figure. Note that even though the features were normalized before classification, they are not normalized in these figures to make interpretation easier. (A) Scatter diagram for silence ratio vs. phoneme rate features. (B) Scatter diagram for pronoun frequency vs. silence ratio features. (C) Scatter diagram for pronoun frequency vs. phoneme rate features.

Note that increasing the size of feature vectors can in fact degrade the performance of classifiers due to the curse of dimensionality that occurs when there is not enough training data and the classifier cannot generalize and perform well on test data. That effect may be partly responsible for not observing an improvement with the feature fusion approach. For the same reason, feature fusion with larger number of features was not investigated.

6 Conclusions

We have investigated an extensive set of features derived from the speech signal and transcriptions of Alzheimer’s patients and control subjects. It is already known that the part of the brain cortex that deals with linguistic abilities is one of the first to deteriorate with the onset of the disease. Our work explored how that deterioration is reflected in the patient’s speech prosody and spoken language, and whether there are markers that can be effectively detected using machine learning techniques. Our results indicate that a prediction accuracy higher than 80% can be obtained with high confidence using the proposed features, independent of the age, education level, and gender of the subjects. Prosodic features were substantially better than the linguistic features. In fact, only two of the linguistic features were found to be strong markers of the disease.

Classification experiments were also done with combinations of features. However, using more than one feature did not outperform the best single feature. This may be a result of limited amounts of data used in training the classifiers which causes generalization problems when the number of features increases.

Our experiments are with late-stage patients, and the effectiveness of the markers that we have found should be measured with early-stage patients, where the signals are more subtle and more subjects may be needed to reach statistically significant results. However, our experimental results and manual observations from the data are encouraging, and we will start collecting data from early-stage patients in the near future.

Another topic that we will investigate in the future work is a cross-lingual study of the proposed features. Features that are independent of language can provide important clues about the neural degeneration process during the disease or perhaps can enable deeper understanding of neural networks in the brain that are responsible from cognition of language and speech production.


  1. M Prince, M Guerchet, M Prina, World Alzheimer report 2013: Journey of caring: an analysis of long-term care for dementia. Accessed 2015-03-13.

  2. R Schmelzer, Roche Joins The Global CEO Initiative on Alzheimer’s Disease. Accessed 2014-08-20.

  3. MF Folstein, SE Folstein, PR McHugh, “Mini-mental state”: a practical method for grading the cognitive state of patients for the clinician. J. Psychiat. Res. 12(3), 189–198 (1975).

    Article  Google Scholar 

  4. LM Bloudek, DE Spackman, M Blankenburg, SD Sullivan, Review and meta-analysis of biomarkers and diagnostic imaging in Alzheimer’s disease. J. Alzheimer’s Dis. 26(4), 627–645 (2011).

    Google Scholar 

  5. RS Bucks, S Singh, JM Cuerden, GK Wilcock, Analysis of spontaneous, conversational speech in dementia of Alzheimer type evaluation of an objective technique for analysing lexical performance. Aphasiology. 14(1), 71–91 (2000).

    Article  Google Scholar 

  6. ET Prud’hommeaux, B Roark. Extraction of narrative recall patterns for neuropsychological assessment. Proceedings of the 12th Annual Conference of the International Speech Communication Association (Interspeech) (Florence, Italy, 2011), pp. 3021–3024.

  7. KC Fraser, JA Meltzer, NL Graham, C Leonard, G Hirst, SE Black, E Rochon, Automated classification of primary progressive aphasia subtypes from narrative speech transcripts. Cortex. 55, 43–60 (2014).

    Article  Google Scholar 

  8. B Roark, M Mitchell, JP Hosom, K Hollingshead, J Kaye, Spoken language derived measures for detecting mild cognitive impairment. IEEE Trans. Audio Speech Lang. Process. 19(7), 2081–2090 (2011).

    Article  Google Scholar 

  9. G Tosto, M Gasparini, GL Lenzi, G Bruno, Prosodic impairment in Alzheimer’s disease: assessment and clinical relevance. J. Neuropsychiat. Clin. Neurosci. 23(2), 21–23 (2011).

    Article  Google Scholar 

  10. DS Knopman, S Weintraub, VS Pankratz, Language and behavior domains enhance the value of the clinical dementia rating scale. Alzheimers Dement. 7(3), 293–299 (2011).

    Article  Google Scholar 

  11. SV Pakhomov, GE Smith, S Marino, A Birnbaum, N Graff-Radford, R Caselli, B Boeve, DS Knopman, A computerized technique to assess language use patterns in patients with frontotemporal dementia. J. Neurolinguistics. 23(2), 127–144 (2010).

    Article  Google Scholar 

  12. V Iliadou, S Kaprinis, Clinical psychoacoustics in Alzheimer’s disease central auditory processing disorders and speech deterioration. Ann. Gen. Hosp. Psychiat. 2(1), 12 (2003).

    Article  Google Scholar 

  13. I Hoffmann, D Nemeth, CD Dye, M Pakaski, T Irinyi, J Kalman, Temporal parameters of spontaneous speech in Alzheimer’s disease. Int. J. Speech Lang. Pathol. 12(1), 29–34 (2010).

    Article  Google Scholar 

  14. SV Pakhomov, EA Kaiser, DL Boley, SE Marino, DS Knopman, AK Birnbaum, Effects of age and dementia on temporal cycles in spontaneous speech fluency. J. Neurolinguistics. 24(6), 619–635 (2011).

    Article  Google Scholar 

  15. C Thomas, V Keselj, N Cercone, K Rockwood, E Asp, in Mechatronics and Automation 2005 IEEE International Conference, 3. Automatic detection and rating of dementia of Alzheimer type through lexical analysis of spontaneous speech (Niagara Falls, Canada, 2005), pp. 1569–15743.

  16. LEE H, F Gayraud, F Hirsch, M Barkat-Defradas. Speech dysfluencies in normal and pathological aging: a comparison between Alzheimer patients and healthy elderly subjects. the 17th International Congress of Phonetic Sciences (ICPhS) (Hong Kong, China, 2011), pp. 1174–1177.

  17. DA Snowdon, SJ Kemper, JA Mortimer, LH Greiner, DR Wekstein, WR Markesbery, Linguistic ability in early life and cognitive function and Alzheimer’s disease in late life. Findings from the Nun Study. JAMA. 275(7), 528–532 (1996).

    Article  Google Scholar 

  18. K Oflazer, S Inkelas, in Proceedings of the EACL Workshop on Finite State Methods in NLP, 82. A finite state pronunciation lexicon for Turkish (Budapest, Hungary, 2003), pp. 900–918.

  19. K Oflazer, Two-level description of Turkish morphology. Literary Linguist. Comput. 9(2), 137–148 (1994).

    Article  Google Scholar 

  20. v Brunet. Le Vocabulaire De Jean Giraudoux : Structure Et évolution : Statistique Et Informatique Appliquées à L’étude Des Textes à Partir Des Données Du Trésor De La Langue Française. Le Vocabulaire des grands écrivains français (Genève, Slatkine, 1978). ASIN: B0000E99PZ.

  21. A Honore, Some simple measures of richness of vocabulary. Assoc. Literary Linguistic Comput. Bull. 7, 1979.

  22. D Biber, S Conrad, G Leech, The Longman student grammar of spoken and written English, (Harlow: Longman, 2002). ISBN: 0 582 237262.

  23. E Moore, MA Clements, JW Peifer, L Weisser, Critical analysis of the impact of glottal features in the classification of clinical depression in speech. Biomed. Eng. IEEE Trans. 55(1), 96–107 (2008).

    Article  Google Scholar 

Download references


The authors would like to thank the anonymous reviewers for their invaluable comments and suggestions to improve the quality of the paper. They are also grateful to Prof. Emily T. Prud’hommeaux of Rochester Institute of Technology for her invaluable comments during the revision of this paper.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Ali Khodabakhsh.

Additional information

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khodabakhsh, A., Yesil, F., Guner, E. et al. Evaluation of linguistic and prosodic features for detection of Alzheimer’s disease in Turkish conversational speech. J AUDIO SPEECH MUSIC PROC. 2015, 9 (2015).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: