Impact of acoustic similarity on efficiency of verbal information transmission via subtle prosodic cues
© The Author(s) 2016
Received: 29 February 2016
Accepted: 15 November 2016
Published: 13 December 2016
In this study, we investigate the effect of tiny acoustic differences on the efficiency of prosodic information transmission. Study participants listened to textually ambiguous sentences, which could be understood with prosodic cues, such as syllable length and pause length. Sentences were uttered in voices similar to the participant’s own voice and in voices dissimilar to their own voice. The participants then identified which of four pictures the speaker was referring to. Both the eye movement and response time of the participants were recorded. Eye tracking and response time results both showed that participants understood the textually ambiguous sentences faster when listening to voices similar to their own. The results also suggest that tiny acoustic features, which do not contain verbal meaning can influence the processing of verbal information.
Language comprehension involves a complex interaction between the transmitted message and the receiver’s background knowledge and experiences . As a result of this complexity, differences in representation styles can clearly influence the efficiency of our language comprehension process. For example, the inversion of subject and object in passive sentences makes these sentences more difficult for listeners to understand than sentences with the same meaning expressed using active voice, for both positive and negative sentences . Listeners also have difficulty interpreting “garden path” sentences, i.e., grammatically correct sentences which have meanings different from those that a listener would normally expect. For example, “The dog that I had really loved bones,” and “I told her children are noisy.” Such sentences are considered to be evidence of our sequential reading process (i.e., one word read at a time) .
Schema theory suggests that presenting messages in style that is familiar to the recipient improves comprehension efficiency, because when a receiver has relevant background knowledge, he or she can free up more working memory for analysis and interpretation of the message [4, 5]. Researchers have found evidence to support the theory that both lexical and prosodic familiarity increase the efficiency of our language comprehension. Use of familiar topics has been found to help foreign language learners improve their performance on reading comprehension tasks, no matter which second language they are learning  or what their native language is . Moreover, the facilitative effect of comprehension on language-related tasks is revealed in simple nativization drills, such as the changing of character and location names into native ones (e.g., when a Japanese English learner replaces “Barack Obama lives in Washington D.C.” with “Shinzo Abe lives in Tokyo”) . Studies also show that familiarity with the speaker’s speech characteristics, such as the speaker’s accent, also have a positive influence on our listening comprehension, for both native and non-native listeners [9, 10].
In most of the cases mentioned above, familiarity also involves self-similarity (i.e., we are familiar with our own accent, capital, president, etc.). Thus, it seems that self-similarity is a factor related with high-efficiency communication. However, most of these researches employed second language learner as their participants, there is still lack of evidence to show whether subtle prosodic cues significantly influence our listening comprehension. It is important to us because we aim to find a way to predict and achieve (through speech synthesis) high-efficiency speech communication, if subtle prosodic cues cannot significantly influence our comprehension, the idea can hardly be applied. Thus, in previous research we have tried to use speaker self-similarity as a predictor of information transmission quality in dialogues . We investigated the relationship between similarity in spectral envelope features, prosodic features and lexical features of speakers and listeners and the quality of information transmission during map task dialogues. Prosodic and lexical similarity were found to be correlated with information transmission quality, and spectral envelope similarity was also found to have a weak but significant correlation with map task performance. These results surprised us, because it is well known that the perception of one’s own voice involves a mixture of air conduction and bone conduction , meaning that our perception of our recorded voice differs from our daily perception of our own voices. In fact, we rarely perceive our own voice to be familiar when heard on a recording. Our previous research thus suggests that it is reasonable to assume that we find our own recorded voices more familiar than the recorded voices of others. However, it is still unclear whether the familiarity of subtle prosodic cues, such as fundamental frequency, have a facilitative effect on comprehension efficiency. It is also unclear whether self-similarity influences communication efficiency when subjects hear synthesized voices as it does when communicating face-to-face with real people. Since the correlation is too weak to reach a definitive conclusion, we decided to design an experiment to investigate the effect of voice similarity on comprehension efficiency by observing comprehension when messages are presented at different levels of voice similarity.
Does similarity in the speech characteristics of the information sender and information receiver result in higher information transmission efficiency?
Do subtle acoustic cues, such as spectral envelop, have any influence on the efficiency of information transmission?
This paper is organized as follows. After a description of our experimental method, we describe our experimental procedure, report our experimental results, and discuss their implications. We then end the paper with our conclusions and a discussion of our future research.
We employed lexically ambiguous material in our experiment to control the influence of lexical and prosodic features on comprehension. To vary similarity of the speakers’ voices, we used morphing technology. This allowed us to present information at different levels of self-similarity. We also used objective similarity measures for further similarity analysis. To measure transmission efficiency, we used both response time during the target selection task and the proportion of the time participants were visually fixated on the appropriate target during the task.
2.2 Voice morphing
After the participants’ voices were recorded reading the Japanese RB vs. LB ambiguous phrases, we randomly paired participants with a stranger5 and used the TANDEM-STRAIGHT toolbox to morph their original voices into four transitional levels of similarity using manually anchored start and end points of each syllable. The starting point and ending point of each syllable were aligned manually (see Fig. 3 b, the white circles are the anchored points). The morphing conditions were as follows: 100% speaker A’s voice, 67% speaker A’s voice mixed with 33% speaker B’s voice, 33% speaker A’s voice mixed with 67% speaker B’s voice, and 100% speaker B’s voice. As the synthesized voices still sound somewhat artificial, to compensate for this, voices were synthesized using TANDEM-STRAIGHT even for the 100 and 0% similarity conditions. Figure 2 c, d show the morphed waveforms and spectrum based on the waveforms shown in Fig. 2 a, respectively. And we can see that they are very similar to each other in timing and intensity.
2.3 Objective similarity measures
Although we used morphing technology to artificially create voices with different levels of similarity, the original dissimilarity of the speaker’s voices varied, i.e., for some participants, even in the 0% “own voice” condition (100% other person’s voice), their partner’s voice was still very similar to their own. Hence, we introduced objective similarity measures, which included spectrum, pitch contour, and duration, to allow further analysis. The spectrum is assumed contains one’s personal characteristics, which partially defines the acoustic features of an individual’s speech. Meanwhile, prosodic cues, such as intonation and duration, are relevant to one’s speaking style, which will also influence the acoustic features of one’s speech. For convenience, all of these features are called “acoustic features” in this paper.
2.3.1 Spectrum similarity measure
as its other elements6. D(i,j) is the entry of the local distance matrix and C(i,j) is the entry of the cost matrix. Thus, the final entry in the cost matrix (e.g., C(I,J)) is the optimum global alignment cost. The optimum mapping path between the two input vectors can also be found by backtracking the optimum path of each node. In this paper, MFCC distance is used to compute the distance between each pair of spectra (one for partner A and one for partner B) for a given phrase (e.g., “red star necktie”) so that we can obtain a distance matrix.
After fixing the manually anchored points together, DTW is used to align the rest of the frames with each other. Spectrum information is extracted using TANDEM-STRAIGHT. MFCC distance, which is the logarithm of the Euclidean distance between two MFCC vectors normalized by the maximum value of the total Euclidean distance, is the default distance measurement for spectrum sequences employed by TANDEM-STRAIGHT (and the distance measurement recommended by its creators).
2.3.2 Pitch contour similarity measure
where f A (i) and f B (i) represents the log F 0 7 value of speaker A and B in the ith aligned frame, respectively, m A and m B represent the mean log F 0 of speaker A and B in the current speech segment, respectively. I represents the number of frames in the aligned sequence, and w(i) is the weighting factor, based on the frame signal power8.
2.3.3 Duration similarity measure
where S A (s) and S B (s) are the sth intervals of speaker A and B computed from the anchored points, respectively, and N is the number of anchored points.
Our experiment was divided into two phases. In the recording phase, participants were shown 13 pairs of pictures. The two pictures in each pair were different, but could be described using the same lexically ambiguous phrase, depending on whether the RB or LB reading was used. They were asked to describe each picture in Japanese twice, using their own natural speaking style, by reading the supplied ambiguous phrase. Example pictures and an example description are shown in Fig. 1 a. They were recorded in a sound-proof booth at 48,000 Hz with 20 bits sampling. Participants were then randomly paired with a stranger participant, and TANDEM-STRAIGHT was used to morph their voices with the voices of their partners.
In the second phase of the experiment, a listening comprehension experiment was performed about 1 week later. After completing two unambiguous warm-up trials, the only aim of which was to make sure that the participants understood what they should do during the experiment, participants listened to the previously recorded ambiguous phrases (in which their voices had been re-synthesized and morphed) while viewing pictures (1024×768 pixels) shown on a visual display (see Fig. 1 b). Participants were asked to identify which target/image they heard described as quickly as possible by pressing one of four arrow keys on the keyboard. Note that participants listened to exactly the same phrases as their randomly paired stranger partner, the only difference being that the self-similarity conditions differed (i.e., one participant’s voice was the “other’s person’s voice” for their partner, and vice versa).
During the experiment, the eye movements of the participants were tracked with a Tobii X2-30 eye tracker at a sampling frequency of 30 Hz. The targets were pictures of pairs of items, all of which had been seen by the participants during the recording phase of the experiment. Participants were shown a target, which was a set of four pairs of pictures. We called the item on the left of each pair the “first item” (i.e., the necktie in Fig. 1 b), and the item on the right of each pair was called the “second item.” The first item in each pair was the subject of the ambiguous phrase, while the second item was unique and was described without ambiguity. We included these “second items” because the prosodic differences between the descriptions of pairs of ambiguous options is very subtle. Based on previous research, even when listeners hear their own recorded voices, they can only achieve a comprehension accuracy of about 70%. By adding a unique “second item”, we are able to better distinguish between confused responses (when the listener does not know which target is being described) and incorrect responses (when the listener presses the wrong key by mistake). Each set of four pairs of pictures included two pairs with correct first items and two pairs with first items, which could be easily mistaken for the correct items due to RB vs. LB ambiguity.
Twenty-eight male, native Japanese-speaking college students were recruited as participants10. Data collected from four of the participants was removed from analysis either because of experimental error (the participants misunderstood the task) or due to data recording error (50% of their eye movement data was lost). Thus, the study was conducted using data collected from 24 participants11.
In this paper, we analyzed our results using ANOVA, which assumes that the ratio (i.e., F value) of between-group variability to within-group variability follows an F-distribution. The probability (i.e., p value) that the means of the experimental groups are all equal becomes smaller as the F value increases. When the p value is smaller than the alpha level (which was set to 0.05 for this paper), the null hypothesis will be rejected (i.e., there is a significant difference between the means of the experimental performances of the groups being compared). Further, as we used four morphing levels in our experiment, Tukey’s test was applied for pairwise comparisons when ANOVA shows that there is a significant difference in experimental performance.
We further divided the “stranger’s voice” data into “strangers with voices similar to the listener’s own voice” and “strangers with voices dissimilar to the listener’s own voice” based on the objective similarity measures, which can be considered to be an extension of the original morphing experiment. We set the 33 and 67% of all the data as thresholds for “similar stranger” and “dissimilar stranger,” respectively. Participant pairs whose average objective similarity measure was higher or lower than these thresholds were considered to be a “similar stranger” or “dissimilar stranger,” respectively. Further, ANOVA analysis was applied using the “similar stranger” and “dissimilar stranger” categories as an additional “between subjects” factor. Because we were afraid that similarity of pitch and duration of utterances within a participant pair could change (i.e., some utterances could sound similar while other utterances sounded dissimilar), for the purpose of analysis, both pitch and duration similarities were treated as both a “between subjects” factor and a “within subjects” factor (i.e., they were analyzed twice)12. Also note that there were only tiny differences in prosodic expression between paired participants. The mean and variance of the mean differences in syllable and pause duration were 44.4 ms and 378.04(ms)2, respectively. The mean and variance of the weighted correlation of pitch contours was 0.7813 and 0.04, respectively.
3.2 Response time
3.2.1 Response time under different morphing conditions
3.2.2 Response time under different pairing conditions
There is still a significant difference between response times when using the duration similarity measure to divide “stranger” (F=7.754,p<0.05 as a between-subjects factor, F=3.37,p<0.05 as a within-subjects factor). However, there is no significant difference in response time between trials divided by spectrum similarity measure (F=2.10,p=0.16) or pitch similarity measure (F=1.55,p=0.23 as a within subjects factor, F=1.1,p=0.34 as a between subjects factor). One possible explanation is that differences in prosodic information comprehension are difficult to catch using response time as an indicator, and the difference in duration itself causes different response times (e.g., one’s response would probably be slower when the stimulus lasts longer).
3.3 Degree of visual fixation
3.3.1 Visual fixation under different voice morphing conditions
3.3.2 Visual fixation under different pairing conditions
In summary, since the audio stimuli used in these experiments were verbally identical, the results of our experiment indicate that similarity in subtle prosodic cues does indeed positively influence the efficiency of prosodic information transmission. Additionally, there are significant differences in response times at different morphing levels and under different duration-based pairing conditions, but no significant difference in response times between MFCC-based pairing conditions or pitch-based pairing conditions. In contrast, the visual fixation results show no significant differences at different morphing levels or different duration-based pairing conditions, but show significant differences between different MFCC-based pairing conditions and pitch-based pairing conditions. We cannot explain this contrastive result, except to suggest that perhaps this experiment revealed a “boundary” of human speech perception ability. Investigation of a possible boundary of this type would be an interesting topic of future research. Also note that the utterances of some pairs of participants may have sounded more artificial than others, and that even within the same pair of participants some sentences sounded more artificial than others since nasal sounds usually sound slightly more artificial than plosive sounds. This research does not investigate the influence of the naturalness of the synthesized voices, which should also be examined in future research.
We designed and conducted experiments to investigate the effect of subtle prosodic similarity on the efficiency of prosodic information transmission. We used sentences with RB vs. LB ambiguity as our experimental material, and voice morphing technology to control voice similarity levels during the experiments. Objective similarity measurements were also used for analysis. Participants’ response times and visual fixation behaviour were recorded. Analysis of the response time data showed that participants identified ambiguous target images more quickly when they heard voices similar to their own. Analysis of the visual fixation data also showed that participants understood more of the prosodically conveyed information when the target images were described in voices similar to their own. To address the questions raised in the “Introduction” section, our results support the hypotheses that similarity in the speech characteristics of the information sender and information receiver result in higher information transmission efficiency, and that subtle acoustic cues, such as the spectral envelope, influence efficiency of information transmission.
These findings were consistent with one another and imply that acoustic feature similarity is relevant to prosodic information transmission efficiency. In contrast to previous research, the subjects of this study were all male undergraduate students who were native speakers of standard Japanese. Our results suggest that human processing of speech information is so sensitive that even subtle prosodic cues influence our information transmission efficiency and language processing ability. But it should also be noted that only half of our experimental results were statistically significant, thus additional experiments which can verify our findings and investigate the “boundary” of human speech perception ability are needed. Finally, as spectrum similarity (MFCC distance) is considered to contain information on the condition of the vocal tract, our results suggest that physiological similarity is likely to be an additional dimension which needs to be considered when discussing speech communication and information transmission between speakers.
Regarding future works, the current experiment is unbalance in participants’ gender and the appearance of different morphing conditions, a stricter experiment with female participants ought to be done in the future. Also, as mentioned above, synthesized voices still sound somewhat artificial. Therefore, further investigation of the naturalness of morphed stimuli and their impact on information transmission is a potential area of research. Moreover, the morphing conditions should be redesigned to show significant differences in experimental performance. Furthermore, instead of using morphed stimuli, information transmission efficiency when using “similar” or “dissimilar” participants’ voices, as determined through the use of an objective similarity measure, should also be investigated. The combination of these two research projects might help us to verify that the slower listener reactions are not merely due to lower-quality stimuli or the amount of morphing, or due to the possibility that participants can identify their own voices and therefore exert extra effort.
1 The other ambiguous material we used can be found in the appendix.
2 A mechanism whereby the pitch register for marking accentual prominences, is lowered with each successive occurrence of a pitch accent within a phrase.
3 Considered to be the main prosodic cue.
4 Although TANDEM-STRAIGHT allows users to modify the parameters independently (some of the parameters are fixed); however, in our experiment all of the parameters were modified together (i.e. replaced by a weighted average of the two source voices). This was because the main question we wanted to investigate was whether the similarity of interlocutor’s voices influences information transmission.
5 Before being paired-up with a partner, participants were shown a list of the names of all of the participants to make sure they did not know their partner.
6 There are numerous ways to calculate the cost matrix, and here we only explain the method used in this paper (for more details see ).
7 F 0 was tracked using TANDEM-STRAIGHT. Unvoiced intervals were interpolated based on a cost function aimed at minimizing discontinuities in the resulting trajectories and maximizing plausibility, based on the side information associated with F 0 candidates .
8 In this paper, the signal power stands for the mean square of the input waveform.
9 Participants can respond at any time during a trial; therefore, the fourth stage is absent in some trials due to situations such as mistaken responses, etc.
10 We did not believe that gender would affect performance in this sort of comprehension experiment, and as a result there is an obvious imbalance in the genders of our participants. Future research should include more female participants, and should investigate the effect of a mixed-gender voice.
11 Trials in which participant gave an incorrect response or which had more than a 50% loss of eye movement data were also removed from analysis, which ignores 10% of the remaining data.
12 We ignored participants/trials which did not meet both of the thresholds. For our analysis of spectrum similarity, we ignored two participants. For pitch similarity, we ignored three participants. For duration similarity, we ignored four participants.
13 A value that has been considered to indicate a high level perceptual prosodic similarity in previous researches .
14 Here we only show the proportion of visual fixation on the “correct” areas for simplicity.
Figure 13 shows the other 12 ambiguous materials we used in our experiment.
This research is supported by the Center of Innovation Program (Nagoya-COI; Mobility Society leading to an Active and Joyful Life for Elderly) from Japan Science and Technology Agency.
The authors declare that they have no competing interests.
Ethics approval and consent to participate
This research has granted an approval from the ethics committee of the graduate school of information science, Nagoya University. The reference number of the ethics approval is 329. Participants were employed to attempt our experiment only after we received their e-mail informed consent.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- N Anderson, Exploring Second Language Reading: Issues and Strategies (MA: Heinle and Heinle Publishers, Boston, 1999).Google Scholar
- DI Slobin, Grammatical transformations and sentence comprehension in childhood and adulthood. J. Verbal Learn. Verbal Behav. 5(3), 219–227 (1966).View ArticleGoogle Scholar
- L Frazier, K Rayner, Making and correcting errors during sentence comprehension: eye movements in the analysis of structurally ambiguous sentences. Cogn. Psychol. 14(2), 178–210 (1982).View ArticleGoogle Scholar
- FC Bartlett, Remembering: A Study in Experimental and Social Psychology (Cambridge University Press, Cambridge, 1995).View ArticleGoogle Scholar
- H Nassaji, Schema theory and knowledge based processes in second language reading comprehension: A need for alternative perspectives. Lang. Learn. 52(2), 439–481 (2002).View ArticleGoogle Scholar
- MJ Leeser, Learner based factors in L2 reading comprehension and processing grammatical form: topic familiarity and working memory. Lang. Learn. 57(2), 229–270 (2007).View ArticleGoogle Scholar
- SK Lee, Effects of textual enhancement and topic familiarity on korean EFL students’ reading comprehension and learning of passive form. Lang. Learn. 57(1), 87–118 (2007).View ArticleGoogle Scholar
- IH Erten, S Razi, The effects of cultural familiarity on reading comprehension. Read. Foreign Lang. 21(1), 60–77 (2009).Google Scholar
- P Adank, BG Evans, J Stuart-Smith, SK Scott, Comprehension of familiar and unfamiliar native accents under adverse listening conditions. J. Exp. Psychol. Hum. Percept. Perform. 35(2), 520–529 (2009).View ArticleGoogle Scholar
- RC Major, SF Fitzmaurice, F Bunta, C Balasubramanian, The effects of nonnative accents on listening comprehension: implications for ESL assessment. TESOL Q. 36(2), 173–190 (2002).View ArticleGoogle Scholar
- B Chen, N Kitaoka, K Takeda, Relationship between speaker/listener similarity and information transmission quality in speech communication (Asia-Pacific Signal and Information Processing Association, Hong Kong, 2015).View ArticleGoogle Scholar
- D Maurer, T Landis, Role of bone conduction in the self-perception of speech. Folia Phoniatr. Logop. 42(5), 226–229 (1990).View ArticleGoogle Scholar
- Y Hirose, Cognitive mechanisms for sentence comprehension speaker’s intention and hearer’s comprehension: a latent function of lexical accent in syntax. Cogn. Sci. 13.3:, 428–442 (2006).Google Scholar
- JJ Bartono, N Radcliffe, MV Cherkasova, J Edelman, JM Intriligator, Information processing during face recognition: the effects of familiarity, inversion, and morphing on scanning fixations. Perception. 35:, 1089–1105 (2006).View ArticleGoogle Scholar
- T Valentine, S Darling, M Donnelly, Why are average faces attractive? The effect of view and averageness on the attractiveness of female faces. Psychon. Bull. Rev. 11(3), 482–487 (2004).View ArticleGoogle Scholar
- H Kawahara, T Takahashi, M Morise, H Banno, Development of exploratory research tools based on TANDEM-STRAIGHT (Asia-Pacific Signal and Information Processing Association, Sapporo, 2009).Google Scholar
- VG Skuk, SR Schweinberger, Influences of fundamental frequency, formant frequencies, aperiodicity, and spectrum level on the perception of voice gender. J. Speech Lang. Hearing Res. 57(1), 285–296 (2014).View ArticleGoogle Scholar
- R Zaske, SR Schweinberger, H Kawahara, Voice aftereffects of adaptation to speaker identity. Hear. Res. 268(1), 38–45 (2010).View ArticleGoogle Scholar
- DJ Hermes, Measuring the perceptual similarity of pitch contours. J Speech, Lang. Hear. Res. 41(1), 73–82 (1998).View ArticleGoogle Scholar
- H Sakoe, S Chiba, Dynamic programming algorithm optimization for spoken word recognition. Acoust. Speech Signal Process. IEEE Trans. 26(1), 43–49 (1978).View ArticleMATHGoogle Scholar
- H Kawahara, A de Cheveigne, H Banno, T Takahashi, T Irino, Nearly defect-free f0 trajectory extraction for expressive speech modifications based on straight (Interspeech, Lisbon, 2005).Google Scholar