- Open Access
Improvement of multimodal gesture and speech recognition performance using time intervals between gestures and accompanying speech
© Miki et al.; licensee Springer. 2014
- Received: 28 February 2013
- Accepted: 3 January 2014
- Published: 13 January 2014
We propose an integrative method of recognizing gestures such as pointing, accompanying speech. Speech generated simultaneously with gestures can assist in the recognition of gestures, and since this occurs in a complementary manner, gestures can also assist in the recognition of speech. Our integrative recognition method uses a probability distribution which expresses the distribution of the time interval between the starting times of gestures and of the corresponding utterances. We evaluate the rate of improvement of the proposed integrative recognition method with a task involving the solution of a geometry problem.
- Speech Recognition
- Recognition Performance
- Gesture Recognition
- Prosodic Feature
- Candidate Pair
Multimodal interaction, where multiple modalities sometimes play complementary roles with one another, is likely to become more widespread in human-machine communication. The semantics expressed in a modality may be ambiguous, but another modality might be able to remove these ambiguities. Combining gestures and speech is a typical example of such multimodality.
When completing a task using an interface, as task difficulty increases, users often prefer multimodal interactions rather than unimodal ones, for example, when entering data in an interface system with speech and pen modalities . This implies that the smooth completion of complex transactions is facilitated by multimodality, especially by the ability to select a method capable of expressing complex intentions.
In this paper, we propose a method for improving gesture and speech recognition and use a task involving the solution of a geometry problem to test it. When performing such tasks, verbal utterances are often accompanied by pointing because individual modalities are often ambiguous. For an automated system to understand such bimodal input, this kind of problem is generally divided into three sub-problems: independent recognition of speech and fingertip movements, matching up the utterances and fingertip movements, and simultaneous recognition (and understanding) of this bimodal input, taking into account both modalities. In this paper, we focus on the second and third issues, which, if successfully resolved, will result in what is known as ‘modality fusion’, which can be defined as the integration of the analysis of multiple modalities.
Although multiple feature streams from multiple modalities may be integrated and recognized simultaneously (using ‘early integration’ or ‘data-level fusion’) , as in bimodal audio-visual speech recognition, this approach is only successful when the modalities are well synchronized with each other. Therefore, it cannot be applied to the integration of speech and gestures. Thus, ‘late integration’ (or ‘decision-level fusion’)  is usually used, and thus all three of the sub-problems above need to be resolved.
To address the first issue of gesture recognition, methods using image processing have been proposed to recognize gestures, including fingertip movements. Head and hand positions have been tracked using video , fingertip position have been tracked using images captured by humans , position sensors have been used to acquire the position of a fingertip , and touch pens and panels have been used to interpret pointing [6, 7]. In this paper, we used derivatives of position sensor data to recognize gestures. We may be able to use other methods as well to improve performance, but this is out of the scope of our current investigation.
After independently recognizing speech and gestures, correspondence must be found between them. Utterances and gestures which express identical meanings are paired. For such pairing, temporal order  and inclusion , semantic compatibility , and the relationship between prosodic features in speech and the speed of hand/finger movements  have been used. Utilizing prosodic features is an interesting approach, but extraction of F0 features is not easy, and prosodic features include a wide range of individual variations. Thus, results using this method tend to vary widely in accuracy. Constraint by temporal order or inclusion (by overlapping the periods of modalities) is effective. However, the order constraint is relatively weak compared to the overlap constraint. On the other hand, the overlap constraint makes it difficult to determine correspondence, resulting in a lack of flexibility. We propose a soft decision method based on the statistics of the overlaps.
Finally, the information from the speech and gestures is used to construct an integrated representation. Integration/fusion methods of multimodal inputs have been well categorized, and the use of frame-based fusion has been proposed . The concepts obtained from individual recognizers are put into semantic slots to represent an integrated meaning. These types of methods cannot consider temporal constraints directly, so temporal constraints are often combined, as in the method referred to above. The following schemes have also been proposed: a graph-based optimization method , a finite-state parsing method , a unification-based parsing method , the integration of multimodal posterior probabilities , and hidden Markov model-based multimodal fusion . Some of these methods are able to take temporal constraints into account to some extent; however, these methods are not intended to improve single mode recognition performance as a result of the fusion.
Qu and Chai  proposed the use of information obtained from gestures to improve speech recognition performance. Our goal is to improve both speech and gesture recognition performance simultaneously through the modality fusion process.
In a previous study, we used the time interval between digit utterance in connected digits and accompanying finger tapping to improve digit recognition . Synchronicity of speech and pen input has also been used for continuous speech recognition .
The rest of this paper is organized as follows. We first introduce the experimental task and explain the method of recording the multimodal inputs in Section 2. We then explain our gesture and speech recognition methods in Sections 3 and 4, respectively, and propose an integrative recognition method using multimodal time alignment in Section 5 . We discuss our experimental results in Section 6 and conclude the paper in Section 7.
In order to recognize gestures, the automated system must be able to differentiate when subjects are pointing at items such as angles, segments, vertices, etc. from the movement of their fingertips.
A subject’s finger position in the z-axis is also important for recognizing gestures because meaningful movements can occur when a fingertip is resting on a desk, for example, so we also used absolute position in the z-axis as a feature. Additionally, we used the first derivatives of the features, resulting in six-dimensional features consisting of Δ x,Δ y,z,Δ Δ x,Δ Δ y, and Δ z.
We used three-state HMMs with a single mixture to model 21 finger movements. A total of 18 of the 21 gestures corresponded to pointing at one of the 11 segments, 4 vertices, or 3 arcs between segments in the figure shown in Figure 1. The three remaining finger movements consisted of gestures which occurred during intervals between pointing gestures, pushing the start/stop switch, and touching the desk without pointing at any of the items.
where Corr and Acc describe correct rate and accuracy, respectively, and C g , N g , and R g represent the number of correctly recognized gestures, the number of gestures included in the test data, and the number of recognized gestures, respectively.
As mentioned in our introduction, we could have adopted other features and/or methods to improve recognition performance. We understand that using HMMs with the features Δ x,Δ y, and z may not be the best choice for gesture recognition. We did so, however, because our proposed method involves an integration, which will be described in Section 5, and each recognition method should be kept separate from the integrationa. Improvement of the performance of individual modality recognition rates is a subject of future work, and we believe improved individual recognition methods will increase the benefits of our integrative method.
We also performed speech recognition experiments using the recorded explanation utterances. The Julius decoder was used for speech recognition . We used a network grammar that accepted a sequence of elements, such as the expression ‘angle ADB equals angle ACB’, etc. Since subjects were often explaining how to solve the problem while they were still thinking about the solution, they often used fillers and disfluencies; therefore, the grammar was set up to accept fillers between any words. No other methods were used to deal with out-of-vocabulary words. The size of vocabulary was 77 words. These words and the grammar were predefined empirically and thus they could be used for all of the test data. Triphone HMMs were used as the acoustic models, and they were trained using the Corpus of Spontaneous Japanese (CSJ) , which is suitable for spontaneous speech. Each HMM had three states with output probabilities. The sampling frequency was 16 kHz, frame length and shift were 25 and 10 ms, respectively, and a 12-dimensional MFCC and its delta with delta log power were used as features. These acoustic models were also trained in advance, not using part of the test set; thus, we used the models for all of the test data. For this reason, we did not need to perform n-fold cross validation. We obtained a 75.0% speech recognition rate with a 66.7% accuracy.
5.1 Relationship between speech and gestures
where μ τ and are the mean and the variance of time difference τ, respectively. Utterances are paired with gestures with maximal probabilities of corresponding starting time differences. We could have used the discrete distribution derived directly from the histogram, but we decided to fit a parametric distribution to the histogram instead, for the purpose of generalizationb.
To verify the effectiveness of this method, we performed a preliminary experiment in which utterances and gestures were manually segmented a priori. Then, each gesture was associated with an utterance including a key phrase that had its maximum probability calculated using Equation 5. Key phrases included demonstratives (‘here’, ‘this’, etc.) and parts of the figure (‘angle ADB’, ‘70 degrees’, etc.). Some utterances were not associated with any gestures. The eight trials described in Section 2 were used as the test set, and μ τ and were estimated from the data from the seven trials, not including each test trial (that is, using eightfold cross validation). Matches were considered to be ‘correct’ when utterances were associated with the correct gestures, and utterances without any accompanying gestures were considered ‘correct’ when no gestures were associated with them. Of the utterances, 93.8% was correctly associated with gestures. The nearest matching starting time strategy and longest overlapping time strategy obtained 89.7% and 83.5% association rates, respectively, and thus, our method was proven to function effectively.
5.2 Integration algorithm
Table of possible associations between utterances and gestures (examples are excerpted)
Pointing at an angle,
a segment, or a vertex
a specific angle
6.1 Experimental setup
We conducted an experiment to evaluate the improvement in speech and gesture recognition using the proposed integrative recognition method. We used the eight trials described in Section 2 as the test set and obtained the N-best results using both the speech and gesture recognition methods introduced in Sections 3 and 4. Each candidate included a word sequence and the time alignment data, i.e., the start and ending time of each word, to determine the correspondence of utterances and gestures. We set both values of N (of the N-best candidates for speech and gestures) at 20. This means that the system compared a maximum of N×N(400) pairs of speech and gesture recognition candidates per triald. As for gaps between corresponding utterances and gestures, we approximated the statistics using a Gaussian distribution with the same μ τ and used in Eqn. (5) in Section 3. To allow for dynamic ranges of likelihood for speech, gestures, and time gaps between utterances and gestures, we set α, β, and γ in Equation 6 appropriately, as the result of a preliminary experiment.
Recognition results using an integration of multiple modalities: recognition rate [%]
Speech and gesture
The proposed integration method achieved a 3.4% point improvement in speech recognition performance and a 3.7% percentage point improvement in gesture recognition performance. The speech recognition performance of the proposed method was near the upper bounds, and its gesture recognition performance was at the upper bounds.
There were many speech and gesture recognition errors. By aligning the corresponding words in utterances with gestures using dynamic programming (DP), we were able to reject pairs with low DP scores. Although this strategy was effective in our proposed method, it only aligned speech and gestures in order, and thus, its rejection ability was weak. Semantically inconsistent alignments were rejected by M(u i ,g j )=0 in Equation 6. This was a strong constraint, and some incorrect alignments were rejected, but because there were so many ambiguous words among the utterances, such as ‘here’ and ‘this’, which had many possible corresponding gestures, it was not highly effective. The distribution of time differences, however, was an effective constraint of the DP path. The start times of corresponding speech and gesture pairs should not differ greatly, and the correspondences were better identified using this strategy than by the ‘nearest matching’ and ‘longest overlapping time’ strategies described in Section 5.1. The distribution in Equation 6 worked as a ‘soft’ path limitation, and this may be a reason why this strategy worked so well.
Overall, this is how we obtained the abovementioned improvements, but we likely could have achieved the same performance using only a simple framework based on Equation 6.
where I is the identification rate, and C and C s are the number of utterances with correctly identified referents, and the total number of utterances accompanied with gestures, respectively. The identification rate using the integrated recognition results was 91.7%, while the identification rate using only the speech portion of the integrated recognition results was 20.0%, thus a 71.7% point improvement was achieved through integration.
In this paper, we introduced an integrative recognition method using accompanying speech to recognize gestures. First, we proposed a probability density of the differences in starting times between speech and the corresponding gestures to align the two. Then, we incorporated this probability into an integrative recognition method, which scored sequenced pairs of utterances and gestures using dynamic programming. This multimodal recognition method achieved more than 3% points of improvement in both speech and gesture recognition.
Note that our method could also possibly be used with other types of multimodalities, although currently, this method is specialized to the task which we have selected. A speaker-dependent, large-vocabulary, continuous speech recognizer could be used without any specific training, but a task-specific gesture recognizer would need to be constructed because there are no universal primitive units for gesture recognition corresponding to the phonemes and syllables used for speech recognition. The correspondence between modalities should also be defined for the task a priori. Even so, we believe that we can apply our method to any task which meets the following conditions: each of the modalities can be recognized using methods such as HMMs, the relationship between modalities can be described by constraint rules, and the timing difference between modalities can be described as a probability density. The larger the task becomes, the more difficult it is to construct such a framework, but once this is achieved, our proposed method can be applied. Application of this method to larger scale tasks is one of our future goals.
Although so far, we have only used N-best lists as intermediate expressions for our integrative method, other expressions with less information loss could also be used, such as word graphs or HMM trellises.
a Another reason to use HMMs is that the score obtained from an HMM is based on a probability, and thus the integration explained in Section 5 becomes theoretically correct.
b Of course, we can use other parametric discrete/continuous distributions, and one of them may achieve better performance, but pursuing such distributions is a task for future work.
c The N values of utterances and gestures can differ. In this paper, however, we used the same N(20) for both values, as described in Section 6. This was decided through preliminary experiments.
d We used K-fold cross validation because we tested the HMM parameters for gesture recognition under an open data condition. This setting is different from that used for speech recognition, in which we prepared training and test data separately. Under both conditions, however, no data were used for both training and test data, and thus, the difference in the experimental setup for gestures and speech did not affect the results.
- Oviatt S, Coulston R, Lundsford R: When do we interact multimodally? Cognitive load and multimodal communication patterns. In Proceedings of ICMI. New York: ACM; 2004:129-136.View ArticleGoogle Scholar
- Dumas B, Signer B, Lalanne D: Fusion in multimodal interactive systems: an HMM-Based algorithm for user-induced adaptation. In Proceedings of 4th ACM SIGCHI Symposium on Engineering Interactive Computing Systems. New York: ACM,; 2012:15-24.Google Scholar
- Kettebekov S, Yeasin M, Sharma R: Prosody based co-analysis for continuous recognition of co-verbal gestures. In Proceedings of ICME. Washington DC: IEEE Computer Society,; 2002:161-166.Google Scholar
- Fukumoto M, Suenaga Y, Mase K: Finger-pointer: pointing interface by image processing. ACM Comput. Graph 1994, 18(5):633-642. 10.1016/0097-8493(94)90157-0View ArticleGoogle Scholar
- Bolt RA: Put-that-there: voice and gesture at the graphics interface. ACM Comput. Graph 1980, 14(3):262-270. 10.1145/965105.807503MathSciNetView ArticleGoogle Scholar
- Hui P, Meng H: Joint interpretation of input speech and pen gestures for multimodal human computer interaction. In INTERSPEECH. Pittsburgh: ISCA,; 2006:1197-1200.Google Scholar
- Qu S, Chai JY: Salience modeling based on non-verbal modalities for spoken language understanding. In Proceedings of ICMI. New York: ACM,; 2006:193-200.View ArticleGoogle Scholar
- Krahnstoever N, Kettebekov S, Yeasin M, Sharma R: A real-time framework for natural multimodal interaction with large screen displays. In Proceedings of ICMI. Piscataway: IEEE; 2002:349-354.Google Scholar
- Lalanne D, Nigay L, Palanque P, Robinson P, Vanderdonckt J, Ladry J: Fusion engines for multimodal input: a survey. In Proceedings of ICMI-MLMI. New York: ACM; 2009:153-160.View ArticleGoogle Scholar
- Chai J, Hong P, Zhou M, Prasov Z: Optimization in multimodal interpretation. In Proceedings of ACL. Stroudsburg: Association for Computational Linguistics; 2004:1-8.Google Scholar
- Johnston M: Finite-state multimodal parsing and understanding. In Proceedings of COLING. Stroudsburg: Association for Computational Linguistics; 2000:369-375.Google Scholar
- Johnston M: Unification-based multimodal parsing. In Proceedings of COLING-ACL. Stroudsburg: Association for Computational Linguistics; 1998:624-630.Google Scholar
- Wu L, Oviatt L, Cohen PR: Multimodal integration - a statistical view. Trans. Multimedia 1999, 1(4):334-341. 10.1109/6046.807953View ArticleGoogle Scholar
- Ban H, Miyajima C, Itou K, Takeda K, Itakura F: Speech recognition using synchronization between speech and finger tapping. In Proceedings of ICSLP. Pittsburgh: ISCA; 2004:943-946.Google Scholar
- Shinoda K, Watanabe Y, Iwata K, Liang Y, Nakagawa R, Furui S: Semi-synchronous speech and pen input for mobile user interfaces. Speech Commun 2011, 53(3):283-291. 10.1016/j.specom.2010.10.001View ArticleGoogle Scholar
- Miki M, Miyajima C, Nishino T, Kitaoka N, Takeda K: An integrative recognition method for speech and gestures. In Proceedings of ICMI. New York: ACM; 2008:93-96.Google Scholar
- Lee A, Kawahara T, Shikano K: Julius — an open source real-time large vocabulary recognition engine. In Proceedings of EUROSPEECH. Aalborg: ISCA; 2001:1691-1694.Google Scholar
- Maekawa K: Corpus of spontaneous Japanese: its design and evaluation. In Proceedings of SSPR. Tokyo: ISCA and IEEE; 2003:7-12.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.