Robust dialogue act detection based on partial sentence tree, derivation rule, and spectral clustering algorithm

Chen, Chia-Ping; Wu, Chung-Hsien; Liang, Wei-Bin

doi:10.1186/1687-4722-2012-13

Research
Open access
Published: 03 March 2012

Robust dialogue act detection based on partial sentence tree, derivation rule, and spectral clustering algorithm

Chia-Ping Chen¹,
Chung-Hsien Wu² &
Wei-Bin Liang²

EURASIP Journal on Audio, Speech, and Music Processing volume 2012, Article number: 13 (2012) Cite this article

3959 Accesses
22 Citations
Metrics details

Abstract

A novel approach for robust dialogue act detection in a spoken dialogue system is proposed. Shallow representation named partial sentence trees are employed to represent automatic speech recognition outputs. Parsing results of partial sentences can be decomposed into derivation rules, which turn out to be salient features for dialogue act detection. Data-driven dialogue acts are learned via an unsupervised learning algorithm called spectral clustering, in a vector space whose axes correspond to derivation rules. The proposed method is evaluated in a Mandarin spoken dialogue system for tourist-information services. Combined with information obtained from the automatic speech recognition module and from a Markov model on dialogue act sequence, the proposed method achieves a detection accuracy of 85.1%, which is significantly better than the baseline performance of 62.3% using a naïve Bayes classifier. Furthermore, the average number of turns per dialogue session also decreases significantly with the improved detection accuracy.

1 Introduction

Spoken dialogue systems (SDS) are computer systems with which a user interacts through natural speech [1]. Services based on SDS have been deployed in a wide range of domains, from simple goal-oriented applications, such as DARPA Airline Travel Information System project for flight information [2], AT&T "How May I Help You?" for call routing [3], and systems for trip planning [4–6], to complex conversational applications, such as chatbot A.L.I.C.E. [7] and a variety of conversational agents using avatars [8].

The designer of an SDS often faces the following critical issues. First, with noisy speech or spontaneous speech with disfluency [9, 10], abundant errors made by automatic speech recognition (ASR) can lead to misunderstanding or even pre-mature termination of a dialogue session (i.e., task failure). Second, the spoken language understanding (SLU) unit is often very expensive to develop, due to the manual annotation of certain features for semantic content. Examples of semantic features are part-of-speech tags [11], semantic roles [12, 13], prosodic features [14], and keywords [15]. Third, the dialogue manager (DM) requires a sound dialogue strategy for management based on the state of a dialogue. Such a strategy could be quite complex in order to deal with all sorts of uncertainty, such as errors in ASR.

A dialogue act (DA) describes the purposes or effects of an utterance in a dialogue [16, 17]. In principle, an utterance can convey multiple DAs. It is a succinct representation of the current intention of the speaker. DAs are closely related to speech acts (SA) [18], but they are specialized to dialogue systems [19]. While SAs are generic, DAs often vary from SDS to SDS. Since we are building an SDS, the notion of DA is more appropriate than SA to our study.

In this article, we describe an SDS with robust DA detection. Knowledge sources exploited include ASR confidence, semantic representation of ASR output, and the history of DA. First, the detrimental effects caused by ASR errors are abated by using partial sentence trees. Second, an unsupervised learning approach can determine data-driven DAs automatically, reducing annotation costs. Third, when DA can be reliably detected, the complexity of DM strategy can be significantly reduced. The motivation for focusing on robust DA detection is that the issues with ASR error, SLU cost, and DM complexity can be greatly alleviated.

A wealth of methods for DA detection have been introduced in the literature. The simplest and strongest "memorylessness" property basically assumes that the current (DA) is independent of the past. Thus, DA detection is based on a set of features derived from the current utterance. In this case, classification-based methods have been studied, including support vector machines (SVM) [12, 20], naïve Bayes classifiers (NBC) [21–23], and multi-layer perceptrons (MLP) [14, 21]. When the memorylessness assumption is relaxed, the dependence between past and current DAs has been modeled by n-grams [24, 25], hidden Markov models [26, 27], and Bayesian networks [28]. Recently, methods based on weighted finite state transducers (WFST) [4, 29–31] or partially observable Markov decision processes (POMDP) [32–34] have been studied for DM.

Our method for DA detection is completely different. First, DAs are data-driven by clustering via the spectral clustering algorithm, with each cluster identified as a DA. The clustering happens in a space defined by derivation rules (DR). Classification of DA for unseen utterances is based on a novel derivation rule-dialogue act (DRDA) matrix, which is created by counting the occurrences of each DR in each utterance cluster. As a result, a column in the DRDA matrix represents a DA in the vector space spanned by DRs. As an example, in our system, the utterance How can I go to Anping-Fort by car? is mapped to DA-33 (Car_Destination, as listed in Table 1), and takes an action which leads to the generation of system response "The suggested line is that... ".

Table 1 List of dialogue acts

Full size table

The rest of this article is organized as follows. The basic framework of SDS is introduced in Section 2. The proposed robust DA detection method is stated in Section 3. Details of the implementation are described in Section 4. Experiments and discussion on the results are presented in Section 5. Lastly, concluding remarks are given in Section 6.

2 Spoken dialogue system

A dialogue session between a user and a statistical SDS consists of a chain of interleaving user turns and system turns, as illustrated in Figure 1. ASR outputs a string of words (or N-best list) W based on utterance U. SLU parses W and output a semantic representation. DM updates the belief on dialogue states, and accordingly decides the system's action based on a policy. Natural language generation (NLG) converts system's action to a surface representation in the textual form, which is passed to the text-to-speech (TTS) module for speech waveform generation. The cycle repeats when the user responds with the next utterance.

The ASR module turns user's utterance into word hypotheses. A telephone-based SDS inevitably needs to deal with noisy speech and spontaneous speech, rendering the job of ASR module difficult. Furthermore, errors made by ASR may propagate along the system, making the jobs of other modules difficult. As a result, ASR accuracy is critical to the performance of SDS.

The SLU module, as depicted in Figure 2, converts ASR output into semantic representation. In the proposed system, the ASR output is first converted to a partial sentence tree (PST) [35] in the PST Construction block. The basic idea of PST is to replace unreliable word hypotheses by fillers. As a result, PST is less vulnerable to recognition errors. From PST, partial sentences are formed and parsed. The parse results contain derivation rules (DR), which are extracted in the DR Generation block. The NEC (name entity class) inventory is referenced and certain words are replaced by word classes.

The core of an SDS is the dialogue manager. DM adopts sound strategy to keep dialogue sessions alive until they are successfully finished. An optimal action is taken at each turn based on the dialogue state, including user's goal, user's DA, and dialogue history. To cope with uncertainty, a belief on the states can be maintained, and the policy for taking action can be based on the belief.

3 Dialogue act detection

To infer dialogue act, a statistical model involving DA is required. The model assumption of the generation process for user's utterance is described as follows. Based on user's goal and the dialogue history, a user decides a DA, convert it into words, and produces an utterance. This is depicted in Figure 3. Note that each variable in the figure is indexed by turn t. However, to keep the notation and graph from being cluttered, we drop the subscript t. It is not difficult to see that the critical evidence to infer the current dialogue act should depend on ASR output, lexical items, and dialogue history. Thus, we can write

A_{u}^{*} = arg max_{A_{u} \in Ω} max_{W} f (W, U) g (A_{u}, W) h (A_{u}, H),

(1)

where Ω = {A¹, . . ., A^q } is the set of DAs. In (1), f (W, U) is called ASR score, g(A_u , W) is called lexical scorem and h(A_u, H) is called history score.

These scores are related to conditional probability functions. For the ASR score, we use the acoustic model and the language model in the ASR system. Specifically,

f (W, U) \approx p_{AM} (U | W) P_{LM}^{α} (W),

(2)

where p_AM(·) is the acoustic model probability, P_LM(·) is the language model probability, and α is the language model scale factor. For the history score, a back-off bi-gram model for DA sequence is used [4, 30, 31]. That is,

h (A_{u}, H) \approx P r (A_{t} = A_{u} | A_{t - 1}) .

(3)

Essentially, equation (3) models DA sequence as a Markov chain. We assume that the current user's DA depends on the history only through the previous user's DA. For the lexical score, a novel measure is proposed and the details are described in the following section.

4 Method for lexical score

One main contribution of this research is to demonstrate that a novel method for estimating lexical score g(A_u , W) works quite well. The proposed method incorporates several steps, including partial sentence tree construction, derivation rule extraction, utterance representation in a vector space, the dialogue act set generation via spectral clustering, dialogue act representation using relative frequency weighted by normalized entropy, and finally a cosine distance measure between dialogue act and utterance. Taking the risk of being tedious, we describe the details of these steps in the following sections in order to make the overall procedure clear.

4.1 Construction of partial sentence tree

In an SDS, it is often beneficial to partition the vocabulary into a set of keywords $K$ and a set of non-keywords $Q$ . Each word $w \in K$ should be quite indicative of DA. Using $K$ and $Q$ , the set of sentences with at least one keyword can be represented as

S = Q^{*} {(K Q^{*})}^{+}

(4)

where $A^{*}$ is the Kleene star (a.k.a. Kleene closure) of $A$ , and $A^{+}$ is the Kleene plus of $A$ .

Given a sentence $s \in S$ , a partial sentence (PS) of s contains all keywords in s, while replacing some non-keywords in s by tokens called Filler. For a sentence with n non-keywords, there are 2ⁿPS's. These PS's can be compiled in a tree called partial sentence tree (PST). A path in PST from the root to a leave corresponds to a PS. The PST of sentence s is henceforth denoted by $T_{s}$ . For example, Figure 4 gives the PST for the sentence

s : Where is the Anping - Fort

(5)

In this example, Where and Anping-Fort are keywords, while is and the are non-keywords. The 2² = 4 PS's embedded in the PST.

PST is a robust representation of ASR output. That is, even if some words are not recognized correctly, the semantics of an utterance can still be conveyed with the recognized keywords.

In the actual implementation, the ASR output is post-processed before PST construction. First, a word hypothesis, say w, is replaced by a Filler if the z-score [36] is below a threshold

z (w) = \frac{f (w) - μ}{σ},

(6)

where f (w) is the recognition probability for word w, μ is the mean and σ² is the variance computed from all samples. In addition, recognized keywords are replaced by the named entity classes (NEC) or the greeting/ending classes, to have a compact representation.

4.2 Extraction of derivation rules

After PST construction, each PS in the PST is parsed by the Stanford parser (S-parser) [11]. Let the grammar of the S-parser be denoted as a 5-tuple [37]

G = (V, Σ, P, S, D),

(7)

where $V$ is the set of variables, Σ is the set of terminals, $P$ is the set of production rules, S is the sentence symbol, and D is a function defined on $P$ for rule probability. In our implementation, a derivation rule (DR) is defined to be a derivation of the form

A \to B \to w,

(8)

where $A, B \in V$ and w ∈ Σ. Note that equation (8) is a lexicalized rule. For illustration, parse results of the partial sentences are shown in Table 2. One can see that a lexical word in a PS produces a DR. Given a text corpus, a set of DRs $R = {R^{1}, R^{2}, \dots, R^{l}}$ can be extracted and compacted.

Table 2 Examples of the parse result (left) and the extracted derivation rules (right) corresponding to the four partial sentences in Figure 4

Full size table

The motivation for using DR is to exploit the part-of-speech (POS) information. In particular, POS tags help to disambiguate noun-verb homonyms that occur quite often in Chinese.

4.3 Vector representation of sentences

Using each DR as a feature, we can represent a sentence s as a binary vector v_s , where

v_{s} (i) = \{\begin{matrix} 1, & if R^{i} \in T_{s}, \\ 0, & otherwise, \end{matrix}

(9)

where $T_{s}$ is the PST for s. For example, the representative vector

v_{s} = {[1 0 1 0]}^{T}

(10)

means that R¹ and R³ are used in $T_{s}$ , and that there are l = 4 derivation rules.

4.4 Generation of dialogue acts

We use a set of data-driven DAs to save the prohibitive cost of manual annotation. We apply the recently-proposed spectral clustering algorithm[38] to cluster utterances in the training set. The spectral clustering algorithm is chosen because a conventional clustering algorithm (e.g., k-means) is often sensitive to centroid selection (for initialization). After clustering, each cluster found is identified as a DA.

Our implementation of spectral clustering is outlined as follows. Suppose there are n utterances in the training set

D = {s_{1}, s_{2}, \dots, s_{n}} .

(11)

Each utterance is represented by a vector according to equation (9). From $D$ , we construct an n × n similarity matrix M, where the similarity M_kk' between two utterances s_k and s_k' is defined as the cosine measure between $v_{s_{k}}$ and $v_{s_{k'}}$ . The normalized Laplacian matrix of M is defined as

L ≜ I - D^{- \frac{1}{2}} M D^{- \frac{1}{2}},

(12)

where D is a diagonal matrix with entries

D_{k k'} = δ_{k k'} \sum_{j = 1}^{n} M_{k j} .

(13)

We find the eigenvectors of the q smallest eigenvalues of L. Note that the eigenvectors can be made orthonormal since L is real-symmetric. We put these eigenvectors in an n × q orthogonal matrix Q, and cluster the row vectors to q clusters. Each cluster is identified as a data-driven DA.

On a theoretical side, consider the conversion of M into a binary-valued matrix $\hat{M}$ via a threshold τ, i.e.,

{\hat{M}}_{k k'} = \{\begin{matrix} 1, & M_{k k'} < τ, \\ 0, & otherwise . \end{matrix}

(14)

$\hat{M}$ can be regarded as the adjacency matrix of a graph $G = (N, E)$ , where node set $N$ corresponds to $D$ , and edge set E corresponds to the non-zero entries in $\hat{M}$ . It can be shown [38] that the multiplicity of the eigenvalue 0 for $\hat{L}$ , the normalized Laplacian matrix of $\hat{M}$ , equals the number of disjoint connected components in G, which can be identified as clusters in $D$ .

4.5 Derivation rule-dialogue act matrix

A cluster of utterances found via spectral clustering algorithm is identified as a DA. In our implementation, we use an entropy-based representation for DA. The representation of DA is described as follows. Let n_ij be the accumulated count that DR Rⁱ occurs in the utterance cluster of A^j . From n_ij , a probability function of DA conditional on DR is defined as follows

γ_{i j} = \hat{P} (DA = A^{j} | DR = R^{i}) ≜ \frac{n_{i j}}{\sum_{j' = 1}^{q} n_{i j'}}, i = 1, \dots, l, j = 1, \dots, q .

(15)

The normalized entropy for the probability conditional on DR Rⁱ is

\in_{i} = - \frac{1}{log q} \sum_{j = 1}^{q} γ_{i j} log γ_{i j}, i = 1, \dots, l .

(16)

Note that 0 ≤ ∈_i ≤ 1, and a DR Rⁱ with a lower ∈_iis more discriminative for DA. From equations (15) and (16), a matrix Γ of size l × q can be constructed with entries

Γ_{i j} = (1 - \in_{i}) γ_{i j} .

(17)

We call Γ the derivation rule-dialogue act (DRDA) matrix. The j^th column in Γ is a vector representation for a DA A^j in the vector space spanned by DRs.

4.6 Similarity between utterance and dialogue act

In our implementation, the lexical score g(A_u , W) in equation (1) is decomposed into two terms

g (A_{u}, W) \approx g_{R} (A_{u}, s) g_{N} (A_{u}, W),

(18)

where g_R (Au, s) is called DR score and g_N (A_u , W) is called named entity score. For DR score, the following similarity measure is used

g_{R} (A_{u} = A^{j}, s) = max_{σ \in T_{s}} \frac{b_{σ}^{T} a_{j}}{| b_{σ} | | a_{j} |},

(19)

where b_σ is the vector representation for PS σ in $T_{s}$ , and a_j is the vector representation for DA A^j (i.e., column j in DRDA matrix Γ). For named entity score, we use the naïve Bayes approximation

g_{N} (A_{u} = A^{j}, W) = \prod_{α \in W} ν (A^{j}, α)

(20)

where α is a named entity. Note that ν(A^j, α) is estimated from a training corpus by the relative frequency of α occurring in A^j .

5 Experiments and discussion

We evaluate the proposed method for dialogue act detection on an SDS for Tainan city tourist-information services.

5.1 Data collection

We adopt the setup utilized in [4, 6] to collect the dialogue speech data. The data collection setup is shown in Figure 5, and an exemplar in the collected dialogue data is shown in Table 3. An operator play the role of SDS, which helps users to plan trips in Tainan. Twenty six male and eleven female subjects play the role of users. For our prototype system, users are asked to use utterances with single DA. Dialogue speech data is recorded in a lab environment, using 16,000-Hz sampling rate and 16-bit PCM format. There are 294 dialogues.

Table 3 The beginning part of a collected dialogue

Full size table

Two types of speech data are collected. The first type, called S-data, is from the operator playing the role of SDS. S-data contains travel information collected from on-line resources, such as Wikipedia and Google map. S-data set consists of 2, 653 utterances, with 317 different words. The second type, called U-data, is from subjects playing the role of users. U-data consists of 2, 636 utterances, with 297 different words. The vocabulary size is small as we have a domain-specific task. From U-data, 87 keywords corresponding to 28 named entity classes/semantic classes and 796 derivation rules are obtained from the S-parser. Examples of the selected NECs and semantic classes are given in Table 4. The collected data contains sightseeing information, queries for the time schedules of two railway systems (Taiwan Railways Administration (TRA) and Taiwan High-Speech Rail (THSR)), and greeting/ending words in dialogues.

Table 4 Examples of named entity classes (NEC) and semantic classes

Full size table

We use fivefold cross-validation method for system development. That is, the data is divided into five parts. In a round-robin fashion, four parts are used as training data, and one part is used as test data. We develop our system such that the average accuracy of DA detection over five test sets is optimal.

5.2 ASR module

The ASR module is an HTK-based Mandarin speech recognizer [39]. A syllable in Mandarin is modeled as the concatenation of an initial model and a final model. The acoustic model set includes 115 right-context-dependent initial models, 38 context-independent final models, 37 particle models, (e.g., EN, MA, OU), 47 syllable-level models for hyper-articulated speech, and 14 filler models (e.g., short pause, breathing, and footfall). Each initial model is a three state HMM, while each final model is a five state HMMs. The observation probability density of a state is a Gaussian mixture model (GMM) with no more than 32 components. The speech feature vector is composed of 39 components, including 12 mel-frequency cepstral coefficients (MFCCs), log energy, and the velocity and acceleration features. For real-world data with a variety of speakers, a reliable acoustic model is needed. Thus, an acoustic model set trained by the TCC-300 Mandarin corpus is adapted by the collected dialogue speech data via maximum-likelihood linear regression (MLLR). The lexicon contains 297 words. The bi-gram language model is estimated by SRILM toolkit [40].

Table 5 shows the performance of ASR module with clean and simulated noisy speech. Note that the real-world scenario of noise corruption is applied in the collection of the noisy speech (footfall noise, human speech, or both). That is, a speaker stands in front of a microphone and the noise is played behind the speaker. From the results, we can see that the recognition accuracy does not severely degrade in the presence of noises behind a user.

Table 5 Word accuracy rates of automatic speech recognition in clean and three noisy conditions

Full size table

5.3 The z-score threshold

Ideally, an effective threshold for z-score strikes a good balance between reliable recognition and keeping sufficient keywords for subsequent semantic representation in SLU. We analyze the rejected word hypotheses with z-scores below the threshold of 2 (which corresponds to the confidence level of 0.95), and find that only 3.4% of the keywords are incorrectly rejected. The threshold of 2 is therefore used. Such performance can be attributed to the fact that users often pause naturally before or after a keyword.

5.4 The number of dialogue acts

While generic speech acts are relatively well-defined, DAs are often specialized to particular domains and they need to be specified. In this research, since we adopt data-driven DAs by clustering, the number of DAs (clusters) become a critical parameter in system design. In order to decide this number, we investigate the system performance when it is varied. The detection accuracies are shown in Table 6. We can see that 38 DAs (q = 38) achieves the best performance^a. Therefore, we use 38 DAs. To make more sense, each cluster is given an artificial but meaningful label (tag, name), as shown in Table 1. For example, Query_Introduction_Spot is assigned to the cluster formed by "queries of introduction to sightseeing spot".

Table 6 Accuracy rates of dialogue act detection with various numbers of DAs

Full size table

5.5 Evaluation of feature sets

Just like the set of DRs, an alternative set of semantic features can be used as the coordinate axes to construct the corresponding vector set for $D$ . Applying spectral clustering, a matrix analogs to the DRDA matrix can be constructed according to the steps described in Section 4.

Including the proposed DR, 5 sets of features are investigated. In baseline, keywords are used as features. In NEC, named entity classes are used. In PS, partial sentences are used. In uwDR, derivation rules without normalized entropy weighting are used.

DA detection accuracies are summarized in Table 7. In this table, the column 40%-SIM means 40% of the words in the reference transcripts are retained, and similarly for 60%-SIM and 100%-REF. The recognition accuracy of ASR is 84.8% (15.2% word error rate), so we have a column of 84.8%-ASR. The middle columns are the results with simulated noisy speech, corrupted by footfall (football), human speech (human), and both noises (both).

Table 7 Accuracy rates of dialogue act detection with various feature sets in various noisy conditions

Full size table

In the case of 84.8%-ASR, we can see that NEC (56.8%) is better than baseline (49.6%), and that PS (76.2%) is better than NEC. The incorporation of uwDR (81.6%) and DR (82.9%) lead to further improvements. Thus, the difference between baseline and the proposed DR is very significant. We notice that an ambiguous Chinese word may correspond to different DAs with its different meanings. For instance, in open door and drive car, the words open and drive are the same word in Chinese. Using DRs helps disambiguation. For the cases of 40%-SIM and 60%-SIM, the results show clear improvement of NEC and PS over the baseline. Using DRs, however, does not further improve in these scenarios as the keywords are randomly discarded. We can see that recognizing the keywords is particularly important in highly adverse acoustic conditions. We also evaluate using the simulated noisy speech data in SDS. One can observe an interesting result that the performance of DR with the simulated noisy data and the clean data are very close. In PS, non-keywords are removed or replaced by Fillers. Thus, most of the partial sentences of simulated noisy speech are almost the same as those obtained from the clean speech.

5.6 Evaluation of the history score

The above results on DA detection are obtained without considering the dependency between DAs. Next, we evaluate the effectiveness of the history score. In order to balance the contribution of the lexical score and the history score, we generalize equation (1) to the following form,

A_{u}^{*} = arg max_{A_{u} \in Ω} {[g (A_{u}, W)]}^{β_{g}} {[h (A_{u}, H)]}^{β_{h}},

(21)

where 0 ≤ β_g ≤ 1 is the weight of the lexical score, and β_h = 1 - β_gis the weight of the history score.

A few comments on using equation (21) are in order. First, we note that when the ASR module outputs only one-best hypothesis, the maximization over W in equation (1) becomes trivial. It follows that the term f (W, U) can be dropped since it does not depend on A_u . In addition, as the values of g(A_u , W) and h(A_u, H) are in different ranges, simple linear combination may not work as one score can easily be dominated by the other. Therefore, we use the linear combination in the log domain, which is equivalent to the product in equation (21). In fact, a similar case based on the same consideration is the language model scale factor commonly used in ASR.

Table 8 shows the results of different β_h , and the best performance is achieved when β_h = 0.7. The evaluation results demonstrate that the dialogue history is informative.

Table 8 Accuracy rates of dialogue act detection with various history score weights

Full size table

5.7 Comparison with other methods

The performance of the proposed approach for DA detection is compared with other methods. In the NBC method, the keywords are used as the semantic features, and they are used in calculating DA probabilities. In the co-occurrence (co-oc) method, a priori algorithm [41] is used to calculate the co-occurrence of keywords in each DA. In the SVM and maximum entropy (ME) methods, a DA classifier is trained using keywords. In latent semantic analysis (LSA), the keyword-DA matrix is treated as a conventional word-document matrix, and then the LSA is applied. The results are listed in Table 9. We can see that the proposed approach achieves the best accuracy.

Table 9 Accuracy rates of dialogue act detection with five feature sets

Full size table

5.8 Evaluation on end-to-end measure

In addition to DA detection accuracy, we also conduct evaluation on end-to-end measures, i.e., from the start of a session to the end of the session. End-to-end measures are arguably better for performance evaluation as the ultimate goal of an SDS is to enable a user to complete a session correctly and quickly.

Three systems are evaluated, including the NBC, the proposed system (proposed), and the proposed system without using the history information (no history). Five subjects are recruited. Subjects perform exactly same task without knowing the order of the systems. This order is random for a test subject. A task is considered completed as soon as a subject acquires the appointed information. Table 10 shows the average dialogue turns per task of the evaluated systems. The proposed approach achieves the minimum of the average number of turns.

Table 10 End-to-end measure of system performance evaluation.

Full size table

6 Conclusion

In this article, a robust dialogue act detection method using named entity classes, partial sentence trees, derivation rules, and entropy-based dialogue act-derivation rule matrix is investigated. Data-driven dialogue acts are created by the spectral clustering algorithm. Our implementation of a spoken dialogue system for tourist-information services incorporating the proposed method achieves 85.1% detection accuracy, outperforming a naïve Bayes classification based method (62.3%). It also reduces the number of dialogue turns per dialogue session on average. The results show that partial sentence tree and derivation rules are indeed succinct and informative features for dialogue act detection. Furthermore, spectral clustering is a successful method for automatic and unsupervised learning of dialogue acts from in-domain training data.

Endnote

^aQueries to 3 kinds of vehicles - bus, TRA, and THSR, are in different clusters when q = 38, but in the same cluster when q = 36. This partially explains the difference in performance between using 36 DAs and 38 DAs.

References

Fraser N: Handbook of Standards and Resources for Spoken Language Systems. Volume chap. 6. Edited by: Gibbon D, Moore R, Winski R. Mouton de Gruyter, Berlin; 1997:564-564.
Google Scholar
Price PJ: Evaluation of spoken language systems: the ATIS domain. In Proc the workshop on Speech and Natural Language. Hidden Valley, Pennsylvania; 1990:91-95.
Chapter Google Scholar
Gorin A, Riccardi G, Wright JH: How may i help you? Speech Commun 1997, 23: 113-127. 10.1016/S0167-6393(97)00040-X
Article Google Scholar
Hori C, Ohtake K, Misu T, Kashioka H, Nakamura S: Dialog management using weighted finite-state transducers. In Proc INTERSPEECH-2008. Brisbane, Australia; 2008:211-214.
Google Scholar
Liu J, Xu Y, Seneff S, Zue V: CITYBROWSER II: a multimodal restaurant guide in Mandarin. In Proc International Symposium on Chinese Spoken Language Processing. Kunming, China; 2008:1-4.
Google Scholar
Misu T, Kawahara T: Bayes risk-based dialogue management for document retrieval system with speech interface. Speech Commun 2010, 52: 61-71. 10.1016/j.specom.2009.08.007
Article Google Scholar
Wallace R:The Artificial Linguistic Internet Computer Entity (A. L. I. C. E.). 2001. [http://www.alicebot.org]
Google Scholar
Cassell J, Bickmore T, Billinghurst M, Campbell L, Chang K, Vilhjálmsson H, Yan H: Embodiment in conversational interfaces: rea. In Proc the SIGCHI Conference on Human Factors in Computing Systems: the CHI is the limit. Pittsburgh, Pennsylvania; 1999:520-527.
Chapter Google Scholar
Yeh JF, Wu CH: Edit Disfluency detection and correction using a cleanup language model and an alignment model. IEEE Trans Speech Audio Process 2006, 14(5):1574-1583.
Article Google Scholar
Wu CH, Liang WB, Yeh JF: Interruption point detection of spontaneous speech using inter-syllable boundary-based prosodic features. ACM Trans Asian Lang Inf Process 2010. 10, 6:16:21
Google Scholar
Levy R, Manning C: Is it harder to parse Chinese, or the Chinese Treebank? In Proc 41st Annual Meeting on Association for Computational Linguistics (ACL). Sapporo, Japan; 2003:439-446.
Google Scholar
Liu CH, Wu CH: Semantic role labeling with discriminative feature selection for spoken language understanding. In Proc INTERSPEECH. Brighton, United Kingdom; 2009:1043-1046.
Google Scholar
Coppola B, Moschitti A, Riccardi G: Shallow semantic parsing for spoken language understanding. In Proc Annual Conference of the North American Chapter of the Association for Computational Linguistics-Human Language Technologies. Boulder, Colorado; 2009:85-88.
Google Scholar
Wright H: Automatic utterance type detection using suprasegmental features. In Proc International Conference on Spoken Language Processing. Volume 4. Sydney, Australia; 1998:1403-1406.
Google Scholar
Kawahara T, Lee CH, Juang BH: Flexible speech understanding based on combined key-phrase detection and verification. IEEE Trans Speech Audio Process 1998, 6(6):558-568. 10.1109/89.725322
Article Google Scholar
Bunt H: Context and dialogue control. THINK Quarterly 1994, 3: 19-31.
Google Scholar
Prasad R, Walker M: Training a dialogue act tagger for human-human and human-computer travel dialogues. In Proc Annual Meeting of the Association for Computational Linguistics. Volume 2. Philadelphia, Pennsylvania; 2002:162-173.
Google Scholar
Austin JL: How to Do Things with Words. Edited by: Urmson JO, Sbisá M. Harvard University Press, Cambridge, MA; 1962.
Google Scholar
Stolcke A, Ries K, Coccaro N, Shriberg E, Bates R, Jurafsky D, Taylor P, Martin R: Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput Linguist 2000, 26(3):339-373. 10.1162/089120100561737
Article Google Scholar
Tur G, Hakkani-Tür D, Heck L: What is left to be understood in ATIS. In Proc IEEE Workshop on Spoken Language Technologies. Berkeley, California; 2010:19-24.
Google Scholar
Levin L, Langley C, Donna Gates AL, Wallace D, Peterson K: Domain specific speech acts for spoken language translation. In Proc the 4th SIGdial Workshop on Discourse and Dialogue. Sapparo, Japan; 2003.
Google Scholar
Grau S, Sanchis E, Castro MJ, Vilar D: Dialogue act classification using a Bayesian approach. In Proc Conference on Speech and Computer. St Petersberg; 2004:495-499.
Google Scholar
Ivanovic E: Dialogue Act Tagging for Instant Messaging Chat Sessions. In Pro the ACL Student Research Workshop, Association for Computational Linguistics. Ann Arbor, Michigan; 2005:79-84.
Google Scholar
Seneff S, Wang C, Hazen TJ: Automatic induction of N -gram language models from a natural language grammar. In Proc EUROSPEECH-2003. Geneva, Swiss; 2003:641-644.
Google Scholar
Hara S, Kitaoka N, Takeda K: Automatic detection of task-incompleted dialog for spoken dialog system based on dialog act N-gram. In Proc INTERSPEECH-2010. Makuhari, Japan; 2010:3034-3037.
Google Scholar
Ries K: Hmm and Neural network based speech act detection. In Proc IEEE International Conference on Acoustics, Speech, and Signal Processing. Volume 1. Phoenix, Arizona; 1999:497-500.
Google Scholar
Wu CH, Yan GL: Speech act modeling and verification of spontaneous speech with disfluency in a spoken dialogue system. IEEE Trans Speech Audio Process 2005, 13(3):330-344.
Article Google Scholar
Keizer S, Nijholt A: Dialogue act recognition with Bayesian networks for Dutch dialogues. In Proc the 3rd SIGdial workshop on Discourse and dialogue. Volume 2. Philadelphia, Pennsylvania; 2002:88-94.
Chapter Google Scholar
Hori C, Ohtake K, Misu T, Kashioka H, Nakamura S: Recent advances in WFST-based dialog system. In Proc INTERSPEECH. Brighton, United Kingdom; 2009:268-271.
Google Scholar
Hori C, Ohtake K, Misu T, Kashioka H, Nakamura S: Statistical dialog management applied to WFST-based dialog systems. In Proc IEEE International Conference on Acoustics Speech and Signal Processing. Taipei, Taiwan; 2009:4793-4796.
Google Scholar
Ohtake K, Misu T, Hori C, Kashioka H, Nakamura S: Dialogue acts annotation for NICT Kyoto tour dialogue corpus to construct statistical dialogue systems. In Proc LREC2010. Valletta, Malta; 2010:2123-2130.
Google Scholar
Williams JD, Young S: Partially observable Markov decision processes for spoken dialog systems. Comput Speech Lang 2007, 21: 393-422. 10.1016/j.csl.2006.06.008
Article Google Scholar
Young S: Still talking to machines (cognitively speaking). In Proc INTERSPEECH2010. Makuhari, Japan; 2010:1-10.
Google Scholar
Williams JD, Young S: Scaling POMDPs for spoken dialog management. IEEE Trans Acoustic Speech Lang Process 15(7):2116-2129.
Wu CH, Chen YJ: Recovery from false rejection using statistical partial pattern trees for sentence verification. Speech Commun 2004, 43(1-2):71-88. 10.1016/j.specom.2004.02.003
Article Google Scholar
Larsen RJ, Marx ML: An Introduction to Mathematical Statistics and Its Applications. 3rd edition. Prentice Hall, Lebanon, Indiana, USA; 2000. ISBN: 0139223037
Google Scholar
Jurafsky D, Martin JH: Speech and Language Processing. 2nd edition. Pearson Prentice Hall, New Jersey; 2009.
Google Scholar
von Luxburg U: A tutorial on spectral clustering. Stat Comput 2007, 17(4):395-416. 10.1007/s11222-007-9033-z
Article MathSciNet Google Scholar
Young SJ, Kershaw D, Odell J, Ollason D, Valtchev V, Woodland P: The HTK Book Version 3.4. Cambridge University Press, Cambridge; 2006.
Google Scholar
Stolcke A: SRILM - an extensible language modeling toolkit. In Proc International Conference on Spoken Language Processing. Denver, Colorado; 2002:901-904.
Google Scholar
Agrawal R, Imielinski T, Swami AN: Mining association rules between sets of items in large databases. In ACM SIGMOD. Washington, D.C; 1993:207-216.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Sun Yat-sen University, 70 Lien-Hai Road, Kaohsiung, Taiwan
Chia-Ping Chen
Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan
Chung-Hsien Wu & Wei-Bin Liang

Authors

Chia-Ping Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chung-Hsien Wu
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Bin Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chung-Hsien Wu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Chen, CP., Wu, CH. & Liang, WB. Robust dialogue act detection based on partial sentence tree, derivation rule, and spectral clustering algorithm. J AUDIO SPEECH MUSIC PROC. 2012, 13 (2012). https://doi.org/10.1186/1687-4722-2012-13

Download citation

Received: 10 December 2011
Accepted: 03 March 2012
Published: 03 March 2012
DOI: https://doi.org/10.1186/1687-4722-2012-13

Robust dialogue act detection based on partial sentence tree, derivation rule, and spectral clustering algorithm

Abstract

1 Introduction

2 Spoken dialogue system

3 Dialogue act detection

4 Method for lexical score

4.1 Construction of partial sentence tree

4.2 Extraction of derivation rules

4.3 Vector representation of sentences

4.4 Generation of dialogue acts

4.5 Derivation rule-dialogue act matrix

4.6 Similarity between utterance and dialogue act

5 Experiments and discussion

5.1 Data collection

5.2 ASR module

5.3 The z-score threshold

5.4 The number of dialogue acts

5.5 Evaluation of feature sets

5.6 Evaluation of the history score

5.7 Comparison with other methods

5.8 Evaluation on end-to-end measure

6 Conclusion

Endnote

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords