Robust dialogue act detection based on partial sentence tree, derivation rule, and spectral clustering algorithm

A novel approach for robust dialogue act detection in a spoken dialogue system is proposed. Shallow representation named partial sentence trees are employed to represent automatic speech recognition outputs. Parsing results of partial sentences can be decomposed into derivation rules, which turn out to be salient features for dialogue act detection. Data-driven dialogue acts are learned via an unsupervised learning algorithm called spectral clustering, in a vector space whose axes correspond to derivation rules. The proposed method is evaluated in a Mandarin spoken dialogue system for tourist-information services. Combined with information obtained from the automatic speech recognition module and from a Markov model on dialogue act sequence, the proposed method achieves a detection accuracy of 85.1%, which is significantly better than the baseline performance of 62.3% using a naïve Bayes classifier. Furthermore, the average number of turns per dialogue session also decreases significantly with the improved detection accuracy.


Introduction
Spoken dialogue systems (SDS) are computer systems with which a user interacts through natural speech [1].Services based on SDS have been deployed in a wide range of domains, from simple goal-oriented applications, such as DARPA Airline Travel Information System project for flight information [2], AT&T "How May I Help You?" for call routing [3], and systems for trip planning [4][5][6], to complex conversational applications, such as chatbot A.L.I.C.E.[7] and a variety of conversational agents using avatars [8].
The designer of an SDS often faces the following critical issues.First, with noisy speech or spontaneous speech with disfluency [9,10], abundant errors made by automatic speech recognition (ASR) can lead to misunderstanding or even pre-mature termination of a dialogue session (i.e., task failure).Second, the spoken language understanding (SLU) unit is often very expensive to develop, due to the manual annotation of certain features for semantic content.Examples of semantic features are part-of-speech tags [11], semantic roles [12,13], prosodic features [14], and keywords [15].Third, the dialogue manager (DM) requires a sound dialogue strategy for management based on the state of a dialogue.Such a strategy could be quite complex in order to deal with all sorts of uncertainty, such as errors in ASR.
A dialogue act (DA) describes the purposes or effects of an utterance in a dialogue [16,17].In principle, an utterance can convey multiple DAs.It is a succinct representation of the current intention of the speaker.DAs are closely related to speech acts (SA) [18], but they are specialized to dialogue systems [19].While SAs are generic, DAs often vary from SDS to SDS.Since we are building an SDS, the notion of DA is more appropriate than SA to our study.
In this article, we describe an SDS with robust DA detection.Knowledge sources exploited include ASR confidence, semantic representation of ASR output, and the history of DA.First, the detrimental effects caused by ASR errors are abated by using partial sentence trees.Second, an unsupervised learning approach can determine datadriven DAs automatically, reducing annotation costs.Third, when DA can be reliably detected, the complexity of DM strategy can be significantly reduced.The motivation for focusing on robust DA detection is that the issues with ASR error, SLU cost, and DM complexity can be greatly alleviated.
Our method for DA detection is completely different.First, DAs are data-driven by clustering via the spectral clustering algorithm, with each cluster identified as a DA.The clustering happens in a space defined by derivation rules (DR).Classification of DA for unseen utterances is based on a novel derivation rule-dialogue act (DRDA) matrix, which is created by counting the occurrences of each DR in each utterance cluster.As a result, a column in the DRDA matrix represents a DA in the vector space spanned by DRs.As an example, in our system, the utterance How can I go to Anping-Fort by car? is mapped to DA-33 (Car_Destination, as listed in Table 1), and takes an action which leads to the generation of system response "The suggested line is that... ".
The rest of this article is organized as follows.The basic framework of SDS is introduced in Section 2. The proposed robust DA detection method is stated in Section 3. Details of the implementation are described in Section 4. Experiments and discussion on the results are presented in Section 5. Lastly, concluding remarks are given in Section 6.

Spoken dialogue system
A dialogue session between a user and a statistical SDS consists of a chain of interleaving user turns and system turns, as illustrated in Figure 1.ASR outputs a string of words (or N-best list) W based on utterance U. SLU parses W and output a semantic representation.DM updates the belief on dialogue states, and accordingly decides the system's action based on a policy.Natural language generation (NLG) converts system's action to a surface representation in the textual form, which is passed to the text-to-speech (TTS) module for speech waveform generation.The cycle repeats when the user responds with the next utterance.
The ASR module turns user's utterance into word hypotheses.A telephone-based SDS inevitably needs to deal with noisy speech and spontaneous speech, rendering the job of ASR module difficult.Furthermore, errors made by ASR may propagate along the system, making the jobs of other modules difficult.As a result, ASR accuracy is critical to the performance of SDS.
The SLU module, as depicted in Figure 2, converts ASR output into semantic representation.In the proposed system, the ASR output is first converted to a partial sentence tree (PST) [35] in the PST Construction block.The basic idea of PST is to replace unreliable word hypotheses by fillers.As a result, PST is less vulnerable to recognition errors.From PST, partial sentences are formed and parsed.Figure 1 Block diagram of a spoken dialogue system.At turn t, the user utters U, which is recognized by ASR to be W. ν is a semantic representation of user's intended dialogue act.q is the dialogue state, where A u is the hypothesized user's dialogue act.g is user's goal, and H is dialogue history.b is a distribution over dialogue states.A s is the system's action.The function π is called policy and it encodes the strategy of the dialogue manager.
The parse results contain derivation rules (DR), which are extracted in the DR Generation block.The NEC (name entity class) inventory is referenced and certain words are replaced by word classes.The core of an SDS is the dialogue manager.DM adopts sound strategy to keep dialogue sessions alive until they are successfully finished.An optimal action is taken at each turn based on the dialogue state, including user's goal, user's DA, and dialogue history.To cope with uncertainty, a belief on the states can be maintained, and the policy for taking action can be based on the belief.

Dialogue act detection
To infer dialogue act, a statistical model involving DA is required.The model assumption of the generation process for user's utterance is described as follows.Based on user's goal and the dialogue history, a user decides a DA, convert it into words, and produces an utterance.This is depicted in Figure 3.Note that each variable in the figure is indexed by turn t.However, to keep the notation and graph from being cluttered, we drop the subscript t.It is not difficult to see that the critical evidence to infer the current dialogue act should depend on ASR output, lexical items, and dialogue history.Thus, we can write where Ω = {A 1 , . .., A q } is the set of DAs.In (1), f (W, U) is called ASR score, g(A u , W) is called lexical scorem and h(A u , H) is called history score.
These scores are related to conditional probability functions.For the ASR score, we use the acoustic model and the language model in the ASR system.Specifically, where p AM (•) is the acoustic model probability, P LM (•) is the language model probability, and a is the language model scale factor.For the history score, a back-off bigram model for DA sequence is used [4,30,31].That is, Essentially, equation ( 3) models DA sequence as a Markov chain.We assume that the current user's DA depends on the history only through the previous user's DA.For the lexical score, a novel measure is proposed and the details are described in the following section.

Method for lexical score
One main contribution of this research is to demonstrate that a novel method for estimating lexical score g (A u , W) works quite well.The proposed method incorporates several steps, including partial sentence tree construction, derivation rule extraction, utterance representation in a vector space, the dialogue act set generation via spectral clustering, dialogue act representation using relative frequency weighted by normalized entropy, and finally a cosine distance measure between dialogue act and utterance.Taking the risk of being tedious, we describe the details of these steps in the following sections in order to make the overall procedure clear.

Construction of partial sentence tree
In an SDS, it is often beneficial to partition the vocabulary into a set of keywords K and a set of non-key- words Q .Each word w ∈ K should be quite indicative of DA.Using K and Q , the set of sentences with at least one keyword can be represented as where A * is the Kleene star (a.k.a.Kleene closure) of A , and A + is the Kleene plus of A .The generation process of user's utterance.In this graph, g is user's goal, H is dialogue history, A u is user's intended dialogue act, S is the uttered sentence, and U is the acoustical observation.Note that the uttered sentence S and the recognition hypothesis of ASR W are different.Given a sentence s ∈ S , a partial sentence (PS) of s contains all keywords in s, while replacing some nonkeywords in s by tokens called Filler.For a sentence with n non-keywords, there are 2 n PS's.These PS's can be compiled in a tree called partial sentence tree (PST).A path in PST from the root to a leave corresponds to a PS.The PST of sentence s is henceforth denoted by T s .For example, Figure 4 gives the PST for the sentence s : Where is the Anping -Fort In this example, Where and Anping-Fort are keywords, while is and the are non-keywords.The 2 2 = 4 PS's embedded in the PST.
PST is a robust representation of ASR output.That is, even if some words are not recognized correctly, the semantics of an utterance can still be conveyed with the recognized keywords.
In the actual implementation, the ASR output is postprocessed before PST construction.First, a word hypothesis, say w, is replaced by a Filler if the z-score [36] is below a threshold where f (w) is the recognition probability for word w, μ is the mean and s 2 is the variance computed from all samples.In addition, recognized keywords are replaced by the named entity classes (NEC) or the greeting/ending classes, to have a compact representation.

Extraction of derivation rules
After PST construction, each PS in the PST is parsed by the Stanford parser (S-parser) [11].Let the grammar of the S-parser be denoted as a 5-tuple [37] G = (V, , P, S, D), (7) where V is the set of variables, Σ is the set of term- inals, P is the set of production rules, S is the sentence symbol, and D is a function defined on P for rule probability.In our implementation, a derivation rule (DR) is defined to be a derivation of the form where A, B ∈ V and w Σ.Note that equation ( 8) is a lexicalized rule.For illustration, parse results of the partial sentences are shown in Table 2.One can see that a lexical word in a PS produces a DR.Given a text corpus, a set of DRs R = {R 1 , R 2 , . . ., R l } can be extracted and compacted.
The motivation for using DR is to exploit the part-ofspeech (POS) information.In particular, POS tags help to disambiguate noun-verb homonyms that occur quite often in Chinese.

Vector representation of sentences
Using each DR as a feature, we can represent a sentence s as a binary vector v s , where Anping Fort Anping Fort Anping Fort Table 2 Examples of the parse result (left) and the extracted derivation rules (right) corresponding to the four partial sentences in Figure 4 PS: where is the spot where T s is the PST for s.For example, the represen- tative vector v s = [1 0 1 0] T (10) means that R 1 and R 3 are used in T s , and that there are l = 4 derivation rules.

Generation of dialogue acts
We use a set of data-driven DAs to save the prohibitive cost of manual annotation.We apply the recently-proposed spectral clustering algorithm [38] to cluster utterances in the training set.The spectral clustering algorithm is chosen because a conventional clustering algorithm (e.g., k-means) is often sensitive to centroid selection (for initialization).After clustering, each cluster found is identified as a DA.
Our implementation of spectral clustering is outlined as follows.Suppose there are n utterances in the training set Each utterance is represented by a vector according to equation (9).From D , we construct an n × n similarity matrix M, where the similarity M kk' between two utterances s k and s k' is defined as the cosine measure between v s k and v s k .The normalized Laplacian matrix of M is defined as where D is a diagonal matrix with entries We find the eigenvectors of the q smallest eigenvalues of L. Note that the eigenvectors can be made orthonormal since L is real-symmetric.We put these eigenvectors in an n × q orthogonal matrix Q, and cluster the row vectors to q clusters.Each cluster is identified as a data-driven DA.
On a theoretical side, consider the conversion of M into a binary-valued matrix M via a threshold τ, i.e., M can be regarded as the adjacency matrix of a graph G = (N , E), where node set N corresponds to D , and edge set E corresponds to the non-zero entries in M .It can be shown [38] that the multiplicity of the eigenvalue 0 for L , the normalized Laplacian matrix of M , equals the number of disjoint connected components in G, which can be identified as clusters in D .

Derivation rule-dialogue act matrix
A cluster of utterances found via spectral clustering algorithm is identified as a DA.In our implementation, we use an entropy-based representation for DA.The representation of DA is described as follows.Let n ij be the accumulated count that DR R i occurs in the utterance cluster of A j .From n ij , a probability function of DA conditional on DR is defined as follows The normalized entropy for the probability conditional on DR R i is Note that 0 ≤ Î i ≤ 1, and a DR R i with a lower Î i is more discriminative for DA.From equations ( 15) and ( 16), a matrix Γ of size l × q can be constructed with entries We call Γ the derivation rule-dialogue act (DRDA) matrix.The j th column in Γ is a vector representation for a DA A j in the vector space spanned by DRs.

Similarity between utterance and dialogue act
In our implementation, the lexical score g(A u , W) in equation ( 1) is decomposed into two terms where g R (Au, s) is called DR score and g N (A u , W) is called named entity score.For DR score, the following similarity measure is used where b s is the vector representation for PS s in T s , and a j is the vector representation for DA A j (i.e., column j in DRDA matrix Γ).For named entity score, we use the naïve Bayes approximation where a is a named entity.Note that ν(A j , a) is estimated from a training corpus by the relative frequency of a occurring in A j .

Experiments and discussion
We evaluate the proposed method for dialogue act detection on an SDS for Tainan city tourist-information services.

Data collection
We adopt the setup utilized in [4,6] to collect the dialogue speech data.The data collection setup is shown in Figure 5, and an exemplar in the collected dialogue data is shown in Table 3.An operator play the role of SDS, which helps users to plan trips in Tainan.Twenty six male and eleven female subjects play the role of users.For our prototype system, users are asked to use utterances with single DA.Dialogue speech data is recorded in a lab environment, using 16,000-Hz sampling rate and 16-bit PCM format.There are 294 dialogues.
Two types of speech data are collected.The first type, called S-data, is from the operator playing the role of SDS.S-data contains travel information collected from on-line resources, such as Wikipedia and Google map.Sdata set consists of 2, 653 utterances, with 317 different words.The second type, called U-data, is from subjects playing the role of users.U-data consists of 2, 636 utterances, with 297 different words.The vocabulary size is small as we have a domain-specific task.From U-data, 87 keywords corresponding to 28 named entity classes/ semantic classes and 796 derivation rules are obtained from the S-parser.Examples of the selected NECs and semantic classes are given in Table 4.The collected data contains sightseeing information, queries for the time schedules of two railway systems (Taiwan Railways Administration (TRA) and Taiwan High-Speech Rail (THSR)), and greeting/ending words in dialogues.
We use fivefold cross-validation method for system development.That is, the data is divided into five parts.In a round-robin fashion, four parts are used as training data, and one part is used as test data.We develop our system such that the average accuracy of DA detection over five test sets is optimal.

ASR module
The ASR module is an HTK-based Mandarin speech recognizer [39].A syllable in Mandarin is modeled as the concatenation of an initial model and a final model.The acoustic model set includes 115 right-context-dependent initial models, 38 context-independent final models, 37 particle models, (e.g., EN, MA, OU), 47 syllable-level models for hyper-articulated speech, and 14 filler models (e.g., short pause, breathing, and footfall).Each initial model is a three state HMM, while each final model is a five state HMMs.The observation probability density of a state is a Gaussian mixture model (GMM) with no more than 32 components.The speech feature vector is composed of 39 components, including 12 mel-frequency cepstral coefficients (MFCCs), log energy, and the velocity and acceleration features.For real-world data with a variety of speakers, a reliable acoustic model is needed.Thus, an acoustic model set trained by the TCC-300 Mandarin corpus is adapted by the collected dialogue speech data via maximum-likelihood linear regression (MLLR).The lexicon contains 297 words.The bi-gram language model is estimated by SRILM toolkit [40].
Table 5 shows the performance of ASR module with clean and simulated noisy speech.Note that the real-world scenario of noise corruption is applied in the collection of the noisy speech (footfall noise, human speech, or both).That is, a speaker stands in front of a microphone and the noise is played behind the speaker.From the results, we can see that the recognition accuracy does not severely degrade in the presence of noises behind a user.

The z-score threshold
Ideally, an effective threshold for z-score strikes a good balance between reliable recognition and keeping   sufficient keywords for subsequent semantic representation in SLU.We analyze the rejected word hypotheses with z-scores below the threshold of 2 (which corresponds to the confidence level of 0.95), and find that only 3.4% of the keywords are incorrectly rejected.The threshold of 2 is therefore used.Such performance can be attributed to the fact that users often pause naturally before or after a keyword.

The number of dialogue acts
While generic speech acts are relatively well-defined, DAs are often specialized to particular domains and they need to be specified.In this research, since we adopt data-driven DAs by clustering, the number of DAs (clusters) become a critical parameter in system design.In order to decide this number, we investigate the system performance when it is varied.The detection accuracies are shown in Table 6.We can see that 38 DAs (q = 38) achieves the best performance a .Therefore, we use 38 DAs.To make more sense, each cluster is given an artificial but meaningful label (tag, name), as shown in Table 1.For example, Query_Introduction_-Spot is assigned to the cluster formed by "queries of introduction to sightseeing spot".

Evaluation of feature sets
Just like the set of DRs, an alternative set of semantic features can be used as the coordinate axes to construct the corresponding vector set for D .Applying spectral cluster- ing, a matrix analogs to the DRDA matrix can be constructed according to the steps described in Section 4.
Including the proposed DR, 5 sets of features are investigated.In baseline, keywords are used as features.In NEC, named entity classes are used.In PS, partial sentences are used.In uwDR, derivation rules without normalized entropy weighting are used.
DA detection accuracies are summarized in Table 7.In this table, the column 40%-SIM means 40% of the words in the reference transcripts are retained, and similarly for 60%-SIM and 100%-REF.The recognition accuracy of ASR is 84.8% (15.2% word error rate), so we have a column of 84.8%-ASR.The middle columns are the results with simulated noisy speech, corrupted by footfall (football), human speech (human), and both noises (both).
In the case of 84.8%-ASR, we can see that NEC (56.8%) is better than baseline (49.6%), and that PS (76.2%) is better than NEC.The incorporation of uwDR (81.6%) and DR (82.9%) lead to further improvements.Thus, the difference between baseline and the proposed DR is very significant.We notice that an ambiguous Chinese word may correspond to different DAs with its different meanings.For instance, in open door and drive car, the words open and drive are the same word in Chinese.Using DRs helps disambiguation.For the cases of 40%-SIM and 60%-SIM, the results show clear improvement of NEC and PS over the baseline.Using DRs, however, does not further improve in these scenarios as the keywords are randomly discarded.We can see that recognizing the keywords is particularly important in highly adverse acoustic conditions.We also evaluate using the simulated noisy speech data in SDS.One can observe an interesting result that the performance of DR with the simulated noisy data and the clean data are very close.In PS, nonkeywords are removed or replaced by Fillers.Thus, most of the partial sentences of simulated noisy speech are almost the same as those obtained from the clean speech.

Evaluation of the history score
The above results on DA detection are obtained without considering the dependency between DAs.Next, we evaluate the effectiveness of the history score.In order to balance the contribution of the lexical score and the history score, we generalize equation (1) to the following form, where 0 ≤ b g ≤ 1 is the weight of the lexical score, and b h = 1 -b g is the weight of the history score.
A few comments on using equation ( 21) are in order.First, we note that when the ASR module outputs only one-best hypothesis, the maximization over W in equation (1) becomes trivial.It follows that the term f (W, U) can be dropped since it does not depend on A u .In  Therefore, we use the linear combination in the log domain, which is equivalent to the product in equation (21).In fact, a similar case based on the same consideration is the language model scale factor commonly used in ASR.Table 8 shows the results of different b h , and the best performance is achieved when b h = 0.7.The evaluation results demonstrate that the dialogue history is informative.

Comparison with other methods
The performance of the proposed approach for DA detection is compared with other methods.In the NBC method, the keywords are used as the semantic features, and they are used in calculating DA probabilities.In the co-occurrence (co-oc) method, a priori algorithm [41] is used to calculate the co-occurrence of keywords in each DA.In the SVM and maximum entropy (ME) methods, a DA classifier is trained using keywords.In latent semantic analysis (LSA), the keyword-DA matrix is treated as a conventional word-document matrix, and then the LSA is applied.The results are listed in Table 9.We can see that the proposed approach achieves the best accuracy.

Evaluation on end-to-end measure
In addition to DA detection accuracy, we also conduct evaluation on end-to-end measures, i.e., from the start of a session to the end of the session.End-to-end measures are arguably better for performance evaluation as the ultimate goal of an SDS is to enable a user to complete a session correctly and quickly.
Three systems are evaluated, including the NBC, the proposed system (proposed), and the proposed system without using the history information (no history).Five subjects are recruited.Subjects perform exactly same task without knowing the order of the systems.This order is random for a test subject.A task is considered completed as soon as a subject acquires the appointed information.Table 10 shows the average dialogue turns per task of the evaluated systems.The proposed approach achieves the minimum of the average number of turns.

Conclusion
In this article, a robust dialogue act detection method using named entity classes, partial sentence trees, derivation rules, and entropy-based dialogue act-derivation rule matrix is investigated.Data-driven dialogue acts are created by the spectral clustering algorithm.Our implementation of a spoken dialogue system for tourist-information services incorporating the proposed method achieves 85.1% detection accuracy, outperforming a naïve Bayes classification based method (62.3%).It also reduces the number of dialogue turns per dialogue session on average.The results show that partial sentence tree and derivation rules are indeed succinct and informative features for dialogue act detection.Furthermore, spectral clustering is a successful method for automatic and unsupervised learning of dialogue acts from in-domain training data.

Endnote
a Queries to 3 kinds of vehicles -bus, TRA, and THSR, are in different clusters when q = 38, but in the same cluster when q = 36.This partially explains the difference in performance between using 36 DAs and 38 DAs.The average numbers of turns per dialogue session of three systems Au, g, H) As = π(b[q])

Figure 2 Figure 3
Figure 2 The spoken language understanding (SLU) module.During training, derivation rules are extracted based on partial sentence tree construction, and a derivation rule-dialogue act (DRDA) matrix is constructed.During testing, the trained DRDA matrix is used in DM for DA detection.Name entity class (NEC) inventory is referenced to convert certain words to word classes.

FillerFigure 4
Figure 4  Construction of the partial sentence tree for the sentence where is the Anping-Fort.With 2 non-keywords, there are 4 partial sentences.

Figure 5
Figure5The environmental setting of data collection.The operator acts like an SDS, and the user acts like s/he is interacting with an SDS.

Table 3
The beginning part of a collected dialogue

Table 4
Examples of named entity classes (NEC) and semantic classes

Table 5
Word accuracy rates of automatic speech recognition in clean and three noisy conditions

Table 6
Accuracy rates of dialogue act detection with various numbers of DAs

Table 7
Accuracy rates of dialogue act detection with various feature sets in various noisy conditions addition, as the values of g(A u , W) and h(A u , H) are in different ranges, simple linear combination may not work as one score can easily be dominated by the other.

Table 8
Accuracy rates of dialogue act detection with various history score weights

Table 9
Accuracy rates of dialogue act detection with five feature sets

Table 10
End-to-end measure of system performance evaluation.