Mapping Utterances onto Dialogue Acts with LSA and Naïve Bayes Thomas K Harris Dialogs on Dialogs: May 6, 2005.

Slides:

Advertisements

Similar presentations

1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.

Advertisements

T. E. Potok - University of Tennessee Software Engineering Dr. Thomas E. Potok Adjunct Professor UT Research Staff Member ORNL.

An Overview of Machine Learning

Machine Learning Neural Networks

Hinrich Schütze and Christina Lioma

Engineering Dialog for Gadgets Thomas K Harris September 12, 2003.

An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.

Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.

James A Personal Mobile Universal Speech Interface for Electronic Devices.

Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Introduction to machine learning

Advanced Multimedia Text Classification Tamara Berg.

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

CHAPTER 12 ADVANCED INTELLIGENT SYSTEMS © 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang.

Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])

Learning Dialog Acts for Embodied Agents Thomas K Harris KTH: 19 May 2005.

This week: overview on pattern recognition (related to machine learning)

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

Author: James Allen, Nathanael Chambers, etc. By: Rex, Linger, Xiaoyi Nov. 23, 2009.

7-Speech Recognition Speech Recognition Concepts

Speech Analysing Component in Automatic Tutoring Systems Presentation by Doris Diedrich and Benjamin Kempe.

NL Question-Answering using Naïve Bayes and LSA By Kaushik Krishnasamy.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:

Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.

Soft Computing Lecture 14 Clustering and model ART.

LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.

How Solvable Is Intelligence? A brief introduction to AI Dr. Richard Fox Department of Computer Science Northern Kentucky University.

 Identify computer system components.  Explain how the CPU works.  Differentiate between RAM and ROM.  Describe how data is represented.  Identify.

Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Learning to Share Meaning in a Multi-Agent System (Part I) Ganesh Padmanabhan.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.

Introduction to Pattern Recognition (การรู้จํารูปแบบเบื้องต้น)

Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.

10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.

The “Spatial Turing Test” Stephan Winter, Yunhui Wu

Integrating Multiple Knowledge Sources For Improved Speech Understanding Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering,

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.

Vector Semantics Dense Vectors.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.

Predicting and Adapting to Poor Speech Recognition in a Spoken Dialogue System Diane J. Litman AT&T Labs -- Research

Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.

From Frequency to Meaning: Vector Space Models of Semantics

Plan for Today’s Lecture(s)

Efficient Estimation of Word Representation in Vector Space

Statistical Models for Automatic Speech Recognition

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Speech recognition, machine learning

CS 430: Information Discovery

Lecture 16. Classification (II): Practical Considerations

Speech recognition, machine learning

Presentation transcript:

Mapping Utterances onto Dialogue Acts with LSA and Naïve Bayes Thomas K Harris Dialogs on Dialogs: May 6, 2005

Overview::Mini-Problem::NB::LSA::Application::Research2 Today’s Talk A quick super high-level overview of problem A mini-problem –SGPUC –Data collected –Six classes of dialog acts A naïve Bayes classification approach and results Latent Semantic Analysis (LSA) –What is LSA (algorithmically)? –What is LSA (theoretically)? –What are the major LSA-language findings? –How I used it to condition the data. Applicability Research Please give me feedback

Overview::Mini-Problem::NB::LSA::Application::Research3 A Super High-Level View of the SDS Input Pass Words -> Speech Acts and Concepts is usually a knowledge engineered “white-box” function. Problematic because the input (words) is –Huge (How large is NL? I don’t think anyone knows.) –Noisy and Probabilistic. (That’s what ASR gives us.) –Dynamic and Situational. Problematic because output (concepts) are difficult to share/generalize from one domain/system to another.

Overview::Mini-Problem::NB::LSA::Application::Research4 What do we do? A lot of design iterations! Restrict the domain Share components Control the speaker through –Training and entrainment –Domain-related expectations –Influencing or outright directing the dialog

Overview::Mini-Problem::NB::LSA::Application::Research5 Use the Data, Luke. Words -> Speech Acts and Concepts can also be a data- driven “black-box” function, or a hybrid. This has its own set of problems –Labeling data is costly –The catch-22 (data collection requires a working system). Iterate starting with seed data which can be nothing designer hypothesized data WoZ data from a similar or previous-version SDS from some human-human analog –The performance often seems nice at first, but then asymptotes quickly. I’m only going to address the labeling cost issue here.

Overview::Mini-Problem::NB::LSA::Application::Research6 A Mini-Problem Let’s look at a small part of the words -> speech acts and concepts problem in a real system, the Speech Graffiti Personal Universal Controller (SGPUC). Hopefully this small, concrete system and it’s mini-problem will facilitate manageable experimentation of approaches. But first, a little about the system itself.

Overview::Mini-Problem::NB::LSA::Application::Research7 Speech Graffiti Personal Universal Controller Protocol-based appliance communication architecture Automatically built SDS from appliance description language Speech Graffiti style user interface

Overview::Mini-Problem::NB::LSA::Application::Research8 Appliance Communication Architecture Havi adapter X10 adapter Speech Graffiti Personal Universal Controller

Overview::Mini-Problem::NB::LSA::Application::Research9 James Stereo (mode) tuner (radio band) AM frequency # station WXXX FM frequency # station WXXX seek forward backward auxiliaryCD (status) playstoppause repeat offsingle tracksingle discall discs disc # track # next track last track random onoff x-bass on off volume volume up volume down off Digital camera control Play mode play stop fast fwd rewind record pause step forward backward info Device mode camera VCR Media type Digital video unknown VHS none (mode) Other devices…

Overview::Mini-Problem::NB::LSA::Application::Research10 Speech Graffiti Dialog Artificial subset language Tree-structured functions Universal conversational primitives User-directed Great for recognition Entirely declarative (and automatic)

Overview::Mini-Problem::NB::LSA::Application::Research11 Six Dialog Acts in SGPUC Session Management Help Execution Query Response Related Other

Overview::Mini-Problem::NB::LSA::Application::Research12 Session Management User: blah blah blah... System: ignoring user User: hello james System: stereo, digital camera User: stereo System: stereo here User: goodbye System: goodbye User: blah blah blah... System: ignoring user

Overview::Mini-Problem::NB::LSA::Application::Research13 Help/Exploration/Orientation User: alarm clock options System: alarm, clock, radio, sleep, and more… User: help System: to hear what options you have, say options, to repeat an utterance, say repeat… User: stereo options System: while turning stereo on: off, am, fm, auxiliary, cd, and more… User: where am I System: stereo options

Overview::Mini-Problem::NB::LSA::Application::Research14 Execution/Specification User: stereo auxiliary System: while turning the stereo on and switching to auxiliary: auxiliary User: cd System: while switching to cd mode: cd User: play System: while playing a cd: playing cd User: switch disc to four System: while switching discs: switching cd disc to four

Overview::Mini-Problem::NB::LSA::Application::Research15 Query User: what is the am frequency System: the am frequency is five hundred thirty User: random status System: random is off User: status System: tuner am, station wabc, volume low, and more…

Overview::Mini-Problem::NB::LSA::Application::Research16 Response Related User: radio band am options System: am options are frequency, kabc, k001, k002, and more… User: more System: k003, k004 User: more System: band options are fm User: repeat System: band options are fm

Overview::Mini-Problem::NB::LSA::Application::Research17 Back to the Mini-Problem The language is explicit and regular in classifying dialog acts. A grammar will accurately classify dialog acts. Users are taught the SG language. Users learn the language incompletely and have faulty memories. Utterances have false starts, spurious repetitions, etc. ASR is error prone. 37.5% of utterances’ dialog acts were misclassified.

Overview::Mini-Problem::NB::LSA::Application::Research18 Data Listening to the actual speech, I labeled 2010 utterances (from 10 participants). Each utterance is labeled with one of the six of the dialog acts. Note that this labeling is much faster than transcription or much other labeling utterances were labeled in 2 ½ hours, close to real-time. Each utterance is represented by a boolean vector, where each element in the vector represents whether that word appears or not in the utterance. (i.e. word order is ignored!)

Overview::Mini-Problem::NB::LSA::Application::Research19 A Naïve Bayes Classifier

Overview::Mini-Problem::NB::LSA::Application::Research20 Classifier Results

Overview::Mini-Problem::NB::LSA::Application::Research21 Problems with Naïve Bayes Independence assumption –Word existence in an utterance contributes a fixed amount to class distinction regardless of context. –i.e. “bank” contributes the same thing to the classifier in the context of “world bank” and “river bank” Estimates a high-dimensional model –The model estimates 5 parameters (1-#classes) for each word. Words that occur infrequently will be severely over-fitted. Problems with singletons words –If a word appears in an utterance that hasn’t occurred in the training data for a particular class, the probability assigned to that class is zero.

Overview::Mini-Problem::NB::LSA::Application::Research22 Latent Semantic Analysis to the Rescue Independence assumption –LSA models both synonymy and polysemy. –Polysemy: Words that occur in different contexts i.e. “bank” in “world bank” vs “river bank” tend to become distinguished. –Synonymy: Words that occur in similar contexts i.e. the “white” and “black” of “white sheep” and “black sheep” tend to become undistinguished. Estimates a high-dimensional model –The effective dimension is arbitrarily fixed. Problems with singletons words –The dimensionality reduction serves as a smoothing function.

Overview::Mini-Problem::NB::LSA::Application::Research23 How Does LSA Work? C1: Human machine interface for ABC computer applications. C2: A survey of user opinion of computer system response time. C3: The EPS user interface management system. C4: System and human system engineering testing of EPS. C…: C1C2C3C4… Human1001… Interface1010… Computer1100… User0110… System0112… Response0100… Time0100… EPS0011… Survey0100… {X} =

Overview::Mini-Problem::NB::LSA::Application::Research24 Singular Value Decomposition Any mxn matrix X where m>n can be decomposed into the product of three matrices, UDV T, where: –U is an mxn matrix and V is an nxn matrix both with orthogonal columns. –D is an nxn diagonal matrix D is a sort-of basis in n dimensions for X. In Matlab, [U, D, V] = SVD(X);

Overview::Mini-Problem::NB::LSA::Application::Research25 LSA Algorithm in 4 Easy Steps Build your feature-passage matrix X. (Here I chose word-utterance.) [U, D, V] = SVD(X) Zero out all but the highest g values of D to form a new reduced D. Recompose a reduced X as U D V T.

Overview::Mini-Problem::NB::LSA::Application::Research26 The Recomposed Matrix C1: Human machine interface for ABC computer applications. C2: A survey of user opinion of computer system response time. C3: The EPS user interface management system. C4: System and human system engineering testing of EPS. C…: C1C2C3C4… Human(1)0.16(0)0.40(0)0.38(1)0.47… Interface(1)0.14(0)0.37(1)0.33(0)0.40… Computer(1)0.15(1)0.51(0)0.36(0)0.41… User(0)0.26(1)0.84(1)0.61(0)0.70… System(0)0.45(1)1.23(1)1.05(2)1.27… Response(0)0.16(1)0.58(0)0.38(0)0.42… Time(0)0.16(1)0.58(0)0.38(0)0.42… EPS(0)0.22(0)0.55(1)0.51(1)0.63… Survey(0)0.10(1)0.53(0)0.23(0)0.21… {X} =

Overview::Mini-Problem::NB::LSA::Application::Research27 And This Means? Cosine distances between words show patterns of similarity, as do cosine distances between passages. Clustering with these distances makes clusters that feel “semantic” and mimic human choices in standardized tests for word sorting and lexical priming so well that people have suggested that LSA may be an actual psycholinguistic mechanism.

Overview::Mini-Problem::NB::LSA::Application::Research28 LSA-Discounted NB Estimators Why don’t we try to use an LSA- reconstructed matrix to train the NB classifier? Used various amounts of labeled data, discounted by various amounts of unlabeled LSA data. Unlabeled decoder output boosts classification!

Overview::Mini-Problem::NB::LSA::Application::Research29 Results

Overview::Mini-Problem::NB::LSA::Application::Research30 Applications for Coarse Classification “How may I help you?” systems More directed error correction, i.e. “You said you wanted to go where?” instead of “Can you repeat that?” Perhaps even self-correction. A coarse classifier could re-weight the language model or re-order hypotheses to elicit a corrected best hypothesis.

Overview::Mini-Problem::NB::LSA::Application::Research31 Extensions and Further Research How to integrate this into a system? Would this work for non-SG systems? Does it scale further, esp. w.r.t. unlabeled data? Are there better features than Boolean bags-of-words? Can we go to finer-grained classification, perhaps even classifying concepts as well as speech acts?