InfoMagnets : Making Sense of Corpus Data Jaime Arguello Language Technologies Institute.

Slides:



Advertisements
Similar presentations
Unsupervised Modeling of Twitter Conversations
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
Mining Query Subtopics from Search Log Data Date : 2012/12/06 Resource : SIGIR’12 Advisor : Dr. Jia-Ling Koh Speaker : I-Chih Chiu.
A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
Methodologies for Evaluating Dialog Structure Annotation Ananlada Chotimongkol Presented at Dialogs on Dialogs Reading Group 27 January 2006.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
InfoMagnets : Making Sense of Corpus Data Jaime Arguello Language Technologies Institute.
© Anselm Spoerri Lecture 13 Housekeeping –Term Projects Evaluations –Morse, E., Lewis, M., and Olsen, K. (2002) Testing Visual Information Retrieval Methodologies.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Evaluation in HCI Angela Kessell Oct. 13, Evaluation Heuristic Evaluation Measuring API Usability Methodology Matters: Doing Research in the Behavioral.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Analyzing Chat Dialogue with Taghelper Tools Catherine Chase Stanford University PSLC Summer Institute June 22, 2007.
Masquerade Detection Mark Stamp 1Masquerade Detection.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Graphical models for part of speech tagging
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
TagHelper and InfoMagnets Technologies for Exploring the effect of Language Interactions in Learning Carolyn Penstein Rosé, Jaime Arguello, Yue Cui, Rohit.
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Incorporating Extra-linguistic Information into Reference Resolution in Collaborative Task Dialogue Ryu Iida Shumpei Kobayashi Takenobu Tokunaga Tokyo.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Dept. of Computer Science University of Rochester Rochester, NY By: James F. Allen, Donna K. Byron, Myroslava Dzikovska George Ferguson, Lucian Galescu,
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
PIER Research Methods Protocol Analysis Module Hua Ai Language Technologies Institute/ PSLC.
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Lei Zhang and Guoning Chen, Department of Computer Science, University of Houston Robert S. Laramee, Swansea University David Thompson and Adrian Sescu,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Learning TFC Meeting, SRI March 2005 On the Collective Classification of “Speech Acts” Vitor R. Carvalho & William W. Cohen Carnegie Mellon University.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.
Automatic recognition of discourse relations Lecture 3.
Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University SIGIR 2009.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
Week 4: 6/6 – 6/10 Jeffrey Loppert. This week.. Coded a Histogram of Oriented Gradients (HOG) Feature Extractor Extracted features from positive and negative.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Introduction to Survey Research
Entity- & Topic-Based Information Ordering
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Presentation transcript:

InfoMagnets : Making Sense of Corpus Data Jaime Arguello Language Technologies Institute

Outline InfoMagnets Applications Topic Segmentation Conclusions Q/A

Outline InfoMagnets Applications Topic Segmentation Conclusions Q/A

Defining Exploratory Corpus Analysis Getting a “sense” of your data How does it relate to: –Information retrieval Need to understand the whole corpus –Data mining Need rich interface to support serendipitous search –Text classification Need to find the “interesting” classes

InfoMagnets

InfoMagnets Applications Behavioral Research –2 Publishable results (submitted to CHI) CycleTalk Project, LTI –New findings on mechanisms at work in guided exploratory learning Robert Kraut’s Netscan Group, HCII Conversational Interfaces Corpus organization makes authoring conversational agents less intimidating. Rose, Pai, & Arguello (2005); Gweon et al., (2005)

InfoMagnets Applications Behavioral Research –2 Publishable results (submitted to CHI) CycleTalk Project, LTI –New findings on mechanisms at work in guided exploratory learning Robert Kraut’s Netscan Group, HCII Conversational Interfaces Corpus organization makes authoring conversational agents less intimidating. Rose, Pai, & Arguello (2005); Gweon et al., (2005)

Authoring Conversational Interfaces Goal: Make Authoring CI’s easier Solution: –Guide development with pre-processed sample human-human conversations Addresses different issues –Accessible to non-computational linguists –Developers ≠ domain experts –Consistent with user-centered design: “The user is not like me!”

Authoring Conversational Interfaces Topic Segmentation Transcribed human-human conversations A C B CA B Constructing a Master Template

Topic Segmentation Preprocess for InfoMagnets But, an important computational linguistics problem in its own right! Previous Work –Marti Hearst’s TextTiling (1994) –Beeferman, Berger, and Lafferty (1997) –Barzilay and Lee (2004) NAACL best paper award! – ….. But, should it all fall under “topic segmentation”?

Topic Segmentation of Dialogue Dialogue is Different: –Very little training data –Linguistic Phenomena Ellipsis Telegraphic Content –Coherence is organized around a shared task, not primarily around a single flow of information

Lots of places where there is no overlap in “meaningful” content Coherence Defined Over Shared Task

Multiple topic shifts in regions w/ zero lexical cohesion

Experimental Condition 22 student-tutor pairs Conversation captured through mainstream chat client Thermodynamics domain Training and test data coded by one coder Results shown in terms of p_k (Lafferty & Beeferman, 1999) Significant tests: 2-tailed, t-tests

1 st Attempt: TextTiling TextTiling (Hearst, 1997) –Slide two adjacent “windows” down the text –At each state calculate cosine correlation –Use correlation values to calculate “depth” –“Depth” values higher than a threshold correspond to topic shifts w1 w2

TextTiling Results Algorithm(avg) P k NONE ALL EVEN TextTiling TT T-test p-value TT (NONE) TT (ALL) TT (EVEN) Trend for TextTiling to perform worse than degenerate baselines Difference not statistically significant Why doesn’t it work?

Lots of gaps where the correlation = 0 Must select boundary heuristically And, still a heuristical improvement on original TextTiling Results

But, topic shifts tend NOT to occur where corr > 0. TextTiling Results

Cluster utterances Treat each cluster as a “state” Construct HMM –Emission probabilities: state-specific language models –Transition probabilities: based on location and cluster-membership of the utterances Viterbi re-estimation until convergence 2 nd Attempt: Barzilay and Lee (2005)

B&L statistically better than TT, but not better than degenerate algorithms B&L Results Algorithm(avg) P k NONE ALL EVEN TextTiling B&L B&L T-test p-value B&L (NONE) B&L (ALL) B&L (EVEN) B&L (TextTiling)

B&L Results Too fine grained topic boundaries Most clusters based on “fixed expressions” (e.g. “ok”, “yeah”, “sure” ) Remember: cohesion based on shared task Are state-based language models sufficiently different?

Incorporating Dialogue Dynamics Dialogue Act coding scheme Not originally developed for segmentation, but for discourse analysis of human-tutor dialogues 4 main dimensions: –Action: open question, closed question, negation, etc. –Depth: (yes/no) is utterance accompanied with explanation or elaboration –Focus: (binary) is focus on speaker or other agent –Control: Initiation, Response, Feedback Dialogue Exchange (Sinclair and Coulthart, 1975)

(Donmez, 2004) Use estimated labels on some dimensions to learn other dimensions 3 types of Features: –Text (discourse cues) –Lexical coherence (binary) –Dialogue Acts labels 10-fold cross-validation Topic Boundaries learned on estimated labels, not hand coded ones! 3 rd Attempt: Cross-Dimensional Learning

X-Dimensional Learning Results X-DIM statistically better than TT and degenerate algorithms! Algorithm(avg) P k NONE ALL EVEN TextTiling B&L X-DIM X-DIM T-test p-value X-DIM (NONE) X-DIM (ALL) X-DIM (EVEN) X-DIM (TextTiling)

Statistically Significant Improvement TTB&LX-DIM NONE NON-SIG SIG ALL NON-SIG SIG EVEN NON-SIG SIG TT SIG B&L SIG

Future Directions Merge cross-dimensional learning (w/ dialogue act features) with B&L content modeling HMM approach. Explore other work in topic segmentation of dialogue

Recap InfoMagnets and applications Corpus exploration and authoring of CI’s Challenges of topic segmentation of dialogue Description of TextTiling, Barzilay & Lee, X-DIM vs. degenerate methods and each other

Q/A Thank you!