Discourse Analysis Abhijit Mishra (114056002) Samir Janardan Sohoni (114086001) A statistical approach to coreference resolution of noun phrases.

Slides:



Advertisements
Similar presentations
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
Part-Of-Speech Tagging and Chunking using CRF & TBL
Processing of large document collections Part 6 (Text summarization: discourse- based approaches) Helena Ahonen-Myka Spring 2006.
1 Discourse, coherence and anaphora resolution Lecture 16.
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Supervised models for coreference resolution Altaf Rahman and Vincent Ng Human Language Technology Research Institute University of Texas at Dallas 1.
Anaphora Resolution Sanghoon Kwak Takahiro Aoyama.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,
807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Coreference (Anaphora resolution)
Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany,
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Introduction  Information Extraction (IE)  A limited form of “complete text comprehension”  Document 로부터 entity, relationship 을 추출 
SYNTAX Lecture -1 SMRITI SINGH.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
I2B2 Shared Task 2011 Coreference Resolution in Clinical Text David Hinote Carlos Ramirez.
Coreference Resolution
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
A Cross-Lingual ILP Solution to Zero Anaphora Resolution Ryu Iida & Massimo Poesio (ACL-HLT 2011)
NYU: Description of the Proteus/PET System as Used for MUC-7 ST Roman Yangarber & Ralph Grishman Presented by Jinying Chen 10/04/2002.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
807 - TEXT ANALYTICS Massimo Poesio Lecture 7: Anaphora resolution (Coreference)
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002.
Measuring the Influence of Errors Induced by the Presence of Dialogs in Reference Clustering of Narrative Text Alaukik Aggarwal, Department of Computer.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
Named Entity Disambiguation: A Hybrid Statistical and Rule-based Incremental Approach Hien Nguyen * (Ton Duc Thang University, Vietnam) Tru Cao (Ho Chi.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
AUTONOMOUS REQUIREMENTS SPECIFICATION PROCESSING USING NATURAL LANGUAGE PROCESSING - Vivek Punjabi.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Using Semantic Relations to Improve Information Retrieval
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
A Database of Narrative Schemas A 2010 paper by Nathaniel Chambers and Dan Jurafsky Presentation by Julia Kelly.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Approaches to Machine Translation
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
张昊.
Improving a Pipeline Architecture for Shallow Discourse Parsing
CSCE 590 Web Scraping – Information Retrieval
Introduction to Information Extraction
Social Knowledge Mining
Clustering Algorithms for Noun Phrase Coreference Resolution
A Machine Learning Approach to Coreference Resolution of Noun Phrases
Automatic Detection of Causal Relations for Question Answering
Approaches to Machine Translation
A Machine Learning Approach to Coreference Resolution of Noun Phrases
Presentation transcript:

Discourse Analysis Abhijit Mishra ( ) Samir Janardan Sohoni ( ) A statistical approach to coreference resolution of noun phrases

What is 'Discourse'? ”A mode of organizing knowledge, ideas or experience that is rooted in language and its concrete contexts” - Meriam Webster Dictionary ”A continuous stretch of (especially spoken) language larger than a sentence, often constituting a coherent unit such as a sermon, argument, joke or narrative” - Crystal (1992:25)

What is 'Discourse'? ”A mode of organizing knowledge, ideas or experience that is rooted in language and its concrete contexts” - Meriam Webster Dictionary ”A continuous stretch of (especially spoken) language larger than a sentence, often constituting a coherent unit such as a sermon, argument, joke or narrative” - Crystal (1992:25)

What happens in Discourse Analysis? Analyze the formal and contextual links within a discourse Formal links are built into the language rendering Contextual links rely upon world knowledge

Formal Links in language Substitution – use of one, do, so Tom produced a nice painting. I told you so long ago. Ellipsis – omission of words or clauses Prosperity is a great teacher; adversity ? a greater ?. Conjunction – addition, temporals and causals The travellers had lunch then they rested because they were tired. References – pronouns and articles draw meaning from other words/contexts Neighbours bought a new car, it is nice. It is raining/It is day/It is night.

Motivation Ability to link coreferring NPs within and across sentences is important Coreference resolution is important because MT will need it IE, QA and Summarization systems need it In a natural environment, students learning new language need to understand the phenomenon

Problem statement A coreference is not always limited to a pronoun like they, it etc. It can be a chain of non-pronominals Mahatma Gandhi insisted on non-violent means for freedom. He is a key figure in Indian history. Gandhi is also known as 'father of the nation'. Coreferenced chain = (Mahatma Gandhi)-(He)-(Gandhi) Can we identify coreferenced chains of noun- pharases?

Why Statistical Approach Rules-based approaches takes time, money and trained personnel to make and test the rules. Corefernce resolution is a semantic level task which requires a lot of time and effort. Statistical methods may not be highly accurate but save a lot of time and money. Availability of monolingual corpus motivates us to try out quick statistical systems.

Glossary Markables – NPs, nested NPs, pronouns etc. that are identities of reference ((Bill Gates) B, (the chairman) C of (Microsoft Corp) D ) A MUC – Message Understanding Conference Initiative to US Gov and depts like DARPA Standardize data to be used by participants

Methodology 1)Training data is standardized corpus having chains of coref-annotated markables 2)From an annotated chain each bigram pair as is obtained. 3)Basing on the features possessed by such pairs, a decision tree is learnt. 4)For testing, chains of markables are created from test data. 5)Markers are presented to the classifer and coreference chains are extracted.

Processing Pipeline Tokenization & sentence segmentation Morphological processing POS tagger Noun phrase identification Named Entity Recognition Nested noun phrase extraction Feature Extraction Free text Markables HMM Classifier

Features Properties of a discourse which help to decide whether two markable corefer or not Should be domain independent. Should not be too difficult to compute. For a marker_pair we consider 12 different kinds of features. Consider this example: Separately, Clinton transition officials said that Frank Newman, 50, Vice chairman and chief finantial officer of BankAmerica Corp., is expected to be nominated as assistant Treasury secretary for domestic finance. Marker i = ”Frank Newman” and Marker j = ”Vice chairman”

Distance Feature : f dist Possible Values: : 0,1,2,3 Captures distance between i and j If i and j are in same sentence, f(i,j)=0. If they are one sentence apart f(i,j)=1 and so on.E.g : f dist (Frank Newman,Vice chairman)=0 I-Pronoun and J-Pronoun : f i_pron, f j_pron Possible values If i is a pronoun then f i_pron (i,j)= ”true” Similarly if j is a pronoun then f j_pron (i,j)= ”true” Pronouns include reflexive, personal pronouns and possessive pronouns. E.g: f j_pron (Frank Newman,Vice chairman)=false Features

Features (contd..) Definite and Demonstrative NP : f def, f dem If ”j” is a definite NP (e.g ”the car”) or demonstrative NP (e.g ”that boy”) then return true. E.g: f def_NP (Frank Newman,Vice chairman)=false Number and Gender: f num and f gender If i, j agree in number then f num ( i, j ) = true If i, j agree in gender then f gender ( i, j ) = true E.g : f num (Frank Newman,Vice chairman) = true f gender can take three values Designators and pronouns such as ”Mr”, ”Mrs”, ”she”, ”he” are used to determine the gender.

Features (contd..) Both-Proper-Noun : If both i and j are proper nouns return true. Alias Feature: If i is an alias of j return true. Appositive Feature: If ”j” is an apposition to ”i” return true. E.g : f appositive (Frank Newman,Vice chairman) = true Semantic Class Agreement feature : f semclass Possible values are The marker head words are assigned with one the following classes. Semantic class labeling is done by finding out the class lable closest to the first sense of the head word in a marker. E.g: f semclass (Frank Newman,Vice chairman)=true since both i and j corespond to persons

Training Data (Eastern Air) 1 proposes (date) 2 for (talks) 3 on (pay-cut plan) 4. ((Eastern Airlines) 5 executives) 6 notified ((union) 7 leaders) 8 that (the carrier) 9 wishes to discuss (selective (wage) 10 reductions) 11 on (Feb. 3) 12. ((Union) 13 representatives) 14 who could be reached said (they) 15 hadn't decided whether (they) 16 would respond. By proposing (a meeting date) 17 (Eastern) 18 moved (one step) 19 closer towards reopening (current high-cost contracts agreements) 20 with ((its) 21 unions) 22 ) 23. ((union) 7 (unions) 13 ) and ((union) 13 (its unions) 22 ) (the carrier) 9 (union) 13 ) and ((wage) 10 (union) 13 ) NP1NP2NP3NP4NP5 One chain Another chain Not in any Chain

Training C5 decision tree algorithm is used to learn a decision tree from the training data. It's an updated version of ID3 algorithm in which the feature to be selected is the one which provides maximum information gain Gain(S, A) = Entropy(S) - ((|A| / |S|) * Entropy(A)) Entropy( X ) = - Sum i ( Pr ( x i ) * log ( Pr ( x i )) C5 has a better pruning mechanism and also handles training data with missing attribute values.

Testing Algorithm: ( Document D, Decision_Tree T) : List M = get_markers_from_document (D) for ( j = 2; j<M ; j++): for ( i = 1; i<j ; i++ ): F = get_feature_vector ( i, j ) /********Get the class from Decision Tree*******/ corfer = get_corefer ( F, T) if (corefer ): j.antecedent = i for ( j = M ; j > 1; j-- ): chain = back_track ( j ) List.add ( chain ) return List

Testing Example (Ms. Washington) 73 's candidacy is being championed by (several lawmakers) 74 including ((her) 76 boss) 75, (chairman John Dingell) 77 (D., (Mich.) 78 ) of (the House Energy and Commerce Committee) 79. (She) 80 currently is (a counsel) 81 to (the committee) 82. (Ms. Washington) 83 and (Mr. Dingell) 84 have been considered (allies) 85 of (the (securities) 87 exchanges) 86, while (banks) 88 and ((futures) 90 exchanges) 89 have often fought with them.

Testing (contd..) Courtesy: Soon, Ng, Lim (2001) Coreferenced chain is (Ms. Washington) 73 -(her) 76 -(She) 80

Evaluation Courtesy: Soon, Ng, Lim (2001)

Evaluation (contd..) Courtesy: Soon, Ng, Lim (2001)

Precision Errors (false +ve)

Recall Errors (false -ve)

Conclusion and further Improvements Works on a small annotated corpus Domain and language independent Resolves noun phrase coreferences in general and not limited to pronominal coreference resolution. We can consider verb suffixes to determine gender in morphologically rich languages. Similarly, other language specific properties can be taken into consideration. This is a sequence labelling problem. We can apply techniques like HMM and CRF instead of scalar classifiers like decision trees.

References Soon, Wee Meng, Hwee Tou Ng, and Daniel Chung Yong Lim A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4):521–544. Crystal, David An encyclopedic dictionary of language and languages. Cambridge, MA: Blackwell. Kamil Wiśniewski, 2006, angielski.info/linguistics/discourse.htm Quinlan, John Ross C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco, CA.

Thank you