Bo Lin Kevin Dela Rosa Rushin Shah.  As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution:

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
University of Sheffield NLP Module 4: Machine Learning.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Exploiting Discourse Structure for Sentiment Analysis of Text OR 2013 Alexander Hogenboom In collaboration with Flavius Frasincar, Uzay Kaymak, and Franciska.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Information Retrieval in Practice
Mining and Summarizing Customer Reviews
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
JSP Standard Tag Library
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
Natural Language Processing Group Department of Computer Science University of Sheffield, UK Improving Semi-Supervised Acquisition of Relation Extraction.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
A Two Tier Framework for Context-Aware Service Organization & Discovery Wei Zhang 1, Jian Su 2, Bin Chen 2,WentingWang 2, Zhiqiang Toh 2, Yanchuan Sim.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Open Information Extraction using Wikipedia
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
A Language Independent Method for Question Classification COLING 2004.
An Intelligent Analyzer and Understander of English Yorick Wilks 1975, ACM.
Requirements as Usecases Capturing the REQUIREMENT ANALYSIS DESIGN IMPLEMENTATION TEST.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
MedKAT Medical Knowledge Analysis Tool December 2009.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Entity Relationship Diagram (ERD). Objectives Define terms related to entity relationship modeling, including entity, entity instance, attribute, relationship.
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
Using Semantic Relations to Improve Information Retrieval
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
CS223: Software Engineering Lecture 25: Software Testing.
An Integrated Approach for Relation Extraction from Wikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009.
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
张昊.
Improving a Pipeline Architecture for Shallow Discourse Parsing
What is IR? In the 70’s and 80’s, much of the research focused on document retrieval In 90’s TREC reinforced the view that IR = document retrieval Document.
Social Knowledge Mining
Clustering Algorithms for Noun Phrase Coreference Resolution
Extracting Semantic Concept Relations
Introduction Task: extracting relational facts from text
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
DEPLOYING SECURITY CONFIGURATION
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Presentation transcript:

Bo Lin Kevin Dela Rosa Rushin Shah

 As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution: Extract all noun phrases from a document (names, descriptions, pronouns), and cluster them according to the real-world entity they describe.  Each such cluster  a chain  Within-doc: Cluster NPs from a single document  Cross-doc: Cluster NPs from different documents

 Run a WDC system on a document and extract chains corresponding to real-world entities. For each chain, track all sentences from which its mentions are obtained.  Features over pairs of such chains: ◦ SoftTFIDF similarity between names ◦ All words TFIDF cosine similarity (over sentences) ◦ Named Entity (NE) TFIDF cosine similarity (over sentences) ◦ NE SoftTFIDF cosine similarity (over sentences) ◦ Semantic similarity between the NPs of each chain  Train an SVM that classifies pairs of chains as co-referent or not  Use SVM to cluster all chains from all documents in the corpus.  Store this clustering in a database. Each entity  a list of chains

Augment our CDC system with the following:  Attribute Extraction ◦ For each chain, extract attributes such as gender, occupation, nationality, birthdates if they exist ◦ Use attributes to enable SVM to do better co-reference  Relationship Extraction ◦ For each pair of chains, extract a relationship (e.g. part of, role, etc), if it exists ◦ Use relationships for better visualization of clusters

 Patterns: Take seed examples of (entity, attribute) and learn extraction patterns  Position: Use typical document positions of attributes  Transitive: Use attributes of neighboring entities  Latent: Use document-level topic models to infer attributes  References: Ravinchandran & Hovy ‘02, Garera & Yarowsky ‘09

 Idea: Attribute values should appear with linguistic clues around it, i.e. it can be defined as a probabilistic language model describing the chance of a word being an attribute given the context. This idea is essentially the same as in KnowItAll or Brin ‘98.  Example: ◦ Marked Text: “ (born ) is a former American, active …” ◦ Generated Context: “(born X)” for X=, “a former American X” for X= …

 Idea: Certain biographical attributes tend to appear in characteristic positions, often near top of article. Relative position /rank between attributes can be helpful information as well.  Example: “Birthdate” is often times the first date in a biography text, or at least near the beginning of the article, and relatively speaking “Deathdate” almost always occurs sometime afterwards

 Idea: Intuition is that a named entity is more likely mentioned together with another entity with similar attribute values (the most applicable ones seem to be “occupation”, “religions”)  Example: “Michael Jordan” (the player) is mostly mentioned together with “Wilt Chamberlain”, “Dennis Rodman” and fellow players.

 Idea: Use “latent wide-document-context” models to detect attributes that may not have been mentioned directly in article  Example: Words such as “songs, album, recorded” can all collectively indicate an occupation of singer or musician

 Three options: ◦ Extract variety of features and train YFCL ◦ Define Kernels to measure similarity between instances, plug into SVM ◦ Semi-supervised approach. Start with seed examples, learn patterns, iterate. Grow KB (E.g. KnowItAll, TextRunner)  Ideally would prefer semi-supervised, especially since it allows open domain IE, but very time and labor-intensive  Kernel approach more elegant, works better than YFCL  Therefore, we proposed to use Kernels

 Attributes: NNDB seed pairs, Wikipedia pages  Relations: ACE Phase 2 top level relations  For our CDC system: ◦ John Smith corpus ◦ WePS (person name disambiguation) corpora

 Data Preparation ◦ Collected attribute extraction data set  Currently our seed pairs have occupation, date-of-birth, and date-of-death values  we plan on collecting birthplace, gender, and nationality ◦ Implemented the program to parse simple Wikipedia pages  Framework ◦ Implemented a pipe-line process which consists of  Loading the list of names and pages  Parsing Wikipedia pages  Sentence segmentation  POS tagging / NE Tagging (If needed)  Modeling and Extraction ◦ Provide capability to support different models for different attributes

 Implementations ◦ Implemented context/pattern-based model, currently focusing on occupations ◦ Implemented position-based models (absolute & relative), on occupation and date of birth attribute extraction ◦ In progress of implementing transitivity-based model for occupation extraction ◦ In progress of implementing latent wide-document-context models for extracting implicit attributes

 Issues Actual implementation raises a lot of problems missed in the paper such as whether POS Tagging should be included in the model, the pipeline framework is intended to solve this issue.  Plan ◦ To finish the implementations and compare with two baseline models for each attributes ◦ Extend code to work on chains and consolidate the attributes ◦ Use attributes extracted as features in the CDC system’s SVM classifier to help co-reference resolution

 Evaluated different kinds of Kernels (subsequence, dependency trees, etc).  Chose subsequence kernels as these perform well and robust to ill-formed documents  As a template, decided to follow the Bunescu and Mooney paper discussed in class  Observation: Although the idea in the paper is easy to understand, implementation is quite complex

 For SVM, decided to use LibSVM. Bunescu has a modified version that can take custom Kernels, but based on older release  Made updates to LibSVM to reconcile different versions. ◦ Now accepts custom/pre-computed Kernels ◦ Allows richer representation of instances than numerical values (e.g. sentence fragments, named entities, etc.)

 As we know, hardest part of coding Machine Learning applications isn’t the classifier (plenty of libraries), but feature extraction  Wrote feature extraction code that extracts sentences from XML and produces instances for SVM, where each instance consists of: ◦ The 2 Named Entities of interest ◦ Sentence fragments before, between and after NEs  If a sentence consists of > 2 NEs, make N C 2 copies

 Implemented general subsequence Kernel algorithm  Currently working on adapting this for the specific case of relationship Kernels (as explained in Bunescu & Mooney)  Once done, will extend this code (which works on sentences), to chains, so it can be plugged into CDC system