Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

Microsoft® Office Access® 2007 Training
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Coupled Semi-Supervised Learning for Information Extraction Carlson et al. Proceedings of WSDM 2010.
Problem Semi supervised sarcasm identification using SASI
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Information Extraction CS 652 Information Extraction and Integration.
Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University
CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.
Traditional Information Extraction -- Summary CS652 Spring 2004.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
July 9, 2003ACL An Improved Pattern Model for Automatic IE Pattern Acquisition Kiyoshi Sudo Satoshi Sekine Ralph Grishman New York University.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Open Information Extraction From The Web Rani Qumsiyeh.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery.
Learning Dictionaries for Information Extraction by Multi- Level Bootstrapping Ellen Riloff and Rosie Jones, AAAI 99 Presented by: Sunandan Chakraborty.
ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Processing of large document collections Part 10 (Information extraction: learning extraction patterns) Helena Ahonen-Myka Spring 2005.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa
Querying Text Databases for Efficient Information Extraction Eugene Agichtein Luis Gravano Columbia University.
Presenter: Shanshan Lu 03/04/2010
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Report on Semi-supervised Training for Statistical Parsing Zhang Hao
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Learning Subjective Nouns using Extraction Pattern Bootstrapping Ellen Riloff School of Computing University of Utah Janyce Wiebe, Theresa Wilson Computing.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Learning Extraction Patterns for Subjective Expressions 2007/10/09 DataMining Lab 안민영.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Using Semantic Relations to Improve Information Retrieval
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Extracting Information from Diverse and Noisy Scanned Document Images
KnowItAll and TextRunner
Presentation transcript:

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University of Utah)

The Vision Data Base Time Line Geo Display Link Analysis Tables Extractor Entities Models Training Program training sentences answers Relations Information Extraction Events

What is IE? Analyze unrestricted text in order to extract information about pre-specified types of events, entities or relationships

Practical / Commercial Applications Database of Job Postings extracted from corporate web paes (flipdog.com) Extracting specific fields from resumes to populate HR databases (mohomine.com) Information Integration (fetch.com) Shopping Portals

Where the world is now? MUC helped drive information extraction research but most systems were fine tuned for terrorist activities Commercial systems can detect names of people, locations, companies (only for proper nouns) Very costly to train and port to new domains 3-6 months to port to new domain (Cardie 98) 20,000 words to learn named entity extraction (Seymore et al 99) 7000 labeled examples to learn MUC extraction rules (Soderland 99)

IE Approaches Hand-Constructed Rules Supervised Learning Semi-Supervised Learning

Goal Can you start with 5-10 seeds and learn to extract other instances? Example tasks Locations Products Organizations People

Aren’t you missing the obvious? Not really! Acquire lists of proper nouns Locations : countries, states, cities Organizations : online database People: Names But not all instances are proper nouns *by the river*, *customer*,*client*

Use context to disambiguate A lot of NPs are unambiguous “The corporation” A lot of contexts are also unambiguous Subsidiary of But as always, there are exceptions….and a LOT of them in this case “customer”, John Hancock, Washington

Bootstrapping Approaches Utilize Redundancy in Text Noun-Phrases New York, China, place we met last time Contexts Located in, Traveled to Learn two models Use NPs to label Contexts Use Contexts to label NPs

Algorithms for Bootstrapping Meta-Bootstrapping (Riloff & Jones, 1999) Co-Training (Blum & Mitchell, 1999) Co-EM (Nigam & Ghani, 2000)

Data Set ~5000 corporate web pages (4000 for training) Test data marked up manually by labeling every NP as one or more of the following semantic categories: location, organization, person, product, none Preprocessed (parsed) to generate extraction patterns using AutoSlog (Riloff, 1996)

Evaluation Criteria Every test NP is labeled with a confidence score by the learned model Calculate Precision and Recall at different thresholds Precision = Correct / Found Recall = Found / Max that can be found

Seeds

Results

Active Learning Can we do better by keeping the user in the loop? If we can ask the user to label any examples, which examples should they be? Selected randomly Selected according to their density/frequency Selected according to disagreement between NP and context (KL divergence to the mean weighted by density)

NP – Context Disagreement KL Divergence

Results

What if you’re really lazy? Previous experiments assumed a training set was available What if you don’t have a set of documents that can be used to train? Can we start from only the seeds?

Collecting Training Data from the Web Use the seed words to generate web queries Simple Approaches For each seed word, fetch all documents returned Only fetch documents, where N or more seed words appear

Collecting Training Data from the Web Query GeneratorWWW Seed Documents Text Filter

Interleaved Data Collection Select a seed word with uniform probability Get documents containing that seed word Run bootstrapping on the new documents Select new seedwords that are learned with high confidence Repeat

Seed-Word Density

Summary Starting with 10 seed words, extract NPs matching specific semantic classes Probabilistic Bootstrapping is an effective technique Asking the user helps only if done intelligently The Web is an excellent resource for training data that can be collected automatically => Personal Information Extraction Systems