Open IE and Universal Schema Discovery Heng Ji Acknowledgement: some slides from Daniel Weld and Dan Roth.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Distant Supervision for Relation Extraction without Labeled Data CSE 5539.
Learning 5000 Relational Extractors Raphael Hoffmann, Congle Zhang, Daniel S. Weld University of Washington Talk at ACL /12/10.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.
Measuring Semantic Similarity between Words Using Web Search Engines Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka Topic  Semantic similarity measures.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Open Information Extraction From The Web Rani Qumsiyeh.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
1 Web Query Classification Query Classification Task: map queries to concepts Application: Paid advertisement 问题:百度 /Google 怎么赚钱?
Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 10: Information Integration and Synthesis.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Unsupervised and Semi-Supervised Relation Extraction.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Information Extraction MAS.S60 Catherine Havasi Rob Speer.
Open Information Extraction using Wikipedia
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Open Information Extraction CSE 454 Daniel Weld. To change More textrunner, more pattern learning Reorder: Kia start.
A Language Independent Method for Question Classification COLING 2004.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
KnowItAll Oren Etzioni, Stephen Soderland, Daniel Weld Michele Banko, Alex Beynenson, Jeff Bigham, Michael Cafarella, Doug Downey, Dave Ko, Stanley Kok,
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
KnowItAll Oren Etzioni, Stephen Soderland, Daniel Weld Michele Banko, Alex Beynenson, Jeff Bigham, Michael Cafarella, Doug Downey, Dave Ko, Stanley Kok,
KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
The Intelligence in Wikipedia Project Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA Joint Work.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
Automatically Labeled Data Generation for Large Scale Event Extraction
A Brief Introduction to Distant Supervision
Information Extraction from Wikipedia: Moving Down the Long Tail
Chapter 10: Information Integration and Synthesis
Introduction Task: extracting relational facts from text
Open Information Extraction from the Web
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
KnowItAll and TextRunner
Presentation transcript:

Open IE and Universal Schema Discovery Heng Ji Acknowledgement: some slides from Daniel Weld and Dan Roth

Traditional, Supervised I.E. Raw Data Labeled Training Data Learning Algorithm Extractor Kirkland-based Microsoft is the largest software company. Boeing moved it’s headquarters to Chicago in Hank Levy was named chair of Computer Science & Engr. … HeadquarterOf(, )

Solutions 3  Open IE  Universal Schema Discovery  Concept Typing

Open Information Extraction  The challenge of Web extraction is to be able to do Open Information Extraction.  Unbounded number of relations  Web corpus contains billions of documents.

How open IE systems work  learn a general model of how relations are expressed (in a particular language), based on unlexicalized features such as part-of- speech tags. (Identify a verb)  Learn domain-independent regular expressions. (Punctuations, Commas).

Methods for Open IE Self Supervision Kylin (Wikipedia) Shrinkage & Retraining Hearst Patterns PMI Validation Subclass Extraction Pattern Learning Structural Extraction List Extraction & WebTables TextRunner

Kylin: Self-Supervised Information Extraction from Wikipedia [Wu & Weld CIKM 2007] Its county seat is Clearfield. As of 2005, the population density was 28.2/km². Clearfield County was created in 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until ,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water. From infoboxes to a training set

Kylin Architecture

Long-Tail: Incomplete Articles Desired Information Missing from Wikipedia 800,000/1,800,000(44.2%) stub pages [July 2007 of Wikipedia ] Length ID

Schema Mapping  Heuristics  Edit History  String Similarity Experiments Precision: 94% Recall: 87% Future Integrated Joint Inference Person Performer birth_date birth_place name other_names … birthdate location name othername …

Main Lesson: Self Supervision Find structured data source Use heuristics to generate training data E.g. Infobox attributes & matching sentences

The KnowItAll System Predicates Country(X) Domain-independent Rule Templates “such as” NP Bootstrapping Extraction Rules “countries such as” NP Discriminators “country X” ExtractorWorld Wide Web Extractions Country(“France”) Assessor Validated Extractions Country(“France”), prob=0.999

Unary predicates: instances of a class Unary predicates: instanceOf(City), instanceOf(Film), instanceOf(Company), … Good recall and precision from generic patterns: “such as” X X “and other” Instantiated rules: “cities such as” XX “and other cities” “films such as” XX “and other films” “companies such as” XX “and other companies”

Recall – Precision Tradeoff High precision rules apply to only a small percentage of sentences on Web hits for “X” “cities such as X” “X and other cities” Boston 365,000,000 15,600,000 12,000 Tukwila 1,300,000 73, Gjatsk Hadaslav “Redundancy-based extraction” ignores all but the unambiguous references.

Limited Recall with Binary Rules Relatively high recall for unary rules: “companies such as” X 2,800,000 Web hits X “and other companies” 500,000 Web hits Low recall for binary rules: X “is the CEO of Microsoft” 160 Web hits X “is the CEO of Wal-mart” 19 Web hits X “is the CEO of Continental Grain” 0 Web hits X “, CEO of Microsoft” 6,700 Web hits X “, CEO of Wal-mart” 700 Web hits X “, CEO of Continental Grain” 2 Web hits

Examples of Extraction Errors Rule: countries such as X => instanceOf(Country, X) “We have 31 offices in 15 countries such as London and France.” =>instanceOf(Country, London) instanceOf(Country, France) Rule: X and other cities => instanceOf(City, X) “A comparative breakdown of the cost of living in Klamath County and other cities follows.” =>instanceOf(City, Klamath County)

“Generate and Test” Paradigm 1. Find extractions from generic rules 2. Validate each extraction Assign probability that extraction is correct Use search engine hit counts to compute PMI PMI (pointwise mutual information) between extraction “discriminator” phrases for target concept PMI-IR: P.D.Turney, “Mining the Web for synonyms: PMI-IR versus LSA on TOEFL”. In Proceedings of ECML, 2001.

Computing PMI Scores Measures mutual information between the extraction and target concept. I = an instance of a target concept instanceOf(Country, “France”) D = a discriminator phrase for the concept “ambassador to X” D+I = insert instance into discriminator phrase “ambassador to France”

Example of PMI Discriminator: “countries such as X” Instance: “France” vs. “London” PMI for France >> PMI for London (2 orders of mag.) Need features for probability update that distinguish “high” PMI from “low” PMI for a discriminator “countries such as France” : 27,800 hits “France”: 14,300,000 hits “countries such as London” : 71 hits “London”: 12,600,000 hits

20 Chicago Unmasked City senseMovie sense

21 Impact of Unmasking on PMI Name Recessive Original Unmask Boost Washington city % Casablanca city % Chevy Chase actor % Chicago movie %

22 RL: learn class-specific patterns. “Headquarted in ” SE: Recursively extract subclasses. “Scientists such as physicists and chemists” LE: extract lists of items (~ Google Sets). How to Increase Recall?

23 List Extraction (LE) 1. Query Engine with known items. 2. Learn a wrapper for each result page. 3. Collect large number of lists. 4. Sort items by number of list “votes”. LE+A=sort list according to Assessor. Evaluation: Web recall, at precision= 0.9.

TextRunner

Works in two phases. 1. Using a conditional random field, the extractor learns to assign labels to each of the words in a sentence. 2. Extracts one or more textual triples that aim to capture (some of) the relationships in each sentence.

Information Redundancy Assumption (Banko et al., 2008) 26

Performance Comparison on Traiditonal IE vs. Open IE 27

General Remarks 29  Exploit data-driven methods (e.g., domain-independent patterns) to extract relation or event tuples  It dramatically enhanced the scalability of IE.  Obtain substantially lower recall than traditional IE  heavily rely on information redundancy to validate the extraction results and thus suffer from the ``long-tail" knowledge sparsity problem  Incapable of generalizing the lexical contexts in order to name new fact types  Over-simplified IE problem: IE != Sentence-level Relation Extraction; how about entity and event types?  They focused on making the types of relations and events unrestricted, while still used other IE components such as name tagging and semantic role labeling for pre-defined types

Universal Schema 30  Discover a wide range of domain-independent fact types manually  AMR, extended NE  or automatically by relation clustering based on coreferential arguments  Go_off represented by “plant”, “set_off”, “injure” because they share coreferential arguments  combining patterns with external knowledge sources such as Freebase or query logs  Some name tagging work focused on extracting more fine-grained types beyond the traditional entity types (person, organization and geo-political entities)

Open Questions 31  Is there a close set of universal schema?  How many knowledge sources should we use for discovering universal schema? What are they?  Many knowledge bases are available, how to unify them (automatically)?

Concept Typing 32  Chunking  Bootstrapping  Clustering with Contexts

Concept Mention Extraction (Wang et al., 2013) 33

Concept Mention Extraction (Tsai et al., 2013)  Identify and categorize mentions of concepts (Gupta and Manning, 2011)  TECHNIQUE and APPLICATION “We apply support vector machines on text classification.”  Unsupervised Bootstrapping algorithm (Yarowsky, 1995; Collins and Singer, 1999)  The proposed algorithm 1. Extract noun phrases (Punyakanok and Roth, 2001) 2. For each category, initialize a decision list by seeds. 3. For several rounds, 1. Annotate NPs using the decision lists. 2. Extract top features from new annotated phrases, and add them into decision lists. 34

Seeds 35

Paper1…………………………………… support vector machine………………... …………………………………………… ………………………………………. c4.5…….. Paper2…………………………………… svm-based classification………………….………………………………… decision_trees………….…….…………… …………………… Paper4…………………………………… maximal_margin_classifiers……………… …………………….……………………… ………………………………………….. Paper3.…………………………………… …………………………………….. svm….…………………………………….………………………………………… ………… (Cortes,1995) (Quinlan,1993) (Vapnik,1995) (Quinlan,1993) (Cortes,1995) (Quinlan,1993) (Vapnik,1995) (Quinlan,1993) (Cortes,1995) c4.5 decision trees support vector machine svm-based classification svm maximal margin classifiers Citation-Context Based Concept Clustering (CitClus) Cluster mentions into semantic coherent concepts 1.Group concept mentions by citation context 2.Merge clusters based on lexical similarity between mentions in the clusters

Remarks 37  Where to draw the line for different granuarilities?  Maybe the type of each phrase should be a path in the ontology structure?  Is an absolute cold-start extraction possible?  How to use other resources such as scenario models and social cognitive theories?

Paper presentations 38

Lifu’s presentation + Phrase Typing Exercise 39