Learning to Extract a Broad-Coverage Knowledge Base from the Web William W. Cohen Carnegie Mellon University Machine Learning Dept and Language Technology.

Slides:

Advertisements

Similar presentations

Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.

Advertisements

Coupled Semi-Supervised Learning for Information Extraction Carlson et al. Proceedings of WSDM 2010.

Semi-Supervised Learning With Graphs William Cohen 1.

Problem Semi supervised sarcasm identification using SASI

Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.

A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.

Data Visualization STAT 890, STAT 442, CM 462

Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.

Coupling Semi-Supervised Learning of Categories and Relations by Andrew Carlson, Justin Betteridge, Estevam R. Hruschka Jr. and Tom M. Mitchell School.

Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Automatic Set Expansion for List Question Answering Richard C. Wang, Nico Schlaefer, William W. Cohen, and Eric Nyberg Language Technologies Institute.

William W. Cohen Machine Learning Dept and Language Technology Dept.

Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.

Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,

Populating the Semantic Web by Macro-Reading Internet Text T.M Mitchell, J. Betteridge, A. Carlson, E. Hruschka, R. Wang Presented by: Will Darby.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Language Technologies Institute, Carnegie Mellon University Language-Independent Class Instance Extraction Using the Web Richard C. Wang Thesis Committee:

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen Language Technologies Institute, Carnegie Mellon University.

GDG DevFest Central Italy Joint work with J. Feldman, S. Lattanzi, V. Mirrokni (Google Research), S. Leonardi (Sapienza U. Rome), H. Lynch (Google)

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

SCALING THE KNOWLEDGE BASE FOR THE NEVER-ENDING LANGUAGE LEARNER (NELL): A STEP TOWARD LARGE-SCALE COMPUTING FOR AUTOMATED LEARNING Joel Welling PSC 4/10/2012.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

NEVER-ENDING LANGUAGE LEARNER Student: Nguyễn Hữu Thành Phạm Xuân Khoái Vũ Mạnh Cầm Instructor: PhD Lê Hồng Phương Hà Nội, January

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Never Ending Learning Tom M. Mitchell Justin Betteridge, Jamie Callan, Andy Carlson, William Cohen, Estevam Hruschka, Bryan Kisiel, Mahaveer Jain, Jayant.

Machine Learning.

EMNLP’01 19/11/2001 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL ML: Classical methods from AI.

Presenter: Shanshan Lu 03/04/2010

Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.

BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.

Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,

Never-Ending Language Learning for Vietnamese Student: Phạm Xuân Khoái Instructor: PhD Lê Hồng Phương Coupled SEAL.

Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.

Learning to Construct and Reason with a Large KB of Extracted Information William W. Cohen Machine Learning Dept and Language Technology Dept joint work.

Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

N EVER -E NDING L ANGUAGE L EARNING (NELL) Jacqueline DeLorie.

The Road to the Semantic Web Michael Genkin SDBI

Semi-Supervised Learning William Cohen. Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression.

NEIL: Extracting Visual Knowledge from Web Data Xinlei Chen, Abhinav Shrivastava, Abhinav Gupta Carnegie Mellon University CS381V Visual Recognition -

Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.

SSL (on graphs).

How Will We Populate the Semantic Web on a Vast Scale?

Wenhan Xiong, Thien Hoang, William Wang Department of Computer Science

Summarizing Entities: A Survey Report

Information Retrieval

Learning to Reason with Extracted Information

Introduction Task: extracting relational facts from text

Effective Entity Recognition and Typing by Relation Phrase-Based Clustering

CS246: Information Retrieval

ProBase: common Sense Concept KB and Short Text Understanding

Learning to Rank Typed Graph Walks: Local and Global Approaches

Topic: Semantic Text Mining

KnowItAll and TextRunner

Presentation transcript:

Learning to Extract a Broad-Coverage Knowledge Base from the Web William W. Cohen Carnegie Mellon University Machine Learning Dept and Language Technology Dept

Learning to Extract a Broad-Coverage Knowledge Base from the Web William W. Cohen joint work with: Tom Mitchell, Richard Wang, Frank Lin, Ni Lao, Estevam Hruschka, Jr., Burr Settles, Derry Wijaya, Edith Law, Justin Betteridge, Jayant Krishnamurthy, Bryan Kisiel, Andrew Carlson, Weam Abu Zaki

Outline Web-scale information extraction: –discovering factual by automatically reading language on the Web NELL: A Never-Ending Language Learner –Goals, current scope, and examples Key ideas: –Redundancy of information on the Web –Constraining the task by scaling up –Learning by propagating labels through graphs Current and future directions: –Additional types of learning and input sources

Information Extraction Goal: –Extract facts about the world automatically by reading text –IE systems are usually based on learning how to recognize facts in text.. and then (sometimes) aggregating the results Latest-generation IE systems need not require large amounts of training … and IE does not necessarily require subtle analysis of any particular piece of text

Never Ending Language Learning (NELL) NELL is a large-scale IE system –Simultaneously learning concepts and relations (person, celebrity, emotion, aquiredBy, locatedIn, capitalCityOf,..) –Starting point: containment/disjointness relations between concepts, types for relations, and O(10) examples per concept/relation –Uses 500M web page corpus + live queries –Running (almost) continuously for over a year –Has learned more than 3.2M low-confidence “beliefs” and more than 500K high-confidence beliefs about 85% of high-confidence beliefs are correct

More details on corpus size 500 M English web pages –25 TB uncompressed –2.5 B sentences POS/NP-chunked Noun phrase/context graph –2.2 B noun phrases, –3.2 B contexts, –100 GB uncompressed; –hundreds of billions of edges After thresholding: –9.8 M noun phrases, 8.6 M contexts

Examples of what NELL knows

learned extraction patterns: playsSport(arg1,arg2) arg1_was_playing_arg2 arg2_megastar_arg1 arg2_icons_arg1 arg2_player_named_arg1 arg2_prodigy_arg1 arg1_is_the_tiger_woods_of_arg2 arg2_career_of_arg1 arg2_greats_as_arg1 arg1_plays_arg2 arg2_player_is_arg1 arg2_legends_arg1 arg1_announced_his_retirement_from_arg2 arg2_operations_chief_arg1 arg2_player_like_arg1 arg2_and_golfing_personalities_including_arg1 arg2_players_like_arg1 arg2_greats_like_arg1 arg2_players_are_steffi_graf_and_arg1 arg2_great_arg1 arg2_champ_arg1 arg2_greats_such_as_arg1 …

Outline Web-scale information extraction: –discovering factual by automatically reading language on the Web NELL: A Never-Ending Language Learner –Goals, current scope, and examples Key ideas: –Redundancy of information on the Web –Constraining the task by scaling up –Learning by propagating labels through graphs Current and future directions: –Additional types of learning and input sources

Semi-Supervised Bootstrapped Learning Paris Pittsburgh Seattle Cupertino mayor of arg1 live in arg1 San Francisco Austin denial arg1 is home of traits such as arg1 it’s underconstrained!! anxiety selfishness Berlin Extract cities: Given: four seed examples of the class “city”

NP1NP2 Krzyzewski coaches the Blue Devils. athlete team coachesTeam(c,t) person coach sport playsForTeam(a,t) NP Krzyzewski coaches the Blue Devils. coach(NP) hard (underconstrained) semi-supervised learning problem much easier (more constrained) semi-supervised learning problem teamPlaysSport(t,s) playsSport(a,s) One Key to Accurate Semi-Supervised Learning 1.Easier to learn many interrelated tasks than one isolated task 2.Also easier to learn using many different types of information

SEAL: Set Expander for Any Language … … … … … ford, toyota, nissan honda Seeds Extractions *Richard C. Wang and William W. Cohen: Language-Independent Set Expansion of Named Entities using the Web. In Proceedings of IEEE International Conference on Data Mining (ICDM 2007), Omaha, NE, USA Another key: use lists and tables as well as text Single-page Patterns

Extrapolating user-provided seeds Set expansion (SEAL): –Given seeds (kdd, icml, icdm), formulate query to search engine and collect semi- structured web pages –Detect lists on these pages –Merge the results, ranking items “frequently” occurring on “good” lists highest –Details: Wang & Cohen ICDM 2007, 2008; EMNLP 2008, 2009

Ontology and populated KB the Web CBL text extraction patterns SEAL HTML extraction patterns evidence integration, self reflection RL learned inference rules Morph Morphology based extractor

Outline Web-scale information extraction: –discovering factual by automatically reading language on the Web NELL: A Never-Ending Language Learner –Goals, current scope, and examples Key ideas: –Redundancy of information on the Web –Constraining the task by scaling up –Learning by propagating labels through graphs Current and future directions: –Additional types of learning and input sources

Semi-Supervised Bootstrapped Learning Paris Pittsburgh Seattle Cupertino mayor of arg1 live in arg1 San Francisco Austin denial arg1 is home of traits such as arg1 anxiety selfishness Berlin Extract cities:

Semi-Supervised Bootstrapped Learning vs Label Propagation Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness

Semi-Supervised Bootstrapped Learning as Label Propagation Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness Nodes “near” seedsNodes “far from” seeds Information from other categories tells you “how far” (when to stop propagating) arrogance traits such as arg1 denial selfishness

Semi-Supervised Learning as Label Propagation on a (Bipartite) Graph Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness Propagate labels to nearby nodes X is “near” Y if there is a high probability of reaching X from Y with a random walk where each step is either (a) move to a random neighbor or (b) jump back to start node Y, if you’re at an NP node rewards multiple paths penalizes long paths penalizes high-fanout paths I like arg1 beer Propagation methods: “personalized PageRank” (aka damped PageRank, random-walk- with-reset)

Semi-Supervised Bootstrapped Learning as Label Propagation Co-EM (semi-supervised method used in NELL) is equivalent to label propagation using harmonic functions –Seeds have score 1; score of other nodes X is weighted average of neighbors’ scores –Edge weight between NP node X and NP node Y is inner product of context features, weighted by inverse frequency Similar to, but different than Personalized PageRank/RWR Compute edge weights –On-the-fly from features –Huge reduction in cost Both very easy to parallelize

Comparison on “City” data Start with city lexicon Hand-label entries based on typical contexts –Is this really a city? Boston, Split, Drug,.. Evaluate using this as gold standard coEM (current) PageRank based Supervised With 21 examples With 21 seeds [Frank Lin & Cohen, current work]

Another example of propagation: Extrapolating seeds in SEAL Set expansion (SEAL): –Given seeds (kdd, icml, icdm), formulate query to search engine and collect semi- structured web pages –Detect lists on these pages –Merge the results, ranking items “frequently” occurring on “good” lists highest –Details: Wang & Cohen ICDM 2007, 2008; EMNLP 2008, 2009

List-merging using propagation on a graph A graph consists of a fixed set of … –Node Types: {seeds, document, wrapper, mention} –Labeled Directed Edges: {find, derive, extract} Each edge asserts that a binary relation r holds Each edge has an inverse relation r -1 (graph is cyclic) –Intuition: good extractions are extracted by many good wrappers, and good wrappers extract many good extractions –Good ranking scheme: find mentions “near” the seeds “ford”, “nissan”, “toyota” curryauto.com Wrapper #3 Wrapper #2 Wrapper #1 Wrapper #4 “honda” 26.1% “acura” 34.6% “chevrolet” 22.5% “bmw pittsburgh” 8.4% “volvo chicago” 8.4% find derive extract northpointcars.com

Outline Web-scale information extraction: –discovering factual by automatically reading language on the Web NELL: A Never-Ending Language Learner –Goals, current scope, and examples Key ideas: –Redundancy of information on the Web –Constraining the task by scaling up –Learning by propagating labels through graphs Current and future directions: –Additional types of learning and input sources

Learning to reason from the KB Learned KB is noisy, so chains of logical inference may be unreliable. How can you decide which inferences are safe? Approach: –Combine graph proximity with learning –Learn which sequences of edge labels usually lead to good inferences [Ni Lao, Cohen, Mitchell – current work]

Results

Semi-Supervised Bootstrapped Learning vs Label Propagation Paris live in arg1 San Francisco Austin traits such as arg1 anxiety mayor of arg1 Pittsburgh Seattle denial arg1 is home of selfishness

Semi-Supervised Bootstrapped Learning vs Label Propagation Paris live in arg1 mayor of San Francisco mayor of arg1 Pittsburgh San Franciso mayor of Paris mayor of Pittsburgh live in Pittsburgh live in Paris Paris’s new show Basic idea: propogate labels from context-NP pairs and classify NP’s in context, not NP’s out-of-context. Challenge: Much larger (and sparser) data

Looking forward Huge value in mining/organizing/making accessible publically available information Information is more than just facts –It’s also how people write about the facts, how facts are presented (in tables, …), how facts structure our discourse and communities, … –IE is the science of all these things NELL is based one premise that doing it right means scaling –From small to large datasets –From fewer extraction problems to many interrelated problems –From one view to many different views of the same data

Thanks to: Tom Mitchell and other collaborators –Frank Lin, Ni Lao, (alumni) Richard Wang DARPA, NSF, Google, the Brazilian agency CNPq (project funding)DARPANSFGoogle CNPq Yahoo! and Microsoft Research (fellowships)Yahoo!Microsoft Research