Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Overview of this week Debugging tips for ML algorithms
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Large-Scale Entity-Based Online Social Network Profile Linkage.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
Problem Semi supervised sarcasm identification using SASI
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.
Information Retrieval in Practice
Aki Hecht Seminar in Databases (236826) January 2009
Learning to Extract a Broad-Coverage Knowledge Base from the Web William W. Cohen Carnegie Mellon University Machine Learning Dept and Language Technology.
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
Automatic Set Expansion for List Question Answering Richard C. Wang, Nico Schlaefer, William W. Cohen, and Eric Nyberg Language Technologies Institute.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.
Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
Overview of Search Engines
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Language Technologies Institute, Carnegie Mellon University Language-Independent Class Instance Extraction Using the Web Richard C. Wang Thesis Committee:
Recommender Systems and Collaborative Filtering
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Collaborative Filtering  Introduction  Search or Content based Method  User-Based Collaborative Filtering  Item-to-Item Collaborative Filtering  Using.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Never-Ending Language Learning for Vietnamese Student: Phạm Xuân Khoái Instructor: PhD Lê Hồng Phương Coupled SEAL.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Algorithmic Detection of Semantic Similarity WWW 2005.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.
Information Retrieval in Practice
Correlation Clustering
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Summarizing Entities: A Survey Report
Learning to Rank Typed Graph Walks: Local and Global Approaches
KnowItAll and TextRunner
Presentation transcript:

Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University Joint work with Richard Wang

Traditional IE vs Open Domain IE Goal: recognize people, places, companies, times, dates, … in NL text. Supervised learning from corpus completely annotated with target entity class (e.g. “people”) Linear-chain CRFs Language- and genre-specific extractors Goal: recognize arbitrary entity sets in text –Minimal info about entity class –Example 1: “ICML, NIPS” –Example 2: “Machine learning conferences” Semi-supervised learning –from very large corpora (WWW) Graph-based learning methods Techniques are largely language-independent (!) –Graph abstraction fits many languages

Examples with three seeds

Outline History –Open-domain IE by pattern-matching The bootstrapping-with-noise problem –Bootstrapping as a graph walk Open-domain IE as finding nodes “near” seeds on a graph –Set expansion - from a few clean seeds –Iterative set expansion – from many noisy seeds –Relational set expansion –Multilingual set expansion –Iterative set expansion – from a concept name alone

History: Open-domain IE by pattern- matching (Hearst, 92) Start with seeds: “NIPS”, “ICML” Look thru a corpus for certain patterns: … “at NIPS, AISTATS, KDD and other learning conferences…” Expand from seeds to new instances Repeat….until ___ –“on PC of KDD, SIGIR, … and…”

Bootstrapping as graph proximity “…at NIPS, AISTATS, KDD and other learning conferences…” … “on PC of KDD, SIGIR, … and…” NIPS AISTATS KDD “For skiiers, NIPS, SNOWBIRD,… and…” SNOWBIRD SIGIR “… AISTATS,KDD,…” shorter paths ~ earlier iterations many paths ~ additional evidence

Set Expansion for Any Language (SEAL) – (Wang & Cohen, ICDM 07) Basic ideas –Dynamically build the graph using queries to the web –Constrain the graph to be as useful as possible Be smart about queries Be smart about “patterns”: use clever methods for finding meaningful structure on web pages

System Architecture Fetcher: download web pages from the Web that contain all the seeds Extractor: learn wrappers from web pages Ranker: rank entities extracted by wrappers 1.Canon 2.Nikon 3.Olympus 4.Pentax 5.Sony 6.Kodak 7.Minolta 8.Panasonic 9.Casio 10.Leica 11.Fuji 12.Samsung 13.…

The Extractor Learn wrappers from web documents and seeds on the fly –Utilize semi-structured documents –Wrappers defined at character level Very fast No tokenization required; thus language independent Wrappers derived from doc d applied to d only –See ICDM 2007 paper for details

.. Generally Ford sales … compared to Honda while General Motors and Bentley …. 1.Find prefix of each seed and put in reverse order: ford 1 : /ecnanif”=fer a> yllareneG … Ford 2 : >”drof/ /ecnanif”=fer a> yllareneG … honda 1 : /ecnanif”=fer a> ot derapmoc … Honda 2 : >”adnoh/ /ecnanif”=fer a> ot … 2.Organize these into a trie, tagging each node with a set of seeds: yllareneG … /ecnanif”=fer a> ot derapmoc … >”drof/ /ecnanif”=fer a> yllareneG.. adnoh/ /ecnanif”=fer a> ot.. {f 1,f 2,h 1,h 2 } {f 1,h 1 } {f 2,h 2 } {f 1 } {h 1 } {f 2 } {h 2 }

.. Generally Ford sales … compared to Honda while General Motors and Bentley …. 1.Find prefix of each seed and put in reverse order: 2.Organize these into a trie, tagging each node with a set of seeds. 3.A left context for a valid wrapper is a node tagged with one instance of each seed. yllareneG … /ecnanif”=fer a> ot derapmoc … >”drof/ /ecnanif”=fer a> yllareneG.. adnoh/ /ecnanif”=fer a> ot.. {f 1,f 2,h 1,h 2 } {f 1,h 1 } {f 2,h 2 } {f 1 } {h 1 } {f 2 } {h 2 }

.. Generally Ford sales … compared to Honda while General Motors and Bentley …. 1.Find prefix of each seed and put in reverse order: 2.Organize these into a trie, tagging each node with a set of seeds. 3.A left context for a valid wrapper is a node tagged with one instance of each seed. 4.The corresponding right context is the longest common suffix of the corresponding seed instances. yllareneG … /ecnanif”=fer a> ot derapmoc … >”drof/ /ecnanif”=fer a> yllareneG.. adnoh/ /ecnanif”=fer a> ot.. {f 1,f 2,h 1,h 2 } {f 1,h 1 } {f 2,h 2 } {f 1 } {h 1 } {f 2 } {h 2 } “> ”>Ford sales … ”>Honda while …

Nice properties: There are relatively few nodes in the trie: O((#seeds)*(document length)) You can tag every node with the complete set of seeds that it covers You can rank of filter nodes by any predicate over this set of seeds you want: e.g., covers all seed instances that appear on the page? covers at least one instance of each seed? covers at least k instances, instances with weight > w, … yllareneG … /ecnanif”=fer a> ot derapmoc … >”drof/ /ecnanif”=fer a> yllareneG.. adnoh/ /ecnanif”=fer a> ot.. {f 1,f 2,h 1,h 2 } {f 1,h 1 } {f 2,h 2 } {f 1 } {h 1 } {f 2 } {h 2 } “> ”>Ford sales … ”>Honda while …

I am noise Me too!

Differences from prior work Fast character-level wrapper learning –Language-independent –Trie structure allows flexibility in goals Cover one copy of each seed, cover all instances of seeds, … –Works well for semi-structured pages Lists and tables, pull-down menus, javascript data structures, word documents, … High-precision, low-recall data integration vs. High-precision, low-recall information extraction

The Ranker Rank candidate entity mentions based on “ similarity ” to seeds –Noisy mentions should be ranked lower Random Walk with Restart (GW) …?

Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Inlinks are “good” (recommendations) Inlinks from a “good” site are better than inlinks from a “bad” site but inlinks from sites with many outlinks are not as “good”... “Good” and “bad” are relative. web site xxx

Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Imagine a “pagehopper” that always either follows a random link, or jumps to random page

Google’s PageRank (Brin & Page, web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Imagine a “pagehopper” that always either follows a random link, or jumps to random page PageRank ranks pages by the amount of time the pagehopper spends on a page: or, if there were many pagehoppers, PageRank is the expected “crowd size”

Personalized PageRank (aka Random Walk with Restart) web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Imagine a “pagehopper” that always either follows a random link, or jumps to particular page

Personalize PageRank Random Walk with Restart web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx Imagine a “pagehopper” that always either follows a random link, or jumps to a particular page P0 this ranks pages by the total number of paths connecting them to P0 … with each path downweighted exponentially with length

The Ranker Rank candidate entity mentions based on “ similarity ” to seeds –Noisy mentions should be ranked lower Random Walk with Restart (GW) On what graph?

Building a Graph A graph consists of a fixed set of … –Node Types: {seeds, document, wrapper, mention} –Labeled Directed Edges: {find, derive, extract} Each edge asserts that a binary relation r holds Each edge has an inverse relation r -1 (graph is cyclic) –Intuition: good extractions are extracted by many good wrappers, and good wrappers extract many good extractions, “ford”, “nissan”, “toyota” curryauto.com Wrapper #3 Wrapper #2 Wrapper #1 Wrapper #4 “honda” 26.1% “acura” 34.6% “chevrolet” 22.5% “bmw pittsburgh” 8.4% “volvo chicago” 8.4% find derive extract northpointcars.com

Differences from prior work Graph-based distances vs. bootstrapping –Graph constructed on-the-fly So it’s not different? –But there is a clear principle about how to combine results from earlier/later rounds of bootstrapping i.e., graph proximity Fewer parameters to consider Robust to “bad wrappers”

Evaluation Datasets: closed sets

Evaluation Method Mean Average Precision –Commonly used for evaluating ranked lists in IR –Contains recall and precision-oriented aspects –Sensitive to the entire ranking –Mean of average precisions for each ranked list Evaluation Procedure (per dataset) 1.Randomly select three true entities and use their first listed mentions as seeds 2.Expand the three seeds obtained from step 1 3.Repeat steps 1 and 2 five times 4.Compute MAP for the five ranked lists where L = ranked list of extracted mentions, r = rank Prec ( r ) = precision at rank r (a) Extracted mention at r matches any true mention (b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r # True Entities = total number of true entities in this dataset

Experimental Results: 3 seeds Vary: [Extractor] + [Ranker] + [Top N URLs] Extractor: E1: Baseline Extractor (longest common context for all seed occurrences) E2: Smarter Extractor (longest common context for 1 occurrence of each seed) Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk } N URLs: { 100, 200, 300 }

Side by side comparisons Telukdar, Brants, Liberman, Pereira, CoNLL 06

Side by side comparisons Ghahramani & Heller, NIPS 2005 EachMovie vs WWWNIPS vs WWW

Why does SEAL do so well? Hypotheses: –More information appears in semi-structured documents than in free text –More semi-structured documents can be (partially) understood with character-level wrappers than with HTML-level wrappers Free-text wrappers are only 10-15% of all wrappers learned: “Used [...] Van Pricing" “Used [...] Engines" “Bell Road [...] " “Alaska [...] dealership" “ “engine [...] used engines" “accessories, [...] parts" “is better [...] or"

Comparing character tries to HTML- based structures

Outline History –Open-domain IE by pattern-matching The bootstrapping-with-noise problem –Bootstrapping as a graph walk Open-domain IE as finding nodes “near” seeds on a graph –Set expansion - from a few clean seeds –Iterative set expansion – from many noisy seeds –Iterative set expansion – from a concept name alone –Multilingual set expansion –Relational set expansion

A limitation of the original SEAL

Proposed Solution: Iterative SEAL (iSEAL) (Wang & Cohen, ICDM 2008) Makes several calls to SEAL, each call … –Expands a couple of seeds –Aggregates statistics Evaluate iSEAL using … –Two iterative processes Supervised vs. Unsupervised (Bootstrapping) –Two seeding strategies Fixed Seed Size vs. Increasing Seed Size –Five ranking methods

ISeal (Fixed Seed Size, Supervised) Initial Seeds …Finally rank nodes by proximity to seeds in the full graph Refinement (ISS): Increase size of seed set for each expansion over time: 2,3,4,4,… Variant (Bootstrap): use high- confidence extractions when seeds run out

Ranking Methods Random Graph Walk with Restart –H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its application. In ICDM, PageRank –L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web Bayesian Sets (over flattened graph) –Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, Wrapper Length –Weights each item based on the length of common contextual string of that item and the seeds Wrapper Frequency –Weights each item based on the number of wrappers that extract the item

Little difference between ranking methods for supervised case (all seeds correct); large differences when bootstrapping Increasing seed size {2,3,4,4,…} makes all ranking methods improve steadily in bootstrapping case

Outline History –Open-domain IE by pattern-matching The bootstrapping-with-noise problem –Bootstrapping as a graph walk Open-domain IE as finding nodes “near” seeds on a graph –Set expansion - from a few clean seeds –Iterative set expansion – from many noisy seeds –Relational set expansion –Multilingual set expansion –Iterative set expansion – from a concept name alone

Relational Set Expansion [Wang & Cohen, EMNLP 2009] Seed examples are pairs: –E.g., audi::germany, acura::japan, Extension: find wrappers in which pairs of seeds occur –With specific left & right contexts –In specific order (audi before germany, …) –With specific string between them Variant of trie-based algorithm

Results First iterationTenth iteration

Outline History –Open-domain IE by pattern-matching The bootstrapping-with-noise problem –Bootstrapping as a graph walk Open-domain IE as finding nodes “near” seeds on a graph –Set expansion - from a few clean seeds –Iterative set expansion – from many noisy seeds –Relational set expansion –Multilingual set expansion –Iterative set expansion – from a concept name alone

Multilingual Set Expansion

Basic idea: –Expand in language 1 (English) with seeds s1,s2 to S1 –Expand in language 2 (Spanish) with seeds t1,t2 to T1. –Find first seed s3 in S1 that has a translation t3 in T1. –Expand in language 1 (English) with seeds s1,s2,s3 to S2 –Find first seed t4 in T1 that has a translation s4 in S2. –Expand in language 2 (Sp.) with seeds t1,t2,t3 to T2. –Continue….

Multilingual Set Expansion What’s needed: –Set expansion in two languages –A way to decide if s is a translation of t

Multilingual Set Expansion Submit s as a query and ask for results in language T. Find chunks in language T in the snippets that frequently co-occur with s Bounded by change in character set (eg English to Chinese) or punctuation Rank chunks by combination of proximity & frequency Consider top 3 chunks t1, t2, t3 as likely translations of s.

Multilingual Set Expansion

Outline History –Open-domain IE by pattern-matching The bootstrapping-with-noise problem –Bootstrapping as a graph walk Open-domain IE as finding nodes “near” seeds on a graph –Set expansion - from a few clean seeds –Iterative set expansion – from many noisy seeds –Relational set expansion –Multilingual set expansion –Iterative set expansion – from a concept name alone

ASIA: Automatic set instance acquisition [Wang & Cohen, ACL 2009] Start with name of concept (e.g., “NFL teams”) Look for instances using (language-dependent) patterns: –“… for successful NFL teams (e.g., Pittsburgh Steelers, New York Giants, …)” Take most frequent answers as seeds Run bootstrapping iSEAL –with seed sizes 2,3,4,4…. –and extended for noise-resistance: wrappers should cover as many distinct seeds as possible (not all seeds) … … subject to a limit on size Modified trie method

Datasets with concept names

Experimental results Direct use of text patterns

Comparison to Kozareva, Riloff & Hovy (which uses concept name plus a single instance as seed)…no seed used.

Comparison to Pasca (using web search queries, CIKM 07)

Comparison to WordNet + Nk Snow et al, ACL 2005: series of experiments learning hyper/hyponyms –Bootstrap from Wordnet examples –Use dependency-parsed free text –E.g., added 30k new instances with fairly high precision –Many are concepts + named-entity instances: Experiments with ASIA on concepts from Wordnet shows a fairly common problem: –E.g., “movies” gives as “instances”: “comedy”, “action/adventure”, “family”, “drama”, …. –I.e., ASIA finds a lower level in a hierarchy, maybe not the one you want

Comparison to WordNet + Nk Filter: a simulated sanity check: –Consider only concepts expanded in Wordnet + 30k that seem to have named-entities as instances and have at least instances –Run ASIA on each concept –Discard result if less than 50% of the Wordnet instances are in ASIA’s output

Summary: Some are good Some of Snow’s concepts are low-precision relative to ASIA (4.7%  100%) For the rest ASIA has 2x  100x the coverage (in number of instances)

Two More Systems to Compare to Van Durme & Pasca, 2008 –Requires an English part-of-speech tagger. –Analyzed 100 million cached Web documents in English (for many classes). Talukdar et al, 2008 –Requires 5 seed instances as input (for each class). –Utilizes output from Van Durme’s system and 154 million tables from the WebTables database (for many classes). ASIA –Does not require any part-of-speech tagger (nearly language-independent). –Supports multiple languages such as English, Chinese, and Japanese. –Analyzes around 200~400 Web documents (for each class). –Requires only the class name as input. –Given a class name, extraction usually finishes within a minute (including network latency of fetching web pages).

Precisions of Talukdar and Van Durme’s systems were obtained from Figure 2 in Talukdar et al, 2008.

(for your reference)

Top 10 Instances from ASIA

Joint work with Tom Mitchell, Weam AbuZaki, Justin Betteridge, Andrew Carlson, Estevam R. Hruschka Jr., Bryan Kisiel, Burr Settles Learn a large number of concepts at once NP1NP2 Krzyzewski coaches the Blue Devils. athlete team coachesTeam(c,t) person coach sport playsForTeam(a,t) teamPlaysSport(t, s) playsSport(a,s)

Coupled learning of text and HTML patterns Ontology and populated KB the Web CBL Free-text extraction patterns SEAL HTML extraction patterns evidence integration

Summary/Conclusions Open-domain IE as finding nodes “near” seeds on a graph “…at NIPS, AISTATS, KDD and other learning conferences…” … “on PC of KDD, SIGIR, … and…” NIPS AISTATS KDD For skiiers, NIPS, SNOWBIRD,… and…” SNOWBIRD SIGIR “… AISTATS,KDD,…” RWR as robust proximity measure Character tries as flexible pattern language high-coverage modifiable to handle expectations of noise

Summary/Conclusions Open-domain IE as finding nodes “near” seeds on a graph: –Graph built on-the-fly with web queries A good graph matters! A big graph matters! –character-level tries >> HTML heuristics –Rank the whole graph Don’t confuse iteratively building the graph with ranking! –Off-the-shelf distance metrics work Differences are minimal for clean seeds Much bigger differences with noisy seeds Bootstrapping (especially from free-text patterns) is noisy

Thanks to DARPA PAL program –Cohen, Wang Google Research Grant program –Wang Sponsored links: (Richard’s demo)