Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.

Slides:

Advertisements

Similar presentations

ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,

Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.

Extracting Personal Names from Applying Named Entity Recognition to Informal Text Einat Minkov & Richard C. Wang Language Technologies Institute.

Fast Query Execution for Retrieval Models based on Path Constrained Random Walks Ni Lao, William W. Cohen Carnegie Mellon University

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.

Integration of Heterogeneous Databases without Common Domains Using Queries Based on Textual Similarity: William W. Cohen Machine Learning Dept. and Language.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.

Automatic Set Expansion for List Question Answering Richard C. Wang, Nico Schlaefer, William W. Cohen, and Eric Nyberg Language Technologies Institute.

Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.

Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Language Technologies Institute, Carnegie Mellon University Language-Independent Class Instance Extraction Using the Web Richard C. Wang Thesis Committee:

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.

A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Modeling term relevancies in information retrieval using Graph Laplacian Kernels Shuguang Wang Joint work with Saeed Amizadeh and Milos Hauskrecht.

Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,

Entity Set Expansion in Opinion Documents Lei Zhang Bing Liu University of Illinois at Chicago.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Post-Ranking query suggestion by diversifying search Chao Wang.

A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.

FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.

Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.

Kijung Shin Jinhong Jung Lee Sael U Kang

Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Linguistic Graph Similarity for News Sentence Searching

A Markov Random Field Model for Term Dependencies

Learning to Rank Typed Graph Walks: Local and Global Approaches

KnowItAll and TextRunner

Presentation transcript:

Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University

Traditional IE vs Open Domain IE Goal: recognize people, places, companies, times, dates, … in NL text. Supervised learning from corpus completely annotated with target entity class (e.g. “people”) Linear-chain CRFs Language- and genre-specific extractors Goal: recognize arbitrary entity sets in text –Minimal info about entity class –Example 1: “ICML, NIPS” –Example 2: “Machine learning conferences” Semi-supervised learning from very large corpora (WWW) Graph-based learning methods Techniques are largely language-independent (!) –Graph abstraction fits many languages

Examples with three seeds

Outline History –Open-domain IE by pattern-matching The bootstrapping-with-noise problem –Bootstrapping as a graph walk Open-domain IE as finding nodes “near” seeds on a graph –Approach 1: A “natural” graph derived from a smaller corpus + learned similarity –Approach 2: A carefully-engineered graph derived from huge corpus (e.g’s above)

History: Open-domain IE by pattern- matching (Hearst, 92) Start with seeds: “NIPS”, “ICML” Look thru a corpus for certain patterns: … “at NIPS, AISTATS, KDD and other learning conferences…” Expand from seeds to new instances Repeat….until ___ –“on PC of KDD, SIGIR, … and…”

Bootstrapping as graph proximity “…at NIPS, AISTATS, KDD and other learning conferences…” … “on PC of KDD, SIGIR, … and…” NIPS AISTATS KDD For skiiers, NIPS, SNOWBIRD,… and…” SNOWBIRD SIGIR “… AISTATS,KDD,…” shorter paths ~ earlier iterations many paths ~ additional evidence

Outline Open-domain IE as finding nodes “near” seeds on a graph –Approach 1: A “natural” graph derived from a smaller corpus + learned similarity –Approach 2: A carefully-engineered graph derived from huge corpus (above) “with” Richard Wang (CMU  ?) “with” Einat Minkov (CMU  Nokia)

Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) boyslikeplayingcars nsubjpartmodprep.with allkinds detprep.of NN VB DT NN Dependency parsed sentence is a naturally represented as a tree

Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) Dependency parsed corpus is “naturally” represented as a graph

Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) Open IE Goal: Find “coordinate terms” (eg, girl/boy, dolls/cars) in the graph, or find Similarity measure S so S(girl,boy) is high What about off-the-shelf similarity measures: Random Walk with Restart (RWR) Hitting time Commute time … ?

Personalized PR/RWR graph walk parameters: edge weights Θ, walk length K and reset probability γ. M[x,y] = Prob. of reaching y from x in one step: the edge weight from x to y, out of the outgoing weight from x. `Personalized PageRank’: reset probability biased towards initial distribution. The graph Nodes Node type Edge label Edge weight Returns a list of nodes (of type ) ranked by the graph walk probs. A query language: Q: {, } Approximate with power iteration, cut off after fixed number of iterations K.

girlsgirls 1 like 1 likelike 2 boys 2 boys mentionnsubjmention -1 mentionnsubj -1 mention -1

girlsgirls 1 like 1 likelike 2 boys 2 boys mentionnsubjmention -1 mentionnsubj -1 mention -1 girlsgirls 1 like 1 playing 1 playing…boys mentionnsubjpartmodmention -1 mentionmention -1

girlsgirls 1 like 1 playing 1 dolls 1 dolls mentionnsubjmention -1 Prep.withmention -1 Useful but not our goal here…

Learning a better similarity metric Query a  node rank 1  node rank 2  node rank 3  node rank 4  …  node rank 10  node rank 11  node rank 12  …  node rank 50 Query b Query q  node rank 1  node rank 2  node rank 3  node rank 4  …  node rank 10  node rank 11  node rank 12  …  node rank 50  node rank 1  node rank 2  node rank 3  node rank 4  …  node rank 10  node rank 11  node rank 12  …  node rank 50 … GRAPH WALK + Rel. answers a+ Rel. answers b+ Rel. answers q Task T (query class) Seed words (“girl”, “boy”, …) Potential new instances of the target concept (“doll”, “child”, “toddler”, …)

Learning methods Weight tuning – weights learned per edge type [Diligenti et-al, 2005] Reranking – re-order the retrieved list using global features of all paths from source to destination [Minkov et-al, 2006] FEATURES  Edge label sequences  Lexical unigrams  … boysdolls nsubj .nsubj-inv nsubj  partmod  partmod-inv  nsubj-inv nsubj  partmod  prep.in “like”, “playing”

Learning methods: Path-Constrained Graph Walk PCW (summary): for each node x, learn  P(x  z : relevant(z) | history(Vq,x) )  History(Vq,x) = seq of edge labels leading from Vq to x, with all histories stored in a tree boysdolls nsubj .nsubj-inv nsubj  partmod  partmod-inv  nsubj-inv nsubj  partmod  prep.in boys dolls Vq “girls” nsubj nsubj-inv partmod partmod-inv nsubj-inv boys prep.in x1 x2 x3

City and person name extraction City names: Vq = {sydney, stamford, greenville, los_angeles} Person names: Vq = {carter, dave_kingman, pedro_ramos, florio} wordsnodesedgesNEs MUC 140K82K244K3K (true) MUC+AP 2,440K1,030K3,550K36K (auto) –10 (X4) queries for each task Train queries q1-q5 / test queries q6-q10 –Extract nodes of type NE. –GW: 6 steps, uniform/learned weights –Reranking: top 200 nodes (using learned weights) –Path trees: 20 correct / 20 incorrect; threshold 0.5 Complete Partial/Noisy Labeling

City namesPerson names MUC precision rank

City namesPerson names conj-and, prep-in, nn, appos …subj, obj, poss, nn … MUC precision rank

City namesPerson names conj-and, prep-in, nn, appos …subj, obj, poss, nn … prep-in-inv  conj-and nn-inv  nn nsubj  nsubj-inv appos  nn-inv MUC precision rank

City namesPerson names conj-and, prep-in, nn, appos …subj, obj, poss, nn … Prep-in-inv  conj-and nn-inv  nn LEX.”based”, LEX.”downtown”LEX.”mr”, LEX.”president” MUC precision rank nsubj  nsubj-inv appos  nn-inv

Vector-space models Co-occurrence vectors (counts; window: +/- 2) Dependency vectors [Pad ó & Lapata, Comp Ling 07] –A path value function: Length-based value: 1 / length(path) Relation based value: subj-5, obj-4, obl-3, gen-2, else-1 –Context selection function: Minimal: verbal predicate-argument (length 1) Medium: coordination, genitive construction, noun compounds (<=3) Maximal: combinations of the above (<=4) –Similarity function: Cosine Lin  Only score the top nodes retrieved with reranking (~1000 overall)

GWs – Vector models MUC City names Person names precision rank  The graph-based methods are best (syntactic + learning)

GWs – Vector models MUC + AP City names Person names precision rank  The advantage of the graph based models diminishes with the amount of data.  This is hard to evaluate at high ranks

Outline Open-domain IE as finding nodes “near” seeds on a graph –Approach 1: A “natural” graph derived from a smaller corpus + learned similarity –Approach 2: A carefully-engineered graph derived from huge corpus “with” Richard Wang (CMU  ?) “with” Einat Minkov (CMU  Nokia)

Set Expansion for Any Language (SEAL) – (Wang & Cohen, ICDM 07) Basic ideas –Dynamically build the graph using queries to the web –Constrain the graph to be as useful as possible Be smart about queries Be smart about “patterns”: use clever methods for finding meaningful structure on web pages

System Architecture Fetcher: download web pages from the Web that contain all the seeds Extractor: learn wrappers from web pages Ranker: rank entities extracted by wrappers 1.Canon 2.Nikon 3.Olympus 4.Pentax 5.Sony 6.Kodak 7.Minolta 8.Panasonic 9.Casio 10.Leica 11.Fuji 12.Samsung 13.…

The Extractor Learn wrappers from web documents and seeds on the fly –Utilize semi-structured documents –Wrappers defined at character level Very fast No tokenization required; thus language independent Wrappers derived from doc d applied to d only –See ICDM 2007 paper for details

I am noise Me too!

The Ranker Rank candidate entity mentions based on “ similarity ” to seeds –Noisy mentions should be ranked lower Random Walk with Restart (GW) As before … What ’ s the graph?

Building a Graph A graph consists of a fixed set of … –Node Types: {seeds, document, wrapper, mention} –Labeled Directed Edges: {find, derive, extract} Each edge asserts that a binary relation r holds Each edge has an inverse relation r -1 (graph is cyclic) –Intuition: good extractions are extracted by many good wrappers, and good wrappers extract many good extractions, “ford”, “nissan”, “toyota” curryauto.com Wrapper #3 Wrapper #2 Wrapper #1 Wrapper #4 “honda” 26.1% “acura” 34.6% “chevrolet” 22.5% “bmw pittsburgh” 8.4% “volvo chicago” 8.4% find derive extract northpointcars.com

Evaluation Datasets: closed sets

Evaluation Method Mean Average Precision –Commonly used for evaluating ranked lists in IR –Contains recall and precision-oriented aspects –Sensitive to the entire ranking –Mean of average precisions for each ranked list Evaluation Procedure (per dataset) 1.Randomly select three true entities and use their first listed mentions as seeds 2.Expand the three seeds obtained from step 1 3.Repeat steps 1 and 2 five times 4.Compute MAP for the five ranked lists where L = ranked list of extracted mentions, r = rank Prec ( r ) = precision at rank r (a) Extracted mention at r matches any true mention (b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r # True Entities = total number of true entities in this dataset

Experimental Results: 3 seeds Vary: [Extractor] + [Ranker] + [Top N URLs] Extractor: E1: Baseline Extractor (longest common context for all seed occurrences) E2: Smarter Extractor (longest common context for 1 occurrence of each seed) Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk } N URLs: { 100, 200, 300 }

Side by side comparisons Telukdar, Brants, Liberman, Pereira, CoNLL 06

Side by side comparisons Ghahramani & Heller, NIPS 2005 EachMovie vs WWWNIPS vs WWW

A limitation of the original SEAL

Proposed Solution: Iterative SEAL (iSEAL) (Wang & Cohen, ICDM 2008) Makes several calls to SEAL, each call … –Expands a couple of seeds –Aggregates statistics Evaluate iSEAL using … –Two iterative processes Supervised vs. Unsupervised (Bootstrapping) –Two seeding strategies Fixed Seed Size vs. Increasing Seed Size –Five ranking methods

ISeal (Fixed Seed Size, Supervised) Initial Seeds Finally rank nodes by proximity to seeds in the full graph Refinement (ISS): Increase size of seed set for each expansion over time: 2,3,4,4,… Variant (Bootstrap): use high- confidence extractions when seeds run out

Ranking Methods Random Graph Walk with Restart –H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its application. In ICDM, PageRank –L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web Bayesian Sets (over flattened graph) –Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, Wrapper Length –Weights each item based on the length of common contextual string of that item and the seeds Wrapper Frequency –Weights each item based on the number of wrappers that extract the item

Little difference between ranking methods for supervised case (all seeds correct); large differences when bootstrapping Increasing seed size {2,3,4,4,…} makes all ranking methods improve steadily in bootstrapping case

Current work Start with name of concept (e.g., “NFL teams”) Look for (language-dependent) patterns: –“… for successful NFL teams (e.g., Pittsburgh Steelers, New York Giants, …)” Take most frequent answers as seeds Run bootstrapping iSEAL with seed sizes 2,3,4,4….

Datasets with concept names

Experimental results Direct use of text patterns

Summary/Conclusions Open-domain IE as finding nodes “near” seeds on a graph “…at NIPS, AISTATS, KDD and other learning conferences…” … “on PC of KDD, SIGIR, … and…” NIPS AISTATS KDD For skiiers, NIPS, SNOWBIRD,… and…” SNOWBIRD SIGIR “… AISTATS,KDD,…” shorter paths ~ earlier iterations many paths ~ additional evidence

Summary/Conclusions Open-domain IE as finding nodes “near” seeds on a graph, approach 1: –Minkov & Cohen, EMNLP 08: –Graph ~ dependency-parsed corpus –Off-the-shelf distance metrics not great –With learning: Results significantly better than state-of-the-art on small corpora (e.g. a personal corpus) Results competitive on 2M+ word corpora

Summary/Conclusions Open-domain IE as finding nodes “near” seeds on a graph, approach 2: –Wang & Cohen, ICDM 07, 08: –Graph built on-the-fly with web queries A good graph matters! –Off-the-shelf distance metrics work Differences are minimal for clean seeds Modest improvements from learning w/ clean seeds –E.g., reranking (not described here) Bigger differences in similarity measures with noisy seeds

Thanks to DARPA PAL program –Minkov, Cohen, Wang Yahoo! Research Labs –Minkov Google Research Grant program –Wang The organizers for inviting me! Sponsored links: (Richard’s demo)