Learning 5000 Relational Extractors Raphael Hoffmann, Congle Zhang, Daniel S. Weld University of Washington Talk at ACL 2010 07/12/10.

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Distant Supervision for Relation Extraction without Labeled Data CSE 5539.
On-line learning and Boosting
Large-Scale Entity-Based Online Social Network Profile Linkage.
Presenters: Arni, Sanjana.  Subtask of Information Extraction  Identify known entity names – person, places, organization etc  Identify the boundaries.
Problem Semi supervised sarcasm identification using SASI
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Event Extraction Using Distant Supervision Kevin Reschke, Martin Jankowiak, Mihai Surdeanu, Christopher D. Manning, Daniel Jurafsky 30 May 2014 Language.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Ensemble Learning: An Introduction
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Open Information Extraction using Wikipedia
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Information Extraction from Wikipedia: Moving Down the Long Tail Fei Wu, Raphael Hoffmann, Daniel S. Weld Department of Computer Science & Engineering.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
John Lafferty Andrew McCallum Fernando Pereira
Text Clustering Hongning Wang
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Detecting Missing Hyphens in Learner Text Aoife Cahill, SusanneWolff, Nitin Madnani Educational Testing Service ACL 2013 Martin Chodorow Hunter College.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.
The Intelligence in Wikipedia Project Daniel S. Weld Department of Computer Science & Engineering University of Washington Seattle, WA, USA Joint Work.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Incremental Text Structuring with Hierarchical Ranking Erdong Chen Benjamin Snyder Regina Barzilay.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Automatically Labeled Data Generation for Large Scale Event Extraction
A Brief Introduction to Distant Supervision
Source: Procedia Computer Science(2015)70:
Websoft Research Group
Information Extraction from Wikipedia: Moving Down the Long Tail
Applying Key Phrase Extraction to aid Invalidity Search
Introduction Task: extracting relational facts from text
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
KnowItAll and TextRunner
Presentation transcript:

Learning 5000 Relational Extractors Raphael Hoffmann, Congle Zhang, Daniel S. Weld University of Washington Talk at ACL /12/10

“What Russian-born writers publish in the U.K.?” Use Information Extraction

Types of Information Extraction Corpus + Manual Labels Specified in Advance O(D*R) Corpus + Wikipedia/PennTB + DI Methods Discovered Automatically O(D) Traditional, Supervised IE Input Relations Complexity Uninterpreted Text Strings TextRunner & WOE OpenIE

Types of Information Extraction Corpus + Manual Labels Specified in Advance O(D*R) Corpus + Wikipedia + Domain-Indep. Methods Learned O(D*R) Corpus + Wikipedia/PennTB + DI Methods Discovered Automatically O(D) Traditional, Supervised IE TextRunner & WOE OpenIE Input Relations Complexity Kylin & Luchs Weak Supervision Uninterpreted Text Strings

Jerome Allen “Jerry” Seinfeld is an American stand-up comedian, actor and writer, best known for playing a semi-fictional version of himself in the situation comedy Seinfeld ( ), which he co-created and co-wrote with Larry David, and, in the show’s final two seasons, co-executive- produced. Seinfeld was born in Brooklyn, New York. His father, Kalman Seinfeld, was of Galician Jewish background and owned a sign-making company; his mother, Betty, is of Syrian Jewish descent. Weak Supervision Heuristically match Wikipedia infobox values to article text (Kylin) Jerry Seinfeld birth-date:April 29, 1954 birth-place:Brooklyn nationality:American genre:comedy, satire height:5 ft 11 in Jerry Seinfeld birth-date:April 29, 1954 birth-place:Brooklyn nationality:American genre:comedy, satire height:5 ft 11 in American Brooklyn American [Wu and Weld, 2007] Brooklyn

Wikipedia Infoboxes Thousands of relations encoded in infoboxes Infoboxes are interesting target: – By-broduct of thousands of contributors – Broad in coverage and growing quickly – Schema noisy and sparse, extraction is challenging

Existing work on Kylin Kylin performs well on popular classes … Precision: mid 70% ~ high 90% Recall: low 50% ~ mid 90% … but flounders on sparse classes. (too little training data) Is this a big problem? 18% >= 100 instances 60% >= 10 instances

Contributions LUCHS – an autonomous, weakly-supervised system which learns 5025 relational extractors LUCHS introduces dynamic lexicon features, a new technique which dramatically improves performance from sparse data and that way enables scalability LUCHS reaches an average F1 score of 61%

Outline Motivation Learning Extractors Extraction with Dynamic Lexicons Experiments Next Steps

Overview of LUCHS Matcher CRF Learner Classifier Learner Training Data Extractor Tuples Article Classifier Extractor Attribute Extractor Attribute Extractor Classified Articles Extraction Learning Harvester Filtered Lists WWW Lexicon Learner

Overview of LUCHS Matcher CRF Learner Classifier Learner Training Data Extractor Tuples Article Classifier Extractor Attribute Extractor Attribute Extractor Classified Articles Extraction Learning Harvester Filtered Lists WWW Lexicon Learner

Overview of LUCHS Matcher CRF Learner Classifier Learner Training Data Extractor Tuples Article Classifier Extractor Attribute Extractor Attribute Extractor Classified Articles Extraction Learning Harvester Filtered Lists WWW Lexicon Learner

Overview of LUCHS Matcher CRF Learner Classifier Learner Training Data Extractor Training Data Tuples Article Classifier Extractor Attribute Extractor Attribute Extractor Classified Articles Extraction Learning Harvester Filtered Lists WWW Lexicon Learner

Learning Extractors Classifier: multi-class classifier using features: words in title, words in first sentence, … CRF extractor: linear-chain CRF predicting label for each word, using features: words, state transitions, capitalization, word contextualization, digits, dependencies, first sentence, lexicons, Gaussians Trained using Voted Perceptron algorithm [Collins 2002, Freund and Schapire 1999] lexicons Gaussians

Overview of LUCHS Matcher Harvester CRF Learner Filtered Lists WWW Lexicon Learner Classifier Learner Training Data Extractor Training Data Lexicons Tuples Article Classifier Extractor Attribute Extractor Attribute Extractor Classified Articles Extraction Learning

Outline Motivation Learning Extractors Extraction with Dynamic Lexicons Experiments Next Steps

Harvesting Lists from the Web Must extract and index lists prior to learning Lists extremely noisy: navigation bars, tag sets, spam links, long text; filtering steps necessary 49M lists containing 56M unique phrases WWW 5B pages Mozart Beethoven Vivaldi Mozart Beethoven Vivaldi John Paul Simon Ed Nick John Paul Simon Ed Nick 49M lists, 56M phrases Boston Seattle Boston Seattle Boston Seattle Boston Seattle

Semi-Supervised Learning of Lexicons Generate lexicons specific to relation in 3 steps: Seinfeld was born in Brooklyn, New York… Born in Omaha, Tony later developed … His birthplace was Boston. … Brooklyn Omaha Boston London Miami Boston Denver Omaha London Miami Boston Denver Omaha Brooklyn Omaha Boston Brooklyn Omaha Boston London Miami Boston Monterey Omaha Dallas... London Miami Boston Monterey Omaha Dallas... Brooklyn Omaha Boston Brooklyn Omaha Boston Extract seed phrases from training set Expand seed phrases into a set of lexicons Add lexicons as features to CRF ?

From Seeds to Lexicons Similarity between lists using vector-space model: Intuition: lists are similar if they have many overlapping phrases, the phrases are not too common, and lists are not too long Yokohama Tokyo Osaka Yokohama Tokyo Osaka Tokyo London Moscow Redmond Tokyo London Moscow Redmond

From Seeds to Lexicons Produce lexicons of different Pr/Re compromises: Brooklyn Omaha seeds Omaha Boston Denver Brooklyn Omaha Boston Denver Brooklyn London Omaha Boston London Omaha Boston London Denver Omaha Boston London Denver Omaha Boston Brooklyn George Brooklyn George Brooklyn John George Brooklyn John George Union of phrases on top lists Omaha Boston Denver Brooklyn Omaha Boston Denver Brooklyn Omaha Boston Denver Brooklyn London Omaha Boston Denver Brooklyn London Omaha Boston Denver Brooklyn London Denver Brooklyn George Omaha Boston Denver Brooklyn London Denver Brooklyn George Sort by similarity to seeds

Semi-Supervised Learning of Lexicons Generate lexicons specific to relation in 3 steps: Seinfeld was born in Brooklyn, New York… Born in Omaha, Tony later developed … His birthplace was Boston. … Brooklyn Omaha Boston London Miami Boston Denver Omaha London Miami Boston Denver Omaha Brooklyn Omaha Boston Brooklyn Omaha Boston London Miami Boston Monterey Omaha Dallas... London Miami Boston Monterey Omaha Dallas... Brooklyn Omaha Boston Brooklyn Omaha Boston Extract seed phrases from training set Expand seed phrases into a set of lexicons Add lexicons as features to CRF ?

Preventing Lexicon Overfitting Lexicons created from seeds in training set CRF may overfit if trained on same examples that generated the lexicon features Seinfeld was born in Brooklyn, New York… Born in Omaha, Tony later developed … His birthplace was Boston. … His hometown Denver is well known for … Redmond where he was born is … Simon, born and grown up in Seattle, … He was born in Spokane. Portland is his hometown. Tony was born in Austin. Brooklyn Omaha Boston Denver Redmond Seattle Spokane Portland Austin

With … Cross-Training Lexicons created from seeds in training set CRF may overfit if trained on same examples that generated the lexicon features Split training set into k partitions, use different partitions for lexicon creation, feature generation Seinfeld was born in Brooklyn, New York… Born in Omaha, Tony later developed … His birthplace was Boston. … His hometown Denver is well known for … Redmond where he was born is … Simon, born and grown up in Seattle, … He was born in Spokane. Portland is his hometown. Tony was born in Austin. Brooklyn Omaha Boston Denver Redmond Seattle Spokane Portland Austin generate lexicons add lexicon features London Miami Boston Denver Omaha London Miami Boston Denver Omaha Brooklyn Omaha Boston Brooklyn Omaha Boston London Miami Boston Monterey Omaha Dallas... London Miami Boston Monterey Omaha Dallas...

Outline Motivation Learning Extractors Extraction with Dynamic Lexicons Experiments Next Steps

Impact of Lexicons 100 random attributes, heuristic matches as gold: Lexicons substantially improve F1 Cross-training essential Text attributes Baseline.491 Baseline + lexicons w/o cross-training.367 Baseline + lexicons w/ cross-training.545 F1 Numeric attributes Baseline.586 Baseline + Gaussians w/o cross-training.623 Baseline + Gaussians w/ cross-training.627

Scaling to all of Wikipedia Extract all 5025 attributes (matches as gold) 1138 attributes reach F1 score of.80 or higher Average F1 of.56 for text and.60 for numeric attr. Weighted by #instances,.64 and.78 respectively

Towards an Attribute Ontology True promise of relation-specific extraction if ontology ties system together “Sloppiness” in infoboxes: identify duplicate relations i,j-th pixel indicates F1 of training on i and testing on j for the 1000 attributes in the largest clusters

Next Steps LUCHS ’ performance may benefit substantially from an ontology & LUCHS may also facilitate ontology learning: thus, learn both jointly Enhance robustness by performing deeper linguistic analysis; also combine with Open extraction techniques

Related Work – YAGO[Suchanek et al WWW 2007] – Bayesian Knowledge Corroboration [Kasneci et al MSR 2010] – PORE [Wang et al 2007] – TextRunner [Banko et al IJCAI 2007] – Distant Supervision [Mintz et al ACL 2009] – Kylin [Wu et al CIKM 2007, Wu et al KDD 2008]

Conclusions Weakly-supervised learning of relation-specific extractors does scale Introduced dynamic lexicon features, which enable hyper-lexicalized extractors

Thank You!

Experiments English Wikipedia 10/2008 dump Classes with at least 10 instances: 1,583, comprising 981,387 articles 5025 attributes Consider first 10 sentences of each article Evaluate extraction on a token-level

Overall Extraction Performance Tested pipeline of classification and extraction: Compared to manually created gold labels – on 100 articles not used for training Observations: Many remaining errors from “ontology” sloppiness low recall for heuristic matches F P 0.55 R 0.68

Article Classification Take all 981,387 articles which have infoboxes 4/5 for training, 1/5 for testing, Use existing infobox as gold standard Accuracy: 92% Again, many errors due to “ontology” sloppiness: e.g. Infobox Minor Planet vs. Infobox Planet

Attribute Extraction For each of 100 attributes sample 100 articles for training and 100 articles for testing Use heuristic matches as gold labels Baseline extractor iteratively add feature with largest improvement (except lexicon & Gaussian)

Impact on Sparse Attributes Lexicons very effective for sparse attributes Gains mostly in recall Text attributes 10+16%+10%+20% 25+13%+7%+20% %+5%+17% # train. articles ∆F1∆Precision∆Recall Numeric attributes 10+10%+4%+13% 25+8%+4%+10% 100+7%+5%+8%

Outline Motivation Learning Extractors Extraction with Dynamic Lexicons Experiments Next Steps