KnowItAll and TextRunner

Slides:



Advertisements
Similar presentations
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Advertisements

Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
Open Information Extraction From The Web Rani Qumsiyeh.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.
CS246 Extracting Structured Information from the Web.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
A Web-based Question Answering System Yu-shan & Wenxiu
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
A Probabilistic Model of Redundancy in Information Extraction University of Washington Department of Computer Science and Engineering
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Unsupervised and Semi-Supervised Relation Extraction.
Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.
Open IE and Universal Schema Discovery Heng Ji Acknowledgement: some slides from Daniel Weld and Dan Roth.
Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Querying Text Databases for Efficient Information Extraction Eugene Agichtein Luis Gravano Columbia University.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
KnowItAll Oren Etzioni, Stephen Soderland, Daniel Weld Michele Banko, Alex Beynenson, Jeff Bigham, Michael Cafarella, Doug Downey, Dave Ko, Stanley Kok,
Presenter: Shanshan Lu 03/04/2010
Bootstrapping April William Cohen. Prehistory Karl Friedrich Hieronymus, Freiherr von Münchhausen (11 May 1720 – 22 February 1797) was a German.
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Bo Lin Kevin Dela Rosa Rushin Shah.  As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution:
KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.
Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004 Mihai Surdeanu.
Classification Ensemble Methods 1
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Relevance Feedback Hongning Wang
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Semi-Supervised Clustering
New Indices for Text : Pat Trees and PAT Arrays
Constrained Clustering -Semi Supervised Clustering-
10701 / Machine Learning.
Relevance Feedback Hongning Wang
Distributed Representation of Words, Sentences and Paragraphs
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Semi-supervised Information Extraction
Julius Information Extractor
Disambiguation Algorithm for People Search on the Web
Introduction Task: extracting relational facts from text
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
Extracting Patterns and Relations from the World Wide Web
Open Information Extraction from the Web
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Presentation transcript:

KnowItAll and TextRunner 10-25-2010

Key Ideas: So Far High-precision low-coverage extractors and large redundant corpora (macro-reading) Hearst patterns (“cities such as Pittsburgh, Cleveland, and …) Regular structure in tables, etc… (Brin, …) Semi-supervised learning Self-training/bootstrapping or co-training Other semi-supervised methods: Expectation-maximization Transductive margin-based methods (e.g., transductive SVM, logistic regression with entropic regularization, …) Graph-based methods Label propogation Label propogation via random walk with reset

Bootstrapping Lin & Pantel ‘02 Hearst ‘92 BlumMitchell ’98 Brin’98 Clustering by distributional similarity… Lin & Pantel ‘02 Hearst ‘92 Deeper linguistic features, free text… BlumMitchell ’98 Learning, semi-supervised learning, dual feature spaces… Brin’98 Scalability, surface patterns, use of web crawlers…

Bootstrapping Lin & Pantel ‘02 Hearst ‘92 Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Boosting-based co-train method using content & context features; context based on Collins’ parser; learn to classify three types of NE Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Brin’98 Scalability, surface patterns, use of web crawlers…

Bootstrapping Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99 Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 Hearst-like patterns, Brin-like bootstrapping (+ “meta-level” bootstrapping) on MUC data Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Brin’98 Scalability, surface patterns, use of web crawlers…

Bootstrapping Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99 Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… EM like co-train method with context & content both defined by character-level tries Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…

Bootstrapping … … Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99 Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 … Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Etzioni et al 2005 … Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…

Bootstrapping … … Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99 Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 … Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Etzioni et al 2005 … TextRunner Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…

Bootstrapping … … Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99 Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 … Collins & Singer ‘99 ReadTheWeb BM’98 Learning, semi-supervised learning, dual feature spaces… Etzioni et al 2005 … TextRunner Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…

Today’s paper: the KnowItAll system

Architecture Set of [disjoint?] predicates to consider + two names for each Context – keywords from user to filter out non-domain pages … ? ~= [H92]

Architecture

Bootstrapping - 1 template rule “city” query

Bootstrapping - 2 Each discriminator U is a function: fU(x) = hits(“city x”)/hits(“x”) i.e. fU(“Pittsburgh”) = hits(“city Pittsburgh”)/hits(“Pittsburgh”) These are then used to create features: fU(x)>θ and fU(x)<θ

Bootstrapping - 3 Submit the queries & apply the rules to produce initial seeds. Evaluate each seed with each discriminator U: e.g., compute PMI stats like: |hits(“city Boston”)| / |hits(“Boston”)| Take the top seeds from each class and call them POSITIVE then use disjointness of classes to find NEGATIVE seeds. Train a NaiveBayes classifier using thresholded U’s as features.

Bootstrapping - 4 Estimate using the classifier based on the previously-trained discriminators Some ad hoc stopping conditions… (“signal to noise” ratio)

Architecture - 2

Extensions to KnowItAll Problem: Unsupervised learning finds clusters—what if the text doesn’t support the clustering we want Eg target is “scientist”, but natural clusters are “biologist”, “physicist”, “chemist” Solution: subclass extraction Modify template/rule system to extract subclasses of target class (eg scientist  chemist, biologist, …) Check extracted subclasses with WordNet and/or PMI-like method (as for instances) Extract from each subclass recursively

Extensions to KnowItAll Problem: Set of rules is limited: Derived from fixed set of “templates” (general patterns ~ from H92) Solution 1: Pattern learning: augment the initial set of rules derivable from templates Search for instances I on the web Generate patterns: some substring of I in context: “b1 … b4 I a1 … a4” Assume classes are disjoint and estimate recall/precision of each pattern P Exclude patterns that cover only one seed (very low recall) Take the top 200 remaining patterns and Evaluate them as extractors “using PMI” (?) Evaluate them as discriminators (in usual way?) Examples: “headquartered in <city>”, “<city> hotels”, …,

Extensions to KnowItAll Solution 2: List extraction: augment the initial set of rules with rules that are local to a specific web page Search for pages containing small sets of instances (eg “London Paris Rome Pittsburgh”) For each page P: Find subtrees T of the DOM tree that contain >k seeds Find longest common prefix/suffix of the seeds in T [Some heuristics added to generalize this further] Find all other strings inside T with the same prefix/suffix Heuristically select the “best” wrapper for a page Wrapper = P, T, prefix, suffix

Results - City

Results - Film

Results - Scientist