Bootstrapping April 3 2007 William Cohen. Prehistory Karl Friedrich Hieronymus, Freiherr von Münchhausen (11 May 1720 – 22 February 1797) was a German.

Slides:



Advertisements
Similar presentations
Distant Supervision for Relation Extraction without Labeled Data CSE 5539.
Advertisements

On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships.
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
PROBLEM BEING ATTEMPTED Privacy -Enhancing Personalized Web Search Based on:  User's Existing Private Data Browsing History s Recent Documents 
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
PDDL: A Language with a Purpose? Lee McCluskey Department of Computing and Mathematical Sciences, The University of Huddersfield.
Automatic Classification of Semantic Relations between Facts and Opinions Koji Murakami, Eric Nichols, Junta Mizuno, Yotaro Watanabe, Hayato Goto, Megumi.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
Learning Subjective Adjectives from Corpora Janyce M. Wiebe Presenter: Gabriel Nicolae.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
CS345 Data Mining Mining the Web for Structured Data.
CS246 Extracting Structured Information from the Web.
Web Mining for Extracting Relations Negin Nejati.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
The EZ way to write a prospectus, thesis, publishable paper John M. Hoenig, Ph.D. Department of Fisheries Science.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Unsupervised and Semi-Supervised Relation Extraction.
SEG4110 – Advanced Software Design and Reengineering
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Ontology-Based Information Extraction: Current Approaches.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
A Language Independent Method for Question Classification COLING 2004.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Logic. What is logic? Logic (from the Ancient Greek: λογική, logike) is the use and study of valid reasoning. The study of logic features most prominently.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
1 Intelligente Analyse- und Informationssysteme Frank Reichartz, Hannes Korte & Gerhard Paass Fraunhofer IAIS, Sankt Augustin, Germany Dependency Tree.
Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Bo Lin Kevin Dela Rosa Rushin Shah.  As part of our research, we are working on a cross- document co-reference resolution system  Co-reference Resolution:
KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Learning Extraction Patterns for Subjective Expressions 2007/10/09 DataMining Lab 안민영.
Münchhausen Book by Erich Kästner Book Review by Tim Hörmann.
Learning Analogies and Semantic Relations Nov William Cohen.
Semantic search-based image annotation Petra Budíková, FI MU CEMI meeting, Plzeň,
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
An Integrated Approach for Relation Extraction from Wikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009.
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
A Brief Introduction to Distant Supervision
Semi-supervised Information Extraction
Family History Technology Workshop
Extracting Patterns and Relations from the World Wide Web
KnowItAll and TextRunner
Presentation transcript:

Bootstrapping April William Cohen

Prehistory Karl Friedrich Hieronymus, Freiherr von Münchhausen (11 May 1720 – 22 February 1797) was a German baron who in his youth was sent to serve as page to Anthony Ulrich II, Duke of Brunswick-Lüneburg and later joined the Russian military. He served until 1750, in particular taking part in two campaigns against the Turks. Returning home, Münchhausen supposedly told a number of outrageous tall tales about his adventures. The Baron was born in Bodenwerder and died there as well.11 May February 1797GermanbaronpageAnthony Ulrich II, Duke of Brunswick-LüneburgRussianTurkstall talesBodenwerder According to the stories, as retold by others, the Baron's astounding feats included riding cannonballs, travelling to the Moon, and escaping from a swamp by pulling himself up by his own hair. … In later versions he was using his own boot straps to pull himself out of the sea. [Wikipedia] cannonballsMoon

Prehistory “Bob Wilson is desperately trying to finish his doctoral thesis and has locked himself in his room in a marathon attempt to do so. His typewriter jams, and as he unjams it he hears someone say "Don't bother, it's hogwash anyway." The thesis, in fact, deals with time travel. The interloper is a man who seems strangely familiar, and might be recognizable without the two-day growth of beard and the black eye. …” “In computing, bootstrapping refers to a process where a simple system activates another more complicated system that serves the same purpose. It is a solution to the Chicken-and-egg problem of starting a certain system without the system already functioning. The term is most often applied to the process of starting up a computer, in which a mechanism is needed to execute the software program that is responsible for executing software programs …” [Wikipedia]Chicken-and-egg problemexecutesoftware

Some more recent history - 1 Idea: write some specific patterns that indicate A is a kind of B: 1.… such NP as NP (“at such schools as CMU, students rarely need extensions”) 2.NP, NP, or other NP (“William, Carlos or other machine learning professors”) 3.NP including NP (“struggling teams including the Pirates”) 4.NP, especially NP (prestigious conferences, especially NIPS) [Coling 1992] Results: 8.6M words of Grolier’s encyclopedia  7067 pattern instances  152 relations Many were not in WordNet.

Some history – 2a Idea: exploit “pattern/relation duality”: 1.Start with some seed instances of (author,title) pairs (“Isaac Asimov”, “The Robots of Dawn”) 2.Look for occurrences of these pairs on the web. 3.Generate patterns that match the seeds. - URLprefix, prefix, middle, suffix 4.Extract new (author, title) pairs that match the patterns. 5.Go to 2. [some workshop, 1998] Unlike Hearst, Brin learned the patterns; and learned very high-precision, easy- to-match patterns. Result: 24M web pages + 5 books  199 occurrences  3 patterns  4047 occurrences + 5M pages  3947 occurrences  105 patterns  … 15,257 books * with some manual tweaks

Some history – 2b Idea: exploit “pattern/relation duality”: 1.Start with some seed instances of (author,title) pairs (“Isaac Asimov”, “The Robots of Dawn”) 2.Look for occurrences of these pairs on the web. 3.Generate patterns that match the seeds. - URLprefix, prefix, middle, suffix 4.Extract new (author, title) pairs that match the patterns. 5.Go to 2. Result: 24M web pages + 5 books  199 occurrences  3 patterns  4047 occurrences + 5M pages  3947 occurrences  105 patterns  … 15,257 books * with some manual tweaks Instances Patterns Occurrences

Some history – 3 [COLT 98]

Some history – 3b InstancesPatterns Occurrences Instances/Occurrences Patterns How to filter out “bad” instances, occurrences, patterns?

Bootstrapping BM’98 Brin’98 Hearst ‘92 Scalability, surface patterns, use of web crawlers… Learning, semi-supervised learning, dual feature spaces… Deeper linguistic features, free text…

Bootstrapping BM’98 Brin’98 Hearst ‘92 Collins & Singer ‘99 Scalability, surface patterns, use of web crawlers… Learning, semi-supervised learning, dual feature spaces… Deeper linguistic features, free text… Boosting-based co-train method using content & context features; context based on Collins’ parser; learn to classify three types of NE

Bootstrapping BM’98 Brin’98 Hearst ‘92 Scalability, surface patterns, use of web crawlers… Learning, semi-supervised learning, dual feature spaces… Deeper linguistic features, free text… Collins & Singer ‘99 Riloff & Jones ‘99 Hearst-like patterns, Brin-like bootstrapping (+ “meta-level” bootstrapping) on MUC data

Bootstrapping BM’98 Brin’98 Hearst ‘92 Scalability, surface patterns, use of web crawlers… Learning, semi-supervised learning, dual feature spaces… Deeper linguistic features, free text… Collins & Singer ‘99 Riloff & Jones ‘99 Cucerzan & Yarowsky ‘99 EM like co-train method with context & content both defined by character-level tries

Bootstrapping BM’98 Brin’98 Hearst ‘92 Scalability, surface patterns, use of web crawlers… Learning, semi-supervised learning, dual feature spaces… Deeper linguistic features, free text… Collins & Singer ‘99 Riloff & Jones ‘99 Cucerzan & Yarowsky ‘99 Etzioni et al 2005 Rosenfeld and Feldman 2006 … … Stevenson & Greenwood 2005 De-emphasize duality, focus on distance between patterns.

Stevenson & Greenwood Instances/Occurrences Patterns Pattern- pattern- from is semantic similarity (Wordnet) Flow from pattern- pattern depends on empirical similarity (i.e. overlapping occurrences in corpus)

Bootstrapping BM’98 Brin’98 Hearst ‘92 Scalability, surface patterns, use of web crawlers… Learning, semi-supervised learning, dual feature spaces… Deeper linguistic features, free text… Collins & Singer ‘99 Riloff & Jones ‘99 Cucerzan & Yarowsky ‘99 Etzioni et al 2005 Rosenfeld and Feldman 2006 … … Stevenson & Greenwood 2005 Clever idea for learning relation patterns & strong experimental results

Rosenfeld & Feldman Instances  Occurrences as before. Vary “positive” occurrences to get near-miss “negative” occurrences, using asymmetry, disjointness, etc. Learn patterns in a (moderately) expressive but easy-to-match language (NPs from OpenNLP).

Know It All

Architecture Set of predicates to consider + two names for each ~= [H92]

Architecture

Bootstrapping Submit the queries & apply the rules  initial seeds. 2.Evaluate each seed with each discriminator U: e.g., compute PMI stats like: |hits(“city Boston”)| / |hits(“Boston”)| 3.Take the top seeds from each class and call them POSITIVE and use disjointness, etc to find NEGATIVE seeds. 4.Train a NaiveBayes classifier using thresholded U’s as features.

Bootstrapping - 2