Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.

Slides:



Advertisements
Similar presentations
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Advertisements

Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Miscellaneous – Regions, Locations and Places Part IV.
A Vector Space Model for Automatic Indexing
Chapter 5: Introduction to Information Retrieval
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng.
Overall Information Extraction vs. Annotating the Data Conference proceedings by O. Etzioni, Washington U, Seattle; S. Handschuh, Uni Krlsruhe.
Open Information Extraction From The Web Rani Qumsiyeh.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
T.Sharon-A.Frank 1 Internet Resources Discovery (IRD) Concrete Learning Agents.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Problem: Extracting attribute set for classes (Eg: Price, Creator, Genre for class ‘Video Games’) Why?  Attributes are used to extract templates which.
Chapter 10: Information Integration and Synthesis.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
1 Extracting Product Feature Assessments from Reviews Ana-Maria Popescu Oren Etzioni
A Probabilistic Model of Redundancy in Information Extraction University of Washington Department of Computer Science and Engineering
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Unsupervised and Semi-Supervised Relation Extraction.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
PERSONALIZED SEARCH Ram Nithin Baalay. Personalized Search? Search Engine: A Vital Need Next level of Intelligent Information Retrieval. Retrieval of.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Probabilistic Query Expansion Using Query Logs Hang Cui Tianjin University, China Ji-Rong Wen Microsoft Research Asia, China Jian-Yun Nie University of.
KnowItAll Oren Etzioni, Stephen Soderland, Daniel Weld Michele Banko, Alex Beynenson, Jeff Bigham, Michael Cafarella, Doug Downey, Dave Ko, Stanley Kok,
Search engines are used to for looking for documents. They compile their databases by employing "spiders" or "robots" to crawl through web space from.
Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
Question Answering over Implicitly Structured Web Content
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Novel Pattern Learning Method for Open Domain Question Answering IJCNLP 2004 Yongping Du, Xuanjing Huang, Xin Li, Lide Wu.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.
Oxygen Indexing Relations from Natural Language Jimmy Lin, Boris Katz, Sue Felshin Oxygen Workshop, January, 2002.
KnowItAll Oren Etzioni, Stephen Soderland, Daniel Weld Michele Banko, Alex Beynenson, Jeff Bigham, Michael Cafarella, Doug Downey, Dave Ko, Stanley Kok,
KnowItAll April William Cohen. Announcements Reminder: project presentations (or progress report) –Sign up for a 30min presentation (or else) –First.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
More Developed Countries Australia Canada France Germany Israel Italy Japan Norway Russia South Korea Spain Sweden Taiwan United Kingdom United States.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Do Now (5 min) 9/30–10/3 Geography of SW Asia/N Africa pg
Chapter 10: Information Integration and Synthesis
United Kingdom.
Data Mining Chapter 6 Search Engines
CS246: Information Retrieval
KnowItAll and TextRunner
Presentation transcript:

Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ

Outline  Motivation  System Architecture  Detail Techniques  Search Engine Interface  Extractor  Probabilistic Assessment  Experimental Result  Future Work  Conclusion

Motivation  Why Web-scale Information Extraction?  Web is the largest knowledge base.  Extracting information by searching the web is not easy: list the cities in the world whose population is above 400,000; humans who has visited space.  Unless we find the “right” document, this work could be tedious, error-prone process of piecemeal search.

Motivation (2)  Previous Information Extraction Works  Supervised Learning  Difficult to scale to the web  the diversity of the web  the prohibitive cost of creating an equally diverse set of hand- tagged documents  Weakly Supervised and Bootstrap  Need domain-specific seeds  Learn rule from seeds, and then vice versa  KnowItAll  Domain-Independent  Use Bootstrap technique

System Architecture  4 Components  Data Flow Extractor Search Engine Interface Assessor Database

System Architecture  System Work Flow Extractor Search Engine Interface Assessor Database Web Pages RuleRule templatekeywords NP1 “such as” NPList2 & head(NP1) = plural(name(Class1)) & properNoun(head(each(NPList2))) => instanceOf(Class1,head(each(NPList2))) Noun PhraseNoun Phrase List NP1 “such as” NPList2 & head(NP1) = “countries” & properNoun(head(each(NPList2))) => instanceOf(Country,head(each(NPList2))) Keywords: “countries such as”

System Architecture  System Work Flow Extractor Search Engine Interface Assessor Database Web PagesRule Extracted Information Knowledge the United Kingdom and Canada India North Korea, Iran, India and Pakistan Japan Iraq, Italy and Spain … the United Kingdom Canada India North Korea Iran … Discriminator Phrase Country AND X “Countries such as X” Country AND the United Kingdom Countries such as the United Kingdom Frequency

System Architecture Extractor Search Engine Interface Assessor Database  Search Engine Interface  Distribute jobs to different Search Engines  Extractor  Rule Instantiation  Information Extraction  Accessor  Discriminator Phrases Construction  Access of Information

Search Engine Interface  Metaphor: Information Food Chain  Search Engine  Herbivore  KnowItAll  Carnivore  Why build on top of search engine?  No need to duplicate existing work  Low cost/time/effort  Query Distribution  Make sure not to overload search engines

Extractor  Extraction Template Examples  NP1 {“,”} “such as” NPList2  NP2 {“,”} “and other” NP2  NP1 {“,”} “is a” NP2  All are domain-independent!

Extractor (2)  Noun phrase analysis  A. “China is a country in Asia”  B. “Garth Brooks is a country singer”  In A, the word “country” is the head of a simple noun phrase.  In B, the word “country” is not the head of a simple noun phrase.  So, China is indeed a country while Garth Brooks is not a country.

Extractor (3)  Rule Template:  NP1 “such as” NPList2 & head(NP1) = plural( name( Class1 )) & properNoun( head( each( NPList2 ))) => instanceOf( Class1, head( each( NPList2)))  The Extractor generates a rule for “Country” from this template by substituting “Country” for “Class 1”.

Assessor  Naïve Bayesian Model  Features: hits returned by search engine  Incident: whether the extracted inf. is a fact  Adjusting the threshold  Trade between precision and recall

Assessor (2)  Use bootstrapping to learn P(fi|Ф) and P(fi|¬Ф)  Define PMI (I,D) = |Hits(D+I)| / |Hits(I)|  I: the extracted NP  D: discriminator phrase  4 P(fi|Ф) and P(fi|¬Ф) Functions  Hits-Thresh:P(hits>Hits(D+I)|Ф)  Hits-Density:p(hits=Hits(D+I)|Ф)  PMI-Thresh:P(pmi>PMI(I,D)|Ф)  PMI-Density:p(pmi=PMI(I,D)|Ф)

Experimental Results  Precision vs. Recall  Thresh  better than Density  PMI  better than Hits

Experimental Results (2)  Time Len: 4 day  Web page retrieved vs. time  3000 pages/hour  New facts vs. Web page retrieved  1 new fact / 3 pages to 1 new fact / 7 pages

Conclusion & Future Works  Conclusion:  Domain-independent rule templates  Rule generated by rule templates  Built on top of search engine  Assessor Model: More data, more accurate  Future works:  Learn domain-specific rules to improve recall  Automatically extend the ontology

Q & A  Thanks!