Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington.

Slides:



Advertisements
Similar presentations
Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.
Advertisements

Flint: exploiting redundant information to wring out value from Web data Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
TEXTRUNNER Turing Center Computer Science and Engineering
Data Integration for the Relational Web Katsarakis Michalis.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1.
Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
Efficient Query Evaluation on Probabilistic Databases
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
Data-oriented Content Query System: Searching for Data into Text on the Web Mianwei Zhou, Kevin Chen-Chuan Chang Department of Computer Science UIUC 1.
ASP.NET Database Connectivity I. 2 © UW Business School, University of Washington 2004 Outline Database Concepts SQL ASP.NET Database Connectivity.
Multiple Tiers in Action
1 Database Research at the UW  Faculty: Alon Halevy and Dan Suciu. A dozen Ph.D students  Related faculty: Oren Etzioni, Pedro Domingos, Dan Weld and.
MystiQ The HusQies* *Nilesh Dalvi, Brian Harris, Chris Re, Dan Suciu University of Washington.
SIMPLE PAST. Write the names of these famous scientists. lived in England. He developed the theory of evolution by natural selection. He wrote the book.
Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Thomas Edison and his Inventions
A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Lowell 2003 Challenges Alon Y. Halevy University of Washington.
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
1 Beyond Pages: Supporting Efficient, Scalable Entity Search with Dual-Inversion Index Tao Cheng and Kevin Chang Computer.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
Web-scale Information Extraction in KnowItAll Oren Etzioni etc. U. of Washington WWW’2004 Presented by Zheng Shao, CS591CXZ.
The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?
WebTables & Octopus Michael J. Cafarella University of Washington CSE454 April 30, 2009.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Data Mining – A First View Roiger & Geatz. Definition Data mining is the process of employing one or more computer learning techniques to automatically.
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Installing and Using MySQL and phpMyAdmin. Last Time... Installing Apache server Installing PHP Running basic PHP scripts on the server Not necessary.
Structured Querying of Web Text A Technical Challenge Kulsawasd Jitkajornwanich University of Texas at Arlington CSE6339 Web Mining.
Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko Presenter: Shahina.
GEORGIOS FAKAS Department of Computing and Mathematics, Manchester Metropolitan University Manchester, UK. Automated Generation of Object.
EntityRank :Searching Entities Directly and Holistically Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang Computer Science Department, University of Illinois.
Mianwei Zhou, Tao Cheng, Kevin Chen-Chuan Chang WSDM 2010, New York, USA 1.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
Part I Data Mining Fundamentals. Data Mining: A First View Chapter 1.
Webdamlog and Contradictions Daniel Deutch Tel Aviv University Joint work with Serge Abiteboul, Meghyn Bienvenu, Victor Vianu.
Lawrence Snyder University of Washington, Seattle © Lawrence Snyder 2004.
Foundations of Business Intelligence: Databases and Information Management.
Date: 2013/10/23 Author: Salvatore Oriando, Francesco Pizzolon, Gabriele Tolomei Source: WWW’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang SEED:A Framework.
HANDLING UNCERTAINTY IN INFORMATION EXTRACTION Maurice van Keulen and Mena Badieh Habib URSW 23 Oct 2011.
THOMAS EDISON BY: MARY CLARE, CHRISTINA, AND SUSANNE.
BY: GAVIN NIETSCHMANN Thomas Edison. Early life He was the youngest of 7 children He had middle ear infections at an early age witch soon led to partly.
Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh
Thomas Alva Edison was an inventor our lives are a little easier!
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
Cortrell morntrez morgan He invented the electric light bulb in INVENTION.
Thomas Edison By Hercelyn R. Rencher.
Reading Report: Open QA Systems
Approximate Lineage for Probabilistic Databases
Probabilistic Data Management
Lecture 16: Probabilistic Databases
Big Businesses and Big Cities
A Platform for Personal Information Management and Integration
discover (v): to be the first person to find something important.
Data Integration for Relational Web
Inventions Promote Change
Open Information Extraction from the Web
Machine Reading.
Presentation transcript:

Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington Asilomar, CA January 9, 2007

2 “Show me some people, what they invented, and the years they died” q(?a, ?b, ?c):- invented(?a, ?b), died-in(?a, ?c) Structured Queries, Unstructured Data abcprob Keplerlog books Heisenbergmatrix mechanics Galileotelescope Newtoncalculus

3 ExDB We b …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedlight bulb0.97 Morganborn-in TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middlewar e invented(Edison ?e, ?i) 1. Run extractors2. Populate data model3. Queries

4 ExDB We b …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedlight bulb0.97 Morganborn-in TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middlewar e invented(Edison ?e, ?i) 1. Run extractors2. Populate data model3. Queries

5 Information Extraction Each concept has an IE mechanism ExampleDescriptionIE technique invented(Edison, phonograph) Arity-2 factTextRunner Einstein Type (hypernymy)KnowItAll has-invented = invented SynonymyDIRT invented  discovered ID (troponymy)? FD: has-capital(x, y)  has-capital(y) FD (rule)?

6 ExDB We b …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedlight bulb0.97 Morganborn-in TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middlewar e invented(Edison ?e, ?i) 1. Run extractors2. Populate data model3. Queries

7 Populate Data Model Use extractions to fill tables Obj1PredObj2prob Edisoninventedlight bulb0.97 Morganborn-in TypeInstanceprob scientistEinstein0.99 cityBoston0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 InclusionIncluderprob inventeddiscovered0.81 SeattleWashington0.65 LHSRHSprob capital(x, y)capital(y)0.77 Facts Types Synonyms IDs FDs It was big news when Edison invented the light bulb. He visited cities such as Boston and New York. We all know that Edison invented the light bulb. … In 1877 Edison created the light bulb.

8 ExDB We b …no one could surprising. In 1877, Edison invented the phonograph. Although he… …didnt surprising. In 1877, Edison invented the phonograph. Although he… …was surprising. In 1877, Edison invented the phonograph. Although he… Obj1PredObj2prob Edisoninventedlight bulb0.97 Morganborn-in TypeInstanceprob scientistEinstein0.99 citySeattle0.92 Pred1Pred2prob inventeddid-invent0.85 inventedcreated0.72 Facts Types Synonyms RDBMS Query middlewar e invented(Edison ?e, ?i) 1. Run extractors2. Populate data model3. Queries

9 For non-projecting queries, we can compute top-k queries Comb. fn is product of probabilities For projecting queries, we compute the disjunction of m probabilistic events In general NP-hard, so we approximate using the panel of experts Query Processing

10 Related Work Query Systems: CIMple (CIDR07), AVATAR (DEBul06) Liu, Dong, Halevy (WebDB06) Gubanov and Bernstein (WebDB06) Extraction: Sarawagi (VLDB06 and others), Etzioni (WWW04), … Probabilistic DBs: MYSTIQ, Trio, … Deep web, reference reconciliation, …

11 Web crawl: 90M pages Facts: 338M tuples, 102M objects Types: 6.6M instances Synonyms: 17k pairs No IDs or FDs yet Most queries in ~30 seconds Built on DB2 with custom middleware; we want to try a compressed C-store Our prototype