Presentation is loading. Please wait.

Presentation is loading. Please wait.

Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010.

Similar presentations


Presentation on theme: "Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010."— Presentation transcript:

1 Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010

2 2 The Structured Web Web pages contain structure that is obvious to humans, though not machines Search engines are largely blind to it Databases need data that is perfectly structured

3

4 4 Different Approaches Extraction Techniques Tables: WebTables [WebDB’08, VLDB’08] Large-scale entity extraction: Structurepedia [ongoing] Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing] Tools MapReduce Optimizer: Manimal [ongoing] Progress in one reinforces others

5 5 Different Approaches Extraction Techniques Tables: WebTables [WebDB’08, VLDB’08] (w/ Alon Halevy, Yang Zhang, Daisy Wang, Eugene Wu) Large-scale entity extraction: Structurepedia [ongoing] Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing] Tools MapReduce Optimizer: Manimal [ongoing] (w/ Chris Re)

6 6

7

8 8 WebTables WebTables system automatically extracts dbs from web crawl [WebDB08, “Uncovering…”, Cafarella et al] [VLDB08, “WebTables: Exploring…”, Cafarella et al] An extracted relation is one table plus labeled columns Estimate that our crawl of 14.1B raw HTML tables contains ~154M good relational dbs Raw crawled pagesRaw HTML TablesRecovered Relations Applications Schema Statistics

9 9 Schema stats useful for computing attribute probabilities p(“make”), p(“model”), p(“zipcode”) p(“make” | “model”), p(“make” | “zipcode”) Allows many applications Schema “tab-complete” Synonym discovery Others Progress in extraction technique enables new data applications

10 10 Manimal (ongoing) MapReduce very popular for “big data” Easy for non-database programmers Parallelizable, but inefficient RDBMSes challenging for “big data” Programming and admin relatively difficult When well-used, very efficient Manimal is hybrid MapReduce/RDBS execution system Static analysis to extract code semantics if(score > 5)…  database selection Extractions enable RDBMS-style optimizations Progress in extraction enables new data tools


Download ppt "Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010."

Similar presentations


Ads by Google