Download presentation
Presentation is loading. Please wait.
Published byRoy Lester Modified over 9 years ago
1
Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010
2
2 The Structured Web Web pages contain structure that is obvious to humans, though not machines Search engines are largely blind to it Databases need data that is perfectly structured
4
4 Different Approaches Extraction Techniques Tables: WebTables [WebDB’08, VLDB’08] Large-scale entity extraction: Structurepedia [ongoing] Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing] Tools MapReduce Optimizer: Manimal [ongoing] Progress in one reinforces others
5
5 Different Approaches Extraction Techniques Tables: WebTables [WebDB’08, VLDB’08] (w/ Alon Halevy, Yang Zhang, Daisy Wang, Eugene Wu) Large-scale entity extraction: Structurepedia [ongoing] Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing] Tools MapReduce Optimizer: Manimal [ongoing] (w/ Chris Re)
6
6
8
8 WebTables WebTables system automatically extracts dbs from web crawl [WebDB08, “Uncovering…”, Cafarella et al] [VLDB08, “WebTables: Exploring…”, Cafarella et al] An extracted relation is one table plus labeled columns Estimate that our crawl of 14.1B raw HTML tables contains ~154M good relational dbs Raw crawled pagesRaw HTML TablesRecovered Relations Applications Schema Statistics
9
9 Schema stats useful for computing attribute probabilities p(“make”), p(“model”), p(“zipcode”) p(“make” | “model”), p(“make” | “zipcode”) Allows many applications Schema “tab-complete” Synonym discovery Others Progress in extraction technique enables new data applications
10
10 Manimal (ongoing) MapReduce very popular for “big data” Easy for non-database programmers Parallelizable, but inefficient RDBMSes challenging for “big data” Programming and admin relatively difficult When well-used, very efficient Manimal is hybrid MapReduce/RDBS execution system Static analysis to extract code semantics if(score > 5)… database selection Extractions enable RDBMS-style optimizations Progress in extraction enables new data tools
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.