Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Center for E-Business Technology Seoul National University Seoul, Korea WebTables: Exploring the Power of Tables on the Web Michael J. Cafarella, Alon.
Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Flint: exploiting redundant information to wring out value from Web data Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Slide 1 International Internet Preservation Consortium General Assembly 2014, Paris Mining a Large Web Corpus Robert Meusel Christian Bizer.
Automatic optimization of MapReduce Programs Michael Cafarella, Eaman Jahani, Christopher Re August 2011.
WebTables: Exploring the Power of Tables on the Web
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min.
Integration and Insight Aren’t Simple Enough Laura Haas IBM Distinguished Engineer Director, Computer Science Almaden Research Center.
Web Exploration and Search Technology Lab Department of Computer and Information Science Polytechnic University Brooklyn, NY Faculty: Torsten Suel.
1 CSE591 (575) Data Mining 1/21/ /6/2003 Computer Science & Engineering ASU.
MetaQuerier Mid-flight: Toward Large-Scale Integration for the Deep Web Kevin C. Chang.
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 WinaCS Project Web Entity Extraction and Mapping Discovering.
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Presented by Mat Kelly CS895 – Web-based Information Retrieval Old Dominion University Septmber 27, 2011 The Deep Web: Surfacing Hidden Value Michael K.
Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang Presented by : Kevin Quadros CSE Department University at Buffalo WebTables: Exploring.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation Consultation Workshop, Brussels, 19/1/2010.
DBrev: Dreaming of a Database Revolution Gjergji Kasneci, Jurgen Van Gael, Thore Graepel Microsoft Research Cambridge, UK.
The Web’s Many Models Michael J. Cafarella University of Michigan AKBC May 19, 2010 ?
WebTables & Octopus Michael J. Cafarella University of Washington CSE454 April 30, 2009.
Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1.
E-R Modeler: A Database Modeling Toolkit for Eclipse Hui Wu wuh -at- cis.uab.edu Academic Advisor : Dr. Jeff Gray gray -at-
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Searching for Statistical Diagrams Michael Cafarella University of Michigan Joint work with Shirley Zhe Chen and Eytan Adar Brigham Young University November.
Structured Querying of Web Text: A Technical Challenge Michael J. Cafarella, Christopher Re, Dan Suciu, Oren Etzioni, Michele Banko University of Washington.
Indexing CSCI 572: Information Retrieval and Search Engines Summer 2010.
Integration of Search and Learning Algorithms Eugene Fink.
Research Topics CSC Parallel Computing & Compilers CSC 3990.
VLDB Demo WISE-Integrator: A System for Extracting and Integrating Complex Web Search Interfaces of the Deep Web Hai He, Weiyi Meng, Clement Yu, Zonghuan.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Using linked data to interpret tables Varish Mulwad September 14,
Summary Knowledge Bases from Web are Real, Big & Useful: Entities, Classes & Relations Key Asset for Intelligent Applications: Semantic Search, Question.
Using linked data to interpret tables Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010.
9/03 Data Mining – Introduction G Dong (WSU)1 CS499/ Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.
The Unreasonable Effectiveness of Data
Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.
Headings are defined with the to tags. defines the largest heading. defines the smallest heading. Note: Browsers automatically add an empty line before.
GOOGLE FUSION TABLES: WEB- CENTERED DATA MANAGEMENT AND COLLABORATION HectorGonzalez, et al. Google Inc. Presented by Donald Cha December 2, 2015.
High-Performance Querying on RAW data Anastasia Ailamaki EPFL.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Presented by: Dardan Xhymshiti Fall  Authors: Eli Cortez, Philip A.Bernstein, Yeye He, Lev Novik (Microsoft Corporation)  Conference: VLDB  Type:
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
MINING DEEP KNOWLEDGE FROM SCIENTIFIC NETWORKS
Introduction to Data Science Lecture 5 Data Integration
Cloud based linked data platform for Structural Engineering Experiment
CSE3 Computational Thinking
Modern Data Management
Data Integration for the Relational Web
Declarative Creation of Enterprise Applications
Data Integration for Relational Web
Integrating Taxonomies
Chapter 13 The Data Warehouse
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010

2 The Structured Web Web pages contain structure that is obvious to humans, though not machines Search engines are largely blind to it Databases need data that is perfectly structured

4 Different Approaches Extraction Techniques Tables: WebTables [WebDB’08, VLDB’08] Large-scale entity extraction: Structurepedia [ongoing] Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing] Tools MapReduce Optimizer: Manimal [ongoing] Progress in one reinforces others

5 Different Approaches Extraction Techniques Tables: WebTables [WebDB’08, VLDB’08] (w/ Alon Halevy, Yang Zhang, Daisy Wang, Eugene Wu) Large-scale entity extraction: Structurepedia [ongoing] Applications Web data integration: Octopus [VLDB’09] Structure-aware Web search: Meez [ongoing] Tools MapReduce Optimizer: Manimal [ongoing] (w/ Chris Re)

6

8 WebTables WebTables system automatically extracts dbs from web crawl [WebDB08, “Uncovering…”, Cafarella et al] [VLDB08, “WebTables: Exploring…”, Cafarella et al] An extracted relation is one table plus labeled columns Estimate that our crawl of 14.1B raw HTML tables contains ~154M good relational dbs Raw crawled pagesRaw HTML TablesRecovered Relations Applications Schema Statistics

9 Schema stats useful for computing attribute probabilities p(“make”), p(“model”), p(“zipcode”) p(“make” | “model”), p(“make” | “zipcode”) Allows many applications Schema “tab-complete” Synonym discovery Others Progress in extraction technique enables new data applications

10 Manimal (ongoing) MapReduce very popular for “big data” Easy for non-database programmers Parallelizable, but inefficient RDBMSes challenging for “big data” Programming and admin relatively difficult When well-used, very efficient Manimal is hybrid MapReduce/RDBS execution system Static analysis to extract code semantics if(score > 5)…  database selection Extractions enable RDBMS-style optimizations Progress in extraction enables new data tools