Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Slide 1 International Internet Preservation Consortium General Assembly 2014, Paris Mining a Large Web Corpus Robert Meusel Christian Bizer.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Coolheads Consulting Copyright © 2003 Coolheads Consulting The Internal Revenue Service Tax Map Michel Biezunski Coolheads Consulting New York City, USA.
Organizing Data & Information
Accelerate Business Success With CRM CRM Interoperability.
Integration and Insight Aren’t Simple Enough Laura Haas IBM Distinguished Engineer Director, Computer Science Almaden Research Center.
BUSINESS DRIVEN TECHNOLOGY
Methodology Conceptual Database Design
Chapter 12 Information Systems. 2 Chapter Goals Define the role of general information systems Explain how spreadsheets are organized Create spreadsheets.
Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 15: Data Integration on the Web PRINCIPLES OF DATA INTEGRATION.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Page 1 ISMT E-120 Introduction to Microsoft Access & Relational Databases The Influence of Software and Hardware Technologies on Business Productivity.
Deep-Web Crawling “Enlightening the dark side of the web”
Presented by Mat Kelly CS895 – Web-based Information Retrieval Old Dominion University Septmber 27, 2011 The Deep Web: Surfacing Hidden Value Michael K.
Chapter 12 Information Systems. 2 Chapter Goals Define the role of general information systems Explain how spreadsheets are organized Create spreadsheets.
Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Chapter 3 The Relational Model Transparencies Last Updated: Pebruari 2011 By M. Arief
Fundamentals of Information Systems, Fifth Edition
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
1 Enabling Webscale Research in Europe Julien Masanès European Archive Foundation Consultation Workshop, Brussels, 19/1/2010.
Organizing Data and Information AD660 – Databases, Security, and Web Technologies Marcus Goncalves Spring 2013.
WebTables & Octopus Michael J. Cafarella University of Washington CSE454 April 30, 2009.
Computer Science 101 Database Concepts. Database Collection of related data Models real world “universe” Reflects changes Specific purposes and audience.
Chapter 3 The Relational Model. 2 Chapter 3 - Objectives u Terminology of relational model. u How tables are used to represent data. u Connection between.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Searching for Statistical Diagrams Michael Cafarella University of Michigan Joint work with Shirley Zhe Chen and Eytan Adar Brigham Young University November.
Towards Web Semantics Spreadsheets and the US Government Lee Feigenbaum, Cambridge Semantics Brand Niemann, U.S. EPA SICoP Special Conference February.
© 2015 Ascendum Solutions. All rights reserved. Welcome To Create Dazzling End-user applications using SharePoint Search Speaker: Bill Crider #sharepointcincy2015.
EXTENDING DATABASE USABILITY Michelle Brown, MSc. Student.
Google Fusion Tables: Web-Centered Data Management and Collaboration Hector Gonzalez, Alon Y. Halevy, Christian S. Jensen, Anno Langen, Jayant Madhavan,
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Deep Web Exploration Dr. Ngu, Steven Bauer, Paris Nelson REU-IR This research is funded by the NSF REU program AbstractOur Submission Technique Results.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Database Basics BCIS 3680 Enterprise Programming.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
User Modeling and Recommender Systems: Introduction to recommender systems Adolfo Ruiz Calleja 06/09/2014.
Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.
CIDR 2007, Asilomar California1 Predicate-Based Indexing of Enterprise Web Applications Cristian Duda, David Graf, Donald Kossmann ETH Zurich.
GOOGLE FUSION TABLES: WEB- CENTERED DATA MANAGEMENT AND COLLABORATION HectorGonzalez, et al. Google Inc. Presented by Donald Cha December 2, 2015.
Structure and Function: IA for Web Applications. Innovate - For What’s Next™ ©1999 Scient, Proprietary and Confidential Page 2 Structure - IA with content.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
NeOn Components for Ontology Sharing and Reuse Mathieu d’Aquin (and the NeOn Consortium) KMi, the Open Univeristy, UK
Yahoo! BOSS Open up Yahoo!’s Search data via web services Developer & Custom Tracks Big Goal – If you’re in a vertical and you perform a search, you should.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Linking Ontologies to Spatial Databases
Fusion Tables.
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Associative Query Answering via Query Feature Similarity
Web Data Extraction Based on Partial Tree Alignment
Lecture 5: Leave no relevant data behind: Data Search
Data Integration for Relational Web
Data Model.
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010

Without (too much) Loss of Generality Web Enterprise, Science projects, … Information integration ≅ data management

A Few Principles Data management “in situ” –Data meaning is derived from its context –Manipulate data in its natural location Pay-as-you-go data management –Provide services before modeling is done –Data can be about any domain Collaboration should be built in –Query answering is only step the first step

Alex Facebook

Structured Data & The Web

Discover Manage, Analyze, Combine ExtractPublish Hard to query, visualize, combine data across organizations Requires infrastructure, concerns about losing control Hard to find structured data via search engines Data is embedded in web page, behind forms

Outline Surfacing the Deep Web Searching tables on the surface Web Fusion Tables: a platform for data management on the Web.

What is the Deep Web? store locations used cars radio stations patents recipes Deep = not accessible through general purpose search engines –Major gap in the coverage of search engines.

Tree Search Amish quilts Parking tickets in India Horses

Solution Constraints Can’t design a solution that requires domain engineering –(unless you can make money in that domain!) Boundaries between domains are fuzzy Solution needs to be integrated into general web search –Can’t assume special query syntax

Surfacing the Deep Web [Madhavan et al. VLDB 2008] Surfacing: –Find high-quality forms –Guess good queries to submit –Put the resulting HTML pages in the index ~3M sites, 50 languages, 700 domains queries per-second get results from the deep web. 400K forms served per day, 800K per week Impact mostly on the long and heavy tail of queries

Deep Web: The Future Still an opportunity to go deeper into the deep web: –E.g., map the user query into a form submission. Key challenge: given a keyword query, map it to forms in any domain Understanding the meaning of forms is still hard (e.g. content, geo constraints).

Outline Surfacing the Deep Web  Searching tables on the surface Web Fusion Tables: a platform for data management on the Web.

Bad table

Vertical Tables

Sub-Header Rows

Winners of the Boston Marathon (but that’s nowhere in the table)

Schema Ok, but context is subtle (year = 2006)

WebTables: Exploring the Relational Web [Cafarella et al., VLDB 2008, WebDB 08] In corpus of 14B raw tables, we estimate 154M are “good” relations –Single-table databases; Schema = attr labels + types –Largest corpus of databases & schemas we know of The Webtables system: –Recovers good relations from crawl and enables search –Builds novel apps on the recovered data

(Web-scale) Schema Collection name | , phone|telephone, _address| _address, date|last_modified instructorcourse-title|title, day|days, course|course-#, course-name|course-title electedcandidate|name, presiding-officer|speaker abk|so, h|hits, avg|ba, name|player sqftbath|baths, list|list-price, bed|beds, price|rent With 2.6 million schemas you can do some very interesting things. Synonym discovery

“KR”-Based Table Search [Wu, Madhavan, Miao, Pasca, Shen] Ideally, we describe every table: –Class of entities it contains –Properties being modeled –Context, quality, … Use Web-extracted knowledge bases –Extract isa-hierarchy using patterns: –“cities such as Paris and London” –“chemical elements including hydrogen and oxygen”

Step 1: Find “Subject” of Table Not always the left (or first non-number column)

Step 2: associate classes with subject Chemical elements Most of the time, the class labels are not in attribute name

Leveraging Web-extracted Ontologies Given a query, e.g., (country, GDP) –Rank tables about countries that have GDP somewhere in the schema. –Very high precision (~90%) Next challenge: understand binary properties and binary relationships. Domain specialization: –System should improve if given ontologies in a particular domain.

25 Combine Search, Extraction, Cleaning and Integration [Cafarella, Koussainova, H., VLDB 2009], Try to create a database of all “VLDB program committee members”

Outline Surfacing the Deep Web Searching tables on the surface Web  Fusion Tables: a platform for data management on the Web.

Data Management for the Web Era Integrate seamlessly with the Web: –Search, maps, … Easy to use: –Much broader user base, pay-as-you-go –Very simple data integration Provide incentives for sharing data Facilitate collaboration Fusion Tables – our current attempt [Madhavan, Gonzalez, Langen, Shapley, Shen]

We store and leverage a large collection of tables. Incentive

Incentive, Pay-..-Go

Coffee Production

Coffee Consumption

Seamless integration with other web tools

Toilet heat map…

Database functionality on map

Collaboration Table Search

Show up in search results!

Data Integration

Merged Table Carries attribution from both base tables. Owners maintain control of their own data.

Fine Grained Discussions

Example Uses of Fusion Tables Tracking potholes in Spain Displaying bike routes (MTBGuru) State of California statistics Government data from data.gov Data about voting locations in the USA Brazilian beaches Chicago homicides Most requested pop songs by year

Conclusions Information integration “in situ” –Blur the boundary between structured and unstructured data Combine search, extraction, cleaning and integration into a single experience Pay-as-you-go: introduce complexity as needed –Serve enterprises without IT depth OpenII – an open-source platform for information integration.

References Fusion Tables: –tables.googlelabs.com –SIGMOD, SOCC, 2010 Deep-web crawling: –[Madhavan et al., VLDB 08] WebTables: –[Cafarella et al., VLDB 08] Octopus: –[Cafarella et al., VLDB 09], –[Elmeleegy et al, VLDB 09]