DataSpaces: A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier.

Slides:



Advertisements
Similar presentations
ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 1: INTRODUCTION TO DATA INTEGRATION PRINCIPLES OF DATA INTEGRATION.
Advertisements

CSE 636 Data Integration Data Integration Approaches.
Presenter : Aviv Alon Seminar in Databases (236826) 1.
Indexing Dataspaces Presenter : Sravanth Palepu CSE 718 Xin DongAlon Halevy University of WashingtonGoogle Inc.
Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005.
Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington.
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod 2005 University of Washington.
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
Search Engines and Information Retrieval
A Web of Concepts Dalvi, et al. Presented by Andrew Zitzelberger.
Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.
Principles of Dataspace Systems Alon Halevy PODS June 26, 2006.
New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems.
Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004.
INDEXING DATASPACES by Xin Dong & Alon Halevy ITCS 6010 FALL 2008 Presented by: VISHAL SHETH.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom.
Dataspaces: Co-Existence with Heterogeneity Alon Halevy KR 2006.
“DOK 322 DBMS” Y.T. Database Design Hacettepe University Department of Information Management DOK 322: Database Management Systems.
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
Information Retrieval
Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UCLA, April 15, 2004.
Overview of Search Engines
Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.
A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Will Data Mining Change the Functions of DBMS? Jiawei Han DAIS (Data And Information Systems) Lab University of Illinois at Urbana-Champaign.
Search Engines and Information Retrieval Chapter 1.
Grant Number: IIS Institution of PI: Arizona State University PIs: Zoé Lacroix Title: Collaborative Research: Semantic Map of Biological Data.
Managing a Space of Heterogeneous Data Xin (Luna) Dong University of Washington March, 2007.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
Information Integration Across Heterogeneous Sources: Where Do We Stand and How to Proceed? Aditya Telang Sharma Chakravarthy, Yan Huang.
1 Ontology-based Semantic Annotatoin of Process Template for Reuse Yun Lin, Darijus Strasunskas Depart. Of Computer and Information Science Norwegian Univ.
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)
INFORMATION MANAGEMENT Unit 2 SO 4 Explain the advantages of using a database approach compared to using traditional file processing; Advantages including.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
1 ICOM 5016 – Introduction to Database System Project # 1 Dr. Manuel Rodriguez-Martinez Department of Electrical and Computer Engineering University of.
Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured.
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Mining the Biomedical Research Literature Ken Baclawski.
Data Integration: Achievements and Perspectives in the Last Ten Years AiJing.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Semantic Mappings for Data Mediation
1 One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing Bei Yu 1, Guoliang Li 2, Beng Chin Ooi 1, Li-zhu Zhou 2 1 National.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Managing Semi-Structured Data. Is the web a database?
Towards Unifying Vector and Raster Data Models for Hybrid Spatial Regions Philip Dougherty.
An Ontological Approach to Financial Analysis and Monitoring.
Data Integration Approaches
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Facilitating Document Annotation Using Content and Querying Value.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
March 8, 2007 From Personal Desktops to Personal Dataspaces: A Report on Building the iMeMex Personal Dataspace Management System Jens Dittrich Lukas Blunschi.
Statistical Schema Matching across Web Query Interfaces
Kyriaki Dimitriadou, Brandeis University
A Platform for Personal Information Management and Integration
Database Design Hacettepe University
Research on Personal Dataspace Management
INFO/CSE 100, Spring 2006 Fluency in Information Technology
Anatomy of a modern data-driven content product
Information Retrieval and Web Design
Context-Aware Internet
Information Retrieval and Web Design
Fading Schemas… Alon Y. Halevy.
Presentation transcript:

DataSpaces: A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier

Outline Dataspaces: –collections of heterogeneous (un)-structured data. –examples and characteristics Dataspace Support Platforms: –“Pay-as-you-go” data management Getting there: –Recent progress and research challenges

Outline Dataspaces: –collections of heterogeneous (un)-structured data. –examples and characteristics Dataspace Support Platforms: –“Pay-as-you-go” data management Getting there: –Recent progress and research challenges

Shrapnels in Baghdad Story courtesy of Phil Bernstein

OriginatedFrom PublishedIn ConfHomePage ExperimentOf ArticleAbout BudgetOf CourseGradeIn AddressOf Cites CoAuthor Frequent er HomePage Sender EarlyVersion Recipient AttachedTo PresentationFor Personal Information Management [Semex: Dong et al.]

Google Base

What do they have in common? All dataspaces contain >20% porn. The rest is spam.

What do they have in common? Must manage all the data in the space Need best-effort services with no setup time. Data is heterogeneous, –possibly unstructured Do not have control over the data, just access.

Isn ’ t this Data Integration? OMIM Swiss- Prot HUGOGO Gene- Clinics Entrez Locus- Link GEO Sequenceable Entity GenePhenotype Structured Vocabulary Experiment Protein Nucleotide Sequence Microarray Experiment

No, it’s Data Co-existence Data integration systems require semantic mappings.

Semantic Mappings BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B

No, It’s Data Co-existence Data integration systems require semantic mappings. Dataspaces are “pay-as-you-go”: –Provide some services immediately –Create more tight integrations as needed.

The Cost of Semantics Benefit Investment (time, cost) Dataspaces Data integration solutions Schema first vs. schema last

Why Now? Data management is moving towards dataspace-like applications. Prediction: –Data management is about people, not enterprises. –In 5 years our community will figure it out. We’ve made relevant progress: –Combining DB & IR –Creation and management of semantic mappings. –Uncertainty, lineage, inconsistency

Dataspaces Fundamentals: Participants and Relationships XML WSDL SDB sensor java RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates

DSSP Components XML WSDL SDB sensor java RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates Catalog Search & query Discovery & Enhancement Local Store & Index Administration Heterogeneous index Reference reconciliation Additional associations Cache for performance & availability Seamless flow from search to query Query about participants, locate data Lineage, uncertainty, completeness Set up workflows Relax Find participants Discover and refine relationships Maintain catalog participants’ capabilities Relationships Quality of both

DSSP Components XML WSDL SDB sensor java RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates Catalog Search & query Discovery & Enhancement Local Store & Index Administration

Querying a Dataspace Best effort, based on: –Approximate semantic mappings –Other mechanisms Example -- searching for Beng Chin’s phone number: –Keyword search for “beng chin ooi” –Examine attributes of tuples/XML elements –Match attributes to ‘address’

Querying a Dataspace Best effort Combine structured and unstructured data

Two Kinds of Data Structured Data (or XML) Unstructured Data [Dong, Liu, Halevy]

Querying a Dataspace Best effort Combine structured and unstructured data Rank: answers, sources

Volvo Palo alto

Acura integra Palo alto

Querying a Dataspace Best effort Combine structured and unstructured data Rank: answers, sources Iterative -- sequences of queries: –Used cars palo alto –Saab for sale –Classified ads  Classified ads for Saabs near Palo Alto

Querying a Dataspace Best effort Combine structured and unstructured data Rank: answers, sources Iterative: sequences of queries Reflective: need to explain and expose confidence in answers.

Dataspace Reflection Sources of uncertainty: –Unreliable sources –Data obtained by imprecise extractions –Inconsistent data –Approximate query answering (mappings) A DSSP must: –Expose the lineage of an answer Web search engines already do this –Reason about relationship between sources

LUI Introspection Currently, three distinct formalisms: –Uncertainty –Lineage –Inconsistency Single formalism should do it all: –Uncertainty and lineage Stanford) –Lineage & inconsistency U.Penn)

ULDB’s [Benjelloun, Das Sarma, Halevy, Widom] Combine uncertainty and lineage –Based on x-tuples: {(t1 | t2 | t3)} Queries can be answered with no additional complexity –You can do even better than uncertain DBs. Because of lineage, you can sometimes obtain completeness.

DSS Components XML WSDL SDB sensor java RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates Catalog Search & query Discovery & Enhancement Local Store & Index Administration

Personal Information Space Alon Halevy Luna Dong Semex: … authoredPaper author authoredPaper author StuIDLastNameFirstName… XinDong… ………… Departmental Database Alon Dong Halevy Luna Semex Xin Inverted List

Personal Information Space Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDLastNameFirstName… XinDong… ………… Departmental Database Alon1 Dong11 Halevy1 Luna1 Semex1 Xin1 Inverted List Query: Dong Luna Dong

Personal Information Space Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDLastNameFirstName… XinDong… ………… Departmental Database Alon1 Dong11 Halevy1 Luna1 Semex1 Xin1 Inverted List Luna Dong Query: FirstName “Dong”

Personal Information Space Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDLastNameFirstName… XinDong… ………… Departmental Database Alon&&name&&1 Dong&&FirstName&&1 Dong&&name&&1 Halevy&&name&&1 Luna&&name&&1 Semex&&title&&1 Xin&&LastName&&1 Inverted List Luna Dong Query: FirstName “Dong”

Personal Information Space Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDLastNameFirstName… XinDong… ………… Departmental Database Alon&&name&&1 Dong&&FirstName&&1 Dong&&name&&1 Halevy&&name&&1 Luna&&name&&1 Semex&&title&&1 Xin&&LastName&&1 Inverted List Query: name “Dong” Luna Dong

Personal Information Space Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDLastNameFirstName… XinDong… ………… Departmental Database Alon&&name&&1 Dong&&name&&FirstName&&1 Dong&&name&&1 Halevy&&name&&1 Luna&&name&&1 Semex&&title&&1 Xin&&name&&LastName&&1 Inverted List Query: name “Dong” Luna Dong

Personal Information Space Alon Halevy authoredPaper author authoredPaper author StuIDLastNameFirstName… XinDong… ………… Departmental Database Inverted List Luna Dong Semex: … Query: Paper author “Dong” Alon&&name&&1 Dong&&name&&FirstName&&1 Dong&&name&&1 Halevy&&name&&1 Luna&&name&&1 Semex&&title&&1 Xin&&name&&LastName&&1

Personal Information Space Alon Halevy authoredPaper author authoredPaper author StuIDLastNameFirstName… XinDong… ………… Departmental Database Inverted List Query: Paper author “Dong” Luna Dong Semex: … Alon&&author&&1 Alon&&name&&1 Dong&&author&&1 Dong&&name&&FirstName&&1 Dong&&name&&1 Halevy&&name&&1 Luna&&name&&1 Semex&&authoredPaper&&11 Semex&&title&&1 Xin&&name&&LastName&&1

DSS Components XML WSDL SDB sensor java RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates Catalog Search & query Discovery & Enhancement Local Store & Index Administration

OriginitatedFrom PublishedIn ConfHomePage ExperimentOf ArticleAbout BudgetOf CourseGradeIn AddressOf Cites CoAuthor Frequent er HomePage Sender EarlyVersion Recipient AttachedTo PresentationFor ComeFrom Enhancing a Dataspace Creating associations [Dong & Halevy, CIDR 2005] Reference reconciliation [Dong et al., SIGMOD 2005] Very active field.

DSS Components XML WSDL SDB sensor java RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates Catalog Search & query Discovery & Enhancement Local Store & Index Administration

Reusing Human Attention Human attention is most expensive. Reuse whenever possible. E.g.,: –Manual schema mapping –Annotations –Queries written on data –Temporary collections of items –Operations on the data (cut & paste) Solicit semantic information selectively –The ESP game: [von Ahn et al.]

Learning from Past Matches Every manual map is a learning example. Learn models for elements in mediated schema. Use multi-strategy learning. Thousands of maps in very little time. Reuse for a very related task. [Doan et. al, Transformic]

Corpus-based Matching Product Music ASINtitleartistsrecordLabeldiscountPrice productIDnamepricesalePrice 0X7630AB12 The Concert in Central Park $13.99$11.99 (no tuples) [Madhavan et al.]

Obtaining More Evidence Product productIDnamepricesalePrice 0X7630AB12 The Concert in Central Park $13.99$11.99 prodIDalbumNameartistsrecordCompanypricesalePrice 9R4374FG56Saturday Night FeverThe Bee GeesColumbia$14.99$9.99 ASINalbumartistNamepricediscountPrice 4Y3026DF23The Best of the DoorsThe Doors$16.99$12.99 MusicCD CD Corpus CD prodID albumName

Comparing with More Evidence Product productIDnamepricesalePrice 0X7630AB12 The Concert in Central Park $13.99$11.99 CD prodID albumName Music MusicCD 4Y6DF23 ASIN The Doors artists artistName Columbia recordLabel recordCompany $12.99 The Best of the Doors discount price Title album

Challenges Learn from other kinds of user activities Create other kinds of relationships between participants Identify higher-level goals from user actions Develop a formal framework for reusing human attention.

Conclusion “Dataspaces: because that’s the size of the problem” Pay-as-you-go data integration Data management for the masses Key: reuse of human attention Need to be very data driven in our research

Some References SIGMOD Record, December 2005: –Original dataspace vision paper PODS 2006: –Specific technical challenges for dataspace research Semex: CIDR 2005, SIGMOD 2005 Teaching integration to undergraduates: –SIGMOD Record, September, 2003.

1. Build initial models s S Name: Instances: Type: … MsMs t T Name: Instances: Type: … MtMt Corpus of schemas and mappings 2. Find similar elements in corpus s S Name: Instances: Type: … M’ s t T Name: Instances: Type: … M’ t 3. Build augmented models 4. Match using augmented models 5. Use additional statistics (IC’s) to refine match