Principles of Dataspace Systems Alon Halevy PODS June 26, 2006.

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

CSE 636 Data Integration Data Integration Approaches.
Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005.
Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington.
Systems Engineering in a System of Systems Context
New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems.
DataSpaces: A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier.
INDEXING DATASPACES by Xin Dong & Alon Halevy ITCS 6010 FALL 2008 Presented by: VISHAL SHETH.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom.
Dataspaces: Co-Existence with Heterogeneity Alon Halevy KR 2006.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.
BTM 382 Database Management Chapter 4: Entity Relationship (ER) Modeling Chitu Okoli Associate Professor in Business Technology Management John Molson.
09/12/2003 Peer-to-Peer Information Systems – WS 03/04 1 Piazza: Data Management Infrastructure for Semantic Web Applications Alon Y. Halevy, Zachary G.
A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.
Search Engines and Information Retrieval Chapter 1.
SICoP Presentation A story about communication Michael Lang BEARevelytix May 2, 2007.
1 The BT Digital Library A case study in intelligent content management Paul Warren
EXCS Sept Knowledge Engineering Meets Software Engineering Hele-Mai Haav Institute of Cybernetics at TUT Software department.
Information Systems: Databases Define the role of general information systems Describe the elements of a database management system (DBMS) Describe the.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
SWETO: Large-Scale Semantic Web Test-bed Ontology In Action Workshop (Banff Alberta, Canada June 21 st 2004) Boanerges Aleman-MezaBoanerges Aleman-Meza,
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
1 Ontology-based Semantic Annotatoin of Process Template for Reuse Yun Lin, Darijus Strasunskas Depart. Of Computer and Information Science Norwegian Univ.
Copyright 2002 Prentice-Hall, Inc. Chapter 2 Object-Oriented Analysis and Design Modern Systems Analysis and Design Third Edition Jeffrey A. Hoffer Joey.
The Data Ring: Community Content Sharing Serge Abiteboul (INRIA) Alkis Polyzotis (UC Santa Cruz)
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
updated CmpE 583 Fall 2008 Ontology Integration- 1 CmpE 583- Web Semantics: Theory and Practice ONTOLOGY INTEGRATION Atilla ELÇİ Computer.
P RINCIPLES OF D ATASPACE S YSTEMS By Alon Halevy (Google Inc.) Michael Franklin (University of California, Berkeley) David Maier (Portland State University)
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
A radiologist analyzes an X-ray image, and writes his observations on papers  Image Tagging improves the quality, consistency.  Usefulness of the data.
Ontology Mapping in Pervasive Computing Environment C.Y. Kong, C.L. Wang, F.C.M. Lau The University of Hong Kong.
Oreste Signore- Quality/1 Amman, December 2006 Standards for quality of cultural websites Ministerial NEtwoRk for Valorising Activities in digitisation.
Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured.
Introduction to the Semantic Web and Linked Data
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Data Integration: Achievements and Perspectives in the Last Ten Years AiJing.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Knowledge Modeling and Discovery. About Thetus Thetus develops knowledge modeling and discovery infrastructure software for customers who: Have high-value.
Managing Semi-Structured Data. Is the web a database?
SICoP Presentation A story about communication Michael Lang BEARevelytix April 25, 2007.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Data Integration Approaches
Yu, et al.’s “A Model-Driven Development Framework for Enterprise Web Services” In proceedings of the 10 th IEEE Intl Enterprise Distributed Object Computing.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
March 8, 2007 From Personal Desktops to Personal Dataspaces: A Report on Building the iMeMex Personal Dataspace Management System Jens Dittrich Lukas Blunschi.
Linking Ontologies to Spatial Databases
Entity- Relationship (ER) Model
StYLiD: Structured Information Sharing with User-defined Concepts
Chapter 4 Relational Databases
A Platform for Personal Information Management and Integration
Entity – Relationship Model
MIS2502: Data Analytics Relational Data Modeling
Semi-structured Data In many applications, data does not have a rigidly and predefined schema: e.g., structured files, scientific data, XML. Managing such.
Presentation transcript:

Principles of Dataspace Systems Alon Halevy PODS June 26, 2006

Outline Example data management challenges  Denote by: “dataspaces” [Franklin, H., Maier] Dataspace Support Platforms: –“Pay-as-you-go” data management Putting meat to the bones: –Specific research problems, recent progress –Querying, dataspace evolution, reflection (Possibly) some predictions and subliminal messages.

Shrapnels in Baghdad Story courtesy of Phil Bernstein

OriginatedFrom PublishedIn ConfHomePage ExperimentOf ArticleAbout BudgetOf CourseGradeIn AddressOf Cites CoAuthor Frequent er HomePage Sender EarlyVersion Recipient AttachedTo PresentationFor Personal Information Management [Semex: Dong et al.]

Google Base

The Web is Getting Semantic Forms (millions) Vertical search engines (hundreds) Annotation schemes: Flickr, ESP Game Google Coop –DB search engine coming soon! “A little semantics goes a long way” [See Madhavan talk on Wednesday afternoon]

“Data is the plural of anecdote” Digital libraries, enterprises, “smart homes” Corie Environmental Observation System – Talk to Maier Circle of Blue – Data about the world’s water sources The Boeing 777 – Stanford]

Requirements A system that: –Is defined by boundaries (organizational, physical, logical), not explicit entry. –Provides best-effort services Little or no setup time –Leverages semantics when possible Manage dataspaces

Other Dataspace Characteristics All dataspaces contain >20% porn. The rest has >50% spam.

Dataspaces vs. Data Integration BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Data integration requires semantic mappings

Dataspaces vs. Data Integration Dataspaces are “pay as you go” Benefit Investment (time, cost) Dataspaces Data integration solutions Artist: Mike Franklin

Dataspaces vs. Data Integration Data integration systems require semantic mappings. –Dataspaces are “pay-as-you-go” Dataspaces capture a broader class of semantic relationships: –DerivedFrom, SnapshotOf, HighlyCorrelatedWith, …

Why Do This Now? Fact: –Data management is about people, not enterprises. Prediction: –In the next few years, DB conferences will be about data management for the masses. “CMA”: –If not, our community will become largely irrelevant. Observation: –We’re doing it anyway: e.g., DB&IR, information extraction, uncertainty, …

Outline Example data management challenges Denote by: “dataspaces”  Dataspace Support Platforms: –“Pay-as-you-go” data management Putting meat to the bones: –Specific research problems, recent progress –Querying, dataspace evolution, reflection (Possibly) some predictions and subliminal messages.

Logical Model: Participants and Relationships XML WSDL SDB sensor java RDF RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates

Relationships General form: (Obj 1, Rel, Obj 2, p) Obj 1, Obj 2 : instances or sources, p: degree of certainty Language for describing relationships? –[Rosati]

Dataspace Support Platforms (DSSP) XML WSDL SDB sensor java RDF RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates Catalog Search & query Local Store & Index Administration Discover & Enhance

Outline Example data management challenges Denote by: “dataspaces” Dataspace Support Platforms: –“Pay-as-you-go” data management  Putting meat to the bones: –Specific research problems, recent progress –Querying, dataspace evolution, reflection (Possibly) some predictions and subliminal messages.

Technical Outline Query Evolve Reflect Queries Answers Query processing

Dataspace Queries Keyword queries as starting point –Later may be refined to add structure –Formulated in terms of user’s “schema” Mostly of the form –Instance*: “britany spears” –P (instance) “chicago weather” “PC chair PODS” Query Evolve Reflect

Semantics of Answers 1.The actual answers: –P(instance), P*(instance) Query Evolve Reflect

Weather Seattle

Semantics of Answers 1.The actual answers: –P(instance), P*(instance) 2.Sources where answer can be found: –Partially specify the query to the source –Help the user clean the query Query Evolve Reflect

Toyota Corolla Palo alto

Volvo Palo alto Toyota Palo alto

Semantics of Answers 1.The actual answers: –P(instance), P*(instance) 2.Sources where answer can be found: –Partially specify the query to the source –Help the user clean the query 3.Supporting facts or sources: –Facts that can be used to derive P(instance) –Rest of derivation may be obvious to user Query Evolve Reflect

Related or Partial Answers In which country was Dan Suciu born? –Bucharest Latest edition of software X: –2004 edition Is the Space Needle higher than the Eiffel Tower? –Height of Seattle Space Needle –Height of Eiffel Tower Query Evolve Reflect 184m 324m Rank all types of answers

Query Processing: Data Integration Q1Q1 Q2Q2 Q3Q3 Q4Q4 Q 41 Q 42 Q 43 Weather(Chicago) Query Evolve Reflect PDMS Active XML Data exchange

Query Processing: DSSPs Query: Jan Van den Bussche address First name: Jan Middle name: Last name: Van den Bussche Address: ? Query Evolve Reflect

Query Processing: DSSPs First name: Jan Middle name: Last name: Van den Bussche Address: ? address address: City required Keyword query: J. vd Bussche ………… City? StreetAdr … … (t 1,p 1 ) city, zip (t 2,p 2 ) ………… Companies address Query Evolve Reflect (t 1,p 3 )

Two Principles Mappings are approximate at best –What do approximate mappings mean? –Answering queries with them? –Composition? Answering queries = Finding evidence + combining evidence Query Evolve Reflect

Technical Outline Query Evolve Reflect Reuse human attention

The Cost of Semantics Benefit Investment (time, cost) Dataspaces Data integration solutions Artist: Mike Franklin ? Semantic integration modeled by: {(Obj 1, rel, Obj 2, p),…} Query Evolve Reflect

Reusing Human Attention Principle:  User action = statement of semantic relationship  Leverage actions to infer other semantic relationships Examples –Providing a semantic mapping Infer other mappings –Writing a query Infer content of sources, relationships between sources –Creating a “digital workspace” Infer “relatedness” of documents/sources Infer co-reference between objects in the dataspace –Annotating, cutting & pasting, browsing among docs Query Evolve Reflect

Past, Present and Future Leverage past actions & existing structure: –[Dong et al., 2004, 2005], [He & Chang, 2003] Generalize from current actions –Queries, schema mappings Beg for extra attention: –ESP [von Ahn], mass collaboration [Doan+], active learning [Sarawagi et al.] Query Evolve Reflect

Reuse: Learning Schema Mappings [Doan et al.] Classifiers for mediated schema  Thousands of web forms mapped in little time  Transformic Inc: deep web search.  [Madhavan et al.]: infer mappings for any schemas in the domain Mediated schema Action Query Evolve Reflect ( S 1, M, S, p)

Technical Outline Query Evolve Reflect Unify lineage, uncertainty, and inconsistency Model them on views

Dataspace Reflection Answers are uncertain in dataspaces: –data, –mappings, –query answering techniques Data may be inconsistent Tracking lineage is crucial Query Evolve Reflect

A DSSP Needs to Process uncertain data Update uncertainty with new evidence Be proactive about reducing uncertainty Live with inconsistency Leverage lineage to reduce uncertainty ( Obj 1, Rel, Obj 2, p) Query Evolve Reflect

Two Principles Need a single formalism for modeling: –uncertainty, –inconsistency, and –lineage Model them on views Query Evolve Reflect

Israel population Query Evolve Reflect Uncertainty & Lineage Stanford

Uncertainty and Inconsistency Inconsistency = uncertainty about the truth –Salary (John Doe, $120,000) –Salary (John Doe, $135,000)  Salary (John Doe, $120,000 | $135,000) Orchestra, U. Penn. Query Evolve Reflect

Uncertainty Formalisms 101 Represent a set of possible worlds A-tuples - uncertainty on attribute values: –(PODS, 2006, {Chicago, Baltimore}) –Tuples can be optional Not closed under relational operators Query Evolve Reflect

Uncertainty Formalisms 101 (2) X-tuples – uncertainty on entire tuples: –{(PODS, 2006, Chicago), (PODC, 2006, Baltimore)} Still not closed under relational operators Query Evolve Reflect

Uncertainty 101: C-Tables { (PODC, 2006, Chicago, X=1), (PODS, 2006, Chicago, X <>1), (SIGMOD, 2006, Chicago, X<>1) } Closed and Complete! Understandable? –See [Das Sarma et al., ICDE 2006]: working models for uncertain data Query Evolve Reflect

Adding Lineage to X-tuples [ULDBs, Benjelloun et al., VLDB 06] {(PODS, 2006, Ch), (PODC, 2006, Ba)} l1l1 l2l2 ! (l 1 & l 2 ) { (…) (…) } (t) { (…) (…) } (t’) Query Evolve Reflect No effect on complexity of relational operators

Adding Lineage to A-tuples (PODS, {2005,2006}, {Chicago, Baltimore}) l1l1 l2l2 l3l3 l4l4 Lineage on views (projections)  Uncertainty should also be on views! (Halevy, Los Altos, CA, professor},  Answering queries using uncertain views See [Dalvi & Suciu, 2005] for a great start Query Evolve Reflect

Putting it all Together Query Evolve Reflect Approximate mappings Evidence combination Reuse human attention Create approximate semantic relationships Foundation for reasoning about uncertainty, inconsistency and lineage

Conclusion and Outlook Data management moving to consumers Dataspaces: key element in this agenda –Pay as you go data management –Reuse human attention The role of theory: –Reflect, generalize and explain –People, people, people

Some References SIGMOD Record, December 2005: –Original dataspace vision paper Maier EDBT 2006 tak PODS 2006 proceedings: challenges Data Integration: the Teenage Years –VLDB 2006 Teaching integration to undergraduates: –SIGMOD Record, September, 2003.