Download presentation
Presentation is loading. Please wait.
1
DataSpaces: A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier
2
Outline Dataspaces: –collections of heterogeneous (un)-structured data. –examples and characteristics Dataspace Support Platforms: –“Pay-as-you-go” data management Getting there: –Recent progress and research challenges
3
Outline Dataspaces: –collections of heterogeneous (un)-structured data. –examples and characteristics Dataspace Support Platforms: –“Pay-as-you-go” data management Getting there: –Recent progress and research challenges
4
Shrapnels in Baghdad Story courtesy of Phil Bernstein
5
OriginatedFrom PublishedIn ConfHomePage ExperimentOf ArticleAbout BudgetOf CourseGradeIn AddressOf Cites CoAuthor FrequentEmailer HomePage Sender EarlyVersion Recipient AttachedTo PresentationFor Personal Information Management [Semex: Dong et al.]
6
Google Base
7
What do they have in common? All dataspaces contain >20% porn. The rest is spam.
8
What do they have in common? Must manage all the data in the space Need best-effort services with no setup time. Data is heterogeneous, –possibly unstructured Do not have control over the data, just access.
9
Isn ’ t this Data Integration? OMIM Swiss- Prot HUGOGO Gene- Clinics Entrez Locus- Link GEO Sequenceable Entity GenePhenotype Structured Vocabulary Experiment Protein Nucleotide Sequence Microarray Experiment
10
No, it’s Data Co-existence Data integration systems require semantic mappings.
11
Semantic Mappings BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B
12
No, It’s Data Co-existence Data integration systems require semantic mappings. Dataspaces are “pay-as-you-go”: –Provide some services immediately –Create more tight integrations as needed.
13
The Cost of Semantics Benefit Investment (time, cost) Dataspaces Data integration solutions Schema first vs. schema last
14
Why Now? Data management is moving towards dataspace-like applications. Prediction: –Data management is about people, not enterprises. –In 5 years our community will figure it out. We’ve made relevant progress: –Combining DB & IR –Creation and management of semantic mappings. –Uncertainty, lineage, inconsistency
15
Dataspaces Fundamentals: Participants and Relationships XML WSDL SDB sensor java RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates
16
DSSP Components XML WSDL SDB sensor java RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates Catalog Search & query Discovery & Enhancement Local Store & Index Administration Heterogeneous index Reference reconciliation Additional associations Cache for performance & availability Seamless flow from search to query Query about participants, locate data Lineage, uncertainty, completeness Set up workflows Relax Find participants Discover and refine relationships Maintain catalog participants’ capabilities Relationships Quality of both
17
DSSP Components XML WSDL SDB sensor java RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates Catalog Search & query Discovery & Enhancement Local Store & Index Administration
18
Querying a Dataspace Best effort, based on: –Approximate semantic mappings –Other mechanisms Example -- searching for Beng Chin’s phone number: –Keyword search for “beng chin ooi” –Examine attributes of tuples/XML elements –Match attributes to ‘address’
19
Querying a Dataspace Best effort Combine structured and unstructured data
20
Two Kinds of Data Structured Data (or XML) Unstructured Data [Dong, Liu, Halevy]
21
Querying a Dataspace Best effort Combine structured and unstructured data Rank: answers, sources
22
Volvo Palo alto
23
Acura integra Palo alto
24
Querying a Dataspace Best effort Combine structured and unstructured data Rank: answers, sources Iterative -- sequences of queries: –Used cars palo alto –Saab for sale –Classified ads Classified ads for Saabs near Palo Alto
25
Querying a Dataspace Best effort Combine structured and unstructured data Rank: answers, sources Iterative: sequences of queries Reflective: need to explain and expose confidence in answers.
26
Dataspace Reflection Sources of uncertainty: –Unreliable sources –Data obtained by imprecise extractions –Inconsistent data –Approximate query answering (mappings) A DSSP must: –Expose the lineage of an answer Web search engines already do this –Reason about relationship between sources
27
LUI Introspection Currently, three distinct formalisms: –Uncertainty –Lineage –Inconsistency Single formalism should do it all: –Uncertainty and lineage (Trio @ Stanford) –Lineage & inconsistency (Orchestra @ U.Penn)
28
ULDB’s [Benjelloun, Das Sarma, Halevy, Widom] Combine uncertainty and lineage –Based on x-tuples: {(t1 | t2 | t3)} Queries can be answered with no additional complexity –You can do even better than uncertain DBs. Because of lineage, you can sometimes obtain completeness.
29
DSS Components XML WSDL SDB sensor java RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates Catalog Search & query Discovery & Enhancement Local Store & Index Administration
30
Personal Information Space Alon Halevy Luna Dong Semex: … authoredPaper author authoredPaper author StuIDLastNameFirstName… 1000001XinDong… ………… Departmental Database Alon Dong Halevy Luna Semex Xin Inverted List
31
Personal Information Space Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDLastNameFirstName… 1000001XinDong… ………… Departmental Database Alon1 Dong11 Halevy1 Luna1 Semex1 Xin1 Inverted List Query: Dong Luna Dong
32
Personal Information Space Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDLastNameFirstName… 1000001XinDong… ………… Departmental Database Alon1 Dong11 Halevy1 Luna1 Semex1 Xin1 Inverted List Luna Dong Query: FirstName “Dong”
33
Personal Information Space Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDLastNameFirstName… 1000001XinDong… ………… Departmental Database Alon&&name&&1 Dong&&FirstName&&1 Dong&&name&&1 Halevy&&name&&1 Luna&&name&&1 Semex&&title&&1 Xin&&LastName&&1 Inverted List Luna Dong Query: FirstName “Dong”
34
Personal Information Space Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDLastNameFirstName… 1000001XinDong… ………… Departmental Database Alon&&name&&1 Dong&&FirstName&&1 Dong&&name&&1 Halevy&&name&&1 Luna&&name&&1 Semex&&title&&1 Xin&&LastName&&1 Inverted List Query: name “Dong” Luna Dong
35
Personal Information Space Alon Halevy Semex: … authoredPaper author authoredPaper author StuIDLastNameFirstName… 1000001XinDong… ………… Departmental Database Alon&&name&&1 Dong&&name&&FirstName&&1 Dong&&name&&1 Halevy&&name&&1 Luna&&name&&1 Semex&&title&&1 Xin&&name&&LastName&&1 Inverted List Query: name “Dong” Luna Dong
36
Personal Information Space Alon Halevy authoredPaper author authoredPaper author StuIDLastNameFirstName… 1000001XinDong… ………… Departmental Database Inverted List Luna Dong Semex: … Query: Paper author “Dong” Alon&&name&&1 Dong&&name&&FirstName&&1 Dong&&name&&1 Halevy&&name&&1 Luna&&name&&1 Semex&&title&&1 Xin&&name&&LastName&&1
37
Personal Information Space Alon Halevy authoredPaper author authoredPaper author StuIDLastNameFirstName… 1000001XinDong… ………… Departmental Database Inverted List Query: Paper author “Dong” Luna Dong Semex: … Alon&&author&&1 Alon&&name&&1 Dong&&author&&1 Dong&&name&&FirstName&&1 Dong&&name&&1 Halevy&&name&&1 Luna&&name&&1 Semex&&authoredPaper&&11 Semex&&title&&1 Xin&&name&&LastName&&1
38
DSS Components XML WSDL SDB sensor java RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates Catalog Search & query Discovery & Enhancement Local Store & Index Administration
39
OriginitatedFrom PublishedIn ConfHomePage ExperimentOf ArticleAbout BudgetOf CourseGradeIn AddressOf Cites CoAuthor FrequentEmailer HomePage Sender EarlyVersion Recipient AttachedTo PresentationFor ComeFrom Enhancing a Dataspace Creating associations [Dong & Halevy, CIDR 2005] Reference reconciliation [Dong et al., SIGMOD 2005] Very active field.
40
DSS Components XML WSDL SDB sensor java RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates Catalog Search & query Discovery & Enhancement Local Store & Index Administration
41
Reusing Human Attention Human attention is most expensive. Reuse whenever possible. E.g.,: –Manual schema mapping –Annotations –Queries written on data –Temporary collections of items –Operations on the data (cut & paste) Solicit semantic information selectively –The ESP game: [von Ahn et al.]
42
Learning from Past Matches Every manual map is a learning example. Learn models for elements in mediated schema. Use multi-strategy learning. Thousands of maps in very little time. Reuse for a very related task. [Doan et. al, Transformic]
43
Corpus-based Matching Product Music ASINtitleartistsrecordLabeldiscountPrice productIDnamepricesalePrice 0X7630AB12 The Concert in Central Park $13.99$11.99 (no tuples) [Madhavan et al.]
44
Obtaining More Evidence Product productIDnamepricesalePrice 0X7630AB12 The Concert in Central Park $13.99$11.99 prodIDalbumNameartistsrecordCompanypricesalePrice 9R4374FG56Saturday Night FeverThe Bee GeesColumbia$14.99$9.99 ASINalbumartistNamepricediscountPrice 4Y3026DF23The Best of the DoorsThe Doors$16.99$12.99 MusicCD CD Corpus CD prodID albumName
45
Comparing with More Evidence Product productIDnamepricesalePrice 0X7630AB12 The Concert in Central Park $13.99$11.99 CD prodID albumName Music MusicCD 4Y6DF23 ASIN The Doors artists artistName Columbia recordLabel recordCompany $12.99 The Best of the Doors discount price Title album
46
Challenges Learn from other kinds of user activities Create other kinds of relationships between participants Identify higher-level goals from user actions Develop a formal framework for reusing human attention.
47
Conclusion “Dataspaces: because that’s the size of the problem” Pay-as-you-go data integration Data management for the masses Key: reuse of human attention Need to be very data driven in our research
48
Some References www.cs.washington.edu/homes/alon SIGMOD Record, December 2005: –Original dataspace vision paper PODS 2006: –Specific technical challenges for dataspace research Semex: CIDR 2005, SIGMOD 2005 Teaching integration to undergraduates: –SIGMOD Record, September, 2003.
49
1. Build initial models s S Name: Instances: Type: … MsMs t T Name: Instances: Type: … MtMt Corpus of schemas and mappings 2. Find similar elements in corpus s S Name: Instances: Type: … M’ s t T Name: Instances: Type: … M’ t 3. Build augmented models 4. Match using augmented models 5. Use additional statistics (IC’s) to refine match
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.