Download presentation
Presentation is loading. Please wait.
1
Principles of Dataspace Systems Alon Halevy PODS June 26, 2006
2
Outline Example data management challenges Denote by: “dataspaces” [Franklin, H., Maier] Dataspace Support Platforms: –“Pay-as-you-go” data management Putting meat to the bones: –Specific research problems, recent progress –Querying, dataspace evolution, reflection (Possibly) some predictions and subliminal messages.
3
Shrapnels in Baghdad Story courtesy of Phil Bernstein
4
OriginatedFrom PublishedIn ConfHomePage ExperimentOf ArticleAbout BudgetOf CourseGradeIn AddressOf Cites CoAuthor FrequentEmailer HomePage Sender EarlyVersion Recipient AttachedTo PresentationFor Personal Information Management [Semex: Dong et al.]
5
Google Base
6
The Web is Getting Semantic Forms (millions) Vertical search engines (hundreds) Annotation schemes: Flickr, ESP Game Google Coop –DB search engine coming soon! “A little semantics goes a long way” [See Madhavan talk on Wednesday afternoon]
7
“Data is the plural of anecdote” Digital libraries, enterprises, “smart homes” Corie Environmental Observation System – Talk to Maier Circle of Blue – Data about the world’s water sources The Boeing 777 – [Hanrahan @ Stanford]
8
Requirements A system that: –Is defined by boundaries (organizational, physical, logical), not explicit entry. –Provides best-effort services Little or no setup time –Leverages semantics when possible Manage dataspaces
9
Other Dataspace Characteristics All dataspaces contain >20% porn. The rest has >50% spam.
10
Dataspaces vs. Data Integration BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B Data integration requires semantic mappings
11
Dataspaces vs. Data Integration Dataspaces are “pay as you go” Benefit Investment (time, cost) Dataspaces Data integration solutions Artist: Mike Franklin
12
Dataspaces vs. Data Integration Data integration systems require semantic mappings. –Dataspaces are “pay-as-you-go” Dataspaces capture a broader class of semantic relationships: –DerivedFrom, SnapshotOf, HighlyCorrelatedWith, …
13
Why Do This Now? Fact: –Data management is about people, not enterprises. Prediction: –In the next few years, DB conferences will be about data management for the masses. “CMA”: –If not, our community will become largely irrelevant. Observation: –We’re doing it anyway: e.g., DB&IR, information extraction, uncertainty, …
14
Outline Example data management challenges Denote by: “dataspaces” Dataspace Support Platforms: –“Pay-as-you-go” data management Putting meat to the bones: –Specific research problems, recent progress –Querying, dataspace evolution, reflection (Possibly) some predictions and subliminal messages.
15
Logical Model: Participants and Relationships XML WSDL SDB sensor java RDF RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates
16
Relationships General form: (Obj 1, Rel, Obj 2, p) Obj 1, Obj 2 : instances or sources, p: degree of certainty Language for describing relationships? –[Rosati]
17
Dataspace Support Platforms (DSSP) XML WSDL SDB sensor java RDF RDB WSDL XML java schema mapping manually created RDB snapshot view replica 1hr updates Catalog Search & query Local Store & Index Administration Discover & Enhance
18
Outline Example data management challenges Denote by: “dataspaces” Dataspace Support Platforms: –“Pay-as-you-go” data management Putting meat to the bones: –Specific research problems, recent progress –Querying, dataspace evolution, reflection (Possibly) some predictions and subliminal messages.
19
Technical Outline Query Evolve Reflect Queries Answers Query processing
20
Dataspace Queries Keyword queries as starting point –Later may be refined to add structure –Formulated in terms of user’s “schema” Mostly of the form –Instance*: “britany spears” –P (instance) “chicago weather” “PC chair PODS” Query Evolve Reflect
21
Semantics of Answers 1.The actual answers: –P(instance), P*(instance) Query Evolve Reflect
22
Weather Seattle
23
Semantics of Answers 1.The actual answers: –P(instance), P*(instance) 2.Sources where answer can be found: –Partially specify the query to the source –Help the user clean the query Query Evolve Reflect
24
Toyota Corolla Palo alto
25
Volvo Palo alto Toyota Palo alto
26
Semantics of Answers 1.The actual answers: –P(instance), P*(instance) 2.Sources where answer can be found: –Partially specify the query to the source –Help the user clean the query 3.Supporting facts or sources: –Facts that can be used to derive P(instance) –Rest of derivation may be obvious to user Query Evolve Reflect
27
Related or Partial Answers In which country was Dan Suciu born? –Bucharest Latest edition of software X: –2004 edition Is the Space Needle higher than the Eiffel Tower? –Height of Seattle Space Needle –Height of Eiffel Tower Query Evolve Reflect 184m 324m Rank all types of answers
28
Query Processing: Data Integration Q1Q1 Q2Q2 Q3Q3 Q4Q4 Q 41 Q 42 Q 43 Weather(Chicago) Query Evolve Reflect PDMS Active XML Data exchange
29
Query Processing: DSSPs Query: Jan Van den Bussche address First name: Jan Middle name: Last name: Van den Bussche Address: ? Query Evolve Reflect
30
Query Processing: DSSPs First name: Jan Middle name: Last name: Van den Bussche Address: ? address address: City required Keyword query: J. vd Bussche ………… City? StreetAdr … … (t 1,p 1 ) city, zip (t 2,p 2 ) ………… Companies address Query Evolve Reflect (t 1,p 3 )
31
Two Principles Mappings are approximate at best –What do approximate mappings mean? –Answering queries with them? –Composition? Answering queries = Finding evidence + combining evidence Query Evolve Reflect
32
Technical Outline Query Evolve Reflect Reuse human attention
33
The Cost of Semantics Benefit Investment (time, cost) Dataspaces Data integration solutions Artist: Mike Franklin ? Semantic integration modeled by: {(Obj 1, rel, Obj 2, p),…} Query Evolve Reflect
34
Reusing Human Attention Principle: User action = statement of semantic relationship Leverage actions to infer other semantic relationships Examples –Providing a semantic mapping Infer other mappings –Writing a query Infer content of sources, relationships between sources –Creating a “digital workspace” Infer “relatedness” of documents/sources Infer co-reference between objects in the dataspace –Annotating, cutting & pasting, browsing among docs Query Evolve Reflect
35
Past, Present and Future Leverage past actions & existing structure: –[Dong et al., 2004, 2005], [He & Chang, 2003] Generalize from current actions –Queries, schema mappings Beg for extra attention: –ESP [von Ahn], mass collaboration [Doan+], active learning [Sarawagi et al.] Query Evolve Reflect
36
Reuse: Learning Schema Mappings [Doan et al.] Classifiers for mediated schema Thousands of web forms mapped in little time Transformic Inc: deep web search. [Madhavan et al.]: infer mappings for any schemas in the domain Mediated schema Action Query Evolve Reflect ( S 1, M, S, p)
37
Technical Outline Query Evolve Reflect Unify lineage, uncertainty, and inconsistency Model them on views
38
Dataspace Reflection Answers are uncertain in dataspaces: –data, –mappings, –query answering techniques Data may be inconsistent Tracking lineage is crucial Query Evolve Reflect
39
A DSSP Needs to Process uncertain data Update uncertainty with new evidence Be proactive about reducing uncertainty Live with inconsistency Leverage lineage to reduce uncertainty ( Obj 1, Rel, Obj 2, p) Query Evolve Reflect
40
Two Principles Need a single formalism for modeling: –uncertainty, –inconsistency, and –lineage Model them on views Query Evolve Reflect
41
Israel population Query Evolve Reflect Uncertainty & Lineage Trio @ Stanford
42
Uncertainty and Inconsistency Inconsistency = uncertainty about the truth –Salary (John Doe, $120,000) –Salary (John Doe, $135,000) Salary (John Doe, $120,000 | $135,000) Orchestra, Ives @ U. Penn. Query Evolve Reflect
43
Uncertainty Formalisms 101 Represent a set of possible worlds A-tuples - uncertainty on attribute values: –(PODS, 2006, {Chicago, Baltimore}) –Tuples can be optional Not closed under relational operators Query Evolve Reflect
44
Uncertainty Formalisms 101 (2) X-tuples – uncertainty on entire tuples: –{(PODS, 2006, Chicago), (PODC, 2006, Baltimore)} Still not closed under relational operators Query Evolve Reflect
45
Uncertainty 101: C-Tables { (PODC, 2006, Chicago, X=1), (PODS, 2006, Chicago, X <>1), (SIGMOD, 2006, Chicago, X<>1) } Closed and Complete! Understandable? –See [Das Sarma et al., ICDE 2006]: working models for uncertain data Query Evolve Reflect
46
Adding Lineage to X-tuples [ULDBs, Benjelloun et al., VLDB 06] {(PODS, 2006, Ch), (PODC, 2006, Ba)} l1l1 l2l2 ! (l 1 & l 2 ) { (…) (…) } (t) { (…) (…) } (t’) Query Evolve Reflect No effect on complexity of relational operators
47
Adding Lineage to A-tuples (PODS, {2005,2006}, {Chicago, Baltimore}) l1l1 l2l2 l3l3 l4l4 Lineage on views (projections) Uncertainty should also be on views! (Halevy, Los Altos, CA, professor}, 0.80.6 Answering queries using uncertain views See [Dalvi & Suciu, 2005] for a great start Query Evolve Reflect
48
Putting it all Together Query Evolve Reflect Approximate mappings Evidence combination Reuse human attention Create approximate semantic relationships Foundation for reasoning about uncertainty, inconsistency and lineage
49
Conclusion and Outlook Data management moving to consumers Dataspaces: key element in this agenda –Pay as you go data management –Reuse human attention The role of theory: –Reflect, generalize and explain –People, people, people
50
Some References SIGMOD Record, December 2005: –Original dataspace vision paper Maier EDBT 2006 tak PODS 2006 proceedings: challenges Data Integration: the Teenage Years –VLDB 2006 Teaching integration to undergraduates: –SIGMOD Record, September, 2003.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.