Presentation is loading. Please wait.

Presentation is loading. Please wait.

Piazza: Data Management Infrastructure for the Semantic Web Zachary G. Ives University of Pennsylvania CIS 700 – Internet-Scale Distributed Computing February.

Similar presentations


Presentation on theme: "Piazza: Data Management Infrastructure for the Semantic Web Zachary G. Ives University of Pennsylvania CIS 700 – Internet-Scale Distributed Computing February."— Presentation transcript:

1 Piazza: Data Management Infrastructure for the Semantic Web Zachary G. Ives University of Pennsylvania CIS 700 – Internet-Scale Distributed Computing February 3, 2004 Joint work with Alon Halevy, Peter Mork, Dan Suciu, Igor Tatarinov, University of Washington

2 2 The Big Question in P2P Why use a P2P system vs. a centralized one?  PRO: P2P offers greater flexibility and resource utilization  CON: P2P often sacrifices reliability guarantees, accountability, and sometimes even performance There are a few simple cases where P2P wins:  Avoiding the law/RIAA/MPAA: copying music, videos, etc.  Anonymity (FreeNet, etc.)  Exploiting idle cycles But are there applications that are inherently P2P?

3 3 Most P2P Work is “Bottom-up”  The basis of P2P: Algorithms/data structures papers  Chord, CAN, Pastry  Focus on providing a robust DHT – not what to do with it  Several systems build functionality over the DHT:  Tang et al. information retrieval paper  Maps LSI space into CAN multidimensional space  Interesting but uncertain benefits  Berkeley PIER  DB query engine: uses distributed hash table to do distributed joins  Sophia  Prolog rules in a distributed environment for network monitoring  None of these apps are inherently (or perhaps even best) based on P2P architectures

4 4 Thinking Top-Down  Find an application that has needs matching the properties of P2P:  No central authority (and no logical owner of central server)  Loose, relatively ad hoc membership  Capabilities of a system grow as new members join  Participants are generally cooperative

5 5 One Possible Answer: Data Integration/Interchange Applications  Multiple parties have proprietary data + sources  Not willing to relinquish control or change their data representation, but are willing to share  Examples:  UPenn hospital system is looking to modernize information sharing among departments (trauma, neurology, etc.)  Many bioinformatics warehouses (e.g., Penn’s GUS, NCBI’s GeneBank) have related info they would like to share  The W3C’s vision of the “Semantic Web”: a web where all pages are annotated with meaning, meanings are well-defined, and complex questions can be answered

6 6 The “Old” Model: Centralization  Get all parties to hash out a standard, global schema or ontology  Different classes of objects to be represented  Constraints + relationships between them  Relate all of the data sources to that schema  Relationships are specified as named queries – views  Efficient techniques exist for using these views to answer future queries posed over the mediated schema

7 7 Data Integration System / Mediator Centralized Data Integration Architecture Mediated Schema Wrapper Source Data Query-based Schema Mappings in Catalog Source Catalog QueryResults

8 8 Centralization Doesn’t Scale  Difficult to arrive at one standard schema  … When we do, it’s slow to evolve to new needs  This is a human factor, but it is also a scalability issue  Hard to leverage mappings well:  If we map source A  mediated schema, does this help us map source B, even if source B is “almost” like source A?  Can we prevent mappings from “breaking” when we update the central schema?  Users often prefer familiar schema, not central one  More schemas  more users forced to change schemas

9 9 The Piazza System: Infrastructure for Relating & Querying Structured Data  Recasts data integration as a decentralized confederation of peers and mappings  Our initial focus is on the logical aspects: 1.Mediating between different types of XML-encoded data  Based on extensions of formalisms & techniques from data integration  Schemas are related via directional pairwise mappings 2.Making maximal use of a limited number of mappings  Translates queries over transitive closure of mappings  Uses mappings “in reverse”

10 10 Mediated Query Answering in the Piazza System UWStanford DBLP Oxford Leipzig CiteSeer Penn Q Q’Q’ Q’Q’ Q ’’ Mappings typically directional, pairwise

11 11 Data in Piazza  Each participant may have its own schema + data  Unordered XML, with pre- specified schemas  In general, we’ll identify it with XPath expressions:  Similar syntax to Unix paths, but over threes  e.g., /rootelement/subelement/* Root ?xml db book mdate key authortitleyear pub Brown92 Kurt Brown PRPL… 1992 MKP 2002…

12 12 Mappings in Data Integration  Express value or class equivalence: DollarCost = EuroToDollar(EuroCost) “ID#0123456”  “Catalog#98324” S1/book/author = S2/author  Also containment: S2/book  S3/publication  Ability to use value(s) as IDs Collect all entries related to the ID into one object  Convert between edge labels, values 1  book  Concatenation: S2/author/fullname = S3/author/first + S3/author/last

13 13 Piazza’s Mapping Language Goals:  Build on XQuery and XML  Remain computationally inexpensive  Capture the common mapping types Directional XML mapping language based on templates {: $var IN document(“doc”)/path WHERE condition :} $var  Translates between parts of data instances  Restricted subset of XQuery that’s decidable to reason about  Supports special annotations and object fusion

14 14 Mapping Example between XML Schemas Target: pubs book* title author* name Source: authors author* full-name publication* title pub-type pub-type name publication author writtenBy title

15 15 Example Piazza Mapping {: $a IN document(“…”)/authors/author, $an IN $a/full-name, $t IN $a/publication/title, $typ IN $a/publication/pub-type WHERE $typ = “book” PROPERTY $t >= ‘A’ AND $t {$t} {$an}

16 16 Query Answering in Piazza Given an XQuery over a schema, iteratively expand and translate it to capture neighbors at distance i  Requires sophisticated reasoning to avoid cycles, redundant expansions  See paper for details How does this work?  Mapping defines constraints on pairs of source & target instances Constrains possible pairs of matched interpretations  Easy to use mapping in “forward direction”: query composition with a view (or chain of views)  Also have algorithms to rewrite query over source in terms of target Need to invert mapping and compose that with query  Answer set is defined by “certain” answers  May lose some information in inversion

17 17 Piazza Is One of Several Similar Efforts  Peer-to-peer databases: PIER, PeerDB, Hyperion, [Bernstein et al. WebDB02], [Aberer et al. WWW03]  RDF engines and mediators for the Semantic Web: EDUTELLA, Sesame  Makes use of semi-automated mapping construction techniques from the database/machine learning communities:  Clio, LSD, GLUE, Cupid, many others

18 18 Summary: Infrastructure for Decentralized Mediation  Powerful XML mappings and transformations  Extensible, scalable architecture, thanks to sophisticated reasoning techniques for mappings The model itself is peer-to-peer at a logical level – functionality that is best suited to a P2P architecture

19 19 Where from Here? Ongoing Work  Piazza effort at U. Wash. continues to focus on problems relating to mappings  Orchestra at Penn follows up with a focus on two questions:  What does a true DHT-based P2P integration system look like?  Covers a variety of query processing stages, including mapping reformulation and query optimization, not just execution (as in PIER)  Where should we materialize or replicate data  The “data placement” problem  What issues arise when we want to consider updates and synchronization at web-scale?

20 20 Data Management and P2P  We’ve now seen a number of approaches  Information retrieval  Network monitoring  Query execution  Decentralized data integration  Common themes:  Declarative query languages separate logical + physical levels  Large amount of data with semantic info, distributed in many sites  Which ideas hold the most promise?  Is data management well-suited to P2P and DHTs? Does data management need P2P?

21 21 Backup slides…

22 22 Challenges with Mappings  Information may be lost in one direction of a mapping:  Name := concat(FirstName, LastName)  Faculty := Professors  Lecturers  Correspondences may be hard to specify precisely:  Bug ≈ Insect  Data may be dirty or incomplete  Exact mappings may be computationally expensive

23 23 RDF vs. XML  RDF explicitly names relationships: (book, title, “ABC”) (book, writtenBy, author) (author, name, “John Smith”)  XML does not always: 1. ABC John Smith 2. ABC John Smith titlename book author writtenBy

24 24 RDF vs. XML 2  RDF is subject-neutral (a graph)  XML centers around a subject (a tree): 1. ABC John Smith 2. John Smith ABC  This may result in duplication of contained objects

25 25 Mapping XML to OWL  We can map from XML to XML; thus we can go from XML to an XML serialization of RDF  Caveat: this doesn’t give us the full power of the KR- based Semantic Web!  We can only create OWL individuals that can be expressed in an XQuery-style view definition  To go any further, we may need to supplement these with additional OWL class definitions  But it gets us 80% there and makes the rest much easier – and it supplies mapping capabilities missing from OWL itself

26 26 Implementing the Semantic Web Early emphasis on languages, tools for one (or a few) ontologies  Very powerful solutions in OWL and tools!  Initial assumption: data will have to be created in RDF Important problems remain: sharing at scale and legacy data 1.Global representations/ontologies hard to agree on!  Not just due to preference: different representations better suited to certain usage models – differences are inevitable  Need infrastructure that allows users to choose & query in their ontology, get results from all related (mapped) data 2.Must be able to import relevant structured data  Most data is in existing, non-RDF formats (XML, relations, legacy sources, etc.)

27 27 Impossible to Capture & Normalize All Semantics (1/2) Even RDF/OWL regularity can’t enforce a single conceptual model:  May use different names for same items  May use different levels of granularity: book vs. publication  Metadata + data may be interchanged: (Car4, hasWheel, Wheel1) vs. (Car5, contains, Obj2), (Obj2, hasPurpose, wheel)

28 28 Impossible to Capture & Normalize All Semantics (2/2)  Even collections may be described differently: 1.(Person, eatsForBreakfast, Meal1) (Person, eatsForLunch, Meal2) (Person, eatsForDinner, Meal3) 2.(Person, eatsMeals, TodaysMeals) (TodaysMeals, breakfast, Meal1) (TodaysMeals, lunch, Meal2) (TodaysMeals, dinner, Meal3) 3.(Person, eatsMeals, list of Meal) (list of Meal := {Meal1, Meal2, Meal3})

29 29 Observations  Even formalisms like RDF, OWL capture only a part of the semantics  Still need some interpretation  (This shouldn’t be surprising, but it’s important!)  Very hard to get many contributors to agree on the same representation or ontology  Simple equivalences ( owl:equivalentProperty, owl:equivalentClass ) aren’t enough to map between different ontologies  Need infrastructure for relating data in many different representations, at different levels of granularity!  This is the core strength of database techniques

30 30 Benefits of Piazza’s DB Heritage Terabytes of existing data that’s in XML (or easily translatable to XML)  Hierarchical and relational data, spreadsheets, Java objects, …  XML files, RDF itself! Sophisticated reasoning about mappings is possible by extending existing data integration work  Achieves schema/concept mapping at different granularities  Chaining of mappings, using mappings in reverse direction, … Can map between data in different structures (including RDF serializations, XML)

31 31 Key Problem: Coordinating Efforts between Collaborators  Today, to collaboratively edit structured data, we centralize  For many applications, this isn’t a good model, e.g.:  Bioinformatics groups have multiple standard schemas and warehouses for genomic information – each group wants to incorporate the info of the others, but have it in their format, with their own unique information preserved, and the ability to override info from elsewhere  Different neuroscientists have may data from measuring electrical activity in the same part of the brain – they may want to share common information but maintain their specific local information; each scientist wants the ability to control when their updates are propagated Work-in-progress with Nitin Khandelwal; other contributors: Murat Cakir, Charuta Joshi, Ivan Terziev

32 32 The Orchestra System: Infrastructure for Collaborative Data Sharing  Each participant is a logical peer, with some XML schema that is mapped to at least one other peer’s schema  Schemas’ contents are logically synchronized initially and then on demand Part 1 Part 2 Part 3 mappings between XML schemas mappings Translated updates from 3: + XML tree A’ - XML tree B’ Updates: + XML tree A - XML tree B Translated updates from 3: + XML tree A’’ - XML tree B’’ Schema 2 Schema 3Schema 1

33 33 Some Challenges in Orchestra  Mappings  How to express them  Using them to translate updates, queries  Inconsistency  How to represent conflicts  How to resolve them  Update propagation  Consistency with intermittent connectivity  Scaling  To many updates  To many queries Logical & semantics- level Implementation- level (P2P-based)

34 34 Mappings  Some peers may be replicas  Others need mappings, expressed as “views”  Views: functions from one schema to another  Can be inverted (may lose some information)  Can be “chained” when there is no direct connection  (Much research in generating these automatically [DDH00][MB01], …)  Prior work on propagating updates through relational views [BD82][K85][C+96]…  Ensuring the mapping specifies a deterministic, side-effect-free translation  Algorithmically applying the translation  Ongoing work with Nitin Khandelwal:  Extending the model to handle (unordered) XML  Challenge: dealing with XML’s nesting and its repercussions

35 35 A Globally Consistent Model that Encodes Conflicts  Even in the presence of conflicts, want a “global state” (from perspective of some schema) when we synchronize  Allows us to determine what’s agreed-upon, what’s conflicting  Can define conflict resolution strategies  Goal: “union of all states” with a way of specifying conflicts  Define conditional XML tree based on a subset of c-tables [IM84]  Each peer p i has a boolean flag P i representing “perspective i” root auth Smith Lee If P 1 If P 2

36 36 Propagating Updates with Intermittent Connectivity  How to synchronize among n peers (even assuming the same schema)?  Not all are connected simultaneously  Usual approaches:  Locking (doesn’t scale)  Epidemic algorithms (only eventually consistent)  Approach:  “Shadow instance” of the schema, replicated within the other peers of the network  Everyone syncs with the shadow instance  Benefits: state is deterministic after each sync

37 37 Scaling, Using P2P Techniques  Update synchronization  Key problem: find values conflicting with “shadow instance”  Partition the “shadow instance” across the network  Query execution  Partition computation across multiple peers (PIER does this)  Query optimization  Optimization breaks the query into sub-problems, uses dynamic programming to build up estimates of the costs of applying operators  Can recast as recursion + memoization  Use P2P overlay to distribute each recursive step  Memoize results at every node  Why is this useful? Suppose 2 peers ask the same query!

38 38 Current Status  Have a basic strategy for addressing many of the problems in collaborative data sharing  Initial sketches of the core algorithms  Need to develop them further  … And to implement (and validate) them in a real system!


Download ppt "Piazza: Data Management Infrastructure for the Semantic Web Zachary G. Ives University of Pennsylvania CIS 700 – Internet-Scale Distributed Computing February."

Similar presentations


Ads by Google