Introduction Zachary G. Ives University of Pennsylvania CIS 700 – Internet-Scale Distributed Computing January 13, 2004
2 Welcome! To the initial version of the Penn Systems Seminar First of an ongoing series, focusing on systems research topics of general interest Format: reading and discussion (no homework or exams) Independent Study encouraged to supplement the seminar Our focus: P2P and distributed ad hoc systems
3 What Is the Vision of Peer-to-Peer Computing? Loose coupling, auto configuration: No central administration Scalability Adaptability/resiliency Nodes contribute as well as consume resources System continues as peers join and leave
4 How Does P2P Work? P2P infrastructure forms an overlay network over the real Internet, which supports: Schemes for distributing resources (data, computation) without a directory structure Unstructured: query by flooding or over advertisements Structured: query according to an algorithm that organizes the peers into a consistent structure (hash table, tree, …) Graceful handling of loss or gain of nodes Replication “where appropriate” Provides reliability/availability Improves performance (self-tuning) More on this later, from Honghui
5 The Promise of P2P Major challenge for applications is generally scalability Traditional systems definition: Scalability of systems to numbers of requests, clients, etc. But we need “human” scalability as well: Avoid human administration, tuning, oversight, custom code Self-administering; auto-tuning Providing the “right” abstractions Human contributors often create heterogeneity among components, data, participation levels, etc. Aspects of P2P should help with all of these
6 The Central Questions: Goals of this Seminar 1.“What is the killer app for a P2P substrate?” Is there more to this P2P idea than pirating music and searching for little green men (and women)? What applications can benefit from P2P-like techniques? What are their key properties? 2.What programming models are most appropriate for building such applications? 3.How can P2P techniques be improved to better support the applications we want to build? Security, trust, reliability, consistency, …
7 Some P2P Applications Early in the semester: examining apps built over P2P overlay networks We’ll start with two projects here at Penn We’d like to talk with you if you’re interested in working or collaborating on these projects! BRIEF overviews of the issues – more detailed talks later in the semester Later: P2P games First: Orchestra – P2P meets data integration…
8 Key Problem: Coordinating Efforts between Collaborators Today, to collaboratively edit structured data, we centralize For many applications, this isn’t a good model, e.g.: Bioinformatics groups have multiple standard schemas and warehouses for genomic information – each group wants to incorporate the info of the others, but have it in their format, with their own unique information preserved, and the ability to override info from elsewhere Different neuroscientists have may data from measuring electrical activity in the same part of the brain – they may want to share common information but maintain their specific local information; each scientist wants the ability to control when their updates are propagated Work-in-progress with Nitin Khandelwal; other contributors: Murat Cakir, Charuta Joshi, Ivan Terziev
9 The Orchestra System: Infrastructure for Collaborative Data Sharing Each participant is a logical peer, with some XML schema that is mapped to at least one other peer’s schema Schemas’ contents are logically synchronized initially and then on demand Part 1 Part 2 Part 3 mappings between XML schemas mappings Translated updates from 3: + XML tree A’ - XML tree B’ Updates: + XML tree A - XML tree B Translated updates from 3: + XML tree A’’ - XML tree B’’ Schema 2 Schema 3Schema 1
10 Some Challenges in Orchestra Mappings How to express them Using them to translate updates, queries Inconsistency How to represent conflicts How to resolve them Update propagation Consistency with intermittent connectivity Scaling To many updates To many queries Logical & semantics- level Implementation- level (P2P-based)
11 Mappings Some peers may be replicas Others need mappings, expressed as “views” Views: functions from one schema to another Can be inverted (may lose some information) Can be “chained” when there is no direct connection (Much research in generating these automatically [DDH00][MB01], …) Prior work on propagating updates through relational views [BD82][K85][C+96]… Ensuring the mapping specifies a deterministic, side-effect-free translation Algorithmically applying the translation Ongoing work with Nitin Khandelwal: Extending the model to handle (unordered) XML Challenge: dealing with XML’s nesting and its repercussions
12 A Globally Consistent Model that Encodes Conflicts Even in the presence of conflicts, want a “global state” (from perspective of some schema) when we synchronize Allows us to determine what’s agreed-upon, what’s conflicting Can define conflict resolution strategies Goal: “union of all states” with a way of specifying conflicts Define conditional XML tree based on a subset of c-tables [IM84] Each peer p i has a boolean flag P i representing “perspective i” root auth Smith Lee If P 1 If P 2
13 Propagating Updates with Intermittent Connectivity How to synchronize among n peers (even assuming the same schema)? Not all are connected simultaneously Usual approaches: Locking (doesn’t scale) Epidemic algorithms (only eventually consistent) Approach: “Shadow instance” of the schema, replicated within the other peers of the network Everyone syncs with the shadow instance Benefits: state is deterministic after each sync
14 Scaling, Using P2P Techniques Update synchronization Key problem: find values conflicting with “shadow instance” Partition the “shadow instance” across the network Query execution Partition computation across multiple peers (PIER does this) Query optimization Optimization breaks the query into sub-problems, uses dynamic programming to build up estimates of the costs of applying operators Can recast as recursion + memoization Use P2P overlay to distribute each recursive step Memoize results at every node Why is this useful? Suppose 2 peers ask the same query!
15 Current Status Have a basic strategy for addressing many of the problems in collaborative data sharing Initial sketches of the core algorithms Need to develop them further … And to implement (and validate) them in a real system!