Peer-To-Peer Data Management Hector Garcia-Molina ICDE Conference, February 28, 2002
? What is P2P? pastry can jxta fiorana napster freenet united devices open cola ? aim ocean store netmeeting gnutella farsite icq maorpheus ebay limewire bearshare seti@home uddi grove jabber popular power kazaa folding@home tapestry process tree mojo nation chord
Napster join query answer get file central index ...
Gnutella query
Morpheus ... ... ... ... super peer ... ...
Seti@Home satellite dish raw data chunk analyzed data central site ...
Lockss D3 D1 library D library A D2 library C library B library E
PeerCast Stanford source after: before: Stanford source
What is a P2P System? Multiple sites (at edge) Distributed resources Sites are autonomous (different owners) Sites are both clients and servers Sites have equal functionality P2P Purity
P2P is BAD IDEA!! Distribution is expensive! Specialized functionality is good!
Example: Distributed Data Management Distribution is expensive If you must distribute: build centralized directory, index use backups for reliability for replicated data, use primary copy
Computational Efficiency is NOT Main Goal Main driving force in a P2P system: exploiting existing (often free) resources sharing costs among many legal protection autonomy anonymity
Should We Do P2P Research? Should we help people break the law? Analogy: Should we develop pillows, knives, hammers, drugs, bath tubs, cars, airplanes, ... ??
Should We Do P2P Research? YES: P2P not exclusively for breaking law Remember the VCR YES: P2P can liberate us from culture “plantation owners” (Lessig)
Is “Free Culture’’ Feasible? Example: Legal texts Can we afford it? economic activity rules of the game today
Should DB community work on P2P? YES
P2P Challenges Easier to list NON-Research-Topics: Color schemes for P2P Nodes Impact of P2P on Moroccan 15th Century Literature
P2P Challenges Search Resource Management Security & Privacy
Search Taxonomy lookup freenet can partial replicated SP content queries search gnutella morpheus napster routing single site regional global scope of index
Index Implementation Taxonomy routing replicated SP freenet yes gnutella morpheus index location correlated with content location partial no napster can centralized distributed P2P nature of index
Content Addressable Network (CAN) Nodes 1 Data 2
Can We Improve Flooding? routing replicated SP freenet yes gnutella morpheus index location correlated with content location partial no napster can centralized distributed P2P nature of index
Directed BFS in Gnutella ? ... query Heuristics for Selecting Direction >RES: Returned most results <TIME: Shortest satisfaction time <HOPS: Min hops for results >MSG: Sent us most messages (all types) <QLEN: Shortest queue <LAT: Shortest latency >DEG: Highest degree
How Does One Evaluate? Live Gnutella? Use real Gnutella as “laboratory”
Time to Satisfaction for Directed BFS
Routing Index C Q(DB) A B D 50 25 C AI DB 20 65 B 70 75 50 90 20 A AI A AI DB 50 B AI DB D 15 D 20 A 50 25 C 15 D AI DB
Types of Routing Indexes Compound Hop Count Exponential Decay Strategies for Cycles Ignore (for Hop-Count, exponential) Avoid Update Cycles Detect Update Cycles and Recover
Effect of Index Compression
Effect of Network Topology
Resource Management Resource: Issues: storage (lockss) CPU processing (seti@home) bandwidth (PeerCast) Issues: fairness load balancing
A1 B1 C1 A2 B2 C2 B1 A1 B2 A2 Example: Data Trading site 1 site 2 trade B2 A2 trade
A1 B1 C1 A2 B2 C2 B1 A1 C1 A2 C2 B2 Example: Data Trading site 1 trade C1 A2 trade C2 B2 trade
Data Trading Order of trades impacts reliability Issues: Swaps vs. Deeds Fixed price vs. bids Preference to sites with a lot of space? reliable sites? “desperate” sites?
Effect of Bid Policies bid more (ask more in return) when I have less free space bid more (ask more in return) when I have more free space
Effect of One Maverick Site always bids high
Security & Privacy Issues: Anonymity Reputation Accountability Information Preservation Information Quality Trust Denial of service attacks
Information Preservation Example Policy: make 3 copies of documents A1 make copies What can go wrong?
A1 A1 A’1 What Can Go Wrong? “Bad” sites make copies “Bad” site alters copy “Bad” site publishes fake “Bad” site makes may copies of other docs ... A1 A1 make copies A’1
Conclusion P2P systems popular today P2P systems vulnerable and inefficient Many challenges ahead Search Resource Management Security and Privacy
For Additional Information Google: “Stanford Peers” http://www-db.stanford.edu/peers/