Download presentation
Presentation is loading. Please wait.
Published byPosy Cooper Modified over 9 years ago
1
1 ICS 214B: Transaction Processing and Distributed Data Management Lecture 18: Data Management in Peer-to-Peer Systems Professor Chen Li Based on slides developed by Beverley Yang and Hector Garcia-Molina
2
ICS214BNotes 182 What is P2P? napster gnutella morpheus kazaa bearshare seti@home folding@home ebay limewire icq fiorana mojo nation jxta united devices open cola uddi process tree can chord ocean store farsite pastry tapestry ? grove netmeeting freenet popular power aim jabber
3
ICS214BNotes 183 Napster central server join query response get file...
4
ICS214BNotes 184 Gnutella query
5
ICS214BNotes 185 PeerCast UCI source UCI source after: before:
6
ICS214BNotes 186 What is Peer-to-Peer? Definition: Nodes of equal roles exchanging information and services directly Is this a new idea? –IP routing (1970’s) –Mariposa (1980’s) –Distributed Databases! What are people really thinking?
7
ICS214BNotes 187 Implicit Definition of P2P Scale: millions (billions?) of peers Nature of peers: PC’s Application: lightweight semantics (e.g., file-sharing)
8
ICS214BNotes 188 P2P vs. Distributed DBMS Traditional DDBMS Issues: Transactions Network Partitions Distributed Query Optimization Interoperation of heterogeneous data sources Reliability/failure of nodes Complex features do not scale
9
ICS214BNotes 189 P2P vs. Distributed DBMS Example application: file-sharing Simple data model and query language –No complex query optimization –Easy interoperation No guarantee on quality of results –Individual site availability unimportant Local updates –No transactions –Network partitions OK Simple Amenable to large-scale network of PCs
10
ICS214BNotes 1810 Potential Benefits Efficiency: harnessing unused resources Self-organizing Effectively sharing cost of ownership Robustness and availability through replication Anonymity/legal protection
11
ICS214BNotes 1811 Challenges No authority to enforce behavior Cooperation Unreliability of individual peers Efficiency of distributed operations (absolute resources)
12
ICS214BNotes 1812 Research Areas Resource Management Security Efficient Search
13
ICS214BNotes 1813 Resource Management Resource: –Storage/information –CPU processing –bandwidth Issues: –fairness –load balancing
14
ICS214BNotes 1814 Example: Data Trading site 1 site 2 site 3 A1A1 B1B1 C1C1 A2A2 B2B2 C2C2 B1B1 A1A1 trade B2B2 A2A2
15
ICS214BNotes 1815 Example: Data Trading site 1 site 2 site 3 A1A1 B1B1 C1C1 A2A2 B2B2 C2C2 B1B1 A1A1 trade C1C1 A2A2 C2C2 B2B2
16
ICS214BNotes 1816 Data Trading Order of trades impacts availability Issues: –Swaps vs. Deeds –Fixed price vs. bids –Preference to sites with a lot of space? reliable sites? “desperate” sites?
17
ICS214BNotes 1817 Security Issues: –Reputation –Trust –Accountability –Information Preservation –Information Quality –Denial of service attacks Problem: Detecting and punishing bad behavior
18
ICS214BNotes 1818 Information Preservation Example Policy: make 3 copies of documents A1A1 make copies What can go wrong?
19
ICS214BNotes 1819 What Can Go Wrong? “Bad” sites deletes copies “Bad” site alters copy “Bad” site publishes fake “Bad” site makes many copies at other sites... A1A1 make copies A’1A’1 A1A1
20
ICS214BNotes 1820 Reputation Systems Peers evaluate each other Good reviews -> Good reputation Bad reviews -> Bad reputation No reviews -> ? Problems –Trustworthiness of reviews –Permanence of identity
21
ICS214BNotes 1821 Efficiency of Search Problem: finding needle in haystack Efficiency measured in terms of absolute resources consumed
22
ICS214BNotes 1822 Architecture Hybrid –Centralized index, P2P file storage and transfer Super-peer –A “pure” network of “hybrid” clusters Pure –functionality completely distributed
23
ICS214BNotes 1823 Goal Develop search techniques for “loose” systems that are Efficient Simple (easy to implement, no hidden costs) Realistically and thoroughly evaluated
24
ICS214BNotes 1824 Current Techniques: Gnutella = forward query = processed query = source = found result = forward response Breadth-First Search (BFS)
25
ICS214BNotes 1825 Metrics Cost (aggregate) –Bandwidth –Processing Power Quality of Results –Number of results –Satisfaction (true if # results >= X, false otherwise) –Time to satisfaction
26
ICS214BNotes 1826 Iterative Deepening Interested in satisfaction, not # of results BFS returns “too many” results expensive Iterative Deepening: common technique to reduce the cost of BFS –Intuition: A search at a small depth is much cheaper than at a larger depth
27
ICS214BNotes 1827 Iterative Deepening = source = forward query = processed query = found result = forward response ?
28
ICS214BNotes 1828 Directed BFS Sends query to a subset of neighbors Maintains statistics on neighbors –E.g., ping latency, history of number of results Chooses subset intelligently (via heuristics), to maximize quality of results –E.g., Neighbors with shortest message queue, since long message queue implies neighbor is saturated/dead
29
ICS214BNotes 1829 Directed BFS = source = forward query = processed query = found result = forward response ?
30
ICS214BNotes 1830 Directed BFS: Heuristics RAND(Random) RESReturned greatest # results in past TIMEHad shorted avg. time to satisfaction in past HOPSHad smallest avg. # hops for response messages in past MSGSent our client greatest # of messages QLENShortest message queue DEGHighest degree
31
ICS214BNotes 1831 Local Indices Each node maintains index over other nodes’ collections –r is the radius of the index –Index covers all nodes within r hops away Can process query at fewer nodes, but get just as many results back r
32
ICS214BNotes 1832 sdf nrd sdf nrd sdf nrd Local Indices (r=1) = source = forward query = processed query = found result = forward response sdf nrd
33
ICS214BNotes 1833 Evaluation Goal: realistic evaluation of techniques Cannot directly evaluate techniques in a real environment Simulation of large-scale distributed systems is hard Use Gnutella as a “laboratory” for gathering data Use analysis driven by query “traces” to project cost
34
ICS214BNotes 1834 Passive Observation 1.Statistics Size of collection % redundant messages 2.Sample queries (Q rep ) Gnutella Network Pong Query
35
ICS214BNotes 1835 Gathering Data # hops traveled IP address Timestamp Individual result records # hops traveled IP address Query (Q rep ) Ping
36
ICS214BNotes 1836 Gathering Data For each query Q: L(Q)Length of query string M(Q,n)# response messages from n hops away R(Q,n)# results from n hops away S(Q,n,Z)True if >= Z results received from n hops away T(Q,Z,W,P)Time to satisfaction N(Q,n)# nodes n hops away C(Q,n)# redundant edges n hops away
37
ICS214BNotes 1837 Example: Trace-driven Cost Projection = source = forward query = processed query = found result = forward response ?
38
ICS214BNotes 1838 Example: Calculating Message Size Use the Gnutella protocol, trace data e.g., Query message consists of: –Gnutella header (22 B) –Options field (2 B) –Query string (L(Q)) –TCP/IP and Ethernet headers (58 B) Total size of Query message for query Q: 82 + L(Q) bytes
39
ICS214BNotes 1839 Calculating Cost We know the sizes of each type of message We know # messages sent, for each type of message, for query Q Put together: aggregate bandwidth for Q Similar process to compute aggregate processing power
40
ICS214BNotes 1840 Overall Comparison Time to Satisfy Prob. of Satisfying # results Bandwidth Cost BFS Iterative Deepening (d=5,W=6) Directed BFS (>RES) Local Indices (r=1) B I D B B B B I I I I D D D D L L L L L
41
ICS214BNotes 1841 Summary: Efficient Search What we’ve done: –Proposed techniques to improve performance Kept simple –Evaluated techniques using extensive real data –Improved performance, with tradeoffs Open issues: –More efficient! –Make intelligent use of topology, replication –Take advantage of heterogeneity (e.g., super- peers)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.