Comparing Hybrid Peer-to-Peer Systems Beverly Yang and Hector Garcia-Molina Presented by Marco Barreno November 3, 2003 CS 294-4: Peer-to-peer systems.

Comparing Hybrid Peer-to-Peer Systems Beverly Yang and Hector Garcia-Molina Presented by Marco Barreno November 3, 2003 CS 294-4: Peer-to-peer systems

Hybrid peer-to-peer systems Pure peer-to-peer systems are hard to scale Gnutella Look at hybrids between p2p and server-client Servers will index files, clients download from each other directly Searching can be done more efficiently on a server Napster (but Napster had its own problems...) Several other architectures

Questions for hybrid systems Best way to organize servers? Index replication policy? What queries are submitted often? How do we deal with churn? How do query patterns affect performance?

Contributions of this paper Presents several architectures for hybrid systems Presents and evaluates a probabilistic model for queries Compares architectures quantitatively, based on their models and the music sharing domain Compares strategies in non-music-sharing domains (a bit)

General concepts: basic actions Login A client connects to a server and uploads metadata about the files it offers It is a local user to that server, a remote user to others Query A list of words to search on Satisfied if preset maximum number of results found Download Contact peer directly after getting info from server

Goal The goal of this study is to maximize UsersPerServer What do you think of this goal?

Batch vs. incremental logins Batch: on login/logout, user's entire metadata set is added/removed Allows index to remain small, but login/logout is expensive Incremental: metadata kept in index at all times, and only deltas are sent at login Saves much effort on login/logout Queries become more expensive, as server must filter for online users

Architectures (1) Chained architecture Servers are arranged in a linear chain (ring?) Each server keeps metadata for local users Unsatisfied queries sent along chain Logins and downloads scalable; queries potentially expensive

Architectures (2) Full replication architecture Each server keeps metadata about all users Logins expensive Queries cheap

Architectures (3) Hash architecture Metadata words hashed so a particular server is responsible for a particular subset of them Queries sent to relevant servers On login, metadata sent to all relevant servers Limited number of servers need to see each query, but sending the lists may be expensive

Architectures (4) Unchained architecture Servers are independent and don't communicate A user can only search files on the server he/she connects to Napster Disadvantage: user's views are limited Advantage: scales very well (as servers, users increase together)

Query model Universe of queries: q 1, q 2, q 3,...; densities f, g g(i) is probability that a submitted query is query q i (query popularity) f(i) is probability that any given file will match query q i (selection power) g tells us what queries users like to submit, while f tells us which files users like to store

Expected results for chained ExServ = Expected number of servers needed to obtain R results (MaxResults) If P(s) is the probability that exactly s servers are needed to return R or more results, we have: ExLocalResults based on (UsersPerServer * FilesPerUser) files ExTotalResults based on (ExLocalResults * k) files

Expected values for others ExServ trivially 1 for full replication and unchained ExServ is equivalent to balls-in-bins for hash

Distributions for f() and g() Exponential distributions work well for music domain: Monotonically decreasing Popularity and selection power are correlated Most popular has highest selection power, and so on

Validation of query model M(n) = expected # results from n files Q(n) = probability we don't get R results These data gathered from OpenNap

Performance model CPU cycles Cost estimates based on examination and guesswork, plus some experiments Matched OpenNap relatively well for batch logins Inter-server bandwidth Varies among architectures Server-client bandwidth Napster protocol: Login, AddFile, RemoveFile Take min over resources (iterative estimation)

Evaluation Metric: max users per server (throughput, not latency)

Memory requirements

Beyond music f() and g() could be different May be no or negative correlation e.g. Adding “price > 0” to a query makes it less popular but doesn't change size of result set e.g. Archive system will return more results from farther in the past (queries presumably rarer) No or negative correlation can be modeled by adjusting the ratio of the parameters to f and g No: r = 1 Negative: r >> 1

CPU performance vs. r

Conclusion Chained is the best architecture for music domain Full replication might be good with lots of cheap memory and stable network connections Incremental logins do best when there is negative correlation between f and g, and it performs best in short, bandwidth-limited sessions

Comparing Hybrid Peer-to-Peer Systems Beverly Yang and Hector Garcia-Molina Presented by Marco Barreno November 3, 2003 CS 294-4: Peer-to-peer systems.

Similar presentations

Presentation on theme: "Comparing Hybrid Peer-to-Peer Systems Beverly Yang and Hector Garcia-Molina Presented by Marco Barreno November 3, 2003 CS 294-4: Peer-to-peer systems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparing Hybrid Peer-to-Peer Systems Beverly Yang and Hector Garcia-Molina Presented by Marco Barreno November 3, 2003 CS 294-4: Peer-to-peer systems.

Similar presentations

Presentation on theme: "Comparing Hybrid Peer-to-Peer Systems Beverly Yang and Hector Garcia-Molina Presented by Marco Barreno November 3, 2003 CS 294-4: Peer-to-peer systems."— Presentation transcript:

Similar presentations

About project

Feedback