Understanding Churn in Peer-to-Peer Networks Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Internet Measurement Conference.

Understanding Churn in Peer-to-Peer Networks Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Internet Measurement Conference Rio de Janeiro, Brazil October 26 th, 2006

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 2/21 Motivation P2P systems are very popular in practice. Several million simultaneous users collectively. 60% of all Internet traffic [CacheLogic Research ‘05] Churn is an important property to model. Outside the control of the designer Needed for simulation or analysis Churn is hard to measure. Requires continuous monitoring Many potential pitfalls Prior results are contradictory.

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 3/21 Talk Outline Datasets Gnutella Kad BitTorrent Pitfalls Lots! Characterizations Inter-arrival distribution Session-length distribution Others

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 4/21 Datasets Data necessities Arrival time of peers Departure time of peers Quality of data The precision of timestamps How representative the sessions are We use data from 3 different P2P systems: Gnutella (unstructured file-sharing) Kad (DHT file-sharing) BitTorrent (unstructured content-delivery)

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 5/21 Datasets: Gnutella More than 1 million simultaneous users Complete snapshots gathered with Cruiser [Stutzbach 05 Global Internet] Using many back-to-back snapshots, we can determine arrival & departure times. Approximately 7 minute granularity Five sets of data, each set is 48-hours

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 6/21 Datasets: Kad Kad is a DHT used by eMule Approximately 1 million simultaneous users We monitor a zone of the DHT address space. Each peer selects its ID uniformly at random, so any zone is representative 4 sets of data, each set is 48-hours.

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 7/21 Datasets: BitTorrent BitTorrent is used for transferring files. A collection of P2P overlays, rather than one big network Each peer periodically contacts a centralized point (the tracker). Tracker logs reveal arrival and departure information with 1-second granularity. Three sets of data: Red Hat ISO image Debian ISO images A demo of the game FlatOut, from 3dgamers.com

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 8/21 Pitfalls Problems can occur: When gathering data When cleaning data When analyzing data Specific pitfalls: Missing Data False Negatives NAT Long Sessions Biased Peer Selection Handling Brief Events

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 9/21 Pitfalls: Missing Data No significant gaps in Gnutella or Kad data BitTorrent logs have significant gaps! Nearly all events are followed by another within 4 minutes. A gap of several hours is highly suspect. We use the longest continuous segment.

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 10/21 Pitfalls: False Negatives What if a peer is missing from a crawl, but did not really depart? Small chance per crawl (p) Compounded after n crawls → high chance: If p=10%, at most we would observe 1 in 3.1 trillion sessions longer than 1 day. Since we observe many more than that, p must be lower. We can compute an upper-bound of p=1.8%. But this would only occur if all sessions were longer than 1 day. In practice, p is likely much lower. Worst case impact of false negatives based on upper-bound of p

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 11/21 Pitfalls: NAT NAT presents two obstacles: Some peers may not be visible at all Multiple peers may look like one peer (large NATs only) BitTorrent Peers contact the tracker and present a unique ID Kad DHT peers must be able to receive unsolicited incoming packets No NATed peers permitted in the DHT overlay Gnutella NATed peers are discovered through their neighbors No good way to resolve multiple peers behind one NAT

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 12/21 Pitfalls: Long Sessions Some sessions are longer than the measurement window. Truncating sessions leads to bias. Ignoring sessions also leads to bias. The “create-based method”: Divide the measurement window into two halves Only consider sessions that begin in the first half Every session beginning in the first half counts Equal opportunity to observe sessions shorter than half a window For longer sessions, we count them, but do not record a particular value. Measurement Window

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 13/21 Pitfalls: Miscellaneous Biased Peer Selection Monitoring a fixed set of peers causes bias because their sessions are correlated. Selecting peers must be done carefully to avoid correlations between uptime and the selection process. In Gnutella and BitTorrent, we use all peers in the overlay. In Kad, we use all peers in a zone. Handling Brief Events What if a peer departs briefly and returns? Not a problem with BitTorrent. Most peers do not depart and return within a day, so this is probably not a large problem.

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 14/21 Pitfalls: Summary Problems can occur: When gathering data When cleaning data When analyzing data See paper for more details Specific pitfalls: Missing Data False Negatives NAT Long Sessions Biased Peer Selection Handling Brief Events

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 15/21 Characterizations Properties critical for simulations Inter-arrival distribution Session length distribution Properties providing design insight Uptime distribution Correlations of consecutive sessions Additional properties in the paper

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 16/21 Inter-arrival Time Inter-arrival time: from the arrival of one peer until the arrival of any other peer Gnutella data is too dense to examine. Exponential is the simplest assumption (commonly used in simulation & analysis). Weibull provides a better fit. However, this may be due to a time-varying exponential (future work).

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 17/21 Session-Length Some prior studies report peer session lengths are heavy-tailed or Pareto (linear on log-log plots). Our data exhibits downward curvature in log-log scale. For the BitTorrent data, the curvature is dramatic for sessions longer than 1 day. Weibull and log-normal distributions provide decent fits.

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 18/21 Lingering after Download Completion (in BitTorrent) How long do peers remain after download completion? Many peers linger for a few hours. A few peers linger for days or weeks. This explains the high seed percentage observed in other studies.

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 19/21 Peer Uptime The uptime of active peers is related to, but different from, the session length distribution. 40 to 60% of peers have an uptime longer than 5 hours. 10 to 20% of peers have an uptime longer than one day. Conclusion: the typical session is short, but the typical peer has been up a long time.

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 20/21 Correlations in Session Length Is uptime a good predictor of remaining uptime? Yes and no. Uptime is a good predictor of the median remaining uptime. But many predictions are wrong. Conclusion: predictions are useful if the cost of a bad prediction is low

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 21/21 Summary & Future Work Many pitfalls when studying churn We enumerate and address them. Characterizations Neither Exponential or Pareto distributions are consistent with our session length data. Session lengths can be modeled by Weibull or log-normal distributions. The typical session is short, but the typical peer is up for a long time. Past session length predicts the next session length, on average But is often wrong Future Work: longer studies of Kad and Gnutella Reduce the chance of False Negatives using heuristics Look for fingerprints indicating whether a peer is new or not Use uniform sampling [Stutzbach 06 IMC] to closely monitor a manageable set of sessions

Daniel Stutzbach The ION P2P Project http://mirage.cs.uoregon.edu/P2PSlide 22/21 Appearances per Day Most peers appear only once in a day. A very small number return often Up to 60 times per day

Understanding Churn in Peer-to-Peer Networks Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Internet Measurement Conference.

Similar presentations

Presentation on theme: "Understanding Churn in Peer-to-Peer Networks Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Internet Measurement Conference."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Understanding Churn in Peer-to-Peer Networks Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Internet Measurement Conference.

Similar presentations

Presentation on theme: "Understanding Churn in Peer-to-Peer Networks Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Internet Measurement Conference."— Presentation transcript:

Similar presentations

About project

Feedback