1 CS 525 Advanced Distributed Systems Spring 2013 Indranil Gupta (Indy) Measurement Studies March 26, 2013 All Slides © IG Acknowledgments: Some slides.

Slides:

Advertisements

Similar presentations

A Measurement Study of Peer-to-Peer File Sharing Systems Presented by Cristina Abad.

Advertisements

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.

1 An Overview of Gnutella. 2 History The Gnutella network is a fully distributed alternative to the centralized Napster. Initial popularity of the network.

Farnoush Banaei-Kashani and Cyrus Shahabi Criticality-based Analysis and Design of Unstructured P2P Networks as “ Complex Systems ” Mohammad Al-Rifai.

Measurement, Modeling, and Analysis of a Peer-2-Peer File-Sharing Workload Presented For Cs294-4 Fall 2003 By Jon Hess.

1 CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Measurement Studies Lecture 23 Reading: See links on website All Slides © IG.

Small-World File-Sharing Communities Adriana Iamnitchi, Matei Ripeanu and Ian Foster,

1 School of Computing Science Simon Fraser University, Canada Modeling and Caching of P2P Traffic Mohamed Hefeeda Osama Saleh ICNP’06 15 November 2006.

P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.

1 A Framework for Lazy Replication in P2P VoD Bin Cheng 1, Lex Stein 2, Hai Jin 1, Zheng Zhang 2 1 Huazhong University of Science & Technology (HUST) 2.

Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.

Measurement, Modeling, and Analysis of a Peer-to-Peer File sharing Workload Krishna P. Gummadi, Richard J. Dunn, Stefan Saroiu, Steven D. Gribble, Henry.

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,

A Hierarchical Characterization of a Live Streaming Media Workload E. Veloso, V. Almeida W. Meira, A. Bestavros, S. Jin Proceedings of Internet Measurement.

Building Low-Diameter P2P Networks Eli Upfal Department of Computer Science Brown University Joint work with Gopal Pandurangan and Prabhakar Raghavan.

Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.

Network Traffic Measurement and Modeling CSCI 780, Fall 2005.

Multiple Sender Distributed Video Streaming Thinh Nguyen, Avideh Zakhor appears on “IEEE Transactions On Multimedia, vol. 6, no. 2, April, 2004”

Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,

Adaptive Content Management in Structured P2P Communities Jussi Kangasharju Keith W. Ross David A. Turner.

A Hierarchical Characterization of a Live Streaming Media Workload IEEE/ACM Trans. Networking, Feb Eveline Veloso, Virg í lio Almeida, Wagner Meira,

1 End-to-End Detection of Shared Bottlenecks Sridhar Machiraju and Weidong Cui Sahara Winter Retreat 2003.

Understanding Churn in Peer-to-Peer Networks Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Internet Measurement Conference.

1 Characterizing Files in the Modern Gnutella Network: A Measurement Study Shanyu Zhao, Daniel Stutzbach, Reza Rejaie University of Oregon SPIE Multimedia.

Web Caching Robert Grimm New York University. Before We Get Started  Illustrating Results  Type Theory 101.

Characterizing Unstructured Overlay Topologies in Modern P2P File-Sharing Systems Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon.

CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 34 – Media Server (Part 3) Klara Nahrstedt Spring 2012.

PROMISE: Peer-to-Peer Media Streaming Using CollectCast Presented by: Randeep Singh Gakhal CMPT 886, July 2004.

Add image. 3 “ Content is NOT king ” today 3 40 analog cable digital cable Internet 100 infinite broadcast Time Number of TV channels.

Traffic Modeling.

P2P Architecture Case Study: Gnutella Network

1 Reading Report 4 Yin Chen 26 Feb 2004 Reference: Peer-to-Peer Architecture Case Study: Gnutella Network, Matei Ruoeanu, In Int. Conf. on Peer-to-Peer.

COCONET: Co-Operative Cache driven Overlay NETwork for p2p VoD streaming Abhishek Bhattacharya, Zhenyu Yang & Deng Pan.

Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.

Skype P2P Kedar Kulkarni 04/02/09.

Protocol(TCP/IP, HTTP) 송준화 조경민 2001/03/13. Network Computing Lab.2 Layering of TCP/IP-based protocols.

1 Towards Cinematic Internet Video-on-Demand Bin Cheng, Lex Stein, Hai Jin and Zheng Zhang HUST and MSRA Huazhong University of Science & Technology Microsoft.

1 CS 525 Advanced Distributed Systems Spring 2014 Indranil Gupta (Indy) Measurement Studies April 8, 2014 All Slides © IG Acknowledgments: Some slides.

Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.

Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.

1 CS 425 Distributed Systems Fall 2011 Slides by Indranil Gupta Measurement Studies All Slides © IG Acknowledgments: Jay Patel.

ACM NOSSDAV 2007, June 5, 2007 IPTV Experiments and Lessons Learned Panelist: Klara Nahrstedt Panel: Large Scale Peer-to-Peer Streaming & IPTV Technologies.

Efficient Peer to Peer Keyword Searching Nathan Gray.

1 presentation of article: Small-World File-Sharing Communities Article: Adriana Iamnitchi, Matei Ripeanu, Ian Foster Presentation: Periklis Akritidis.

An Experimental Study of the Skype Peer-to-Peer VoIP System Saikat Guha, Cornell University Neil DasWani, Google Ravi Jain, Google IPTPS ’ 06 Presenter:

Peer Pressure: Distributed Recovery in Gnutella Pedram Keyani Brian Larson Muthukumar Senthil Computer Science Department Stanford University.

Temporal-DHT and its Application in P2P-VoD Systems Abhishek Bhattacharya, Zhenyu Yang & Shiyun Zhang.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.

CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 37 – P2P Applications/PPLive Klara Nahrstedt Spring 2009.

PROP: A Scalable and Reliable P2P Assisted Proxy Streaming System Computer Science Department College of William and Mary Lei Guo, Songqing Chen, and Xiaodong.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

1 CS 525 Advanced Distributed Systems Spring 2011 Indranil Gupta (Indy) Measurement Studies April 12, 2011 All Slides © IG Acknowledgments: Some slides.

Click to edit Master title style Multi-Destination Routing and the Design of Peer-to-Peer Overlays Authors John Buford Panasonic Princeton Lab, USA. Alan.

CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 37 – Introduction to P2P (Part 1) Klara Nahrstedt.

1 CS 425 / ECE 428 Distributed Systems Fall 2013 Indranil Gupta (Indy) Measurement Studies Lecture 22 Nov Reading: See links on website All Slides.

09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.

#16 Application Measurement Presentation by Bobin John.

1 Internet Traffic Measurement and Modeling Carey Williamson Department of Computer Science University of Calgary.

On the scale and performance of cooperative Web proxy caching 2/3/06.

Does Internet media traffic really follow the Zipf-like distribution? Lei Guo 1, Enhua Tan 1, Songqing Chen 2, Zhen Xiao 3, and Xiaodong Zhang 1 1 Ohio.

CS 425 / ECE 428 Distributed Systems Fall 2015 Indranil Gupta (Indy) Peer-to-peer Systems All slides © IG.

CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 24 – Introduction to Peer-to-Peer (P2P) Systems Klara Nahrstedt (presented by Long Vu)

Lecture 23: Structure of Networks

Lecture 23: Structure of Networks

Peer-to-Peer Information Systems Week 6: Performance

Lecture 23: Structure of Networks

Presentation transcript:

1 CS 525 Advanced Distributed Systems Spring 2013 Indranil Gupta (Indy) Measurement Studies March 26, 2013 All Slides © IG Acknowledgments: Some slides by Long Vu, Jay Patel

We’ve seen a variety of distributed systems so far… P2P file sharing systems (Kazaa, Gnutella, etc.) Clouds (AWS, Hadoop, etc.) P2P streaming systems (PPLive etc.) Often, the behavior and characteristics of these systems, when deployed in the wild, are surprising. Important to know these in order to build better distributed systems for deployment 2

3 How do you find characteristics of these Systems in Real-life Settings? Write a crawler to crawl a real working system Collect traces from the crawler Tabulate the results Papers contain plenty of information on how data was collected, the caveats, ifs and buts of the interpretation, etc. –These are important, but we will ignore them for this lecture and concentrate on the raw data and conclusions

4 Measurement, Modeling, and Analysis of a Peer-to-Peer File-Sharing Workload Gummadi et al Department of Computer Science University of Washington

5 What They Did 2003 paper analyzed 200-day trace of Kazaa traffic Considered only traffic going from U. Washington to the outside Developed a model of multimedia workloads

6 Results Summary 1.Users are patient 2.Users slow down as they age 3.Kazaa is not one workload 4.Kazaa clients fetch objects at-most-once 5.Popularity of objects is often short-lived 6.Kazaa is not Zipf

7 User characteristics (1) Users are patient

8 User characteristics (2) Users slow down as they age –clients “die” –older clients ask for less each time they use system

9 User characteristics (3) Client activity –Tracing used could only detect users when their clients transfer data –Thus, they only report statistics on client activity, which is a lower bound on availability –Avg session lengths are typically small (median: 2.4 mins) Many transactions fail Periods of inactivity may occur during a request if client cannot find an available peer with the object

10 Object characteristics (1) Kazaa is not one workload This does not account for connection overhead

11 Object characteristics (2) Kazaa object dynamics –Kazaa clients fetch objects at most once –Popularity of objects is often short-lived –Most popular objects tend to be recently-born objects –Most requests are for old objects (> 1 month) 72% old – 28% new for large objects 52% old – 48% new for small objects

12 Object characteristics (3) Kazaa is not Zipf Zipf’s law: popularity of ith-most popular object is proportional to i -α, (α: Zipf coefficient) Web access patterns are Zipf Authors conclude that Kazaa is not Zipf because of the at-most-once fetch characteristics Caveat: what is an “object” in Kazaa?

13 Model of P2P file-sharing workloads [?] Why a model? On average, a client requests 2 objects/day P(x): probability that a user requests an object of popularity rank x  Zipf(1) –Adjusted so that objects are requested at most once A(x): probability that a newly arrived object is inserted at popularity rank x  Zipf(1) All objects are assumed to have same size Use caching to observe performance changes (effectiveness  hit rate)

14 Model – Simulation results File-sharing effectiveness diminishes with client age –System evolves towards one with no locality and objects chosen at random from large space New object arrivals improve performance –Arrivals replenish supply of popular objects New clients cannot stabilize performance –Cannot compensate for increasing number of old clients –Overall bandwidth increases in proportion to population size By tweaking the arrival rate of of new objects, were able to match trace results (with 5475 new arrivals per year)

15 Some Questions for You “Unique object” : When do we say two objects A and B are “different”? –When they have different file names fogonthetyne.mp3 and fogtyne.mp3 –When they have exactly same content 2 mp3 copies of same song, one at 64 kbps and the other at 128 kbps –When A (and not B) is returned by a keyword search, and vice versa –…? Based on this, does “caching” have a limit? Should caching look into file content? Is there a limit to such intelligent caching then? Should there be separate overlays for small objects and large objects? For new objects and old objects? Or should there be separate caching strategies? Most requests for old objects, while most popular objects are new ones – is there a contradiction?

16 Understanding Availability R. Bhagwan, S. Savage, G. Voelker University of California, San Diego

17 What They Did Measurement study of peer-to-peer (P2P) file sharing application –Overnet (January 2003) –Based on Kademlia, a DHT based on xor routing metric Each node uses a random self-generated ID The ID remains constant (unlike IP address) Used to collect availability traces –Closed-source Analyze collected data to analyze availability Availability = % of time a node is online (node=user, or machine)

18 Crawler: –Takes a snapshot of all the active hosts by repeatedly requesting 50 randomly generated IDs. –The requests lead to discovery of some hosts (through routing requests), which are sent the same 50 IDs, and the process is repeated. –Run once every 4 hours to minimize impact Prober: –Probe the list of available IDs to check for availability By sending a request to ID I; request succeeds only if I replies Does not use TCP; potentially avoids problems with NAT and DHCP –Used on only randomly selected 2400 hosts from the initial list –Run every 20 minutes All Crawler and Prober trace data from this study is available for your project (ask Indy if you want access) What They Did

19 Scale of Data Ran for 15 days from January 14 to January 28 (with problems on January 21) 2003 Each pass of crawler yielded 40,000 hosts. In a single day (6 crawls) yielded between 70,000 and 90,000 unique hosts of the 2400 randomly selected hosts probes responded at least once

20 Results Summary 1.Overall availability is low, but need to take IP aliasing into account for measurement studies 2.Diurnal patterns existing in availability 3.Availabilities are uncorrelated across nodes 4.High Churn exists

21 Multiple IP Hosts

22 Availability

23 Host Availability As time interval increased, av. decreases

24 Diurnal Patterns 6.4 joins/host/day 32 hosts/day lost N changes by only 100/day Normalized to “local time” at peer, not EST

25 Are Node Failures Interdependent? 30% with 0 difference, 80% within Should be same if X and Y independent

26 Arrival and Departure 20% of nodes each day are new Number of nodes stays about 85,000

27 Conclusions and Discussion Each host uses an average 4 different IP addresses within just 15 days –Keeping track of assumptions is important for trace collection studies Strong diurnal patterns –Design p2p systems that are adaptive to time-of-day? Value of N stable in spite of churn –Can leverage this in building estimation protocols, etc., for p2p systems.

28 Measurement and Modeling of a Large-scale Overlay for Multimedia Streaming Long Vu, Indranil Gupta, Jin Liang, Klara Nahrstedt UIUC This was a CS525 Project (Spring 2006). Published in QShine 2007 conference, and ACM TOMCCAP.

29 Motivation IPTV applications have flourished (SopCast, PPLive, PeerCast, CoolStreaming, TVUPlayer, etc.) IPTV growth: (MRG Inc. April 2007) –Subscriptions: 14.3 million in 2007, 63.6 million in –Revenue: $3.6 billion in 2007, $20.3 billion in 2011 Largest IPTV in the world today are P2P streaming systems A few years ago, this system was PPLive: 500K users at peak, multiple channels and per-channel overlay, nodes may be recruited as relays for other channels. (Data from 2006) Do peer to peer IPTV systems have the same overlay characteristics as peer to peer file-sharing systems?

30 Summary of Results P2P Streaming overlays are different from File-sharing P2P overlays in a few ways: 1.Users are impatient: Session times are small, and exponentially distributed (think of TV channel flipping!) 2.Smaller overlays are random (and not power-law or clustered) 3.Availability is highly correlated across nodes within same channel 4.Channel population varies by 9x over a day.

31 Results

32 PPLive Channels A Program Segment (PS) An episode channel

33 PPLive Membership Protocol Challenges PPLive is a closed source system: Makes measurement challenging – have to select metrics carefully!

34 Channel Size Varies over a day Use 10 PlanetLab geographically distributed nodes to crawl peers Popular channel varies 9x, less popular channel varies 2x

35 Channel Size Varies over Days The same channel, same program: Peaks drift First daySecond day

36 Operations Snapshot collects peers in one channel PartnerDiscovery collects partners of responsive peers Studied channels

37 K-degree Problem: When PPLive node is queried for membership list, it sends back a fixed size list. –Subsequent queries return slightly different lists One option: figure out why –Lists changing? –Answers random? –… Our option: define –K-degree = Union of answers received when K consecutive membership queries are sent to the PPLive node K=5-10 gives half of entries as K=20

38 Node Degree is Independent of Channel Size Similar to P2P file sharing [ Ripeanu 02 ] Average node degree scale- free

39 Overlay Randomness Clustering Coefficient (CC) [Watts 98] –for a random node x with two neighbors y and z, the CC is the probability that either y is a neighbor of z or vice versa Probability that two random nodes are neighbors (D) –Average degree of node / channel size Graph is more clustered if CC is far from D [well- known results theory of networks and graphs]

40 Smaller Overlay, More Random Small overlay, more random Large overlay, more clustered P2P file sharing overlays are clustered. [Ripeanu 02, Saroiu 03]

41 Nodes in one Snapshot Have Correlated Availability In P2P file sharing, nodes are uncorrelated [Bhagwan 03] Correlated Availability Nodes appearing together is likely appear together again

42 Random Node Pairs (across snapshots) Have Independent Availabilities Similar to P2P file sharing [Bhagwan 03] Independent Availabilities

43 PPLive Peers are Impatient 90% sessions are less than 70 minutes In P2P file sharing, peers are patient [Saroiu 03]

44 Feasible Directions/Discussion Nodes are homogeneous due to their memoryless session lengths. Does a protocol that treats all nodes equally is simple and work more effectively? As PPLive overlay characteristics depend on application behavior, a deeper study of user behavior may give better design principle Designing “generic” P2P substrates for a wide variety of applications is challenging Node availability correlations can be used to create sub-overlays of correlated nodes or to route media streams? Simulation of multimedia streaming needs to take this correlated availability into account? Geometrically distributed session lengths can be used to better simulate node arrival/departure behavior

45 An Evaluation of Amazon’s Grid Computing Services: EC2, S3, and SQS Simson L. Garfinkel SEAS, Harvard University

46 What they Did Did bandwidth measurements –From various sites to S3 (Simple Storage Service) –Between S3, EC2 (Elastic Compute Cloud) and SQS (Simple Queuing Service)

47 Results Summary 1.Effective Bandwidth varies heavily based on geography! 2.Throughput is relatively stable, except when internal network was reconfigured. 3.Read and Write throughputs: larger is better –Decreases overhead 4.Consecutive requests receive performance that are highly correlated. 5.QoS received by requests fall into multiple “classes”

48 Effective Bandwidth varies heavily based on geography!

MB Get Ops from EC2 to S3 Throughout is relatively stable, except when internal network was reconfigured.

50 Read and Write throughputs: larger is better (but beyond some block size, it makes little difference).

51 Concurrency: Consecutive requests receive performance that are highly correlated.

52 QoS received by requests fall into multiple “classes” MB xfers fall into 2 classes.

53 Feasible Directions 1.Effective Bandwidth varies heavily based on geography! Wide-area network transfer algorithms! 2.Throughout is relatively stable, except when internal network was reconfigured. Guess the structure of an internal datacenter (like AWS)? Datacenter tomography 3.Read and Write throughputs: larger is better –Make these better? 4.Consecutive requests receive performance that are highly correlated. Really concurrent? Improve? 5.QoS received by requests fall into multiple “classes” Make QoS explicitly visible? Adapt SLAs?SLOs?

What Do Real-Life Hadoop Workloads Look Like? (Cloudera) (Slide Acknowledgments: Brian Cho) 54

What They Did Hadoop workloads from 5 Cloudera customers –Diverse industries: “in e-commerce, telecommunications, media, and retail” –2011 Hadoop workloads from Facebook –2009, 2010 across same cluster 55

The Workloads TB/ Day Jobs/ Day GB/ Job

Skew in access frequency across (HDFS) files 90% of jobs access files of less than a few GBs; these files account for only 16% of bytes stored 90%, < few GBs 16% Data access patterns (1/2) 57 Zipf

Data access patterns (2/2) Temporal Locality in data accesses 70%, < 10mins = Sequences of Hadoop jobs? Can you make Hadoop/HDFS better, now that you know these characteristics? 58 < 6 hrs Compare

Burstiness Plotted –Sum of task-time (map + reduce) over an hour interval –n-th percentile / median Facebook –From 2009 to 2010, peak-to-median ratio dropped from 31:1 to 9:1 –Claim: consolidating resources decreases effect of burstiness 59

High-level Processing Frameworks 60 Each cluster prefers 1-2 data processing frameworks

Classification by multi-dimensional clustering 61

Results Summary Workloads different across industries Yet commonalities –Zipf distribution for access file access frequency –Slope same across all industries 90% of all jobs access small files, while the other 10% account for 84% of the file accesses –Parallels p2p systems (mp3-mpeg split) A few frameworks popular for each cluster 62

63 Backup slides

64 Recommendations for P2P IPTV designers Node availability correlations can be used to create sub-overlays of correlated nodes or to route media streams Simulation of multimedia streaming needs to take this bimodal availability into account Geometrically distributed session lengths can be used to simulate node arrival/departure behavior Nodes are homogeneous due to their memoryless session lengths. A protocol treats all nodes equally is simple and works effectively As PPLive overlay characteristics depend on application behavior, a deeper study of user behavior may give better design principle Designing “generic” P2P substrates for a wide variety of applications is challenging