Download presentation
Presentation is loading. Please wait.
1
An Approach to Measuring Large-Scale Distributed Systems Jun Li, Peter Reiher, Gerald Popek, and Mark Yarvis UCLA Geoffrey H. Kuenning Harvey Mudd College
2
2 How to Measure Internet-Scale Systems? ä Distributed systems have complex performance at large sizes ä Would like to measure & tune before deployment ä Biggest research testbeds are tiny relative to Internet ä Only Internet-scale testbed is Internet itself
3
3 Live Internet Measurement ä Difficult or impossible to get cooperation ä Difficult to control remote sites ä Extraneous noise in measurements
4
4 The Simulation Option ä Usually requires models of real software ä Expensive to develop ä Possible inaccuracy or bugs ä Must be validated against real system ä Simulation usually much slower than reality
5
5 Measuring Big Distributed Systems is Tough ä Only one really big testbed: the Internet ä Can’t get enough participants ä Too much noise for repeatable measurements ä Simulations don’t use the real software ä Hard to validate ä Small testbeds don’t reveal scaling problems
6
6 Testbed Overloading ä Use real software ä Run multiple instances on one machine ä Virtual topology to simulate connectivity
7
7 Characteristics of Overloading ä Allows greatly increased scale ä Works best when applications are lightweight ä Some (not all) measurements will differ
8
8 Effects of Overloading ä Some metrics unaffected ä Hop count ä Bytes transferred per (virtual) node ä Storage cost ä Other metrics must be adjusted due to resource competition ä CPU processing times ä Latencies
9
9 Eliminating Interference ä Locking to avoid contention ä Characterize slowdown ä Divide and conquer
10
10 Locking to Avoid Contention ä Use central coordinator ä One process at a time initiates operation x ä Measure latency, bytes transferred, messages exchanged ä No contention because of serialization ä Works well for operations that are one-at-a- time in real world (e.g., join multicast group) ä Total run time increases
11
11 Slowdown Analysis ä Measure time for one logical node on a physical node ä Measure time for n logical nodes ä Develop slowdown factor as function of n ä Apply to measured results
12
12 Divide and Conquer ä Divide task into components ä Must be independent ä No parallelism ä Contention only at component boundaries ä Measure components individually in isolation ä Measure occurrences in full system & sum ä Resource contention now omitted from total
13
13 Divide-and-Conquer Example ä Components of dissemination latency in Revere ä Local processing time ä Kernel-space crossing ä Transmission delay (per hop) ä Each component measured in isolation ä Sum multiplied by observed hop count
14
14 Dissemination Latency OS Revere Previous hop Next hop Java Local processing time (measured) Kernel-crossing time (measured) Per-hop transmission latency (parameter)
15
15 OS Java Revere User space Kernel space Java OS Revere Java Measurement Environment Delays - Sum known times - Multiply by hop count
16
16 Open Issues ä Measurement framework for arbitrary applications ä Scalability of locking approach
17
17 Conclusions ä Method for measuring much larger systems ä Used to measure Revere on 3000 virtual nodes ä Avoids drawbacks of other approaches
18
An Approach to Measuring Large-Scale Distributed Systems Jun Li, Peter Reiher, Gerald Popek, and Mark Yarvis UCLA Geoffrey H. Kuenning Harvey Mudd College lijun@lasr.cs.ucla.edu geoff@cs.hmc.edu http://lasr.cs.ucla.edu/revere
19
Black Slide
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.