Measuring a (MapReduce) Data Center Srikanth KandulaSudipta SenguptaAlbert Greenberg Parveen Patel Ronnie Chaiken
……… … … … Aggregation Switches Top-of-rack Switch Servers 24-, 48- port 1G to server, 10Gbps up ~ $7K Modular switch Chassis + up to 10 blades >140 10G ports $150K-$200K ToR Agg Typical Data Center Network IP Routers Less bandwidth up the hierarchy Clunky routing e.g., VL2, BCube, FatTree, Portland, DCell
What does traffic in a datacenter look like? A realistic model of data center traffic Compare proposals How to measure a datacenter? (Macro-) Who talks to whom? Congestion, its impact (Micro-) Flow details: Sizes, Durations, Inter-arrivals, flux How to measure a datacenter? (Macro-) Who talks to whom? Congestion, its impact (Micro-) Flow details: Sizes, Durations, Inter-arrivals, flux Goal
How to measure? ……… … … … 1.SNMP reports per port: in/out octets sample every few minutes miss server- or flow- level info 2.Packet Traces Not native on most switches Hard to set up (port-spans) 3.Sampled NetFlow Use the end-hosts to share load Tradeoff: CPU overhead on switch for detailed traces Auto managed already ToR Agg. Switches Servers Router MapReduce Scripts Distr. FS + = Measured 1500 servers for several months
Server From Server To 1Gbps.4 Gbps 3 Mbps 20 Kbps.2 Kbps 0 Who Talks To Whom? Two patterns dominate Most of the communication happens within racks Scatter, Gather Two patterns dominate Most of the communication happens within racks Scatter, Gather
Flows are small. 80% of bytes in flows < 200MB are short-lived. 50% of bytes in flows < 25s turnover quickly. median inter-arrival at ToR = s Flows which lead to… Traffic Engineering schemes should react faster, few elephants Localized traffic additional bandwidth alleviates hotspots
Congestion, its Impact are links busy? who are the culprits? are apps impacted? Contiguous Duration of >70% link utilization (seconds) Often!
Congestion, its Impact are links busy? who are the culprits? are apps impacted? Apps (Extract, Reduce) Marginally Often!
Measurement Alternatives Link Utilizations (e.g., from SNMP) Tomography Server 2 Server Traffic Matrix + make do with easier-to-measure data – under-constrained problem heuristics a)gravity
Measurement Alternatives Link Utilizations (e.g., from SNMP) Tomography Server 2 Server Traffic Matrix + make do with easier-to-measure data – under-constrained problem heuristics a)gravity b)max sparse
Measurement Alternatives Link Utilizations (e.g., from SNMP) Tomography Server 2 Server Traffic Matrix + make do with easier-to-measure data – under-constrained problem heuristics a)gravity b)max sparsec)tomography + Job Information
a first look at traffic in a (map-reduce) data center some insights traffic stays mostly within high bandwidth regions flows are small, short-lived and turnover quickly net highly-utilized often with moderate impact on apps. end-hosts is feasible, necessary (?) → a model for data center traffic