Scalable Integrated Performance Analaysis of Multi-Gigabit Networks Ezra Kissel, U. Delaware Ahmed El-Hassany, Guilherme Fernandes, Martin Swany, Indiana U. Dan Gunter, Taghrid Samak, LBNL Jen Schopf, WHOI
What I hope you learn 1. Why we care about bulk data transfer at multi-gigabit rates 2. Why and how detailed monitoring is helpful 3. How dynamic control of monitoring is related to Session Layer protocols 4/16/12 1
Bulk data transfer needs Some domains of interest: –Climate simulation (Earth System Grid) –Genomics (JGI) –High-energy physics (Large Hadron Collider) –Astronomy (Large Synoptic Survey Telescope) –Astrophysics (FLASH) Huge data Analysis sites 4/16/12 2
Multi-gigabit rates Networks connecting national labs and universities have 10Gb/s and soon 100Gb/s capability. one PB = one day at 100Gb/s Rarely achieved due to bottlenecks: –Host: Application or Disks –Campus/local networks –Wide area networks Hard to tell why, where, or even if there is a problem 4/16/12 3
Solution Monitor all the time Analyze all the time.. but much more when something interesting is happening Use analysis results as feedback 4/16/12 4
System components eXtensible Session Protocol (XSP) –Associate multiple TCP connections, L2 circuits, as a "session" –Provide channels for bi-directional metadata NL-Calipers –Summarize in situ timings of every read/write BLiPP –Host and TCP stack info. using XSP channels PerfSONAR –Standard information formats and exchange protocols 4/16/12 5
Dynamic Session Monitoring User (1) Start xfer (2) Open session 3) data (3) NL- calipers data (4) Signal TCP (5) data Look at the performance Network engineer 4/16/12 6
Bottleneck detection 4/16/12 7 Triangles give "instantaneous" throughput On fixed intervals, summarize all measurements into mean, min, max, variance for both rate and #bytes Instrumentation Analysis: pick lowest mean value as bottleneck, apply t-test
TCP throughput Time series of throughput* for representative TCP experiments: (a) 1 stream memory-to-disk with 100ms latency, (b) 1 stream memory-to-memory with no latency, (c) 1 stream disk-to-disk with no latency, (d) 4 streams memory-to-disk with 100ms latency and 1% loss added at 60 seconds. 4/16/12 8
UDT throughput Time series of throughput* for representative UDT experiments: (a) 4 streams memory-to-disk with 100ms latency, (b) 4 streams memory-to-disk with 100ms latency and 1% loss added at 60 seconds, (c) 4 streams disk-to-disk with 100ms latency, (d) 4 streams memory-to-memory with 100ms latency. 4/16/12 9
Wait, what? 4/16/12 10
Half as many read()s. Others return zero, not counted Variance Less work being done 4/16/12 11
Review Why we care about bulk data transfer at multi-gigabit rates Why and how detailed monitoring is helpful How monitoring is related to Session Layer protocols –and how that might integrate with a management framework Questions? 4/16/12 12
Related projects NetLogger netlogger.lbl.gov perfSONAR perfsonar.org XSPdamsl.cis.udel.edu/ GENIgeni.net CEDPScedps-scidac.org 4/16/12 13
Topology-aware Monitoring 4/16/12 14