Download presentation
Presentation is loading. Please wait.
1
Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services Kiran Nagaraja, Neeraj Krishnan, Ricardo Bianchini, Richard P Martin, Thu D Nguyen Rutgers University
2
Dark and Panic Lab Computer Science, Rutgers University2 Motivation Network services often use cluster of commodity components Various design choices Incl. communication architecture Numerous performance studies TCP is perceived to be more robust Performance vs. Availability tradeoff not well understood
3
Dark and Panic Lab Computer Science, Rutgers University3 Our Study Evaluate impact of 2 different communication architectures on service performance and availability in presence of faults TCP vs. VIA Kernel-level comm. vs. user-level Mature vs. new technology Differ in fault-model Quantify performability (performance and availability) Study systems under various fault scenarios Sensitivity to fault rates and fault classes Case study: High performance cluster-based Web server Understand tradeoff between high performance and high availability design choices
4
Dark and Panic Lab Computer Science, Rutgers University4 Computing Average Availability Assumptions: Faults are non-overlapping and independent Parameters: MTTF, MTTR Sources: [Sullivan91, Chillarege95, Iyer99, Talagala99, Trivedi00, Heath02] Measure throughput under single fault
5
Dark and Panic Lab Computer Science, Rutgers University5 Effect of Single Fault: Seven Stage Model Various phases map the behavior of system under single fault All phases may not be necessary
6
Dark and Panic Lab Computer Science, Rutgers University6 Performability(P) = Tn x log(AI) log(AA) Tn – Throughput under normal execution AI - Availability of “Ideal” system e.g., 0.99999 AA – Average Availability Log scale allows linearization with unavailability and reduces the range Performability Metric Normal performance Penalty component
7
Dark and Panic Lab Computer Science, Rutgers University7 Case Study: PRESS Web Server Cluster-based, locality-conscious web server Serve requests out of globally coordinated memory pool Several versions developed over time Differ in performance and fault-tolerance Internal communication architecture TCP versions TCP-PRESS, TCP-PRESS-HB VIA versions VIA-PRESS-0, VIA-PRESS-3, VIA-PRESS-5 Names consistent with previous performance study[HPCA02]
8
Dark and Panic Lab Computer Science, Rutgers University8 PRESS Versions Comparison PRESS VersionsDescriptionFault Detection General Protocol Characteristics TCP-PRESSBase versionConnection based TCP Assumes: Very few h/w permanent faults, transient faults are common Robust to transient faults OK to lose packets TCP-PRESS-HBPeriodic heartbeats VIA-PRESS-0Base versionConnection based VIA Assumes: Faults indicate serious problems Fail-stop model Lost packets are bad VIA-PRESS-3RDMA for comm.Same VIA-PRESS-5RDMA and Zero-copy (Dynamic pinning) Same
9
Dark and Panic Lab Computer Science, Rutgers University9 Single-Fault Experiments Setup: 4-PC cluster running at 90% load 800Mhz, 2 SCSI disks, 1 Gbps network TCP over VIA, bare VIA 4 client nodes make HTTP requests Rutgers trace Poisson arrival process Fault Set Link down, switch down OS - memory exhaustion, OS - no pin-able memory Null pointer, off-by-N pointer value, off-by-N size [Sullivan91] Application crash, hang Node crash, freeze
10
Dark and Panic Lab Computer Science, Rutgers University10 Single-Fault Results Link down
11
Dark and Panic Lab Computer Science, Rutgers University11 Performance VIA-based communication enables higher performance Low latency, less software overhead
12
Dark and Panic Lab Computer Science, Rutgers University12 Performability Results Identical fault load for all versions Application fault rate 1/month All versions of VIA do better than TCP
13
Dark and Panic Lab Computer Science, Rutgers University13 TCP Vs VIA: Program Robustness VIA application fault rates 1/day, 1/week, 1/month Programming complexity TCP application fault rate 1/month Cross over point
14
Dark and Panic Lab Computer Science, Rutgers University14 VIA under Stressful Fault Load Additional fault load Transient packet drops 1/month, system failure 1/month Application faults -> 2/month TCP-HB performs slightly better than 2 VIA versions
15
Dark and Panic Lab Computer Science, Rutgers University15 Observations – Cluster Communication Match fault-model of network stack to fabric Non-fatal behavior on transient faults TCP is robust to packet drops Fail-stop behavior on permanent faults Protocol level fault-avoidance Preserve message boundaries Reduce number of copies Pre-allocate communication resources Explicit fault reporting by all components in “path” End-to-End necessary, but may not be sufficient Reduces detection latency Allows more accurate recovery actions
16
Dark and Panic Lab Computer Science, Rutgers University16 Related Work Impact of faults on systems Robustness and availability studies Protocol performance studies Congestion avoidance and control in WAN Back-off based algorithms Interconnects in cluster environment SAN context: Packet drops Serious failures Evidence of faults due to immature technology Fault tolerant interconnects: Myrinet
17
Dark and Panic Lab Computer Science, Rutgers University17 Summary & Conclusion Studied impact of communication architecture on service performability Surprisingly VIA versions delivered better availability Comparison under varying fault loads Evaluated architecture maturity and complexity Desirable cluster-based protocol characteristics Messaging, single-copy transfers, pre-allocated resources
18
Dark and Panic Lab Computer Science, Rutgers University18 Thank you. Questions? http://dark-panic.rutgers.edu/Research/vivo
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.