Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services.

Similar presentations

Presentation on theme: "Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services."— Presentation transcript:

1 Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services Kiran Nagaraja, Neeraj Krishnan, Ricardo Bianchini, Richard P Martin, Thu D Nguyen Rutgers University

2 Dark and Panic Lab Computer Science, Rutgers University2 Motivation  Network services often use cluster of commodity components  Various design choices  Incl. communication architecture  Numerous performance studies  TCP is perceived to be more robust  Performance vs. Availability tradeoff not well understood

3 Dark and Panic Lab Computer Science, Rutgers University3 Our Study  Evaluate impact of 2 different communication architectures on service performance and availability in presence of faults  TCP vs. VIA  Kernel-level comm. vs. user-level  Mature vs. new technology  Differ in fault-model  Quantify performability (performance and availability)  Study systems under various fault scenarios  Sensitivity to fault rates and fault classes  Case study: High performance cluster-based Web server  Understand tradeoff between high performance and high availability design choices

4 Dark and Panic Lab Computer Science, Rutgers University4 Computing Average Availability  Assumptions:  Faults are non-overlapping and independent  Parameters: MTTF, MTTR  Sources: [Sullivan91, Chillarege95, Iyer99, Talagala99, Trivedi00, Heath02]  Measure throughput under single fault

5 Dark and Panic Lab Computer Science, Rutgers University5 Effect of Single Fault: Seven Stage Model  Various phases map the behavior of system under single fault  All phases may not be necessary

6 Dark and Panic Lab Computer Science, Rutgers University6 Performability(P) = Tn x log(AI) log(AA)  Tn – Throughput under normal execution  AI - Availability of “Ideal” system e.g., 0.99999  AA – Average Availability  Log scale allows linearization with unavailability and reduces the range Performability Metric Normal performance Penalty component

7 Dark and Panic Lab Computer Science, Rutgers University7 Case Study: PRESS Web Server  Cluster-based, locality-conscious web server  Serve requests out of globally coordinated memory pool  Several versions developed over time  Differ in performance and fault-tolerance  Internal communication architecture  TCP versions  TCP-PRESS, TCP-PRESS-HB  VIA versions  VIA-PRESS-0, VIA-PRESS-3, VIA-PRESS-5  Names consistent with previous performance study[HPCA02]

8 Dark and Panic Lab Computer Science, Rutgers University8 PRESS Versions Comparison PRESS VersionsDescriptionFault Detection General Protocol Characteristics TCP-PRESSBase versionConnection based TCP  Assumes: Very few h/w permanent faults, transient faults are common  Robust to transient faults  OK to lose packets TCP-PRESS-HBPeriodic heartbeats VIA-PRESS-0Base versionConnection based VIA  Assumes: Faults indicate serious problems  Fail-stop model  Lost packets are bad VIA-PRESS-3RDMA for comm.Same VIA-PRESS-5RDMA and Zero-copy (Dynamic pinning) Same

9 Dark and Panic Lab Computer Science, Rutgers University9 Single-Fault Experiments  Setup: 4-PC cluster running at 90% load  800Mhz, 2 SCSI disks, 1 Gbps network  TCP over VIA, bare VIA  4 client nodes make HTTP requests  Rutgers trace  Poisson arrival process  Fault Set  Link down, switch down  OS - memory exhaustion, OS - no pin-able memory  Null pointer, off-by-N pointer value, off-by-N size [Sullivan91]  Application crash, hang  Node crash, freeze

10 Dark and Panic Lab Computer Science, Rutgers University10 Single-Fault Results Link down

11 Dark and Panic Lab Computer Science, Rutgers University11 Performance  VIA-based communication enables higher performance  Low latency, less software overhead

12 Dark and Panic Lab Computer Science, Rutgers University12 Performability Results  Identical fault load for all versions  Application fault rate  1/month  All versions of VIA do better than TCP

13 Dark and Panic Lab Computer Science, Rutgers University13 TCP Vs VIA: Program Robustness  VIA application fault rates 1/day, 1/week, 1/month  Programming complexity  TCP application fault rate 1/month Cross over point

14 Dark and Panic Lab Computer Science, Rutgers University14 VIA under Stressful Fault Load  Additional fault load  Transient packet drops  1/month, system failure  1/month  Application faults -> 2/month  TCP-HB performs slightly better than 2 VIA versions

15 Dark and Panic Lab Computer Science, Rutgers University15 Observations – Cluster Communication  Match fault-model of network stack to fabric  Non-fatal behavior on transient faults  TCP is robust to packet drops  Fail-stop behavior on permanent faults  Protocol level fault-avoidance  Preserve message boundaries  Reduce number of copies  Pre-allocate communication resources  Explicit fault reporting by all components in “path”  End-to-End necessary, but may not be sufficient  Reduces detection latency  Allows more accurate recovery actions

16 Dark and Panic Lab Computer Science, Rutgers University16 Related Work  Impact of faults on systems  Robustness and availability studies  Protocol performance studies  Congestion avoidance and control in WAN  Back-off based algorithms  Interconnects in cluster environment  SAN context: Packet drops  Serious failures  Evidence of faults due to immature technology  Fault tolerant interconnects: Myrinet

17 Dark and Panic Lab Computer Science, Rutgers University17 Summary & Conclusion  Studied impact of communication architecture on service performability  Surprisingly VIA versions delivered better availability  Comparison under varying fault loads  Evaluated architecture maturity and complexity  Desirable cluster-based protocol characteristics  Messaging, single-copy transfers, pre-allocated resources

18 Dark and Panic Lab Computer Science, Rutgers University18 Thank you. Questions?

Download ppt "Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services."

Similar presentations

Ads by Google