Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services.

Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services Kiran Nagaraja, Neeraj Krishnan, Ricardo Bianchini, Richard P Martin, Thu D Nguyen Rutgers University

Dark and Panic Lab Computer Science, Rutgers University2 Motivation  Network services often use cluster of commodity components  Various design choices  Incl. communication architecture  Numerous performance studies  TCP is perceived to be more robust  Performance vs. Availability tradeoff not well understood

Dark and Panic Lab Computer Science, Rutgers University3 Our Study  Evaluate impact of 2 different communication architectures on service performance and availability in presence of faults  TCP vs. VIA  Kernel-level comm. vs. user-level  Mature vs. new technology  Differ in fault-model  Quantify performability (performance and availability)  Study systems under various fault scenarios  Sensitivity to fault rates and fault classes  Case study: High performance cluster-based Web server  Understand tradeoff between high performance and high availability design choices

Dark and Panic Lab Computer Science, Rutgers University4 Computing Average Availability  Assumptions:  Faults are non-overlapping and independent  Parameters: MTTF, MTTR  Sources: [Sullivan91, Chillarege95, Iyer99, Talagala99, Trivedi00, Heath02]  Measure throughput under single fault

Dark and Panic Lab Computer Science, Rutgers University5 Effect of Single Fault: Seven Stage Model  Various phases map the behavior of system under single fault  All phases may not be necessary

Dark and Panic Lab Computer Science, Rutgers University6 Performability(P) = Tn x log(AI) log(AA)  Tn – Throughput under normal execution  AI - Availability of “Ideal” system e.g., 0.99999  AA – Average Availability  Log scale allows linearization with unavailability and reduces the range Performability Metric Normal performance Penalty component

Dark and Panic Lab Computer Science, Rutgers University7 Case Study: PRESS Web Server  Cluster-based, locality-conscious web server  Serve requests out of globally coordinated memory pool  Several versions developed over time  Differ in performance and fault-tolerance  Internal communication architecture  TCP versions  TCP-PRESS, TCP-PRESS-HB  VIA versions  VIA-PRESS-0, VIA-PRESS-3, VIA-PRESS-5  Names consistent with previous performance study[HPCA02]

Dark and Panic Lab Computer Science, Rutgers University8 PRESS Versions Comparison PRESS VersionsDescriptionFault Detection General Protocol Characteristics TCP-PRESSBase versionConnection based TCP  Assumes: Very few h/w permanent faults, transient faults are common  Robust to transient faults  OK to lose packets TCP-PRESS-HBPeriodic heartbeats VIA-PRESS-0Base versionConnection based VIA  Assumes: Faults indicate serious problems  Fail-stop model  Lost packets are bad VIA-PRESS-3RDMA for comm.Same VIA-PRESS-5RDMA and Zero-copy (Dynamic pinning) Same

Dark and Panic Lab Computer Science, Rutgers University9 Single-Fault Experiments  Setup: 4-PC cluster running at 90% load  800Mhz, 2 SCSI disks, 1 Gbps network  TCP over VIA, bare VIA  4 client nodes make HTTP requests  Rutgers trace  Poisson arrival process  Fault Set  Link down, switch down  OS - memory exhaustion, OS - no pin-able memory  Null pointer, off-by-N pointer value, off-by-N size [Sullivan91]  Application crash, hang  Node crash, freeze

Dark and Panic Lab Computer Science, Rutgers University10 Single-Fault Results Link down

Dark and Panic Lab Computer Science, Rutgers University11 Performance  VIA-based communication enables higher performance  Low latency, less software overhead

Dark and Panic Lab Computer Science, Rutgers University12 Performability Results  Identical fault load for all versions  Application fault rate  1/month  All versions of VIA do better than TCP

Dark and Panic Lab Computer Science, Rutgers University13 TCP Vs VIA: Program Robustness  VIA application fault rates 1/day, 1/week, 1/month  Programming complexity  TCP application fault rate 1/month Cross over point

Dark and Panic Lab Computer Science, Rutgers University14 VIA under Stressful Fault Load  Additional fault load  Transient packet drops  1/month, system failure  1/month  Application faults -> 2/month  TCP-HB performs slightly better than 2 VIA versions

Dark and Panic Lab Computer Science, Rutgers University15 Observations – Cluster Communication  Match fault-model of network stack to fabric  Non-fatal behavior on transient faults  TCP is robust to packet drops  Fail-stop behavior on permanent faults  Protocol level fault-avoidance  Preserve message boundaries  Reduce number of copies  Pre-allocate communication resources  Explicit fault reporting by all components in “path”  End-to-End necessary, but may not be sufficient  Reduces detection latency  Allows more accurate recovery actions

Dark and Panic Lab Computer Science, Rutgers University16 Related Work  Impact of faults on systems  Robustness and availability studies  Protocol performance studies  Congestion avoidance and control in WAN  Back-off based algorithms  Interconnects in cluster environment  SAN context: Packet drops  Serious failures  Evidence of faults due to immature technology  Fault tolerant interconnects: Myrinet

Dark and Panic Lab Computer Science, Rutgers University17 Summary & Conclusion  Studied impact of communication architecture on service performability  Surprisingly VIA versions delivered better availability  Comparison under varying fault loads  Evaluated architecture maturity and complexity  Desirable cluster-based protocol characteristics  Messaging, single-copy transfers, pre-allocated resources

Dark and Panic Lab Computer Science, Rutgers University18 Thank you. Questions? http://dark-panic.rutgers.edu/Research/vivo

Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services.

Similar presentations

Presentation on theme: "Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services.

Similar presentations

Presentation on theme: "Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services."— Presentation transcript:

Similar presentations

About project

Feedback