Performance and Fault Tolerance

Performance and Fault Tolerance
Reasoning Systems: Performance and Fault Tolerance Dan Reed Chancellor’s Eminent Professor Vice Chancellor for IT and CIO University of North Carolina at Chapel Hill Director, Renaissance Computing Institute (RENCI) Duke University North Carolina State University

VGrADS Vision Renaissance Computing Institute

Some GrADS Lessons Dynamic adaptation is really hard
especially if one adapts at very low levels Performance model decomposition is also really hard separating application behavior and hardware attributes Grid application development should not require a Ph.D. nor should having a Ph.D. be a liability  Multivariate optimization is challenging conflicting needs and goals Implications Thoreau was right simplify, simplify, simplify consider higher level specifications for adaptation qualitative specifications fault tolerance and performance -> performability Renaissance Computing Institute

Challenges Overall goal Large-scale fault tolerance
raise the level of discourse about classification Large-scale fault tolerance prediction observed failures and measured predictors adaptation: resilience and recovery Application susceptibility sensitivity to faults and performance changes implication for resource selection Behavioral specification and variability qualitative specifications variability detection Multivariate specification and validation Renaissance Computing Institute

Behavioral Classification and Adaptation
Performance variability temporal classification frequently, usually, seldom, … spatial classification (processors, tasks, …) all, most, some, a few, … temporal/spatial relative to behavioral specification close, far, lowBW, highBW, reliable, unreliable mechanisms multivariate statistical characterization bounding regions and isosurfaces/clustering Fault tolerance predictive techniques for probability of failure resource classes and capabilities coupled to application usage modes resilience implementation mechanisms adaptive checkpoint frequency in memory checkpoints Renaissance Computing Institute

Monitoring, Prediction and Classification
Fault tolerance predictive techniques for probability of failure resource classes and capabilities resilience implementation mechanisms adaptive checkpoint frequency in memory checkpoints Batch scheduling and checkpointing checkpoint frequency based on predictors adaptive triggering based on measurement batch queue selection based on failure probability In memory checkpointing for resilience collaborative approach with Jack Renaissance Computing Institute

SMART Disks SMART Typical SMART capabilities
Self Monitoring, Analysis and Reporting Technology on-disk monitoring and data analysis ATA/IDE and SCSI support Typical SMART capabilities head flying height, data throughput, spin up time reallocated sector count, seek error rate, seek time performance spin retry count, drive calibration retry count, temperature Drive spin up time (for example) indicative of motor or bearing failure By monitoring, one can identify performance problems failure probability Renaissance Computing Institute

ACPI Power Control ACPI ACPI defines the following
Advanced Configuration and Power Management HP, Intel, Microsoft, Phoenix and Toshiba OS management of system power consumption originally targeted at laptop/mobile device market ACPI defines the following hardware registers on chip BIOS interfaces Thermal failure is a big issue for HPC systems monitor and react in many ways processor clock speed based on code disk spin down to conserve power/reduce heat Renaissance Computing Institute

Power Consumption and Failure Modes
Arrhenius equation: temperature implications mean time to catastrophic failure of commercial silicon doubles for every 10 degrees C rise above 70 degrees C Autopilot tagged sensors SMART (disk temperature) ACPI (thermal zone) active cooling policy and throttling lm_tools (CPU and board temperature) Accessible via Autopilot Manager Simple example floating point computation launched at T=12 seconds immediate CPU temperature increase motherboard temperature increase at T=100 seconds computation terminated at T=135 seconds Renaissance Computing Institute Source: Kevin Gamiel

Power Consumption and Failure Modes
Load Induced Renaissance Computing Institute Source: Kevin Gamiel

Dynamic Behavioral Adaptation
Renaissance Computing Institute

Diskless Checkpointing
Application MPI Interface UNIX I/O Diskless Checkpoint Fault Tolerant MPI Space Optimization MPI Fault Detection & Automatic Recovery Trigger Recovery Storage Choice Data Recovery Redundancy Encoding User messages Heartbeat High Speed Interconnect Renaissance Computing Institute

Performance and Fault Tolerance

Similar presentations

Presentation on theme: "Performance and Fault Tolerance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance and Fault Tolerance

Similar presentations

Presentation on theme: "Performance and Fault Tolerance"— Presentation transcript:

Similar presentations

About project

Feedback