Presentation is loading. Please wait.

Presentation is loading. Please wait.

Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate.

Similar presentations


Presentation on theme: "Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate."— Presentation transcript:

1 Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate Professor, Computer Science Director, eXtreme Computing Research (XCR) HA-OSCAR: unleashing HA Beowulf

2 Innovation and information technology June 20, 2005 Research Collaborators –National, Academic and Industry Labs ORNL Intel, Dell, Ericsson Lucent, CRAY IU, NCSA, OSU, NCSU, UNM, TTU Systran OSDL (Linus is here) ANL, LLNL

3 Innovation and information technology June 20, 2005 Service Unavailability Impacts No Performance and No Functionality Losses of $195K - $58M with 3.5 hrs (Meta Group report, 2000) – (enterprise) Enterprise/Shared Major computing resources- 7/24/365 (enterprise/HPC-HEC) Critical HPC apps such as National Security (Home Land defense) (HPC-HEC) Service provider Regulation/Mandate –FCC mandate (Class 5 local switch = 5 9’s) Losses time and opportunities Life-threatening

4 Innovation and information technology June 20, 2005 RASS Definitions Reliability (MTTF) –How fast it fails? Availability –What is the total uptime? –Availability = MTTF / (MTTF + MTTR) Serviceability –How fast to build, manage, upgrade system –Planned outages – 60% of total outages Security will impact Availability

5 Innovation and information technology June 20, 2005 High Availability Open Source Cluster Application Resources (HA-OSCAR) HA-OSCAR: unleashing HA Beowulf

6 Innovation and information technology June 20, 2005 HA-OSCAR overview Production-quality Open source Linux-cluster project HA and HPC clustering techniques to enable critical HPC infrastructure Self-configuration Multi- head Beowulf system HA-enabled HPC Services: Active/Hot Standby Self-healing with 3-5 sec automatic failover time The first known field-grade open source HA Beowulf cluster release

7 Innovation and information technology June 20, 2005 Monitoring & Self-healing cores Service Monitor Resource Monitor Health channel Monitor Self-Healing Daemon PBS,MAUI, NFS,HTTP services are monitored load_average, disk_usage, free_memory are monitored eth0,eth0:1 interfaces are monitored

8 Innovation and information technology June 20, 2005 Monitoring and recovery Enhancement based kernel.org MON, IPMI, and net-SNMP framework Recovery –Associative Response Local recovery, e.g. restart, checkpoint Failover (simple or impersonate/clone) Admin-defined actions –Adaptive Response Previous state and number retry Acceleration (Time-series) E.g. maui dies, restart. After 3 times reties within 3 mins, failover

9 Innovation and information technology June 20, 2005 Appeared in a front cover in two major Linux magazines, various technical papers, research exhibitions. web site: http://xcr.cenit.latech.edu/ha-oscar HA-OSCAR beta was released to open source community in March 2004

10 Innovation and information technology June 20, 2005 On-going R&D works (Lab grade enhancements)

11 Innovation and information technology June 20, 2005 Reliability Modeling for dummy

12 Innovation and information technology June 20, 2005 UML Representation of System Architecture XMI Representation with Embedded Dependability Information Extracting Dependability parameters and Building Logical Representation Results showing Reliability and Availability of System Semantic Mapping and Dependability Modeling UML based Approach

13 Innovation and information technology June 20, 2005 An example of UML tools

14 Innovation and information technology June 20, 2005 Examples in UML diagrams

15 Innovation and information technology June 20, 2005 Example of HA-OSCAR A single head cluster λ µ System Unreliability MTTF days System Instantaneous Unavailability Availability Percentage System Downtime Per Year Node Switch Client1 Client2 Client3 Client4 21E-05 32E-05 67E-05 89E-05 76E-05 54E-05 12E-04 1E-03 32E-04 15E-04 16E-04 19E-04 1.7804E-013002.87771E-0399.7125.2 hrs HA-OSCAR λ µ System Un- reliability MTTF days System Instantaneous Unavailability Availability Percentage System Downtime Per Year Node 1 Node 2 Switch 1 Switch 2 Client 1 Client 2 Client 3 Client 4 3.4E-05 8.6E-05 1E-05 1.3E-05 2.5E-05 9.8E-05 6.7E-05 3.5E-05 2E-05 12E-04 2E-04 2.1E- 04 32E-04 4E-04 5E-04 21E-05 92.1138E-033312.10727E-0599.997 11 min Node1 3.4E-5 2.0E-5 Node2 8.6E-5 0.0012 Switch1 1.0E-5 2.0E-4 Switch2 1.3E-5 2.1E-4 Client4 3.5E-5 2.1E-4 Node1 Switch1 Client1 Node1 Switch2 Client1 Node1 Switch1 Client2 Node1 Switch2 Client2 Node1 Switch1 Client3 Node1 Switch2 Client3 Node1 Switch1 Client4 Node1 Switch2 Client4 Node2 Switch1 Client1 Node2 Switch2 Client1 Node2 Switch1 Client2 Node2 Switch2 Client2 Node2 Switch1 Client3 Node2 Switch2 Client3 Node2 Switch1 Client4 Node2 Switch2 Client4 id=0 id=1 id=2 id=3 id=4 id=5 id=6 id=7 id=8 id=9 id=10 id=11 id=12 id=13 id=14 id=155 9.211E-02 331 99.997 11

16 Innovation and information technology June 20, 2005 Policy-based Fault Prediction, Hardware Management abstraction

17 Innovation and information technology June 20, 2005 Policy-based Fault Prediction, Hardware Management abstraction

18 Innovation and information technology June 20, 2005 Hardware Management abstraction Ability to access and control detailed status for better management (CPU temp, baseboard, power status, system ID/ up/ down etc.) IPMI ( Intelligent Platform Management Interface ) open IPMI and OpenHPI (SA forum) HW abstraction hinds vendor specific –CPU –Power –Memory –Baseboard –Fan (cooling)

19 Innovation and information technology June 20, 2005 Our early observations 01/25/2004 | 00:31:19 | Sys Fan 1 | critical 01/25/2004 | 00:31:19 | Sys Fan 3 | critical 01/25/2004 | 00:31:19 | Sys Fan 4 | critical 01/25/2004 | 00:31:19 | Processor 1 Fan | ok 01/25/2004 | 00:31:20 | Processor 2 Fan | ok Can set thresholds in managed elements to trigger events with severity levels Automatic failure trend analysis -> prediction

20 Innovation and information technology June 20, 2005 A failure prediction & policy-based recovery Cluster management Detections - the damage done! Predictions –trend analysis –Anticipate imminent failures –Better handling –More difficult for multiple events/nodes correlations Example of IPMI events and trend analysis –E.g. CPU temp raising too fast with 5 min -> prepare to checkpoint, failover and restart –Memory bit error detected -> take a node out

21 Innovation and information technology June 20, 2005 HA-OSCAR monitoring, Fault prediction and recovery Restructure

22 Innovation and information technology June 20, 2005 Cluster Power Management (IPMI)

23 Innovation and information technology June 20, 2005 Reliability-aware Runtime

24 Innovation and information technology June 20, 2005 Reliability-Aware Runtime Programming paradigm and Scalability impact “Reliability”, esp for HPC environment “AND Survivability” analysis based on –at 10, 100, 1000 nodes all have to survive. –Each node MTTF at 5000 hours –N=10, MTTF = 492.424242 –N=100, MTTF = 49.9902931 –N=1000, MTTF = 4.99999003 –N=10000, MTTF = ½ hour Reliability and Availability info - Better Job execution (checkpointing, resource management)

25 Innovation and information technology June 20, 2005 MTTF 1000-5000 The more & the faster processors, the faster failure rate e.g. each nodal failure rate 2/year N=10, MTTF = 492.424242 N=100, MTTF = 49.9902931 N=1000, MTTF = 4.99999003

26 Innovation and information technology June 20, 2005 Reliability-aware Checkpointing –Consideration of Scalability vs. Reliability in Runtime –MTTF vs. application execution time –HA-OSCAR monitoring -> Failure Prediction and Detection –System-initiated (transparent) and Reliability-aware checkpointing in MPI environments. –Developed smart checkpoint based on above. –Reduce unnecessary overheads yet reliability-aware –Detailed reports in HAPCW2004 and submitted to IEEE cluster 2005

27 Innovation and information technology June 20, 2005 Federated System Architecture (DOE fastOS)

28 Innovation and information technology June 20, 2005 Summary Problems in Large-scale computing is similar to Wireless Sensor Network –Computing node = SN –Head node = gateway Reliability issues are similar –Depends on applications Self-config, self-awareness, self-healing Routing algorithm = location-aware


Download ppt "Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate."

Similar presentations


Ads by Google