1 MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt 1, Christian Engelmann 1, Stephen L. Scott 1, Jeffrey Vetter 1 Arthur.

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

Distributed Processing, Client/Server and Clusters
Thanks to Microsoft Azure’s Scalability, BA Minds Delivers a Cost-Effective CRM Solution to Small and Medium-Sized Enterprises in Latin America MICROSOFT.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
Protect Your Business and Simplify IT with Symantec and VMware Presenter, Title, Company Date.
2. Computer Clusters for Scalable Parallel Computing
Introduction to DBA.
1 Storage Today Victor Hatridge – CIO Nashville Electric Service (615)
VERITAS Software Corp. BUSINESS WITHOUT INTERRUPTION Fredy Nick SE Manager.
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
Iron Mountain’s Continuity Service ©2006 Iron Mountain Incorporated. All rights reserved. Iron Mountain and the design of the mountain are registered.
June 23rd, 2009Inflectra Proprietary InformationPage: 1 SpiraTest/Plan/Team Deployment Considerations How to deploy for high-availability and strategies.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
MyCloudIT Removes the Complexity of Moving Cloud Customers’ Entire IT Infrastructures to Microsoft Azure – Including the Desktop MICROSOFT AZURE ISV: MYCLOUDIT.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
An Empirical Examination of Current High-Availability Clustering Solutions’ Performance Jeffrey Absher DePaul University Research Symposium Presentation.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Effectively Explaining the Cloud to Your Colleagues.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
CHAPTER FIVE Enterprise Architectures. Enterprise Architecture (Introduction) An enterprise-wide plan for managing and implementing corporate data assets.
Chapter 10 : Designing a SQL Server 2005 Solution for High Availability MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS Bill Devlin, Jim Cray, Bill Laing, George Spix Microsoft Research Dec
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
IT Infrastructure Chap 1: Definition
Cloud Computing.
Submitted by: Shailendra Kumar Sharma 06EYTCS049.
© 2005 Mt Xia Technical Consulting Group - All Rights Reserved. HACMP – High Availability Introduction Presentation November, 2005.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Introduction to Cloud Computing
Transparency in Distributed Operating Systems Vijay Akkineni.
HA-OSCAR Chuka Okoye Himanshu Chhetri. What is HA-OSCAR? “High Availability Open Source Cluster Application Resources”
Service Overview CA- IROD- Instant Recovery on Demand CRITICAL SERVER CONTINUITY, NON-STOP OPERATIONS, TOTAL DATA PROTECTION Turnkey solution that provides.
© 2005 Mt Xia Technical Consulting Group - All Rights Reserved. HACMP – High Availability Testing and Updates November, 2005.
Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.
Protect Your Business-Critical Data in the Cloud with SoftNAS, a Full-Featured, Highly Available Solution for the Agile Microsoft Azure Platform MICROSOFT.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Plan  Introduction  What is Cloud Computing?  Why is it called ‘’Cloud Computing’’?  Characteristics of Cloud Computing  Advantages of Cloud Computing.
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Clustering In A SAN For High Availability Steve Dalton, President and CEO Gadzoox Networks September 2002.
CLUSTER COMPUTING TECHNOLOGY BY-1.SACHIN YADAV 2.MADHAV SHINDE SECTION-3.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
OS and System Software for Ultrascale Architectures – Panel Jeffrey Vetter Oak Ridge National Laboratory Presented to SOS8 13 April 2004 ack.
OSIsoft High Availability PI Replication
 High-Availability Cluster with Linux-HA Matt Varnell Cameron Adkins Jeremy Landes.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Innovation and information technology June 20, 2005 Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott Christian Engelmann Computer Science Research.
VCS Building Blocks. Topic 1: Cluster Terminology After completing this topic, you will be able to define clustering terminology.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.
Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.
OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.
A Seminar On. What is Cloud Computing? Distributed computing on internet Or delivery of computing service over the internet. Eg: Yahoo!, GMail, Hotmail-
Enterprise Architectures
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Cloud Platform for Geophysical Research
Secrets to Fast, Easy High Availability for SQL Server in AWS
Building a Virtual Infrastructure
Maximum Availability Architecture Enterprise Technology Centre.
Cloud Computing Dr. Sharad Saxena.
Dell Data Protection | Rapid Recovery: Simple, Quick, Configurable, and Affordable Cloud-Based Backup, Retention, and Archiving Powered by Microsoft Azure.
Increase and Improve your PC management with Windows Intune
Presentation transcript:

1 MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt 1, Christian Engelmann 1, Stephen L. Scott 1, Jeffrey Vetter 1 Arthur B. Maccabe 2, Patrick G. Bridges 2 Frank Mueller 3 Ponnuswany Sadayappan 4 Chokchai Leangsuksun 5 1 Oak Ridge National Laboratory 2 University of New Mexico 3 North Carolina State University 4 Ohio State University 5 Louisiana Tech University Briefing at: Scalable Systems Software meeting Argonne National Laboratory - August 26, 2004

2 Research Plan  Create a modular and configurable Linux system that allows customized changes based on the requirements of the applications, runtime systems, and cluster management software.  Build runtime systems that leverage the OS modularity and configurability to improve efficiency, reliability, scalability, ease-of-use, and provide support to legacy and promising programming models.  Advance computer RAS management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues.  Explore the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions.

3 MOLAR map

4 MOLAR: Modular Linux and Adaptive Runtime support HEC Linux OS: modular, custom, light-weight Monitoring RAS: reliability, availability, serviceability High availability [LaTech, ORNL] Process state saving [LLNL] Message logging [NCSU] Extend/adapt runtime/OS [ORNL, OSU] Root cause analysis [ ORNL, LaTech ] Kernel design [UNM, ORNL, LLNL] Programming modelsTestbeds Evaluation [ORNL, OSU] Provided [Cray, ORNL] MOLAR: Modular Linux and Adaptive Runtime support HEC Linux OS: modular, custom, light-weight Monitoring RAS: reliability, availability, serviceability High availabilityExtend/adapt runtime/OS [ORNL, OSU] Kernel design [UNM, ORNL] Programming modelsTestbeds Evaluation [ORNL, OSU] Provided [Cray, ORNL] Root cause analysis ORNL LaTech [LaTech, ORNL, NCSU]

5 RAS for Scientific and Engineering Applications  High mean time between interrupts (MTBI) for hardware, system software, and storage devices.  High mean time between errors/failures that affect users.  Recovery is automatic w/o human intervention.  Minimal work loss due to recovery process. Computation – Storage – Network

6 Case for RAS in HEC  Today’s systems need to reboot to recover.  Entire system often down for any maintenance or repair.  Compute nodes sit idle if their head (service) node is down.  Availability and MTBI typically decreases as system grows.  The “hidden” costs of failures  researchers’ lost work-in-progress  researchers on hold  additional system staff  checkpoint & restart time  Why do we accept such significant system outages due to failures, maintenance or repair?  With the expected investment into HEC we simply cannot afford low availability!  We need to drastically increase the availability of HEC computing resources now!

7 High-availability in Industry  Industry has shown for years that % (five nines) high- availability is feasible for computing services.  Used in corporate web servers, distributed data bases, business accounting and stock exchange services.  OS-level high-availability has not been a priority in the past.  Implementation involves complex algorithms.  Development and distribution licensing issues exist.  Most solutions are proprietary and do not perform well.  HA-OSCAR first freely available open source HA cluster implementation.  If we don’t step-up and do it as an Open Source proof-of- concept implementation and set the standard no one will.

8 Availibility by the Nines* 9’sAvailabilityDowntime/YearExamples 190.0%36 days, 12 hoursPersonal Computers 299.0%87 hours, 36 minEntry Level Business 399.9%8 hours, 45.6 minISPs, Mainstream Business %52 min, 33.6 secData Centers %5 min, 15.4 secBanking, Medical %31.5 secondsMilitary Defense *Highly-Affordable High Availability by Alan Robertson Linux Magazine, November  Service measured by “9’s of availability”  90% has one 9, 99% has two 9s, etc…  Good HA package + substandard hardware = up to 3 nines  Enterprise-class hardware + stable Linux kernel = 5+ nines

9 Federated System Management

10 High-availability Methods Active/Hot-Standby:  Single head node.  Idle standby head node(s).  Backup to shared storage.  Service interruption for the  time of the fail-over.  Rollback to backup.  Simple checkpoint/restart.  Service interruption for the time of restore-over. Active/Active:  Many active head nodes.  Work load distribution.  Symmetric replication  between head nodes.  Continuous service.  Always up-to-date.  Complex distributed control algorithms.  No restore-over necessary

11 High-availability Technology Active/Hot-Standby:  HA-OSCAR with active/  hot-standby head node.  Cluster system software.  No support for multiple active/active head nodes.  No middleware support.  No support for compute nodes. Active/Active:  HARNESS with symmetric  distributed virtual machine.  Heterogeneous adaptable  distributed middleware.  No system level support.  System-level data replication and distributed control service needed for active/active head node solution.  Reconfigurable framework similar to HARNESS needed to adapt to system properties and application needs.

12 Modular RAS Framework for Terascale Computing Distributed Control Service Data Replication Service Group Communication Service Reliable Server Groups: Virtual Synchrony: Symmetric Replication: Communication Methods: TCP/IPShared MemoryEtc. Reliable Services: Job Sched.User Mgmt.Etc. Service Node Service Node Service Node High-Available Service Nodes: To Compute Nodes