Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc.

Outline Overview Key pieces  OpenRTE  uPNP ORCM  Architecture  Fault behavior Future directions

3 © 2006 Cisco Systems, Inc. All rights reserved. System Software Requirements 1)Turn on once with remote access thereafter 2)Non-Stop == max 20 events/day lasting < 200ms each 3)Hitless SW Upgrades and Downgrades 4)Upgrade/downgrade SW components across delta versions 5)Field Patchable 6)Beta Test New Features in situ 7)Extensive Trace Facilities: on Routes, Tunnels, Subscribers,… 8)Configuration 9)Clear APIs; minimize application awareness 10)Extensive remote capabilities for fault management, software maintenance and software installations

Our Approach Distributed redundancy  NO master  Multiple copies of everything  Running in tracking mode Parallel, seeing identical input Multiple ways of selecting leader Utilize component architecture  Multiple ways to do something => framework!  Create an initial working base  Encourage experimentation

Methodology Exploit open source software  Reduce development time  Encourage outside participation  Cross-fertilize with HPC community Write new cluster manager (ORCM)  Exploit new capabilities  Potential dual-use for HPC clusters  Encourage outside contributions

Open Source ≠ Free Pro Widespread exposure  ORTE on thousands of systems around world  Surface & address problems Community support  Others can help solve problems  Expanded access to tools (e.g., debuggers) Energy  Other ideas, methods Con Your timeline ≠ my timeline  No penalty for late contributions  Academic contributors have other priorities Compromise: a required art  Code must be designed to support multiple approaches  Nobody wins all the time  Adds time to implementation

Outline Overview Key pieces  OpenRTE  uPNP ORCM  Architecture  Fault behavior Future directions 3-day workshop

Robustness (CSU) A Convergence of Ideas PACX-MPI (HLRS) LAM/MPI (IU) LA-MPI (LANL) FT-MPI (U of TN) Open MPI Fault Detection (LANL, Industry) Grid (many) Autonomous Computing (many) FDDP (Semi. Mfg. Industry) ResilientComputingSystems OpenRTE

Program Objective *Cell = one or more computers sharing a common launch environment/point

Participants Developers DOE/NNSA*  Los Alamos Nat Lab  Sandia Nat Lab  Oak Ridge Nat Lab Universities  Indiana University  Univ of Tennessee  Univ of Houston  HLRS, Stuttgart Support Industry  Cisco  Oracle  IBM  Microsoft*  Apple*  Multiple interconnect vendors Open source teams  OFED, autotools, Mercurial *Providing funding

Reliance on Components Formalized interfaces  Specifies “black box” implementation  Different implementations available at run-time  Can compose different systems on the fly Interface 1Interface 2Interface 3 Caller

OpenRTE and Components Components are shared libraries  Central set of components in installation tree  Users can also have components under $HOME Can add / remove components after install  No need to recompile / re-link apps  Download / install new components  Develop new components safely Update “on-the-fly”  Add, update components while running  Frameworks “pause” during update

Component Benefits Stable, production quality environment for 3rd party researchers  Can experiment inside the system without rebuilding everything else  Small learning curve (learn a few components, not the entire implementation)  Allow wide use, experience before exposing work Vendors can quickly roll out support for new platforms  Write only the components you want/need to change  Protect intellectual property

ORTE: Resiliency* Fault  Events that hinder the correct operation of a process. May not actually be a “failure” of a component, but can cause system-level failure or performance degradation below specified level  Effect may be immediate or some time in the future.  Usually are rare. May not have many data examples. Fault prediction  Estimate probability of incipient fault within some time period in the future Fault Tolerance ………………………………………reactive, static  Ability to recover from a fault Robustness…………………………………………..metric  How much can the system absorb without catastrophic consequences Resilience……………………………………………..proactive, dynamic  Dynamically configure system to minimize impact of potential faults *standalone presentation

Key Frameworks Error Manager (Errmgr) Receives all process state updates  Sensor, waitpid  Includes predictions Determines response strategy  Restart locally, globally, abort Executes recovery  Accounts for fault groups to avoid repeated failover Sensor Monitors software and hardware state-of-health  Sentinel file size, mod & access times  Memory footprint  Temperature  Heartbeat  ECC errors Predicts incipient faults  Trend, fingerprint  AI-based algos coming

Universal PNP Widely adopted standard ORCM uses only a part  PNP discovery via announcement on std multicast channel Includes application id, contact info All applications respond  Wireup “storm” limits scalability  Various algorithms for storm reduction  Each application assigned own “channel” All output from members of that application Input sent to that application given to all members

ORCM DVM One per node  Started at node boot or launched by tool  Locally spawns and monitors processes, system health sensors  Small footprint (≤1Mb) Each daemon tracks existence of others  PNP wireup  Know where all processes are located orcmd Predefined “System” multicast channel orcmd

Parallel DVMs Allows  Concurrent development, testing in production environment  Sharing of development resources Unique identifier (ORTE jobid)  Maintains separation between orcmd’s  Each application belongs to their respective DVM  No cross-DVM communication allowed

Configuration Mgmt orcmd cfgi confdtoolfile confd daemon subscribe Lowest vpid recv config Open framework set recvconfig file? connect? orcm-start file

Configuration Mgmt orcmd cfgi confdtoolfile confd daemon subscribe recv config Open framework set recvconfig file? connect? orcm-start file Update any missing config info Assume “leader” duties

Application Launch orcmd cfgi confdtoolfile confd daemon subscribe recv config set recvconfig file? connect? orcm-start file Config change #procs location Launch msgPredefined “System” multicast channel

Resilient Mapper Fault groups  Nodes with common failure mode  Node can belong to multiple fault groups  Defined in system file Map instances across fault groups  Minimize probability of cascading failures  One instance/fault group  Pick lightest loaded node in group  Randomly map extras Next-generation algorithms  Failure mode probability => fault group selection

Multiple Replicas Multiple copies of each executable  Run on separate fault groups  Async, independent Shared pnp channel  Input: recvd by all  Output: broadcast to all, recvd by those who registered for input Leader determined by recvr

Leader Selection Two forms of leader selection  Internal to ORCM DVM  External facing Internal - framework  App-specific module  Configuration specified  Lowest rank  First contact  None

External Connections orcm-connector  Input Broadcast on respective PNP channel  Output Determines “leader” to supply output to rest of world Utilize any leader method in framework

Testing in Production orcm-logger logger dbfilesyslogconsole

Software Maintenance On-the-fly module activation  Configuration manager can select new modules to load, reload, activate  Change priorities of active modules Full replacement  When more than a module needs updating  Start replacement version  Configuration manager switches “leader”  Stop old version

Detecting Failures Application failures - detected by local daemon  Monitors for self-induced problems Memory and cpu usage Orders termination if limits exceeded or are trending to exceed  Detects unexpected failures via waitpid Hardware failures  Local hardware sensors continuously report status Read by local daemon Projects potential failure modes to pre-order relocation of processes, shutdown node  Detected by DVM when daemon misses heartbeats

Application Failure Local daemon  Detects (or predicts) failure  Locally restarts up to specified max #local-restarts  Utilizes resilient mapper to determine re-location  Sends launch message to all daemons Replacement app  Announces itself on application public address channel  Receives responses - registers own inputs  Begins operation Connected applications  Select new “leader” based on current module

Node Failure orcmd cfgi confdtoolfile Open framework Next higher orcmd becomes leader Open/init cfgi framework Update any missing config info Mark node as “down” Relocate application processes from failed node Connected apps failover leader per active leader module Attempt to restart

Node Replacement/Addition Auto-boot of local daemon on power up  Daemon announces to DVM  All DVM members add node to available resources Reboot/restart  Relocate original procs back up to some max number of times (need smarter algo here)  Leadership remains unaffected to avoid “bounce” Processes will map to new resource as start/restart demands  Future: rebalance existing load upon node availability

35 © 2006 Cisco Systems, Inc. All rights reserved. System Software Requirements 1)Turn on once with remote access thereafter 2)Non-Stop == max 20 events/day lasting < 200ms each 3)Hitless SW Upgrades and Downgrades 4)Upgrade/downgrade SW components across delta versions 5)Field Patchable 6)Beta Test New Features in situ 7)Extensive Trace Facilities: on Routes, Tunnels, Subscribers,… 8)Configuration 9)Clear APIs; minimize application awareness 10)Extensive remote capabilities for fault management, software maintenance and software installations       ~5ms recovery Start new app triplet, kill old one New app triplet, register for production input Boot-level startup Start/stop triplets, leader selection  

Still A Ways To Go Security  Who can order ORCM to launch/stop apps?  Who can “log” output from which apps?  Network extent of communications? Communications  Message size, fragmentation support  Speed of underlying transport  Truly reliable multicast  Asynchronous messaging

Still A Ways To Go Transfer of state  How does a restarted application replica regain the state of its prior existence?  How do we re-sync state across replicas so outputs track? Deterministic outputs  Same output from replicas tracking same inputs Assumes deterministic algorithms  Can we support non-deterministic algorithms? Random channel selection to balance loads Decisions based on instantaneous traffic sampling

Still A Ways To Go Enhanced algorithms  Mapping  Leader selection Fault prediction  Implementation and algorithms  Expanded sensors Replication vs rapid restart  If we can restart in few millisecs, do we really need replication?

Concluding Remarks http://www.open-mpi.org http://www.open-mpi.org/projects/orcm

Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc.

Similar presentations

Presentation on theme: "Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc.

Similar presentations

Presentation on theme: "Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc."— Presentation transcript:

Similar presentations

About project

Feedback