Download presentation
Presentation is loading. Please wait.
Published byRalf Howard Modified over 9 years ago
1
Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc.
2
Outline Overview Key pieces OpenRTE uPNP ORCM Architecture Fault behavior Future directions
3
3 © 2006 Cisco Systems, Inc. All rights reserved. System Software Requirements 1)Turn on once with remote access thereafter 2)Non-Stop == max 20 events/day lasting < 200ms each 3)Hitless SW Upgrades and Downgrades 4)Upgrade/downgrade SW components across delta versions 5)Field Patchable 6)Beta Test New Features in situ 7)Extensive Trace Facilities: on Routes, Tunnels, Subscribers,… 8)Configuration 9)Clear APIs; minimize application awareness 10)Extensive remote capabilities for fault management, software maintenance and software installations
4
Our Approach Distributed redundancy NO master Multiple copies of everything Running in tracking mode Parallel, seeing identical input Multiple ways of selecting leader Utilize component architecture Multiple ways to do something => framework! Create an initial working base Encourage experimentation
5
Methodology Exploit open source software Reduce development time Encourage outside participation Cross-fertilize with HPC community Write new cluster manager (ORCM) Exploit new capabilities Potential dual-use for HPC clusters Encourage outside contributions
6
Open Source ≠ Free Pro Widespread exposure ORTE on thousands of systems around world Surface & address problems Community support Others can help solve problems Expanded access to tools (e.g., debuggers) Energy Other ideas, methods Con Your timeline ≠ my timeline No penalty for late contributions Academic contributors have other priorities Compromise: a required art Code must be designed to support multiple approaches Nobody wins all the time Adds time to implementation
7
Outline Overview Key pieces OpenRTE uPNP ORCM Architecture Fault behavior Future directions 3-day workshop
8
Robustness (CSU) A Convergence of Ideas PACX-MPI (HLRS) LAM/MPI (IU) LA-MPI (LANL) FT-MPI (U of TN) Open MPI Fault Detection (LANL, Industry) Grid (many) Autonomous Computing (many) FDDP (Semi. Mfg. Industry) ResilientComputingSystems OpenRTE
9
Program Objective *Cell = one or more computers sharing a common launch environment/point
10
Participants Developers DOE/NNSA* Los Alamos Nat Lab Sandia Nat Lab Oak Ridge Nat Lab Universities Indiana University Univ of Tennessee Univ of Houston HLRS, Stuttgart Support Industry Cisco Oracle IBM Microsoft* Apple* Multiple interconnect vendors Open source teams OFED, autotools, Mercurial *Providing funding
11
Reliance on Components Formalized interfaces Specifies “black box” implementation Different implementations available at run-time Can compose different systems on the fly Interface 1Interface 2Interface 3 Caller
12
OpenRTE and Components Components are shared libraries Central set of components in installation tree Users can also have components under $HOME Can add / remove components after install No need to recompile / re-link apps Download / install new components Develop new components safely Update “on-the-fly” Add, update components while running Frameworks “pause” during update
13
Component Benefits Stable, production quality environment for 3rd party researchers Can experiment inside the system without rebuilding everything else Small learning curve (learn a few components, not the entire implementation) Allow wide use, experience before exposing work Vendors can quickly roll out support for new platforms Write only the components you want/need to change Protect intellectual property
14
ORTE: Resiliency* Fault Events that hinder the correct operation of a process. May not actually be a “failure” of a component, but can cause system-level failure or performance degradation below specified level Effect may be immediate or some time in the future. Usually are rare. May not have many data examples. Fault prediction Estimate probability of incipient fault within some time period in the future Fault Tolerance ………………………………………reactive, static Ability to recover from a fault Robustness…………………………………………..metric How much can the system absorb without catastrophic consequences Resilience……………………………………………..proactive, dynamic Dynamically configure system to minimize impact of potential faults *standalone presentation
15
Key Frameworks Error Manager (Errmgr) Receives all process state updates Sensor, waitpid Includes predictions Determines response strategy Restart locally, globally, abort Executes recovery Accounts for fault groups to avoid repeated failover Sensor Monitors software and hardware state-of-health Sentinel file size, mod & access times Memory footprint Temperature Heartbeat ECC errors Predicts incipient faults Trend, fingerprint AI-based algos coming
16
Outline Overview Key pieces OpenRTE uPNP ORCM Architecture Fault behavior Future directions
17
Universal PNP Widely adopted standard ORCM uses only a part PNP discovery via announcement on std multicast channel Includes application id, contact info All applications respond Wireup “storm” limits scalability Various algorithms for storm reduction Each application assigned own “channel” All output from members of that application Input sent to that application given to all members
18
Outline Overview Key pieces OpenRTE uPNP ORCM Architecture Fault behavior Future directions
19
ORCM DVM One per node Started at node boot or launched by tool Locally spawns and monitors processes, system health sensors Small footprint (≤1Mb) Each daemon tracks existence of others PNP wireup Know where all processes are located orcmd Predefined “System” multicast channel orcmd
20
Parallel DVMs Allows Concurrent development, testing in production environment Sharing of development resources Unique identifier (ORTE jobid) Maintains separation between orcmd’s Each application belongs to their respective DVM No cross-DVM communication allowed
21
Configuration Mgmt orcmd cfgi confdtoolfile confd daemon subscribe Lowest vpid recv config Open framework set recvconfig file? connect? orcm-start file
22
Configuration Mgmt orcmd cfgi confdtoolfile confd daemon subscribe recv config Open framework set recvconfig file? connect? orcm-start file Update any missing config info Assume “leader” duties
23
Application Launch orcmd cfgi confdtoolfile confd daemon subscribe recv config set recvconfig file? connect? orcm-start file Config change #procs location Launch msgPredefined “System” multicast channel
24
Resilient Mapper Fault groups Nodes with common failure mode Node can belong to multiple fault groups Defined in system file Map instances across fault groups Minimize probability of cascading failures One instance/fault group Pick lightest loaded node in group Randomly map extras Next-generation algorithms Failure mode probability => fault group selection
25
Multiple Replicas Multiple copies of each executable Run on separate fault groups Async, independent Shared pnp channel Input: recvd by all Output: broadcast to all, recvd by those who registered for input Leader determined by recvr
26
Leader Selection Two forms of leader selection Internal to ORCM DVM External facing Internal - framework App-specific module Configuration specified Lowest rank First contact None
27
External Connections orcm-connector Input Broadcast on respective PNP channel Output Determines “leader” to supply output to rest of world Utilize any leader method in framework
28
Testing in Production orcm-logger logger dbfilesyslogconsole
29
Software Maintenance On-the-fly module activation Configuration manager can select new modules to load, reload, activate Change priorities of active modules Full replacement When more than a module needs updating Start replacement version Configuration manager switches “leader” Stop old version
30
Detecting Failures Application failures - detected by local daemon Monitors for self-induced problems Memory and cpu usage Orders termination if limits exceeded or are trending to exceed Detects unexpected failures via waitpid Hardware failures Local hardware sensors continuously report status Read by local daemon Projects potential failure modes to pre-order relocation of processes, shutdown node Detected by DVM when daemon misses heartbeats
31
Application Failure Local daemon Detects (or predicts) failure Locally restarts up to specified max #local-restarts Utilizes resilient mapper to determine re-location Sends launch message to all daemons Replacement app Announces itself on application public address channel Receives responses - registers own inputs Begins operation Connected applications Select new “leader” based on current module
32
Node Failure orcmd cfgi confdtoolfile Open framework Next higher orcmd becomes leader Open/init cfgi framework Update any missing config info Mark node as “down” Relocate application processes from failed node Connected apps failover leader per active leader module Attempt to restart
33
Node Replacement/Addition Auto-boot of local daemon on power up Daemon announces to DVM All DVM members add node to available resources Reboot/restart Relocate original procs back up to some max number of times (need smarter algo here) Leadership remains unaffected to avoid “bounce” Processes will map to new resource as start/restart demands Future: rebalance existing load upon node availability
34
Outline Overview Key pieces OpenRTE uPNP ORCM Architecture Fault behavior Future directions
35
35 © 2006 Cisco Systems, Inc. All rights reserved. System Software Requirements 1)Turn on once with remote access thereafter 2)Non-Stop == max 20 events/day lasting < 200ms each 3)Hitless SW Upgrades and Downgrades 4)Upgrade/downgrade SW components across delta versions 5)Field Patchable 6)Beta Test New Features in situ 7)Extensive Trace Facilities: on Routes, Tunnels, Subscribers,… 8)Configuration 9)Clear APIs; minimize application awareness 10)Extensive remote capabilities for fault management, software maintenance and software installations ~5ms recovery Start new app triplet, kill old one New app triplet, register for production input Boot-level startup Start/stop triplets, leader selection
36
Still A Ways To Go Security Who can order ORCM to launch/stop apps? Who can “log” output from which apps? Network extent of communications? Communications Message size, fragmentation support Speed of underlying transport Truly reliable multicast Asynchronous messaging
37
Still A Ways To Go Transfer of state How does a restarted application replica regain the state of its prior existence? How do we re-sync state across replicas so outputs track? Deterministic outputs Same output from replicas tracking same inputs Assumes deterministic algorithms Can we support non-deterministic algorithms? Random channel selection to balance loads Decisions based on instantaneous traffic sampling
38
Still A Ways To Go Enhanced algorithms Mapping Leader selection Fault prediction Implementation and algorithms Expanded sensors Replication vs rapid restart If we can restart in few millisecs, do we really need replication?
39
Concluding Remarks http://www.open-mpi.org http://www.open-mpi.org/projects/orcm
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.