Argonne National Laboratory is managed by The University of Chicago for the U.S. Department of Energy ILC Controls: High Availability Software
2 Outline Opening comments ILC software architecture refresher The HA stack Primary and management protocols HPI (Hardware Platform Interface) summary AIS (Application Interface Specification) summary Bottom-up, are these a good fit? –HPI and HPI-ATCA –AIS Conclusions A proposed “stack” for ILC HA research Tasks
3 Opening Comments –Don’t build any critical path software infrastructure without access to source code –HA software is a hard problem –SAF specifications are an impressive unification of known techniques –SAF implementations won’t “solve” HA problem You still have to determine what you want to do and encode it in the framework – this is where work lies 1.What are failures 2.How to identify failure 3.How to compensate (redundancy or reconfiguration or both) –How long for known reliable, SAF compliant products to come out? Compare to time between OMG CORBA spec and good implementations… –Is resultant software complexity manageable? Potential fix worse than the problem
4 Architecture Refresher
5 SAF and ILC Controls Architecture Real-Time Tier Services Tier (middleware) Client Tier Failed I/O card or power supply: fix locally (localization) Hung task: escalate Report upwards SM CPU1 CPU2 I/O 1 I/O 2 checkpoints CLM Crashed middleware container: escalate Report upwards GUI sensor container object HPI AIS Cluster Membership Service Shelf Manager
6 Primary and Management Protocols How do they interact? –Primary connection mgmt. informed by management protocol –Specific actions carried out over primary protocol based on info from management protocol State Info Primary Controls Protocol HA Management Protocol Level N Level N+1
7 HPI (Hardware Platform Interface) Summary HPI subsumes IPMI(established), SNMP, Others Sessions Domains Entities Resources Client access to manage events - RDR repository (SNMP OIDs) - Physical components HPI passes info as IPMI packets over RMCP HPI-ATCA –Expose ATCA entities through HPI (hot swap LEDs, etc..)
8 AIS (Application Interface Specification) Summary C-code interface specification No protocols or other language bindings given AMF (Application Mangement Framework) – the tie that binds –Object lifecycle state diagrams (behavior) Services –Message – similar to JMS, MQSeries, Tuxedo Log, Notification, Events –Cluster Membership – redundant instances within a “group” –Checkpoint – save my state so standby can take over –Distributed Lock – basic need of distributed, coordinated system –IMMS – what is out there configured and deployed LDAP-like DN (Distinguished Names) identify resources
9 Bottom-up, Are these a good fit? HPI and HPI-ATCA –Yes! – IPMI and SNMP implementations all gravitating to HPI –Interoperability very useful to us here –Unified view of hardware resources Front-end CPU’s and I/O cards Servers (database and application) NADs (network attached devices) AIS –Hard problem –Anyone promoting they’ve produced solid 100% compliant AIS product is probably exaggerating –C-code interface only so far –Not clear that components will be interoperable Are we really going to be shopping for COTS control system middleware components?
10 HA Middleware: The Contenders (SAF presentation dated 4/26/05) (note: not a good story…) –Commercial Cluster SW Pro: Transparent to application; ISV support Con: Failover too slow; Proprietary –FT OS Single System Image Pro: Transparent Con: Scalability; Very complex to implement –FT CORBA Pro: Reasonably Transparent; Industry Standard Con: Failover times; Heterogeneity; Management –Telco HA Middleware Pro: Fast Fail-over; Extensible; Management Con: Intrusive; Non-Intuitive Model
11 FT-CORBA (fault tolerant)
12 FT-CORBA No existing CORBA-based control system is HA –Tango – uses open-source JacORB –ACS – uses open-source ORBacus –NIF uses Visibroker with custom connection management No Commercial FT-CORBA ORB as of beginning of 2004 –Spec out since 2001 – not a good sign There exists very little open-source FT-CORBA (mostly academic) –GroupPAC –OCI (Object Computing Inc.) TAO
13 CORBA Alternative - ZeroC ICE ICE (Internet Communications Engine) –High performance middleware –Open-Source GPL licensed –Multiple language bindings (C++, Java, PHP, Python, C# so far) –Used by Hewlett Packard and FCS (Future Combat Systems) –Very much like CORBA, but addresses substantial complexity and performance issues with CORBA (not designed by committee) HA Features –Has explicit support for storing object state to db –Coarse-grain failover only so far (server to server) Could possibly even use this to unify RTP (Real Time Protocol) and DOP (Distributed Object Protocol)
14 Options from world of Java Web Development JBoss –Open source middleware container –Lots of sophisticated, solid features for redundant deployment JINI –Java RMI service lookup/discovery protocol –Very useful for connection management Spring Framework –Lightweight middleware container –Alternative to EJB 2.0 EJB 3.0 –Response to Spring and flaws in EJB 2.0
15 Middleware HA – my conclusions This is a hard problem to solve It’s OK if this part of our efforts here take longer to solidify OS based clustering too slow and complex SAF AIS specification is great on paper, but… –No implementations yet that offer full compliance –No bindings other than C so far as I can tell FT-CORBA not looking good Proprietary Telco solutions – need I say more Success stories seem to use non-HA standards to build HA system –Use set of standards that matches your culture Ie. Java (JINI/RMI) or non FT-CORBA –Build needed HA behavior custom to your requirements Add in checkpointing, active/standby, connection mgmt, etc.
16 Middleware HA – conclusions (2) My inclination is to look at ICE and/or standard CORBA Build basic HA features following model of SAF AIS where reasonable Need more knowledge to even evaluate SAF AIS compliant products Wait for commercial and open-source implementations of AIS… In the mean-time, build a la carte from known stable frameworks
17 Proposed Stack for ILC HA Research SM CPU1 CPU2 COTS Custom Arrow ATCA Starter Kit -Pigeon Point shelf manager - need SM SDK ? -Dual (Quad) X86 processors - we need board developers kit Run EPICS iocCore on dual CPU’s ICE Middleware Tier -Examine suitability - build prototype HA features IPMI V1.5 over RMCP Channel Access Java GUI Applications ICE protocol
18 Tasks 1.Study and document points of failure (look at FNAL project…) How to identify failure How to recover (redundancy and/or reconfiguration) 2.Port EPICS iocCore to ATCA CPU’s RTOS ? Explore redundancy and checkpointing within iocCore 3.Establish middleware server Explore HA feature development within ICE RMCP to ATCA shelf manager Channel Access to ATCA CPU’s 4.Look at custom hardware development in ATCA, including potential associated additions to shelf manager software