Fermilab ILC School, July 07 1 ILC Global Control System John Carwardine, ANL.

Slides:



Advertisements
Similar presentations
Distributed Data Processing
Advertisements

Wednesday July 19 GDE Plenary Global Design Effort 1 Instrumentation Technical System Review Marc Ross.
26-Sep-11 1 New xTCA Developments at SLAC CERN xTCA for Physics Interest Group Sept 26, 2011 Ray Larsen SLAC National Accelerator Laboratory New xTCA Developments.
Operations and Availability GG3. Key decisions Summary of Key Decisions for the Baseline Design The linac will have two parallel tunnels so that the support.
Argonne National Laboratory is managed by The University of Chicago for the U.S. Department of Energy ILC Controls Requirements Claude Saunders.
A U.S. Department of Energy Office of Science Laboratory Operated by The University of Chicago Argonne National Laboratory Office of Science U.S. Department.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
Windows Server 2008 Chapter 11 Last Update
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 18 Slide 1 Software Reuse 2.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 18 Slide 1 Software Reuse.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
ILC Control and IHEP Activity Jijiu. Zhao, Gang. Li IHEP, Nov.5~7,2007 CCAST ILC Accelerator Workshop and 1st Asia ILC R&D Seminar under JSPS Core-University.
Redundancy. 2. Redundancy 2 the need for redundancy EPICS is a great software, but lacks redundancy support which is essential for some highly critical.
High Availability for Information Security Managing The Seven R’s Rich Schiesser Sr. Technical Planner.
ILC Electronics Manufacturing Opportunities ILC Industrial Forum at Fermilab September 21-22, 2005 Ray Larsen for SLAC ILC Electronics Group.
ILC Trigger & DAQ Issues - 1 ILC DAQ issues ILC DAQ issues By P. Le Dû
ATCA based LLRF system design review DESY Control servers for ATCA based LLRF system Piotr Pucyk - DESY, Warsaw University of Technology Jaroslaw.
August 3-4, 2004 San Jose, CA Developing a Complete VoIP System Asif Naseem Senior Vice President & CTO GoAhead Software.
Redundant IOC with ATCA(HPI) support Utilizing modern hardware for better availability Artem Kazakov, KEK/SOKENDAI.
HLRF DRAFT Global Design Effort 1 Defining EDR* Work Packages [Engineering Design Report] Ray Larsen SLAC ILC Division for HLRF Team DRAFT April.
Control in ATLAS TDAQ Dietrich Liko on behalf of the ATLAS TDAQ Group.
SLAC ILC High Availability Electronics R&D LCWS IISc Bangalore India Ray Larsen, SLAC Presented by S. Dhawan, Yale University.
Operations, Test facilities, CF&S Tom Himel SLAC.
Business Data Communications, Fourth Edition Chapter 11: Network Management.
Clustering In A SAN For High Availability Steve Dalton, President and CEO Gadzoox Networks September 2002.
Project Management Mark Palmer Cornell Laboratory for Accelerator-Based Sciences and Education.
Eugenia Hatziangeli Beams Department Controls Group CERN, Accelerators and Technology Sector E.Hatziangeli - CERN-Greece Industry day, Athens 31st March.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
1 Global Design Effort Beijing GDE Meeting, February 2007 Controls for Linac Parallel Session 2/6/07 John Carwardine ANL.
Online Software 8-July-98 Commissioning Working Group DØ Workshop S. Fuess Objective: Define for you, the customers of the Online system, the products.
The Main Injector Beam Position Monitor Front-End Software Luciano Piccoli, Stephen Foulkes, Margaret Votava and Charles Briegel Fermi National Accelerator.
1 Availability and Controls Tom Himel SLAC Controls GG meeting January 20, 2006.
FLASH Free Electron Laser in Hamburg Status of the FLASH Free Electron Laser Control System Kay Rehlich DESY Content: Introduction Architecture Future.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
SLAC High Availability Electronics Architectures & Standards for ILC The Case for Five 9’s Ray Larsen SLAC ILC Program.
Latest ideas in DAQ development for LHC B. Gorini - CERN 1.
1 The ILC Control System J. Carwardine, C. Saunders, N. Arnold, F. Lenkszus (Argonne), K. Rehlich, S. Simrock (DESY), B. Banerjee, B. Chase, E. Gottschalk,
1 Global Design Effort: Control System Vancouver GDE Meeting, July 2006 Controls Global System Review John Carwardine, ANL (For Controls Global Group Team)
1 The ILC Control System J. Carwardine, C. Saunders, N. Arnold, F. Lenkszus (Argonne), K. Rehlich, S. Simrock (DESY), B. Banerjee, B. Chase, E. Gottschalk,
1 Global Design Effort Beijing GDE Meeting, February 2007 Global Controls: RDR to EDR John Carwardine For Controls Global Group.
Using Mica Motes for Platform Management A Telecommunications Application.
1 The ILC Control Work Packages. ILC Control System Work Packages GDE Oct Who We Are Collaboration loosely formed at Snowmass which included SLAC,
Global Design Effort: Controls & LLRF Americas Region Team WBS x.2 Global Systems Program Overview for FY08/09.
Welcome to the 4th meeting of the xTCA Interest Group M. Joos, PH/ESE1 Agenda: Next Meeting: During the TWEPP 2012 conference in Oxford SpeakerTitle M.
January 9, 2006 Margaret Votava 1 ILC – NI/FNAL/ANL Brief overview of Global Design Effort (GDE) plans, dates, and organization: –Changes since Industrial.
XFEL The European X-Ray Laser Project X-Ray Free-Electron Laser Wojciech Jalmuzna, Technical University of Lodz, Department of Microelectronics and Computer.
1 Global Control System J. Carwardine (ANL) 6 November, 2007.
FLASH Free Electron Laser in Hamburg Status of the FLASH Free Electron Laser Control System Kay Rehlich DESY Outline: Introduction Architecture Future.
Lecture 11. Switch Hardware Nowadays switches are very high performance computers with high hardware specifications Switches usually consist of a chassis.
ATCA based LLRF system design review DESY Control servers for ATCA based LLRF system Piotr Pucyk - DESY, Warsaw University of Technology Jaroslaw.
Redundancy in the Control System of DESY’s Cryogenic Facility. M. Bieler, M. Clausen, J. Penning, B. Schoeneburg, DESY ARW 2013, Melbourne,
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
XFEL The European X-Ray Laser Project X-Ray Free-Electron Laser Wojciech Jalmuzna, Technical University of Lodz, Department of Microelectronics and Computer.
RDR Controls Design Walk- Through Controls and LLRF EDR Kick-Off Meeting August 20-22, 2007.
MicroTCA Development and Status
Presented by Li Gang Accelerator Control Group
Hands-On Microsoft Windows Server 2008
ATF/ATF2 Control System
The ILC Control Work Packages
Software Research Directions Related to HA/ATCA Ecosystem
ILC Global Control System
LLRF and Beam-based Longitudinal Feedback Readiness
QNX Technology Overview
The ILC Control System J. Carwardine, C. Saunders, N. Arnold, F. Lenkszus (Argonne), K. Rehlich, S. Simrock (DESY), B. Banerjee, B. Chase, E. Gottschalk,
RF System (HLRF, LLRF, Controls) EDR Plan Overview
Presentation transcript:

Fermilab ILC School, July 07 1 ILC Global Control System John Carwardine, ANL

Fermilab ILC School, July 07 2 ILC Accelerator overview Major accelerator systems –Polarized PC gun electron source and undulator-based positron source. –5-GeV electron and positron damping rings, 6.7km circumference. –Beam transport from damping rings to bunch compressors. –Two 11km long 250-GeV linacs with 15,000+ cavities and ~600 RF units. –A 4.5-km beam delivery system with a single interaction point. J. Bagger

Fermilab ILC School, July 07 3 Control System Requirements and Challenges General requirements are largely similar to those of any large-scale experimental physic machines …but there are some challenges Scalability –100,000 devices, several million control points. –Large geographic scale: 31km end to end –Multi-region, multi-lab development team. Support ILC accelerator availability goals of 85%. –Intrinsic Control system availability of 99% by design. Cannot rely on approach of ‘fix in place.’ May require % (five nines) availability from each crate. –Functionality to help minimize overall accelerator downtime.

Fermilab ILC School, July 07 4 Requirements and Challenges …(2) Precision timing & synchronization –Distribute precision timing and RF phase references to many technical systems throughout the accelerator complex. –Requirements consistent with LLRF requirements of 0.1% amplitude and 0.1 degree phase stability. Support remote operations / remote access (GAN / GDN) –Allow collaborators to participate with machine commissioning, operation, optimization, and troubleshooting. –At technical equipment level there is little difference between on-site and off-site access - Control Room is already ‘remote.’ –There are both technical and sociological challenges.

Fermilab ILC School, July 07 5 Requirements and Challenges …(3) Extensive reliance on machine automation –Manage accelerator operations of the many accelerator systems, eg 15,000+ cavities, 600+ RF units. –Automate machine startup, cavity conditioning, tuning, etc. Extensive reliance on beam-based feedback –Multiple beam based feedback loops at 5Hz, eg Trajectory control, orbit control Dispersion measurement & control Beam energies Emittance correction

Fermilab ILC School, July 07 6 Control System Functional Model Client Tier GUIs Scripting Services Tier “Business Logic” Device abstraction Feedback engine State machines Online models … Front-End Tier Technical Systems Interfaces Control-point level

Fermilab ILC School, July 07 7 Physical Model as applied to main linac (Front-end)

Fermilab ILC School, July 07 8 Some representative component counts ComponentDescriptionQuantity 1U SwitchInitial aggregator of network connections from technical systems 8356 Controls ShelfStandard chassis for front-end processing and instrumentation cards 1195 Aggregator SwitchHigh-density connection aggregator for 2 sectors of equipment 71 Controls Backbone Switch Backbone networking switch for controls network 126 Phase Ref. LinkRedundant fiber transmission of 1.3- GHz phase reference 68 Controls RackStandard rack populated with one to three controls shelves 753 LLRF StationTwo racks per station for signal processing and motor/piezo drives 668

Fermilab ILC School, July 07 9 Which Control System…? Established accelerator control system..? –EPICS, DOOCS, TANGO, ACNET, … Development from scratch…? Commercial solution…? Too early to down-select for ILC …and there are benefits to not down-selecting during R&D phase

Fermilab ILC School, July Availability Design Philosophy for the ILC Design for Availability up front. Budget 15% downtime total. Keep an extra 10% as contingency. Try to get the high availability for the minimum cost. Will need to iterate as design progresses. –Quantities are not final –Engineering studies may show that the cost minimum would be attained by moving some of the unavailability budget from one item to another. –This means some MTBFs may be allowed to go down, but others will have to go up. Availability/reliability modeling (Availsim)

Fermilab ILC School, July Availability budgets by system (percentage total downtime)

MTBF/MTTR requirements from Availsim

Fermilab ILC School, July High Availability primer Availability A = MTBF/(MTBF+MTTR) –MTBF=Mean Time Before Failure –MTTR= Mean Time To Repair –If MTBF approaches infinity A approaches 1 –If MTTR approaches zero A approaches 1 –Both are impossible on a unit basis –Both are possible on a system basis. Key features for HA, i.e. A approaching 1: –Modular design –Built-in 1/n redundancy –Hot standby systems –Hot-swap capable at subsystem unit or subunit level

Fermilab ILC School, July Systems That Never Shut Down Any large telecom system will have a few redundant Shelves, so loss of a whole unit does not bring down system – like RF system in the Linac. –Load auto-rerouted to hot spare, again like Linac. Key: All equipment always accessible for hot swap. Other Features: –Open System Non-Proprietary – very important for non-Telecom customers like ILC. –Developed by industry consortium¹ of major companies sharing in $100B market. –20X larger market than any of old standards including VME leads to competitive prices. ¹ PICMG -- PCI Industrial Computer Manufacturer’s Group

Fermilab ILC School, July Controls Cluster FEATURES ◊ Dual Star 1/N Redundant Backplanes ◊ Redundant Fabric Switches ◊ Dual Star/ Loop/ Mesh Serial Links ◊ Dual Star Serial Links To/From Level 2 Sector Nodes Dual Star to/From Sector Nodes Dual Fabric Switches Applications Modules Dual Star/ Loop/Mesh

Fermilab ILC School, July HA Concept DR Kicker Systems Approx 50 unit drivers n/N Redundancy System level (extra kickers) n/N Redundancy Unit level (extra cards) Diagnostics on each card, networked, local wireless

Fermilab ILC School, July Physical Model as applied to main linac (Front-end)

Fermilab ILC School, July High Availability Control System… Control system itself must be highly available –Redundant and hot-swap hardware platform (baseline ATCA). –Redundancy functionality in control system software. In many cases, redundancy and hot-swap/hot-reconfigure can only be implemented at the accelerator system level, eg –Rebalance RF systems if a klystron fails. –Modify control algorithm on loss of critical sensor. Control System will provide High Availability functionality at the accelerator system level. Technical systems must provide high level of diagnostics to support remote troubleshooting and re-configuration.

Fermilab ILC School, July Slot Crate w/ Shelf Manager Fabric Switch Dual IOC Processors Rear View 16 Slot Dual Star Backplane 4 Hot- Swappable Fans Shelf Manager Dual IOC’s Fabric Switch Dual 48VDC Power Interface ATCA as a reference platform… R. Larsen

Fermilab ILC School, July 07 20

Fermilab ILC School, July ATCA as reference platform for Front-end electronics Representative of the breadth of high-availability functions needed –Hot-swappable components: circuit boards, fans, power supplies, … –Remote power management: power on/off each circuit board –Supports redundancy: processors, comms links, power supplies,… –Remote resource management through Shelf Manager µTCA offers lower cost but with reduced feature set. There is growing interest in the physics community in exploring ATCA for instrumentation and DAQ applications. As candidate technology for the ILC, ATCA/µTCA have strong potential …currently is it an emerging standard.

Fermilab ILC School, July Read Out evolution LHC --> ILC Subdetector Read Out Crate (VME 9U) Read Out Driver Read Out Buffer (3 ROBin) ROS (150 PCs) Gbit Link to Gbe Switch (60 PCs) 92 SLink Subdetector 400 Robin (PCI) AMC ATCA Module ATCA Crate Digital Buffer  C  CTA

Fermilab ILC School, July Cost/Benefit Analysis of HA Techniques Cost (some effort laden, some materials laden) Availability(benefit) 1. Good administrative practices 2. Disk volume management 3. Automation (supporting RF tune-up, magnet conditioning, etc.) 4. COTS redundancy (switches, routers, NFS, RAID disks, database, etc.) 5. Extensive monitoring (hardware and software) 6. Model-based configuration management (change management) 7. Adaptive machine control (detect failed BPM, modify feedback) 8. Development methodology (testing, standards, patterns) 9. Application design (error code checking, etc) 10. Hot swap hardware 11. Manual failover (eg bad memory, live patching) 12. Model-based automated diagnosis 13. Automatic failover

Fermilab ILC School, July HA R&D objectives Learn about HA (High Availability) in context of accelerator controls –Bring in expertise (RTES, training, NASA, military, …) Develop (adopt) a methodology for examining control system failures –Fault tree analysis –FMEA or scenario-based FMEA –Supporting software (CAFTA, SAPPHIRE, …) –Others? Develop policies for detecting and managing identified failure modes –Development and testing methodology –Workaround –Redundancy Develop a full “vertical” prototype implementation –Ie. how we might implement above policies Integrate portions of “vertical” prototype with test stands (LLRF) Feed some software-oriented data to SLAC availability simulation?

Fermilab ILC School, July Configuration Management Infrastructure Monitoring Software Runtime Lifecycle Management Shelf Management Automation Software QA Development Methodology Conflict Avoidance High Availability Software What are the most common and critical failure modes in control system software? Mis-configuration Network buffer overruns Application logic bugs Task deadlock Accepting conflicting commands Ungraceful handling of failed sensors/actuators Flying blind (lack of monitoring) Introduction of untested features More… How do we mitigate these, and what is the cost/benefit? Availability = MTBF MTBF + MTTR

Fermilab ILC School, July Sample of Techniques: Shelf Management CPU1 CPU2 I/O Custom Services Tier IPMI, HPI, SNMP, others… Controls Protocol Client Tier Front-end tier SM sensor Shelf Manager: Identify all boards on shelf Power cycle boards (individually) Reset boards Monitor voltages/temps Manage Hot-Swap LED state Switch to backup flash mem bank More…

Fermilab ILC School, July SAF – Availability Management Framework Shutting down Locked- Instantiation Unlocked Locked AMF Logical Entities Service Unit Administrative States Service Unit Component Node U Service Unit Component Node V Service Group Service Instance Service Instance is work assigned to Service Unit active standby 1. Service unit starts out un-instantiated. 2. State changed to locked, meaning software is instantiated on node, but not assigned work. 3. State changed to unlocked, meaning software is assigned work (Service Instance). A simple example of software component runtime lifecycle management

Fermilab ILC School, July SAF – Service Availability Forum Specifications Managed Hardware Platform Carrier Grade Operating System HPI Middleware AIS Middleware Other Middleware and Application Services HA Applications AMFCLMIMMCKPTLOGNTFLCKEVTMSG ControlSensor Annun- ciator HotswapWatchdogInventoryEventResetPowerConfig Diagram courtesy of Service Availability Forum Hardware Platform Interface Application Interface Specification

Fermilab ILC School, July SAF – Availability Management Framework AMF – Availability Management Framework Manages software runtime lifecycle, fault reporting, failover policies, etc. Works in combination with a collection of well-defined services to provide a powerful environment for application software components. –CLM – Cluster Membership Service –LOG – Log Service –CKPT – Checkpoint Service –EVT – Event Service –LCK – Lock Service –More … An open standard from telecom industry geared towards supporting a highly available, highly distributed system. Potential application to critical core control system software such as IOCs, device servers, gateways, nameservers, data reduction, etc. Know exactly what software is running where. Be able to gracefully restart components, or manage state while hot- swapping underlying hardware. Uniform diagnostics to troubleshoot problems.

Fermilab ILC School, July An HA software framework is just the start… –SAF (Service Availability Forum) implementations won’t “solve” HA problem You still have to determine what you want to do and encode it in the framework – this is where work lies 1.What are failures 2.How to identify failure 3.How to compensate (failover, adaptation, hot-swap) –Is resultant software complexity manageable? Potential fix worse than the problem Always evaluate: “am I actually improving availability?”

Fermilab ILC School, July R&D Engineering Design (EDR) Phase Main focus of R&D efforts are on high availability –Gain experience with high availability tools & techniques to be able to make value-based judgments of cost versus benefit. Four broad categories –Control system failure mode analysis –High-availability electronics platforms (ATCA) –High-availability integrated control systems Conflict avoidance & failover, model-based resource monitoring. –Control System as a tool for implementing system-level HA Fault detection methods, failure modes & effects

Fermilab ILC School, July HA means doing things differently… ILC must apply techniques not typically used at an accelerator, particularly in software –Development culture must be different this time. –Cannot build ad-hoc with in-situ testing. –Build modeling, simulation, testing, and monitoring into hardware and software methodology up front. Reliable hardware –Instrumentation electronics to servers and disks. –Redundancy where feasible, otherwise adapt in software. –Modeling and simulation (T. Himel). Reliable software –Equally important. –Software has many more internal states – difficult to predict. –Modeling and simulation needed here for networking and software.

Fermilab ILC School, July Controls topic areas LLRF algorithms RF phase & timing distribution, synchronization Machine automation, beam-based feedback ATCA evaluation as front-end instrumentation platform ATCA evaluation for control system integration HA integrated control system Integrated Control System as a tool for system-level HA Remote access, remote operations (GAN/GDN) Failure modes analysis Lot’s of opportunities to get involved…