Control of large scale distributed DAQ/trigger systems in the networked PC era Toby Burnett Kareem Kazkaz Gordon Watts DAQ2000 Workshop Nuclear Science.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Lectures on File Management
Chapter 11 user support. Issues –different types of support at different times –implementation and presentation both important –all need careful design.
DAQ Monitoring and Auto Recovery at DØ B. Angstadt, G. Brooijmans, D. Chapin, D. Charak, M. Clements, S. Fuess, A. Haas, R. Hauser, D. Leichtman, S. Mattingly,
Chapter 19: Network Management Business Data Communications, 5e.
EECE499 Computers and Nuclear Energy Electrical and Computer Eng Howard University Dr. Charles Kim Fall 2013 Webpage:
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture XVIII: Concluding Remarks.
MotoHawk Training Model-Based Design of Embedded Systems.
Chapter 19: Network Management Business Data Communications, 4e.
API Design CPSC 315 – Programming Studio Fall 2008 Follows Kernighan and Pike, The Practice of Programming and Joshua Bloch’s Library-Centric Software.
Lecture 1: History of Operating System
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
Solutions for diagnostic failures in HFC networks.
University of Kansas Construction & Integration of Distributed Systems Jerry James Oct. 30, 2000.
SIM5102 Software Evaluation
CS533 - Concepts of Operating Systems
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
1 CS101 Introduction to Computing Lecture 19 Programming Languages.
Client/Server Architectures
Software Reliability: The “Physics” of “Failure” SJSU ISE 297 Donald Kerns 7/31/00.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Chapter 3 Memory Management: Virtual Memory
Components of Database Management System
Magnetic Field Measurement System as Part of a Software Family Jerzy M. Nogiec Joe DiMarco Fermilab.
Applications of Advanced Data Analysis and Expert System Technologies in the ATLAS Trigger-DAQ Controls Framework G. Avolio – University of California,
File Processing - Database Overview MVNC1 DATABASE SYSTEMS Overview.
Windows Vista Inside Out Chapter 22 - Monitoring System Activities with Event Viewer Last modified am.
Livespace Architecture. Overview Livespace requirements Discussion of issues Livespace Architecture.
CHAPTER TEN AUTHORING.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
DEBUGGING. BUG A software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected.
Games Development 2 Concurrent Programming CO3301 Week 9.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
Control in ATLAS TDAQ Dietrich Liko on behalf of the ATLAS TDAQ Group.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Chapter 33 Troubleshooting Windows Errors. STOP Errors  When Microsoft Windows XP encounters a serious problem  And the operating system can't continue.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Time Management.  Time management is concerned with OS facilities and services which measure real time, and is essential to the operation of timesharing.
Scientific Debugging. Errors in Software Errors are unexpected behaviors or outputs in programs As long as software is developed by humans, it will contain.
June 17th, 2002Gustaaf Brooijmans - All Experimenter's Meeting 1 DØ DAQ Status June 17th, 2002 S. Snyder (BNL), D. Chapin, M. Clements, D. Cutts, S. Mattingly.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Copyright © 2004, Keith D Swenson, All Rights Reserved. OASIS Asynchronous Service Access Protocol (ASAP) Tutorial Overview, OASIS ASAP TC May 4, 2004.
I/O Software CS 537 – Introduction to Operating Systems.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.
Netscape Application Server
Controlling a large CPU farm using industrial tools
CS101 Introduction to Computing Lecture 19 Programming Languages
Introduction to Operating System (OS)
Operating System Reliability
Operating System Reliability
#01 Client/Server Computing
CS 31006: Computer Networks – The Routers
Model-View-Controller Patterns and Frameworks
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
Operating System Reliability
MapReduce: Simplified Data Processing on Large Clusters
Operating System Reliability
#01 Client/Server Computing
Operating System Reliability
Presentation transcript:

Control of large scale distributed DAQ/trigger systems in the networked PC era Toby Burnett Kareem Kazkaz Gordon Watts DAQ2000 Workshop Nuclear Science Symposium and Medical Imaging Conference October 20, 2000 Gennady Briskin Michel Clements Dave Cutts Sean Mattingly

Control of large scale distributed DAQ/trigger systems in the networked PC era Gordon Watts u.washington.edu University of Washington, Seattle DAQ2000 NSS2000 Oct 20, The Trigger Environment Pressure on the Trigger –Shorter decision times required –More complex algorithms –More algorithms –More Data Hardware Triggers –Inflexible, dedicated –Hard to change as accelerator condition change –FPGAs: Hardware moving into firmware. –(but) Well understood Cheap Commodity Components –High speed networks –Cheap memory –Cheap CPU power as never before The ERA of the PC Based Trigger –Farms of Triggers where traditional hardware exists BTeV – very little in the way of a HW trigger Hardware Based Triggers Software Based Triggers Speed!

Control of large scale distributed DAQ/trigger systems in the networked PC era Gordon Watts u.washington.edu University of Washington, Seattle DAQ2000 NSS2000 Oct 20, Farm Based Triggers/DAQ Control Code –Management –Configuration –Monitoring –Error Recovery DAQ Code –Controls DAQ components –Network Control & Routing for farm Physics Code –Performs actual trigger decision –Frequently interfaces with others DAQ Node Control Node Physics Code Farm Node Physics Code Farm Node Physics Code Farm Node Physics Code Farm Node Physics Code Farm Node Physics Code Farm Node All code, except the control code, is independent of each other Passed Raw Data Raw Data

Control of large scale distributed DAQ/trigger systems in the networked PC era Gordon Watts u.washington.edu University of Washington, Seattle DAQ2000 NSS2000 Oct 20, Control Configuration Tasks –Management Interface with other parts of Online System Resource Management Controls other components –Configuration Event Routing Farm Partitions –Multiple Runs Trigger Code DAQ parameters –Monitoring System Performance Resource Usage RT Debugging (hardware & software) –Error Recovery Asynchronous and synchronous errors Crashed nodes! On the fly reconfiguration –Must decide if possible! Diagnostic information for shifters & experts

Control of large scale distributed DAQ/trigger systems in the networked PC era Gordon Watts u.washington.edu University of Washington, Seattle DAQ2000 NSS2000 Oct 20, The Environment: Distributed Run Control Run Control – Farm Interaction changing –Simple RC program used to handle whole DAQ/Trigger system FECs, Trigger, DAQ, etc. –DØ Run 1: RC was complex –DØ Run 2: To complex for a single RC. L3 (Farm) Supervisor has more responsibility –Hides farm implementation details from RC. Black box for the rest of the online system –More flexibility in how the Trigger/DAQ system is managed. Run Control (run #1, trigger list, 10 farm nodes) Level 3 Supervisor (trigger list, specific nodes, readout components, routing) Readout Component (routing) Farm Node (trigger list) Distributed –Components can fail in way that doesn’t bring down the run –A node crash –Allows it to recover from errors in farm and DAQ without having to involve RC when possible

Control of large scale distributed DAQ/trigger systems in the networked PC era Gordon Watts u.washington.edu University of Washington, Seattle DAQ2000 NSS2000 Oct 20, Design of Supervisor Goals –Configuration Effectively manage transitions from one state to another –running, paused, new run, etc. A large number of separate entities must be controlled Speed issues –Recovery, short accelerator times between shots, etc. –Recovery Physics code will change through out life of experiment –Test Test Test! –Will still have crashes & explosions One of the few systems that can suffer a major component failure and not terminate a run! State Diagrams – standard approach –Use a state definition language to define each transition in the farm Reconfiguration requests that don’t change the global state Difficult to deal with exception conditions, asynchronous errors. –Especially if they require a farm reconfigurations Super

Control of large scale distributed DAQ/trigger systems in the networked PC era Gordon Watts u.washington.edu University of Washington, Seattle DAQ2000 NSS2000 Oct 20, The Ideal State Maintain an ideal state –Ideal state is built only if RC commands make sense –Eliminates erroneous input –Prevents leaving the farm in an unknown state due to a RC failure There is a mapping from an ideal state to a Hardware State/configuration Maintain Actual State –Comparison between ideal and actual generates commands Did a RC Configuration Change Succeed? –Are the Actual and Ideal states close enough to report success? Ideal Run Control State RC Commands and Specification Ideal L3 Hardware State Actual L3 Hardware State Component Command Generator Components Execute Commands (or fail)

Control of large scale distributed DAQ/trigger systems in the networked PC era Gordon Watts u.washington.edu University of Washington, Seattle DAQ2000 NSS2000 Oct 20, Faults & Errors Ultimate goal is 100% uptime; 0% trigger deadtime! Faults are rarely predictable. –Hardware: Always knew every input and output –In software, this is no longer always true Two classes –Catastrophes Whole Farm Crashes, on every event Critical Readout component fails to configure correctly –Minor errors Low grade crashes –In Run 1 we lost the use of our build system due to a licensing issue for a short while: had to deal with a farm crash once per 20 minutes. Non-critical component fails to configure Failure Mode Analysis –Will tell you the most likely failures –And the places that a failure will do the most damage. –Helps to focus error recovery efforts Still need humans!

Control of large scale distributed DAQ/trigger systems in the networked PC era Gordon Watts u.washington.edu University of Washington, Seattle DAQ2000 NSS2000 Oct 20, Farm Node Failures Separate the physics code from the Framework and Control Code –Physics code changes over life of the experiment –Separate memory space –Assure that data back from the Filter Process is not interpreted by Framework at all Blindly handed off to Tape –Automatically restart a dead filter process Or start debugger on it without interrupting other filter processes Farm Node Framework Filter Process (~1 per CPU) A Special Case Recovery indicated by a failure mode analysis Can crash and recover crash detection & recovery code

Control of large scale distributed DAQ/trigger systems in the networked PC era Gordon Watts u.washington.edu University of Washington, Seattle DAQ2000 NSS2000 Oct 20, General Recovery At Failure –Supervisor is notified of the failure –The Actual State is updated to reflect the failure There is an understanding of what constitutes a minimal working L3 system Stop Run is issued if this test isn’t satisfied. Recovery –Notify Shifter of problem, or –Measure current state against ideal state and re-issue commands Will repair broken device/node Reconfigure the farm (expensive) Type of failure can effect the hardware state –Means special recovery commands can be automatically generated Repeated Failure Detection should be implemented No need to notify RC Accounting (trigger information)

Control of large scale distributed DAQ/trigger systems in the networked PC era Gordon Watts u.washington.edu University of Washington, Seattle DAQ2000 NSS2000 Oct 20, Local/Global Fault Detection Local –Timeout from command, etc. –Explicitly designed detection in control code –Must be careful to track the source of the error and record it in the Actual State –real-time detection –Similar to old-time hardware faults Know exactly where the fault occurred Limited number of fault types Specific – simpler for computers to handle directly. Global –Single monitor point can’t determine sickness –Global Monitoring is crucial for this –Useful to help diagnose problems for the shifters Recommend solutions? One day… it will go farther! –Input to rule based system?? –Detection isn’t on an event-by-event basis

Control of large scale distributed DAQ/trigger systems in the networked PC era Gordon Watts u.washington.edu University of Washington, Seattle DAQ2000 NSS2000 Oct 20, Flexible Monitoring Monitoring –XML Based –Generic WEB front-end take advantage of commodity software market! –Drop-in Aggregate Filters. –Displays query via http requests Cross platform SOAP? Custom designed displays (graphics) or simple static web pages –Use actual program in limited circumstances –Interface for rule-based system?? Source1Source2Source3 Monitor Process XML DB1 Filter1 Filter2 Display 1Display 2 XML DB2 Every 5 seconds Java?

Control of large scale distributed DAQ/trigger systems in the networked PC era Gordon Watts u.washington.edu University of Washington, Seattle DAQ2000 NSS2000 Oct 20, Putting It Together No reason for a display to be the only consumer for Monitor Information Intelligent process can watch for known patterns –Good Look for correct running patterns, warning when they aren’t present –Bad: Look for known bad patters, warning when they are present, perhaps taking automatic action, even. Rule based system –Difficult to maintain. –Would like a snap-shot feature Monitor Process XML DB1 Filter1 Filter2 Monitor XML DB2 Farm Super Run Control DAQ Diagnostic Displays Control Room

Control of large scale distributed DAQ/trigger systems in the networked PC era Gordon Watts u.washington.edu University of Washington, Seattle DAQ2000 NSS2000 Oct 20, Conclusions PC Based DAQ/Trigger –Large component count –Uptime must be almost 100% for each component –Software bugs are a reality –Approach system design in a unified way A General Approach Means –Flexible system –Ease of extension Hooks for adding monitor and other capabilities, for example. Hooks to deal with many error situations one can expect to encounter over 10 years of running. –Failure Mode Analysis This Requires –Robust, extensible Monitoring –Extensible Control System Can handle addition of error conditions at a later time –Software infrastructure to tie the two together –Being aware of shifters!