G. Watts (UW/Seatte) for the DZERO Collaboration “More reliable than an airline*” * Or the GRID The last DZERO DAQ Talk?

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

Operating System.
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
1 Lecture 2: Review of Computer Organization Operating System Spring 2007.
1: Operating Systems Overview
1 Computer System Overview OS-1 Course AA
1 CSIT431 Introduction to Operating Systems Welcome to CSIT431 Introduction to Operating Systems In this course we learn about the design and structure.
Chapter 1 and 2 Computer System and Operating System Overview
Figure 1.1 Interaction between applications and the operating system.
Computer Organization and Architecture
CHAPTER 9: Input / Output
Operating System Organization
Operating Systems.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Basic Concepts of Computer Networks
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
Computer System Overview Chapter 1. Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users.
CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.
1-1 Embedded Network Interface (ENI) API Concepts Shared RAM vs. FIFO modes ENI API’s.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,
Run II DZERO DAQ / Level 3 Trigger System Ron Rechenmacher, Fermilab The DØ DAQ Group (Brown/FNAL-CD/U.of Washington) CHEP03.
Chapter 1: Introduction. 1.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 1: Introduction What Operating Systems Do Computer-System.
DAQ Issues for the 12 GeV Upgrade CODA 3. A Modest Proposal…  Replace aging technologies  Run Control  Tcl-Based DAQ components  mSQL  Hall D Requirements.
Data Acquisition for the 12 GeV Upgrade CODA 3. The good news…  There is a group dedicated to development and support of data acquisition at Jefferson.
L3 DAQ the past, the present, and your future Doug Chapin for the L3DAQ group DAQ Shifters Meeting 26 Mar 2002.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
The Main Injector Beam Position Monitor Front-End Software Luciano Piccoli, Stephen Foulkes, Margaret Votava and Charles Briegel Fermi National Accelerator.
Overview of DAQ at CERN experiments E.Radicioni, INFN MICE Daq and Controls Workshop.
1 CSE451 Architectural Supports for Operating Systems Autumn 2002 Gary Kimura Lecture #2 October 2, 2002.
Latest ideas in DAQ development for LHC B. Gorini - CERN 1.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Operating Systems 1 K. Salah Module 1.2: Fundamental Concepts Interrupts System Calls.
L3 DAQ Doug Chapin for the L3DAQ group DAQShifters Meeting 10 Sep 2002 Overview of L3 DAQ uMon l3xqt l3xmon.
Level 3 Trigger Leveling D. Cutts(Brown), A. Garcia-Bellido(UW), A. Haas(Columbia), G. Watts(UW)
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
DØ LEVEL 3 TRIGGER/DAQ SYSTEM STATUS G. Watts (for the DØ L3/DAQ Group) “More reliable than an airline*” * Or the GRID.
June 17th, 2002Gustaaf Brooijmans - All Experimenter's Meeting 1 DØ DAQ Status June 17th, 2002 S. Snyder (BNL), D. Chapin, M. Clements, D. Cutts, S. Mattingly.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Lecture 1: Review of Computer Organization
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
L3DAQ An Introduction Michael Clements 5/5/03. A Quick Glance The DZero DAQ L3DAQ components L3DAQ communication An in depth look at monitoring Michael.
Bob Hirosky L2  eta Review 26-APR-01 L2  eta Introduction L2  etas – a stable source of processing power for DØ Level2 Goals: Commercial (replaceable)
بسم الله الرحمن الرحيم MEMORY AND I/O.
Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.
Management of the LHCb DAQ Network Guoming Liu *†, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, R. Brock,T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina,
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
Computer System Structures
Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.
Lesson Objectives Aims Key Words Interrupt, Buffer, Priority, Stack
RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne
Introduction of microprocessor
Real-time Software Design
Module 2: Computer-System Structures
Chapter 2: Operating-System Structures
CSE 451: Operating Systems Autumn 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 596 Allen Center 1.
CSE 451: Operating Systems Autumn 2001 Lecture 2 Architectural Support for Operating Systems Brian Bershad 310 Sieg Hall 1.
CSE 451: Operating Systems Winter 2003 Lecture 2 Architectural Support for Operating Systems Hank Levy 412 Sieg Hall 1.
Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.
Chapter 2: Operating-System Structures
Chapter 13: I/O Systems.
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

G. Watts (UW/Seatte) for the DZERO Collaboration “More reliable than an airline*” * Or the GRID The last DZERO DAQ Talk?

Level 1 Level 2 DAQ L3 Trigger Farm Firmware Trigger Algorithms Software Trigger Algorithms Specialized Hardware Commodity Hardware 1.7 MHz 2 kHz 1 kHz Standard Tiered HEP DAQ/Trigger Design Readout at 1 kHz is about 300 MB/sec Data rate out of the Trigger farm is 60 MB/sec Full Readout Occurs 200 Hz Fast, course resolution, readout G. Watts (UW/Seattle)2

ROC L3 Node A standard scatter-gather architecture ROC – VME based Read out Crate with a Single Board Computer L3 Node – Commodity Linux multicore computer G. Watts (UW/Seattle)3 About 20m distant Electrically isolated Large CISCO switch…

G. Watts (UW/Seattle)4 Tevatron Increases Luminosity # of Multiple Interactions Increase Increased Noise Hits Physicists Get Clever Trigger List Changes More CPU Time Per Event Increased Data Size Trigger HW Improvements Rejection Moves to L1/L2 Architecture lasted 10 years!  Trigger software written by large collection of non-realtime programmer physicists.  CPU time/event has more than tripled.  Continuous upgrades since operation started  Have added about 10 new crates since the start  Started with 90 nodes, ended with almost 200, peak was about 330, all have been replaced  Single core at start, last purchase was dual 4-core.  No major unplanned outages

G. Watts (UW/Seattle)5 Constant pressure: L3 deadtime shows up in this gap!

G. Watts (UW/Seattle)6 An overwhelming success This DAQ System! 10,347,683 K events served!! Papers! Previous VAX/VMS based DAQ system

7G. Watts (UW/Seattle)

8 Data Flow Unidirectional Directed (destination known) Buffered at origin and destination Not buffered while in flight Minimize copying of data Flow Control 100% TCP/IP! Small messages bundled to decrease network overhead Messages compressed via pre- computed lookup tables Configuratoin Automatic Configuration. Configuration at boot time. Auto-reconfiguration when hardware changes. Extensive monitoring system with minimal impact on running.

G. Watts (UW/Seattle)9 The Read Out Crate VME based crates All DZERO readout ends here Interface between detector and DAQ One slot has SBC VMIC 7750’s, PIII, 933 MHz 128 MB Ram Universe II chip Dual 100 MB Ethernet 1 or 2 used depending on data load 4 upgraded to GB late in run to deal with increased occupancy SBC has custom built extension Control lines on J3 backplane that trigger readout Readout starts when detector control signals there is an event ready. Home brew device driver (Fermi CD) for efficient VME readout

G. Watts (UW/Seattle)10 CISCO 6590 switch 16 Gb/s backplane 9 module slots, all full 8 port GB 112 MB shared output buffer per 48 ports Not enough to do full event buffering Connection between SBCs and switch optically isolated Static configuration, maintained by Fermilab Computing Division Farm Nodes n-core 1U AMD’s and Xeon’s with varying amounts of memory 1 Trigger Process per core Single 100 Mb Ethernet connection Peaked around 330 1U boxes, but later in run was smaller number with faster CPU’s 1 local disk (which failed rarely) Fans (which failed a lot)

G. Watts (UW/Seattle)11 SBC in ROC Farm Node Routing Master DØ Trigger Framework L2 Accept 1 A Level 2 Trigger fires, and an accept message is sent to the DAQ Routing Master as well as the detector readout control by the Trigger Framework The Routing Master is the point of entry into the DAQ system A single SBC, sitting in a special crate with an interface to the Trigger Framework. L2 Accept message contains an event number and a 128 bit trigger mask Readout crates start pushing data to their SBC’s Data is buffered in the SBC’s. SBC makes the data uniform VME

G. Watts (UW/Seattle)12 SBC in ROC Farm Node Routing Master ROC DØ Trigger Framework 2 Routing Master decides what node should build and run the Level 3 Trigger code for the event. This information is sent to all SBC’s and the Node.. Node decision is based on the trigger bit mask and cached configuration information Nodes busy fraction is also taken into account, using a token counting system This properly accounts for node’s processing ability. 10 decisions are accumulated before nodes and SBCs are notified.

G. Watts (UW/Seattle)13 SBC in ROC Farm Node DØ Trigger Framework Routing Master 3 The SBC’s send their data fragments to the Farm Nodes. Once all the event fragments have arrived a token is sent back to the Routing Master as long as the node has room to accept another event.

G. Watts (UW/Seattle)14 SBC in ROC Farm Node DØ Trigger Framework Routing Master 4 The Node builds the event, runs the trigger on the event, and if it passes, sends it off to the online system. The event fragments are built into the official RAW event chunk in shared memory (which was designed to do this without copying in this case). The Trigger process (~ 1 per core) can only read from this shared memory Trigger result chunk is returned, and sent to the online system.

G. Watts (UW/Seattle)15 SBC in ROC Farm Node DØ Trigger Framework Routing Master If there is no room in the SBC’s internal buffers it will refuse to read out the crate Quickly causes DAQ to stop with a “front-end busy” status. If the # of tokens for free nodes gets too low, the Routing Master will disable the triggers in the Trigger Framework If the farm node can’t send an event to the online system, its internal buffers will fill It will refuse to receive the SBC’s data. Many minutes of buffer space in the farm! VME

16G. Watts (UW/Seattle)

17 This is not a no-single-point-of-failure system DZERO made the policy decision that all crates are required – so each read out crate becomes a single point of failure. The Routine Master and networking infrastructure are single points of failure Some of the configuration machines are single points of failure The Nodes are not. Spent extra money on that hardware Out of about 70 SBCs, only 5 failed in 10 years, and 2 of those were due to user error. Also had spares on hand. CISCO switch never hic- uped and we bought more capacity than we required. Failed a lot… Mostly fans, some disks And then there was the software…

G. Watts (UW/Seattle)18 ConfiguredRunning Paused Start Configure – 26 sec Start Run #nnnnn – 1 sec Stop Run #nnnnn – 1 secPause 1 sec Resume 1 sec Unconfigure – 4 secs

G. Watts (UW/Seattle)19 Over the course of the 10 years we had the full class of software related crashes In Linux - slow downs or hard crashes. In our custom kernel driver for efficient VME readout In our user level code that was part of the DAQ Our strategy for the configuration turned out to be immensely helpful here Hard reset any machine and when it finished booting it should come back into the DAQ without any shifter intervention. Endeavor to support this with minimal interruption of data flow. This strategy helped keep L3/DAQ’s dead time to less than a few minutes per week

20G. Watts (UW/Seattle)

21 Most expensive single item in system (besides switch) Most reliable – went out of production! Replaced less than one per year User error often at fault Reduced version of Linux Updates sometimes required due to Fermilab security policies Code not modified in last 7 years of the experiment Problems: Large number of TCP/IP connections caused Linux intermittent flakiness. TFW 16 bit run number roll over in coincidence with another problem caused trouble. Single Board Computer Farm Nodes Standard rack-mount servers. Cheap. Too cheap sometimes. Managed jointly with the Fermilab CD who have a lot of experience with farms. Problems: Reliability issues: 1/month required warrantee service. Quality was a function of purchase batch Modified bidding to account for this. Computers coming back from service had to be cleaned of old versions of the trigger Special hacks for partitions with 1 node Did have to debug linux kernel driver…

G. Watts (UW/Seattle)22 At 1 kHz CPU is about 80% busy Data transfer is VME block transfer (DMA) via the Universe II module

G. Watts (UW/Seattle)23 Different behavior vs Luminosity Dual Core seems to do better at high luminosity More modern systems with better memory bandwidth CPU Time Per Event

G. Watts (UW/Seattle)24 SBC Buffering Farm Node Buffering Cisco Switch Buffering Event fragments are buffered until the RM sends a decision RM buffers up to 10 decisions before sending them out We’ve never had a SBC queue overflow RM bases node event decision on size of internal queue Provides a large amount of buffering space Automatically accounts for node speed differences without having to make measurements The occasional infinite loop does not cause one node to accumulate an unusually large number of events. Drops buffers when it can’t send RM tokens means we never saw this unless some other problem has occurred

25G. Watts (UW/Seattle)

26

G. Watts (UW/Seattle)27 This DØ DAQ/L3 Trigger has taken every single physics event for DØ since it started taking data in > 10 billion physics events Calibration runs Test runs, etc. 63 VME sources powered by Single Board Computers sending data to ~200 off-the-shelf commodity CPUs at the end of the run. Data flow architecture is push, and is crash and glitch resistant. Has survived all the hardware, trigger, and luminosity upgrades smoothly Upgraded farm size from 90 to ~330 and back to 200 nodes with no major change in architecture. Evolved control architecture means minimal dead time incurred during normal operations Primary responsibility is carried out by 3 people (who also work on physics analysis) This works only because Fermi Computing Division takes care of the farm nodes...

G. Watts (UW/Seattle)28 A subset of people that worked on L3 that were able to gather for the Tevatron End-Of- Run-Party

G. Watts (UW/Seattle)29

Calorimeter readout has a max speed of 1.2 kHz Cost prohibitive to upgrade Two Choices: Make pre-decision based on other detector read outs, and reads CAL only if successful (and reads at lower rate) Regions-of-Interest like solution Add a Level 2 G. Watts (UW/Seattle)30

G. Watts (UW/Seattle)31 Run Multiple Copies of Trigger Decision Software Hyper threaded dual processor nodes run 3 copies, for example. The new 8 core machines will run 7-9 copies (only preliminary testing done). Designed a special mode to stress test nodes in-situ by force-feeding them all data. Better than any predictions we’ve done. Software IOProcess lands all data from DAQ and does event building. FilterShell (multiple copies) runs the decision software All levels of the system are crash insensitive If a Filter Shell crashes, new one is started and reprogrammed by IOProcess – rest of system is non-the-wiser. Software distribution 300 MB – takes too long to copy!

G. Watts (UW/Seattle)32 Farm Nodes What read out crates to expect on every event What is the trigger list programming Where to send the accepted events and how to tag them Routing Master What read out crates to expect for a given set of trigger bits What nodes are configured to handle particular trigger bits Front End Crates (SBC’s) List of all nodes that data can be sent to Much of this configuration information is cached in lookup tables for efficient communication at run time

SBC Farm Node G. Watts (UW/Seattle)33 “COOR”dinate (master Run Control) “COOR”dinate (master Run Control) Calorimeter Level 2 Level 3 Online Examines Farm Node Routing Master SBC DØ Shifter Configuration Database Standard Hierarchical Layout Intent flows from the shifter down to the lowest levels of the system Errors flow in reverse

COOR sends a state description down to Level 3 Trigger List Minimum # of nodes Where to send the output data What ROC’s should be used L3 Component Failures, crashes, etc. Supervisor updates its state and reconfigures if necessary. G. Watts (UW/Seattle)34 Level 3 Input Output Commands sent to each L3 component to change the state of that component. Current State Desired State Super Commands to Effect the Change State Configuration Calculates the minimum number of steps to get from the current state to desired state. Complete sequence calculated before first command issued Takes ~9 seconds for this calculation.

G. Watts (UW/Seattle)35 SBC Farm Node “COOR”dinate (master Run Control) “COOR”dinate (master Run Control) Farm Node Routing Master SBC Level 3 Boundary Conditions Dead time: beam in the machine and no data flowing Run change contributes to downtime! Current operating efficiency is 90-94% Includes 3-4% trigger deadtime That translates to less than 1-2 minutes per day of downtime. Any configuration change means a new run Prescales for luminosity changes, for example. A system fails and needs to be reset Clearly, run transitions have to be very quick! Push Responsibility Down Level Don’t Touch configuration unless it must be changed We didn’t start out this way!

G. Watts (UW/Seattle)36 SBC Farm Node “COOR”dinate (master Run Control) “COOR”dinate (master Run Control) Farm Node Routing Master SBC Level 3 COOR caches the complete state of the system Single point of failure for the whole system. But it never crashes! Recovery is about 45 minutes Can re-issue commands if L3 crashes L3 Supervisor caches the desired state of the system (COOR’s view) and the actual state COOR is never aware of any difficulties unless L3 can’t take data due to a failure Some reconfigurations require minor interruption of data flow (1-2 seconds) Farm nodes cache the trigger filter programming If a filter process goes bad no one outside the node has to know Event that caused the problem is saved locally Fast Response to problems == less downtime

G. Watts (UW/Seattle)37 Configuration was cached locally and replayed after a crash. If not available locally, global L3/DAQ configuration cache was used. Run pause less than a second when the node either left or returned to the run. Happened relatively frequently. SBC in ROC Farm Node Configuration cached in the Routine Master Data flow stops during crash, could be as short as 20 seconds to come back or up to a minute Code was very robust – this was rare.