G. Watts (UW/Seatte) for the DZERO Collaboration “More reliable than an airline” Or the GRID The last DZERO DAQ Talk?

G. Watts (UW/Seatte) for the DZERO Collaboration “More reliable than an airline*” * Or the GRID The last DZERO DAQ Talk?

Level 1 Level 2 DAQ L3 Trigger Farm Firmware Trigger Algorithms Software Trigger Algorithms Specialized Hardware Commodity Hardware 1.7 MHz 2 kHz 1 kHz Standard Tiered HEP DAQ/Trigger Design Readout at 1 kHz is about 300 MB/sec Data rate out of the Trigger farm is 60 MB/sec Full Readout Occurs 200 Hz Fast, course resolution, readout G. Watts (UW/Seattle)2

ROC L3 Node A standard scatter-gather architecture ROC – VME based Read out Crate with a Single Board Computer L3 Node – Commodity Linux multicore computer G. Watts (UW/Seattle)3 About 20m distant Electrically isolated Large CISCO switch…

G. Watts (UW/Seattle)4 Tevatron Increases Luminosity # of Multiple Interactions Increase Increased Noise Hits Physicists Get Clever Trigger List Changes More CPU Time Per Event Increased Data Size Trigger HW Improvements Rejection Moves to L1/L2 Architecture lasted 10 years!  Trigger software written by large collection of non-realtime programmer physicists.  CPU time/event has more than tripled.  Continuous upgrades since operation started  Have added about 10 new crates since the start  Started with 90 nodes, ended with almost 200, peak was about 330, all have been replaced  Single core at start, last purchase was dual 4-core.  No major unplanned outages

G. Watts (UW/Seattle)5 Constant pressure: L3 deadtime shows up in this gap!

G. Watts (UW/Seattle)6 An overwhelming success This DAQ System! 10,347,683 K events served!! Papers! Previous VAX/VMS based DAQ system

7G. Watts (UW/Seattle)

8 Data Flow Unidirectional Directed (destination known) Buffered at origin and destination Not buffered while in flight Minimize copying of data Flow Control 100% TCP/IP! Small messages bundled to decrease network overhead Messages compressed via pre- computed lookup tables Configuratoin Automatic Configuration. Configuration at boot time. Auto-reconfiguration when hardware changes. Extensive monitoring system with minimal impact on running.

G. Watts (UW/Seattle)9 The Read Out Crate VME based crates All DZERO readout ends here Interface between detector and DAQ One slot has SBC VMIC 7750’s, PIII, 933 MHz 128 MB Ram Universe II chip Dual 100 MB Ethernet 1 or 2 used depending on data load 4 upgraded to GB late in run to deal with increased occupancy SBC has custom built extension Control lines on J3 backplane that trigger readout Readout starts when detector control signals there is an event ready. Home brew device driver (Fermi CD) for efficient VME readout

G. Watts (UW/Seattle)10 CISCO 6590 switch 16 Gb/s backplane 9 module slots, all full 8 port GB 112 MB shared output buffer per 48 ports Not enough to do full event buffering Connection between SBCs and switch optically isolated Static configuration, maintained by Fermilab Computing Division Farm Nodes n-core 1U AMD’s and Xeon’s with varying amounts of memory 1 Trigger Process per core Single 100 Mb Ethernet connection Peaked around 330 1U boxes, but later in run was smaller number with faster CPU’s 1 local disk (which failed rarely) Fans (which failed a lot)

G. Watts (UW/Seattle)11 SBC in ROC Farm Node Routing Master DØ Trigger Framework L2 Accept 1 A Level 2 Trigger fires, and an accept message is sent to the DAQ Routing Master as well as the detector readout control by the Trigger Framework The Routing Master is the point of entry into the DAQ system A single SBC, sitting in a special crate with an interface to the Trigger Framework. L2 Accept message contains an event number and a 128 bit trigger mask Readout crates start pushing data to their SBC’s Data is buffered in the SBC’s. SBC makes the data uniform VME

G. Watts (UW/Seattle)12 SBC in ROC Farm Node Routing Master ROC DØ Trigger Framework 2 Routing Master decides what node should build and run the Level 3 Trigger code for the event. This information is sent to all SBC’s and the Node.. Node decision is based on the trigger bit mask and cached configuration information Nodes busy fraction is also taken into account, using a token counting system This properly accounts for node’s processing ability. 10 decisions are accumulated before nodes and SBCs are notified.

G. Watts (UW/Seattle)13 SBC in ROC Farm Node DØ Trigger Framework Routing Master 3 The SBC’s send their data fragments to the Farm Nodes. Once all the event fragments have arrived a token is sent back to the Routing Master as long as the node has room to accept another event.

G. Watts (UW/Seattle)14 SBC in ROC Farm Node DØ Trigger Framework Routing Master 4 The Node builds the event, runs the trigger on the event, and if it passes, sends it off to the online system. The event fragments are built into the official RAW event chunk in shared memory (which was designed to do this without copying in this case). The Trigger process (~ 1 per core) can only read from this shared memory Trigger result chunk is returned, and sent to the online system.

G. Watts (UW/Seattle)15 SBC in ROC Farm Node DØ Trigger Framework Routing Master If there is no room in the SBC’s internal buffers it will refuse to read out the crate Quickly causes DAQ to stop with a “front-end busy” status. If the # of tokens for free nodes gets too low, the Routing Master will disable the triggers in the Trigger Framework If the farm node can’t send an event to the online system, its internal buffers will fill It will refuse to receive the SBC’s data. Many minutes of buffer space in the farm! VME

17 This is not a no-single-point-of-failure system DZERO made the policy decision that all crates are required – so each read out crate becomes a single point of failure. The Routine Master and networking infrastructure are single points of failure Some of the configuration machines are single points of failure The Nodes are not. Spent extra money on that hardware Out of about 70 SBCs, only 5 failed in 10 years, and 2 of those were due to user error. Also had spares on hand. CISCO switch never hic- uped and we bought more capacity than we required. Failed a lot… Mostly fans, some disks And then there was the software…

G. Watts (UW/Seattle)18 ConfiguredRunning Paused Start Configure – 26 sec Start Run #nnnnn – 1 sec Stop Run #nnnnn – 1 secPause 1 sec Resume 1 sec Unconfigure – 4 secs

G. Watts (UW/Seattle)19 Over the course of the 10 years we had the full class of software related crashes In Linux - slow downs or hard crashes. In our custom kernel driver for efficient VME readout In our user level code that was part of the DAQ Our strategy for the configuration turned out to be immensely helpful here Hard reset any machine and when it finished booting it should come back into the DAQ without any shifter intervention. Endeavor to support this with minimal interruption of data flow. This strategy helped keep L3/DAQ’s dead time to less than a few minutes per week

21 Most expensive single item in system (besides switch) Most reliable – went out of production! Replaced less than one per year User error often at fault Reduced version of Linux Updates sometimes required due to Fermilab security policies Code not modified in last 7 years of the experiment Problems: Large number of TCP/IP connections caused Linux intermittent flakiness. TFW 16 bit run number roll over in coincidence with another problem caused trouble. Single Board Computer Farm Nodes Standard rack-mount servers. Cheap. Too cheap sometimes. Managed jointly with the Fermilab CD who have a lot of experience with farms. Problems: Reliability issues: 1/month required warrantee service. Quality was a function of purchase batch Modified bidding to account for this. Computers coming back from service had to be cleaned of old versions of the trigger Special hacks for partitions with 1 node Did have to debug linux kernel driver…

G. Watts (UW/Seattle)22 At 1 kHz CPU is about 80% busy Data transfer is VME block transfer (DMA) via the Universe II module

G. Watts (UW/Seattle)23 Different behavior vs Luminosity Dual Core seems to do better at high luminosity More modern systems with better memory bandwidth CPU Time Per Event

G. Watts (UW/Seattle)24 SBC Buffering Farm Node Buffering Cisco Switch Buffering Event fragments are buffered until the RM sends a decision RM buffers up to 10 decisions before sending them out We’ve never had a SBC queue overflow RM bases node event decision on size of internal queue Provides a large amount of buffering space Automatically accounts for node speed differences without having to make measurements The occasional infinite loop does not cause one node to accumulate an unusually large number of events. Drops buffers when it can’t send RM tokens means we never saw this unless some other problem has occurred

G. Watts (UW/Seattle)27 This DØ DAQ/L3 Trigger has taken every single physics event for DØ since it started taking data in 2002. > 10 billion physics events Calibration runs Test runs, etc. 63 VME sources powered by Single Board Computers sending data to ~200 off-the-shelf commodity CPUs at the end of the run. Data flow architecture is push, and is crash and glitch resistant. Has survived all the hardware, trigger, and luminosity upgrades smoothly Upgraded farm size from 90 to ~330 and back to 200 nodes with no major change in architecture. Evolved control architecture means minimal dead time incurred during normal operations Primary responsibility is carried out by 3 people (who also work on physics analysis) This works only because Fermi Computing Division takes care of the farm nodes...

G. Watts (UW/Seattle)28 A subset of people that worked on L3 that were able to gather for the Tevatron End-Of- Run-Party

G. Watts (UW/Seattle)29

Calorimeter readout has a max speed of 1.2 kHz Cost prohibitive to upgrade Two Choices: Make pre-decision based on other detector read outs, and reads CAL only if successful (and reads at lower rate) Regions-of-Interest like solution Add a Level 2 G. Watts (UW/Seattle)30

G. Watts (UW/Seattle)31 Run Multiple Copies of Trigger Decision Software Hyper threaded dual processor nodes run 3 copies, for example. The new 8 core machines will run 7-9 copies (only preliminary testing done). Designed a special mode to stress test nodes in-situ by force-feeding them all data. Better than any predictions we’ve done. Software IOProcess lands all data from DAQ and does event building. FilterShell (multiple copies) runs the decision software All levels of the system are crash insensitive If a Filter Shell crashes, new one is started and reprogrammed by IOProcess – rest of system is non-the-wiser. Software distribution 300 MB – takes too long to copy!

G. Watts (UW/Seattle)32 Farm Nodes What read out crates to expect on every event What is the trigger list programming Where to send the accepted events and how to tag them Routing Master What read out crates to expect for a given set of trigger bits What nodes are configured to handle particular trigger bits Front End Crates (SBC’s) List of all nodes that data can be sent to Much of this configuration information is cached in lookup tables for efficient communication at run time

SBC Farm Node G. Watts (UW/Seattle)33 “COOR”dinate (master Run Control) “COOR”dinate (master Run Control) Calorimeter Level 2 Level 3 Online Examines Farm Node Routing Master SBC DØ Shifter Configuration Database Standard Hierarchical Layout Intent flows from the shifter down to the lowest levels of the system Errors flow in reverse

COOR sends a state description down to Level 3 Trigger List Minimum # of nodes Where to send the output data What ROC’s should be used L3 Component Failures, crashes, etc. Supervisor updates its state and reconfigures if necessary. G. Watts (UW/Seattle)34 Level 3 Input Output Commands sent to each L3 component to change the state of that component. Current State Desired State Super Commands to Effect the Change State Configuration Calculates the minimum number of steps to get from the current state to desired state. Complete sequence calculated before first command issued Takes ~9 seconds for this calculation.

G. Watts (UW/Seattle)35 SBC Farm Node “COOR”dinate (master Run Control) “COOR”dinate (master Run Control) Farm Node Routing Master SBC Level 3 Boundary Conditions Dead time: beam in the machine and no data flowing Run change contributes to downtime! Current operating efficiency is 90-94% Includes 3-4% trigger deadtime That translates to less than 1-2 minutes per day of downtime. Any configuration change means a new run Prescales for luminosity changes, for example. A system fails and needs to be reset Clearly, run transitions have to be very quick! Push Responsibility Down Level Don’t Touch configuration unless it must be changed We didn’t start out this way!

G. Watts (UW/Seattle)36 SBC Farm Node “COOR”dinate (master Run Control) “COOR”dinate (master Run Control) Farm Node Routing Master SBC Level 3 COOR caches the complete state of the system Single point of failure for the whole system. But it never crashes! Recovery is about 45 minutes Can re-issue commands if L3 crashes L3 Supervisor caches the desired state of the system (COOR’s view) and the actual state COOR is never aware of any difficulties unless L3 can’t take data due to a failure Some reconfigurations require minor interruption of data flow (1-2 seconds) Farm nodes cache the trigger filter programming If a filter process goes bad no one outside the node has to know Event that caused the problem is saved locally Fast Response to problems == less downtime

G. Watts (UW/Seattle)37 Configuration was cached locally and replayed after a crash. If not available locally, global L3/DAQ configuration cache was used. Run pause less than a second when the node either left or returned to the run. Happened relatively frequently. SBC in ROC Farm Node Configuration cached in the Routine Master Data flow stops during crash, could be as short as 20 seconds to come back or up to a minute Code was very robust – this was rare.

G. Watts (UW/Seatte) for the DZERO Collaboration “More reliable than an airline” Or the GRID The last DZERO DAQ Talk?

Similar presentations

Presentation on theme: "G. Watts (UW/Seatte) for the DZERO Collaboration “More reliable than an airline” Or the GRID The last DZERO DAQ Talk?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

G. Watts (UW/Seatte) for the DZERO Collaboration “More reliable than an airline*” * Or the GRID The last DZERO DAQ Talk?

Similar presentations

Presentation on theme: "G. Watts (UW/Seatte) for the DZERO Collaboration “More reliable than an airline*” * Or the GRID The last DZERO DAQ Talk?"— Presentation transcript:

Similar presentations

About project

Feedback

G. Watts (UW/Seatte) for the DZERO Collaboration “More reliable than an airline” Or the GRID The last DZERO DAQ Talk?

Presentation on theme: "G. Watts (UW/Seatte) for the DZERO Collaboration “More reliable than an airline” Or the GRID The last DZERO DAQ Talk?"— Presentation transcript: