Presentation is loading. Please wait.

Presentation is loading. Please wait.

Run II DZERO DAQ / Level 3 Trigger System Ron Rechenmacher, Fermilab The DØ DAQ Group (Brown/FNAL-CD/U.of Washington) CHEP03.

Similar presentations


Presentation on theme: "Run II DZERO DAQ / Level 3 Trigger System Ron Rechenmacher, Fermilab The DØ DAQ Group (Brown/FNAL-CD/U.of Washington) CHEP03."— Presentation transcript:

1 Run II DZERO DAQ / Level 3 Trigger System Ron Rechenmacher, Fermilab The DØ DAQ Group (Brown/FNAL-CD/U.of Washington) CHEP03

2 3/24/2003Ron Rechenmacher, FermilabSlide 2 What to Expect DAQ/trigger system overview Hardware, software description Performance Lessons learned Scaling to the future

3 3/24/2003Ron Rechenmacher, FermilabSlide 3 D0 at Fermilab DØDØ

4 3/24/2003Ron Rechenmacher, FermilabSlide 4 D0 DAQ at Fermilab Collision rate of 7.6 MHz 3 level trigger system L1/L2 reduce rate to 1KHz into L3 250 MB/s average event data rate into L3 50 Hz and 12.5 MB/s output from L3 L3/DAQ (Commodity HW/SW) L1/L2 (Custom Hardware) Data Tape 7.5 MHz 1 KHz 50 Hz

5 3/24/2003Ron Rechenmacher, FermilabSlide 5 Commodity DAQ/L3 System We chose a good mix of hardware and software and built a system that easily met the 250KB @ 1KHz (=250MB/sec) requirement. Great depth of software development tools and methodologies. Commodity software development environment Commodity hardware

6 3/24/2003Ron Rechenmacher, FermilabSlide 6 DAQ/trigger System Overview

7 3/24/2003Ron Rechenmacher, FermilabSlide 7 Mechanically supported in crate by custom 9U “Extender” board 933 MHz CPU 128 MB Flash ROM 128 MB RAM “PMC” slot (filled with BVM I/O module) VME to PCI (Universe II) Commodity Single Board Computer “SBC” Dual 100Mb Ethernet J3connector SBC Front-panelconnections Statuslights

8 3/24/2003Ron Rechenmacher, FermilabSlide 8 Hardware Description - Switches 6509 (single central switch) 16 GB/s backplane 9 module slots 8 port Gb (fiber or copper) 48 port 100Mb/s 112MB/48 ports for output buffering 2948G (currently 5 of these in the system) 48 100Mb ports and 2 Gb ports “Concentrator” switch Combines data from up to 20 100mb/s inputs into 2 Gb outputs No packet loss possible if limited to 20 inputs Capacity well exceeds D0 requirements

9 3/24/2003Ron Rechenmacher, FermilabSlide 9 Hardware Description - Nodes 82 nodes in total, currently Dual CPU 1 GB RAM 1 GHz PIII / AMD Athlon 2000’s Dual ethernet Cost effective

10 3/24/2003Ron Rechenmacher, FermilabSlide 10 Developed Software Description Multiple runs can be configured simultaneously Connections to monitor server (talk on Thursday) All connections TCP Auto re-connection Application buffer tracking Components of the system can be restarted ‘on the fly’ Event Data Routing Crate-lists Buffer-info Configuration Selected Events Configuration Routing Master D0 Run Control NodesSBCs Runs To Tape DAQ supervisor Monitoring

11 3/24/2003Ron Rechenmacher, FermilabSlide 11 Software Infrastructure Linux 2.4 kernel Modifiable One arp patch Easy development Kernel debugging – KGDB Tcpdump Fermi Linux Trace Complete system – kernel users space interaction Rgang Single executable Parallel remote execution and file copy for “farms”

12 3/24/2003Ron Rechenmacher, FermilabSlide 12 Software TRACE root@d0sbc001b:/proc/trace>cat buffer | head –20 count timeStamp PID TraceName CPU lvl message ------------------------------------------------------------------------ 1 1048198375378446 1620 KERNEL 0 30 exit do_softirq 2 1048198375378425 1620 KERNEL 0 30 enter do_softirq 3 1048198375378422 1620 KERNEL 0 30 exit handle_IRQ_event irq=5 4 1048198375378411 1620 KERNEL 0 30 enter handle_IRQ_event irq=5 5 1048198375377887 1339 KERNEL 0 31 sched: prev=1339 next=1620 6 1048198375377788 1339 KERNEL 0 30 exit do_softirq 7 1048198375377780 1339 KERNEL 0 30 enter do_softirq 8 1048198375377779 1339 KERNEL 0 30 exit handle_IRQ_event irq=5 9 1048198375377771 1339 KERNEL 0 30 enter handle_IRQ_event irq=5 10 1048198375377716 1339 l3xetg 0 8 Node idx 55: total=3 delta=1 latency=0 11 1048198375377688 1339 KERNEL 0 30 exit do_softirq 12 1048198375377686 1339 KERNEL 0 30 enter do_softirq 13 1048198375377684 1339 KERNEL 0 30 exit handle_IRQ_event irq=5 14 1048198375377680 1339 KERNEL 0 30 enter handle_IRQ_event irq=5 15 1048198375377668 1620 KERNEL 0 31 sched: prev=1620 next=1339 16 1048198375377646 1339 KERNEL 0 31 sched: prev=1339 next=1620 17 1048198375377615 1339 KERNEL 0 30 exit do_softirq root@d0sbc001b:/proc/trace>echo KERNEL=0x0fffffff >|level

13 3/24/2003Ron Rechenmacher, FermilabSlide 13 Software Logic Analyzer

14 3/24/2003Ron Rechenmacher, FermilabSlide 14 Control DØ Run Control sends configuration commands to Level 3 Level 3 is a black box to the rest of DØ DZERO Run Control Level 3/DAQ Supervisor Node Routing Master Level 3 Supervisor configures L3/DAQ system Allows configuration of multiple run. All components can crash or reboot at any time System will automatically reconfigure without contacting run control.

15 3/24/2003Ron Rechenmacher, FermilabSlide 15 Monitoring   Example use A status display in the Control Room (or your living room!)   All components of the DAQ are clients   The Server caches recent queries to limit the load on clients   There are many displays, each serving a specialized purpose (uMon, l3xqt, history, systray, and web pages)   Based on TCP/IP, ACE, and XML Real-time “Trace”   Example use See the event numbers that were in an SBC’s buffers just before some glitch occurs   Combines low-level debugging information and log-file entries in a single real-time circular buffer   The buffer can be “frozen” by either software or hardware triggers   A system-wide display has been demonstrated and is under development   Example use Understand why a node was not connected to an SBC yesterday   Centralized, accessible, and time-stamped   Errors go to SES Log-files XML Server Clients Displays Web Pages Talk on this topic on Thursday Able to control the amount to log files

16 3/24/2003Ron Rechenmacher, FermilabSlide 16 Performance Current rate is 400Hz with 300KB events 120 MB/sec Subset at 2KHz with smaller events Subset utilizing dual ethernet using large events

17 3/24/2003Ron Rechenmacher, FermilabSlide 17 Performance `Yearly' Graph Percent backplane utilization

18 3/24/2003Ron Rechenmacher, FermilabSlide 18 Lessons Learned R&D (system’s analysis) goes a long way (VME) systems integration expertise goes along way Transcend sub-system boundaries TCP expertise needed 200 ms dropped packet problem TCP not tuned for ‘real-time’ applications by default TCP_RTO_MIN parameter and others need tuning Understanding of Linux Kernel and TCP tools Track all software/configuration

19 3/24/2003Ron Rechenmacher, FermilabSlide 19 R&D Significant upfront analysis/investigation “To the metal” understanding/expertise Basis for smooth integration

20 3/24/2003Ron Rechenmacher, FermilabSlide 20 R&D - VME SBCs Universe II VMETRO studies Linux interrupt latency measurements VMETRO VBT-325C VME Trace Sampling: STATE at Middle Trace Search Jump Count Format Markers Window Quit Help Trace Search Jump Count Format Markers Window Quit Help+DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDVME#1DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD+|DDDDDDDDDTimeDDDDDDDBgLDDAMDAddressDDDataDDDDESizeDDDCycleDStatDIRQ7:1*DIackD>| | -17 1.35us ---- 39..21A004....0001 WORD RD OK....... ---- ^ | -16 47.41us ---- 39..224012....0500 WORD RD OK....... ---- # | -15 49.97us ---- 39..22C012....0500 WORD RD OK....... ---- # | -14 46.61us ---- 39..234012....0500 WORD RD OK....... ---- # | -13 165.04ms ---- 39..21A004....0001 WORD RD OK....... ---- # | -12 1.15us ---- 39..21A004....0001 WORD RD OK....... ---- # | -11 1.15us ---- 39..21A004....0001 WORD RD OK....... ---- # | -10 1.35us ---- 39..21A004....0011 WORD RD OK....... ---- # | -9 46.93us ---- 39..224012....0500 WORD RD OK....... ---- # | -8 49.81us ---- 39..22C012....0500 WORD RD OK....... ---- # | -7 46.61us ---- 39..234012....0500 WORD RD OK....... ---- # | -6 166.35ms ---- 39..21A004....0001 WORD RD OK....... ---- # | -5 1.15us ---- 39..21A004....0001 WORD RD OK....... ---- # | -4 1.15us ---- 39..21A004....0001 WORD RD OK....... ---- # | -3 1.33us ---- 39..21A004....0001 WORD RD OK....... ---- # | -2 46.45us ---- 39..224012....0500 WORD RD OK....... ---- # | -1 50.77us ---- 39..22C012....0500 WORD RD OK....... ---- | HALT 47.09us ---- 39..234012....0500 WORD RD OK....... ---- V +DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD+ Ok. | | | Efficient OS access Writev Limit memory copying Linux RT scheduling

21 3/24/2003Ron Rechenmacher, FermilabSlide 21 R&D – Switches/ethernet VME to ethernet Rate and CPU We analyzed the architecture of the 6509 buffering increased at the last minute and turned out not to be an issue Prepared to control required buffering via control of TCP window size Round trip messages passing timings Tests done under Linux

22 3/24/2003Ron Rechenmacher, FermilabSlide 22 Expanding Room to grow This system could easily double Utilization indicator

23 3/24/2003Ron Rechenmacher, FermilabSlide 23 Routing Master DAQ Nodes SBCs   Get info from Emperor and pass to SBCs Routing Master DAQ Nodes SBCs DAQ Nodes Event Nodes Routing Emperor Event Node Groups Node Master   Tell DAQ nodes which event node to use   Advertise total free buffers to the Emperor   Emperor… for each event:   Pick an Event Node Group (ENG) with the most free buffers   Inform the NM and RMs of the choice Scaling

24 3/24/2003Ron Rechenmacher, FermilabSlide 24 Summary Commodity-based ethernet DAQ built for D0 250 MB/s: 1 KHz of 250 KB events 63 sources and >80 targets Commodity (ethernet) systems wow, a lot of stuff can show up! You need a TCP/IP expert or two People that can transcend boundaries “to the metal” understanding Infrastructure

25 3/24/2003Ron Rechenmacher, FermilabSlide 25 References / Additional Information Fermitools http://fermitools.fnal.gov Buffering http://www-d0.fnal.gov/cgi-bin/cvsweb.cgi/~checkout~/l3xsbc/doc/buffering/index.html?rev=HEAD&content-type=text/html DAQ Scaling, DAQ Overview, sci2002 http://d0.phys.washington.edu/~haas/d0/L3/ L3DAQ Homepage http://www-d0online.fnal.gov/www/groups/l3daq/default.html L3 switch backplane load http://m-d0-mrtg.fnal.gov/s-d0-dab2cr-l3/s-d0-dab2cr-l3.backp2.html The D0 Experiment http://www-d0.fnal.gov/ D0 Run II Operations http://www-d0.fnal.gov/runcoor/

26 3/24/2003Ron Rechenmacher, FermilabSlide 26 Extra Slides Follow

27 3/24/2003Ron Rechenmacher, FermilabSlide 27 Monitoring Centralized, caching, monitor server. Based on TCP/IP, ACE, and XML Supports many displays and clients 40 displays simultaneously 200 data sources Talk and poster on this topic Thursday.c

28 3/24/2003Ron Rechenmacher, FermilabSlide 28 Performance `Weekly' Graph (2 Hour Average) Max 5-min.17.0 % Average 5-min.4.0 % Current 5-min.0.0 % Max 5-min.1.0 % Average 5-min.0.0 % Current 5-min.0.0 %

29 3/24/2003Ron Rechenmacher, FermilabSlide 29 Software Description - RM Event routing (Routing Master) Receives “run” information from supervisor Farm node list and crate list per bit Gets bit fired by event# from TFW Receives no. of free buffers from each farm node Decides which nodes receive which events Sends routing info by event# to SBCs Sends crate list by event# to farm nodes Disables triggers when necessary


Download ppt "Run II DZERO DAQ / Level 3 Trigger System Ron Rechenmacher, Fermilab The DØ DAQ Group (Brown/FNAL-CD/U.of Washington) CHEP03."

Similar presentations


Ads by Google