ALICE Data Challenges On the way to recording @ 1 GB/s
What is ALICE
ALICE Data Acquisition architecture Inner Tracking System Time Projection Chamber Particle Identification Photon Spectrometer Trigger Detectors Muon Trigger decisions Front-End Electronics Trigger data Detector Data Link Trigger Level 0 Local Data Concentrator Readout Receiver Card Trigger Level 1 Event Building switch 3 GB/s Event Destination Manager Trigger Level 2 Global Data Collector Storage switch 1.25 GB/s Perm. Data Storage
ALICE running parameters Two different running modes: Heavy Ion (HI): 106 seconds/year Proton: 107 seconds/year One Data Acquisition system (DAQ): Data Acquisition and Test Environment (DATE) Many triggers classes each providing events at different rates, sizes and sources HI data rates: 3 GB/s 1.25 GB/s ~1 PB/year to mass storage Proton run: ~ 0.5 PB/year to mass storage Staged DAQ installation plan (20% 30% 100%): 85 300 LDCs, 10 40 Global Data Collectors (GDC) Different recording options: Local/remote disks Permanent Data Storage (PDS): CERN Advanced Storage Manager (CASTOR)
History of ALICE Data Challenges Started in 1998 to put together a high-bandwidth DAQ/recording chain Continued as a periodic activity to: Validate interoperability of all existing components Assess and validate developments, trends and options commercial products in-house developments Provide guidelines for ALICE & IT development and installation Continuously expand up to ALICE requirement at LHC startup
Performance goals MBytes/s
Data volume goals TBytes to Mass Storage
The ALICE Data Challenge IV
Components & modes LDC emulator ALICE DAQ Objectifier CASTOR FE CASTOR PDS RAW DATA OBJECTS RAW EVENTS Private network CERN backbone AFFAIR CASTOR monitor
Targets DAQ system scalability tests Single peer-to-peer tests: Evaluate the behavior of the DAQ system components with the available HW Preliminary tuning Multiple LDC/GDC tests: Add the full Data Acquisition (DAQ) functionality Verify the objectification process Validate & benchmark the CASTOR I/F Evaluate the performance of new hardware components: New generation of tapes 10 Gb Ethernet Achieve a stable production period: Minimum 200 MB/s sustained 7 days non stop 200 TB data to PDS
Software components Configurable LDC Emulator (COLE) Data Acquisition and Test Environment (DATE) 4.2 A Fine Fabric and Application Information Recorder (AFFAIR) V1 ALICE Mock Data Challenge objectifier (ALIMDC) ROOT (Object-Oriented Data Analysis Framework) v3.03 Permanent Data Storage (PDS): CASTOR V1.4.1.7 Linux RedHat 7.2, kernel 2.2 and 2.4 Physical pinned memory driver (PHYSMEM) Standard TCP/IP library
Hardware setup ALICE DAQ: infrastructure & benchmarking NFS & DAQ servers SMP HP Netserver (4 CPUs): setup & benchmarking LCG testbed (lxshare): setup & production 78 CPU servers on GE Dual ~1GHz Pentium III, 512 MB RAM Linux kernel 2.2 and 2.4 NFS (installation, distribution) and AFS (unused) [ 8 .. 30 ] DISK servers (IDE-based) on GE Mixed FE/GE/trunk GE, private & CERN backbone 2 * Extreme Networks Summit 7i switches (32 GE ports) 12 * 3COM 4900 switches (16 GE ports) CERN backbone: Enterasys SSR8600 routers (28 GE ports) PDS: 16 * 9940B tape drives in two different buildings STK linear tapes, 30 MB/s, 200 GB/cartridge
Networking LDCs & GDCs DISK servers 3 3 3 3 2 2 2 6 CPU servers on FE Backbone (4 Gbps) 16 TAPE servers (distributed)
Scalability test Put together as many hosts as possible to verify the scalability of: run control state machines control and data channels DAQ services system services hardware infrastructure Connect/control/disconnect plus simple data transfers Data patterns, payloads and throughputs uninteresting Keywords: usable, reliable, scalable, responsive
Scalability test
Single peer-to-peer Compare: Architectures Network configurations System and DAQ parameters Exercise: DAQ system network modules DAQ system clients and daemons Linux system calls, system libraries and network drivers Benchmark and tune: Linux parameters DAQ processes, libraries and network components DAQ data flow
Single peer-to-peer MB/s % CPU/MB Event size (KB) 10 20 30 40 50 60 70 80 90 100 110 200 400 600 800 1000 1200 1400 1600 1800 2000 Event size (KB) MB/s 0.00 1.00 2.00 3.00 4.00 5.00 % CPU/MB Transfer speed GDC CPU usage LDC CPU usage HP Netserver LH6000, 4*Xeon 700 MHz, 1.5 GB RAM, 3COM996 1000BaseT, tg3 driver, RedHat 7.3.1, kernel 2.4.18-19.7.x.cernsmp
Full test runtime options Different trigger classes for different traffic patterns Several recording options NULL GDC disk CASTOR disks CASTOR tapes Raw data vs. ROOT objects We concentrated on two major traffic patterns: Flat traffic: all LDCs send the same event ALICE-like traffic: periodic sequence of different events distributed according to forecasted ALICE raw data
Performance Goals MBytes/s 650 MB/s
Flat data traffic 40 LDCs * 38 GDCs 1 MB/event/LDC NULL Occupancies: Critical item: load balancing over the GE trunks (2/3 nominal)
Load distribution on trunks 100 200 300 400 500 1 2 3 4 5 6 7 # LDCs Distributed Same switch MB/s 3 GE trunks: 330 MB/s -> why 220 MB/s? Challenge: 200 MB/s sustained, each switch used for: DATE and CASTOR -> contention separate switches -> insufficient # switches
ALICE-like traffic LDCs: rather realistic simulation partitioned in detectors no hardware trigger simulated readout, no “real” input channels GDCs acting as: event builder CASTOR front-end Data traffic: Realistic event sizes and trigger classes Partial detector readout Networking & nodes’ distribution scaled down & adapted
Challenge setup & outcomes ~ 25 LDCs TPC: 10 LDCs others detectors: [ 1 .. 3 ] LDCs ~ 50 GDCs Each satellite switch: 12 LDCs/GDCs (distributed) [ 8 .. 16 ] tape servers on the CERN backbone [ 8 .. 16 ] tape drives attached to a tape server No objectification named pipes too slow and too heavy upgraded to avoid named pipes: ALIMDC/CASTOR not performing well ITS Pix: 2 ITS Drift: 3 ITS Strips: 1 TPC: 10 TRD: 2 TOF: 1 PHOS: 1 HMPID: 1 MUON: 2 PMD: 2 TRIGGER: 1
Impact of traffic pattern FLAT/CASTOR ALICE/NULL ALICE/CASTOR
Performance Goals MBytes/s 200 MB/s
Production run 8 LDCs*16 GDCs, 1 MB/event/LDC (FLAT traffic) [ 8 .. 16 ] tape server and tape units 7 days at ~300 MB/s sustained, > 350 MB/s peak, ~ 180 TB to tape 9 Dec: too much input data 10 Dec: HW failures on tape drives & reconfiguration Despite the failures, always exceeded the performance goals
System reliability Hosts: ~ 10% Dead On Installation ~ 25% Failed On Installation Long period of short runs (tuning): occasional problems (recovered) with: name server network & O.S. in average [1 .. 2 ] O.S. failures per week (on 77 hosts) unrecoverable occasional failures on GE interfaces Production run: one tape unit failed and had to be excluded DOI: power supply, Hard Disk, BOOT failure, bad BIOS, installation failed FOI: no AFS, not all products, NIC working at FE or at simple duplex, single CPU - usually solved by a complete re-installation, sometimes repeated twice
Outcomes DATE 80 hosts/160+ roles with one run control Excellent reliability and performance Scalable and efficient architecture Linux Few hiccups here and there but rather stable and fast Excellent network performance/CPU usage Some components are too slow (e.g. named pipes) More reliability needed from the GE interfaces GDCs: 70% of each CPU free for extra activities CPU user: 1% CPU system (interrupts, libs): [10 .. 30]% Input rate: [6 .. 10 ] MB/s (average: 7 MB/s) LDCs: [15 .. 30]% CPU busy CPU user: [3..6]% CPU system: [10..25]% In reality: higher CPU user, same CPU system (pRORC: no interrupts) MEMORY: LDC: TPC can buffer 300/3.79 ~ 80 events: no problems GDC: every central event needs~ 43 MB. The event builder has to allocate maxEvent Size to each LDC (to ensure forward progress). In reality events are allocated on-the-fly according to their real size, but the minimum has to be ensured statically NETWORK: configured as a mesh with same resources on each node. The DAQ needs a tree with asymmetric assignment of branches (to absorb all the traffic)
Outcomes FARM installation and operation: not to be underestimated! CASTOR Reliable and effective Improvements needed on: Overloading Parallelizing tape resources Tapes One DOA and one DOO Network: silent but very effective partner Layout made for a farm, not optimized for ALICE DAQ 10 GB Ethernet tests: failure at first problem “fixed” too late for the Data Challenge reconfiguration: transparent to DAQ and CASTOR
Future ALICE Data Challenges Continue the planned progression ALICE-like pattern Record ROOT objects New technologies CPUs Servers Network NICs Infrastructure Beyond 1 GbE Insert online algorithms Provide some “real” input channels Get ready to record at 1.25 GB/s MBytes/s