Presentation is loading. Please wait.

Presentation is loading. Please wait.

R.Divià, CERN/ALICE Challenging the challenge Handling data in the Gigabit/s range.

Similar presentations


Presentation on theme: "R.Divià, CERN/ALICE Challenging the challenge Handling data in the Gigabit/s range."— Presentation transcript:

1 R.Divià, CERN/ALICE Challenging the challenge Handling data in the Gigabit/s range

2 R.Divià, CERN/ALICE 2 24 March 2003, CHEP03, La Jolla

3 R.Divià, CERN/ALICE 3 24 March 2003, CHEP03, La Jolla

4 R.Divià, CERN/ALICE 4 24 March 2003, CHEP03, La Jolla ALICE Data Acquisition architecture InnerTrackingSystem Local Data Concentrator Readout Receiver Card Front-EndElectronics TimeProjectionChamber Muon PhotonSpectrometerParticleIdentification Trigger Level 1 Trigger Level 2 Trigger Level 0 Triggerdata TriggerDetectors Trigger decisions Event Building switch  3 GB/s Global Data Collector Event Destination Manager Storage switch 1.25 GB/s Perm.DataStorage Detector Data Link

5 R.Divià, CERN/ALICE 5 24 March 2003, CHEP03, La Jolla ALICE running parameters u Two different running modes: m Heavy Ion (HI): 10 6 seconds/year m Proton: 10 7 seconds/year u One Data Acquisition system (DAQ): Data Acquisition and Test Environment (DATE) u Many triggers classes each providing events at different rates, sizes and sources u HI data rates:  3 GB/s  1.25 GB/s  ~1 PB/year to mass storage u Proton run: ~ 0.5 PB/year to mass storage u Staged DAQ installation plan (20%  30%  100%): m 85  300 LDCs, 10  40 Global Data Collectors (GDC) u Different recording options: m Local/remote disks m Permanent Data Storage (PDS): CERN Advanced Storage Manager (CASTOR)

6 R.Divià, CERN/ALICE 6 24 March 2003, CHEP03, La Jolla History of ALICE Data Challenges u Started in 1998 to put together a high-bandwidth DAQ/recording chain u Continued as a periodic activity to: m Validate interoperability of all existing components m Assess and validate developments, trends and options  commercial products  in-house developments m Provide guidelines for ALICE & IT development and installation u Continuously expand up to ALICE requirement at LHC startup

7 R.Divià, CERN/ALICE 7 24 March 2003, CHEP03, La Jolla Performance goals MBytes/s

8 R.Divià, CERN/ALICE 8 24 March 2003, CHEP03, La Jolla Data volume goals TBytes to Mass Storage

9 R.Divià, CERN/ALICE 9 24 March 2003, CHEP03, La Jolla TheALICE Data Challenge IV

10 R.Divià, CERN/ALICE 10 24 March 2003, CHEP03, La Jolla RAWEVENTS Objectifier RAWDATAOBJECTS LDCemulator Components & modes ALICEDAQCASTOR PDS AFFAIR CASTOR monitor Private network CERN backbone CASTOR FE

11 R.Divià, CERN/ALICE 11 24 March 2003, CHEP03, La Jolla Targets u DAQ system scalability tests u Single peer-to-peer tests: m Evaluate the behavior of the DAQ system components with the available HW m Preliminary tuning u Multiple LDC/GDC tests: m Add the full Data Acquisition (DAQ) functionality m Verify the objectification process m Validate & benchmark the CASTOR I/F u Evaluate the performance of new hardware components: m New generation of tapes m 10 Gb Ethernet u Achieve a stable production period: m Minimum 200 MB/s sustained m 7 days non stop m 200 TB data to PDS

12 R.Divià, CERN/ALICE 12 24 March 2003, CHEP03, La Jolla Software components u Configurable LDC Emulator (COLE) u Data Acquisition and Test Environment (DATE) 4.2 u A Fine Fabric and Application Information Recorder (AFFAIR) V1 u ALICE Mock Data Challenge objectifier (ALIMDC) u ROOT (Object-Oriented Data Analysis Framework) v3.03 u Permanent Data Storage (PDS): CASTOR V1.4.1.7 u Linux RedHat 7.2, kernel 2.2 and 2.4 m Physical pinned memory driver (PHYSMEM) m Standard TCP/IP library

13 R.Divià, CERN/ALICE 13 24 March 2003, CHEP03, La Jolla Hardware setup u ALICE DAQ: infrastructure & benchmarking m NFS & DAQ servers m SMP HP Netserver (4 CPUs): setup & benchmarking u LCG testbed (lxshare): setup & production m 78 CPU servers on GE  Dual ~1GHz Pentium III, 512 MB RAM  Linux kernel 2.2 and 2.4  NFS (installation, distribution) and AFS (unused) m [ 8.. 30 ] DISK servers (IDE-based) on GE u Mixed FE/GE/trunk GE, private & CERN backbone m 2 * Extreme Networks Summit 7i switches (32 GE ports) m 12 * 3COM 4900 switches (16 GE ports) m CERN backbone: Enterasys SSR8600 routers (28 GE ports) u PDS: 16 * 9940B tape drives in two different buildings m STK linear tapes, 30 MB/s, 200 GB/cartridge

14 R.Divià, CERN/ALICE 14 24 March 2003, CHEP03, La Jolla Networking 2 2 2 3 3 3 3 3333 2 16 TAPE servers (distributed) Backbone (4 Gbps) 6 CPU servers on FE LDCs & GDCs DISK servers

15 R.Divià, CERN/ALICE 15 24 March 2003, CHEP03, La Jolla Scalability test u Put together as many hosts as possible to verify the scalability of: m run control m state machines m control and data channels m DAQ services m system services m hardware infrastructure u Connect/control/disconnect plus simple data transfers u Data patterns, payloads and throughputs uninteresting u Keywords: usable, reliable, scalable, responsive

16 R.Divià, CERN/ALICE 16 24 March 2003, CHEP03, La Jolla Scalability test

17 R.Divià, CERN/ALICE 17 24 March 2003, CHEP03, La Jolla Single peer-to-peer u Compare: m Architectures m Network configurations m System and DAQ parameters u Exercise: m DAQ system network modules m DAQ system clients and daemons m Linux system calls, system libraries and network drivers u Benchmark and tune: m Linux parameters m DAQ processes, libraries and network components m DAQ data flow

18 R.Divià, CERN/ALICE 18 24 March 2003, CHEP03, La Jolla Single peer-to-peer

19 R.Divià, CERN/ALICE 19 24 March 2003, CHEP03, La Jolla Full test runtime options u Different trigger classes for different traffic patterns u Several recording options m NULL m GDC disk m CASTOR disks m CASTOR tapes u Raw data vs. ROOT objects u We concentrated on two major traffic patterns: m Flat traffic: all LDCs send the same event m ALICE-like traffic: periodic sequence of different events distributed according to forecasted ALICE raw data

20 R.Divià, CERN/ALICE 20 24 March 2003, CHEP03, La Jolla Performance Goals MBytes/s 650 MB/s

21 R.Divià, CERN/ALICE 21 24 March 2003, CHEP03, La Jolla Flat data traffic u 40 LDCs * 38 GDCs u 1 MB/event/LDC  NULL u Occupancies: m LDCs: 75% m GDCs: 50% u Critical item: load balancing over the GE trunks (2/3 nominal)

22 R.Divià, CERN/ALICE 22 24 March 2003, CHEP03, La Jolla Load distribution on trunks 100 200 300 400 500 1234567 # LDCs Distributed Same switch MB/s

23 R.Divià, CERN/ALICE 23 24 March 2003, CHEP03, La Jolla ALICE-like traffic u LDCs: m rather realistic simulation m partitioned in detectors m no hardware trigger m simulated readout, no “real” input channels u GDCs acting as: m event builder m CASTOR front-end u Data traffic: m Realistic event sizes and trigger classes m Partial detector readout u Networking & nodes’ distribution scaled down & adapted

24 R.Divià, CERN/ALICE 24 24 March 2003, CHEP03, La Jolla Challenge setup & outcomes u ~ 25 LDCs m TPC: 10 LDCs m others detectors: [ 1.. 3 ] LDCs u ~ 50 GDCs u Each satellite switch: 12 LDCs/GDCs (distributed) u [ 8.. 16 ] (+1) tape servers on the CERN backbone u [ 8.. 16 ] (+1) tape drives attached to a tape server u No objectification m named pipes too slow and too heavy m upgraded to avoid named pipes:  ALIMDC/CASTOR not performing well

25 R.Divià, CERN/ALICE 25 24 March 2003, CHEP03, La Jolla Impact of traffic pattern FLAT/CASTORALICE/NULL ALICE/CASTOR

26 R.Divià, CERN/ALICE 26 24 March 2003, CHEP03, La Jolla Performance Goals MBytes/s 200 MB/s

27 R.Divià, CERN/ALICE 27 24 March 2003, CHEP03, La Jolla Production run u 8 LDCs*16 GDCs, 1 MB/event/LDC (FLAT traffic) u [ 8.. 16 ] tape server and tape units u 7 days at ~300 MB/s sustained, > 350 MB/s peak, ~ 180 TB to tape u 9 Dec: too much input data u 10 Dec: HW failures on tape drives & reconfiguration u Despite the failures, always exceeded the performance goals

28 R.Divià, CERN/ALICE 28 24 March 2003, CHEP03, La Jolla System reliability u Hosts: m ~ 10% Dead On Installation m ~ 25% Failed On Installation u Long period of short runs (tuning): m occasional problems (recovered) with:  name server  network & O.S. m in average [1.. 2 ] O.S. failures per week (on 77 hosts)  unrecoverable occasional failures on GE interfaces u Production run: m one tape unit failed and had to be excluded

29 R.Divià, CERN/ALICE 29 24 March 2003, CHEP03, La Jolla Outcomes u DATE m 80 hosts/160+ roles with one run control m Excellent reliability and performance m Scalable and efficient architecture u Linux m Few hiccups here and there but rather stable and fast m Excellent network performance/CPU usage m Some components are too slow (e.g. named pipes) m More reliability needed from the GE interfaces

30 R.Divià, CERN/ALICE 30 24 March 2003, CHEP03, La Jolla Outcomes u FARM installation and operation: not to be underestimated! u CASTOR m Reliable and effective m Improvements needed on:  Overloading  Parallelizing tape resources u Tapes m One DOA and one DOO u Network: silent but very effective partner m Layout made for a farm, not optimized for ALICE DAQ u 10 GB Ethernet tests: m failure at first m problem “fixed” too late for the Data Challenge m reconfiguration: transparent to DAQ and CASTOR

31 R.Divià, CERN/ALICE 31 24 March 2003, CHEP03, La Jolla Future ALICE Data Challenges u Continue the planned progression u ALICE-like pattern u Record ROOT objects u New technologies m CPUs m Servers m Network  NICs  Infrastructure  Beyond 1 GbE u Insert online algorithms u Provide some “real” input channels u Challenge the challenge! MBytes/s

32 R.Divià, CERN/ALICE 32 24 March 2003, CHEP03, La Jolla http://bulletin.cern.ch/eng/ebulletin.php


Download ppt "R.Divià, CERN/ALICE Challenging the challenge Handling data in the Gigabit/s range."

Similar presentations


Ads by Google