1 LHC-Era Data Rates in 2004 and 2005 Experiences of the PHENIX Experiment with a PetaByte of Data Martin L. Purschke, Brookhaven National Laboratory PHENIX.

Slides:

Advertisements

Similar presentations

IT253: Computer Organization

Advertisements

Categories of I/O Devices

Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.

Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.

Sander Klous on behalf of the ATLAS Collaboration Real-Time May /5/20101.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

1 Advanced Database Technology February 12, 2004 DATA STORAGE (Lecture based on [GUW ], [Sanders03, ], and [MaheshwariZeh03, ])

CS 333 Introduction to Operating Systems Class 11 – Virtual Memory (1)

Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,

OS Fall ’ 02 Introduction Operating Systems Fall 2002.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

CHEP04 - Interlaken - Sep. 27th - Oct. 1st 2004T. M. Steinbeck for the Alice Collaboration1/20 New Experiences with the ALICE High Level Trigger Data Transport.

OS Spring’03 Introduction Operating Systems Spring 2003.

1 Data Storage MICE DAQ Workshop 10 th February 2006 Malcolm Ellis & Paul Kyberd.

IT Systems Memory EN230-1 Justin Champion C208 –

OEP infrastructure issues Gregory Dubois-Felsmann Trigger & Online Workshop Caltech 2 December 2004.

CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.

1 Upgrades for the PHENIX Data Acquisition System Martin L. Purschke, Brookhaven National Laboratory for the PHENIX Collaboration RHIC from space Long.

Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

CLEO’s User Centric Data Access System Christopher D. Jones Cornell University.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.

06/15/2009CALICE TB review RPC DHCAL 1m 3 test software: daq, event building, event display, analysis and simulation Lei Xia.

Online Data Challenges David Lawrence, JLab Feb. 20, /20/14Online Data Challenges.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

Artdaq Introduction artdaq is a toolkit for creating the event building and filtering portions of a DAQ. A set of ready-to-use components along with hooks.

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

J OINT I NSTITUTE FOR N UCLEAR R ESEARCH OFF-LINE DATA PROCESSING GRID-SYSTEM MODELLING FOR NICA 1 Nechaevskiy A. Dubna, 2012.

D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,

Copyright © 2000 OPNET Technologies, Inc. Title – 1 Distributed Trigger System for the LHC experiments Krzysztof Korcyl ATLAS experiment laboratory H.

ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.

Network Architecture for the LHCb DAQ Upgrade Guoming Liu CERN, Switzerland Upgrade DAQ Miniworkshop May 27, 2013.

Software Status  Last Software Workshop u Held at Fermilab just before Christmas. u Completed reconstruction testing: s MICE trackers and KEK tracker.

EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,

The PHysics Analysis SERver Project (PHASER) CHEP 2000 Padova, Italy February 7-11, 2000 M. Bowen, G. Landsberg, and R. Partridge* Brown University.

Overview of DAQ at CERN experiments E.Radicioni, INFN MICE Daq and Controls Workshop.

Hall-D/GlueX Software Status 12 GeV Software Review III February 11[?], 2015 Mark Ito.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

Report on CHEP ‘06 David Lawrence. Conference had many participants, but was clearly dominated by LHC LHC has 4 major experiments: ALICE, ATLAS, CMS,

1 The PHENIX Experiment in the RHIC Run 7 Martin L. Purschke, Brookhaven National Laboratory for the PHENIX Collaboration RHIC from space Long Island,

TB1: Data analysis Antonio Bulgheroni on behalf of the TB24 team.

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.

Online Monitoring System at KLOE Alessandra Doria INFN - Napoli for the KLOE collaboration CHEP 2000 Padova, 7-11 February 2000 NAPOLI.

IT3002 Computer Architecture

Activity related CMS RPC Hyunchul Kim Korea university Hyunchul Kim Korea university 26th. Sep th Korea-CMS collaboration meeting.

1 ECE 526 – Network Processing Systems Design System Implementation Principles I Varghese Chapter 3.

Analyzing ever growing datasets in PHENIX Chris Pinkenburg for the PHENIX collaboration.

Lec 5 part2 Disk Storage, Basic File Structures, and Hashing.

Data processing Offline review Feb 2, Productions, tools and results Three basic types of processing RAW MC Trains/AODs I will go through these.

LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.

PHENIX DAQ RATES. RHIC Data Rates at Design Luminosity PHENIX Max = 25 kHz Every FEM must send in 40 us. Since we multiplex 2, that limit is 12 kHz and.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

P H E N I X / R H I CQM04, Janurary 11-17, Event Tagging & Filtering PHENIX Level-2 Trigger Xiaochun He Georgia State University.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

October 19, 2010 David Lawrence JLab Oct. 19, 20101RootSpy -- CHEP10, Taipei -- David Lawrence, JLab Parallel Session 18: Software Engineering, Data Stores,

1 Performance and Upgrade Plans for the PHENIX Data Acquisition System Martin L. Purschke, Brookhaven National Laboratory for the PHENIX Collaboration.

SuperB and its computing requirements

LHC experiments Requirements and Concepts ALICE

for the Offline and Computing groups

Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group

Preparations for Reconstruction of Run6 Level2 Filtered PRDFs at Vanderbilt’s ACCRE Farm Charles Maguire et al. March 14, 2006 Local Group Meeting.

CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.

Mark Zbikowski and Gary Kimura

CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.

Expanding the PHENIX Reconstruction Universe

Presentation transcript:

1 LHC-Era Data Rates in 2004 and 2005 Experiences of the PHENIX Experiment with a PetaByte of Data Martin L. Purschke, Brookhaven National Laboratory PHENIX Collaboration RHIC from space Long Island, NY

2 RHIC/PHENIX at a glance RHIC: 2 independent rings, one beam clockwise, the other counterclockwise sqrt(S NN )= 500GeV * Z/A ~200 GeV for Heavy Ions ~500 GeV for proton-proton (polarized) PHENIX: 4 spectrometer arms 12 Detector subsystems 400,000 detector channels Lots of readout electronics Uncompressed Event size typically KB for AuAu, CuCu, pp Data rate 2.4KHz (CuCu), 5KHz (pp) Front-end data rate GB/s Data Logging rate ~ MB/s, 600 MB/s max

3 Where we are w.r.t. others ATLAS CMS LHCb ALICE ~25~40 ~100 ~300 All in MB/s all approximate ~100 ~ ~ MB/s are not so Sci-Fi these days

4 Outline What was planned, and what we did Key technologies Online data handling How offline / the users cope with the data volumes Related talks: Danton Yu, “BNL Wide Area Data Transfer for RHIC and ATLAS:Experience and Plan” this (Tuesday) afternoon 16:20 Chris Pinkenburg, “Online monitoring, calibration, and reconstruction in the PHENIX Experiment”, Wednesday 16:40 (OC-6)

5 A Little History In its design phase, PHENIX had settled on a “planning” number of 20MB/s data rate (“40 Exabyte tapes, 10 postdocs”) In Run 2 (2002) we started to approach that 20MB/s number and ran Level-2 triggers In Run 3 (2003) we faced the decision to either go full steam ahead with Level-2 triggers… or first try something else Several advances in various technologies conspired to make first 100MB/s, then multi-100MB/s data rates possible We are currently logging at about 60 times higher rates than the original design called for

6 But is this a good thing? We had a good amount of discussions about the merit of going to high data rates. Are we drowning in data? Will we be able to analyze the data quickly enough? Are we recording “boring” events, mostly? Is it not better to trigger and reject? In Heavy-Ion collisions, the rejection power of level2- triggers is limited (high multiplicity, etc) triggers take a lot of time to study and developers usually welcome a delay in the onset of the actual rejection mode The would-be rejected events are by no means “boring”, high- statistics physics in them, too

7...good thing? Cont’d In the end we convinced ourselves that we could (and should) do it. The increased data rate helped defer the onset of the LVL2 rejection mode in Runs 4, 5 (didn’t run rejection at all in the end) Saved a lot of headaches… we think Get the data while the getting is good - the Detector system is evolving and is hence unique for each run, better get as much data with it as you can Get physics that you simply can’t trigger on Don’t be afraid to let data sit on tape unanalyzed - computing power increases, time is on your side here

8 Ingredients to achieve the high rates We implemented several main ingredients that made the high rates possible. We compress the data before they get on a disk the first time (cram as much information as possible into each MB) Run several local storage servers (“Buffer boxes”) in parallel, and dramatically increased the buffer disk space (40TB currently) Improved the overall network connectivity and topology Automated most of the file handling so the data rates in the DAQ become manageable

9 Event Builder Layout ATP SEB Gigabit Crossbar Switch To HPSS Event Builder The data are stored on “Buffer boxes” that help ride out the ebb and flow of the data stream, ride out service interruptions of the HPSS tape system, and help send a steady stream to the tape robot Buffer Box Sub-Event buffers see parts of a given Event following the granularity of the Detector electronics With ~50% combined uptime of PHENIX * RHIC, we can log 2x the HPSS max data rate in the DAQ. Just needs enough buffer depth (40TB) The pieces are sent to a given Assembly-and-Trigger Processor (ATP) on a next- available basis, fully assembled event available here for the first time Some technical background...

10 Data Compression LZO algorithm Add new buffer hdr New buffer with the compressed one as payload buffer This is what a file then looks like LZO Unpack Original uncompressed buffer restored On readback: This is what a file normally looks like All this is handled completely in the I/O layer, the higher-level routines just receive a buffer as before. Found that the raw data are still gzip-compressible after zero-suppression and other data reduction techniques Introduced a compressed raw data format that supports a late-stage compression

11 Breakthrough: Distributed Compression ATP SEB Gigabit Crossbar Switch To HPSS Event Builder The compression is handled in the “Assembly and Trigger Processors” (ATP’s) and can so be distributed over many CPU’s -- that was the breakthrough Buffer Box The Event builder has to cope with the uncompressed data flow, e.g. 500MB/s … 900MB/s The buffer boxes and storage system see the compressed data stream, 250MB/s … 450MB/s

12 Buffer Boxes We use COTS dual-CPU machines, some SuperMicro or Tyan MoBo designed for data throughput SATA 400G disk arrays (16 disks) Sata->SCSI Raid controller in striping (Raid0) mode TB systems across 4 disks each on 6 Boxes, called a,b,c,d on each box. All “a” ( and b,c,d) file systems of each machine are one “set”, written to or read from at the same time File system set either written to or read from, not both Each Bufferbox adds a quantum of 2x 100MB/s rate capability (in and out) on 2 interfaces, 6 Buffer boxes = demonstrated 600 MB/s, expansion is cheap, all commodity hardware

13 Buffer Boxes (2) Short-term data rates are typically as high as ~450MB/s But the longer you average, the smaller the average data rate becomes (include machine development days, outages, etc) Run 4: highest rate 450MB/s, whole run average about 35MB/s So: The longer you can buffer, the better you can take advantage of the breaks in the data flow for your tape logging Also, a given file can linger on the disk until the space is needed, 40 hours or so -- the opportunity to do calibrations, skim off some events for fast-track reconstruction, etc “Be ready to reconstruct by the time the file leaves the countinghouse”

14 Run 5 Event statistics CuCu 200GeV 2.2 Billion Events 157TB on tape CuCu 62.4GeV 630 Million Events 42TB on tape CuCu 22GeV 48 Million Events 3TB on tape p-p 200GeV 6.8 Billion Events 286TB on tape TB on tape for Run 5 Just shy of 0.5 PetaByte of raw data on tape in Run 5: Had about 350 in Run 4 and about 150 in the runs before

15 How to analyze these amounts of data? Priority reconstruction and analysis of “filtered” events (Level2 trigger algorithms offline, filter out the most interesting events) Shipped a whole raw data set (Run 5 p-p) to Japan to a regional PHENIX Computing Center (CCJ) Radically changed the analysis paradigm to a “train-based” one

16 Near-line Filtering and Reconstruction We didn’t run Level-2 triggers in the ATP’s, but ran the “triggers” on the data on disk and extracted the interesting events for priority reconstruction and analysis The buffer boxes where the data are stored for ~40 hours come in handy The filtering was started and monitored as a part of the shift operations, looked over by 2,3 experts The whole CuCu data set was run through the filters The so-reduced data set was sent to Oak Ridge where a CPU farm was available Results were shown at the Quark Matter conference in August That’s record speed – the CuCu part of Run 5 ended in April.

17 Data transfers for p-p Going into the proton-proton half of the run, the BNL Computing Center RCF was busy with last year’s data and the new CuCu data set This was already the pre-QM era, most important conference in the field, gotta have data, can’t wait and make room for other projects no adequate resources for pp reconstruction at BNL historically, large-scale reconstruction could only be done at BNL where the raw data are stored, only smaller DST data sets could be exported…but times change Our Japan computing center CCJ had resources… had plans to set up a “FedEx- tape network” to move the raw data there for speedy reconstruction But then the GRID made things easier... “This seems to be the first time that a data transfer of such magnitude was sustained over many weeks in actual production, and was handled as part of routine operation by non-experts.”

18 Shift Crew Sociology The overall goal is a continuous, standard, and sustainable operation, avoid heroic efforts. The shift crews need to do all routine operations, let the experts concentrate on expert stuff. Automate. Automate. And automate some more. Prevent the shift crews from getting “innovative” We find that we usually later throw out data where crews deviated from the prescribed path In Run 5 we seemed to have settled into a nice routine operation Encourage the shift crews to CALL THE EXPERT! Given that, my own metrics how well it works is how many nights I and an half-dozen other folks get uninterrupted sleep

19 Shifts in the Analysis Paradigm After a good amount of experimenting, we found that simple concepts work best. PHENIX reduced the level of complexity in the analysis model quite a bit. We found: You can overwhelm any storage system by having it run as a free- for-all. Random access to files at those volumes brings anything down to its knees. People don’t care too much what data they are developing with (ok, it has to be the right kind) Every once in a while you want to go through a substantial dataset with your established analysis code

20 Shifts in the Analysis Paradigm We keep some data of the most-wanted varieties on disk. that disk-resident dataset remains the same mostly, we add data to the collection as we get newer runs. The stable disk-resident dataset has the advantage that you can immediately compare the effect of a code change while you are developing your analysis module Once you think it’s mature enough to see more data, you register it with a “train”.

21 Analysis Trains After some hard lessons with the more free-for-all model, we established the concept of analysis trains. Pulling a lot of data off the tapes is expensive (in terms of time/resources) Once you go to the trouble, you want to get as much “return on investment” for that file as possible - do all the analysis you want while it’s on disk We also switched to tape (cartridge)-centric retrievals - once a tape is mounted, get all the files off while the getting is good Hard to put a speed-up factor to this, but we went from impossible to analyse to an “ok” experience. On paper the speed-up is like 30 or so. So now the data gets pulled from the tapes, and any number of registered analysis modules run over it -- very efficient You can still opt out for certain files you don’t want or need If you don’t do that, the number of file retrievals explodes - I request the file today, next person requests it tomorrow, 3rd person next week

22 Analysis Train Etiquette Your analysis module has to obey certain rules: be of a certain module-kind to make it manageable by the train conductor be a “good citizen” -- mostly enforced by inheriting from the module parent class, start from templates, and review Code mustn't crash, have no endless loops, no memory leaks pass prescribed Valgrind and Insure tests This train concept has streamlined the PHENIX analysis in hard-to- underestimate ways. After the train, the typical output is relatively small and fits on disk Made a mistake? Or forgot to include something you need? Bad, but not too bad… fix it, test it, the next train is leaving soon. Be on board again.

23 Other Advantages When we started this, there was no production-grade GRID yet, so we had to have something that “just worked” Less ambitious than current GRID projects The train allows someone else to run your analysis (and many other’s at the same time with essentially the same effort) on your behalf Can make use of other, non-Grid-aware, installations Only one debugging effort to adapt to slightly different environments

24 Summary (and Lessons) We exceeded our design data rate by a factor of 60 (2 by compression, 30 by logging rate increases) We wrote 350 TB in Run 4, 500TB in Run 5, ~150 TB in the runs before -> ~1PB Increased data rate simplifies many tough trigger-related problems Get physics that one can't trigger on Addt'l hardware relatively inexpensive and easily expandable Offline and analysis copes by “train” paradigm shift

25 Lessons First and foremost, compress. Do NOT let “soft” data hit the disks. CPU power is comparably cheap. Distribute the compression load. Buffer the data. Even if your final storage system can deal with the instantaneous rates, take advantage of having the files around for a decent amount of time for monitoring, calibration, etc Don't be afraid to take more data than you can conceivably analyze before the next run. There's a new generation of grad students and post-docs coming, and technology is on your side Do not allow random access to files on a large scale. Impose a concept that maximizes the “return on investment” (file retrieval) The End

26 1