ALICE Production and Analysis Software

Slides:



Advertisements
Similar presentations
High Level Trigger (HLT) for ALICE Bergen Frankfurt Heidelberg Oslo.
Advertisements

5/2/  Online  Offline 5/2/20072  Online  Raw data : within the DAQ monitoring framework  Reconstructed data : with the HLT monitoring framework.
– Unfortunately, this problems is not yet fully under control – No enough information from monitoring that would allow us to correlate poor performing.
Trains status&tests M. Gheata. Train types run centrally FILTERING – Default trains for p-p and Pb-Pb, data and MC (4) Special configuration need to be.
ALICE Operations short summary LHCC Referees meeting June 12, 2012.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
The ALICE Analysis Framework A.Gheata for ALICE Offline Collaboration 11/3/2008 ACAT'081A.Gheata – ALICE Analysis Framework.
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
David N. Brown Lawrence Berkeley National Lab Representing the BaBar Collaboration The BaBar Mini  BaBar  BaBar’s Data Formats  Design of the Mini 
ALICE Simulation Framework Ivana Hrivnacova 1 and Andreas Morsch 2 1 NPI ASCR, Rez, Czech Republic 2 CERN, Geneva, Switzerland For the ALICE Collaboration.
Andreas Morsch, CERN EP/AIP CHEP 2003 Simulation in ALICE Andreas Morsch For the ALICE Offline Project 2003 Conference for Computing in High Energy and.
V4-20-Release P. Hristov 08/08/ Changes: v4-20-Rev-38 #85151 Memory leak in T0 DQM agent. From rev #85276 AliGRPPreprocessor.cxx: Port to.
Infrastructure for QA and automatic trending F. Bellini, M. Germain ALICE Offline Week, 19 th November 2014.
PWG3 Analysis: status, experience, requests Andrea Dainese on behalf of PWG3 ALICE Offline Week, CERN, Andrea Dainese 1.
Analysis trains – Status & experience from operation Mihaela Gheata.
AliRoot survey P.Hristov 11/06/2013. Offline framework  AliRoot in development since 1998  Directly based on ROOT  Used since the detector TDR’s for.
HLT/AliRoot integration C.Cheshkov, P.Hristov 2/06/2005 ALICE Offline Week.
Status of global tracking and plans for Run2 (for TPC related tasks see Marian’s presentation) 1 R.Shahoyan, 19/03/14.
1 Checks on SDD Data Piergiorgio Cerello, Francesco Prino, Melinda Siciliano.
NEC' /09P.Hristov1 Alice off-line computing Alice Collaboration, presented by P.Hristov, CERN NEC'2001 September 12-18, Varna.
Features needed in the “final” AliRoot release P.Hristov 26/10/2006.
Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)
Computing for Alice at GSI (Proposal) (Marian Ivanov)
AliRoot survey: Analysis P.Hristov 11/06/2013. Are you involved in analysis activities?(85.1% Yes, 14.9% No) 2 Involved since 4.5±2.4 years Dedicated.
1 Offline Week, October 28 th 2009 PWG3-Muon: Analysis Status From ESD to AOD:  inclusion of MC branch in the AOD  standard AOD creation for PDC09 files.
ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.
Javier Castillo 1 Muon Embedding Status & Open Issues PWG3 - CERN - 15/02/2011.
Predrag Buncic CERN ALICE Status Report LHCC Referee Meeting 01/12/2015.
Quality assurance for TPC. Quality assurance ● Process: ● Detect the problems ● Define, what is the problem ● What do we expect? ● Defined in the TDR.
PWG3 Analysis: status, experience, requests Andrea Dainese on behalf of PWG3 ALICE Offline Week, CERN, Andrea Dainese 1.
V5-01-Release & v5-02-Release Peter Hristov 20/02/2012.
1 Reconstruction tasks R.Shahoyan, 25/06/ Including TRD into track fit (JIRA PWGPP-1))  JIRA PWGPP-2: Code is in the release, need to switch setting.
V4-20-Release P. Hristov 21/02/ Changes: v4-20-Rev-14 #78385 Please port AliTPCPreprocessor.cxx Rev to Release #78344 Request to port changes.
AliRoot survey: Reconstruction P.Hristov 11/06/2013.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
AliRoot survey: Calibration P.Hristov 11/06/2013.
V4-19-Release P. Hristov 11/10/ Not ready (27/09/10) #73618 Problems in the minimum bias PbPb MC production at 2.76 TeV #72642 EMCAL: Modifications.
Predrag Buncic CERN Plans for Run2 and the ALICE upgrade in Run3 ALICE Tier-1/Tier-2 Workshop February 2015.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
Monthly video-conference, 18/12/2003 P.Hristov1 Preparation for physics data challenge'04 P.Hristov Alice monthly off-line video-conference December 18,
CALIBRATION: PREPARATION FOR RUN2 ALICE Offline Week, 25 June 2014 C. Zampolli.
January 2009 offline detector review - 2 nd go 1 ● Offline – Geometry – Material budget – Simulation – Raw data – OCDB parameters – Reconstruction ● Calibration.
The ALICE Analysis -- News from the battlefield Federico Carminati for the ALICE Computing Project CHEP 2010 – Taiwan.
V4-18-Release P. Hristov 21/06/2010.
Calibration: preparation for pa
Data Formats and Impact on Federated Access
ALICE experience with ROOT I/O
Analysis trains – Status & experience from operation
PWG2 Analysis status Adam Kisiel, CERN for the PWG2 group.
Developments of the PWG3 muon analysis code
DPG Activities DPG Session, ALICE Monthly Mini Week
Visualization of embedding
Production status – christmas processing
ALICE analysis preservation
v4-19-Release & v4-20-Release
ALICE – First paper.
v4-18-Release: really the last revision!
Operations in 2012 and plans for the LS1
ALICE – Evolving towards the use of EDG/LCG - the Data Challenge 2004
Experience in ALICE – Analysis Framework and Train
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
Commissioning of the ALICE HLT, TPC and PHOS systems
AliRoot status and PDC’04
ALICE Offline Week, CERN
Simulation use cases for T2 in ALICE
Analysis framework - status
ALICE Computing Upgrade Predrag Buncic
Performance optimizations for distributed analysis in ALICE
Use of Geant4 in experiment interactive frameworks AliRoot
ATLAS DC2 & Continuous production
Presentation transcript:

ALICE Production and Analysis Software P.Hristov 06/06/2013

ALICE detectors L3 magnet B=0.5 T HMPID TRD TOF Dipole magnet PMD ITS PHOS MUON spectrometer TPC Absorber

Offline framework AliRoot in development since 1998 Directly based on ROOT Used since the detector TDR’s for all ALICE studies Few packages to install: ROOT, AliRoot,Geant3, AliEn client Optionally use Geant4 or Fluka instead of Geant3 => additional VMC package Optionally DATE + AMORE for DA and DQM development Optionally FASTJET + Boost + CGAL for jet studies Ported on most common architectures/OS: Linux & Mac OS Distributed development Over 150 developers and a single SVN repository analysis:~1M SLOC; simulation, reconstruction, calibration, alignment, visualization ~1.4M SLOC. Integration with DAQ (data recorder) and HLT (same code-base) Abstract interfaces and “restricted” subset of C++ used for maximum portability Used for simulation, reconstruction, calibration (detector algorithms), alignment, quality assurance, monitoring, and analysis

AliRoot Layout G R I D A L I E N Virtual MC STEER AliSimulation Fluka PDF HIJING G R I D A L I E N VZERO ACORDE STRUCT HLT Virtual MC PYTHIA T0 EVGEN PHOJET RAW EVE STEER AliSimulation AliReconstruction ESD/AOD classes PMD DPMJET Analysis OCDB TRIG FMD ITS TPC TRD TOF PHOS EMCAL HMPID MUON ZDC ROOT CINT GEOM HIST TREE PROOF IO MATH …

AliRoot: Simulation + Reconstruction Initialization Event Generation Particle Transport Hits AliSimulation Clusters Digits/ Raw digits Event Merging (optional) Summable Digits AliReconstruction Tracking PID ESD AOD Analysis

Particle transport G3 User Code VMC G4 FLUKA Generators Reconstruction G3 transport User Code VMC G4 transport G4 Geometrical Modeller FLUKA transport FLUKA Generators Reconstruction Visualisation

Event Merging (MC + MC): Summable Digits Signal Event Generation: Kinematics Particle Transport: Hits (energy deposition at given Point, MC label) Detector Response: Summable Digits (low ADC threshold, no noise, MC label) Digits (noise + normal Threshold, MC label) Raw data formats: DDL files, DATE file, “rootified” raw Event Merging (MC + MC): Summable Digits Signal +Summable Digits Background Needed to reduce the simulation time SIMULATION Event embedding (MC + Raw): Summable Digits Signal +Raw converted to SDigits Studies of reconstruction efficiency Event mixing: tracks or clusters from one event are combined with tracks or clusters from another (but “similar”) event. Reconstruction

MC OCDB (Ideal/Residual/Full) Simulation: Data flow Hits SDigits Digits, Raw Kine Anchor Run Calibration Alignment GRP Raw OCDB MC OCDB (Ideal/Residual/Full) Reco Param Simulation B field C++ macros: Config.C, sim.C Generator, etc. QA

Clusterization, vertex Local Detector Reconstruction: Clusterization, vertex Seeding: 2 clusters in TPC + primary vertex Kalman Filter: Forward propagation TPCITS Backward propagation ITSTPCTRDTOF Refit inward TRDTPCITS RECONSTRUCT ION TPC seeding, Forward tracking TPC V0 and kink finder ITS forward tracking, Combinatorial Kalman filter BARREL TRACKING ITS V0 finder ITS forward tracking ITS tracking Propagation to PHOS,EM,HMPID TPC tracking Update of V0s and kinks TRD tracking, seeding TOF PID TRD tracking, PID TPC tracking, PID Update of V0s and kinks ITS tracking, PID Update of V0s

ALICE raw data proton - proton PbPb pPb (usually replaces PbPb) many small events, pileup typical run of 10 months produces ~ 1Pb of raw data PbPb from very big central events to very small ultra-peripheral ones typical run of 1 month produces ~ 1Pb of raw data pPb (usually replaces PbPb) the events look like “high multiplicity pp” typical run ~1.5 months produces ~0. 3 Pb Variety of running conditions for each period Different trigger mixtures HLT compression Different behavior of detectors (HV, efficiency, noise, etc.)

Raw data in 2012/2013 1.65PB 8 periods in 2012 (LHC12a to LHC12h) 6 periods in 2013 (LHC13a to LHC13f) 2 Copies @CERN (T1) and a distributed copy @6 T1s) 7.5PB since start of LHC

RAW data transfer 340 TB of RAW: 20% of total 2012+2013 6 sub-periods, divided by triggering conditions Pb+Pb periods p+Pb period

Raw data processing Calibration (reconstruction with special settings + QA + analysis) => OCDB update CPass0 (mainly for barrel detectors, i.e. TPC) CPass1 (all detectors, i.e. TOF) Manual calibration Validation (reconstruction of 10% raw sample, standard settings + QA trains) VPass Production (reconstruction of all collected raw data + AOD filtering) PPass

Reconstruction: Data flow Run RAW Calibration Alignment GRP OCDB Reco Param Reconstruction B field C++ macro rec.C ESD/friends Tags QA QA Ref

The ESD AliESDEvent AliCentrality Cetrality for AA AliESDVertex Estimated with SPD AliESDTZERO TZERO information AliESDEvent has a TList of objects and containers (TClonesArrays) “to be extendable” Non-persistent pointers to the “standard” content Heavy object Complex IO (de-serialization) Long inheritance chains for the content of containers. Example AliESDTrack AliExternalTrackParam AliVTrack AliVParticle TObject AliESDVertex Estimated with ESD tracks AliESDVZERO VZERO infornation ESDCascade Cascade vertices AliMultiplicity SPD tracklets ESDPmdTrack Tracks in PMD ESDKink Kinks ESDTrdTracks Triggered tracks in TRD ESDCaloClusters PHOS/EMCAL clusters ESDTrack Detailed information in central barrel ESDFMD FMD multiplicity AliESDMuonTrack Tracks in Muon arm ESDV0 V0 vertices AliEventPlane Event plane for AA and so on…

From ESD to AOD Tender OADB AOD filter/QA tasks AOD Δ-AOD C++ macro/ plugin QA OCDB B field ESD chain Merge

Analysis Framework The framework is steered by a manager class that gets called in the different selector stages… Defining a common interface to access input data files (MC, ESD, AOD) and to write AOD outputs … and the interface for the user analysis that follow the selector analysis stages. A train of such analysis tasks share the main event loop AND the input data while still hot in memory. Our framework uses the TSelector technology that defines three analysis stages: initialization, event processing and termination. The current file change is notified. Basics: make use of the main ROOT event loop initiated when processing a TChain of files with a TSelector TChain TSelector Current file AliESDs.root AliESDs.root Current event AliESDs.root AliAnalysis Selector AliESDs.root AliESDs.root Notify Begin Process MC kine Input Terminate AliVEvent AliAnalysis Manager AliVEvent Handler 0…n TrackRefs AliAnalysis Task (s) Output AOD

Analysis today and tomorrow Input: storing ~4 PB/month data suitable for analysis Processing: ~20 PB/month (using 1/3 of total GRID CPU resources) Growth: ~20% increase/year in computing capacity Resource migration towards analysis: slowly growing to ~50% by LS2 => Assess the situation today by improving monitoring tools & learn from today's mistakes... Large improvements needed to analyze the higher rates after LS2 Analysis job efficiency + time to solution Cut down turnaround time in distributed analysis I/O improvements (data size, transaction time, throughput, ...) Low level parallelism (IPC, pipelining, vectorization)

Advantages and disadvantages Interfaces for uniform navigation Access to all resources Analysis framework: uniform analysis model and style Job control and traceability Sharing input data for many analysis in a train Increasing quota of organized analysis Flexible AOD format to accomodate any type of analysis Big event size High impact of user errors Insufficient control of bad usage patterns Uncoordinated analysis triggers inefficient use of resources

Cost of reading big data Local reading: 270MB AOD file (Pb-Pb) Spinning disk (50 MB/s, 13ms access time): Throughput 5.9 MB/s, CPU efficiency 86% SSD (266 MB/s, 0.2 ms access time): Throughput 6.8 MB/s, CPU efficiency 94% CPU time governed by ROOT de-serialization Remote WAN reading: RTT 63 ms Load: 5 (processes per disk server): 0.46 MB/s Load 200: 0.08 MB/s Latency and server load can kill performance Caching and load balancing needed

Summary of operations 2012/13 60% simulation 10% organized analysis 10% RAW data reconstruction 20% individual user analysis (465 users)

One week of site efficiency User jobs in the system Tue Wed Thu Fri Sat Sun Mon Clearly visible ‘weekday working hours’ pattern Average efficiency = 84%

One week uncoordinated user analysis: efficiencies The ‘carpet’, 180 users, average = 26%, contribution to total @20% level

Organized analysis – lego trains Allows running many analyses for the same data → I/O reduction CPU efficiency overall behaves like the best component Better automatic testing of components and control of memory or “bad practices”

Organized analysis Handler configuration Wagon configuration Data configuration Testing and running status

One week alitrain efficiency alitrain = LEGO trains Contributes to the overall efficiency @10% level

How to address the I/O issue ? Caching I/O queries and prefetching Expecting >90% CPU/wall with prefetching enabled Reducing the analyzed event size... ...at the price of cutting down generality and multiplying the datasets Selective disabling branches can alleviate I/O cost by a factor of 5 Custom filtering and skimming procedures in organized analysis Two main use cases: rare signals requiring a small fraction of input events, or analysis requesting a small fraction of the available event info

Improving the user analysis The efficiency and I/O throughput of a train can now be monitored The LEGO framework is a big step forward Extensive tests are being done before masterjob submission With exclusion from the train on failure Common I/O improves the efficiency, especially when CPU-intensive wagons are present

Summary The simulation framework is stable and only minimal changes are expected We foresee extensive use of Geant4 and Fluka in near future New generators are also expected The reconstruction framework is relatively stable Main effort to reduce the memory consumption Some changes in the PID part still expected The analysis code is very volatile Many new analysis tasks or updates => two analysis tags weekly The quality of the analysis code is worse compared to the rest of AliRoot => constant monitoring