ALICE Operations short summary ALICE Offline week June 15, 2012.

Slides:



Advertisements
Similar presentations
31/03/00 CMS(UK)Glenn Patrick What is the CMS(UK) Data Model? Assume that CMS software is available at every UK institute connected by some infrastructure.
Advertisements

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
– Unfortunately, this problems is not yet fully under control – No enough information from monitoring that would allow us to correlate poor performing.
T1 at LBL/NERSC/OAK RIDGE General principles. RAW data flow T0 disk buffer DAQ & HLT CERN Tape AliEn FC Raw data Condition & Calibration & data DB disk.
Trains status&tests M. Gheata. Train types run centrally FILTERING – Default trains for p-p and Pb-Pb, data and MC (4) Special configuration need to be.
ALICE Operations short summary and directions in 2012 Grid Deployment Board March 21, 2011.
ALICE Operations short summary LHCC Referees meeting June 12, 2012.
ALICE Operations short summary and directions in 2012 WLCG workshop May 19-20, 2012.
Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –
1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
ALICE data access WLCG data WG revival 4 October 2013.
ALICE Roadmap for 2009/2010 Patricia Méndez Lorenzo (IT/GS) Patricia Méndez Lorenzo (IT/GS) On behalf of the ALICE Offline team Slides prepared by Latchezar.
Data Import Data Export Mass Storage & Disk Servers Database Servers Tapes Network from CERN Network from Tier 2 and simulation centers Physics Software.
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
Costin Grigoras ALICE Offline. In the period of steady LHC operation, The Grid usage is constant and high and, as foreseen, is used for massive RAW and.
BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing.
Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
Offline report – 7TeV data taking period (Mar.30 – Apr.6) ALICE SRC April 6, 2010.
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
ALICE Grid operations: last year and perspectives (+ some general remarks) ALICE T1/T2 workshop Tsukuba 5 March 2014 Latchezar Betev Updated for the ALICE.
Status of PDC’07 and user analysis issues (from admin point of view) L. Betev August 28, 2007.
PWG3 Analysis: status, experience, requests Andrea Dainese on behalf of PWG3 ALICE Offline Week, CERN, Andrea Dainese 1.
Andrei Gheata, Mihaela Gheata, Andreas Morsch ALICE offline week, 5-9 July 2010.
Analysis trains – Status & experience from operation Mihaela Gheata.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
5/2/  Online  Offline 5/2/20072  Online  Raw data : within the DAQ monitoring framework  Reconstructed data : with the HLT monitoring framework.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Armenuhi Abramyan, Narine Manukyan ALICE team of A.I. Alikhanian National Scientific Laboratory {aabramya,
Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)
Computing for Alice at GSI (Proposal) (Marian Ivanov)
A. Gheata, ALICE offline week March 09 Status of the analysis framework.
LCG-LHCC mini-review ALICE Latchezar Betev Latchezar Betev for the ALICE collaboration.
1 Offline Week, October 28 th 2009 PWG3-Muon: Analysis Status From ESD to AOD:  inclusion of MC branch in the AOD  standard AOD creation for PDC09 files.
Predrag Buncic CERN Future of the Offline. Data Preparation Group.
Data processing Offline review Feb 2, Productions, tools and results Three basic types of processing RAW MC Trains/AODs I will go through these.
M. Gheata ALICE offline week, October Current train wagons GroupAOD producersWork on ESD input Work on AOD input PWG PWG31 (vertexing)2 (+
PWG3 analysis (barrel)
The ATLAS Computing & Analysis Model Roger Jones Lancaster University ATLAS UK 06 IPPP, 20/9/2006.
ALICE Grid operations +some specific for T2s US-ALICE Grid operations review 7 March 2014 Latchezar Betev 1.
Analysis efficiency Andrei Gheata ALICE offline week 03 October 2012.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
ALICE Physics Data Challenge ’05 and LCG Service Challenge 3 Latchezar Betev / ALICE Geneva, 6 April 2005 LCG Storage Management Workshop.
Some topics for discussion 31/03/2016 P. Hristov 1.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
The ALICE Analysis -- News from the battlefield Federico Carminati for the ALICE Computing Project CHEP 2010 – Taiwan.
Data Formats and Impact on Federated Access
ALICE internal and external network
Analysis trains – Status & experience from operation
ALICE Monitoring
Workshop Computing Models status and perspectives
Update on Plan for KISTI-GSDC
Operations in 2012 and plans for the LS1
Offline data taking and processing
ALICE Computing : 2012 operation & future plans
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
Readiness of ATLAS Computing - A personal view
Disk capacities in 2017 and 2018 ALICE Offline week 12/11/2017.
Simulation use cases for T2 in ALICE
Analysis framework - status
ALICE Computing Model in Run3
ALICE Computing Upgrade Predrag Buncic
Performance optimizations for distributed analysis in ALICE
ATLAS DC2 & Continuous production
Presentation transcript:

ALICE Operations short summary ALICE Offline week June 15, 2012

2 Data taking is 2012 Stable operation, steady data taking Accumulation of RAW since beginning of 2012 run Total 450 TB of physics data

3 RAW Data processing RAW data is subject to CPass0/CPass1 schema See session on Thursday morning Most of RAW this year has been reconstructed ‘on demand’ Replication follows the standard schema, no issues The largest RAW production was LHC11h (PbPb) Pass2 Processing of 2012 data will start soon…

4 MC productions in 2012 So far…62 production cycles, p+p, Pb+Pb Various generators, signals, detectors More realistic – use of RAW OCDB and anchor runs for all productions Presently running large-scale LHC11h productions with various signals for QM’2012 This will take another month MCs are more complex, but still rather routine

5 In general The central productions (RAW and MC) are stable and well-behaved Despite the (large) complexity…complexity Fortunately, most of the above is automatic Or we would need an army of people to do it

6 Grid power 2012: 25.8K jobs average 61.6Mio CPU hours = 7 CPU centuries…in 6 months

7 Job distribution No. of users ►

8 Non-production users In average – organized and chaotic analysis use 39% of the Grid

9 And if we don’t have production The user jobs would fill the Grid Production jobs (2200) are 8% of the total

10 Chaotic and organized analysis July and August will be ‘hot’ months QM’2012 is end of August. March – average 10K jobs 7.9GB/sec from SE Last month – 11K (+10%) 9.8GB/sec from SE (+20%)

11 Jobs use not only CPU… Average read rate: 10GB/sec from 57 SEs In one month = 25PB of data read (approximately all storage is read ~twice) ALICE total disk capacity = 15PB Remember the daily cyclic structure…

12 Efficiencies Efficiency definition: CPU/Wall Simplistic, and as such very appealing metrics By this measure, we are not doing great The 2012 (all centres) average efficiency is: 60%

13 Efficiency (2) The CPU/Wall depends on many factors I/O rate of the jobs Swap rate … And (IMHO) is not necessarily the best metrics to assess the productivity of the jobs or the computing centres What about the usage of the storage and the network? At the end, what counts is that the job is done That said, we must work on increasing the CPU/Wall ratio

14 Factorization - production job efficiencies MC (aliprod), RAW (alidaq), QA and AOD filtering Averages: aliprod: 90%, alidaq: 75%, average: 82% LHC11h Pass1 LHC11h Pass2

15 Enter the user analysis Note the daily cycle, remember the SE load structure… 24 hours without production Weekends Ascention

16 Day/night effect Nighttime – production – 83% Daytime – production and analysis – 62%

17 Users and trains Clearly the chaotic user jobs require a lot of I/O Little CPU – mostly histogram filling This simple fact is known since long A (partial) solution to this is Analyze smaller set of input data (ESD►AOD) Use organized analysis - the train See Andrei’s presentation from the analysis sessionpresentation And the subsequent PWG talks – quite happy with the system’s performance

18 Users and trains (2) The chaotic analysis will not go away, but will be less pertinent Tuning of cuts, tests of tasks before joining the trains The smaller input set and trains also help to use less resources: do much more analysis for the same CPU and I/O (independent on efficiency)

19 What can we do Establish realistic expectations wrt I/O Lego train tests: measure processing rate Lego train tests: measure processing rate E.g. CF_PbPb (4 wagons, 1 CPU intensive) E.g. CF_PbPb (4 wagons, 1 CPU intensive) Train #120 running on AOD095 Train #120 running on AOD095 Local efficiency 99.52% Local efficiency 99.52% AOD event size: 0.66 MB/ev AOD event size: 0.66 MB/ev Processing rate: ms/ev (2.69 ev/sec) Processing rate: ms/ev (2.69 ev/sec) The train can “burn” 2.69*0.66 = 1.78 MB/sec The train can “burn” 2.69*0.66 = 1.78 MB/sec This was a good example… This was a good example… Average ~100 ms/ev equivalent to 6.5 MB/sec Average ~100 ms/ev equivalent to 6.5 MB/sec Best student found: DQ_PbPb: 1723 ms/ev, can “live” with 380 kBytes/sec Best student found: DQ_PbPb: 1723 ms/ev, can “live” with 380 kBytes/sec This number is really relevant This number is really relevant It is NOT the number of wagons that really matters, but the rate they consume data with It is NOT the number of wagons that really matters, but the rate they consume data with This is the number we have to improve against and measure, both in local tests and GRID This is the number we have to improve against and measure, both in local tests and GRID We have to measure instantaneous transfer rate per site, to correlate with other conditions We have to measure instantaneous transfer rate per site, to correlate with other conditions On ESD is 3-4 times worse On ESD is 3-4 times worse Same processing rate, but event size bigger… Same processing rate, but event size bigger… A train processing < 100 ms/ev will have < 50 % efficiency in grid, depending where it is running and in which conditions A train processing < 100 ms/ev will have < 50 % efficiency in grid, depending where it is running and in which conditions Borrowed without permission from A.Gheata

20 WN to storage throughput Could be estimated using ‘standard’ centre fabric Type of WNs (number of cores, NIC) Switches (ports/throughput) SE types …. but the picture will be incomplete and too generic Thus we will not do it

21 WN to storage throughput (2) Better measure the real thing Set of benchmarking jobs with known inut set, measure the time to complete So that at all centres during normal load Get a ‘HEP I/O’ rating of the centre WNs We will do that very soon Using the benchmark every train can be rated easily for expected efficiency The centres could use this measurement to optimize the fabric, if practical

22 More… SE monitoring and control… see Harsh’s presentation Clear correlation between efficiency and server load Code optimization Memory footprint – use of swap is also efficiency killer

23 And more… Execute trains in different environments and compare results GSI has kindly volunteered to help Programme of tests is being discussed The ultimate goal is to bring the efficiency of organized analysis to the level of production jobs The PWGs are relentlessly pushing their members to migrate to organized analysis By mid-2013 we should complete this task

24 Conclusions 2012 is so far a standard year for data taking, production and analysis Not mentioned in the talk (no need to discuss a working system) – the stability of the Grid has been outstanding Thanks to the mature sites support and AliEn and LCG software And thus it fullfills its function to deliver Offline computational resources to the collaboration Our current programme is to Deliver and support the next version of AliEn Improve the SE operation in collaboration with the xrootd development team Improve the support for analysis and its efficiency