ALICE Operations short summary ALICE Offline week June 15, 2012.

ALICE Operations short summary ALICE Offline week June 15, 2012

2 Data taking is 2012 Stable operation, steady data taking Accumulation of RAW since beginning of 2012 run Total 450 TB of physics data

3 RAW Data processing RAW data is subject to CPass0/CPass1 schema See session on Thursday morning Most of RAW this year has been reconstructed ‘on demand’ Replication follows the standard schema, no issues The largest RAW production was LHC11h (PbPb) Pass2 Processing of 2012 data will start soon…

4 MC productions in 2012 So far…62 production cycles, p+p, Pb+Pb Various generators, signals, detectors More realistic – use of RAW OCDB and anchor runs for all productions Presently running large-scale LHC11h productions with various signals for QM’2012 This will take another month MCs are more complex, but still rather routine

5 In general The central productions (RAW and MC) are stable and well-behaved Despite the (large) complexity…complexity Fortunately, most of the above is automatic Or we would need an army of people to do it

6 Grid power 2012: 25.8K jobs average 61.6Mio CPU hours = 7 CPU centuries…in 6 months

7 Job distribution No. of users ►

8 Non-production users In average – organized and chaotic analysis use 39% of the Grid

9 And if we don’t have production The user jobs would fill the Grid Production jobs (2200) are 8% of the total

10 Chaotic and organized analysis July and August will be ‘hot’ months QM’2012 is end of August. March – average 10K jobs 7.9GB/sec from SE Last month – 11K (+10%) 9.8GB/sec from SE (+20%)

11 Jobs use not only CPU… Average read rate: 10GB/sec from 57 SEs In one month = 25PB of data read (approximately all storage is read ~twice) ALICE total disk capacity = 15PB Remember the daily cyclic structure…

12 Efficiencies Efficiency definition: CPU/Wall Simplistic, and as such very appealing metrics By this measure, we are not doing great The 2012 (all centres) average efficiency is: 60%

13 Efficiency (2) The CPU/Wall depends on many factors I/O rate of the jobs Swap rate … And (IMHO) is not necessarily the best metrics to assess the productivity of the jobs or the computing centres What about the usage of the storage and the network? At the end, what counts is that the job is done That said, we must work on increasing the CPU/Wall ratio

14 Factorization - production job efficiencies MC (aliprod), RAW (alidaq), QA and AOD filtering Averages: aliprod: 90%, alidaq: 75%, average: 82% LHC11h Pass1 LHC11h Pass2

15 Enter the user analysis Note the daily cycle, remember the SE load structure… 24 hours without production Weekends Ascention

16 Day/night effect Nighttime – production – 83% Daytime – production and analysis – 62%

17 Users and trains Clearly the chaotic user jobs require a lot of I/O Little CPU – mostly histogram filling This simple fact is known since long A (partial) solution to this is Analyze smaller set of input data (ESD►AOD) Use organized analysis - the train See Andrei’s presentation from the analysis sessionpresentation And the subsequent PWG talks – quite happy with the system’s performance

18 Users and trains (2) The chaotic analysis will not go away, but will be less pertinent Tuning of cuts, tests of tasks before joining the trains The smaller input set and trains also help to use less resources: do much more analysis for the same CPU and I/O (independent on efficiency)

19 What can we do Establish realistic expectations wrt I/O Lego train tests: measure processing rate Lego train tests: measure processing rate E.g. CF_PbPb (4 wagons, 1 CPU intensive) E.g. CF_PbPb (4 wagons, 1 CPU intensive) Train #120 running on AOD095 Train #120 running on AOD095 Local efficiency 99.52% Local efficiency 99.52% AOD event size: 0.66 MB/ev AOD event size: 0.66 MB/ev Processing rate: 370.95 ms/ev (2.69 ev/sec) Processing rate: 370.95 ms/ev (2.69 ev/sec) The train can “burn” 2.69*0.66 = 1.78 MB/sec The train can “burn” 2.69*0.66 = 1.78 MB/sec This was a good example… This was a good example… Average ~100 ms/ev equivalent to 6.5 MB/sec Average ~100 ms/ev equivalent to 6.5 MB/sec Best student found: DQ_PbPb: 1723 ms/ev, can “live” with 380 kBytes/sec Best student found: DQ_PbPb: 1723 ms/ev, can “live” with 380 kBytes/sec This number is really relevant This number is really relevant It is NOT the number of wagons that really matters, but the rate they consume data with It is NOT the number of wagons that really matters, but the rate they consume data with This is the number we have to improve against and measure, both in local tests and GRID This is the number we have to improve against and measure, both in local tests and GRID We have to measure instantaneous transfer rate per site, to correlate with other conditions We have to measure instantaneous transfer rate per site, to correlate with other conditions On ESD is 3-4 times worse On ESD is 3-4 times worse Same processing rate, but event size bigger… Same processing rate, but event size bigger… A train processing < 100 ms/ev will have < 50 % efficiency in grid, depending where it is running and in which conditions A train processing < 100 ms/ev will have < 50 % efficiency in grid, depending where it is running and in which conditions Borrowed without permission from A.Gheata

20 WN to storage throughput Could be estimated using ‘standard’ centre fabric Type of WNs (number of cores, NIC) Switches (ports/throughput) SE types …. but the picture will be incomplete and too generic Thus we will not do it

21 WN to storage throughput (2) Better measure the real thing Set of benchmarking jobs with known inut set, measure the time to complete So that at all centres during normal load Get a ‘HEP I/O’ rating of the centre WNs We will do that very soon Using the benchmark every train can be rated easily for expected efficiency The centres could use this measurement to optimize the fabric, if practical

22 More… SE monitoring and control… see Harsh’s presentation Clear correlation between efficiency and server load Code optimization Memory footprint – use of swap is also efficiency killer

23 And more… Execute trains in different environments and compare results GSI has kindly volunteered to help Programme of tests is being discussed The ultimate goal is to bring the efficiency of organized analysis to the level of production jobs The PWGs are relentlessly pushing their members to migrate to organized analysis By mid-2013 we should complete this task

24 Conclusions 2012 is so far a standard year for data taking, production and analysis Not mentioned in the talk (no need to discuss a working system) – the stability of the Grid has been outstanding Thanks to the mature sites support and AliEn and LCG software And thus it fullfills its function to deliver Offline computational resources to the collaboration Our current programme is to Deliver and support the next version of AliEn Improve the SE operation in collaboration with the xrootd development team Improve the support for analysis and its efficiency

ALICE Operations short summary ALICE Offline week June 15, 2012.

Similar presentations

Presentation on theme: "ALICE Operations short summary ALICE Offline week June 15, 2012."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ALICE Operations short summary ALICE Offline week June 15, 2012.

Similar presentations

Presentation on theme: "ALICE Operations short summary ALICE Offline week June 15, 2012."— Presentation transcript:

Similar presentations

About project

Feedback