ALICE Computing : 2012 operation & future plans

ALICE Computing : 2012 operation & future plans
Rencontre LCG-France, SUBATECH Nantes 18-20 September 2012

A quick glimpse of 2012 Standard data taking year for ALICE
p-p – emphasis on rare triggers, high Pt (Calorimeter) pilot p-A run (few million events) long p-A run in February 2013 (still counts as ‘2012’) Bulk of analysis on 2011 Pb-Pb - the largest single period data sample

2012 - RAW Standard treatment – 2 copies of RAW data
One at T0, one replica at T1s, proportional to the fraction of mass storage capacity ~1PB of RAW until now

2012 – job profile Average 28K jobs in parallel
Increases as capacities become available

2012 – site contribution Wall time – 50/50 T0/1 to T2s

2012 – French sites Wall time – 25/75 T1 to T2s

Other important parameters
Storage (always insufficient…) 2PB of disk, 45% at T1 The balance is not as equal as CPU Network – extremely well provisioned T2 connectivity will further improve with LHCONE

More details on workload
Organized activities, including trains Chaotic

Resources use - tasks Last year goal – increase the fraction of organized analysis Tool – analysis trains Long-term goal, takes a substantial amount of coordination and user education The resources use distribution 10% RAW reconstruction (constant) 16% train analysis (5% beginning of year) 23% chaotic analysis (36% beginning of 2012) 51% Monte-Carlo productions (49% beginning of 2012)

The Analysis Trains Polling together many user analysis tasks (wagons) in a single set of Grid jobs (the train) Managed through a web interface by a Physics Working Group conductor (ALICE has 8 PWGs) Provides a configuration and test platform (functionality, memory, efficiency) and a submission/monitoring interface Speed – few days to go through a complete period (PBs of data!) MonALISA Web interface LPM AliROOT Analysis Framework AliEn Grid jobs

Data access in analysis
The chaotic and to some extent organized analysis is I/O bound (efficient use of disk/network resources) Average 8 GB/s, peak 20 GB/s Total read data from 1-st of April is 120 PB.

CPU efficiency Stable (but ‘low’), some improvement with time – increase of trains share over chaotic analysis

CPU efficiency for organized tasks
MC - high, RAW ~OK, trains – still needs improvement

Analysis efficiency Processing phases per event
Reading event data from disk – sequential De-serializing the event object hierarchy – sequential Processing the event parallelizable Cleaning the event structures - sequential Writing the output – sequential but parallelizable Merging the outputs – sequential but parallelizable tread tds tproc tcl twrite Event #0 Event #1 Event #2 Event #m Event #0 Event #1 Event #2 Event #n Event #0 Event #1 Event #2 Event #p tmerge A.Gheata – improving analysis efficiency

Analysis efficiency (2)
The efficiency of the analysis job: job_eff = (tds+tproc+tcl)/ttotal analysis_eff = tproc / ttotal Time/event for different phases depending on many factors Tread~ IOPS*event_size/read_throughput – to be minimized Minimize event size, keep under control read throughput Tds+Tcl~ event_size*n_branches – to be minimized Minimize event size and complexity Tproc = ∑wagonsTi – to be maximized Maximize number of wagons and useful processing Twrite = output_size/write_throughput – to be minimized A.Gheata – improving analysis efficiency

Grid upgrades New AliEn version (v.2-20) – ready for deployment
Lighter catalogue structure Presently M LFNs, 2.5x PFNs (replicas) Growing at 10Mio new entries per week Extreme job brokering The jobs are no longer pre-split and with pre-determined input data set Potentially one job could process all input data (of the set) at a given site The data locality principle remains (for now) The site/central services upgrade – need some downtime, after end of data taking in Feb.2013

analyze as much as possible
File Brokering Site A Site B Site C File 1 File 2 File 3 File 4 File 5 Current schema Submit 4 jobs: File1 File 4 File2 File3 File 5 Broker per file Submit 3 empty subjobs If nothing left, just exit File 1,2,4,5 When a job starts, analyze as much as possible File 3 From P.Saiz – AliEn development

Short development roadmap
Data management: Popularity service SE layout (EOS-like) GUIDless catalogue Job Processing: Job Merging Error classification Multicore and multiagent Remote access optimization Combine AF/classical grid CE – interactive Grid

General remarks on the future
– Long Shutdown 1 No revolution is (ever) planned, however… All LHC experiments have submitted LoI for the LS3 (HL-LHC) upgrades in 2022 For the computing is massively larger than today (data rates and volumes, CPU needs) – 10-30x of today – the factors are not yet finalized Massive online DAQ and HLT event filtering farms 2x size of what a T1 is today No clear ideas how this will be achieved – technologically and financially Moore’s and Kryder’s laws will not ‘cover’ the needs

The present Grid profited from ~10 years of planning and development (on par with the detectors) And it delivered from day 1, continues to this day The future planning and development of Grid/Cloud/<Insert name here> should start now – years of experience will help, but not enough Parallel programming cannot be done by physicists… there are other hurdles too

Big improvement is expected from the frameworks and code Undoubtedly a common effort and professional help will be necessary Parallelism is a no-brainer, given the technological trends Big parts of the code must be re-engineered and re-written Every experiment has a panel which is charged with the design of the ‘new’ software Crystal balls have been ordered 

Summary – back to today The 2012 is so far a standard data taking/processing/analysis year for ALICE – much excitement is expected in February with the p-A data The operation is smooth and is helped a lot by the mature Grid around the world The French T1/T2s are part of this structure, with remarkably stable performance and well balanced components – CPU, storage, networks … and of course a solid expert support at all levels – a big **thank you** for this!

Summary – cont. The (near)future developments are focused on analysis tasks and tools Emphasis on data containers and process synchronization Whole node is a promising path and will naturally help the multicore development Progressive introduction of new features, improvements The Grid must run continuously also during the LS1 shutdown Resources (disk) are scarce More efficient use – less replicas, WAN access

ALICE Computing : 2012 operation & future plans

Similar presentations

Presentation on theme: "ALICE Computing : 2012 operation & future plans"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ALICE Computing : 2012 operation & future plans

Similar presentations

Presentation on theme: "ALICE Computing : 2012 operation & future plans"— Presentation transcript:

Similar presentations

About project

Feedback