Performance optimizations for distributed analysis in ALICE

Performance optimizations for distributed analysis in ALICE
Latchezar Betev, Andrei Gheata, Mihaela Gheata, Costin Grigoras, Peter Hristov For the ALICE Off-line Collaboration ACAT 2013, Beijing

Analysis today and tomorrow
Input: storing ~4 PBytes/month data suitable for analysis Processing: ~20 PBytes/month (using 1/3 of total GRID CPU resources) Growth: ~20% increase/year in computing capacity Resource migration towards analysis: slowly growing to ~50% by LS2 Assess the situation today by improving monitoring tools & learn from today's mistakes... Large improvements needed to analyze the higher rates after LS2 Analysis job efficiency + time to solution Cut down turnaround time in distributed analysis I/O improvements (data size, transaction time, throughput, ...) Low level parallelism (IPC, pipelining, vectorization)

Advantages Interfaces for uniform navigation Access to all resources
Analysis framework: uniform analysis model and style Job control and traceability Sharing input data for many analysis in a train Increasing quota of organized analysis Flexible AOD format to accomodate any type of analysis

Disadvantages Big event size High impact of user errors
Insufficient control of bad usage patterns Chaotic analysis triggers innefficient use of resources

Analysis phases per event
task1 task2 task3 tread tds tproc tcl twrite Event #0 Reading event data from disk ~ IOPS*event_size/read_throughput De-serializing the event object hierarchy/ cleaning up after use ~ event_size*n_branches(2) Processing the event – summing the processing time for all connected analysis modules Writing the output ~ output size/write_throughput Merging the outputs ~ output_size*n_merging_processes Event #1 Join jobs tmerge

Cost of reading big data
Local reading: 270MB AOD file (Pb-Pb) Spinning disk (50 MB/s, 13ms access time): Throughput 5.9 MB/s, CPU efficiency 86% SSD (266 MB/s, 0.2 ms access time): Throughput 6.8 MB/s, CPU efficiency 94% CPU time governed by ROOT deserialization Remote WAN reading: RTT 63 ms Load: 5 (processes per disk server): 0.46 MB/s Load 200: 0.08 MB/s Latency and server load can kill performance Caching and load balancing needed

Summary of operations 2012/13
60% simulation 10% organized analysis 10% RAW data reconstruction 20% individual user analysis (465 users)

One week of site efficiency
User jobs in the system Tue Wed Thu Fri Sat Sun Mon Clearly visible ‘weekday working hours’ pattern Average efficiency = 84%

One week of user analysis
The ‘carpet’, 180 users, average = 26%, contribution to level

Organized analysis – lego trains
Allows running many analyses for the same data → I/O reduction CPU efficiency overall behaves like the best component Better automatic testing of components and control of memory or “bad practices”

Organized analysis – lego trains
Handler configuration Wagon configuration Data configuration Global configuration Testing and running status

One week alitrain efficiency
alitrain = LEGO trains Contributes to the overall level

Distributed analysis & caching
Caching on 66% average 28% average

Efficiency gains Despite high complexity, the LEGO train efficiency is already high Just moving ½ of the individual user jobs to LEGO would result in immediate ~7% increase in overall efficiency! Transition from ESD to AOD

How to address the I/O issue ?
Caching I/O queries and prefetching Expecting larger CPU/wall with prefetching enabled Reducing the analyzed event size... ...at the price of cutting down generality and multiplying the datasets Selective disabling branches can alleviate I/O cost by a factor of 5 Custom filtering and skimming procedures in organized analysis Two main use cases: rare signals requiring a small fraction of input events, or analysis requesting a small fraction of the available event info

Instrumenting analysis performance
Low-level "sensors" in the analysis manager Per analysis session Measuring throughput and efficiency per file Logging data sources and WN identity Info read, processed and summarized by the monitoring system Detect pathologies in the data flow Study data type effects Investigate flaws in trains

Scan of GRID analysis: jobs with best and worst efficiencies
Most files remote Most files local

Improving the user analysis
The efficiency and I/O throughput of a train can now be monitored The LEGO framework is a big step forward Extensive tests are being done before masterjob submission With exclusion from the train on failure Common I/O improves the efficiency, especially when CPU-intensive wagons are present

Parallelism – will we need this in future analysis ?
Makes sense in analysis if we can parallelize I/O Low level parallelism (increasing average IPC) can gain throughput By simplifying the data structures Improving locality during processing Achieving good vectorization gains in high computing loops

Possible thread parallelism
Device 1 Device 2 Device 3 Physical disks file1 file2 file3 file4 file5 file6 Reading and deserializing ROOT Reader 1 Reader 2 Reader 3 Event buffer (work queue) AM AM AM AM AM AM AM analysis workers ROOT Histogram buffers Output Histo buffers merge

Current merging procedure
analysis.root wn.xml Output/001/ results.root MyAnalysis_merge(“…/Output, stage=1, chunk) MyAnalysis_merge(“…/Output, stage=2, chunk) Output/ results_ Stage01_000.root analysis.root wn.xml Output/002/ results.root Output/ results.root analysis.root wn.xml Output/003/ results.root MyAnalysis.root xml analysis.root wn.xml Output/004/ results.root Output/ results_ Stage01_001.root analysis.root wn.xml Output/005/ results.root analysis.root wn.xml Output/006/ results.root Scales like tjob* log(Nstages)

Merging performance Merging stages Worse performance on big output
Job submission

Possible improvements
Process 1 Merging service process on cluster Process 2 Process 3 Single merging process if this scales (?) Merged.root Process 4 Process 5 Process N Not obvious for histograms unless enabling “buffered” mode Needs fast network to the merging service Merge-ahead possible at initial stage A dedicated merge cluster could handle “difficult” cases

Summary and conclusions
The future requirements in analysis throughput are very demanding We try to evolve from a working system to one which is working efficiently The road path to performance is long and requires both low level optimizations and global scheduling rethinking Need to evaluate major changes, like changes in the data format, use of parallelism, re-writing the GRID scheduling system Active monitoring of today analysis patterns is mandatory for eliminating bottlenecks Recent improvements in efficiency by organized analysis and I/O improvements prove it

Performance optimizations for distributed analysis in ALICE

Similar presentations

Presentation on theme: "Performance optimizations for distributed analysis in ALICE"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance optimizations for distributed analysis in ALICE

Similar presentations

Presentation on theme: "Performance optimizations for distributed analysis in ALICE"— Presentation transcript:

Similar presentations

About project

Feedback