Improving the analysis performance Andrei Gheata ALICE offline week 7 Nov 2013.

Slides:

Advertisements

Similar presentations

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Advertisements

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Chap 5 Process Scheduling. Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU–I/O Burst Cycle – Process execution consists of a.

OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.

Scheduling in Batch Systems

CS 3013 & CS 502 Summer 2006 Scheduling1 The art and science of allocating the CPU and other resources to processes.

Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.

User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.

Review: Operating System Manages all system resources ALU Memory I/O Files Objectives: Security Efficiency Convenience.

1 Thursday, June 15, 2006 Confucius says: He who play in root, eventually kill tree.

Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.

Process Concept An operating system executes a variety of programs

CSE 490dp Resource Control Robert Grimm. Problems How to access resources? –Basic usage tracking How to measure resource consumption? –Accounting How.

BINA RAMAMURTHY UNIVERSITY AT BUFFALO System Structure and Process Model 5/30/2013 Amrita-UB-MSES

CPU Scheduling Chapter 6 Chapter 6.

Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Operating Systems Lecture Notes CPU Scheduling Matthew Dailey Some material © Silberschatz, Galvin, and Gagne, 2002.

1 RTOS Design Some of the content of this set of slides is taken from the documentation existing on the FreeRTOS website

Status of the vector transport prototype Andrei Gheata 12/12/12.

CPU Scheduling CSCI 444/544 Operating Systems Fall 2008.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.

ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.

CE Operating Systems Lecture 7 Threads & Introduction to CPU Scheduling.

“By the end of this chapter, you should have obtained a basic understanding of how modern processors execute parallel programs & understand some rules.

Andrei Gheata, Mihaela Gheata, Andreas Morsch ALICE offline week, 5-9 July 2010.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.

Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.

Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.

A. Gheata, ALICE offline week March 09 Status of the analysis framework.

AliRoot survey: Analysis P.Hristov 11/06/2013. Are you involved in analysis activities?(85.1% Yes, 14.9% No) 2 Involved since 4.5±2.4 years Dedicated.

M. Gheata ALICE offline week, October Current train wagons GroupAOD producersWork on ESD input Work on AOD input PWG PWG31 (vertexing)2 (+

1 CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.

CPU Scheduling CSCI Introduction By switching the CPU among processes, the O.S. can make the system more productive –Some process is running at.

Analysis efficiency Andrei Gheata ALICE offline week 03 October 2012.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)

Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.

GeantV – status and plan A. Gheata for the GeantV team.

ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.

Threads prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University 1July 2016Processes.

ROOT IO workshop What relates to ATLAS. General Both CMS and Art pushing for parallelization at all levels. Not clear why as they are anyhow CPU bound.

CPU Scheduling Scheduling processes (or kernel-level threads) onto the cpu is one of the most important OS functions. The cpu is an expensive resource.

Data Formats and Impact on Federated Access

Atlas IO improvements and Future prospects

Processes and threads.

ALICE experience with ROOT I/O

Analysis trains – Status & experience from operation

Sujata Ray Dey Maheshtala College Computer Science Department

Processes and Threads Processes and their scheduling

ALICE analysis preservation

CPU scheduling 6. Schedulers, CPU Scheduling 6.1. Schedulers

Analysis framework - status

Report on Vector Prototype

Chapter 5: CPU Scheduling

Chapter 6: CPU Scheduling

Operating Systems CPU Scheduling.

Performance optimizations for distributed analysis in ALICE

CPU Scheduling G.Anuradha

Chapter 5: CPU Scheduling

Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.

Chapter 5: CPU Scheduling

Lecture Topics: 11/1 General Operating System Concepts Processes

Sujata Ray Dey Maheshtala College Computer Science Department

Chapter 5: CPU Scheduling

CPU Scheduling David Ferry CSCI 3500 – Operating Systems

Presentation transcript:

Improving the analysis performance Andrei Gheata ALICE offline week 7 Nov 2013

Performance ingredients 1.Maximize efficiency: CPU/IO o Useful CPU cycles: deserialization is I/O accounted for as CPU 2.Minimize time to complete a given analysis 3.Maximize throughput o N events /second/core for local data access 2

1. CPU/IO 3

More CPU cycles Maximize CPU = organized analysis trains o At least one CPU intensive wagon gives others a free ride… o Many times possible in organized mode (LEGO), not the case for user distributed analysis Maximize participation in LEGO 4 LEGO/USER

Faster I/O Fragmented I/O introduces overheads o ~N I/O_req *latency o Killer in case of WAN file access (see old lessons) Tree caching reduces fragmentation o Overheads still present for WAN access! Sharing WAN bandwidth does not scale as good as LAN at all sites. 5 latency TREE CACHING latency

Maximizing local file access By smart file distribution and “basketizing” (splitting) per job - done o Locality is a feature by design o Algorithm fixed now after longstanding splitting issues o Some local file access may timeout, or jobs intentionally sent to free slots with non-local access Data migration? o Based on popularity services? Not terribly useful without a working forecast service… o Integrated with the job scheduling system? By using data prefetching o At low level (ROOT) – “copy during run” – to be tested at large scale o At workload management level (smart prefetching on proxies)? 6

TFilePrefetch Separate thread reading ahead data blocks First testing round found a bug o The fix made it to ROOT v (not yet used in production) o To be tested with LEGO trains soon AliAnalysisManager::SetAsyncReading() o Job reading from CNAF with 70% efficiency becomes 82% ! o Forward file opening not implemented -> non-smooth transitions between files 7 block1 block2 block3 block4 block5 block6 block7 File1 File2 Reading thread Processing thread

Speeding-up deserialization “The event complexity highly contributes to DS time…” o …for the same amount of bytes read o Deserialization itself + ReadFromTree() o Remains to be measured Ongoing work: flattening AliAODEvent to 2 levels o Event + tracks, vertices, cascades, V0’s, … o TFile::MakeFile, TFile::MakeClass as base for refactoring o Read event from AOD into new structure + write to new file o Compare reading speeds in the 2 cases For the future: new “thinned” AOD format serving 80% of analysis o The train model calls for “general” data -> contradicts with reducing size o To investigate: SOA for tracks, vertices, V0’s, … 8

Minimizing time to complete analysis 9

Improving tails… Assess acceptable statistics loss cutting the tail Jobs in the queue not finding free slots where data is o Currently the site requirements are being released and jobs land on some free slot and access (almost) all files remotely -> efficiency price o One can also play with raising the priority for the tail jobs Prefetching jobs file lists on data proxies near free slots o Triggered by low watermark on remaining jobs or high watermark on waiting time o Change job requirements to match the data proxy o Better than local job file prefetching because can be done while jobs are waiting Just ideas to open the discussion… o Implementation would require a data transfer service and xrootd-based proxy caching o Can bring the efficiency up, but also reduce the time to finish the jobs 10

Possible approaches 11 Dispatched jobs Waiting jobs Job slots SITE A SE Job slots SITE B SE Job slots SITE C SE Job slots SITE D SE trigger F1,F2,F3 F4,F5,F6 F7,F8,F9 Transfer service F1,F2,F3, SITE A F4,F5,F6, SITE B F7,F8,F9, SITE D OR C Big gun to kill a fly ? (the tail is ~5%). Alternative: no copy but relaxing requirements based on geographic proximity Is there such thing as “free slots?

Throughput/core 12

Prerequisites Achieving micro parallelism o Vectors, instructions pipelining, ILP o Some performed by HW and compilers, other require explicit intervention Working on “parallel” data (i.e. vector-like) o Redesign of data structures AND algorithms o Data and code locality enforced at framework level, algorithm optimizations at user level Make use of coprocessors (GPGPU, Xeon Phi, …) o Require parallelism besides vectorization (including at I/O level) “Decent” CPU usage o CPU bound tasks 13

Does it worth? Nice exercise by Magnus Mager reshaping a three- prong vertexing analysis o Re-formatting input data and keeping minimal info, feeding threads from a circular data buffer, some AVX vectorization, custom histogramming o 500 MB/s processed from SSD 14 Merging Reader thread Analysis threads Circular buffer Result

Micro-parallelism path Data structures re-engineering: shrink and flatten, use SOA and alignment techniques runtime Concurrency in data management: reading, dispatching to workers Define work unit: e.g. vector of tracks, provide API with vector signatures Concurrent processing: thread safe user code, usage of vectorization, kernels 15

Conclusions Analysis performance has multiple dimensions, we are addressing few Data management policy require improvements to extra reduce time to analysis completion while staying efficient File prefetching expected to bring efficiency up with extra 10% A long path in future towards micro-parallelism in analysis 16