Evolution of Parallel Programming in HEP F. Rademakers – CERN International Workshop on Large Scale Computing VECC, Kolkata.

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

Advanced Computational Research Laboratory (ACRL) Virendra C. Bhavsar Faculty of Computer Science University of New Brunswick Fredericton, NB, E3B 5A3.

1 PROOF & GRID Update Fons Rademakers. 2 Parallel ROOT Facility The PROOF system allows: parallel execution of scripts parallel analysis of trees in a.

History of Distributed Systems Joseph Cordina

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

Chapter 17 Parallel Processing.

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Introduction to Parallel Processing Ch. 12, Pg

Shilpa Seth.  Centralized System Centralized System  Client Server System Client Server System  Parallel System Parallel System.

PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

Computer System Architectures Computer System Software

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.

Classification of Computers

Interactive Data Analysis with PROOF Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers CERN.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

1 Marek BiskupACAT2005PROO F Parallel Interactive and Batch HEP-Data Analysis with PROOF Maarten Ballintijn*, Marek Biskup**, Rene Brun**, Philippe Canal***,

1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.

ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.

1 PROOF The Parallel ROOT Facility Gerardo Ganis / CERN CHEP06, Computing in High Energy Physics 13 – 17 Feb 2006, Mumbai, India Bring the KB to the PB.

ROOT-CORE Team 1 PROOF xrootd Fons Rademakers Maarten Ballantjin Marek Biskup Derek Feichtinger (ARDA) Gerri Ganis Guenter Kickinger Andreas Peters (ARDA)

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Parallel Computing.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Data Structures and Algorithms in Parallel Computing Lecture 1.

PROOF and ALICE Analysis Facilities Arsen Hayrapetyan Yerevan Physics Institute, CERN.

Outline Why this subject? What is High Performance Computing?

Super Scaling PROOF to very large clusters Maarten Ballintijn, Kris Gulbrandsen, Gunther Roland / MIT Rene Brun, Fons Rademakers / CERN Philippe Canal.

1 Status of PROOF G. Ganis / CERN Application Area meeting, 24 May 2006.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

March 13, 2006PROOF Tutorial1 Distributed Data Analysis with PROOF Fons Rademakers Bring the KB to the PB not the PB to the KB.

September, 2002CSC PROOF - Parallel ROOT Facility Fons Rademakers Bring the KB to the PB not the PB to the KB.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Parallel Computing Presented by Justin Reschke

LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?

Background Computer System Architectures Computer System Software.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

PROOF on multi-core machines G. GANIS CERN / PH-SFT for the ROOT team Workshop on Parallelization and MultiCore technologies for LHC, CERN, April 2008.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Processor Level Parallelism 1

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Chapter 1: Introduction

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Introduction to Parallel Computing

PARALLEL COMPUTING.

Distributed Processors

PROOF – Parallel ROOT Facility

Parallel Processing - introduction

Partner: LMU (Atlas), GSI (Alice)

TYPES OFF OPERATING SYSTEM

Introduction to Parallel Processing

Support for ”interactive batch”

CSE8380 Parallel and Distributed Processing Presentation

AN INTRODUCTION ON PARALLEL PROCESSING

Introduction to Operating Systems

Introduction to Operating Systems

Part 2: Parallel Models (I)

Subject Name: Operating System Concepts Subject Number:

Chapter 4 Multiprocessors

Presentation transcript:

Evolution of Parallel Programming in HEP F. Rademakers – CERN International Workshop on Large Scale Computing VECC, Kolkata

IWLSC, 9 Feb 20062Fons Rademakers Outline  Why use parallel computing  Parallel computing concepts  Typical parallel problems  Amdahl’s law  Parallelism in HEP  Parallel data analysis in HEP  PIAF  PROOF  Conclusions

IWLSC, 9 Feb 20063Fons Rademakers Why Parallelism  Two primary reasons: Save time – wall clock time Solve larger problems  Other reasons: Taking advantage of non-local resources – Grid Cost saving – using multiple “cheap” machines instead of paying for a super computer Overcoming memory constraints – single computers have finite memory resources, use many machine to create a very large memory  Limits to serial computing Transmission speeds – the speed of a serial computer is directly dependent on how much data can move through the hardware Limits speed of light (30 cm/ns) and transmission limit of copper wire (9 cm/ns) Limits to miniaturization Economic limitations  Ultimately, parallel computing is an attempt to maximize the infinite but seemingly scarce commodity called time

IWLSC, 9 Feb 20064Fons Rademakers Parallel Computing Concepts  Parallel hardware A single computer with multiple processors (multiple multi-core) An arbitrary number of computers connected by a network (LAN/WAN) A combination of both  Parallelizable computational problems Can be broken apart into discrete pieces of work that can be solved simultaneously Can execute multiple program instructions at any moment in time Can be solved in less time with multiple compute resource than with a single compute resource

IWLSC, 9 Feb 20065Fons Rademakers Parallel Computing Concepts  There are different ways to classify parallel computers (Flynn’s Taxonomy): SISD Single Instruction, Single Data SIMD Single Instruction, Multiple Data MISD Multiple Instruction, Multiple Data MIMD Multiple Instruction, Multiple Data

IWLSC, 9 Feb 20066Fons Rademakers SISD  A serial (non-parallel) computer  Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle  Single data: only one data stream is being used as input during any one clock cycle  Deterministic execution  Examples: most classical PC’s, single CPU workstations and mainframes

IWLSC, 9 Feb 20067Fons Rademakers SIMD  A type of parallel computer  Single instruction: all processing units execute the same instruction at any given clock cycle  Multiple data: each processing unit can operate on a different data element  This type of machine typically has an instruction dispatcher, a very high bandwidth internal network and a very large array of very small-capacity CPU’s  Best suited for specialized problems with high degree of regularity: image processing  Synchronous and deterministic execution  Two varieties: processor arrays and vector pipelines  Examples (some extinct): Processor arrays: Connection Machine, Maspar MP-1, MP-2 Vector pipelines: CDC 205, IBM 9000, Cray C90, Fujitsu, NEC SX-2

IWLSC, 9 Feb 20068Fons Rademakers MISD  Few actual examples of this class of parallel computer have ever existed  Some conceivable examples might be: Multiple frequency filter operating on a single signal stream Multiple cryptography algorithms attempting to track a single coded message

IWLSC, 9 Feb 20069Fons Rademakers MIMD  Currently the most common type of parallel computer  Multiple instruction: every processor may be executing a different instruction stream  Multiple data: every processor may be working with a different data stream  Execution can be synchronous or asynchronous, deterministic or non-deterministic  Examples: most current supercomputers, networked parallel computer “grids” and multi-processor SMP computers – including multi-CPU and multi-core PC’s

IWLSC, 9 Feb Fons Rademakers Relevant Terminology  Observed speedup wall-clock time of serial execution / wall-clock time of parallel execution  Granularity Coarse: relatively large amounts of computational work are done between communication events Fine: relatively small amounts of computational work are done between communication events  Parallel overhead The amount of time required to coordinate parallel tasks, as opposed to doing useful work, typically: Task start-up time Synchronizations Data communications Software overhead imposed by parallel compilers, libraries, tools, OS, etc. Task termination time  Scalability Refers to a parallel system’s ability to demonstrate a proportional increase in parallel speedup with the addition of more processors  Embarrassingly parallel

IWLSC, 9 Feb Fons Rademakers Typical Parallel Problems  Traditionally, parallel computing has been considered to be “the high-end of computing”: Weather and climate Chemical and nuclear reactions Biological, human genome Geological, seismic activity Electronic circuits  Today commercial applications are the driving force: Parallel databases, data mining Oil exploration Web search engines Computer-aided diagnosis in medicine Advanced graphics and virtual reality  The future: during past 10 years trends indicated by ever faster networks, distributed systems and multi-processor, and now multi- core, computer architectures suggest that parallelism is the future

IWLSC, 9 Feb Fons Rademakers Amdahl’s Law  Amdahl’s law states that potential speedup is defined by the fraction of code (P) that can be parallelized:  If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup). If all the code is parallelized, P = 1, the speedup is infinite (in theory)  If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast  Introducing the number of processors performing the parallel fraction of work, the relationship can written like:  Where P = parallel fraction, N = number of processors and S = serial fraction

IWLSC, 9 Feb Fons Rademakers

IWLSC, 9 Feb Fons Rademakers Parallelism in HEP  Main areas of processing in HEP DAQ Typically highly parallel Process in parallel large number of detectors modules or sub-detectors Simulation No need for fine-grained track level parallelism, a single event is not the end product Some attempts were made to introduce track level parallelism in G3 Typically job level parallelism, resulting in a large number of files Reconstruction Idem as for simulation Analysis Run over many events in parallel to get quickly the final analysis results Embarrassingly parallel, event level parallelism Preferably interactive for better control on and feedback of the analysis Main challenge: efficient data access

IWLSC, 9 Feb Fons Rademakers Parallel Data Analysis in HEP  Most parallel data analysis systems designed in the past and present are based on job splitting scripts and batch queues When queue full no parallelism Explicit parallelism  Turn around time dictated by batch system scheduler and resource availability  Remarkably few attempts at real interactive implicitly parallel systems PIAF PROOF

IWLSC, 9 Feb Fons Rademakers Classical Parallel Data Analysis Storage Batch farm queues manager outputs catalog  “Static” use of resources  Jobs frozen, 1 job / CPU  “Manual” splitting, merging  Limited monitoring (end of single job) submit files jobs data file splitting myAna.C merging final analysis

IWLSC, 9 Feb Fons Rademakers Interactive Parallel Data Analysis catalog Storage Interactive farm scheduler query  Farm perceived as extension of local PC  More dynamic use of resources  Automated splitting and merging  Real time feedback MASTER query: data file list, myAna.C files final outputs (merged) feedbacks (merged)

IWLSC, 9 Feb Fons Rademakers PIAF  The Parallel Interactive Analysis Facility  First attempt at an interactive parallel analysis system  Extension of and based on the PAW system  Joint project between CERN/IT and Hewlett-Packard  Development started in 1992  Small production service opened for LEP users in 1993 Up to 30 concurrent users  CERN PIAF cluster consisted of 8 HP PA-RISC machines FDDI interconnect 512 MB RAM Few hundred GB disk  First observation of hyper-speedup using column-wise n-tuples

IWLSC, 9 Feb Fons Rademakers PIAF Architecture  Two-tier push architecture Client → Master → Workers Master divides total number of events by number of workers and assigns each worker 1/n number of events to process  Pros Transparent  Cons Slowest node determined time of completion Not adaptable to varying node loads No optimized data access strategies Required homogeneous cluster Not scalable

IWLSC, 9 Feb Fons Rademakers PIAF Push Architecture Initialization Process Wait for next command Slave 1 Process(“ana.C”) Processor Initialization Process Wait for next command Slave NMaster SendEvents () SendObject(histo) Add histograms Display histograms 1/n Process(“ana.C”)

IWLSC, 9 Feb Fons Rademakers PROOF  Parallel ROOT Facility  Second generation interactive parallel analysis system  Extension of and based on the ROOT system  Joint project between ROOT, LCG, ALICE and MIT  Proof of concept in 1997  Development picked up in 2002  PROOF in production in Phobos/BNL (with up to 150 CPU’s) since 2003  Second wave of developments started in 2005 following interest by LHC experiments

IWLSC, 9 Feb Fons Rademakers PROOF Original Design Goals  Interactive parallel analysis on heterogeneous cluster  Transparency Same selectors, same chain Draw(), etc. on PROOF as in local session  Scalability Good and well understood (1000 nodes most extreme case) Extensive monitoring capabilities MLM (Multi-Level-Master) improves scalability on wide area clusters  Adaptability Partly achieved, system handles varying load on cluster nodes MLM allows much better latencies on wide area clusters No support yet for coming and going of worker nodes

IWLSC, 9 Feb Fons Rademakers good connection ? VERY importantless important Optimize for data locality or efficient data server access adapts to cluster of clusters or wide area virtual clusters Physically separated domains PROOF Multi-Tier Architecture

IWLSC, 9 Feb Fons Rademakers PROOF Pull Architecture Initialization Process Wait for next command Slave 1 Process(“ana.C”) Packet generator Initialization Process Wait for next command Slave NMaster GetNextPacket () SendObject(histo) Add histograms Display histograms 0, , , , , ,40 440,50 590,60 Process(“ana.C”)

IWLSC, 9 Feb Fons Rademakers PROOF New Features  Support for “interactive batch” mode Allow submission of long running queries Allow client/master disconnect and reconnect  Powerful, friendly and complete GUI  Work in grid environments Startup of agents via Grid job scheduler Agents calling out to master (firewalls, NAT) Dynamic master-worker setup

IWLSC, 9 Feb Fons Rademakers Interactive/Batch queries GUI Commands scripts Batch stateful or stateless stateless Interactive analysis using local resources, e.g. - end-analysis calculations - visualizationv Analysis jobs with well defined algorithms (e.g. production of personal trees) Medium term jobs, e.g. analysis design and development using also non-local resources Goal: bring these to the same level of perception

IWLSC, 9 Feb Fons Rademakers AQ1: 1s query produces a local histogram AQ2: a 10mn query submitted to PROOF1 AQ3->AQ7: short queries AQ8: a 10h query submitted to PROOF2 BQ1: browse results of AQ2 BQ2: browse temporary results of AQ8 BQ3->BQ6: submit 4 10mn queries to PROOF1 CQ1: Browse results of AQ8, BQ3->BQ6 Monday at 10h15 ROOT session on my desktop Monday at 16h25 ROOT session on my laptop Wednesday at 8h40 ROOT session on my laptop in Kolkata Analysis Session Example

IWLSC, 9 Feb Fons Rademakers New PROOF GUI

IWLSC, 9 Feb Fons Rademakers New PROOF GUI

IWLSC, 9 Feb Fons Rademakers New PROOF GUI

IWLSC, 9 Feb Fons Rademakers New PROOF GUI

IWLSC, 9 Feb Fons Rademakers TGrid – Abstract Grid Interface class TGrid : public TObject { public: virtual Int_t AddFile(const char *lfn, const char *pfn) = 0; virtual Int_t DeleteFile(const char *lfn) = 0; virtual TGridResult *GetPhysicalFileNames(const char *lfn) = 0; virtual Int_t AddAttribute(const char *lfn, const char *attrname, const char *attrval) = 0; virtual Int_t DeleteAttribute(const char *lfn, const char *attrname) = 0; virtual TGridResult *GetAttributes(const char *lfn) = 0; virtual void Close(Option_t *option="") = 0; virtual TGridResult *Query(const char *query) = 0; static TGrid *Connect(const char *grid, const char *uid = 0, const char *pw = 0); ClassDef(TGrid,0) // ABC defining interface to GRID services };

IWLSC, 9 Feb Fons Rademakers PROOF on the Grid PROOF USER SESSION PROOF SLAVE SERVERS PROOF MASTER SERVER PROOF SLAVE SERVERS Guaranteed site access through PROOF Sub-Masters calling out to Master (agent technology) PROOF SUB-MASTER SERVER PROOF PROOF PROOF Grid/ROOT Authentication Grid Access Control Service TGrid UI/Queue UI Proofd Startup Grid Service Interfaces Grid File/Metadata Catalogue Client retrieves list of logical files (LFN + MSN) Slave servers access data via xrootd from local disk pools

IWLSC, 9 Feb Fons Rademakers Running PROOF TGrid *alien = TGrid::Connect(“alien”); TGridResult *res; res = alien->Query(“lfn:///alice/simulation/ /V0.6*.root“); TChain *chain = new TChain("AOD"); chain->Add(res); gROOT->Proof(“master”); chain->Process(“myselector.C”); // plot/save objects produced in myselector.C...

IWLSC, 9 Feb Fons Rademakers Conclusions  The Amdahl’s Law shows that making really scalable parallel applications is very hard  Parallelism in HEP off-line computing still lagging  To solve the LHC data analysis problems, parallelism is the only solution  To make good use of the current and future generation of multi-core CPU’s parallel applications are required