Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evolution of Parallel Programming in HEP F. Rademakers – CERN International Workshop on Large Scale Computing VECC, Kolkata.

Similar presentations


Presentation on theme: "Evolution of Parallel Programming in HEP F. Rademakers – CERN International Workshop on Large Scale Computing VECC, Kolkata."— Presentation transcript:

1 Evolution of Parallel Programming in HEP F. Rademakers – CERN International Workshop on Large Scale Computing VECC, Kolkata

2 IWLSC, 9 Feb 20062Fons Rademakers Outline  Why use parallel computing  Parallel computing concepts  Typical parallel problems  Amdahl’s law  Parallelism in HEP  Parallel data analysis in HEP  PIAF  PROOF  Conclusions

3 IWLSC, 9 Feb 20063Fons Rademakers Why Parallelism  Two primary reasons: Save time – wall clock time Solve larger problems  Other reasons: Taking advantage of non-local resources – Grid Cost saving – using multiple “cheap” machines instead of paying for a super computer Overcoming memory constraints – single computers have finite memory resources, use many machine to create a very large memory  Limits to serial computing Transmission speeds – the speed of a serial computer is directly dependent on how much data can move through the hardware Limits speed of light (30 cm/ns) and transmission limit of copper wire (9 cm/ns) Limits to miniaturization Economic limitations  Ultimately, parallel computing is an attempt to maximize the infinite but seemingly scarce commodity called time

4 IWLSC, 9 Feb 20064Fons Rademakers Parallel Computing Concepts  Parallel hardware A single computer with multiple processors (multiple multi-core) An arbitrary number of computers connected by a network (LAN/WAN) A combination of both  Parallelizable computational problems Can be broken apart into discrete pieces of work that can be solved simultaneously Can execute multiple program instructions at any moment in time Can be solved in less time with multiple compute resource than with a single compute resource

5 IWLSC, 9 Feb 20065Fons Rademakers Parallel Computing Concepts  There are different ways to classify parallel computers (Flynn’s Taxonomy): SISD Single Instruction, Single Data SIMD Single Instruction, Multiple Data MISD Multiple Instruction, Multiple Data MIMD Multiple Instruction, Multiple Data

6 IWLSC, 9 Feb 20066Fons Rademakers SISD  A serial (non-parallel) computer  Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle  Single data: only one data stream is being used as input during any one clock cycle  Deterministic execution  Examples: most classical PC’s, single CPU workstations and mainframes

7 IWLSC, 9 Feb 20067Fons Rademakers SIMD  A type of parallel computer  Single instruction: all processing units execute the same instruction at any given clock cycle  Multiple data: each processing unit can operate on a different data element  This type of machine typically has an instruction dispatcher, a very high bandwidth internal network and a very large array of very small-capacity CPU’s  Best suited for specialized problems with high degree of regularity: image processing  Synchronous and deterministic execution  Two varieties: processor arrays and vector pipelines  Examples (some extinct): Processor arrays: Connection Machine, Maspar MP-1, MP-2 Vector pipelines: CDC 205, IBM 9000, Cray C90, Fujitsu, NEC SX-2

8 IWLSC, 9 Feb 20068Fons Rademakers MISD  Few actual examples of this class of parallel computer have ever existed  Some conceivable examples might be: Multiple frequency filter operating on a single signal stream Multiple cryptography algorithms attempting to track a single coded message

9 IWLSC, 9 Feb 20069Fons Rademakers MIMD  Currently the most common type of parallel computer  Multiple instruction: every processor may be executing a different instruction stream  Multiple data: every processor may be working with a different data stream  Execution can be synchronous or asynchronous, deterministic or non-deterministic  Examples: most current supercomputers, networked parallel computer “grids” and multi-processor SMP computers – including multi-CPU and multi-core PC’s

10 IWLSC, 9 Feb 200610Fons Rademakers Relevant Terminology  Observed speedup wall-clock time of serial execution / wall-clock time of parallel execution  Granularity Coarse: relatively large amounts of computational work are done between communication events Fine: relatively small amounts of computational work are done between communication events  Parallel overhead The amount of time required to coordinate parallel tasks, as opposed to doing useful work, typically: Task start-up time Synchronizations Data communications Software overhead imposed by parallel compilers, libraries, tools, OS, etc. Task termination time  Scalability Refers to a parallel system’s ability to demonstrate a proportional increase in parallel speedup with the addition of more processors  Embarrassingly parallel

11 IWLSC, 9 Feb 200611Fons Rademakers Typical Parallel Problems  Traditionally, parallel computing has been considered to be “the high-end of computing”: Weather and climate Chemical and nuclear reactions Biological, human genome Geological, seismic activity Electronic circuits  Today commercial applications are the driving force: Parallel databases, data mining Oil exploration Web search engines Computer-aided diagnosis in medicine Advanced graphics and virtual reality  The future: during past 10 years trends indicated by ever faster networks, distributed systems and multi-processor, and now multi- core, computer architectures suggest that parallelism is the future

12 IWLSC, 9 Feb 200612Fons Rademakers Amdahl’s Law  Amdahl’s law states that potential speedup is defined by the fraction of code (P) that can be parallelized:  If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup). If all the code is parallelized, P = 1, the speedup is infinite (in theory)  If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast  Introducing the number of processors performing the parallel fraction of work, the relationship can written like:  Where P = parallel fraction, N = number of processors and S = serial fraction

13 IWLSC, 9 Feb 200613Fons Rademakers

14 IWLSC, 9 Feb 200614Fons Rademakers Parallelism in HEP  Main areas of processing in HEP DAQ Typically highly parallel Process in parallel large number of detectors modules or sub-detectors Simulation No need for fine-grained track level parallelism, a single event is not the end product Some attempts were made to introduce track level parallelism in G3 Typically job level parallelism, resulting in a large number of files Reconstruction Idem as for simulation Analysis Run over many events in parallel to get quickly the final analysis results Embarrassingly parallel, event level parallelism Preferably interactive for better control on and feedback of the analysis Main challenge: efficient data access

15 IWLSC, 9 Feb 200615Fons Rademakers Parallel Data Analysis in HEP  Most parallel data analysis systems designed in the past and present are based on job splitting scripts and batch queues When queue full no parallelism Explicit parallelism  Turn around time dictated by batch system scheduler and resource availability  Remarkably few attempts at real interactive implicitly parallel systems PIAF PROOF

16 IWLSC, 9 Feb 200616Fons Rademakers Classical Parallel Data Analysis Storage Batch farm queues manager outputs catalog  “Static” use of resources  Jobs frozen, 1 job / CPU  “Manual” splitting, merging  Limited monitoring (end of single job) submit files jobs data file splitting myAna.C merging final analysis

17 IWLSC, 9 Feb 200617Fons Rademakers Interactive Parallel Data Analysis catalog Storage Interactive farm scheduler query  Farm perceived as extension of local PC  More dynamic use of resources  Automated splitting and merging  Real time feedback MASTER query: data file list, myAna.C files final outputs (merged) feedbacks (merged)

18 IWLSC, 9 Feb 200618Fons Rademakers PIAF  The Parallel Interactive Analysis Facility  First attempt at an interactive parallel analysis system  Extension of and based on the PAW system  Joint project between CERN/IT and Hewlett-Packard  Development started in 1992  Small production service opened for LEP users in 1993 Up to 30 concurrent users  CERN PIAF cluster consisted of 8 HP PA-RISC machines FDDI interconnect 512 MB RAM Few hundred GB disk  First observation of hyper-speedup using column-wise n-tuples

19 IWLSC, 9 Feb 200619Fons Rademakers PIAF Architecture  Two-tier push architecture Client → Master → Workers Master divides total number of events by number of workers and assigns each worker 1/n number of events to process  Pros Transparent  Cons Slowest node determined time of completion Not adaptable to varying node loads No optimized data access strategies Required homogeneous cluster Not scalable

20 IWLSC, 9 Feb 200620Fons Rademakers PIAF Push Architecture Initialization Process Wait for next command Slave 1 Process(“ana.C”) Processor Initialization Process Wait for next command Slave NMaster SendEvents () SendObject(histo) Add histograms Display histograms 1/n Process(“ana.C”)

21 IWLSC, 9 Feb 200621Fons Rademakers PROOF  Parallel ROOT Facility  Second generation interactive parallel analysis system  Extension of and based on the ROOT system  Joint project between ROOT, LCG, ALICE and MIT  Proof of concept in 1997  Development picked up in 2002  PROOF in production in Phobos/BNL (with up to 150 CPU’s) since 2003  Second wave of developments started in 2005 following interest by LHC experiments

22 IWLSC, 9 Feb 200622Fons Rademakers PROOF Original Design Goals  Interactive parallel analysis on heterogeneous cluster  Transparency Same selectors, same chain Draw(), etc. on PROOF as in local session  Scalability Good and well understood (1000 nodes most extreme case) Extensive monitoring capabilities MLM (Multi-Level-Master) improves scalability on wide area clusters  Adaptability Partly achieved, system handles varying load on cluster nodes MLM allows much better latencies on wide area clusters No support yet for coming and going of worker nodes

23 IWLSC, 9 Feb 200623Fons Rademakers good connection ? VERY importantless important Optimize for data locality or efficient data server access adapts to cluster of clusters or wide area virtual clusters Physically separated domains PROOF Multi-Tier Architecture

24 IWLSC, 9 Feb 200624Fons Rademakers PROOF Pull Architecture Initialization Process Wait for next command Slave 1 Process(“ana.C”) Packet generator Initialization Process Wait for next command Slave NMaster GetNextPacket () SendObject(histo) Add histograms Display histograms 0,100 200,100 340,100 490,100 100,100 300,40 440,50 590,60 Process(“ana.C”)

25 IWLSC, 9 Feb 200625Fons Rademakers PROOF New Features  Support for “interactive batch” mode Allow submission of long running queries Allow client/master disconnect and reconnect  Powerful, friendly and complete GUI  Work in grid environments Startup of agents via Grid job scheduler Agents calling out to master (firewalls, NAT) Dynamic master-worker setup

26 IWLSC, 9 Feb 200626Fons Rademakers Interactive/Batch queries GUI Commands scripts Batch stateful or stateless stateless Interactive analysis using local resources, e.g. - end-analysis calculations - visualizationv Analysis jobs with well defined algorithms (e.g. production of personal trees) Medium term jobs, e.g. analysis design and development using also non-local resources Goal: bring these to the same level of perception

27 IWLSC, 9 Feb 200627Fons Rademakers AQ1: 1s query produces a local histogram AQ2: a 10mn query submitted to PROOF1 AQ3->AQ7: short queries AQ8: a 10h query submitted to PROOF2 BQ1: browse results of AQ2 BQ2: browse temporary results of AQ8 BQ3->BQ6: submit 4 10mn queries to PROOF1 CQ1: Browse results of AQ8, BQ3->BQ6 Monday at 10h15 ROOT session on my desktop Monday at 16h25 ROOT session on my laptop Wednesday at 8h40 ROOT session on my laptop in Kolkata Analysis Session Example

28 IWLSC, 9 Feb 200628Fons Rademakers New PROOF GUI

29 IWLSC, 9 Feb 200629Fons Rademakers New PROOF GUI

30 IWLSC, 9 Feb 200630Fons Rademakers New PROOF GUI

31 IWLSC, 9 Feb 200631Fons Rademakers New PROOF GUI

32 IWLSC, 9 Feb 200632Fons Rademakers TGrid – Abstract Grid Interface class TGrid : public TObject { public: virtual Int_t AddFile(const char *lfn, const char *pfn) = 0; virtual Int_t DeleteFile(const char *lfn) = 0; virtual TGridResult *GetPhysicalFileNames(const char *lfn) = 0; virtual Int_t AddAttribute(const char *lfn, const char *attrname, const char *attrval) = 0; virtual Int_t DeleteAttribute(const char *lfn, const char *attrname) = 0; virtual TGridResult *GetAttributes(const char *lfn) = 0; virtual void Close(Option_t *option="") = 0; virtual TGridResult *Query(const char *query) = 0; static TGrid *Connect(const char *grid, const char *uid = 0, const char *pw = 0); ClassDef(TGrid,0) // ABC defining interface to GRID services };

33 IWLSC, 9 Feb 200633Fons Rademakers PROOF on the Grid PROOF USER SESSION PROOF SLAVE SERVERS PROOF MASTER SERVER PROOF SLAVE SERVERS Guaranteed site access through PROOF Sub-Masters calling out to Master (agent technology) PROOF SUB-MASTER SERVER PROOF PROOF PROOF Grid/ROOT Authentication Grid Access Control Service TGrid UI/Queue UI Proofd Startup Grid Service Interfaces Grid File/Metadata Catalogue Client retrieves list of logical files (LFN + MSN) Slave servers access data via xrootd from local disk pools

34 IWLSC, 9 Feb 200634Fons Rademakers Running PROOF TGrid *alien = TGrid::Connect(“alien”); TGridResult *res; res = alien->Query(“lfn:///alice/simulation/2001-04/V0.6*.root“); TChain *chain = new TChain("AOD"); chain->Add(res); gROOT->Proof(“master”); chain->Process(“myselector.C”); // plot/save objects produced in myselector.C...

35 IWLSC, 9 Feb 200635Fons Rademakers Conclusions  The Amdahl’s Law shows that making really scalable parallel applications is very hard  Parallelism in HEP off-line computing still lagging  To solve the LHC data analysis problems, parallelism is the only solution  To make good use of the current and future generation of multi-core CPU’s parallel applications are required


Download ppt "Evolution of Parallel Programming in HEP F. Rademakers – CERN International Workshop on Large Scale Computing VECC, Kolkata."

Similar presentations


Ads by Google