Parallel Programming models in the era of multi-core processors: Laxmikant Kale Parallel Programming Laboratory Department of.

Parallel Programming models in the era of multi-core processors: Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign

6/20/2008UPCRC seminar2 Requirements Composibility Respect for locality Dealing with heterogeneity Dealing with the memory wall Dealing with dynamic resource variation –Machine running 2 parallel apps on 64 cores, needs to run a third one –Shrink and expand the sets of cores assigned to a job Dealing with Static resource variation : Fwd Scaling –I.e. Parallel App should run unchanged on the next generation manycore with twice as many cores Above all: Simplicity

6/20/2008UPCRC seminar3 Guidelines A guideline that appeals to me: –Bottom-up, application-driven development of abstractions Aim at a good division of labor between the programmer and System –Automate what the system can do well –Allow programmer to do what they can do best

6/20/2008UPCRC seminar4 Foundation: Adaptive Runtime System User View System implementation Programmer : [Over] decomposition into virtual processors Runtime: Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI Software engineering –Num. of VPs to match application logic (not physical cores) –Separate VPs for different modules Message driven execution –Predictability : –Asynchronous reductions Dynamic mapping –Heterogeneity Vacate, adjust to speed, share –Change set of processors used –Dynamic load balancing Benefits For me, Based on Migratable Objects

6/20/2008UPCRC seminar5 What is the cost of Processor Virtualization?

6/20/2008UPCRC seminar6 “Overhead” of Virtualization Fragmentation cost? –Cache performance improves –Adaptive overlap improves –Difficult to see cost.. Fixable Problems: –Memory overhead: (larger ghost areas) –Fine-grained messaging : V: overhead per message Tp: p processor completion time G: grainsize (computation per message)

6/20/2008UPCRC seminar7 Modularity and Concurrent Composition

6/20/2008UPCRC seminar8 Message Driven Execution Scheduler Message Q Virtualization leads to Message Driven Execution Which leads to Automatic Adaptive overlap of computation and communication

6/20/2008UPCRC seminar9 Adaptive overlap and modules SPMD and Message-Driven Modules ( From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.) Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, San Fransisco, 1995

6/20/2008UPCRC seminar10 NAMD: A Production MD program NAMD Fully featured program NIH-funded development Installed at NSF centers Large published simulations We were able to demonstrate the utility of adaptive overlap, and share the Gordon Bell award in 2002 Collaboration with K. Schulten, R. Skeel, and coworkers

6/20/2008UPCRC seminar11 Integration Electrostatics PME/3DFFT

6/20/2008UPCRC seminar12 Modularization Logical Units decoupled from “Number of processors” –E.G. Oct tree nodes for particle data –No artificial restriction on the number of processors Cube of power of 2 Modularity: –Software engineering: cohesion and coupling –MPI’s “are on the same processor” is a bad coupling principle –Objects liberate you from that: E.G. Solid and fluid modules in a rocket simulation

6/20/2008UPCRC seminar13 Rocket Simulation Large Collaboration headed Mike Heath –DOE supported ASCI center Challenge: –Multi-component code, with modules from independent researchers –MPI was common base AMPI: new wine in old bottle –Easier to convert –Can still run original codes on MPI, unchanged

6/20/2008UPCRC seminar14 AMPI: 7 MPI processes

6/20/2008UPCRC seminar15 AMPI: Real Processors 7 MPI “processes” Implemented as virtual processors (user-level migratable threads)

6/20/2008UPCRC seminar16 Rocket Simulation Components in AMPI Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocflo

6/20/2008UPCRC seminar17 AMPI and Roc* communications Rocflo Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocflo

6/20/2008UPCRC seminar18 Automatic Adaptive Runtime Optimizations New Parallel Languages and Enhancements (MSA, Charisma,..) Applications Especially, dynamic, irregular and difficult to parallelize ones How to build better parallel machines Communication Support (SW/HW) Migratable Objects model OS support Memory Mgmt BigSim Resource Management On Computational Grids

6/20/2008UPCRC seminar19 Charm++/AMPI are mature systems Available on all parallel machines we know of –Clusters, Vendor supported: IBM, SGI, HP (Q), BlueGene/L, … Tools: –Performance analysis/visualization –Debuggers –Live visualization –Libraries and frameworks Used by many applications –17,000+ installations – NAMD, Rocket simulation, Quantum Chemistry, Space-time meshes, animation graphics, Astronomy,.. It is C++, with message (event) driven execution –So, a familiar model for desktop programmers

6/20/2008UPCRC seminar20 Parallel Objects, Adaptive Runtime System Libraries and Tools The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE Crack Propagation Space-time meshes Computational Cosmology Rocket Simulation Protein Folding Dendritic Growth Quantum Chemistry LeanCP Develop abstractions in context of full-scale applications NAMD: Molecular Dynamics STM virus simulation

6/20/2008UPCRC seminar21 CSE to ManyCore The Charm++ model has succeeded in CSE/HPC Because: –Resource management, … In spite of: –Based on C++, not Fortran, message-driven model,.. But is an even better fit for desktop programmers –C++, event driven execution –Predictability of data/code accesses 15% of cycles at NCSA, 20% at PSC, were used on Charm++ apps, in a one year period

6/20/2008UPCRC seminar22 Why is it suitable for Multi-cores Objects connote and promote locality Message-driven execution –A strong principle of prediction for data and code use –Much stronger than Principle of locality Can use to scale memory wall: Prefetching of needed data: –into scratch pad memories, for example Scheduler Message Q

6/20/2008UPCRC seminar23 Why Charm++ & Cell? Data Encapsulation / Locality –Each message associated with… Code : Entry Method Data : Message & Chare Data –Entry methods tend to access data local to chare and message Virtualization (many chares per processor) –Provides opportunity to overlap SPE computation with DMA transactions –Helps ensure there is always useful work to do Message Queue Peek-Ahead / Predictability –Peek-ahead in message queue to determine future work –Fetch code and data before execution of entry method S S Q Q

6/20/2008UPCRC seminar24 System View on Cell Work by David Kunzman (with Gengbin Zheng, Eric Bohm, )

6/20/2008UPCRC seminar25 Charm++ on Cell Roadmap

SMP implementation of Charm++ Has existed since the beginning –An original aim (circa 1988) was to be portable between shared memory and distributed memory machines –Multimax, sequent, hypercubes SMP-nodes version –Exploits shared memory within a node Performance issues.. 6/20/2008UPCRC seminar26

6/20/2008UPCRC seminar27 So, I expect Charm++ to be a strong contender for manycore models BUT: What about the quest for Simplicity? Charm++ is powerful, but not much simpler than, say, MPI

6/20/2008UPCRC seminar28 How to Get to Simple Parallel Programming Models? Parallel Programming is much too complex –In part because of resource management issues : Handled by Adaptive Runtime Systems –In a larger part, because of unintended non-determinacy Race conditions Clearly, we need simple models –But what are willing to give up? (No free lunch) –Give up “Completeness”!?! –May be one can design a language that is simple to use, but not expressive enough to capture all needs

6/20/2008UPCRC seminar29 Simplicity? A collection of “incomplete” languages, backed by a (few) complete ones, will do the trick –As long as they are interoperable Where does simplicity come from? –Outlaw non-determinacy! –Deterministic, Simple, parallel programming models With Marc Snir, Vikram Adve,.. –Are there examples of such paradigms? Multiphase shared Arrays : [LCPC ‘04] Charisma++ : [LCR ’04]

6/20/2008UPCRC seminar30 Shared memory or not Smart people on both sides: –Thesis, antithesis Clearly, needs a “synthesis” “Shared memory is easy to program” has –Only a grain of truth –But there exists that grain of truth We as a community, need to have this debate –Put some armor on, drink friendship potion, but debate the issue threadbare.. –What do we mean by SAS model and what we like and dislike about it

6/20/2008UPCRC seminar31 Multiphase Shared Arrays Observations: –General shared address space abstraction is complex –Certain special cases are simple, and cover most uses Each array is in one mode at a time –But its mode may change from phase to phase Modes –Write-once –Read-only –Accumulate –Owner-computes All workers sync, at end of each phase

6/20/2008UPCRC seminar32 MSA: In the simple model: A program consists of –A collection of Charm threads, and –Multiple collections of data-arrays Partitioned into pages (user-specified) Execution begins in a “main” –Then all threads are fired in parallel More complex model –Multiple collections of threads –… A B CCCC

6/20/2008UPCRC seminar33 typedef MSA2D, 4096,MSA_ROW_MAJOR> MSA2DRowMjr; typedef MSA2D, 4096, MSA_COL_MAJOR> MSA2DColMjr; // One thread creates MSAs and broadcasts their IDs MSA2DRowMjr A(ROWS1, COLS1, NUMWORKERS, cacheSize1); // row major MSA2DColMjr B(ROWS2, COLS2, NUMWORKERS, cacheSize2); // column major MSA2DRowMjr C(ROWS1, COLS2, NUMWORKERS, cacheSize3); // product matrix // Each thread executes the following code arr1.enroll(); arr2.enroll(); prod.enroll(); … for(unsigned int c = 0; c < COLS2; c++) { // Each thread computes a subset of rows of product matrix for(unsigned int r = rowStart; r <= rowEnd; r++) { double result = 0.0; for(unsigned int k = 0; k < cols1; k++) result += A[r][k] * B[k][c]; C[r][c] = result; } prod.sync(); // use product matrix afterwards..

6/20/2008UPCRC seminar34 MSA: Plimpton’s Algoritm for Molecular Dynamics // Declarations of the 3 arrays class XYZ; // { double x; double y; double z;. } typedef MSA1D, DEFAULT_PAGE_SIZE> XyzMSA; class AtomInfo; typedef MSA1D, PAGE_SIZE> AtomInfoMSA; typedef MSA2D, PAGE_SIZE,MSA_ROW_MAJOR> NeighborMSA; XyzMSA coords; XyzMSA forces; AtomInfoMSA atominfo; NeighborMSA nbrList; //broadcast the above array handles to the worker threads. // Each thread executes the following code coords.enroll(numberOfWorkerThreads); forces.enroll(numberOfWorkerThreads); atominfo.enroll(numberOfWorkerThreads); nbrList.enroll(numberOfWorkerThreads);

6/20/2008UPCRC seminar35 MSA: Plimpton MD for timestep = 0 to Tmax { // Phase I : Force Computation: for a section of the interaction matrix for i = i_start to i_end for j = j_start to j_end if (nbrlist[i][j]) { // nbrlist enters ReadOnly mode force = calculateForce(coords[i], atominfo[i], coords[j], atominfo[j]); forces[i] += force; // Accumulate mode forces[j] += -force; } nbrlist.sync(); forces.sync(); coords.sync(); atominfo.sync(); for k = myAtomsbegin to myAtomsEnd // Phase II : Integration coords[k] = integrate(atominfo[k], forces[k]); // WriteOnly mode coords.sync(); atominfo.sync(); forces.sync(); if (timestep %8 == 0) { // Phase III: update neighbor list every 8 steps for i = i_start to i_end for j = j_start to j_end nbrList[i][j] = distance( coords[i],coords[j]) < CUTOFF; nbrList.sync(); coords.sync(); }

6/20/2008UPCRC seminar36 Extensions Need check for each access: is the page here? –Pre-fetching, and known-local accesses A Twist on ACCUMULATE –Each array element can be a set –Set Union operation is a valid accumulate operation. –Example: Appending a list of (x,y) points

6/20/2008UPCRC seminar37 MSA: Graph Partition // Phase I: EtoN: RO, NtoE: Accumulate for i=1 to EtoN.length() for j=1 to EtoN[i].length() { n = EtoN[i][j]; NtoE[n] += i; // Accumulate } EtoN.sync(); NtoE.sync(); // Phase II: NtoE: RO, EtoE: Accumulate for j = my section of j //foreach pair e1, e2 elementof NtoE[j] for i1 = 1 to NtoE[j].length() for i2 = i1 + 1 to NtoE[j].length() { e1 = NtoE[j][i1]; e2 = NtoE[j][i2]; EtoE[e1] += e2; // Accumulate EtoE[e2] += e1; } EtoN.sync(); NtoE.sync();

AMPI – Charm++ Workshop 2008 05/02/08 Charisma Static data flow  Suffices for number of applications  Molecular dynamics, FEM, PDE's, etc. Global data and control flow explicit  Unlike Charm++

AMPI – Charm++ Workshop 2008 05/02/08 The Charisma Programming Model Arrays of objects Global parameter space (PS)‏  Objects read from and write into PS Clean division between  Parallel (orchestration) code  Sequential methods Worker objects Buffers (PS)‏

AMPI – Charm++ Workshop 2008 05/02/08 Example: Stencil Computation foreach x,y in workers (lb[x,y], rb[x,y], ub[x,y], db[x,y]) <- workers[x,y].produceBorders(); end-foreach foreach x,y in workers (+err) <- workers[x,y].compute(lb[x+1,y], rb[x-1,y], ub[x,y+1], db[x,y-1]); end-foreach while(err > epsilon)‏ foreach x,y in workers (lb[x,y], rb[x,y], ub[x,y], db[x,y]) <- workers[x,y].produceBorders(); (+err) <- workers[x,y].compute( lb[x+1,y], rb[x-1,y], ub[x,y-1], db[x,y+1] ); end-foreach end-while Reduction on variable err

AMPI – Charm++ Workshop 2008 05/02/08 Language Features Communication patterns  P2P, Multicast, Scatter, Gather Determinism  Methods invoked on objects in Program Order Support for libraries  Use external  Create your own foreach w in workers (p[w]) <- workers[w].produce(); workers[w].consume(p[w-1]); end-foreach foreach r in workers (p[r]) <- workers[r].produce(); end-foreach foreach r,c in consumers consumers[r,c].consume(p[r])‏ end-foreach foreach r in workers (p[r,*]) <- workers[r].produce(); end-foreach foreach r,c in consumers consumers[r,c].consume(p[r,c]); end-foreach workers[w]workers[r] consumers[r,1]consumers[r,3]consumers[r,2] workers[r] consumers[r,1]consumers[r,3]

6/20/2008UPCRC seminar42 Charisma++ example (Simple) Jacobi 1D begin forall i in J := J[i].init(); end-forall while (e > threshold) forall i in J := J[i].compute(rb[i-1],lb[i+1]); end-forall end-while end

6/20/2008UPCRC seminar43 Charisma: Motivation Rocket simulation example under traditional MPI vs. Charm++/AMPI framework –Benefit: load balance, communication optimizations, modularity –Problem: flow of control buried in asynchronous method invocations Solid Fluid Solid Fluid Solid Fluid... 1 2 P Solid 1 Fluid 1 Solid 2 Fluid 2 Solid n Fluid m... Solid 3...

6/20/2008UPCRC seminar44 Motivation: Car-Parrinello Ab Initio Molecular Dynamics (CPMD) Charisma presentation

6/20/2008UPCRC seminar45 Mol. Dynamics with Spatial Decomposition foreach i,j,k in cells := cells[i,j,k].produceAtoms(); end-foreach for iter := 0 to MAX_ITER foreach i1,j1,k1,i2,j2,k2 in cellpairs := cellpairs[i1,j1,k1,i2,j2,k2].computeCoulombForces( atoms[i1,j1,k1],atoms[i2,j2,k2]); end-foreach foreach … for bonded forces.. Uses atoms and add to forces foreach i,j,k in cells := cells[i,j,k].integrate(forces[i,j,k]); end-foreach end-for

6/20/2008UPCRC seminar46 Charm++ TCharm AMPI MSA Charisma Multimodule Application Other Abstractions: GA, CAF, UPC, PPL1 A set of “incomplete” but elegant/simple languages, backed by a low-level complete one

Interoperable Multi-paradigm programming 6/20/2008UPCRC seminar47

6/20/2008UPCRC seminar48 Lets play together Multiple programming models need to be investigated –“Survival of the fittest” doesn’t lead to a single species, it leads to an eco-system. Different ones may be good for different algorithms/domains/… Allow them to interoperate in a multi-paradigm environment

6/20/2008UPCRC seminar49 Summary It is necessary to raise the level of abstraction –Foundation: adaptive runtime system, based on migratable objects Automate resource management Composibility Interoperability –Design new Models that avoid data races, and promote locality –Incorporate good aspects of shared memory model More info on my group’s work: http://charm.cs.uiuc.edu

Parallel Programming models in the era of multi-core processors: Laxmikant Kale Parallel Programming Laboratory Department of.

Similar presentations

Presentation on theme: "Parallel Programming models in the era of multi-core processors: Laxmikant Kale Parallel Programming Laboratory Department of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Programming models in the era of multi-core processors: Laxmikant Kale Parallel Programming Laboratory Department of.

Similar presentations

Presentation on theme: "Parallel Programming models in the era of multi-core processors: Laxmikant Kale Parallel Programming Laboratory Department of."— Presentation transcript:

Similar presentations

About project

Feedback