Presentation is loading. Please wait.

Presentation is loading. Please wait.

Simplifying Parallel Programming with Incomplete Parallel Languages Laxmikant Kale Computer Science

Similar presentations


Presentation on theme: "Simplifying Parallel Programming with Incomplete Parallel Languages Laxmikant Kale Computer Science"— Presentation transcript:

1 Simplifying Parallel Programming with Incomplete Parallel Languages Laxmikant Kale Computer Science http://charm.cs.uiuc.edu

2 Requirements Composibility and Interoperability Respect for locality Dealing with heterogeneity Dealing with the memory wall Dealing with dynamic resource variation – Machine running 2 parallel apps on 64 cores, needs to run a third one – Shrink and expand the sets of cores assigned to a job Dealing with Static resource variation – I.e. Parallel App should run unchanged on the next generation manycore with twice as many cores Above all: Simplicity www.upcrc.illinois.edu August 20082

3 How to Simplify Parallel Programming The question to ask is: What to automate? – NOT what can be done relatively easily by the programmer E.g. Deciding what to do in parallel – Strive for a good balance between “system” and programmer Balance Evolves towards more automation This talk: – A sequence of ideas, building on each other – Over a long period of time www.upcrc.illinois.edu August 20083

4 Object based Over-decomposition www.upcrc.illinois.edu August 20084 Programmer’s View System implementation Programmer : Over-decomposition into computational entities (objects, VPs, i.e. virtual processors) Empowers Adaptive Runtime System Embodied in Charm++ system Runtime: Assigns VPs to processors

5 Some benefits of Charm++ model Software Engineering –Num. of VPs to match application logic (not physical cores) –i.e. “programming model is independent of number of processors” –Separate VPs for different modules Dynamic mapping –Heterogeneity Vacate, adjust to speed, share –Change set of processors used –Dynamic load balancing Message driven execution –Compositionality –Predictability www.upcrc.illinois.edu August 20085

6 Message Driven Execution www.upcrc.illinois.edu August 20086 Scheduler Message Q Object-based Virtualization leads to Message Driven Execution

7 Adaptive overlap and modules 7 Gursoy, Kale. Proc. of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, San Fransisco, 1995 Message Driven Execution is critical for Compositionality www.upcrc.illinois.edu August 2008

8 Charm++ and CSE Applications www.upcrc.illinois.edu August 20088 The enabling CS technology of parallel objects and intelligent runtime systems has led to several collaborative applications in CSE Synergy Well-known Biophysics molecular simulations App Gordon Bell Award, 2002 Computational Astronomy Nano-Materials..

9 CSE to ManyCore The Charm++ model has succeeded in CSE/HPC Because: – Resource management, … In spite of: – Based on C++, not Fortran, message-driven model,.. 9 15% of cycles at NCSA, 20% at PSC, were used on Charm++ apps, in a one year period But is an even better fit for desktop programmers – C++, event driven execution – Predictability of data/code accesses www.upcrc.illinois.edu August 2008

10 Why is this model suitable for Many-cores? 10 Objects connote and promote locality Message-driven execution – A strong principle of prediction for data and code use – Much stronger than Principle of locality Can use to scale memory wall: Prefetching of needed data: – into scratch pad memories, for example Scheduler Message Q www.upcrc.illinois.edu August 2008

11 Charm++ & Cell? Data Encapsulation / Locality – Each message associated with… Code : Entry Method Data : Message & Chare Data – Entry methods tend to access data local to chare and message Virtualization (many chares per processor) – Provides opportunity to overlap SPE computation with DMA transactions – Helps ensure there is always useful work to do Message Queue Peek-Ahead / Predictability – Peek-ahead in message queue to determine future work – Fetch code and data before execution of entry method www.upcrc.illinois.edu August 200811 S S Q Q

12 12 So, I expect Charm++ to be a strong contender for manycore models BUT: What about the quest for Simplicity? Charm++ is powerful, but not much simpler than, say, MPI www.upcrc.illinois.edu August 2008

13 How to Attain Simplicity? Parallel Programming is much too complex – In part because of resource management issues : Handled by Adaptive Runtime Systems – In part, because of lack of support for common patterns – In a larger part, because of unintended non-determinacy Race conditions Clearly, we need simple models – But what are willing to give up? (No free lunch) – Give up “Completeness”!?! – May be one can design a language that is simple to use, but not expressive enough to capture all needs 13www.upcrc.illinois.edu August 2008

14 Incomplete Models? A collection of “incomplete” languages, backed by a (few) complete ones, will do the trick – As long as they are interoperable, and – support parallel composition Recall: – Message Driven Objects promote exactly such interoperability – Different modules, written in different languages/paradigms, can overlap in time and on processors, without programmer having to worry about this explicitly 14www.upcrc.illinois.edu August 2008

15 Simplicity Where does simplicity come from? – Support common patterns of parallel behavior – Outlaw non-determinacy! – Deterministic, Simple, parallel programming models With Marc Snir, Vikram Adve,.. – Are there examples of such paradigms? Multiphase shared Arrays : [LCPC ‘04] Charisma++ : [LCR ’04, HPDC ‘07] 15www.upcrc.illinois.edu August 2008

16 Multiphase Shared Arrays Observations: – General shared address space abstraction is complex – Yet, certain special cases are simple, and cover most uses In MSA: – Each array is in one mode at a time – But its mode may change from phase to phase – Modes Write-once, Read-only Accumulate, Owner-computes – Pagination All workers sync, at end of each phase www.upcrc.illinois.edu August 200816 A B CCCC

17 MSA: Plimpton MD 17 for timestep = 0 to Tmax { // Phase I : Force Computation: for a section of the interaction matrix for i = i_start to i_end for j = j_start to j_end if (nbrlist[i][j]) { // nbrlist enters ReadOnly mode force = calculateForce(coords[i], atominfo[i], coords[j], atominfo[j]); forces[i] += force; // Accumulate mode forces[j] += -force; } nbrlist.sync(); forces.sync(); coords.sync(); for k = myAtomsbegin to myAtomsEnd // Phase II : Integration coords[k] = integrate(atominfo[k], forces[k]); // WriteOnly mode coords.sync(); atominfo.sync(); forces.sync(); if (timestep %8 == 0) { // Phase III: update neighbor list every 8 steps for i = i_start to i_end for j = j_start to j_end nbrList[i][j] = distance( coords[i],coords[j]) < CUTOFF; nbrList.sync(); } www.upcrc.illinois.edu August 2008

18 The Charisma Programming Model Static data flow  Suffices for number of applications  Molecular dynamics, FEM, PDE's, etc. Explicit specification of Global data flow Global control flow Arrays of objects Global parameter space  Objects read from and write into it Clean division between  Parallel (orchestration) code  Sequential methods Worker objects Buffers (PS)‏ 18www.upcrc.illinois.edu August 2008

19 Charisma++ example (Simple) 19 while (e > threshold) forall i in J := J[i].compute(rb[i-1],lb[i+1]); www.upcrc.illinois.edu August 2008

20 Summary A sequence of Key ideas – Respect for locality – Object based decomposition Automated Resource management Interoperability Multiple interaction paradigms: async method invocations, messaging,.. – Further simplification needs a new idea: Incomplete but simple languages Specifically, a complete collection that includes incomplete languages – Two new languages in this mold Multiphase Shared Arrays (MSA) Charisma www.upcrc.illinois.edu August 200820 More info: http://charm.cs.uiuc.edu


Download ppt "Simplifying Parallel Programming with Incomplete Parallel Languages Laxmikant Kale Computer Science"

Similar presentations


Ads by Google