Simplifying Parallel Programming with Incomplete Parallel Languages Laxmikant Kale Computer Science

Slides:

Advertisements

Similar presentations

Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.

Advertisements

Parallel Programming models in the era of multi-core processors: Laxmikant Kale Parallel Programming Laboratory Department of.

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Memory Management 2010.

Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.

1 Charm++ on the Cell Processor David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant V. Kale.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

INFO415 Approaches to System Development: Part 2

CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Adaptive MPI Milind A. Bhandarkar

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

If Exascale by 2018, Really? Yes, if we want it, and here is how Laxmikant Kale.

Advanced / Other Programming Models Sathish Vadhiyar.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.

1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.

Processes Introduction to Operating Systems: Module 3.

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Lecture by: Prof. Pooja Vaishnav.  Language Processor implementations are highly influenced by the kind of storage structure used for program variables.

Lecture 5 Page 1 CS 111 Online Processes CS 111 On-Line MS Program Operating Systems Peter Reiher.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Basic Memory Management 1. Readings r Silbershatz et al: chapters

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.

Charm++ overview L. V. Kale. Parallel Programming Decomposition – what to do in parallel –Tasks (loop iterations, functions,.. ) that can be done in parallel.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)

Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.

Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.

NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

Parallel Application Paradigms CS433 Spring 2001 Laxmikant Kale.

Using Charm++ with Arrays Laxmikant (Sanjay) Kale Parallel Programming Lab Department of Computer Science, UIUC charm.cs.uiuc.edu.

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

Productive Performance Tools for Heterogeneous Parallel Computing

Computational Techniques for Efficient Carbon Nanotube Simulation

Parallel Algorithm Design

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

Component Frameworks:

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Charisma: Orchestrating Migratable Parallel Objects

BigSim: Simulating PetaFLOPS Supercomputers

Computational Techniques for Efficient Carbon Nanotube Simulation

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

An Orchestration Language for Parallel Objects

Higher Level Languages on Adaptive Run-Time System

Support for Adaptivity in ARMCI Using Migratable Objects

Laxmikant (Sanjay) Kale Parallel Programming Laboratory

Presentation transcript:

Simplifying Parallel Programming with Incomplete Parallel Languages Laxmikant Kale Computer Science

Requirements Composibility and Interoperability Respect for locality Dealing with heterogeneity Dealing with the memory wall Dealing with dynamic resource variation – Machine running 2 parallel apps on 64 cores, needs to run a third one – Shrink and expand the sets of cores assigned to a job Dealing with Static resource variation – I.e. Parallel App should run unchanged on the next generation manycore with twice as many cores Above all: Simplicity August 20082

How to Simplify Parallel Programming The question to ask is: What to automate? – NOT what can be done relatively easily by the programmer E.g. Deciding what to do in parallel – Strive for a good balance between “system” and programmer Balance Evolves towards more automation This talk: – A sequence of ideas, building on each other – Over a long period of time August 20083

Object based Over-decomposition August Programmer’s View System implementation Programmer : Over-decomposition into computational entities (objects, VPs, i.e. virtual processors) Empowers Adaptive Runtime System Embodied in Charm++ system Runtime: Assigns VPs to processors

Some benefits of Charm++ model Software Engineering –Num. of VPs to match application logic (not physical cores) –i.e. “programming model is independent of number of processors” –Separate VPs for different modules Dynamic mapping –Heterogeneity Vacate, adjust to speed, share –Change set of processors used –Dynamic load balancing Message driven execution –Compositionality –Predictability August 20085

Message Driven Execution August Scheduler Message Q Object-based Virtualization leads to Message Driven Execution

Adaptive overlap and modules 7 Gursoy, Kale. Proc. of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, San Fransisco, 1995 Message Driven Execution is critical for Compositionality August 2008

Charm++ and CSE Applications August The enabling CS technology of parallel objects and intelligent runtime systems has led to several collaborative applications in CSE Synergy Well-known Biophysics molecular simulations App Gordon Bell Award, 2002 Computational Astronomy Nano-Materials..

CSE to ManyCore The Charm++ model has succeeded in CSE/HPC Because: – Resource management, … In spite of: – Based on C++, not Fortran, message-driven model, % of cycles at NCSA, 20% at PSC, were used on Charm++ apps, in a one year period But is an even better fit for desktop programmers – C++, event driven execution – Predictability of data/code accesses August 2008

Why is this model suitable for Many-cores? 10 Objects connote and promote locality Message-driven execution – A strong principle of prediction for data and code use – Much stronger than Principle of locality Can use to scale memory wall: Prefetching of needed data: – into scratch pad memories, for example Scheduler Message Q August 2008

Charm++ & Cell? Data Encapsulation / Locality – Each message associated with… Code : Entry Method Data : Message & Chare Data – Entry methods tend to access data local to chare and message Virtualization (many chares per processor) – Provides opportunity to overlap SPE computation with DMA transactions – Helps ensure there is always useful work to do Message Queue Peek-Ahead / Predictability – Peek-ahead in message queue to determine future work – Fetch code and data before execution of entry method August S S Q Q

12 So, I expect Charm++ to be a strong contender for manycore models BUT: What about the quest for Simplicity? Charm++ is powerful, but not much simpler than, say, MPI August 2008

How to Attain Simplicity? Parallel Programming is much too complex – In part because of resource management issues : Handled by Adaptive Runtime Systems – In part, because of lack of support for common patterns – In a larger part, because of unintended non-determinacy Race conditions Clearly, we need simple models – But what are willing to give up? (No free lunch) – Give up “Completeness”!?! – May be one can design a language that is simple to use, but not expressive enough to capture all needs 13www.upcrc.illinois.edu August 2008

Incomplete Models? A collection of “incomplete” languages, backed by a (few) complete ones, will do the trick – As long as they are interoperable, and – support parallel composition Recall: – Message Driven Objects promote exactly such interoperability – Different modules, written in different languages/paradigms, can overlap in time and on processors, without programmer having to worry about this explicitly 14www.upcrc.illinois.edu August 2008

Simplicity Where does simplicity come from? – Support common patterns of parallel behavior – Outlaw non-determinacy! – Deterministic, Simple, parallel programming models With Marc Snir, Vikram Adve,.. – Are there examples of such paradigms? Multiphase shared Arrays : [LCPC ‘04] Charisma++ : [LCR ’04, HPDC ‘07] 15www.upcrc.illinois.edu August 2008

Multiphase Shared Arrays Observations: – General shared address space abstraction is complex – Yet, certain special cases are simple, and cover most uses In MSA: – Each array is in one mode at a time – But its mode may change from phase to phase – Modes Write-once, Read-only Accumulate, Owner-computes – Pagination All workers sync, at end of each phase August A B CCCC

MSA: Plimpton MD 17 for timestep = 0 to Tmax { // Phase I : Force Computation: for a section of the interaction matrix for i = i_start to i_end for j = j_start to j_end if (nbrlist[i][j]) { // nbrlist enters ReadOnly mode force = calculateForce(coords[i], atominfo[i], coords[j], atominfo[j]); forces[i] += force; // Accumulate mode forces[j] += -force; } nbrlist.sync(); forces.sync(); coords.sync(); for k = myAtomsbegin to myAtomsEnd // Phase II : Integration coords[k] = integrate(atominfo[k], forces[k]); // WriteOnly mode coords.sync(); atominfo.sync(); forces.sync(); if (timestep %8 == 0) { // Phase III: update neighbor list every 8 steps for i = i_start to i_end for j = j_start to j_end nbrList[i][j] = distance( coords[i],coords[j]) < CUTOFF; nbrList.sync(); } August 2008

The Charisma Programming Model Static data flow  Suffices for number of applications  Molecular dynamics, FEM, PDE's, etc. Explicit specification of Global data flow Global control flow Arrays of objects Global parameter space  Objects read from and write into it Clean division between  Parallel (orchestration) code  Sequential methods Worker objects Buffers (PS)‏ 18www.upcrc.illinois.edu August 2008

Charisma++ example (Simple) 19 while (e > threshold) forall i in J := J[i].compute(rb[i-1],lb[i+1]); August 2008

Summary A sequence of Key ideas – Respect for locality – Object based decomposition Automated Resource management Interoperability Multiple interaction paradigms: async method invocations, messaging,.. – Further simplification needs a new idea: Incomplete but simple languages Specifically, a complete collection that includes incomplete languages – Two new languages in this mold Multiphase Shared Arrays (MSA) Charisma August More info: