CSE 230 Parallelism Slides due to: Kathleen Fisher, Simon Peyton Jones, Satnam Singh, Don Stewart.

Slides:

Advertisements

Similar presentations

The Art of Avoiding Work

Advertisements

Parallel Haskell Tutorial: The Par Monad Simon Marlow Ryan Newton Simon Peyton Jones.

Comparing and Optimising Parallel Haskell Implementations on Multicore Jost Berthold Simon Marlow Abyd Al Zain Kevin Hammond.

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Kathleen Fisher cs242 Reading: “A history of Haskell: Being lazy with class”,A history of Haskell: Being lazy with class Section 6.4 and Section 7 “Monads.

Introduction to Parallel Computing

Kathleen Fisher cs242 Reading: A Tutorial on Parallel and Concurrent Programming in HaskellA Tutorial on Parallel and Concurrent Programming in Haskell.

Functional Programming. Pure Functional Programming Computation is largely performed by applying functions to values. The value of an expression depends.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Program Representations. Representing programs Goals.

Nested data parallelism in Haskell Simon Peyton Jones (Microsoft) Manuel Chakravarty, Gabriele Keller, Roman Leshchinskiy (University of New South Wales)

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Data Parallel Algorithms Presented By: M.Mohsin Butt

1 Lecture 6 Performance Measurement and Improvement.

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

Composing Dataflow Analyses and Transformations Sorin Lerner (University of Washington) David Grove (IBM T.J. Watson) Craig Chambers (University of Washington)

Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.

Intel Concurrent Collections (for Haskell) Ryan Newton*, Chih-Ping Chen*, Simon Marlow+ *Intel +Microsoft Research Software and Services Group Jul 29,

Intel Concurrent Collections (for Haskell) Ryan Newton, Chih-Ping Chen, Simon Marlow Software and Services Group Jul 27, 2010.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Language Evaluation Criteria

Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.

Nested data parallelism in Haskell Simon Peyton Jones (Microsoft) Manuel Chakravarty, Gabriele Keller, Roman Leshchinskiy (University of New South Wales)

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Analysis of Algorithms

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

1 Names, Scopes and Bindings Aaron Bloomfield CS 415 Fall

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Chapter 13 Recursion. Learning Objectives Recursive void Functions – Tracing recursive calls – Infinite recursion, overflows Recursive Functions that.

Object-Oriented Program Development Using Java: A Class-Centered Approach, Enhanced Edition.

CMPT 438 Algorithms. Why Study Algorithms? Necessary in any computer programming problem ▫Improve algorithm efficiency: run faster, process more data,

1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)

Theory of Programming Languages Introduction. What is a Programming Language? John von Neumann (1940’s) –Stored program concept –CPU actions determined.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1 Combining Events and Threads for Scalable Network Services Peng Li and Steve Zdancewic University of Pennsylvania PLDI 2007, San Diego.

Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Data Structures and Algorithms in Parallel Computing Lecture 3.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

How to execute Program structure Variables name, keywords, binding, scope, lifetime Data types – type system – primitives, strings, arrays, hashes – pointers/references.

Page :Algorithms in the Real World Parallelism: Lecture 1 Nested parallelism Cost model Parallel techniques and algorithms

FUNCTIONS. Midterm questions (1-10) review 1. Every line in a C program should end with a semicolon. 2. In C language lowercase letters are significant.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

CS 330 Programming Languages 10 / 23 / 2007 Instructor: Michael Eckmann.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Quicksort This is probably the most popular sorting algorithm. It was invented by the English Scientist C.A.R. Hoare It is popular because it works well.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

Haskell With Go Faster Stripes Neil Mitchell. Catch: Project Overview Catch checks that a Haskell program won’t raise a pattern match error head [] =

Chapter 1: Preliminaries Lecture # 2. Chapter 1: Preliminaries Reasons for Studying Concepts of Programming Languages Programming Domains Language Evaluation.

CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

CMPT 438 Algorithms.

CSCI5570 Large Scale Data Processing Systems

Conception of parallel algorithms

Objective of This Course

CIS 488/588 Bruce R. Maxim UM-Dearborn

Slides prepared by Samkit

Frédéric Gava Bulk-Synchronous Parallel ML Implementation of the

CSE 3302 Programming Languages

Lecture 2 The Art of Concurrency

Frédéric Gava Bulk-Synchronous Parallel ML

CSC 143 Binary Search Trees.

Presentation transcript:

CSE 230 Parallelism Slides due to: Kathleen Fisher, Simon Peyton Jones, Satnam Singh, Don Stewart

The Grand Challenge How to properly use multi-cores? Need new programming models!

Parallelism vs Concurrency A parallel program exploits real parallel computing resources to run faster while computing the same answer. – Expectation of genuinely simultaneous execution – Deterministic A concurrent program models independent agents that can communicate and synchronize. – Meaningful on a machine with one processor – Non-deterministic

Candidate models in Haskell Explicit threads  Non-deterministic by design  Monadic: forkIO and STM Semi-implicit parallelism  Deterministic  Pure: par and pseq Data parallelism  Deterministic  Pure: parallel arrays  Memory: Shared -> Distributed -> GPUs? main :: IO () = do { ch <- newChan ; forkIO (ioManager ch) ; forkIO(worker 1 ch)... etc... }

Haskell Execution Model fib 0 = 0 fib 1 = 1 fib n = fib(n-1) + fib(n-2) fib 0 = 0 fib 1 = 1 fib n = fib(n-1) + fib(n-2) “Thunk” for fib 10 Pointer to the implementation Storage slot for the result Values for free variables

FP to the Rescue? No side effects makes parallelism easy? – Safe to speculate on pure code, so – Execute each sub-expression in its own thread. Alas, 80s dream died: too many parallel tasks! – Most not worth the forking overhead – Impossible for compiler to guess are worth forking. Idea: Give the user control over which expressions might run in parallel.

The `par` Combinator Value (thunk) bound to x is sparked for speculative evaluation. Runtime may instantiate spark on a thread running in parallel with the parent thread. Operationally, x `par` y = y Typically, x is used inside y : par :: a -> b -> b x `par` y blurRows `par` (mix blurCols blurRows)

Concurrency Hierarchy

The GHC Runtime Multiple virtual CPUs – Each virtual CPU has a pool of OS threads. – CPU-local spark pools for additional work. – Work-stealing queue to run sparks. Lightweight Haskell threads map many-to-one onto OS threads. Automatic thread migration and load balancing. Parallel, generational garbage collection.

The meaning of par par does not guarantee new Haskell thread Hint that it may be good to evaluate the first argument in parallel.

par is very cheap Runtime converts spark depending on load. Programmers can safely use it anywhere. Over-approximates program parallelism.

Example: One processor x y y is evaluated x x is evaluated x is sparked x fizzles x `par` (y + x)

Example: Two Processors x y y is evaluated on P1 x x is taken up for evaluation on P2 x is sparked on P1 P1P2 x `par` (y + x)

Model: Two Processors

Model: One Processor No extra resources, so spark for f fizzles

No parallelism? Main thread demands f, so spark fizzles

Lucky Parallelism

That’s lame! How to force computation?

A second combinator: pseq x `pseq` y evaluate x in the current thread then return y pseq :: a -> b -> b x `pseq` y x `pseq` y = diverge if x diverge = y otherwise.

A second combinator: pseq x `pseq` y Control evaluation order pseq :: a -> b -> b x `pseq` y f `par` (e `pseq` (f + e))

A second combinator: pseq

ThreadScope: Generated Event Logs Visualize and track spark behavior Thread 1 Thread 2 (Idle) Thread 1 Thread 2 (Busy) f `par` (f + e) f `par` (e + f)

Sample Program fib and sumEuler are unchanged fib :: Int -> Int fib 0 = 0 fib 1 = 1 fib n = fib (n-1) + fib(n-2) sumEuler :: Int -> Int sumEuler n = … in ConcTutorial.hs … parSumFibEulerGood :: Int -> Int -> Int parSumFibEulerGood a b = f `par` (e `pseq` (f + e)) where f = fib a e = sumEuler b

Strategies Performance Numbers

Summary: Semi-implicit parallelism Deterministic: Parallel result = Sequential result No races or errors. Good for Reasoning Erase par, pseq to get the original program Cheap: Sprinkle par measure and refine Where? Takes practice. Often get easy speed-ups.

Summary: Semi-implicit parallelism Deterministic, Correct, Cheap Purity is the Key!

Candidate models in Haskell Explicit threads  Non-deterministic by design  Monadic: forkIO and STM Semi-implicit parallelism  Deterministic  Pure: par and pseq Data parallelism  Deterministic  Pure: parallel arrays  Memory: Shared -> Distributed -> GPUs? main :: IO () = do { ch <- newChan ; forkIO (ioManager ch) ; forkIO(worker 1 ch)... etc... } f :: Int -> Int f x = a `par` b `pseq` a+b where a = f1 (x-1) b = f2 (x-2)

Data Parallelism Brand Leader Fortran, MPI, Map-Reduce Limited applicability Dense matrix, Map-Reduce Well developed Limited new opportunities Flat Data Parallel Apply Sequential Op Over Bulk Data Nested Data Parallel Apply Parallel Op Over Bulk Data Recently developed (90s) Wider Applicability Sparse matrix, Graph Algs, Games. Practically Undeveloped Huge opportunity

Flat Data Parallel Widely used, well understood, well supported But something is sequential. Single point of concurrency Easy to implement (“chunking”) Good cost model foreach i in 1..N {...do something to A[i]... } 1,000,000’s of small work items P1P2P3

Main idea: Allow something to be parallel Parallelism is recursive & unbalanced Still Good Cost Model But hard to implement! Nested Data Parallel foreach i in 1..N {...do something to A[i]... } Still 1,000,000’s of (small) work items

Nested DP Great for Programmers Modularity opens up range of applications Divide and conquer (sort)

Nested DP Great for Programmers Modularity opens up range of applications Graph Algorithms (Shortest Paths, Spanning Trees)

Nested DP Great for Programmers Modularity opens up range of applications Sparse Arrays, Variable-Grid Adaptive Methods (Barnes-Hut)

Nested DP Great for Programmers Modularity opens up range of applications Physics Engines, Graphics (Delauney Triangulation)

Nested DP Great for Programmers Modularity opens up range of applications Machine Learning, Optimization, Constraint Solving

But, Nested DP hard for compilers! As NDP “tree” is irregular and fine-grained But it can be done! [NESL, Blelloch 1995] Key idea: “Flattening Transformation” Compiler Nested Data Parallel Code (what we want to write) Flat Data Parallel Code (what we want to run)

Data Parallel Haskell Substantial improvement in  Expressiveness  Performance  Shared memory now  Distributed memory later  GPUs someday? Not a special purpose data- parallel compiler! Most support is either useful for other things, or is in the form of library code. NESL (Blelloch) A mega-breakthrough but:  specialized, prototype  first order  few data types  no fusion  interpreted Haskell –broad-spectrum, widely used –higher order –very rich data types –aggressive fusion –compiled

Array Comprehensions vecMul :: [:Float:] -> [:Float:] -> Float vecMul v1 v2 = sumP [: f1*f2 | f1 <- v1 | f2 <- v2 :] [:Float:] is parallel arrays of Float An array comprehension: “the array of all f1*f2 where f1 is drawn from v1 and f2 from v2 in lockstep.” sumP :: [:Float:] -> Float Operations over parallel array are computed in parallel; that is the only way the programmer says “do parallel stuff.” NOTE: no locks!

Sparse Vector Multiplication sDotP :: [:(Int,Float):] -> [:Float:] -> Float sDotP sv v = sumP [: f * (v!i) | (i,f) <- sv :] A sparse vector represented as vector of (index, value) pairs: [:(0,3),(2,10):] instead of [:3,0,10,0:]. v!i gets the i-th element of v Parallelism is proportional to length of sparse vector sDotP [:(0,3),(2,10):][:2,1,1,4:]  sumP [: 3 * 2, 10 * 1 :]  16

Sparse Matrix Multiplication smMul :: [:[:(Int,Float):]:] -> [:Float:] -> Float smMul sm v = sumP [: sDotP sv v | sv <- sm :] A sparse matrix is a vector of sparse vectors: [:[:(1,3),(4,10):], [:(0,2),(1,12),(4,6):]:] Nested data parallelism here! We are calling a parallel operation, sDotP, on every element of a parallel array, sm.

Example: Data-Parallel Quicksort sort :: [:Float:] -> [:Float:] sort a = if (lengthP a <= 1) then a else sa!0 +:+ eq +:+ sa!1 where p = a!0 lt = [: f | f<-a, f < p :] eq = [: f | f<-a, f == p :] gr = [: f | f p :] sa = [: sort a | a <- [:lt, gr:] :] 2-way nested data parallelism here. Parallel filters

Parallel Sub-Sorts At Same Level sort Step 1 Step 2 Step 3...etc... Segment vectors map chunks to sub-problems Instant insanity when done by hand

Example: Parallel Search type Doc = [: String :] -- Sequence of words type Corpus = [: Doc :] search :: Corpus -> String -> [: (Doc,[:Int:]):] Find all Docs that mention the string, along with the places where it is mentioned (e.g. word 45 and 99)

Example: Parallel Search type Doc = [: String :] -- Sequence of words type Corpus = [: Doc :] search :: Corpus -> String -> [: (Doc,[:Int:]):] wordOccs :: Doc -> String -> [: Int :] Find all the places where a string is mentioned in a doc (e.g. word 45, 99).

Example: Parallel Search type Doc = [: String :] -- Sequence of words type Corpus = [: Doc :] search :: Corpus -> String -> [: (Doc,[:Int:]):] search ds s = [: (d,is) | d <- ds, let is = wordOccs d s, not (nullP is) :] wordOccs :: Doc -> String -> [: Int :] nullP :: [:a:] -> Bool

Example: Parallel Search type Doc = [: String :] -- Sequence of words type Corpus = [: Doc :] search :: Corpus -> String -> [: (Doc,[:Int:]):] search ds s = [: (d,is) | d <- ds, let is = wordOccs d s, not (nullP is) :] wordOccs :: Doc -> String -> [: Int :] wordOccs d s = [: i | (i,si) <- zipP [: 1..lengthP d :] d, s == si :] zipP :: [:a:] -> [:b:] -> [:(a,b):] lengthP :: [:a:] -> Int

Hard to Implement Well! Even Chunking At Top-level May Be Ill-balanced Top level alone may not be very parallel! Corpus Documents

The Flattening Transformation etc… Concatenate sub-arrays into one big flat array Operate in parallel on the big array

The Flattening Transformation etc… Concatenate sub-arrays into one big flat array Segment vector tracks extent of sub-arrays

The Flattening Transformation etc… Lot of Tricky Book-Keeping Possible, but hard to do by hand

The Flattening Transformation etc… Lot of Tricky Book-Keeping Blelloch showed how to do it systematically

Fusion Flattening enables load balancing, but it is not enough to ensure good performance. Bad idea: Generate Intermediate Array - [: f1*f2 | f1 <- v1 | f2 <-v2 :] - Add elements of this big intermediate vector vecMul :: [:Float:] -> [:Float:] -> Float vecMul v1 v2 = sumP [: f1*f2 | f1 <- v1 | f2 <- v2 :]

Fusion Flattening enables load balancing, but it is not enough to ensure good performance. Good Idea: Multiply and Add in Same Loop! -That is, fuse the multiply loop with add loop -Very general, aggressive fusion is required vecMul :: [:Float:] -> [:Float:] -> Float vecMul v1 v2 = sumP [: f1*f2 | f1 <- v1 | f2 <- v2 :]

4-Step Implementation Technique 1.Vectorization Specific to parallel arrays 2.Non-parametric data representations A generically useful new feature in GHC 3.Distribution Divide up the work evenly between processors 4.Aggressive fusion Uses “rewrite rules,” an old feature of GHC } Flatten

Step 0: Desugaring Rewrite Haskell source into simpler core e.g., remove array comprehensions sDotP sv v = sumP [: x * (v!i) | (i,x) <- sv :] sDotP sv v = sumP (mapP (\(i,x) -> x * (v!i)) sv) sumP :: Num a => [:a:] -> a mapP :: (a -> b) -> [:a:] -> [:b:]

sDotP sv v = sumP (snd^ sv *^ bpermuteP v (fst^ sv)) Step 1: Vectorization Replace scalar-fun f by vector-fun f^ sumP :: Num a => [:a:] -> a *^ :: Num a => [:a:] -> [:a:] -> [:a:] fst^:: [:(a,b):] -> [:a:] snd^:: [:(a,b):] -> [:b:] bpermuteP:: [:a:] -> [:Int:] -> [:a:] sDotP sv v = sumP (mapP (\(i,x) -> x * (v!i)) sv)

Vectorization (Basic idea) mapP f v For every fun f, generate lifted version f^ Resulting program operates over flat arrays w/ fixed set of prim-ops: *^,sumP, fst^,… f^ v f :: a -> b f^ :: [:a:] -> [:b:] -- f^ = mapP f

Vectorization (Basic idea) mapP f v But, lots of intermediate arrays! f^ v f :: a -> b f^ :: [:a:] -> [:b:] -- f^ = mapP f

Vectorization (Basic idea) f :: Int -> Int f x = x + 1 f^ :: [:Int:] -> [:Int:] f^ xs = xs +^ (replicateP (lengthP xs) 1) replicateP :: Int -> a -> [:a:] lengthP :: [:a:] -> Int SourceTransformed to… (Vars) xsxs (Funs) gg^ (Consts) kreplicateP (lengthP xs) k

Vectorization: Problem How to lift functions that have already been lifted? f :: [:Int:] -> [:Int:] f xs = mapP g xs = g^ xs f^ :: [:[:Int:]:] -> [:[:Int:]:] f^ xss = g^^ xss --??? Yet another version of g?? ?

Vectorization: Key insight f :: [:Int:] -> [:Int:] f xs = mapP g xs = g^ xs f^ :: [:[:Int:]:] -> [:[:Int:]:] f^ xss = segmentP xss (g^ (concatP xss)) concatP :: [:[:a:]:] -> [:a:] segmentP :: [:[:a:]:] -> [:b:] -> [:[:b:]:] Shape Flat data Nested data Payoff: f, f^ are enough ( avoid f^^ ) First concat, then map, then re-split

Step 2: Representing Arrays [:Double:] Too Slow: Array of pointers to boxed nums [:(a,b):] Too Slow: Arrays of pointers to pairs Idea: Pick representation using element type…

Step 2: Representing Arrays Extend Haskell with construct to specify families of structures with a different implementations. data family [:a:] data instance [:Double:] = AD Int ByteArray data instance [:(a, b):] = AP [:a:] [:b:] [POPL05], [ICFP05], [TLDI07] AP

Step 2: Representing Arrays Now *^ can be a fast loop Array elements are not boxed data family [:a:] data instance [:Double:] = AD Int ByteArray data instance [:(a, b):] = AP [:a:] [:b:] AP

Step 2: Representing Arrays And fst^ is constant time! data family [:a:] data instance [:Double:] = AD Int ByteArray data instance [:(a, b):] = AP [:a:] [:b:] AP fst^ :: [:(a,b):] -> [:a:] fst^ (AP as bs) = as

Step 2: Nested arrays Represent with (shape descriptor, flat array) data instance [:[:a:]:] = AN [:Int:] [:a:] Flat data Shape etc…

Step 2: Nested arrays Representation supports efficient op-lifting Surprise : concatP,segmentP are constant time! data instance [:[:a:]:] = AN [:Int:] [:a:] concatP :: [:[:a:]:] -> [:a:] concatP (AN shape data) = data segmentP :: [:[:a:]:] -> [:b:] -> [:[:b:]:] segmentP (AN shape _) data = AN shape data Flat data Shape

Higher order complications  f1^ is good for [: f a b | a <- as | b <- bs :]  But the type transformation is not uniform.  And sooner or later we want higher-order functions anyway.  f2^ forces us to find a representation for [:(T2->T3):]. Closure conversion [PAPP06] f :: T1 -> T2 -> T3 f1^ :: [:T1:] -> [:T2:] -> [:T3:] -– f1^ = zipWithP f f2^ :: [:T1:] -> [:(T2 -> T3):] -– f2^ = mapP f

Step 3: Distribution After steps 0-2, Program = flat arrays + vec-ops ( *^, sumP …) NESL directly executes this version Hand-coded assembly for primitive ops All the time is spent here anyway sDotP :: [:(Int,Float):] -> [:Float:] -> Float sDotP (AP is fs) v = sumP (fs *^ bpermuteP v is)

Step 3: Distribution After steps 0-2, Program = flat arrays + vec-ops ( *^, sumP …) But, many intermediate arrays! Increase memory traffic, synchronization Idea: Distribution and Fusion sDotP :: [:(Int,Float):] -> [:Float:] -> Float sDotP (AP is fs) v = sumP (fs *^ bpermuteP v is)

Step 3: Distribution 1.Distribute is, fs into chunks, one per processor 2. Fuse sumP (fs *^ bpermuteP v is) into a tight, sequential loop on each processor 3. Combine results of each chunk into final result. Step 2 alone is not good on a parallel machine! sDotP :: [:(Int,Float):] -> [:Float:] -> Float sDotP (AP is fs) v = sumP (fs *^ bpermuteP v is)

Expressing Distribution New type Dist a Describes a collection of distributed ` a ` values (Selected) Operations splitD Distribute data among processors joinD Collect result data mapD Run sequential function on each processor sumD Sum numbers from each processor splitD :: [:a:] -> Dist [:a:] joinD:: Dist [:a:] -> [:a:] mapD :: (a->b) -> Dist a -> Dist b sumD :: Dist Float -> Float

Distributing sumP sumP is Composition of Primitive Functions sumP :: [:Float:] -> Float sumP xs = sumD (mapD sumS (splitD xs) xs = [: 2,1,4,9,5 :] xs 1 = [: 2,1,4 :] t 1 = sumS xs 1 = 7 xs 2 = [: 9,5 :] t 2 = sumS xs 2 = 14 result = 21 splitD mapD sumS sumD Processor 1Processor 2

Distributing sumP splitD :: [:a:] -> Dist [:a:] mapD :: (a->b) -> Dist a -> Dist b sumD :: Dist Float -> Float sumS:: [:Float:] -> Float xs = [: 2,1,4,9,5 :] xs 1 = [: 2,1,4 :] t 1 = sumS xs 1 = 7 xs 2 = [: 9,5 :] t 2 = sumS xs 2 = 14 result = 21 splitD mapD sumS sumD Processor 1Processor 2

mapD : The Source of Parallelism splitD :: [:a:] -> Dist [:a:] mapD :: (a->b) -> Dist a -> Dist b sumD :: Dist Float -> Float sumS:: [:Float:] -> Float xs = [: 2,1,4,9,5 :] xs 1 = [: 2,1,4 :] t 1 = sumS xs 1 = 7 xs 2 = [: 9,5 :] t 2 = sumS xs 2 = 14 result = 21 splitD mapD sumS sumD Processor 1Processor 2

sumS : A Tight Sequential Loop splitD :: [:a:] -> Dist [:a:] mapD :: (a->b) -> Dist a -> Dist b sumD :: Dist Float -> Float sumS:: [:Float:] -> Float xs = [: 2,1,4,9,5 :] xs 1 = [: 2,1,4 :] t 1 = sumS xs 1 = 7 xs 2 = [: 9,5 :] t 2 = sumS xs 2 = 14 result = 21 splitD mapD sumS sumD Processor 1Processor 2

sumD: Collecting the Result splitD :: [:a:] -> Dist [:a:] mapD :: (a->b) -> Dist a -> Dist b sumD :: Dist Float -> Float sumS:: [:Float:] -> Float xs = [: 2,1,4,9,5 :] xs 1 = [: 2,1,4 :] t 1 = sumS xs 1 = 7 xs 2 = [: 9,5 :] t 2 = sumS xs 2 = 14 result = 21 splitD mapD sumS sumD Processor 1Processor 2

Distributing Lifted Multiply *^ :: [:Float:] -> [:Float:] -> [:Float:] *^ xs ys = joinD (mapD mulS (zipD (splitD xs) (splitD ys)) xs = [: 2,1,4,9,5 :] ys = [: 3,2,2,1,1 :] xs 1 = [: 2,1,4 :] ys 1 = [: 3,2,2 :] zs 1 = zipD … t 1 = mulS zs 1 = [: 6,2,8 :] xs 2 = [: 9,5 :] ys 2 = [: 1,1 :] zs 2 = zipD … t 2 = sumS zs 2 = [: 9,5 :] result = [: 6,2,8,9,5 :] splitD joinD Processor 1Processor 2 mapD mulS

Step 4: Fusion sDotP :: [:(Int,Float):] -> [:Float:] -> Float sDotP (AP is fs) v = sumP (fs *^ bpermuteP v is) ==> sumD. mapD sumS. splitD. joinD. mapD mulS $ zipD (splitD fs) (splitD (bpermuteP v is)) joinD splitD Idea: Rewrite Rules Remove Synchronizations {-# RULE #-} splitD (joinD x) = x splitD followed by joinD is equivalent to doing nothing

Step 4: Fusion sDotP :: [:(Int,Float):] -> [:Float:] -> Float sDotP (AP is fs) v = sumP (fs *^ bpermuteP v is) ==> sumD. mapD sumS. mapD mulS $ zipD (splitD fs) (splitD (bpermuteP v is)) joinD splitD Idea: Rewrite Rules Remove Synchronizations {-# RULE #-} splitD (joinD x) = x splitD followed by joinD is equivalent to doing nothing

Step 4: Fusion Idea: Rewrite Rules Remove Synchronizations {-# RULE #-} mapD f (mapD g x) = mapD (f.g) x Fuse Successive uses of mapD Removes synchronization points sDotP (AP is fs) v ==> sumP (fs *^ bpermuteP v is) ==> sumD. mapD sumS. mapD mulS $ zipD (splitD fs) (splitD (bpermuteP v is))

Step 4: Fusion Idea: Rewrite Rules Remove Synchronizations {-# RULE #-} mapD f (mapD g x) = mapD (f.g) x Fuse Successive uses of mapD Removes synchronization points sDotP (AP is fs) v ==> sumP (fs *^ bpermuteP v is) ==> sumD. mapD (sumS. mulS) $ zipD (splitD fs) (splitD (bpermuteP v is))

Step 4: Sequential fusion Now we have a sequential fusion problem Lots and lots of functions over arrays Can’t have fusion rules for every pair… … new idea: Stream Fusion [ICFP 07][ICFP 07] sDotP :: [:(Int,Float):] -> [:Float:] -> Float sDotP (AP is fs) v ==> sumP (fs *^ bpermuteP v is) ==> sumD. mapD (sumS. mulS) $ zipD (splitD fs) (splitD (bpermuteP v is))

4-Step Implementation Technique 1.Vectorization Specific to parallel arrays 2.Non-parametric data representations A generically useful new feature in GHC 3.Distribution Divide up the work evenly between processors 4.Aggressive fusion Uses “rewrite rules,” an old feature of GHC Main advance: an optimizing data-parallel compiler implemented by modest enhancements to a full-scale functional language implementation. } Flatten

Purity Pays Off! Key optimizations: Flattening & Fusion Rely on purely-functional semantics No Assignments, No Shared Memory Every operation is pure Prediction: The data-parallel languages of the future will be functional languages

And it goes fast too... Pinch of salt 1-processor version 30% slower than C 2-proceessor version is a performance win

Nested Data Parallel Summary NDP is a promising way to harness 100’s of cores Great for programmers: far more flexible than flat DP NDP is tough to implement But we (think we) know how to do it. Functional programming is a big win in this space Work in progress, available in GHC 6.10 and

Distribution & Fusion Used Elsewhere GoogleFlumeJava MicrosoftDryad/Linq

Candidate models in Haskell Explicit threads  Non-deterministic by design  Monadic: forkIO and STM Semi-implicit parallelism  Deterministic  Pure: par and pseq Data parallelism  Deterministic  Pure: parallel arrays  Memory: Shared -> Distributed -> GPUs? main :: IO () = do { ch <- newChan ; forkIO (ioManager ch) ; forkIO(worker 1 ch)... etc... } f :: Int -> Int f x = a `par` b `pseq` a+b where a = f1 (x-1) b = f2 (x-2) f :: Int -> Int sDotP sv v = sumP [: f*(v!i)|(i,f)<-sv :]