Comparing and Optimising Parallel Haskell Implementations on Multicore Jost Berthold Simon Marlow Abyd Al Zain Kevin Hammond.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Paging: Design Issues. Readings r Silbershatz et al: ,
A Block-structured Heap Simplifies Parallel GC Simon Marlow (Microsoft Research) Roshan James (U. Indiana) Tim Harris (Microsoft Research) Simon Peyton.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Instruction Set Design
Scheduling and Performance Issues for Programming using OpenMP
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Reference: Message Passing Fundamentals.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Chapter Hardwired vs Microprogrammed Control Multithreading
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Algorithms in a Multiprocessor Environment Kevin Frandsen ENCM 515.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
“Evaluating MapReduce for Multi-core and Multiprocessor Systems” Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis Computer.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Architectures
Intel Concurrent Collections (for Haskell) Ryan Newton*, Chih-Ping Chen*, Simon Marlow+ *Intel +Microsoft Research Software and Services Group Jul 29,
Intel Concurrent Collections (for Haskell) Ryan Newton, Chih-Ping Chen, Simon Marlow Software and Services Group Jul 27, 2010.
DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Processes and OS basics. RHS – SOC 2 OS Basics An Operating System (OS) is essentially an abstraction of a computer As a user or programmer, I do not.
Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.
CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Concurrency Properties. Correctness In sequential programs, rerunning a program with the same input will always give the same result, so it makes sense.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Memory Management Overview.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Martin Kruliš by Martin Kruliš (v1.1)1.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Christopher Brown and Kevin Hammond School of Computer Science, University of St. Andrews July 2010 Ever Decreasing Circles: Implementing a skeleton for.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Haskell on a Shared-Memory Multiprocessor Tim Harris Simon Marlow Simon Peyton Jones.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Concurrency and Performance Based on slides by Henri Casanova.
These slides are based on the book:
Distributed Processors
Architecture Background
Morgan Kaufmann Publishers
Lecture 5: GPU Compute Architecture
CMSC 611: Advanced Computer Architecture
Lecture 5: GPU Compute Architecture for the last time
CS510 - Portland State University
Chapter 4 Multiprocessors
Programming with Shared Memory Specifying parallelism
Presentation transcript:

Comparing and Optimising Parallel Haskell Implementations on Multicore Jost Berthold Simon Marlow Abyd Al Zain Kevin Hammond

The Parallel Haskell Landscape research into parallelism using Haskell has been ongoing since the late 1980s – semi-implicit, deterministic programming model: par :: a -> b -> b – strategies package up larger parallel computation patterns, separates algorithm from parallelism – the GUM implementation ran on clusters or multiprocessors, using PVM – successful: linear speedups on large clusters Another Parallel Haskell variant: Eden – more explicit than par : programming model says where the evaluation happens – also able to express parallel computation skeletons, e.g. parMap – implementation based on GHC, runs on clusters and multiprocessors using PVM-based communication – multiple heaps, not virtually-shared as in GUM (simpler implementation) Several other Parallel/Distributed Haskell dialects, mostly research prototypes and all based on distributed heaps (some virtually-shared)

The Parallel Haskell Landscape Recently (2005) shared-memory parallelism added to GHC – single shared heap – programming models supported: pure: – par and Strategies – soon: Data Parallel Haskell impure, non-deterministic: – Concurrent Haskell, STM – widely available, high-quality implementation – very lightweight concurrency, we win concurrency benchmarks – parallel GC added recently This work: – compare distributed and shared-heap models – analyse performance of the shared-heap implementation implement execution profiling make improvements to the runtime

Shared vs. Distributed heaps why a shared heap? – no communication overhead, hence easier to program – good for fine-grained tasks with plenty of communication and sharing why a distributed heap? – parallel GC is much easier – no cache-coherency overhead – no mutexes

The GpH programming model par :: a -> b -> b stores a pointer to a in a spark pool an idle CPU takes a spark from the spark pool and turns it into a thread seq :: a -> b -> b – used for sequential ordering parMap :: (a -> b) -> [a] -> [b] parMap f [] = [] parMap f (x:xs) = let y = f x ys = parMap f xs in y `par` (ys `seq` y:ys)

sumEuler :: Int -> Int sumEuler n = sum (map phi [1..n]) phi :: Int -> Int phi n = length (filter (relprime n) [1..(n-1)]) sumEuler :: Int -> Int sumEuler n = sum (parMap phi [1..n]) phi :: Int -> Int phi n = length (filter (relprime n) [1..(n-1)]) sumEuler :: Int -> Int sumEuler n = parChunkFoldMap (+) phi [1..n] phi :: Int -> Int phi n = length (filter (relprime n) [1..(n-1)]) parChunkFoldMap :: (b -> b -> b) -> (a -> b) -> [a] -> b parChunkFoldMap f g xs = foldl1 f (map (foldl1 f. map g) (splitAtN c xs) `using` parList rnf) sumEuler benchmark

sumEuler execution profile 1. Standard GHC, 8 CPUs (2 x quad-core) 2. Eden using PVM, 8 CPUs (2 x quad-core)

Analysis (1) The shared-heap implementation was spending a lot of time at the GC barrier. It turned out that the GC barrier had a bug: it was stopping one CPU at a time. We fixed that. Also, reducing the number of barriers, by increasing the size of the young generations, helps a bit.

sumEuler execution profile (2) 1. Standard GHC, including fix for GC barrier and 5MB young generation 2. Eden using PVM, 8 CPUs (2 x quad-core)

Analysis (2) Some of the gaps are due to poor load- balancing. The existing load-balancing strategy was based on pushing spare work to idle CPUs – could be a long delay between a CPU becoming idle and receiving work from another CPU. We implemented lock-free work-stealing queues for load-balancing of sparks.

sumEuler execution profile (3) 1. Standard GHC + GC barrier fixes + work-stealing 2. Eden using PVM, 8 CPUs (2 x quad-core)

Analysis (3) High priority: implement per-CPU GC – each CPU has a local heap that can be collected independently of the other CPUs. – Single shared global heap, collected much less frequently using stop-the-world – e.g. Concurrent Caml, Manticore Lower the overhead for spark activation, by having a dedicated thread to run sparks. – This will make the implementation less sensitive to granularity: less need to group work into “chunks”, easier for programmers to get speedup

Matrix multiplication Using strategies, we can parallelise matrix multiply either elementwise, by grouping rows or columns, or blockwise. In Eden, the matrix data is communicated between the processing elements, but no PE keeps a complete copy of the matrix.

Matrix multiplication 1. Standard GHC, 8 CPUs (2 x quad-core) 2. Standard GHC + GC barrier fix + work-stealing 3. Eden

Analysis (4) The distributed memory implementation suffers due to communication overhead. Also the distributed-memory algorithm is more complex, due to trying to avoid copying the input data. We still have a way to go, though: GHC achieves a 5.6 speedup on 8 CPUs.

Further Challenges Work duplication – GHC doesn’t prevent multiple threads from duplicating a computation, it tries to discover duplicated work in progress and halt one of the threads. – to prevent duplication up-front is expensive – extra memory operations (black holes), or even atomic instructions – we found that in some cases work duplication really is affecting scaling – so we want to do this for some computations

Further Challenges Space leak in par – “ par e1 e2 ” stores a pointer to e1 in the spark pool before evaluating e2 – Typically e2 and e1 share some computation – If we don’t have enough processors, we might not evaluate e1 in parallel – how do we know when we can discard that entry from the spark pool? If we don’t ever discard entries from the spark pool, we have a space leak. – “when e2 has completed” doesn’t work, e.g. parMap – “when e1 is evaluated” also doesn’t work: e1 itself isn’t shared, but it refers to shared computations – “when e1 is disjoint from the program’s live data” is too hard to determine – workaround: use only “ par x e2 ” where x is shared with e2.

Conclusions The tradeoff between distributed and shared heaps is a complex one – a distributed heap can give better performance – but is harder to program against: the programmer must think about communication – We believe a shared heap is the better model in the short- term, but as we need to scale to larger numbers of cores or NUMA architectures, a distributed or hybrid model will become necessary. We have made significant improvements to the performance of parallel programs in GHC – and identified several further areas for improvement – GHC (released next week) contains some of these improvements, download it and try it out!