Parallel Programming Chapter 1 Introduction to Parallel Architectures Johnnie Baker Spring 2011 1.

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

Parallel Algorithms Lecture Notes. Motivation Programs face two perennial problems:: –Time: Run faster in solving a problem Example: speed up time needed.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Reference: Message Passing Fundamentals.

CS6963 Parallel Programming for Graphics Processing Units (GPUs) Lecture 1: Introduction L1: Introduction 1.

09/22/2008CS49601 CS4960: Parallel Programming Guest Lecture: Parallel Programming for Scientific Computing Mary Hall September 22, 2008.

Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.

Prince Sultan College For Woman

08/23/2011CS4961 CS4961 Parallel Programming Lecture 1: Introduction Mary Hall August 23,

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

08/21/2012CS4230 CS4230 Parallel Programming Lecture 1: Introduction Mary Hall August 21,

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

08/26/2010CS4961 CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26,

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Multi-core architectures. Single-core computer Single-core CPU chip.

CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 25, /25/2011 CS4961.

Multi-Core Architectures

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

08/24/2010CS4961 CS4961 Parallel Programming Lecture 1: Introduction Mary Hall August 24,

Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity Approach: –Discretize the.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Games Development 2 Concurrent Programming CO3301 Week 9.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

CSCI-455/522 Introduction to High Performance Computing Lecture 1.

University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,

Parallel Computing Presented by Justin Reschke

Concurrency and Performance Based on slides by Henri Casanova.

1a.1 Parallel Computing and Parallel Computers ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.

1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

08/23/2012CS4230 CS4230 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 23,

These slides are based on the book:

Parallel Computing and Parallel Computers

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

18-447: Computer Architecture Lecture 30B: Multiprocessors

CS427 Multicore Architecture and Parallel Computing

Parallel Programming By J. H. Wang May 2, 2017.

The University of Adelaide, School of Computer Science

EE 193: Parallel Computing

CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 27, /25/2009 CS4961.

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.

COMP60621 Fundamentals of Parallel and Distributed Systems

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

Parallel Programming Chapter 1 Introduction to Parallel Architectures Johnnie Baker Spring

Acknowledgements for material used in creating these slides Mary Hall, CS4961 Parallel Programming, University of Utah. Lawrence Snyder, CSE524 Parallel Programming, University of Washington Chapter 1 of Course Text: Lin & Snyder, Principles of Parallel Programming.

Course Basic Details Time & Location: MWF 2:15-3:05 Course Website: – Instructor : Johnnie Baker – – Office Hours: 12:15-1:30 MWF in my office –MSB160 – May have to change Textbook: – “Principles of Parallel Programming,” – Also, readings and/or notes provided for languages and some topics

Course Basic Details (cont) Prerequistes: Data Structures – Algorithms and Operating Systems useful but not required Topics Covered in Course – Will cover topics from most topics in textbook – May add information on programming languages used, probably MPI, OpenMP, CUDA – May add some information on parallel algorithms Course Requirements – may be adjusted depending on total amount of homework and programming assignments. – Midterm Exam25% – Homework25% – Programming Projects25% – Final Exam25%

Course Logistics Class webpage will be headquarters for all slides, reading supplements, and assignments Take lecture notes – as slides will be online sometime after the lecture Informal class: Ask questions immediately

Why Study Parallelism Currently, sequential processing is plenty fast for most of our daily computing uses Some Advantages of Parallel include – The extra power from parallel computers is enabling in science, engineering, business, etc. – Multicore chips present new opportunities – Deep intellectual challenges for CS – models, programming languages, algorithms, etc.

Why is this Course Important? Multi-core and many-core era is here to stay – Why? Technology Trends Many programmers will be developing parallel software – But still not everyone is trained in parallel programming – Learn how to put all these vast machine resources to the best use! Useful for – Joining the work force – Graduate school Our focus – Teach core concepts – Use common programming models – Discuss broader spectrum of parallel computing 7

Clock speed flattening sharply

Technology Trends: Power Density Limits Serial Performance 9

Key ideas: – Movement away from increasingly complex processor design and faster clocks – Replicated functionality (i.e., parallel) is simpler to design – Resources more efficiently utilized – Huge power management advantages What to do with all these transistors? The Multi-Core Paradigm Shift All Computers are Parallel Computers. 10

Scientific Simulation: The Third Pillar of Science Traditional scientific and engineering paradigm: 1)Do theory or paper design. 2)Perform experiments or build system. Limitations: – Too difficult -- build large wind tunnels. – Too expensive -- build a throw-away passenger jet. – Too slow -- wait for climate or galactic evolution. – Too dangerous -- weapons, drug design, climate experimentation. Computational science paradigm: 3)Use high performance computer systems to simulate the phenomenon Base on known physical laws and efficient numerical methods. 11

The quest for increasingly more powerful machines Scientific simulation will continue to push on system requirements: – To increase the precision of the result – To get to an answer sooner (e.g., climate modeling, disaster modeling) The U.S. will continue to acquire systems of increasing scale – For the above reasons – And to maintain competitiveness 12

A Similar Phenomenon in Commodity Systems More capabilities in software Integration across software Faster response More realistic graphics … 13

The fastest computer in the world today What is its name? Where is it located? How many processors does it have? What kind of processors? How fast is it? Jaguar (Cray XT5) Oak Ridge National Laboratory ~37,000 processor chips (224,162 cores) AMD 6-core Opterons Petaflop/second One quadrillion operations/s 1 x See 14

The SECOND fastest computer in the world today What is its name? Where is it located? How many processors does it have? What kind of processors? How fast is it? RoadRunner Los Alamos National Laboratory ~19,000 processor chips (~129,600 “processors”) AMD Opterons and IBM Cell/BE (in Playstations) Petaflop/second One quadrilion operations/s 1 x See 15

Example: Global Climate Modeling Problem Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity Approach: – Discretize the domain, e.g., a measurement point every 10 km – Devise an algorithm to predict weather at time t+  t given t Uses: -Predict major events, e.g., El Nino -Use in setting air emissions standards Source: 16

High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL 08/24/2010CS496117

Some Characteristics of Scientific Simulation Discretize physical or conceptual space into a grid – Simpler if regular, may be more representative if adaptive Perform local computations on grid – Given yesterday’s temperature and weather pattern, what is today’s expected temperature? Communicate partial results between grids – Contribute local weather result to understand global weather pattern. Repeat for a set of time steps Possibly perform other calculations with results – Given weather model, what area should evacuate for a hurricane? 18

Example of Discretizing a Domain One processor computes this part Another processor computes this part in parallel Processors in adjacent blocks in the grid communicate their result. 19

Parallel Programming Complexity An Analogy to Preparing Thanksgiving Dinner Enough parallelism? (Amdahl’s Law) – Suppose you want to just serve turkey Granularity – How frequently must each assistant report to the chef After each stroke of a knife? Each step of a recipe? Each dish completed? Locality – Grab the spices one at a time? Or collect ones that are needed prior to starting a dish? Load balance – Each assistant gets a dish? Preparing stuffing vs. cooking green beans? Coordination and Synchronization – Person chopping onions for stuffing can also supply green beans – Start pie after turkey is out of the oven All of these things makes parallel programming even harder than sequential programming. 20

Parallel and Distributed Computing Parallel computing (processing): – the use of two or more processors (computers), usually within a single system, working simultaneously to solve a single problem. Distributed computing (processing): – any computing that involves multiple computers remote from each other that each have a role in a computation problem or information processing. Parallel programming: – the human process of developing programs that express what computations should be executed in parallel. 21

Is it really harder to “think” in parallel? Some would argue it is more natural to think in parallel… … and many examples exist in daily life – House construction -- parallel tasks, wiring and plumbing performed at once (independence), but framing must precede wiring (dependence) Similarly, developing large software systems – Assembly line manufacture - pipelining, many instances in process at once – Call center - independent calls executed simultaneously (data parallel) – “Multi-tasking” – all sorts of variations 34

Finding Enough Parallelism Suppose only part of an application seems parallel Amdahl’s law – let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable – P = number of processors Speedup(P) = Time(1)/Time(P) <= 1/(s + (1-s)/P) <= 1/s Even if the parallel part speeds up perfectly performance is limited by the sequential part 35

Overhead of Parallelism Given enough parallel work, this is the biggest barrier to getting desired speedup Parallelism overheads include: – cost of starting a thread or process – cost of communicating shared data – cost of synchronizing – extra (redundant) computation Each of these can be in the range of milliseconds (=millions of flops) on some systems Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work 36

Load Imbalance Load imbalance is the time that some processors in the system are idle due to – insufficient parallelism (during that phase) – unequal size tasks Examples of the latter – adapting to “interesting parts of a domain” – tree-structured computations – fundamentally unstructured problems Algorithm needs to balance load 37

Summary of Preceding Slides Solving the “Parallel Programming Problem” – Key technical challenge facing today’s computing industry, government agencies and scientists Scientific simulation discretizes some space into a grid – Perform local computations on grid – Communicate partial results between grids – Repeat for a set of time steps – Possibly perform other calculations with results Commodity parallel programming can draw from this history and move forward in a new direction Writing fast parallel programs is difficult – Amdahl’s Law  Must parallelize most of computation – Data Locality – Communication and Synchronization – Load Imbalance 38

Reasoning about a Parallel Algorithm Ignore architectural details for now Assume we are starting with a sequential algorithm and trying to modify it to execute in parallel – Not always the best strategy, as sometimes the best parallel algorithms are NOTHING like their sequential counterparts – But useful since you are accustomed to sequential algorithms 39

Reasoning about a parallel algorithm, cont. Computation Decomposition – How to divide the sequential computation among parallel threads/processors/computations? Aside: Also, Data Partitioning (ignore today) Preserving Dependences – Keeping the data values consistent with respect to the sequential execution. Overhead – We’ll talk about some different kinds of overhead 40

Race Condition or Data Dependence A race condition exists when the result of an execution depends on the timing of two or more events. A data dependence is an ordering on a pair of memory operations that must be preserved to maintain correctness. 41

A Simple Example Count the 3s in array[] of length values Definitional solution … Sequential program count = 0; for (i=0; i<length; i++) { if (array[i] == 3) count += 1; } Can we rewrite this to a parallel code? 08/26/2010CS496142

Computation Partitioning Block decomposition: Partition original loop into separate “blocks” of loop iterations. – Each “block” is assigned to an independent “thread” in t0, t1, t2, t3 for t=4 threads – Length = 16 in this example {{{{ t0t1t2t3 int block_length_per_thread = length/t; int start = id * block_length_per_thread; for (i=start; i<start+block_length_per_thread; i++) { if (array[i] == 3) count += 1; } Correct? Preserve Dependences?

Data Race on Count Variable Two threads may interfere on memory writes 08/26/2010 CS load count increment count store count Thread 3Thread 1 load count increment count store count {{{{ t0t1t2t3 count = 0 count = 1 count = 2 count = 1 store

What Happened? Dependence on count across iterations/threads – But reordering ok since operations on count are associative Load/increment/store must be done atomically to preserve sequential meaning Definitions: – Atomicity: a set of operations is atomic if either they all execute or none executes. Thus, there is no way to see the results of a partial execution. – Mutual exclusion: at most one thread can execute the code at any time 45

Try 2: Adding Locks Insert mutual exclusion (mutex) so that only one thread at a time is loading/incrementing/storing count atomically 46 int block_length_per_thread = length/t; mutex m; int start = id * block_length_per_thread; for (i=start; i<start+block_length_per_thread; i++) { if (array[i] == 3) { mutex_lock(m); count += 1; mutex_unlock(m); } Correct now. Done?

Performance Problems Serialization at the mutex Insufficient parallelism granularity Impact of memory system 08/26/

Lock Contention and Poor Granularity To acquire lock, must go through at least a few levels of cache (locality) Local copy in register not going to be correct Not a lot of parallel work outside of acquiring/releasing lock 08/26/2010 CS

Try 3: Increase “Granularity” Each thread operates on a private copy of count Lock only to update global data from private copy 08/26/2010CS mutex m; int block_length_per_thread = length/t; int start = id * block_length_per_thread; for (i=start; i<start+block_length_per_thread; i++) { if (array[i] == 3) private_count[id] += 1; } mutex_lock(m); count += private_count[id]; mutex_unlock(m);

Much Better, But Not Better than Sequential Subtle cache effects are limiting performance 08/26/ Private variable ≠ Private cache line

Try 4: Force Private Variables into Different Cache Lines Simple way to do this? See textbook for authors’ solution 08/26/2010CS Parallel speedup when : time(1)/time(2) = 0.91/0.51 = 1.78 (close to number of processors!)

Discussion: Overheads What were the overheads we saw with this example? – Extra code to determine portion of computation – Locking overhead: inherent cost plus contention – Cache effects: false sharing 08/26/2010CS496152

Interestingly, this code represents a common pattern in parallel algorithms A reduction computation – From a large amount of input data, compute a smaller result that represents a reduction in the dimensionality of the input – In this case, a reduction from an array input to a scalar result (the count) Reduction computations exhibit dependences that must be preserved – Looks like “result = result op …” – Operation op must be associative so that it is safe to reorder them Aside: Floating point arithmetic is not truly associative, but usually ok to reorder 08/26/2010CS Generalizing from this example