10/07/2010CS4961 CS4961 Parallel Programming Lecture 14: Reasoning about Performance Mary Hall October 7, 2010 1.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Reference: Message Passing Fundamentals.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Chapter 17 Parallel Processing.
CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.
Chapter 1 and 2 Computer System and Operating System Overview
CSE451 Processes Spring 2001 Gary Kimura Lecture #4 April 2, 2001.
GCSE Computing - The CPU
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
CS 470/570:Introduction to Parallel and Distributed Computing.
Operating Systems CSE 411 CPU Management Sept Lecture 11 Instructor: Bhuvan Urgaonkar.
10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Processor Architecture
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
The Central Processing Unit (CPU)
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
1 Module 3: Processes Reading: Chapter Next Module: –Inter-process Communication –Process Scheduling –Reading: Chapter 4.5, 6.1 – 6.3.
Threads prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University 1July 2016Processes.
Multiprogramming. Readings r Chapter 2.1 of the textbook.
GCSE Computing - The CPU
Processes and threads.
Chapter 3: Process Concept
Operating Systems (CS 340 D)
CS4961 Parallel Programming Lecture 11: Data Locality, cont
Lecture Topics: 11/1 Processes Process Management
Computer Engg, IIT(BHU)
Operating Systems (CS 340 D)
Lecture 5: GPU Compute Architecture
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
CMSC 611: Advanced Computer Architecture
Lecture 5: GPU Compute Architecture for the last time
CS4961 Parallel Programming Lecture 12: Data Locality, cont
Distributed Systems CS
GCSE Computing - The CPU
Operating System Overview
Lecture 11: Machine-Dependent Optimization
ESE532: System-on-a-Chip Architecture
Presentation transcript:

10/07/2010CS4961 CS4961 Parallel Programming Lecture 14: Reasoning about Performance Mary Hall October 7,

Administrative: What’s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday, October 25 Midterm Quiz on October 26 Start CUDA after break Start thinking about independent/group projects 10/07/2010CS49612

Today’s Lecture Estimating Locality Benefits Finish OpenMP example Ch. 3, Reasoning About Performance 10/07/2010CS49613

Programming Assignment 2: Due 11:59 PM, Friday October 8 Combining Locality, Thread and SIMD Parallelism: The following code excerpt is representative of a common signal processing technique called convolution. Convolution combines two signals to form a third signal. In this example, we slide a small (32x32) signal around in a larger (4128x4128) signal to look for regions where the two signals have the most overlap. for (l=0; l<N; l++) { for (k=0; k<N; k++) { C[k][l] = 0.0; for (j=0; j<W; j++) { for (i=0; i<W; i++) { C[k][l] += A[k+i][l+j]*B[i][j]; } 10/07/2010CS49614

How to use Tiling What data you are trying to get to reside in cache? Is your tiling improving whether that data is reused out of cache? Do you need to tile different loops? Do you need to tile multiple loops? Do you need a different loop order after tiling? How big are W and N? The right way to approach this is to reason about what data is accessed in the innermost 2 or 3 loops after tiling. You can permute to get the order you want. -Let’s review Matrix Multiply from Lecture 12 -Calculate “footprint” of data in innermost loops not controlled by N, which is presumably very large -Advanced idea: take spatial locality into account 10/07/2010CS49615

10/07/2010CS Example: matrix multiply for (J=0; J<N; J++) for (K=0; K<N; K++) for (I=0; I<N; I++) C[J][I]= C[J][I] + A[K][I] * B[J][K] CA B I K I K J Locality + SIMD (SSE-3) Example

10/07/2010CS Tiling inner loops I and K (+permutation) for (K = 0; K<N; K+=T K ) for (I = 0; I<N; I+=T I ) for (J =0; J<N; J++) for (KK = K; KK<min(K+T K, N); KK++) for (II = I; II<min(I+ T I, N); II++) C[J][II] = C[J][II] + A[KK][II] * B[J][KK]; TITI CAB TKTK Locality + SIMD (SSE-3) Example

Motivating Example: Linked List Traversal How to express with parallel for? -Must have fixed number of iterations -Loop-invariant loop condition and no early exits Convert to parallel for -A priori count number of iterations (if possible) 10/07/2010CS while(my_pointer) { (void) do_independent_work (my_pointer); my_pointer = my_pointer->next ; } // End of while loop

OpenMP 3.0: Tasks! 10/07/2010CS49619 my_pointer = listhead; #pragma omp parallel { #pragma omp single nowait { while(my_pointer) { #pragma omp task firstprivate(my_pointer) { (void) do_independent_work (my_pointer); } my_pointer = my_pointer->next ; } } // End of single - no implied barrier (nowait) } // End of parallel region - implied barrier here firstprivate = private and copy initial value from global variable lastprivate = private and copy back final value to global variable

Chapter 3: Reasoning about Performance Recall introductory lecture: Easy to write a parallel program that is slower than sequential! Naïvely, many people think that applying P processors to a T time computation will result in T/P time performance Generally wrong -For a few problems (Monte Carlo) it is possible to apply more processors directly to the solution -For most problems, using P processors requires a paradigm shift, additional code, “communication” and therefore overhead -Also, differences in hardware -Assume “P processors => T/P time” to be the best case possible -In some cases, can actually do better (why?) 10/07/2010CS496110

Sources of Performance Loss Overhead not present in sequential computation Non-parallelizable computation Idle processors, typically due to load imbalance Contention for shared resources 10/07/2010CS496111

Sources of parallel overhead Thread/process management (next few slides) Extra computation -Which part of the computation do I perform? -Select which part of the data to operate upon -Local computation that is later accumulated with a reduction -… Extra storage -Auxiliary data structures -“Ghost cells” “Communication” -Explicit message passing of data -Access to remote shared global data (in shared memory) -Cache flushes and coherence protocols (in shared memory) -Synchronization (book separates synchronization from communication) 10/07/2010CS496112

Processes and Threads (& Filaments…) Let’s formalize some things we have discussed before Threads … -consist of program code, a program counter, call stack, and a small amount of thread-specific data -share access to memory (and the file system) with other threads -communicate through the shared memory Processes … -Execute in their own private address space -Do not communicate through shared memory, but need another mechanism like message passing; shared address space another possibility -Logically subsume threads -Key issue: How is the problem divided among the processes, which includes data and work 10/07/2010CS496113

Comparison Both have code, PC, call stack, local data -Threads -- One address space -Processes -- Separate address spaces -Filaments and similar are extremely fine-grain threads Weight and Agility -Threads: lighter weight, faster to setup, tear down, more dynamic -Processes: heavier weight, setup and tear down more time consuming, communication is slower 10/07/2010CS496114

Latency vs. Throughput Parallelism can be used either to reduce latency or increase throughput -Latency refers to the amount of time it takes to complete a given unit of work (speedup). -Throughput refers to the amount of work that can be completed per unit time (pipelining computation). There is an upper limit on reducing latency -Speed of light, esp. for bit transmissions -In networks, switching time (node latency) -(Clock rate) x (issue width), for instructions -Diminishing returns (overhead) for problem instances -Limitations on #processors or size of memory -Power/energy constraints 10/07/2010CS496115

Throughput Improvements Throughput improvements are often easier to achieve by adding hardware -More wires improve bits/second -Use processors to run separate jobs -Pipelining is a powerful technique to execute more (serial) operations in unit time Common way to improve throughput -Multithreading (e.g., Nvidia GPUs and Cray El Dorado) 10/07/2010CS496116

Latency Hiding from Multithreading Reduce wait times by switching to work on different operation -Old idea, dating back to Multics -In parallel computing it’s called latency hiding Idea most often used to lower λ costs -Have many threads ready to go … -Execute a thread until it makes nonlocal ref -Switch to next thread -When nonlocal ref is filled, add to ready list 10/07/2010CS496117

3-18 Interesting phenomenon: Superlinear speedup Figure 3.5 from text A typical speedup graph showing performance for two programs; the dashed line represents linear speedup. Why might Program 1 be exhibiting superlinear speedup? Different amount of work? Cache effects? 10/07/2010CS4961

Performance Loss: Contention Contention -- the action of one processor interferes with another processor’s actions -- is an elusive quantity -Lock contention: One processor’s lock stops other processors from referencing; they must wait -Bus contention: Bus wires are in use by one processor’s memory reference -Network contention: Wires are in use by one packet, blocking other packets - Bank contention: Multiple processors try to access different locations on one memory chip simultaneously 10/07/2010CS496119

Performance Loss: Load Imbalance Load imbalance, work not evenly assigned to the processors, underutilizes parallelism -The assignment of work, not data, is key -Static assignments, being rigid, are more prone to imbalance -Because dynamic assignment carries overhead, the quantum of work must be large enough to amortize the overhead -With flexible allocations, load balance can be solved late in the design programming cycle 10/07/2010CS496120

Scalability Consideration: Efficiency Efficiency example from textbook, page 82 Parallel Efficiency = Speedup / NumberofProcessors -Tells you how much gain is likely from adding more processors Assume for this example that overhead is fixed at 20% of T S What is speedup and efficiency of 2 processors? 10 processors? 100 processors? 10/07/2010CS496121

Summary: Issues in reasoning about performance Finish your assignment (try tiling!) Have a nice fall break! 10/07/2010CS496122