ECE 1747 Parallel Programming

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.
Scheduling and Performance Issues for Programming using OpenMP
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Multiscalar processors
Computer Architecture II 1 Computer architecture II Introduction.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
CMPE 421 Parallel Computer Architecture
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
ECE 1747H: Parallel Programming Lecture 2: Data Parallelism.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
Case Study 5: Molecular Dynamics (MD) Simulation of a set of bodies under the influence of physical laws. Atoms, molecules, forces acting on them... Have.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.
CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick
Lecture 38: Compiling for Modern Architectures 03 May 02
Auburn University
CSE 351 Section 9 3/1/12.
Introduction to OpenMP
Distributed Processors
Lecture 12 Virtual Memory.
Operating Systems (CS 340 D)
Multiscalar Processors
How will execution time grow with SIZE?
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
The Hardware/Software Interface CSE351 Winter 2013
Computer Engg, IIT(BHU)
Parallel Algorithm Design
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Parallel Sorting Algorithms
Operating Systems (CS 340 D)
Bojian Zheng CSCD70 Spring 2018
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
ECE1747 Parallel Programming
ECE1747 Parallel Programming
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Course Outline Introduction in algorithms and applications
How can we find data in the cache?
Parallel Sorting Algorithms
COMP60621 Fundamentals of Parallel and Distributed Systems
Introduction to OpenMP
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Programming with Shared Memory Specifying parallelism
Parallel Programming in C with MPI and OpenMP
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
COMP60611 Fundamentals of Parallel and Distributed Systems
Introduction to Optimization
Optimizing single thread performance
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

ECE 1747 Parallel Programming Machine-independent Performance Optimization Techniques

Returning to Sequential vs. Parallel Sequential execution time: t seconds. Startup overhead of parallel execution: t_st seconds (depends on architecture) (Ideal) parallel execution time: t/p + t_st. If t/p + t_st > t, no gain.

General Idea Parallelism limited by dependences. Restructure code to eliminate or reduce dependences. Sometimes possible by compiler, but good to know how to do it by hand.

Example 1 Dependence on tmp. for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } Dependence on tmp.

Scalar Privatization Dependence on tmp is removed. #pragma omp parallel for private(tmp) for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } Dependence on tmp is removed.

Example 2 Dependence on sum. for( i=0, sum=0; i<n; i++ ) sum += a[i]; Dependence on sum.

Reduction Dependence on sum is removed. #pragma omp parallel for reduction(+,sum) for( i=0, sum=0; i<n; i++ ) sum += a[i]; Dependence on sum is removed.

Example 3 Dependence on index. for( i=0, index=0; i<n; i++ ) { index += i; a[i] = b[index]; } Dependence on index. Induction variable: can be computed from loop variable.

Induction Variable Elimination #pragma omp parallel for for( i=0, index=0; i<n; i++ ) { a[i] = b[i*(i+1)/2]; } Dependence removed by computing the induction variable.

Example 4 for( i=0, index=0; i<n; i++ ) { index += f(i); b[i] = g(a[index]); } Dependence on induction variable index, but no closed formula for its value.

Loop Splitting Loop splitting has removed dependence. for( i=0; i<n; i++ ) { index[i] += f(i); } #pragma omp parallel for b[i] = g(a[index[i]]); Loop splitting has removed dependence.

Example 5 Dependence on a[i][j] prevents k-loop parallelization. for( k=0; k<n; k++ ) for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) a[i][j] += b[i][k] + c[k][j]; Dependence on a[i][j] prevents k-loop parallelization. No dependencies carried by i- and j-loops.

Example 5 Parallelization for( k=0; k<n; k++ ) #pragma omp parallel for for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) a[i][j] += b[i][k] + c[k][j]; We can do better by reordering the loops.

Loop Reordering Larger parallel pieces of work. #pragma omp parallel for for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) for( k=0; k<n; k++ ) a[i][j] += b[i][k] + c[k][j]; Larger parallel pieces of work.

Example 6 Make two parallel loops into one. #pragma omp parallel for for(i=0; i<n; i++ ) a[i] = b[i]; for( i=0; i<n; i++ ) c[i] = b[i]^2; Make two parallel loops into one.

Loop Fusion Reduces loop startup overhead. #pragma omp parallel for for(i=0; i<n; i++ ) { a[i] = b[i]; c[i] = b[i]^2; } Reduces loop startup overhead.

Example 7: While Loops The number of loop iterations is unknown. while( *a) { process(a); a++; } The number of loop iterations is unknown.

Special Case of Loop Splitting for( count=0, p=a; p!=NULL; count++, p++ ); #pragma omp parallel for for( i=0; i<count; i++ ) process( a[i] ); Count the number of loop iterations. Then parallelize the loop.

Example 8: Linked Lists Dependence on p for(p=head; !p; p=p->next ) /* Assume process does not affect linked list */ process(p); Dependence on p

Another Special Case of Loop Splitting for( p=head, count=0; p!=NULL; p=p->next, count++); count[i] = f(count, omp_get_numthreads()); start[0] = head; for( i=0; i< omp_get_numthreads(); i++ ) { for( p=start[i], cnt=0; cnt<count[i]; p=p->next); start[i+1] = p->next; }

Another Special Case of Loop Splitting #pragma omp parallel sections for( i=0; i< omp_get_numthreads(); i++ ) #pragma omp section for(p=start[i]; p!=start[i+1]; p=p->next) process(p);

Example 9 Dependence on wrap. Only first iteration causes dependence. for( i=0, wrap=n; i<n; i++ ) { b[i] = a[i] + a[wrap]; wrap = i; } Dependence on wrap. Only first iteration causes dependence.

Loop Peeling b[0] = a[0] + a[n]; #pragma omp parallel for for( i=1; i<n; i++ ) { b[i] = a[i] + a[i-1]; }

Example 10 Dependence if m<n. for( i=0; i<n; i++ ) a[i+m] = a[i] + b[i]; Dependence if m<n.

Another Case of Loop Peeling if(m>n) { #pragma omp parallel for for( i=0; i<n; i++ ) a[i+m] = a[i] + b[i]; } else { … cannot be parallelized

Summary Reorganize code such that dependences are removed or reduced large pieces of parallel work emerge loop bounds become known … Code can become messy … there is a point of diminishing returns.

Factors that Determine Speedup Characteristics of parallel code granularity load balance locality communication and synchronization

Granularity Granularity = size of the program unit that is executed by a single processor. May be a single loop iteration, a set of loop iterations, etc. Fine granularity leads to: (positive) ability to use lots of processors (positive) finer-grain load balancing (negative) increased overhead

Granularity and Critical Sections Small granularity => more processors => more critical section accesses => more contention.

Issues in Performance of Parallel Parts Granularity. Load balance. Locality. Synchronization and communication.

Load Balance Load imbalance = different in execution time between processors between barriers. Execution time may not be predictable. Regular data parallel: yes. Irregular data parallel or pipeline: perhaps. Task queue: no.

Static vs. Dynamic Static: done once, by the programmer block, cyclic, etc. fine for regular data parallel Dynamic: done at runtime task queue fine for unpredictable execution times usually high overhead Semi-static: done once, at run-time

Choice is not inherent MM or SOR could be done using task queues: put all iterations in a queue. In heterogeneous environment. In multitasked environment. TSP could be done using static partitioning: give length-1 path to all processors. If we did exhaustive search.

Static Load Balancing Block Cyclic Block-cyclic best locality possibly poor load balance Cyclic better load balance worse locality Block-cyclic load balancing advantages of cyclic (mostly) better locality (see later)

Dynamic Load Balancing (1 of 2) Centralized: single task queue. Easy to program Excellent load balance Distributed: task queue per processor. Less communication/synchronization

Dynamic Load Balancing (2 of 2) Task stealing: Processes normally remove and insert tasks from their own queue. When queue is empty, remove task(s) from other queues. Extra overhead and programming difficulty. Better load balancing.

Semi-static Load Balancing Measure the cost of program parts. Use measurement to partition computation. Done once, done every iteration, done every n iterations.

Molecular Dynamics (MD) Simulation of a set of bodies under the influence of physical laws. Atoms, molecules, celestial bodies, ... Have same basic structure.

Molecular Dynamics (Skeleton) for some number of timesteps { for all molecules i for all other molecules j force[i] += f( loc[i], loc[j] ); loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics To reduce amount of computation, account for interaction only with nearby molecules.

Molecular Dynamics (continued) for some number of timesteps { for all molecules i for all nearby molecules j force[i] += f( loc[i], loc[j] ); loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics (continued) for each molecule i number of nearby molecules count[i] array of indices of nearby molecules index[j] ( 0 <= j < count[i])

Molecular Dynamics (continued) for some number of timesteps { for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics (simple) for some number of timesteps { #pragma omp parallel for for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics (simple) Simple to program. Possibly poor load balance block distribution of i iterations (molecules) could lead to uneven neighbor distribution cyclic does not help

Better Load Balance Assign iterations such that each processor has ~ the same number of neighbors. Array of “assign records” size: number of processors two elements: beginning i value (molecule) ending i value (molecule) Recompute partition periodically

Molecular Dynamics (continued) for some number of timesteps { #pragma omp parallel pr = omp_get_thread_num(); for( i=assign[pr]->b; i<assign[pr]->e; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); #pragma omp parallel for for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); }

Frequency of Balancing Every time neighbor list is recomputed. once during initialization. every iteration. every n iterations. Extra overhead vs. better approximation and better load balance.

Summary Parallel code optimization Critical section accesses. Granularity. Load balance.

Factors that Determine Speedup granularity load balancing locality uniprocessor multiprocessor synchronization and communication

Uniprocessor Memory Hierarchy size access time memory 128Mb-... 100 cycles L2 cache 256-512k 20 cycles 32-128k L1 cache 2 cycles CPU

Uniprocessor Memory Hierarchy Transparent to user/compiler/CPU. Except for performance, of course.

Typical Cache Organization Caches are organized in “cache lines”. Typical line sizes L1: 32 bytes L2: 128 bytes

Typical Cache Organization tag array data array tag index offset address cache line =?

Typical Cache Organization Previous example is a direct mapped cache. Most modern caches are N-way associative: N (tag, data) arrays N typically small, and not necessarily a power of 2 (3 is a nice value)

Cache Replacement If you hit in the cache, done. If you miss in the cache, Fetch line from next level in hierarchy. Replace the current line at that index. If associative, then choose a line within that set Various policies: e.g., least-recently-used

Bottom Line To get good performance, You have to have a high hit rate. You have to continue to access the data “close” to the data that you accessed recently.

Locality Locality (or re-use) = the extent to which a processor continues to use the same data or “close” data. Temporal locality: re-accessing a particular word before it gets replaced Spatial locality: accessing other words in a cache line before the line gets replaced

Example 1 for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) grid[i][j] = temp[i][j]; Good spatial locality in grid and temp (arrays in C laid out in row-major order). No temporal locality.

Example 2 for( j=0; j<n; j++ ) for( i=0; i<n; i++ ) grid[i][j] = temp[i][j]; No spatial locality in grid and temp. No temporal locality.

Example 3 for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) temp[i][j] = 0.25 * (grid[i+1][j]+grid[i+1][j]+ grid[i][j-1]+grid[i][j+1]); Spatial locality in temp. Spatial locality in grid. Temporal locality in grid?

Access to grid[i][j] First time grid[i][j] is used: temp[i-1,j]. Second time grid[i][j] is used: temp[i,j-1]. Between those times, 3 rows go through the cache. If 3 rows > cache size, cache miss on second access.

Fix Traverse the array in blocks, rather than row-wise sweep. Make sure grid[i][j] still in cache on second access.

Example 3 (before)

Example 3 (afterwards)

Achieving Better Locality Technique is known as blocking / tiling. Compiler algorithms known. Few commercial compilers do it. Learn to do it yourself.

Locality in Parallel Programming View parallel machine not only as a multi-CPU system but also as a multi-memory system

Message Passing access time remote memory memory 100 cycles L2 cache 1000s of cycles L1 cache 2 cycles CPU

Small Shared Memory access time shared memory 100+ cycles L2 cache CPU CPU

Large Shared Memory access time memory memory 100s of cycles L2 cache CPU CPU

Note on Parallel Memory Hierarchies In shared memory (small or large), memory hierarchy is transparent. In message passing, memory hierarchy is not transparent -- managed by user. Principle remains (largely) the same. Transparent memory hierarchy can lead to unexpected locality loss in shared memory.

Locality in Parallel Programming Is even more important than in sequential programming, because the memory latencies are longer.

Returning to Sequential vs. Parallel A piece of code may be better executed sequentially if considered by itself. But locality may make it profitable to execute it in parallel. Typically happens with initializations.

Example: Parallelization Ignoring Locality for( i=0; i<n; i++ ) a[i] = i; #pragma omp parallel for /* assume f is a very expensive function */ b[i] = f( a[i-1], a[i] );

Example: Taking Locality into Account #pragma omp parallel for for( i=0; i<n; i++ ) a[i] = i; /* assume f is a very expensive function */ b[i] = f( a[i-1], a[i] )

How to Get Started? First thing: figure what takes the time in your sequential program => profile it (gprof) ! Typically, few parts (few loops) take the bulk of the time. Parallelize those parts first, worrying about granularity and load balance. Advantage of shared memory: you can do that incrementally. Then worry about locality.

Performance and Architecture Understanding the performance of a parallel program often requires an understanding of the underlying architecture. There are two principal architectures: distributed memory machines shared memory machines Microarchitecture plays a role too!