ECE 1747H : Parallel Programming Lecture 1-2: Overview.

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Parallel Processing with OpenMP

Distributed Systems CS

SE-292 High Performance Computing

Programmability Issues

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.

Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

ICS 556 Parallel Algorithms Ebrahim Malalla Office: Bldg 22, Room

Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.

Introduction to Computer Programming I CSE 113

Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Quantitative.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Parallel & Distributed Computing Fall 2004 Comments About Final.

Register Allocation (via graph coloring)

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.

Parallelizing Compilers Presented by Yiwei Zhang.

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

Data Dependences CS 524 – High-Performance Computing.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

CS 470/570:Introduction to Parallel and Distributed Computing.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Carolyn Seaman University of Maryland, Baltimore County.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Integrating Parallel and Distributed Computing Topics into an Undergraduate CS Curriculum Andrew Danner & Tia Newhall Swarthmore College Third NSF/TCPP.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

(1) ECE 3056: Architecture, Concurrency and Energy in Computation Lecture Notes by MKP and Sudhakar Yalamanchili Sudhakar Yalamanchili (Some small modifications.

Computer Science 313 – Advanced Programming Topics.

Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.

1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

C o n f i d e n t i a l 1 Course: BCA Semester: III Subject Code : BC 0042 Subject Name: Operating Systems Unit number : 1 Unit Title: Overview of Operating.

Parallel Programming with MPI and OpenMP

CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:

Advanced Computer Networks Lecture 1 - Parallelization 1.

ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.

Cheating The School of Network Computing, the Faculty of Information Technology and Monash as a whole regard cheating as a serious offence. Where assignments.

Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.

Parallel Computing Presented by Justin Reschke

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Objective of This Course

Programming with Shared Memory Specifying parallelism

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

ECE 1747H : Parallel Programming Lecture 1-2: Overview

ECE 1747H Meeting time: Mon 4-6 PM Meeting place: RS 310 Instructor: Cristiana Amza, office Pratt 484E

Material Course notes Web material (e.g., published papers) No required textbook, some recommended

Prerequisites Programming in C or C++ Data structures Basics of machine architecture Basics of network programming Please send to to get an eecg account !! (name, stuid, class, instructor) to get an MPI (this is on our research cluster for the purpose of the MPI homework).

Other than that No written homeworks, no exams 10% for each small programming assignments (expect 1-2) 10% class participation Rest comes from major course project

Programming Project Parallelizing a sequential program, or improving the performance or the functionality of a parallel program Project proposal and final report In-class project proposal and final report presentation “Sample” project presentation can be posted

Parallelism (1 of 2) Ability to execute different parts of a single program concurrently on different machines Goal: shorter running time Grain of parallelism: how big are the parts? Can be instruction, statement, procedure, … Will mainly focus on relative coarse grain

Parallelism (2 of 2) Coarse-grain parallelism mainly applicable to long-running, scientific programs Examples: weather prediction, prime number factorization, simulations, …

Lecture material (1 of 4) Parallelism –What is parallelism? –What can be parallelized? –Inhibitors of parallelism: dependences

Lecture material (2 of 4) Standard models of parallelism –shared memory (Pthreads) –message passing (MPI) –shared memory + data parallelism (OpenMP) Classes of applications –scientific –servers

Lecture material (3 of 4) Transaction processing –classic programming model for databases –now being proposed for scientific programs

Lecture material (4 of 4) Perf. of parallel & distributed programs –architecture-independent optimization –architecture-dependent optimization

Course Organization First 2-3 weeks of semester: –lectures on parallelism, patterns, models –small programming assignments (1-2), done individually Rest of the semester –major programming project, done individually or in small group –Research paper discussions

Parallel vs. Distributed Programming Parallel programming has matured: Few standard programming models Few common machine architectures Portability between models and architectures

Bottom Line Programmer can now focus on program and use suitable programming model Reasonable hope of portability Problem: much performance optimization is still platform-dependent –Performance portability is a problem

ECE 1747H: Parallel Programming Lecture 1-2: Parallelism, Dependences

Parallelism Ability to execute different parts of a program concurrently on different machines Goal: shorten execution time

Measures of Performance To computer scientists: speedup, execution time. To applications people: size of problem, accuracy of solution, etc.

Speedup of Algorithm Speedup of algorithm = sequential execution time / execution time on p processors (with the same data set). p speedup

Speedup on Problem Speedup on problem = sequential execution time of best known sequential algorithm / execution time on p processors. A more honest measure of performance. Avoids picking an easily parallelizable algorithm with poor sequential execution time.

What Speedups Can You Get? Linear speedup –Confusing term: implicitly means a 1-to-1 speedup per processor. –(almost always) as good as you can do. Sub-linear speedup: more normal due to overhead of startup, synchronization, communication, etc.

Speedup p speedup linear actual

Scalability No really precise decision. Roughly speaking, a program is said to scale to a certain number of processors p, if going from p-1 to p processors results in some acceptable improvement in speedup (for instance, an increase of 0.5).

Super-linear Speedup? Due to cache/memory effects: –Subparts fit into cache/memory of each node. –Whole problem does not fit in cache/memory of a single node. Nondeterminism in search problems. –One thread finds near-optimal solution very quickly => leads to drastic pruning of search space.

Cardinal Performance Rule Don’t leave (too) much of your code sequential!

Amdahl’s Law If 1/s of the program is sequential, then you can never get a speedup better than s. –(Normalized) sequential execution time = 1/s + (1- 1/s) = 1 –Best parallel execution time on p processors = 1/s + (1 - 1/s) /p –When p goes to infinity, parallel execution = 1/s –Speedup = s.

Why keep something sequential? Some parts of the program are not parallelizable (because of dependences) Some parts may be parallelizable, but the overhead dwarfs the increased speedup.

When can two statements execute in parallel? On one processor: statement 1; statement 2; On two processors: processor1:processor2: statement1; statement2;

Fundamental Assumption Processors execute independently: no control over order of execution between processors

When can 2 statements execute in parallel? Possibility 1 Processor1:Processor2: statement1; statement2; Possibility 2 Processor1:Processor2: statement2: statement1;

When can 2 statements execute in parallel? Their order of execution must not matter! In other words, statement1; statement2; must be equivalent to statement2; statement1;

Example 1 a = 1; b = a; Statements cannot be executed in parallel Program modifications may make it possible.

Example 2 a = f(x); b = a; May not be wise to change the program (sequential execution would take longer).

Example 3 a = 1; a = 2; Statements cannot be executed in parallel.

True dependence Statements S1, S2 S2 has a true dependence on S1 iff S2 reads a value written by S1

Anti-dependence Statements S1, S2. S2 has an anti-dependence on S1 iff S2 writes a value read by S1.

Output Dependence Statements S1, S2. S2 has an output dependence on S1 iff S2 writes a variable written by S1.

When can 2 statements execute in parallel? S1 and S2 can execute in parallel iff there are no dependences between S1 and S2 –true dependences –anti-dependences –output dependences Some dependences can be removed.

Example 4 Most parallelism occurs in loops. for(i=0; i<100; i++) a[i] = i; No dependences. Iterations can be executed in parallel.

Example 5 for(i=0; i<100; i++) { a[i] = i; b[i] = 2*i; } Iterations and statements can be executed in parallel.

Example 6 for(i=0;i<100;i++) a[i] = i; for(i=0;i<100;i++) b[i] = 2*i; Iterations and loops can be executed in parallel.

Example 7 for(i=0; i<100; i++) a[i] = a[i] + 100; There is a dependence … on itself! Loop is still parallelizable.

Example 8 for( i=0; i<100; i++ ) a[i] = f(a[i-1]); Dependence between a[i] and a[i-1]. Loop iterations are not parallelizable.

Loop-carried dependence A loop carried dependence is a dependence that is present only if the statements are part of the execution of a loop. Otherwise, we call it a loop-independent dependence. Loop-carried dependences prevent loop iteration parallelization.

Example 9 for(i=0; i<100; i++ ) for(j=0; j<100; j++ ) a[i][j] = f(a[i][j-1]); Loop-independent dependence on i. Loop-carried dependence on j. Outer loop can be parallelized, inner loop cannot.

Example 10 for( j=0; j<100; j++ ) for( i=0; i<100; i++ ) a[i][j] = f(a[i][j-1]); Inner loop can be parallelized, outer loop cannot. Less desirable situation. Loop interchange is sometimes possible.

Level of loop-carried dependence Is the nesting depth of the loop that carries the dependence. Indicates which loops can be parallelized.

Be careful … Example 11 printf(“a”); printf(“b”); Statements have a hidden output dependence due to the output stream.

Be careful … Example 12 a = f(x); b = g(x); Statements could have a hidden dependence if f and g update the same variable. Also depends on what f and g can do to x.

Be careful … Example 13 for(i=0; i<100; i++) a[i+10] = f(a[i]); Dependence between a[10], a[20], … Dependence between a[11], a[21], … … Some parallel execution is possible.

Be careful … Example 14 for( i=1; i<100;i++ ) { a[i] = …;... = a[i-1]; } Dependence between a[i] and a[i-1] Complete parallel execution impossible Pipelined parallel execution possible

Be careful … Example 15 for( i=0; i<100; i++ ) a[i] = f(a[indexa[i]]); Cannot tell for sure. Parallelization depends on user knowledge of values in indexa[]. User can tell, compiler cannot.

Optimizations: Example 16 for (i = 0; i < ; i++) a[i ] = a[i] + 1; Cannot be parallelized as is. May be parallelized by applying certain code transformations.

An aside Parallelizing compilers analyze program dependences to decide parallelization. In parallelization by hand, user does the same analysis. Compiler more convenient and more correct User more powerful, can analyze more patterns.

To remember Statement order must not matter. Statements must not have dependences. Some dependences can be removed. Some dependences may not be obvious.