L21: Final Preparation and Course Retrospective (also very brief introduction to Map Reduce) December 1, 2011.

Slides:

Advertisements

Similar presentations

The University of Adelaide, School of Computer Science

Advertisements

Practical techniques & Examples

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

Introductory Courses in High Performance Computing at Illinois David Padua.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

12/1/2005Comp 120 Fall December Three Classes to Go! Questions? Multiprocessors and Parallel Computers –Slides stolen from Leonard McMillan.

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

Topic ? Course Overview. Guidelines Questions are rated by stars –One Star Question  Easy. Small definition, examples or generic formulas –Two Stars.

1 TRAPEZOIDAL RULE IN MPI Copyright © 2010, Elsevier Inc. All rights Reserved.

L20: Putting it together: Tree Search (Ch. 6) November 29, 2011.

CS 470/570:Introduction to Parallel and Distributed Computing.

08/21/2012CS4230 CS4230 Parallel Programming Lecture 1: Introduction Mary Hall August 21,

10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

1 Agenda Administration Background Our first C program Working environment Exercise Memory and Variables.

CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 25, /25/2011 CS4961.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

Guiding Principles. Goals First we must agree on the goals. Several (non-exclusive) choices – Want every CS major to be educated in performance including.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CS 320 Assignment 1 Rewriting the MISC Osystem class to support loading machine language programs at addresses other than 0 1.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

09/08/2011CS4961 CS4961 Parallel Programming Lecture 6: More OpenMP, Introduction to Data Parallel Algorithms Mary Hall September 8, 2011.

Introduction to Problem Solving. Steps in Programming A Very Simplified Picture –Problem Definition & Analysis – High Level Strategy for a solution –Arriving.

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.

L19: Putting it together: N-body (Ch. 6) November 22, 2011.

CSCI-455/552 Introduction to High Performance Computing Lecture 23.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Parallel Computing Presented by Justin Reschke

Page 1 Computer Architecture and Organization 55:035 Final Exam Review Spring 2011.

08/23/2012CS4230 CS4230 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 23,

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

ISA - Instruction Set Architecture

Distributed Shared Memory

Pattern Parallel Programming

CS 583 Fall 2006 Analysis of Algorithms

Lecture 5: GPU Compute Architecture

L21: Putting it together: Tree Search (Ch. 6)

EE 193: Parallel Computing

Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Lecture 5: GPU Compute Architecture for the last time

Shared Memory Programming

Lecture 5: Synchronization and ILP

Presentation transcript:

L21: Final Preparation and Course Retrospective (also very brief introduction to Map Reduce) December 1, 2011

Administrative Next homework, CUDA, MPI (Ch. 3) and Apps (Ch. 6) -Goal is to prepare you for final -We’ll discuss it in class on Thursday -Solutions due on Monday, Dec. 5 (should be straightforward if you are in class today) Poster dry run on Dec. 6, final presentations on Dec. 8 Optional final report (4-6 pages) due on Dec. 15 can be used to improve your project grade if you need that

Outline Next homework -Discuss solutions Two problems from midterm Poster information SC Followup: Student Cluster Competition

Midterm Review, III.b How would you rewrite the following code to perform as many operations as possible in SIMD mode for SSE and achieve good SIMD performance? Assume the data type for all the variables is 32-bit integers, and the superword width is 128 bits. Briefly justify your solution. k = ?; // k is an unknown value in the range 0 to n, and b is declared to be of size 2n for (i= 0; i<n; i++) { x1 = i1 + j1; x2 = i2 + j2; x3 = i3 + j3; x4 = i4 + j4; x5 = x1 + x2 + x3 + x4; a[i] = b[n+i]*x5; } X5 is loop invariant You must align b, and a and b likely have different alignments. If you did not realize that X5 is loop invariant, or if you wanted to execute it in SSE SIMD mode anyway, you need to pack the “I” and “j” operands to compute the “X” values.

Midterm Review, III.c c. Sketch out how to rewrite the following code to improve its cache locality and locality in registers. for (i=1; i<n; i++) for (j=1; j<n; j++) a[j][i] = a[j-1][i-1] + c[j]; Briefly explain how your modifications have improved memory access behavior of the computation. Interchange I and j loops: spatial locality in a temporal locality in c (assign to register) Tile I loop: temporal locality of a in j dimension Unroll I,J loops: Expose reuse of a in loop body, assign to registers

Homework 4, due Monday, December 5 at 11:59PM Instructions: We’ll go over these in class on December 1. Handin on CADE machines: “handin cs4961 hw4 ” 1.Given the following sequential code, sketch out two CUDA implementations for just the kernel computation (not the host code). Determine the memory access patterns and whether you will need synchronization. Your two versions should (a) use only global memory; and, (b) use both global and shared memory. Keep in mind capacity limits for shared memory and limits on the number of threads per block. … int a[1024][1024], b[1024]; for (i=0; i<1020; i++) { for (j=0; j<1024-i; j++) { b[i+j] += a[j][i] + a[j][i+1] + a[j][i+2] + a[j][i+3]; }

Homework 4, cont. 2.Programming Assignment 3.8, p Parallel merge sort starts with n/comm_sz keys assigned to each process. It ends with all the keys stored on process 0 in sorted order. … when a process receives another process’ keys, it merges the new keys into its already sorted list of keys. … parallel mergesort … Then the processes should use tree-structured communication to merge the global list onto process 0, which prints the result. 3.Exercise 6.27, p If there are many processes and many redistributions of work in the dynamic MPI implementation of the TSP solver, process 0 could become a bottleneck for energy returns. Explain how one could use a spanning tree of processes in which a child sends energy to its parent rather than process 0. 4.Exercise 6.30, p. 350 Determine which of the APIs is preferable for the n-body solvers and solving TSP. a.How much memory is required… will data fit into the memory … b.How much communication is required by each of the parallel algorithms (consider remote memory accesses and coherence as communication) c.Can the serial programs be easily parallelized by the use of OpenMP directives?Doe they need synchronization constructs such as condition variables or read-write locks?

MPI is similar Static and dynamic partitioning schemes Maintaining “best_tour” requires global synchronization, could be costly -May be relaxed a little to improve efficiency -Alternatively, some different communication constructs can be used to make this more asynchronous and less costly -MPI_Iprobe checks for available message rather than actually receiving -MPI_Bsend and other forms of send allow aggregating results of communication asynchronously

Terminated Function for a Dynamically Partitioned TSP solver with MPI (1) Copyright © 2010, Elsevier Inc. All rights Reserved

Terminated Function for a Dynamically Partitioned TSP solver with MPI (2) Copyright © 2010, Elsevier Inc. All rights Reserved

Packing data into a buffer of contiguous memory Copyright © 2010, Elsevier Inc. All rights Reserved

Unpacking data from a buffer of contiguous memory Copyright © 2010, Elsevier Inc. All rights Reserved

Course Retrospective Most people in the research community agree that there are at least two kinds of parallel programmers that will be important to the future of computing Programmers that understand how to write software, but are naïve about parallelization and mapping to architecture (Joe programmers) Programmers that are knowledgeable about parallelization, and mapping to architecture, so can achieve high performance (Stephanie programmers) Intel/Microsoft say there are three kinds (Mort, Elvis and Einstein) This course is about teaching you how to become Stephanie/Einstein programmers

Course Retrospective Why OpenMP, Pthreads, MPI and CUDA? These are the languages that Einstein/Stephanie programmers use. They can achieve high performance. They are widely available and widely used. It is no coincidence that both textbooks I’ve used for this course teach all of these except CUDA.

The Future of Parallel Computing It seems clear that for the next decade architectures will continue to get more complex, and achieving high performance will get harder. Programming abstractions will get a whole lot better. Seem to be bifurcating along the Joe/Stephanie or Mort/Elvis/Einstein boundaries. Will be very different. Whatever the language or architecture, some of the fundamental ideas from this class will still be foundational to the area. Locality Deadlock, load balance, race conditions, granularity…