Joe Hummel, PhD U. Of Illinois, Chicago Loyola University Chicago

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

Lecture 6: Multicore Systems

Arrays and Matrices CSE, POSTECH. 2 2 Introduction Data is often available in tabular form Tabular data is often represented in arrays Matrix is an example.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

MEMORY ORGANIZATION Memory Hierarchy Main Memory Auxiliary Memory

Maths for Computer Graphics

Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.

1 Lecture 6 Performance Measurement and Improvement.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.

Matrix Multiplication (i,j,k) for I = 1 to n do for j = 1 to n do for k = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor.

Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.

CUDA Grids, Blocks, and Threads

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

Fast matrix multiplication; Cache usage

CE 311 K - Introduction to Computer Methods Daene C. McKinney

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP Instructors: Krste Asanovic & Vladimir Stojanovic.

ECE Dept., University of Toronto

High Performance Computing 1 Numerical Linear Algebra An Introduction.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

A.Abhari CPS1251 Multidimensional Arrays Multidimensional array is the array with two or more dimensions. For example: char box [3] [3] defines a two-dimensional.

1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.

Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”

Instructor: Justin Hsia 7/17/2013Summer Lecture #141 CS 61C: Great Ideas in Computer Architecture Performance Programming, Technology Trends.

Early Adopter: Integration of Parallel Topics into the Undergraduate CS Curriculum at Calvin College Joel C. Adams Chair, Department of Computer Science.

Code and Caches 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with permission.

Introduction to MMX, XMM, SSE and SSE2 Technology

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago stuff:

1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.

1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.

Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Jeff Howbert Introduction to Machine Learning Winter Machine Learning MATLAB Essentials.

1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.

DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.

Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:

CS 110 Computer Architecture Lecture 20: Thread-Level Parallelism (TLP) and OpenMP Intro Instructor: Sören Schwertfeger School.

Heterogeneous Computing using openMP lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

1 ITCS4145 Parallel Programming B. Wilkinson March 23, hybrid-abw.ppt Hybrid Parallel Programming Introduction.

Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.

A few words on locality and arrays

Cache Memories.

ECE 1304 Introduction to Electrical and Computer Engineering

Yuanrui Zhang, Mahmut Kandemir

Cache Memories CSE 238/2038/2138: Systems Programming

Optimizing Cache Performance in Matrix Multiplication

CS 105 Tour of the Black Holes of Computing

Optimizing Cache Performance in Matrix Multiplication

BLAS: behind the scenes

Directory-based Protocol

Parallel Matrix Operations

Hybrid Parallel Programming

Coe818 Advanced Computer Architecture

Hybrid Parallel Programming

EE 4xx: Computer Architecture and Performance Programming

Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:

Matrix Addition and Multiplication

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Hybrid Parallel Programming

Presentation transcript:

Joe Hummel, PhD U. Of Illinois, Chicago Loyola University Chicago

 Class:“ Introduction to CS for Engineers ”  Lang:C/C++  Focus:programming basics, vectors, matrices  Timing:present this after introducing 2D arrays…

 Yes, it’s boring, but… ◦ everyone understands the problem ◦ good example of triply-nested loops ◦ non-trivial computation for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); 1500x1500 matrix: 2.25M elements  32 seconds… 1500x1500 matrix: 2.25M elements  32 seconds…

 Matrix multiply is great candidate for multicore ◦ embarrassingly-parallel ◦ easy to parallelize via outermost loop #pragma omp parallel for for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); #pragma omp parallel for for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); Cores 1500x1500 matrix: Quad-core CPU  8 seconds… 1500x1500 matrix: Quad-core CPU  8 seconds…

 Parallelism alone is not enough… HPC == Parallelism + Memory Hierarchy ─ Contention Expose parallelism Maximize data locality: network disk RAM cache core Minimize interaction: false sharing locking synchronization

 What’s the other half of the chip?  Implications? ◦ No one implements MM this way ◦ Rewrite to use loop interchange, and access B row-wise… Cache! X #pragma omp parallel for for (int i = 0; i < N; i++) for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); #pragma omp parallel for for (int i = 0; i < N; i++) for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); 1500x1500 matrix: Quad-core + cache  2 seconds… 1500x1500 matrix: Quad-core + cache  2 seconds…