CS4961 Parallel Programming Lecture 11: Data Locality, cont

Slides:

Advertisements

Similar presentations

1 Optimizing compilers Managing Cache Bercovici Sivan.

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Data Locality CS 524 – High-Performance Computing.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

L6: Memory Hierarchy Optimization III, Bandwidth Optimization CS6963.

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS L4: Memory Hierarchy, II.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

L7: Memory Hierarchy Optimization, cont. CS6963. Administrative Homework #2, posted on website – Due 5PM, Thursday, February 19 – Use handin program to.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

08/26/2010CS4961 CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26,

10/07/2010CS4961 CS4961 Parallel Programming Lecture 14: Reasoning about Performance Mary Hall October 7,

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

09/07/2012CS4230 CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7,

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

Single Node Optimization Computational Astrophysics.

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.

10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3,

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Buffering Techniques Greg Stitt ECE Department University of Florida.

1 Lecture 5a: CPU architecture 101 boris.

OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

Lecture 38: Compiling for Modern Architectures 03 May 02

CMSC 611: Advanced Computer Architecture

Automatic Thread Extraction with Decoupled Software Pipelining

Ioannis E. Venetis Department of Computer Engineering and Informatics

EECE571R -- Harnessing Massively Parallel Processors ece

Reducing Hit Time Small and simple caches Way prediction Trace caches

The Hardware/Software Interface CSE351 Winter 2013

The University of Adelaide, School of Computer Science

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

CS427 Multicore Architecture and Parallel Computing

5.2 Eleven Advanced Optimizations of Cache Performance

Prof. Zhang Gang School of Computer Sci. & Tech.

Compilers for Embedded Systems

Short Circuiting Memory Traffic in Handheld Platforms

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6235 L4: Memory Hierarchy, II.

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Bojian Zheng CSCD70 Spring 2018

L18: CUDA, cont. Memory Hierarchy and Examples

Memory Hierarchies.

CS4961 Parallel Programming Lecture 12: Data Locality, cont

L4: Memory Hierarchy Optimization II, Locality and Data Placement

CS4961 Parallel Programming Lecture 11: SIMD, cont

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

L15: CUDA, cont. Memory Hierarchy and Examples

Siddhartha Chatterjee

L6: Memory Hierarchy Optimization IV, Bandwidth Optimization

L4: Memory Hierarchy Optimization II, Locality and Data Placement, cont. CS6963 L4: Memory Hierarchy, II.

Multi-Core Programming Assignment

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Memory System Performance Chapter 3

6- General Purpose GPU Programming

Introduction to Optimization

Optimizing single thread performance

ENERGY 211 / CME 211 Lecture 11 October 15, 2008.

Presentation transcript:

CS4961 Parallel Programming Lecture 11: Data Locality, cont CS4961 Parallel Programming Lecture 11: Data Locality, cont. Mary Hall September 28, 2010 09/28/2010 CS4961

Programming Assignment 2: Due 11:59 PM, Friday October 8 Combining Locality, Thread and SIMD Parallelism: The following code excerpt is representative of a common signal processing technique called convolution. Convolution combines two signals to form a third signal. In this example, we slide a small (32x32) signal around in a larger (4128x4128) signal to look for regions where the two signals have the most overlap. for (l=0; l<N; l++) { for (k=0; k<N; k++) { C[k][l] = 0.0; for (j=0; j<W; j++) { for (i=0; i<W; i++) { C[k][l] += A[k+i][l+j]*B[i][j]; } 09/28/2010 CS4961

Programming Assignment 2, cont. Your goal is to take the code as written, and by either changing the code or changing the way you invoke the compiler, improve its performance. You will need to use an Intel platform that has the “icc” compiler installed. You can use the CADE Windows lab, or another Intel platform with the appropriate software. The Intel compiler is available for free on Linux platforms for non-commercial use. The version of the compiler, the specific architecture and the flags you give to the compiler will drastically impact performance. You can discuss strategies with other classmates, but do not copy code. Also, do not copy solutions from the web. This must be your own work. 09/28/2010 CS4961

Programming Assignment 2, cont. How to compile OpenMP (all versions): icc –openmp conv-assign.c Measure and report performance for five versions of the code, and turn in all variants: Baseline: compile and run code as provided (5 points) Thread parallelism only (using OpenMP): icc –openmp conv-omp.c (5 points) SSE-3 only: icc –openmp –msse3 –vec-report=3 conv-sse.c (10 points) Locality only: icc –openmp conv-loc.c (10 points) Combined: icc –openmp –msse3 –vec-report=3 conv-all.c (15 points) Explain results and observations (15 points) Extra credit (10 points): improve results by optimizing for register reuse 09/28/2010 CS4961

Targets of Memory Hierarchy Optimizations Reduce memory latency The latency of a memory access is the time (usually in cycles) between a memory request and its completion Maximize memory bandwidth Bandwidth is the amount of useful data that can be retrieved over a time interval Manage overhead Cost of performing optimization (e.g., copying) should be less than anticipated gain 09/28/2010 CS4961

Reuse and Locality Consider how data is accessed Data reuse: Same or nearby data used multiple times Intrinsic in computation Data locality: Data is reused and is present in “fast memory” Same data or same data transfer If a computation has reuse, what can we do to get locality? Appropriate data placement and layout Code reordering transformations 09/28/2010 CS4961

Temporal Reuse in Sequential Code Same data used in distinct iterations I and I’ for (i=1; i<N; i++) for (j=1; j<N; j++) A[j]= A[j+1]+A[j-1] A[j] has self-temporal reuse in loop i 09/28/2010 CS4961

A[j] has self-spatial reuse in loop j Multi-dimensional array note: Same data transfer (usually cache line) used in distinct iterations I and I’ for (i=1; i<N; i++) for (j=1; j<N; j++) A[j]= A[j+1]+A[j-1]; A[j] has self-spatial reuse in loop j Multi-dimensional array note: C arrays are stored in row-major order Fortran arrays are stored in column-major order 09/28/2010 CS4961

Can Use Reordering Transformations! Analyze reuse in computation Apply loop reordering transformations to improve locality based on reuse With any loop reordering transformation, always ask Safety? (doesn’t reverse dependences) Profitablity? (improves locality) 09/28/2010 CS4961

Loop Permutation: An Example of a Reordering Transformation Permute the order of the loops to modify the traversal order for (i= 0; i<3; i++) for (j=0; j<6; j++) A[i][j+1]=A[i][j]+B[j]; i j for (j=0; j<6; j++) for (i= 0; i<3; i++) A[i][j+1]=A[i][j]+B[j]; new traversal order! i j Which one is better for row-major storage? 09/28/2010 CS4961

Permutation has many goals Locality optimization Particularly, for spatial locality (like in your SIMD assignment) Rearrange loop nest to move parallelism to appropriate level of granularity Inward to exploit fine-grain parallelism (like in your SIMD assignment) Outward to exploit coarse-grain parallelismx Also, to enable other optimizations 09/28/2010 CS4961

Tiling (Blocking): Another Loop Reordering Transformation Blocking reorders loop iterations to bring iterations that reuse data closer in time Goal is to retain in cache between reuse I J I J 09/28/2010 CS4961

Tiling is Fundamental! Tiling is very commonly used to manage limited storage Registers Caches Software-managed buffers Small main memory Can be applied hierarchically 09/28/2010 CS4961

Tiling Example Strip mine for (j=1; j<M; j++) Permute for (i=1; i<N; i++) D[i] = D[i] +B[j,i] for (j=1; j<M; j++) for (i=1; i<N; i+=s) for (ii=i; ii<min(i+s-1,N); ii++) D[ii] = D[ii] +B[j,ii] Strip mine for (i=1; i<N; i+=s) for (j=1; j<M; j++) for (ii=i; ii<min(i+s-1,N); ii++) D[ii] = D[ii] +B[j,ii] Permute 09/28/2010 CS4961

How to Determine Safety and Profitability? A step back to Lecture 2 Notion of reordering transformations Based on data dependences Intuition: Cannot permute two loops i and j in a loop nest if doing so reverses the direction of any dependence. Tiling safety test is similar. Profitability Reuse analysis (and other cost models) Also based on data dependences, but simpler 09/28/2010 CS4961

Simple Examples: 2-d Loop Nests for (i= 0; i<3; i++) for (j=0; j<6; j++) A[i][j+1]=A[i][j]+B[j] for (i= 0; i<3; i++) for (j=0; j<6; j++) A[i+1][j-1]=A[i][j] +B[j] Ok to permute? 09/28/2010 CS4961

Tiling = strip-mine and permutation Legality of Tiling Tiling = strip-mine and permutation Strip-mine does not reorder iterations Permutation must be legal OR strip size less than dependence distance 09/28/2010 CS4961