*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以利用编译器实现优化 www.intel.com/software/products.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Data Locality CS 524 – High-Performance Computing.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Friday, September 15, 2006 The three most important factors in selling optimization are location, location, location. - Realtor’s creed.

L6: Memory Hierarchy Optimization III, Bandwidth Optimization CS6963.

1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Matrix Multiplication (i,j,k) for I = 1 to n do for j = 1 to n do for k = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor.

U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Data Locality CS 524 – High-Performance Computing.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

Performance Optimization Getting your programs to run faster.

ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

High-Level Transformations for Embedded Computing

Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.

ECE 1754 Loop Transformations by: Eric LaForest

Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

Memory-Aware Compilation Philip Sweany 10/20/2011.

IBM ATS Deep Computing © 2007 IBM Corporation Compiler Optimization HPC Workshop – University of Kentucky May 9, 2007 – May 10, 2007 Andrew Komornicki,

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Dependence Analysis and Loops CS 3220 Spring 2016.

DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.

A few words on locality and arrays

Lecture 38: Compiling for Modern Architectures 03 May 02

Code Optimization Overview and Examples

Introduction To Computer Systems

Simone Campanoni Loop transformations Simone Campanoni

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Getting Started with Automatic Compiler Vectorization

Compilers for Embedded Systems

CS 213: Data Structures and Algorithms

CS4961 Parallel Programming Lecture 12: Data Locality, cont

Register Pressure Guided Unroll-and-Jam

Optimizing MMM & ATLAS Library Generator

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Loop Optimization “Programs spend 90% of time in loops”

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以利用编译器实现优化

Responsible Pointer Usage  Compiler alias analysis limits optimizations  Developer knows App – tell compiler!  Avoid pointing to same memory address with 2 different pointers  Use array notation when possible  Avoid pointer arithmetic if possible Data Issues

Pointer Disambiguation  -Oa file.c (Windows)-fno-alias file.c (Linux)  All pointers in file.c are assumed not to alias  -Ow file.c (Windows)Not (yet) on Linux  Assume no aliasing within functions (ie, pointer arguments are unique)  -Qrestrict file.c (Windows)-restrict (Linux)  Restrict Qualifier: Enables pointer disambiguation  -Za file.c (Windows)-ansi (Linux)  Enforce strict ANSI compilance (requires that pointers to different data types are not aliased) Data Issues

High Level Optimizations Available at O3  Prefetch  Loop interchange  Unrolling  Cache blocking  Unroll-and-jam  Scalar replacement  Redundant zero-trip elimination  Data dependence analysis  Reuse analysis  Loop recovery  Canonical expressions  Loop fusion  Loop distribution  Loop reversal  Loop skewing  Loop peeling  Scalar expansion  Register blocking HLOHLO

5 Data Prefetching for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] end_for for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] if (mod(j,8) == 0) lfetch.nta(A[j+d, i]) if (i == 1) lfetch.nt1(B[0, j+d]) end_for Adding prefetching instructions using selective prefetching. Works for array, pointers, C structure, C/C++ parameters Goal: to issue one prefetch instruction per cache line Itanium cache lines are L1 : 32B, L2 : 64B, L3 : 64B Itanium 2 cache lines are L1 : 64B, L2 : 128B, L3 : 128B - O3 does this for you “Let the Compiler do the work!”HLOHLO

Loop Interchange  Note: c[i][j] term is constant in inner loop  Interchange to allow unit stride memory access DemoHLOHLO for(i=0;i<NUM;i++) { for(j=0;j<NUM;j++) { for(k=0;k<NUM;k++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } Consecutive memory index Fast Inner loop index Lab : Matrix with Loop Interchange, -O2

Unit Stride memory access C/C++ Example – Fortran opposite bN-10bN-1jbN-1N-1 b10b11b12b13b1jb1N-1 b00b01b02b03b0jb0N-1 b Non-unit strided data access aN-10aN-1N-1 ai0ai1ai2ai3aiN-1 a10a11a12a13a1N-1 a00a01a02a03a0N-1 a k j k i incrementing K gets non consecutive memory elements Unit strided data access incrementing K gets consecutive memory elementsHLOHLO

Loop after interchange  Note: a[i][k] term is constant in inner loop  Two loads, one Store, one FMA: F/M =.33, Unit stride for(i=0;i<NUM;i++) { for(k=0;k<NUM;k++) { for(j=0;j<NUM;j++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } HLOHLO Demo Lab : Matrix with Loop Interchange, -O3

Unit Stride memory access (C/C++) All Unit strided data access aN-10aN-1N-1 ai0ai1ai2ai3aiN-1 a10a11a12a13a1N-1 a00a01a02a03a0N-1 k a k i bN-10bN-1N-1 bk0bk1bk2bk3bkN-1 b10b11b12b13b1N-1 b00b01b02b03b0N-1 j j b k Fastest incremented index Consecutive memory access Next fastest loop index Consecutive memory indexHLOHLO

Loop Unrolling N=1025 M=5 DO I=1,N DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDO II = IMOD (N,4) DO I = 1, II DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDO DO I = II,N,4 DO J=1,M A(J,I) = B(J,I) + C(J,I) * D A(J,I+1) = B(J,I+1) + C(J,I+1) * D A(J,I+2) = B(J,I+2) + C(J,I+2) * D A(J,I+3) = B(J,I+3) + C(J,I+3) * D ENDDO Unroll Outer loop by 4 Preconditioning loop Unroll largest loops If loop size known can eliminate preconditioning loop by choosing number of times to unrollHLOHLO Demo Lab : Matrix with Loop Unrolling by 2

Loop Unrolling - Candidates  If trip count is low and known at compile time it may make sense to Fully unroll  Poor Candidates: (similar issues for SWP or vectorizer)  Low trip count loops – for (j=0; j < N; j++) : N=4 at runtime  Fat loops – loop body already has lots of computation taking place  Loops containing procedure calls  Loops with branches HLOHLO

Loop Unrolling - Benefits  Benefits  perform more computations per loop iteration  Reduces the effect of loop overhead  Can increase Floating point to memory access ratio (F/M)  Costs  Register pressure  Code bloat HLOHLO

 All loops unrolled by 4 results in (per iteration) 32 Loads, 16 stores, 64 FMA: F/M = 1.33 Loop Unrolling - Example for(i=0;i<NUM;i=i+2) { for(k=0;k<NUM;k=k+2){ for(j=0;j<NUM;j++){ c[i][j]= c[i][j]+ a[i][k]* b[k][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k]* b[k][j]; c[i][j]= c[i][j]+ a[i][k+1]* b[k+1][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k+1]* b[k+1][j]; } Loop invariantHLOHLO Lab Demo Lab : Matrix with Loop Unrolling by 4

14 Cache Blocking for i = 1, 1000 for j = 1, 1000 for k = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for for v = 1, 1000, 20 for u = 1, 1000, 20 for k = v, v+19 for j = u, u+19 for i = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for When all arrays in loop do not fit in cache Effective for huge out-of-core memory applications Effective for large out-of-cache applications Work on “neighborhoods” of data and keep these neighborhoods in cache Helps reduce TLB & Cache missesHLOHLO