Performance Optimization Getting your programs to run faster CS 691.

Slides:



Advertisements
Similar presentations
Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Eos Compilers Fernanda Foertter HPC User Assistance Specialist.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Carnegie Mellon Lessons From Building Spiral The C Of My Dreams Franz Franchetti Carnegie Mellon University Lessons From Building Spiral The C Of My Dreams.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.
Introduction CS 524 – High-Performance Computing.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Data Locality CS 524 – High-Performance Computing.
1 Lecture 6 Performance Measurement and Improvement.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.
CS267 L2 Memory Hierarchies.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication.
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 1 Enzo Papandrea COMPUTING PERFORMANCE.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Processor and Internal Stuff or the “guts” of the computer.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.
Recap Technology trends Cost/performance Measuring and Reporting Performance What does it mean to say “computer X is faster than computer Y”? E.g. Machine.
What have mr aldred’s dirty clothes got to do with the cpu
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”
Performance Optimization Getting your programs to run faster.
Optimised C/C++. Overview of DS General code Functions Mathematics.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +
Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.
Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.
Optimization of C Code The C for Speed
FPGA Hardware Synthesis Jessica Baxter. Reference M. Haldar, A. Nayak, N. Shenoy, A. Choudhary and P. Banerjee, “FPGA Hardware Synthesis from MATLAB”,
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
Memory-Aware Compilation Philip Sweany 10/20/2011.
IBM ATS Deep Computing © 2007 IBM Corporation Compiler Optimization HPC Workshop – University of Kentucky May 9, 2007 – May 10, 2007 Andrew Komornicki,
Outline Announcements: –HW II Idue Friday! Validating Model Problem Software performance Measuring performance Improving performance.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
My Coordinates Office EM G.27 contact time:
Optimization. How to Optimize Code Conventional Wisdom: 1.Don't do it 2.(For experts only) Don't do it yet.
Two notions of performance
How do we evaluate computer architectures?
Visit for more Learning Resources
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
Getting Started with Automatic Compiler Vectorization
Compilers for Embedded Systems
Memory Hierarchies.
Central Processing Unit
Register Pressure Guided Unroll-and-Jam
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Multithreading Why & How.
Optimizing program performance
Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.
Memory System Performance Chapter 3
Optimization.
Lecture 11: Machine-Dependent Optimization
Optimizing single thread performance
ENERGY 211 / CME 211 Lecture 11 October 15, 2008.
Presentation transcript:

Performance Optimization Getting your programs to run faster CS 691

Why optimize Better turn-around on jobs Run more programs/scenarios Release resources to other applications You want the job to finish before you retire

Ways to get more performance Run on bigger, faster hardware clock speed, more memory, … Tweak your algorithm Optimize your code

Loop Unrolling Converting passes of a loop into in-line streams of code Useful when loops do calculations on data in arrays Unrolling can take advantage of pipeline processing units in processors Compiler may preload operands into CPU registers

Loop Unrolling – disadvantages may be limited by the number of Floating point registers Pentium III: 8 Pentium 4: 8 Itanium: 128

Loop Unrolling – simple example Loop do i=1,n a(i) = b(i) +x*c(i) enddo Unrolled Loop do i=1,n,4 a(i) = b(i) +x*c(i) a(i+1) = b(i+1) +x*c(i+1) a(i+2) = b(i+2) +x*c(i+2) a(i+3) = b(i+3) +x*c(i+3) enddo

Loop Unrolling – simple example Performance – Rolled P3 550mhz – 13 mflops Itanium – 30 mflops Performance Unrolled P3 550mhz – 30 mflops Itanium – 107 mflops *from: LCI and NCSA

Loop Unrolling int a[100]; for (i=0;i<100;i++){ a[i] = a[i] * 2; } int a[100]; for (i=0;i<100;i+=5){ a[i] = a[i] * 2; a[i+1]=a[i+1]*2; a[i+2]=a[i+2]*2; a[i+3]=a[i+3]*2; a[i+4]=a[i+4]*2; }

Loop unrolling int a[10][10]; for (i=0;i<10;i++){ for (j=0;j<10;j++) { a[i][j] = a[i][j] *2;} int a[10][10]; for (i=0;i<10;i++){ a[i][0]=a[i][0]*2; a[i][1]=a[i][1]*2; a[i][2]=a[i][2]*2; a[i][3]=a[i][3]*2; a[i][4]=a[i][4]*2; a[i][5]=a[i][5]*2; a[i][6]=a[i][6]*2; a[i][7]=a[i][7]*2; a[i][8]=a[i][8]*2; a[i][9]=a[i][9]*2;}

Loop unrolling – Matrix Dot Product float a[100]; float b[100]; float z; for (i=0;i<100;i++){ z = z + a[i] * b[i]; } float a[100]; float b[100]; float z; for (i=0;i<100;i+=2){ z = z + a[i] * b[i]; z = z + a[i+1] * b[i+1]; }

Unrolling Loops You can do it automatically

Unrolling Loops – compiler options GNU Compilers -funroll-loops -funrull-all-loops (not recommended) PGI Compilers -Munroll -Munroll=c:N -Munroll=n:M

Unrolling Loops – Compiler Options Intel Compilers -unrollM (up to M times) -unroll

Taking Memory in Order Optimizing the use of cache row major order vs column major order row major --  a(1,1), a(2,1), a(3,1), a(1,2), a(2,2),… column major –  a(1,1), a(1,2), a(1,3), a(2,1), a(2,2),…

Taking Memory in Order Remember C and Fortran store arrays in the opposite manner  C – row major  Fortran – column major

Taking Memory in Order c  Fortran 

Taking Memory in Order do i=1,m do j=1,n a(i,j)=b(i,j)+c(i) end do do j=1,m do i=1,n a(i,j)=b(i,j)+c(i) end do loop time: loop runs at 4.48 Mflops loop time: 2.80 loop runs at Mflops

Floating Point Division FP Division is very expensive in terms of processor time clock cycles to compute Usually not pipelined FP Division required by IEEE “rules”

Floating point division – use reciprocal float a[100]; for (i=0;i<100;i++){ a[i]=a[i]/2; } float a[100]; Float denom; denom = 1/2; for (i=0;i<100;i++){ a[i]=a[i]*denom; }

Compiler options for IEEE Compatibility PGI Compilers  -Knoieee Intel Compilers  -mp GNU Compilers  can’t do Floating Point Division

Compilers can’t optimize if divisor is not scalar Breaks IEEE “rules” May impact portability

Function Inlining Build functions/subroutines in as inline parts of the programs code… … rather than functions/subroutines minimizes functions calls (and management of…)

Function Inlining Compile with – -Minline  compiler tries to inline what it can (meet compiler criteria) -Minline=except:func  excludes func from inlining -Minline=func  inline only func

Function Inlining …Compile with- -Minline=myfile.lib  inlines functions from inline library file -Minline=levels:n  inlines functions up to n levels of calls  usually default = 1

MPI Tuning Minimize messages Pointers/counts MPI Derived datatypes MPI_Pack/MPI_Unpack Using shared memory for message passing #PBS –l nodes=6:ppn=1 … but… #PBS –l nodes=3:ppn=2 … is better.

Compiler optimizations -O0 –no optimization -O1 –local optimization, register allocation -O2 –local/limited global optimization -O3 –aggressive global optimization -Munroll – loop unrolling -Mvect - vectorization -Minline – function inlining

gcc Compiler Optimatizations See: