CS 380C: Advanced Topics in Compilers. Administration Instructor: Keshav Pingali –Professor (CS, ICES) –ACES 4.126A TA: Muhammed.

Slides:

Advertisements

Similar presentations

Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

Compiler Challenges for High Performance Architectures

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Data Locality CS 524 – High-Performance Computing.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

SCIENCES USC INFORMATION INSTITUTE An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes.

CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Register Allocation (via graph coloring)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Multiscalar processors

Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.

Intermediate Code. Local Optimizations

Data Locality CS 524 – High-Performance Computing.

Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

An Experimental Comparison of Empirical and Model-based Optimization Keshav Pingali Cornell University Joint work with: Kamen Yotov 2,Xiaoming Li 1, Gang.

1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

CS 612: Software Design for High-performance Architectures.

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

Memory-Aware Compilation Philip Sweany 10/20/2011.

CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.

Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

CS 380C: Advanced Topics in Compilers. Administration Instructor: Keshav Pingali –Professor (CS, ICES) –ACES 4.126A TA: Zubair.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

Topics to be covered Instruction Execution Characteristics

Advanced Computer Systems

Code Optimization Overview and Examples

Code Optimization.

Introduction To Computer Systems

Optimization Code Optimization ©SoftMoore Consulting.

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

5.2 Eleven Advanced Optimizations of Cache Performance

Instruction Scheduling for Instruction-Level Parallelism

Memory Hierarchies.

STUDY AND IMPLEMENTATION

CS 380C: Advanced Compiler Techniques

Optimizing MMM & ATLAS Library Generator

Memory System Performance Chapter 3

Cache Models and Program Transformations

Optimizing Compilers CISC 673 Spring 2009 Course Overview

Optimizing single thread performance

Presentation transcript:

CS 380C: Advanced Topics in Compilers

Administration Instructor: Keshav Pingali –Professor (CS, ICES) –ACES 4.126A TA: Muhammed Zubair Malik –Graduate student (ECE)

Meeting times Lecture: MW 12:30-2:00PM, ENS 126 Office hours: –Keshav Pingali: Monday 3:00-4:00PM –Muhammed Malik: TBA Meeting at other times: –send to set up appointment

Prerequisites Compilers and architecture –Some background in compilers (front-end stuff) –Basic computer architecture Software and math maturity –Able to implement large programs in C –Comfortable with abstractions like graph theory and integer linear programming Ability to read research papers and understand content

Course material Website for course – All lecture notes, announcements, papers, assignments, etc. will be posted there No assigned book for the course –but we will put papers and other material on the website as appropriate

Coursework Roughly 3-4 assignments that combine –problem sets: written answers to questions –programming assignments: implementing optimizations in a compiler test-bed Term project –substantial programming project that may involve working with other test-beds –would be publishable, ideally –based on our ideas or yours Paper presentation –towards the end of the semester

What do compilers do? Conventional view of compilers –Program that analyzes and translates a high-level language program automatically into low-level machine code that can be executed by the hardware –May do simple (scalar) optimizations to reduce the number of operations –Ignore data structures for the most part Modern view of compilers –Program for translation, transformation and verification of high-level language programs –Reordering (restructuring) the computations is as important if not more important than reducing the amount of computation –Optimization of data structure computations is critical –Program analysis techniques can be useful for other applications such as debugging, verifying the correctness of a program against a specification, detecting malware, ….

High-level language programsMachine language programs semantically equivalent programs Intermediate language programs …..

Bridge the “semantic gap” –Programmers prefer to write programs at a high level of abstraction –Modern architectures are very complex, so to get good performance, we have to worry about a lot of low-level details –Compilers let programmers write high-level programs and still get good performance on complex machine architectures Application portability –When a new ISA or architecture comes out, you only need to reimplement the compiler on that machine –Application programs should run without (substantial) modification –Saves a huge amount of programming effort Why do we need compilers?

Complexity of modern architectures: AMD Barcelona Quad-core Processor

Discussion To get good performance on modern processors, program must exploit –coarse-grain (multicore) parallelism –memory hierarchy (L1,L2,L3,..) –instruction-level parallelism (ILP) –registers –…. Key questions: –How important is it to exploit these hardware features? If you have n cores and you run on only one, you get at most 1/n of peak performance, so this is obvious How about other hardware features? –If it is important, how hard is it to do this by hand? Let us look at memory hierarchies to get a feel for this –Typical latencies L1 cache: ~ 1 cycle L2 cache: ~ 10 cycles Memory: ~ cycles

Software problem Caches are useful only if programs have locality of reference –temporal locality: program references to given memory address are clustered together in time –spatial locality: program references clustered in address space are clustered in time Problem: –Programs obtained by expressing most algorithms in the straight-forward way do not have much locality of reference –Worrying about locality when coding algorithms complicates the software process enormously.

Example: matrix multiplication Great algorithmic data reuse: each array element is touched O(N) times! All six loop permutations are computationally equivalent (even modulo round-off error). However, execution times of the six versions can be very different if machine has a cache. DO I = 1, N //assume arrays stored in row-major order DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J)

IJK version (large cache) DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) Large cache scenario: –Matrices are small enough to fit into cache –Only cold misses, no capacity misses –Miss ratio: Data size = 3 N 2 Each miss brings in b floating-point numbers Miss ratio = 3 N 2 /b*4N 3 = 0.75/bN = (b = 4,N=10) C B A K K

IJK version (small cache) DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) Small cache scenario: –Matrices are large compared to cache/row-major storage –Cold and capacity misses –Miss ratio: C: N 2 /b misses (good temporal locality) A: N 3 /b misses (good spatial locality) B: N 3 misses (poor temporal and spatial locality) Miss ratio  0.25 (b+1)/b = (for b = 4) C B A K K

MMM Experiments Simulated L1 Cache Miss Ratio for Intel Pentium III –MMM with N = 1…1300 –16KB 32B/Block 4-way 8-byte elements

Quantifying performance differences DO I = 1, N //assume arrays stored in row-major order DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) Typical cache parameters: –L2 cache hit: 10 cycles, cache miss 70 cycles Time to execute IKJ version: 2N *0.13*4N *0.87*4N 3 = 73.2 N 3 Time to execute JKI version: 2N *0.5*4N *0.5*4N 3 = 162 N 3 Speed-up = 2.2 Key transformation: loop permutation

Even better….. Break MMM into a bunch of smaller MMMs so that large cache model is true for each small MMM  large cache model is valid for entire computation  miss ratio will be 0.75/bt for entire computation where t is

Loop tiling Break big MMM into sequence of smaller MMMs where each smaller MMM multiplies sub-matrices of size txt. Parameter t (tile size) must be chosen carefully –as large as possible –working set of small matrix multiplication must fit in cache A B C It Kt Jt I K J DO It = 1,N, t DO Jt = 1,N,t DO Kt = 1,N,t DO I = It,It+t-1 DO J = Jt,Jt+t-1 DO K = Kt,Kt+t-1 C(I,J) = C(I,J)+A(I,K)*B(K,J) t t t t

Speed-up from tiling Miss ratio for block computation = miss ratio for large cache model = 0.75/bt = (b = 4, t = 200) Time to execute tiled version = 2N *0.001*4N *0.999*4N 3 = 42.3N 3 Speed-up over JKI version = 4

Observations Locality optimized code is more complex than high-level algorithm. Locality optimization changed the order in which operations were done, not the number of operations “Fine-grain” view of data structures (arrays) is critical Loop orders and tile size must be chosen carefully –cache size is key parameter –associativity matters Actual code is even more complex: must optimize for processor resources –registers: register tiling –pipeline: loop unrolling –Optimized MMM code can be ~1000’s of lines of C code Wouldn’t it be nice to have all this be done automatically by a compiler? –Actually, it is done automatically nowadays…

Performance of MMM code produced by Intel’s Itanium compiler (-O3) Goto BLAS obtains close to 99% of peak, so compiler is pretty good! 92% of Peak Performance

Discussion Exploiting parallelism, memory hierarchies etc. is very important If program uses only one core out of n cores in processors, you get at most 1/n of peak performance Memory hierarchy optimizations are very important – can improve performance by factor of 10 or more Key points: –need to focus on data structure manipulation –reorganization of computations and data structure layout are key –few opportunities usually to reduce the number of computations

Course content (scalar stuff) Introduction –compiler structure, architecture and compilation, sources of improvement Control flow analysis –basic blocks & loops, dominators, postdominators, control dependence Data flow analysis –lattice theory, iterative frameworks, reaching definitions, liveness Static-single assignment –static-single assignment, constant propagation. Global optimizations –loop invariant code motion, common subexpression elimination, strength reduction. Register allocation –coloring, allocation, live range splitting. Instruction scheduling –pipelined and VLIW architectures, list scheduling.

Course content (data structure stuff) Array dependence analysis –integer linear programming, dependence abstractions. Loop transformations –linear loop transformations, loop fusion/fission, enhancing parallelism and locality Self-optimizing programs –empirical search, ATLAS, FFTW Optimistic evaluation of irregular programs –data parallelism in irregular programs, optimistic parallelization Optimizing irregular program execution –points-to analysis, shape analysis Program verification –Floyd-Hoare style proofs, model checking, theorem provers

Lecture schedule See –