Memory-Aware Compilation Philip Sweany 10/20/2011.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Lecture 11: Code Optimization CS 540 George Mason University.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

Programmability Issues

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

2015/6/21\course\cpeg F\Topic-1.ppt1 CPEG 421/621 - Fall 2010 Topics I Fundamentals.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Introduction to Program Optimizations Chapter 11 Mooly Sagiv.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Parallelizing Compilers Presented by Yiwei Zhang.

Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

Procedure Optimizations and Interprocedural Analysis Chapter 15, 19 Mooly Sagiv.

CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.

Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

CS 380C: Advanced Topics in Compilers. Administration Instructor: Keshav Pingali –Professor (CS, ICES) –ACES 4.126A TA: Muhammed.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

High-Level Transformations for Embedded Computing

Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Pipelining and Parallelism Mark Staveley

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

EKT303/4 Superscalar vs Super-pipelined.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.

Compiler Research How I spent my last 22 summer vacations Philip Sweany.

Lecture 38: Compiling for Modern Architectures 03 May 02

CS 352H: Computer Systems Architecture

Topics to be covered Instruction Execution Characteristics

Code Optimization.

Computer Architecture Principles Dr. Mike Frank

Instruction Scheduling for Instruction-Level Parallelism

Superscalar Processors & VLIW Processors

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Chapter 12 Pipelining and RISC

CSC3050 – Computer Architecture

How to improve (decrease) CPI

Research: Past, Present and Future

Presentation transcript:

Memory-Aware Compilation Philip Sweany 10/20/2011

Architectural Diversity “Simple” Load/Store Instruction-level parallel Heterogeneous multi-core parallelism “Traditional” parallel architectures – Vector – MIMD Many core Next???

Load/Store Architecture All arithmetic must take place in registers Cache hits typically 3-5 cycles Cache misses more like 100 cycles Compiler tries to keep scalars in registers Graph-coloring register assignment

Instruction-Level Parallelism (ILP) ILP architectures include: – Multiple pipelined functional units – Static or dynamic scheduling Compiler schedules instructions to reduce execution time – Local scheduling – Global scheduling – Software pipelining

“Typical” ILP Architecture 8 “generic” pipelined functional units Timing – Register operations require 1 cycle – Memory operations (load) require 5 cycles (hit) or 50 cycles (miss), pipelined of course – Stores are buffered so don’t require time directly.

Matrix Multiply Matrix_multiply a,b,c: int[4][4] for i from 0 to 3 for j from 0 to 3 c[i][j] = 0 for k from 0 to 3 c[i][j] += a[i][k] * b[k][j]

Single Loop Schedule (ILP) 1.t1 = a[i][k] # t2 = b[k][j] 2.nop 3.nop 4.nop 5.t3 = t1 * t2 6.t0 += t3 --- t0 = c[i][j] before loop and c[i][j] = t0 after loop

Software Pipelining Can “cover” any latency, removing nops from single-loop schedule IFF conditions are “right.” They are right for matrix multiply so, …

Software Pipelined Matrix Mult All the operations can be included in a single cycle, speeding up loop by a factor of 7. t1 = a[i][k], t2 = b[k][j], t3 = t1 -5 *t2 -5, t0 +=t3

Improved Software Pipelining? Unroll-and-Jam on nested loops can significantly shorten the execution time Use of a cache-reuse model can give better schedules than assuming all cache accesses are hits and can reduce register requirements over assuming all accesses are cache misses.

Results of Software Pipelining Improvements Using unroll-and-jam on 26 FORTRAN nested loops before performing modulo scheduling led to: – Decreased execution time for loops of up to 94.2%. On average, loops decreased execution time by 56.9% – Increased register requirements greatly, often by a factor of 5.

Results of Software Pipelining Improvements Using a simple cache reuse model, our modulo scheduler Improved execution time roughly 11% over an all-hit assumption with little change in register usage Used 17.9% fewer registers than an all- miss assumption, while generating 8% slower code

Chiron Tesla Ducati Multi-CPU Shared Memory “OMAP” Resources FPGA

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures

Dependence-Based Compilation Vectorization and Parallelization require a deeper analysis than optimization for scalar machines – Must be able to determine whether two accesses to the same array might be to the same location Dependence is the theory that makes this possible – There is a dependence between two statements if they might access the same location, there is a path from one to the other, and one access is a write Dependence has other applications – Memory hierarchy management—restructuring programs to make better use of cache and registers Includes input dependences – Scheduling of instructions

Syllabus I Introduction – Parallel and vector architectures. The problem of parallel programming. Bernstein's conditions and the role of dependence. Compilation for parallel machines and automatic detection of parallelism. Dependence Theory and Practice – Fundamentals, types of dependences. Testing for dependence: separable, gcd and Banerjee tests. Exact dependence testing. Construction of direction and distance vectors. Preliminary Transformations – Loop normalization, scalar data flow analysis, induction variable substitution, scalar renaming.

Syllabus II Fine-Grain Parallel Code Generation – Loop distribution and its safety. The Kuck vectorization principle. The layered vector code-generation algorithm and its complexity. Loop interchange. Coarse-Grain Parallel Code Generation – Loop Interchange. Loop Skewing. Scalar and array expansion. Forward substitution. Alignment. Code replication. Array renaming. Node splitting. Pattern recognition. Threshold analysis. Symbolic dependence tests. Parallel code generation and its problems. Control Dependence – Types of branches. If conversion. Control dependence. Program dependence graph.

Syllabus III Memory Hierarchy Management – The use of dependence in scalar register allocation and management of the cache memory hierarchy. Scheduling for Superscalar and Parallel Machines Machines – Role of dependence. List Scheduling. Software Pipelining. Work scheduling for parallel systems. Guided Self-Scheduling Interprocedural Analysis and Optimization – Side effect analysis, constant propagation and alias analysis. Flow- insensitive and flow-sensitive problems. Side effects to arrays. Inline substitution, linkage tailoring and procedure cloning. Management of interprocedural analysis and optimization. Compilation of Other Languages. – C, Verilog, Fortran 90, HPF.

What is High Performance Computing? What architectural models are there? What system software is required? Standard? How should we evaluate high performance? – Run time? – Run time x machine cost? – Speedup ? – Efficient use of CPU resources?