A Seminar on Optimizations for Modern Architectures Computer Science Seminar 2, 236802 Winter 2006 Lecturer: Erez Petrank

Slides:



Advertisements
Similar presentations
In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.
Advertisements

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
Computer Organization and Architecture
Computer architecture
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
CSCI 4717/5717 Computer Architecture
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Programmability Issues
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
10/11: Lecture Topics Slides on starting a program from last time Where we are, where we’re going RISC vs. CISC reprise Execution cycle Pipelining Hazards.
Chapter Six 1.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
PZ13A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ13A - Processor design Programming Language Design.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Chapter 12 Pipelining Strategies Performance Hazards.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Register Allocation (via graph coloring)
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Prince Sultan College For Woman
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
Basics and Architectures
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Memory-Aware Compilation Philip Sweany 10/20/2011.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
PipeliningPipelining Computer Architecture (Fall 2006)
Lecture 38: Compiling for Modern Architectures 03 May 02
Chapter Six.
Topics to be covered Instruction Execution Characteristics
Advanced Architectures
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Morgan Kaufmann Publishers
CSCI1600: Embedded and Real Time Software
Superscalar Processors & VLIW Processors
Chapter Six.
Chapter Six.
Computer Architecture
Introduction to Optimization
CSCI1600: Embedded and Real Time Software
Presentation transcript:

A Seminar on Optimizations for Modern Architectures Computer Science Seminar 2, Winter 2006 Lecturer: Erez Petrank

Topics today Administration Introduction Course topics Administration again

Course format Lectures 2-14 will be given by you. Each lecture will follow a lecture from a course by Ken Kennedy at Rice University. Slides “drafts” are in Original lectures took 80 minutes. 16 such lectures, one by each registered student.

Grading Presentation - at least 85%, Participation in class – at most 15%. Reduction of 5 points for each (unjustified) absence. Presentation: –Understanding the material –Slides effectiveness –Communicating the ideas

Seminar Requirement Easier than “standard”: –Material is taken from a book –Draft slides are available on the internet Harder than “standard”: –A presentation is 80 minutes instead of 50. –Presentation should be much better than draft slides. –Presentations may depend on previous material, so one actually has to listen… –Lecturer has to know what he is talking about. Goal: make people understand !

Scheduling Since lectures are not synch’ed with our 2*50 minutes schedule, one must be prepared to start after the previous talk ends. Lecturers should send me their talk slides by 12:30 the day of the talk. After the talk, please modify the slides to create a final version and send it to me by the end of Thursday.

The book Optimizing Compilers for Modern Architectures by Ken Kennedy & Randy Allen (Available at the library: 1 reserved copy at the library and 2 more for a week's loan. )

The nature of this topic Trying to automatically detect potential parallelization seems to be harder than expected. Requires some math, some common sense, some compiler knowledge, some NP-Hardness, etc. At the frontline of programming languages research today. Interesting both from research and from practice point of view.

An Example Can we run the following loop iterations in parallel? For (i=0, i<1000, i++) For (j=0, j<1000, j++) For (k=0, k<1000, k++) For (l=0, l<1000, l++) A(i+3,2j+2,k+5,l) = A(2j-i+k,2k,7,2i) A sufficient question is whether one iteration writes to a location in A that another iteration reads from… This translates to whether there exist two integer vectors (i 1,j 1,k 1,l 1 ) and (i 2,j 2,k 2,l 2 ) such that i 1 +3 = 2j 2 -i 2 +k 2 & 2j 1 +2 = 2k 2 & k 1 +5 = 7 & l 1 = 2i 2 (Solutions should also be within the loop bounds.)

An Example Do there exist two integer vectors (i 1,j 1,k 1,l 1 ) and (i 2,j 2,k 2,l 2 ) such that i 1 +3 = 2j 2 -i 2 +k 2 & 2j 1 +2 = 2k 2 & k 1 +5 = 7 & l 1 = 2i 2 ? This is a set of linear equations. Let’s solve them! Bad news: finding integer solutions for linear equations in NP-Hard. So what do we do? Restrict to special cases that are common and we have the math tools to solve…

Introduction to Course (Chapter 1) What kind of parallel platforms are we targeting? How “dependency” relates to all.

Moore’s Law

Features of Machine Architectures Pipelining. Superscalar: multiple execution units (pipelined). –When controlled by the software it is called VLIW. Vector operations Parallel processing –Shared memory, distributed memory, message-passing Registers Cache hierarchy Combinations of the above –Parallel-vector machines

Compiler’s Job Many advanced developers write platform- conscious code. –Parallelism, cache behavior, etc. Time consuming and bug prone. Bugs are extremely difficult to find. Holly grail: the compiler gets “standard” code and optimizes it to make best use of the hardware.

Main Tool: Reordering Changing the execution order of original code: –Do not wait for cache misses –Run instructions in parallel –Improve “branch prediction”, etc. Main question: Which orders preserve program semantics? –Dependence. –Safety

Dependnece Dependence is the main theme in this book. How do we determine dependence? How do we use this knowledge to improve programs? Next, we go through the various platform relevant properties: pipelining, vector instructions, superscalar, processor parallelism, memory hierarchy.

Earliest: Instruction Pipelining Instruction pipelining –DLX Instruction Pipeline Picture shows perfect pipelining, executing 1 inst/cycle Instruction fetch, Decode, execute, mem access, & write back to register. (RISC.)

Replicated Execution Logic Pipelined Execution Units Multiple Execution Units Typical programs do not exploit such pipelining or parallelism well. But a compiler can help by reordering instructions. Fine-grained parallelism

Parallelism Hazards Structural Hazard –A single bus brings instructions and data. So memory access cannot happen with instruction fetch. Data Hazard –The result produced by one instruction is required by the following instruction Control Hazard –Branches Hazards create stalls. Goal: minimize stalls.

Vector Instructions By the 70’s, the task of keeping the pipeline full had become cumbersome. Vector operations: an attempt to simplify instruction processing. Idea: long “vector” register, loaded, executed, or stored in a single operation. Example: »VLOAD V1,A »VLOADV2,B »VADDV3,V1,V2 »VSTOREV3,C

Vector Instructions The machine runs vector operations using the same pipelined hardware. Simplification: we know what is happening. Platform/system costs: processor has more registers, more instructions, cache complications, etc.

Vector Operations How does a programmer specify or how does a compiler identify vector operations? »DO I = 1, 64 »C(I) = A(I) + B(I) »D(I+1) = D(I) + C(I) »ENDDO

VLIW (Very Long Instruction Word) Multiple instruction issue on the same cycle –Wide word instruction Challenges for programmer/compiler –Find enough parallel instructions –Reschedule operations to minimize number of cycles. Avoid hazards –Dependence !

SMP Parallelism Multiple processors with uniform shared memory. Course grained (task) parallelism: independent tasks. Do not rely on one another for obtaining data or making control decisions.

Memory Hierarchy Memory speed falls behind processor speed, and gap still increasing. Solution: use a fast cache between memory and CPU. Implication: program cache behavior has a significant impact on program efficiency. Challenge: How can we enhance reuse? –Correct! We reorder instructions. Cache Mem CPU

Memory Hierarchy –Example: –DO I = 1, N – DO J = 1, M – C(I) = C(I) + A(J) –If M is small enough to let A[1:M] in the cache, then (almost) no cache misses on A. Otherwise, cache miss on A per loop iteration. –Solution for a cache of size L: –DO jj = 1, M, L – DO I = 1, N – DO J = jj, jj+L-1 – C(I) = C(I) + A(J)

Distributed Memory Memory packaged with processors –Message passing –Distributed shared memory SMP clusters –Shared memory on node, message passing off node Problem more complex: split computation and data. –Minimizing communication: data placement –Create copies? Maintain coherence, etc. We will probably not have time to discuss data partition in this seminar.

Some Words on Dependence Goal: aggressive transformations to improve performance Problem: when is a transformation legal? –Simple answer: when program meaning not changed. Meaning? –Same sequence of memory states? Too strong! –Same answers? Good, but in general undecidable. Need a sufficient condition. We use in this course: dependence. –Ensures instructions that access the same location (with at least one store) must not be reordered.

Bernstein Conditions When is it safe to run two instructions I1 and I2 in parallel? If none of the following holds: 1.I1 writes into a memory location that I2 reads 2.I2 writes into a memory location that I1 reads 3.Both I1 and I2 write to the same memory location Loop parallelism: Think of loop iterations as tasks

Bernstein Conditions Bernstein Conditions True Dependence: –I1 writes into a memory location that I2 reads –Do i=1,N –A(i+1) = A(i) + B(i) Bernstein Conditions Antidependence: –I2 writes into a memory location that I1 reads –Do i=1,N –A(i-1) = A(i) + B(i)

Bernstein Conditions Bernstein Conditions Output Dependence: –Both I1 and I2 write to the same memory location –Do i=1,N –S = A(i) + B(i) –Better: –X = 10 –X = 20

Compiler Technologies Program Transformations –Most architectural issues dealt with by reordering. Vectorization, parallelization, cache reuse enhancement –Challenges: Determining when transformations are legal Selecting transformations based on profitability All transformations require some understanding of the ways that instructions and statements depend on one another (share data).

A Common Problem: Matrix Multiplication DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO This code will perform perfectly on a scalar machine. Compiler only needs to identify the common variable C(j,i) in the inner loop and put it in a register.

Pipelined Machine DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO But Suppose it ran on a pipeline machine with 4 stages. Problem: inner (k) loop cannot be parallelized.

Problem for Pipelines Instructions in inner (k) loop are dependent. Solution: –work on several iterations of the J-loop simultaneously

MatMult for a Pipelined Machine DO I = 1, N, DO J = 1, N, 4 C(J,I) = 0.0 C(J+1,I) = 0.0 C(J+2,I) = 0.0 C(J+3,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) C(J+1,I) = C(J+1,I) + A(J+1,K) * B(K,I) C(J+2,I) = C(J+2,I) + A(J+2,K) * B(K,I) C(J+3,I) = C(J+3,I) + A(J+3,K) * B(K,I) ENDDO

Matrix Multiply on Vector Machines DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO Reminder: this was the original code.

Problems for Vectors Inner loop must be vector –And should be stride 1 Vector registers have finite length (Cray: 64 elements) Solution –Strip mine the loop over the stride-one dimension to the register length (say, 64). –Move the iterate over strip loop to the innermost position

Vectorizing Matrix Multiply DO I = 1, N DO J = 1, N, 64 DO JJ = 0,63 C(JJ,I) = 0.0 ENDDO DO K = 1, N DO JJ = 0,63 C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO

MatMult for a Vector Machine DO I = 1, N DO J = 1, N, 64 C(J:J+63,I) = 0.0 DO K = 1, N C(J:J+63,I) = C(J:J+63,I) + A(J:J+63,K)*B(K,I) ENDDO

Matrix Multiply on Parallel SMPs DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO On an SMP, we prefer to parallelize at a coarse level because of the cost involved in synchronizing the parallel runs.

Matrix Multiply on Parallel SMPs DO I = 1, N ! Independent for all I DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO We can let each processor execute some of the outer loop iterations. (They should all have access to the full matrix.)

MatMult on a Shared-Memory MP PARALLEL DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO END PARALLEL DO

MatMult on a Vector SMP PARALLEL DO I = 1, N DO J = 1, N, 64 C(J:J+63,I) = 0.0 DO K = 1, N C(J:J+63,I) = C(J:J+63,I) + A(J:J+63,K)*B(K,I) ENDDO

Matrix Multiply for Cache Reuse DO I = 1, N DO J = 1, N C(J,I) = 0.0 DO K = 1, N C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO

Problems on Cache There is reuse of C but no reuse of A and B Solution –Block the loops so you get reuse of both A and B Multiply a block of A by a block of B and add to block of C

MatMult on a Uniprocessor with Cache DO I = 1, N, S DO J = 1, N, S DO p = I, I+S-1 DO q = J, J+S-1 C(q,p) = 0.0 ENDDO DO K = 1, N, S DO p = I, I+S-1 DO q = J, J+S-1 DO r = K, K+S-1 C(q,p) = C(q,p) + A(q,r) * B(r,p) ENDDO ST elements S 2 elements

MatMult on a Distributed-Memory MP PARALLEL DO I = 1, N PARALLEL DO J = 1, N C(J,I) = 0.0 ENDDO PARALLEL DO I = 1, N, S PARALLEL DO J = 1, N, S DO K = 1, N, T DO p = I, I+S-1 DO q = J, J+S-1 DO r = K, K+T-1 C(q,p) = C(q,p) + A(q,r) * B(r,p) ENDDO

Conclusion Tailoring the program to a given platform is difficult. Due to the speed of hardware change, modifications may be required repeatedly. Thus, it is useful to let the compiler do much of the work. How much can it do? This is the topic of this course.

Summary Modern computer architectures present many performance challenges Most of the problems can be overcome by changing order of instructions, and in particular, by transforming loop nests –Transformations are not obviously correct. Dependence tells us when this is feasible –Most of the course is concentrated on detection and use of dependence (or independence) for the various goals.

Administration Again

Registration Procedure 1.I’ll wait to get additional initial registration s for the next 2 hours. 2.I will send s asking students if they want to pick a lecture. 3.I will wait a limited time for responses and go for the next candidates.  If you want to ensure your place in this course, check your every 4 hours tomorrow and respond. 4.A schedule will be built and put on the internet with names of registered students. 5.(If you don’t get registered, you are still welcome as a listener…)