Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Computer Organization and Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

11/21/2002© 2002 Hal Perkins & UW CSEO-1 CSE 582 – Compilers Instruction Scheduling Hal Perkins Autumn 2002.

FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.

Local Instruction Scheduling — A Primer for Lab 3 — Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.

Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.

PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

COMP25212 Advanced Pipelining Out of Order Processors.

William Stallings Computer Organization and Architecture 8th Edition

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Local Instruction Scheduling

Instruction Scheduling for Instruction-Level Parallelism

Out of Order Processors

Instruction Scheduling Hal Perkins Summer 2004

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Instruction Scheduling Hal Perkins Winter 2008

Local Instruction Scheduling — A Primer for Lab 3 —

Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Scheduling Hal Perkins Autumn 2005

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism

Instruction Scheduling Hal Perkins Autumn 2011

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

Instruction Scheduling is... Spring 2014Jim Hogg - UW - CSE - P501O-2 Schedule Execute in-order to get correct answer abcdefghabcdefgh badfcghfbadfcghf Issue in new order eg: memory fetch is slow eg: divide is slow Overall faster Still get correct answer! Originally devised for super-computers Now used everywhere: in-order procs - older ARM out-of-order procs - newer x86 Compiler does 'heavy lifting' - reduce chip power

Spring 2014JIm Hogg - UW - CSE - P501O-3 Chip Complexity, 1 Following factors make scheduling complicated: Different kinds of instruction take different times (in clock cycles) to complete Modern chips have multiple functional units so they can issue several operations per cycle "super-scalar" Loads are non-blocking ~50 in-flight loads and ~50 in-flight stores

Typical Instruction Timings Spring 2014JIm Hogg - UW - CSE - P501O-4 InstructionTime in Clock Cycles int  int 1 int * int3 float  float 3 float * float5 float  float 15 int  int 30

Load Latencies T-5 Core L1 = 64 KB per core L2 = 256 KB per core L3 = 2-8 MB shared DRAM Instruction~5 per cycle Register 1 cycle L1 Cache~4 cycles L2 Cache~10 cycles L3 Cache~40 cycles DRAM ~100 ns

Spring 2014JIm Hogg - UW - CSE - P501O-6 Super-Scalar

Spring 2014JIm Hogg - UW - CSE - P501O-7 Chip Complexity, 2 Branch costs vary (branch predictor) Branches on some processors have delay slots (eg: Sparc) Modern processors have branch-predictor logic in hardware heuristics predict whether branches are taken or not keeps pipelines full GOAL: Scheduler should reorder instructions to hide latencies take advantage of multiple function units (and delay slots) help the processor effectively pipeline execution However, many chips schedule on-the-fly too eg: Haswell out-of-order window = 192  ops

Data Dependence Graph Spring 2014JIm Hogg - UW - CSE - P501O-8 a i c g h d f b e Start Cycle Instruction aloadAI r => r 1 badd r 1, r 1 => r 1 cloadAI r => r 2 dmult r 1, r 2 => r 1 eloadAI r => r 2 fmult r 1, r 2 => r 1 gloadAI r => r 2 hmult r 1, r 2 => r 1 istoreAI r 1 => r read-after-write = RAW = true dependence = flow dependence write-after-read = WAR = anti-dependence write-after-write = WAW = output-dependence The scheduler has freedom to re-order instructions, so long as it complies with inter-instruction dependencies leaf root

Scheduling Really Works... Spring 2014JIm Hogg - UW - CSE - P501O-9 Start Cycle Instruction 1 loadAI r => r 1 4 add r 1, r 1 => r 1 5 loadAI r => r 2 8 mult r 1, r 2 => r 1 10 loadAI r => r 2 13 mult r 1, r 2 => r 1 15 loadAI r => r 2 18 mult r 1, r 2 => r 1 20 storeAI r 1 => r Start Cycle Instruction 1 loadAI r => r 1 2 loadAI r => r 2 3 loadAI r => r 3 4 add r 1, r 1 => r 1 5 mult r 1, r 2 => r 1 6 loadAI r => r 2 7 mult r 1, r 3 => r 1 9 mult r 1, r 2 => r 1 11 storeAI r 1 => r Original Scheduled 1 Functional Unit Load or Store: 3 cycles Multiply: 2 cycles Otherwise: 1 cycle a = 2*a*b*c*dNew schedule uses extra register: r 3 Preserves (WAW) output-dependency

Spring 2014JIm Hogg - UW - CSE - P501O-10 Scheduler: Job Description The Job Given code for some machine; and latencies for each instruction, reorder to minimize execution time Constraints Produce correct code Minimize wasted cycles Avoid spilling registers Don't take forever to reach an answer

Job Description - Part 2 foreach instruction in dependence graph Denote current instruction as ins Denote number of cyles to execute as ins.delay Denote cycle number in which ins should start as ins.start foreach instruction dep that is dependent on ins Ensure ins.start + ins.delay <= dep.start Spring 2014JIm Hogg - UW - CSE - P501O-11 What if the scheduler makes a mistake? On-chip hardware stalls the pipeline until operands become available: so slower, but still correct!

Dependence Graph + Timings Spring 2014JIm Hogg - UW - CSE - P501O-12 a 13 i3i3 c 12 g8g8 h5h5 d9d9 f7f7 b 10 e 10 Start Cycle Instruction aloadAI r => r 1 badd r 1, r 1 => r 1 cloadAI r => r 2 dmult r 1, r 2 => r 1 eloadAI r => r 2 fmult r 1, r 2 => r 1 gloadAI r => r 2 hmult r 1, r 2 => r 1 istoreAI r 1 => r 1 Functional Unit Load or Store: 3 cycles Multiply: 2 cycles Otherwise: 1 cycle Superscripts show path length to end of computation a-b-d-f-h-i is critical path Can schedule leaves any time - no constraints Since a has longest delay, schedule it first; then c; then...

Spring 2014JIm Hogg - UW - CSE - P501O-13 List Scheduling Build a precedence graph D Compute a priority function over the nodes in D typical: longest latency-weighted path Rename registers to remove WAW conflicts Create schedule, one cycle at a time Use queue of operations that are Ready At each cycle Choose a Ready operation and schedule it Update Ready queue

O-14 List Scheduling Algorithm cycle = 1// clock cycle number Ready = leaves of D// ready to be scheduled Active = { }// being executed while Ready  Active  {} do foreach ins  Active do if ins.start + ins.delay < cycle then remove ins from Active foreach successor suc of ins in D do if suc  Ready then Ready = {suc} endif enddo endif endforeach if Ready  {} then remove an instruction, ins, from Ready ins.start = cycle; Active = ins; endif cycle++ endwhile

Beyond Basic Blocks List scheduling dominates, but moving beyond basic blocks can improve quality of the code. Possibilities: Schedule extended basic blocks (EBBs) Watch for exit points – limits reordering or requires compensating Trace scheduling Use profiling information to select regions for scheduling using traces (paths) through code Spring 2014JIm Hogg - UW - CSE - P501O-15