U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Slides:



Advertisements
Similar presentations
In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.
Advertisements

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
1 CS 201 Compiler Construction Machine Code Generation.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Forwarding and Hazards MemberRole William ElliottTeam Leader Jessica Tyler ShulerWiki Specialist Tyler KimseyLead Engineer Cameron CarrollEngineer Danielle.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
Instruction-Level Parallelism (ILP)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Midterm Thursday let the slides be your guide Topics: First Exam - definitely cache,.. Hamming Code External Memory & Buses - Interrupts, DMA & Channels,
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science John Cavazos J Eliot B Moss Architecture and Language Implementation Lab University.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Simultaneous Multithreading
CS203 – Advanced Computer Architecture
Instruction Scheduling for Instruction-Level Parallelism
Lecture 6: Advanced Pipelines
Instruction Scheduling Hal Perkins Winter 2008
CS 201 Compiler Construction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Static Code Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Prof. Sirer CS 316 Cornell University
CSC3050 – Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Scheduling Hal Perkins Autumn 2011
Presentation transcript:

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710 Spring 2003 Instruction Scheduling

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 2 Modern Architectures Lots of features to increase performance and hide memory latency Superscalar Multiple logic units Multiple issue 2 or more instructions issued per cycle Speculative execution Branch predictors Speculative loads Deep pipelines

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 3 Instruction Scheduling Challenges to achieving instruction-level parallelism: Structural hazards: Insufficient resources to exploit parallelism Data hazards Instruction depends on result of previous instruction still in pipeline Control hazards Branches & jumps modify PC affect which instructions should be in pipeline

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 4 Scheduling for Pipelined Architectures Compiler reorders (“schedules”) instructions to maximize ILP = minimize stalls (“bubbles”) in pipeline Perform after code generation & register allocation First approach: [Hennessy & Gross 1983] O(n 4 ), n = instructions in basic block Today: [Gibbons & Muchnick 1986] O(n 2 )

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 5 Gibbons & Muchnick, I Assumptions: Hardware hazard detection Algorithm not required to introduce nops Each memory location referenced via offset of single base register Pointer may reference all of memory Load followed by add creates interlock (stall) Hazards only take single cycle

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 6 Gibbons & Muchnick, II For each basic block: Construct directed acyclic graph (DAG) using dependences between statements Node = statement / instruction Edge (a,b) = statement a must execute before b Schedule instructions using the DAG

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 7 Dependence DAG Cannot reorder two dependent instructions Data dependencies: True dependence (RAW = read-after-write) Instruction can’t be executed until all required operands available Anti-dependence (WAR) Write must not occur before read Output dependence (WAW) Earlier write cannot overwrite later one

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 8 Scheduling Example 1.r8 = [r12+8](4) 2.r1 = r r2 = 2 4.call r14,r31 5.nop 6.r9 = r1 + 1

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 9 Scheduling Example 1.r8 = [r12+8](4) 2.r1 = r r2 = 2 4.call r14,r31 5.nop 6.r9 = r1 + 1 We can reschedule to remove nop in delay slot: (1,3,4,2,6)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 10 Scheduling Algorithm Construct dependence dag on basic block Put roots in candidate set Use scheduling heuristics (in order) to select instruction Take into account terminating instruction of predecessor basic blocks While candidate set not empty Evaluate all candidates and select best one Delete scheduled instruction from candidate set Add newly-exposed candidates

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 11 Instruction Scheduling Heuristics NP-complete ) we need heuristics Bias scheduler to prefer instructions: Interlock with dag successors Allow other operations can proceed Have many successors More flexibility in scheduling Progress along critical path Free registers Reduce register pressure etc. (see ACDI p. 542)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 12 Scheduling Algorithm, Example ExecTime(n): cycles to execute statement n Let ExecTime(6) = 2, ExecTime(others) = 1; assume instruction latency = 1 Compute Delay(n): =ExecTime(n), if n is leaf =max m 2 Succ(n) Delay(m)+1

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 13 Scheduling Algorithm, Example Start at CurTime = 0 ETime(n): earliest time node should be scheduled to avoid stall Initally 0 Cands = {1,3} MCands: set of candidates with max delay time to end of block ECands:set whose earliest start time is at most current time MCands = ECands = {3}

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 14 Scheduling Algorithm, Example Scheduled node 3 Cands = {1} CurTime = 1 ETime(4) = 1

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 15 Scheduling Algorithm, Example Scheduled node 1 Cands = {2} CurTime = 2 ETime(2) = 1, ETime(4) = 4

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 16 Scheduling Algorithm, Example Scheduled node 2 Cands = {4} CurTime = 3 ETime(4) = 4

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 17 Scheduling Algorithm, Example Scheduled node 4 Cands = {5,6} CurTime = 4 ETime(5) = 6, ETime(6) = 4 MaxDelay = 2: MCands = {6} Want to progress along critical path

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 18 Scheduling Algorithm, Example Scheduled node 6 Cands = {5} CurTime = 5 ETime(5) = 6 Only one left…

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 19 Scheduling Algorithm, Example Scheduled node 5 Resulting schedule: [3,1,2,4,6,5] Requires 6 cycles – optimal! Version of this algorithm is (p+1) competitive p = number of pipelines Average-case much better

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 20 Scheduling Algorithm Complexity Time complexity: O(n 2 ) n = max number of instructions in basic block Building dependence dag: worst-case O(n 2 ) Each instruction must be compared to every other instruction Scheduling then requires each instruction be inspected at each step = O(n 2 ) Average-case: small constant (e.g., 3)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 21 Empirical Results Scheduling: always a win (1-13% on PA- RISC) Results same as Hennessy & Gross for most benchmarks However: removes only 5/16 stalls in sieve, at most 10/16 with better alias information

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 22 Next Time We’ve assumed no cache misses! Next time: balanced scheduling Read Kerns & Eggers