Download presentation
Presentation is loading. Please wait.
Published byMalcolm Collins Modified over 6 years ago
1
Instruction Scheduling for Instruction-Level Parallelism
CSS 548 Daniel R. Lewis November 28, 2012
2
Agenda Where does instruction scheduling fit into the compilation process? What is instruction-level parallelism? What are data dependencies, and how do they limit instruction-level parallelism? How should the compiler order instructions to maximize instruction-level parallelism? What is the affect on register allocation? What else must be considered in instruction scheduling?
3
Big Picture Instruction scheduling is an optimization that is implemented in the back-end of the compiler Operates on machine code (not IR) Tied to the characteristics of the CPU Assumes generic optimization is complete Idea: reorder instructions to increase instruction-level parallelism.
4
Instruction-Level Parallelism
Parallelism on a single core; not multicore Pipelined processors are executing several instructions at once, at different stages Superscalar and VLIW processors can issue multiple instructions per cycle
5
Pipelined Parallelism
(Jouppi and Wall, 1989) Ubiquitous in modern processors Superpipelining: Longer pipelines with shorter stages (Pentium 4 had 20-stage pipeline)
6
Superscalar Parallelism
(Jouppi and Wall, 1989) Works with CPUs that have multiple functional units (e.g., ALU, multiplier, bit shifter) Since mid-1990s, all general purpose processors have been superscalar (original Pentium was first superscalar x86)
7
VLIW Parallelism Most commonly seen in embedded DSPs
(Jouppi and Wall, 1989) Most commonly seen in embedded DSPs Non-embedded example: Intel Itanium
8
Data Dependencies inc ebx ;; ebx++ mov eax, ebx ;; eax := ebx
Ordering of some instructions must be preserved Three flavors of data dependence: True dependence (read after write) Antidependence (write after read) Output dependence (write after write) Data dependencies substantially reduce available parallelism Dynamically-scheduled processors detect dependencies at run-time and stall instructions until their operands are ready (most processors) Statically-scheduled processors leave dependency detection to the compiler, which must insert no-ops (simple, low-power embedded)
9
Instruction Scheduling
Goal: re-order instructions, accounting for data dependencies and other factors, to minimize the number of stalls/no-ops (Engineering a Compiler, p. 644)
10
Dependence Graphs and List Scheduling
Key data structure is the dependence graph (* List scheduling algorithm *) Cycle := 1 Ready := [leaves of Graph] Active := [] while (Ready + Active).size > 0 for each op in Active if op.startCycle + op.length < Cycle remove op from Active for each successor s of op in Graph if s is ready add s to Ready if Ready.size > 0 remove an op from Ready op.startCycle = Cycle add op to Active Cycle++ (Engineering a Compiler, p. 645–652)
11
Register Allocation Trade-Offs
More storage locations = fewer data dependencies = more parallelism Many register allocation schemes seek to minimize the number of registers used, undermining parallelism Processors developed hardware register renaming as a workaround However, excess register usage may require spillover code which negates the benefit of parallelism Register allocation can be done either before or after instruction scheduling
12
Advanced Topics List-scheduling algorithm operates on basic blocks
Global code scheduling, code motion Software pipelining: schedule entire loop at once Branch prediction Alias analysis: determine if a pointer causes a data dependency Scheduling variable-length operations LOAD can take hundreds or thousands of cycles upon a cache miss Speculative execution
13
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.