Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.

Similar presentations


Presentation on theme: "Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling."— Presentation transcript:

1 Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling

2 Instruction Scheduling is... Spring 2014Jim Hogg - UW - CSE - P501O-2 Schedule Execute in-order to get correct answer abcdefghabcdefgh badfcghfbadfcghf Issue in new order eg: memory fetch is slow eg: divide is slow Overall faster Still get correct answer! Originally devised for super-computers Now used everywhere: in-order procs - older ARM out-of-order procs - newer x86 Compiler does 'heavy lifting' - reduce chip power

3 Spring 2014JIm Hogg - UW - CSE - P501O-3 Chip Complexity, 1 Following factors make scheduling complicated: Different kinds of instruction take different times (in clock cycles) to complete Modern chips have multiple functional units so they can issue several operations per cycle "super-scalar" Loads are non-blocking ~50 in-flight loads and ~50 in-flight stores

4 Typical Instruction Timings Spring 2014JIm Hogg - UW - CSE - P501O-4 InstructionTime in Clock Cycles int  int 1 int * int3 float  float 3 float * float5 float  float 15 int  int 30

5 Load Latencies T-5 Core L1 = 64 KB per core L2 = 256 KB per core L3 = 2-8 MB shared DRAM Instruction~5 per cycle Register 1 cycle L1 Cache~4 cycles L2 Cache~10 cycles L3 Cache~40 cycles DRAM ~100 ns

6 Spring 2014JIm Hogg - UW - CSE - P501O-6 Super-Scalar

7 Spring 2014JIm Hogg - UW - CSE - P501O-7 Chip Complexity, 2 Branch costs vary (branch predictor) Branches on some processors have delay slots (eg: Sparc) Modern processors have branch-predictor logic in hardware heuristics predict whether branches are taken or not keeps pipelines full GOAL: Scheduler should reorder instructions to hide latencies take advantage of multiple function units (and delay slots) help the processor effectively pipeline execution However, many chips schedule on-the-fly too eg: Haswell out-of-order window = 192  ops

8 Data Dependence Graph Spring 2014JIm Hogg - UW - CSE - P501O-8 a i c g h d f b e Start Cycle Instruction aloadAI r arp, @a => r 1 badd r 1, r 1 => r 1 cloadAI r arp, @b => r 2 dmult r 1, r 2 => r 1 eloadAI r arp, @c => r 2 fmult r 1, r 2 => r 1 gloadAI r arp, @d => r 2 hmult r 1, r 2 => r 1 istoreAI r 1 => r arp, @a read-after-write = RAW = true dependence = flow dependence write-after-read = WAR = anti-dependence write-after-write = WAW = output-dependence The scheduler has freedom to re-order instructions, so long as it complies with inter-instruction dependencies leaf root

9 Scheduling Really Works... Spring 2014JIm Hogg - UW - CSE - P501O-9 Start Cycle Instruction 1 loadAI r arp, @a => r 1 4 add r 1, r 1 => r 1 5 loadAI r arp, @b => r 2 8 mult r 1, r 2 => r 1 10 loadAI r arp, @c => r 2 13 mult r 1, r 2 => r 1 15 loadAI r arp, @d => r 2 18 mult r 1, r 2 => r 1 20 storeAI r 1 => r arp, @a Start Cycle Instruction 1 loadAI r arp, @a => r 1 2 loadAI r arp, @b => r 2 3 loadAI r arp, @c => r 3 4 add r 1, r 1 => r 1 5 mult r 1, r 2 => r 1 6 loadAI r arp, @d => r 2 7 mult r 1, r 3 => r 1 9 mult r 1, r 2 => r 1 11 storeAI r 1 => r arp, @a Original Scheduled 1 Functional Unit Load or Store: 3 cycles Multiply: 2 cycles Otherwise: 1 cycle a = 2*a*b*c*dNew schedule uses extra register: r 3 Preserves (WAW) output-dependency

10 Spring 2014JIm Hogg - UW - CSE - P501O-10 Scheduler: Job Description The Job Given code for some machine; and latencies for each instruction, reorder to minimize execution time Constraints Produce correct code Minimize wasted cycles Avoid spilling registers Don't take forever to reach an answer

11 Job Description - Part 2 foreach instruction in dependence graph Denote current instruction as ins Denote number of cyles to execute as ins.delay Denote cycle number in which ins should start as ins.start foreach instruction dep that is dependent on ins Ensure ins.start + ins.delay <= dep.start Spring 2014JIm Hogg - UW - CSE - P501O-11 What if the scheduler makes a mistake? On-chip hardware stalls the pipeline until operands become available: so slower, but still correct!

12 Dependence Graph + Timings Spring 2014JIm Hogg - UW - CSE - P501O-12 a 13 i3i3 c 12 g8g8 h5h5 d9d9 f7f7 b 10 e 10 Start Cycle Instruction aloadAI r arp, @a => r 1 badd r 1, r 1 => r 1 cloadAI r arp, @b => r 2 dmult r 1, r 2 => r 1 eloadAI r arp, @c => r 2 fmult r 1, r 2 => r 1 gloadAI r arp, @d => r 2 hmult r 1, r 2 => r 1 istoreAI r 1 => r arp, @a 1 Functional Unit Load or Store: 3 cycles Multiply: 2 cycles Otherwise: 1 cycle Superscripts show path length to end of computation a-b-d-f-h-i is critical path Can schedule leaves any time - no constraints Since a has longest delay, schedule it first; then c; then...

13 Spring 2014JIm Hogg - UW - CSE - P501O-13 List Scheduling Build a precedence graph D Compute a priority function over the nodes in D typical: longest latency-weighted path Rename registers to remove WAW conflicts Create schedule, one cycle at a time Use queue of operations that are Ready At each cycle Choose a Ready operation and schedule it Update Ready queue

14 O-14 List Scheduling Algorithm cycle = 1// clock cycle number Ready = leaves of D// ready to be scheduled Active = { }// being executed while Ready  Active  {} do foreach ins  Active do if ins.start + ins.delay < cycle then remove ins from Active foreach successor suc of ins in D do if suc  Ready then Ready = {suc} endif enddo endif endforeach if Ready  {} then remove an instruction, ins, from Ready ins.start = cycle; Active = ins; endif cycle++ endwhile

15 Beyond Basic Blocks List scheduling dominates, but moving beyond basic blocks can improve quality of the code. Possibilities: Schedule extended basic blocks (EBBs) Watch for exit points – limits reordering or requires compensating Trace scheduling Use profiling information to select regions for scheduling using traces (paths) through code Spring 2014JIm Hogg - UW - CSE - P501O-15


Download ppt "Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling."

Similar presentations


Ads by Google