Presentation is loading. Please wait.

Presentation is loading. Please wait.

Instruction Scheduling: Beyond Basic Blocks

Similar presentations


Presentation on theme: "Instruction Scheduling: Beyond Basic Blocks"— Presentation transcript:

1 Instruction Scheduling: Beyond Basic Blocks
Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use.

2 Local Scheduling As long as we stay within a single block
List scheduling does well Problem is hard, so tie-breaking matters More descendants in dependence graph Prefer operation with a last use over one with none Breadth first makes progress on all paths Tends toward more ILP & fewer interlocks Depth first tries to complete uses of a value Tends to use fewer registers Classic work on this is Gibbons & Muchnick

3 Block from SPEC benchmark “go”
Local Scheduling Forward and backward can produce different results Latency to the cbr 8 8 8 8 8 cbr cmp store1 store2 store3 store4 store5 add1 add2 add3 add4 addI loadI1 lshift loadI2 loadI3 loadI4 Subscript to identify 7 7 7 7 7 2 5 5 5 5 5 Block from SPEC benchmark “go” 1 Operation load loadI add addI store cmp Latency 1 2 4

4 Using latency to root as the priority
Local Scheduling F o r w a d S c h e u l Int Mem 1 loadI1 lshift 2 loadI2 loadI3 3 loadI4 add1 4 add2 add3 5 add4 addI store1 6 cmp store2 7 store3 8 store4 9 store5 10 11 12 13 cbr B a c k w r d S h e u l Int Mem 1 loadI4 2 addI lshift 3 add4 loadI3 4 add3 loadI2 store5 5 add2 loadI1 store4 6 add1 store3 7 store2 8 store1 9 10 11 cmp 12 cbr 13 Using latency to root as the priority

5 Local Scheduling Schielke’s RBF algorithm
Run 5 passes of forward list scheduling and 5 passes of backward list scheduling Break each tie randomly Keep the best schedule Shortest time to completion Other metrics are possible (shortest time + fewest registers) In practice, this does very well Randomized Backward & Forward

6 Scheduling Larger Regions
Superlocal Scheduling Work EBB at a time Example has four EBBs a b c d g e f h i l j k B1 B2 B4 B6 B5 B3

7 Scheduling Larger Regions
Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B1,B2,B4 } & {B1,B3 } Having B1 in both causes conflicts Moving an op out of B1 causes problems a b c d g e f h i l j k B1 B2 B4 B6 B5 B3

8 Scheduling Larger Regions
Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B1,B2,B4 } & {B1,B3 } Having B1 in both causes conflicts Moving an op out of B1 causes problems a b c d g c,e f h i l j k B1 B2 B4 B6 B5 B3 no c here !

9 Scheduling Larger Regions
Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B1,B2,B4 } & {B1,B3 } Having B1 in both causes conflicts Moving an op out of B1 causes problems Must insert “compensation” code in B3 Increases code space a b c d g c,e f h i l j k B1 B2 B4 B6 B5 B3 This one wasn’t done for speed!

10 Scheduling Larger Regions
Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B1,B2,B4 } & {B1,B3 } Having B1 in both causes conflicts Moving an op into B1 causes problems a b c d g e f h i l j k B1 B2 B4 B6 B5 B3

11 Scheduling Larger Regions
Superlocal Scheduling Work EBB at a time Example has four EBBs Only two have nontrivial paths {B1,B2,B4 } & {B1,B3 } Having B1 in both causes conflicts Moving an op into B1 causes problems Lengthens {B1,B3 } Adds computation to {B1,B3 } May need compensation code, too Renaming may avoid “undo f” a b c d,f undo f g e f h i l j k B1 B2 B4 B6 B5 B3 This makes the path even longer!

12 Scheduling Larger Regions
Superlocal Scheduling How much can we get? Schielke saw 11 to 12% speed ups Constrained away compensation code Why was this harder than DVNT? DVNT moved information Scheduling moves ops DVNT moves forward Scheduling moves both ways Value tables partition nicely Dependence graph does not a b c d g e f h i l j k B1 B2 B4 B6 B5 B3 Value numbering is the best case for superlocal scope

13 Scheduling Larger Regions
More Aggressive Superlocal Scheduling Clone blocks to create more context a b c d g e f h i l j k B1 B2 B4 B6 B5 B3 Join points create blocks that must work in multiple contexts 2 paths 3 paths

14 Scheduling Larger Regions
More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine Single successor, single predecessor B1 a b c d B2 B3 e f g B4 B5a B5b h i j k j k B6a B6b B6c l l l

15 Scheduling Larger Regions
More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine Single successor, single predecessor B1 a b c d B2 B3 e f g B4 B5a B5b h i j k j k B6a B6b B6c l l l

16 Scheduling Larger Regions
More Aggressive Superlocal Scheduling Clone blocks to create more context Some blocks can combine Single successor, single predecessor Now schedule EBBs {B1,B2,B4 }, {B1,B2,B5q }, {B1,B3,B5b } Pay heed to compensation code Works well for forward motion Backward motion still has off-path problems Speeding up one path can slow down others (undo) B1 a b c d B2 B3 e f g B4 B5a B5b h i l j k l j k l

17 Scheduling Larger Regions
Trace Scheduling Start with execution counts for edges Obtained by profiling a b c d g e f h i l j k B1 B2 B4 B6 B5 B3

18 Scheduling Larger Regions
10 Trace Scheduling Start with execution counts for edges Obtained by profiling Pick the “hot” path a b c d g e f h i l j k B1 B2 B4 B6 B5 B3 7 3 5 2 3 5 5 Block counts could mislead us — see B5

19 Scheduling Larger Regions
10 Trace Scheduling Start with execution counts for edges Obtained by profiling Pick the “hot” path B1,B2,B4,B6 Schedule it Compensation code in B3,B5 if needed Get the hot path right! If we picked the right path, the other blocks do not matter as much Places a premium on quality profiles B1 a b c d 7 3 B2 B3 e f g 5 2 3 B4 B5 h i j k 5 5 B6 l

20 Scheduling Larger Regions
10 Trace Scheduling Entire CFG Pick & schedule hot path Insert compensation code Remove hot path from CFG Repeat the process until CFG is empty Idea Hot paths matter Farther off hot path, less it matters a b c d g e f h i l j k B1 B2 B4 B6 B5 B3 7 3 5 2 3 5 5


Download ppt "Instruction Scheduling: Beyond Basic Blocks"

Similar presentations


Ads by Google