Download presentation
Presentation is loading. Please wait.
Published byLoreen Stevens Modified over 9 years ago
1
Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † * Institute for Information Science and Technologies, Italy † Massachusetts Institute of Technology, USA
2
Outline This talk shows how one can apply machine learning techniques to find good phase orderings for an instruction scheduler First, I’ll introduce the scheduler that we are interested in improving Then, I’ll discuss genetic programming Then, I’ll present experimental results
3
R4000 like Processor Core Operand network Clustered Architectures Memory and registers separated into clusters RAW Clustered VLIWs When scheduling, we try to co-locate data with computation
4
Convergent Scheduling Convergent scheduling passes are symmetric Each pass takes as input a preference map and outputs a preference map Passes are modular and can be applied in any order
5
Convergent Scheduling Preference Maps Instructions Clusters Time 0123 4 5 6 7 0 1 2 3 Each entry is a weight The weights correspond to the “confidence” of a space-time assignment for a given instruction
6
Four clusters High confidence Low confidence Example Dependence Graph
7
Placement Propagation
8
Critical Path Strengthening
9
Path Propagation
10
Parallelism Distribute
11
Path Propagation
12
Communication Reduction
13
Path Propagation
14
Final Schedule
15
Convergent Scheduling “Classical” scheduling passes make absolute decisions that can’t be undone Convergent scheduling passes make soft decisions in the form of preferences Mistakes made early on can be undone Passes don’t impose order! Pass
16
Double-Edged Sword The good news: convergent scheduling does not constrain phase order Nice interface makes writing and integrating passes easy The bad news: convergent scheduling does not constrain phase order Limitless number of phase orders to consider, some of which are much better than others
17
Our Proposal Use genetic programming to automatically search for a phase ordering that’s catered to a given Architecture Compiler Our inspiration comes from Cooper’s work [Cooper et al., LCTES 1999]
18
Genetic Programming Searching algorithm analogous to Darwinian evolution Maintain a population of expressions (sequence INITTIME (sequence PLACE (if imbalanced LOAD COMM)))
19
Genetic Programming Searching algorithm analogous to Darwinian evolution Maintain a population of expressions Selection The fittest expressions in the population are more likely to reproduce Reproduction Crossing over subexpressions of two expressions Mutation
20
General Flow Create initial population (initial solutions) Evaluation Selection Randomly generated initial population Create Variants done?
21
General Flow Create initial population (initial solutions) Evaluation Selection Create Variants done? Compiler is modified to use the given expression as the phase ordering Each expression is evaluated by compiling and running the benchmark(s) Fitness is the relative speedup over our original phase ordering on the benchmark(s)
22
General Flow Create initial population (initial solutions) Evaluation Selection Create Variants done? Just as with Natural Selection, the fittest individuals are more likely to survive
23
General Flow Create initial population (initial solutions) Evaluation Selection Create Variants done? Use crossover and mutation to generate new expressions And thus, generate new and hopefully improved phase orderings
24
Experimental Setup We use an in-house VLIW compiler (SUIF, MachSUIF) and simulator Compiler and simulator are parameterized so we can easily change VLIW configurations Experiments presented here are for clustered architectures Details of the architectures are in the paper
25
Convergent Scheduling Heuristics Noise Introduction Initial Time Assignment Preplacement Critical Path Strengthening Communication Minimization Parallelism Distribution Load Balance Dependence Enforcement Assignment Strengthening Functional Unit Distribution Push to first cluster Critical Path Distance Cluster Creation Register Pressure Reduction in Time Register Pressure Reduction in Space
26
Hand-Tuned Results 4-cluster VLIW, Rich Interconnect
27
Results 4-cluster VLIW, Limited Interconnect
28
Training an Improved Sequence Goal: find a sequence that works well for all the benchmarks in the last graph (vmul, rbsorf, yuv, etc.) Train a sequence using these benchmarks then… For each expression in the population compile and run all the benchmarks, take the average speedup as fitness
29
The Schedule Evolved sequence is much more conservative in communication inittime func dep func load func dep func comm dep func comm place func reduces weights of instructions on overloaded clusters dep increases probability that dependent instruction scheduled “nearby” comm tries to keep neighboring instructions in same cluster
30
Results 4-cluster VLIW, Limited Interconnect
31
Results Leave-One-Out Cross Validation
32
Summary of Results When we changed the architecture, the hand-tuned sequence failed UAS and PCC outperform convergent scheduling Our GP system found a sequence that usually outperforms UAS and PCC Cross validation suggests that it is possible to find a “general-purpose” sequence
33
Running Time Using about 20 machines in a small cluster of workstations it takes about 2 days to evolve a sequence This is a one-time process! Performed by the compiler vendor
34
Disappointing Result Unfortunately, sequences with conditionals are weeded out of the GP selection process Our system rewards parsimony Convergent scheduling passes make soft decisions, so running an extra pass may not be detrimental We’d like to get to the bottom of this unexpected result
35
Conclusions Using GP we’re able to find architecture- specific, application-independent sequences We can quickly retune the compiler when The architecture changes The compiler itself changes
37
Implemented Tests
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.