Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † * Institute for Information Science and Technologies, Italy † Massachusetts Institute of Technology, USA
Outline This talk shows how one can apply machine learning techniques to find good phase orderings for an instruction scheduler First, I’ll introduce the scheduler that we are interested in improving Then, I’ll discuss genetic programming Then, I’ll present experimental results
R4000 like Processor Core Operand network Clustered Architectures Memory and registers separated into clusters RAW Clustered VLIWs When scheduling, we try to co-locate data with computation
Convergent Scheduling Convergent scheduling passes are symmetric Each pass takes as input a preference map and outputs a preference map Passes are modular and can be applied in any order
Convergent Scheduling Preference Maps Instructions Clusters Time Each entry is a weight The weights correspond to the “confidence” of a space-time assignment for a given instruction
Four clusters High confidence Low confidence Example Dependence Graph
Placement Propagation
Critical Path Strengthening
Path Propagation
Parallelism Distribute
Path Propagation
Communication Reduction
Path Propagation
Final Schedule
Convergent Scheduling “Classical” scheduling passes make absolute decisions that can’t be undone Convergent scheduling passes make soft decisions in the form of preferences Mistakes made early on can be undone Passes don’t impose order! Pass
Double-Edged Sword The good news: convergent scheduling does not constrain phase order Nice interface makes writing and integrating passes easy The bad news: convergent scheduling does not constrain phase order Limitless number of phase orders to consider, some of which are much better than others
Our Proposal Use genetic programming to automatically search for a phase ordering that’s catered to a given Architecture Compiler Our inspiration comes from Cooper’s work [Cooper et al., LCTES 1999]
Genetic Programming Searching algorithm analogous to Darwinian evolution Maintain a population of expressions (sequence INITTIME (sequence PLACE (if imbalanced LOAD COMM)))
Genetic Programming Searching algorithm analogous to Darwinian evolution Maintain a population of expressions Selection The fittest expressions in the population are more likely to reproduce Reproduction Crossing over subexpressions of two expressions Mutation
General Flow Create initial population (initial solutions) Evaluation Selection Randomly generated initial population Create Variants done?
General Flow Create initial population (initial solutions) Evaluation Selection Create Variants done? Compiler is modified to use the given expression as the phase ordering Each expression is evaluated by compiling and running the benchmark(s) Fitness is the relative speedup over our original phase ordering on the benchmark(s)
General Flow Create initial population (initial solutions) Evaluation Selection Create Variants done? Just as with Natural Selection, the fittest individuals are more likely to survive
General Flow Create initial population (initial solutions) Evaluation Selection Create Variants done? Use crossover and mutation to generate new expressions And thus, generate new and hopefully improved phase orderings
Experimental Setup We use an in-house VLIW compiler (SUIF, MachSUIF) and simulator Compiler and simulator are parameterized so we can easily change VLIW configurations Experiments presented here are for clustered architectures Details of the architectures are in the paper
Convergent Scheduling Heuristics Noise Introduction Initial Time Assignment Preplacement Critical Path Strengthening Communication Minimization Parallelism Distribution Load Balance Dependence Enforcement Assignment Strengthening Functional Unit Distribution Push to first cluster Critical Path Distance Cluster Creation Register Pressure Reduction in Time Register Pressure Reduction in Space
Hand-Tuned Results 4-cluster VLIW, Rich Interconnect
Results 4-cluster VLIW, Limited Interconnect
Training an Improved Sequence Goal: find a sequence that works well for all the benchmarks in the last graph (vmul, rbsorf, yuv, etc.) Train a sequence using these benchmarks then… For each expression in the population compile and run all the benchmarks, take the average speedup as fitness
The Schedule Evolved sequence is much more conservative in communication inittime func dep func load func dep func comm dep func comm place func reduces weights of instructions on overloaded clusters dep increases probability that dependent instruction scheduled “nearby” comm tries to keep neighboring instructions in same cluster
Results 4-cluster VLIW, Limited Interconnect
Results Leave-One-Out Cross Validation
Summary of Results When we changed the architecture, the hand-tuned sequence failed UAS and PCC outperform convergent scheduling Our GP system found a sequence that usually outperforms UAS and PCC Cross validation suggests that it is possible to find a “general-purpose” sequence
Running Time Using about 20 machines in a small cluster of workstations it takes about 2 days to evolve a sequence This is a one-time process! Performed by the compiler vendor
Disappointing Result Unfortunately, sequences with conditionals are weeded out of the GP selection process Our system rewards parsimony Convergent scheduling passes make soft decisions, so running an extra pass may not be detrimental We’d like to get to the bottom of this unexpected result
Conclusions Using GP we’re able to find architecture- specific, application-independent sequences We can quickly retune the compiler when The architecture changes The compiler itself changes
Implemented Tests