UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning Alex Aletà Josep M. Codina Jesús Sánchez Antonio González David Kaeli {aaleta, jmcodina, fran, PACT 2002, Charlottesville, Virginia – September 2002
Clustered Architectures Current/future challenges in processor design Delay in the transmission of signals Power consumption Architecture complexity Clustering: divide the system in semi-independent units Each unit Cluster Fast interconnects intra-cluster Slow interconnects inter-clusters Common trend in commercial VLIW processors TI’s C6x Analog’s TigerSHARC HP’s LX Equator’s MAP1000
Architecture Overview L1 CACHE LOCAL REGISTER FILE FU MEM LOCAL REGISTER FILE FU MEM Register Buses CLUSTER 1CLUSTER n
Instruction Scheduling For non-clustered architectures Resources Dependences For clustered architectures Cluster assignment Minimize inter-cluster communication delays Exploit communication locality This work focuses on modulo scheduling for clustered VLIW architectures Technique to schedule loops
Talk Outline Previous work Proposed algorithm Overview Graph partitioning Pseudo-scheduling Performance evaluation Conclusions
MS for Clustered Architectures Two steps Data Dependence Graph partitioning: each instruction is assigned to a cluster Scheduling: instructions are scheduled in a suitable slot but only in the preassigned cluster In previous work, two different approaches were proposed: II++ Cluster Assignment + Scheduling One step There is no initial cluster assignment The scheduler is free to choose any cluster Cluster Assignment Cluster Assignment Scheduling II++
Goal of the Work Both approaches have benefits Two steps Global vision of the Data Dependence Graph Workload is better split among different clusters Number of communications is reduced One step Local vision of partial scheduling Cluster assignment is performed with information of the partial scheduling Goal: obtain an algorithm taking advantage of the benefits of both approaches
Baseline Baseline scheme: GP [Aletà et al., Micro34] Cluster assignment performed with a graph partitioning algorithm Feed-back between the partitioning and the scheduler Results outperformed previous approaches Still little information available for cluster assignment New algorithm: better partition Pseudo-schedules are used to guide the partition Global vision of the Data Dependence Graph More information to perform cluster assignment
Algorithm Overview YES II++ Refine Partition II:= MII Compute initial partition Able to schedule? Select next operation (j++) Start scheduling Schedule Op j based on the current partition Move Op j to another cluster NO Able to schedule? YES
Algorithm Overview YES II++ Refine Partition II:= MII Compute initial partition Able to schedule? Select next operation (j++) Start scheduling Schedule Op j based on the current partition Move Op j to another cluster NO Able to schedule? YES
Graph Partitioning Background Problem statement Split the nodes into a pre-determined number of sets and optimizing some functions Multilevel strategy Coarsen the graph Iteratively, fuse pairs of nodes into new macro-nodes Enhancing heuristics Avoid excess load in any one set Reduce execution time of the loops
Graph Coarsening Previous definitions Matching Slack Iterate until same number of nodes than clusters: The edges are weighted according to Impact on execution time of adding a bus delay to the edge Slack of the edge Then, select the maximum weight matching Nodes linked by edges in the matching are fused in a single macro-node
Coarsening Example Find matching Final graph Initial graph
coarsening Example (II) 1st STEP : Partition induced in the original graph Initial graphInduced Partition Final graph
Estimation of execution time needed Pseudo-schedules Information obtained II SC Lifetimes Spills Reducing Execution Time
Dependences Respected if possible Else a penalty on register pressure and/or in execution time is assessed Cluster assignment Partition strictly followed Building pseudo-schedules
Pseudo-schedule: example Induced partition A D B C Cluster 1Cluster 2 0A 1 2 3B 4D 5 6C?NO 7 Cluster 1Cluster 2 AD B 2 clusters, 1 FU/cluster, 1 bus of latency 1, II= 2 Instruction latency= 3
Pseudo-schedule: example Induced partition A D B C Cluster 1Cluster 2 0A 1 2 3B 4D C Cluster 1Cluster 2 A,CD B
Heuristic description While improvement, iterate: Different partitions are obtained by moving nodes among clusters Partitions that produce overload resources in any of the clusters are discarded The partition minimizing execution time is chosen In case of tie, the one that minimizes register pressure is selected
Algorithm Overview YES II++ Refine Partition II:= MII Compute initial partition Able to schedule? Select next operation (j++) Start scheduling Schedule Op j based on the current partition Move Op j to another cluster NO Able to schedule? YES
The Scheduling Step To schedule the partition we use URACAM [Codina et al., PACT’01] Figure of merit Uses dynamic transformations to improve the partial schedule Register communications Bus memory Spill code on-the-fly Register pressure memory If an instruction can not be scheduled in the cluster assigned by the partition Try all other clusters Select the best one according to a figure of merit
Algorithm Overview YES II++ Refine Partition II:= MII Compute initial partition Able to schedule? Select next operation (j++) Start scheduling Schedule Op j based on the current partition Move Op j to another cluster NO Able to schedule? YES
Partition Refinement II has increased A better partition can be found for the new II New slots have been generated in each cluster More lifetimes are available A larger number of bus communications allowed Coarsening process is repeated Only edges between nodes in the same set can appear in the matching After coarsening, the induced partition will be the last partition that could not be scheduled The reducing execution time heuristic is reapplied
Benchmarks and Configurations Benchmarks - all the SPECfp95 using the ref input set Two schedulers evaluated: GP – (previous work) Pseudo-schedule (PSP) Resources INT/cluster FP/cluster MEM/cluster Unified cluster cluster Latencies INTFP MEM 22 ARITH 13 MUL/ABS DIV/SQR/TRG
GP vs PSP 32 registers split into 2 clusters 1 bus (L=1) 32 registers split into 4 clusters 1 bus (L=1)
GP vs PSP 64 registers split into 4 clusters 1 bus (L=2) 32 registers split into 4 clusters 1 bus (L=2)
Conclusions A new algorithm to perform MS for clustered VLIW architectures Cluster assignment based on multilevel graph partitioning The partition algorithm is improved Based on pseudo-schedules Reliable information available to guide the partition Outperform previous work 38.5% speedup for some configurations
UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Any questions?
GP vs PSP 64 registers split into 2 clusters 1 bus (L=1) 64 registers split into 4 clusters 1 bus (L=1)
Different Alternatives Cluster Assignment Cluster Assignment Scheduling II++ Global vision when assigning clusters Schedule follows exactly assignment Re-scheduling does not take into account more resources available Local vision when assigning and scheduling Assignment is based on current resource usage No global view of the graph II++ Cluster Assignment + Scheduling Global and local views of the graph If cannot schedule, depending on the reason Re-schedule Re-compute cluster assignment Cluster Assignment Cluster Assignment Scheduling II++ ? ?
Clustered Architectures Current/future challenges in processor design Delay in the transmission of signals Power consumption Architecture complexity Solutions: VLIW architectures Clustering: divide the system in semi-independent units Fast interconnects intra-cluster Slow interconnects inter-clusters Common trend in commercial VLIW processors TI’s C6x Analog’s Tigersharc HP’s LX Equator’s MAP1000
Example (I) 1st STEP : Coarsening the graph Initial graph 15 3 Find matching New graph 3 1 Find matching 3 1 Final graph 1
coarsening Example (I) 1st STEP : Partition induced in the original graph Initial graphInduced partition coarsened graph 1
Reducing Execution Time Heuristic description Different partitions are obtained by moving nodes among clusters Partitions overloading resources in any of the clusters are discarded The partition minimizing execution time is chosen In case of tie, the one that minimizes register pressure Estimation of execution time needed Pseudo-schedules
Building pseudo-schedules Dependences Respected if possible Else a penalty on register pressure and/or in execution time is assumed Cluster assignment Partition strictly followed Valuable information can be estimated II Length of the pseudo-schedule Register pressure Pseudo-schedules Execution time