Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan November 14, 2018
Clustered Architectures Conventional Architecture Increasing width from 4 to 8 increases total delay 29% [Palacharla ‘98] Clustered Approach: Decentralized architecture Communication through interconnection network Used in Alpha 21264, TI C6x, Analog Tigersharc and others. Register File FU RF FU Register File FU Cluster 1 Cluster 2 Clustered Architecture
Basics of Multicluster Compilation Objectives: Balance workload per cluster Minimize critical intercluster communication Interconnection Network + >> & Register File Register File * LW I MEM I MEM + Intercluster move Cluster 1 Cluster 2
Problem #1: Local vs Global Scope Local scope clustering Global scope clustering 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 3 1 4 1 2 7 1 2 8 2 3 4 5 6 move 4 2 3 4 5 6 5 10 8 3 9 6 7 8 9 cycle 6 7 8 9 cycle 5 7 11 10 11 move 10 11 9 move 10 12 12 11 12 12
Problem #2: Scheduler-centric Cluster assignment during scheduling adds complexity Detailed resource model/reservation tables is slow Forces local decisions Reservation Tables cycle Cluster 1 cycle Cluster 2 1 X X 1 X X 1 2 X X 2 X X 2 3 4 5 1 X X 1 X X 6 7 8 9 2 X X 2 X X 10 11 1 X X 1 X X 12 2 X X 2 X X
Our Approach Opposite approach to conventional clustering Global view Graph partitioning strategy [Aletà ‘01, ‘02] Identify tightly coupled operations - treat uniformly Non scheduler-centric mindset Prescheduling technique Doesn’t complicate scheduler Enable global view of code Estimate-based approach [Lapinskii ‘01]
Region-based Hierarchical Operation Partitioning (RHOP) Program Region 1 10 8 int main { int x; printf(…); . } Weight Calculation Graph Partitioning Code is considered region at a time Weight calculation creates guides for good partitions Partitioning clusters based on given weights
Node Weights Create a metric to determine resource usage 1 2 Dedicated Resources Shared Resources 3 8 5 9 6 10 7 11 13 14 12 1 2 4 1 2 I F M B Register File I F M B Register File 3 Accounts for FU’s Accounts for buses, ports
Edge Weights Slack distribution allocates slack to certain edges Edge slack = lstartdest - latencyedge - estartsrc First come, first serve method used (0,0) 1 (0,0) 2 10 (1,1) 3 (0,1) 4 (0,1) 5 (0,1) 6 (0,1) 7 1 1 2 8 1 (2,2) 8 (1,2) 9 (0,2) 10 (1,2) 11 1 1 (3,3) 12 (2,3) 13 (estart, lstart) 1 14 (4,4)
RHOP - Partitioning Phase Modified Multilevel-KL algorithm [Kernighan ‘69] Multilevel graph partitioning consists of two stages Coarsening stage Refinement stage
Cluster Refinement 3 questions to answer: Which cluster should operations move from? How good is the current partition? How profitable is it to move X from cluster A to B? ?
Where Should Operations Move From? Cluster 1 2.5 2.0 0.5 0.0 1 2 Cluster_wgt1= 5.0 0.33 Cluster_wgt2= 0.67 1 2 3 4 1 2 4 5 6 3 9 cycle 8 12 14 Cluster 2 1 2 3 4 7 10 11 cycle 13
How Good is this Partition? Cluster 1 Cluster 2 Max 1 2 1 2 2.5 0.0 2.5 2.0 0.33 2.0 0.5 0.33 0.5 0.0 0.0 0.0 0.0 0.0 0.0 Cluster_wgt1= 5.0 Cluster_wgt1= 0.67 SL= 5.0
How Good is This Proposed Move? Cluster 1 1 2 3 4 1 2 1.0 SL(before)= 5.0 3 0.0 cycle 8 0.0 SL(after)= 4.5 12 0.0 14 0.0 Cluster 2 Lgain= 0.5 1 2 3 4 7 4 5 6 1.33 10 11 9 2.33 Egain= -1.0 cycle 13 0.83 0.0 Mgain= 4.0 0.0
Experimental Evaluation Trimaran toolset: a retargetable VLIW compiler Evaluated DSP kernels and SPECint2000 Name Configuration 2-1111 2 Homogenous clusters 1 I, 1 F, 1 M, 1 B per cluster 2-2111 2 I, 1 F, 1 M, 1 B per cluster 4-1111 4 Homogenous clusters 4-2111 4-H 4 Heterogeneous clusters IM, IF, IB and IMF clusters 64 registers per cluster Latencies similar to Itanium Perfect caches For more detailed results, see paper
2 Cluster Results vs 1 Cluster
4 Cluster Results vs 1 Cluster
Conclusions A new, region-scoped method for clustering operations Prescheduling technique Estimates on schedule length used instead of scheduler Combines slack distribution with multilevel-KL partitioning Performs better as number of resources increases Average Improvement Machine RHOP vs BUG 2-1111 -1.8% 2-2111 3.7% 4-1111 14.3% 4-2111 15.3% 4-H 8.0%
Questions? http://cccp.eecs.umich.edu
Backup Slides
Previous Work X X X X Algorithm UAS CARS Convergent Leupers Capitanio GP(B) B-ITER BUG RHOP When (rel. to sched) During Iterative Before X Scope Local Region X Desirability Metric Sched Pseudo Est Count X Grouping Hier. Flat X
Bottom-Up Greedy (BUG) Typical clustering strategy, falls into trouble because of its limited view Places operations without knowing the rest of the graph Uses the scheduler to determine where to best place each operation First used in Multiflow trace [Ellis ‘85]
Graph Partitioning Algorithms Local improvement methods: Kernighan-Lin Swaps pairs of operations between partitions Fiduccia and Matheyses KL-inspired, efficent O(|E|) algorithm Simulated annealing Genetic algorithms
Graph Partitioning Algorithms Global methods: Geometric methods Coordinate based, not suitable for clustering Coordinate-free methods Recursive spectral bisection (RSB) Multilevel-RSB Multilevel-KL
RHOP - Example
Improvement at Increasing CPLs
Resource Manager Overhead