Download presentation
Presentation is loading. Please wait.
1
Region-based Hierarchical Operation Partitioning for Multicluster Processors
Michael Chu, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan November 14, 2018
2
Clustered Architectures
Conventional Architecture Increasing width from 4 to 8 increases total delay 29% [Palacharla ‘98] Clustered Approach: Decentralized architecture Communication through interconnection network Used in Alpha 21264, TI C6x, Analog Tigersharc and others. Register File FU RF FU Register File FU Cluster 1 Cluster 2 Clustered Architecture
3
Basics of Multicluster Compilation
Objectives: Balance workload per cluster Minimize critical intercluster communication Interconnection Network + >> & Register File Register File * LW I MEM I MEM + Intercluster move Cluster 1 Cluster 2
4
Problem #1: Local vs Global Scope
Local scope clustering Global scope clustering 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 3 1 4 1 2 7 1 2 8 2 3 4 5 6 move 4 2 3 4 5 6 5 10 8 3 9 6 7 8 9 cycle 6 7 8 9 cycle 5 7 11 10 11 move 10 11 9 move 10 12 12 11 12 12
5
Problem #2: Scheduler-centric
Cluster assignment during scheduling adds complexity Detailed resource model/reservation tables is slow Forces local decisions Reservation Tables cycle Cluster 1 cycle Cluster 2 1 X X 1 X X 1 2 X X 2 X X 2 3 4 5 1 X X 1 X X 6 7 8 9 2 X X 2 X X 10 11 1 X X 1 X X 12 2 X X 2 X X
6
Our Approach Opposite approach to conventional clustering Global view
Graph partitioning strategy [Aletà ‘01, ‘02] Identify tightly coupled operations - treat uniformly Non scheduler-centric mindset Prescheduling technique Doesn’t complicate scheduler Enable global view of code Estimate-based approach [Lapinskii ‘01]
7
Region-based Hierarchical Operation Partitioning (RHOP)
Program Region 1 10 8 int main { int x; printf(…); . } Weight Calculation Graph Partitioning Code is considered region at a time Weight calculation creates guides for good partitions Partitioning clusters based on given weights
8
Node Weights Create a metric to determine resource usage 1 2
Dedicated Resources Shared Resources 3 8 5 9 6 10 7 11 13 14 12 1 2 4 1 2 I F M B Register File I F M B Register File 3 Accounts for FU’s Accounts for buses, ports
9
Edge Weights Slack distribution allocates slack to certain edges Edge slack = lstartdest - latencyedge - estartsrc First come, first serve method used (0,0) 1 (0,0) 2 10 (1,1) 3 (0,1) 4 (0,1) 5 (0,1) 6 (0,1) 7 1 1 2 8 1 (2,2) 8 (1,2) 9 (0,2) 10 (1,2) 11 1 1 (3,3) 12 (2,3) 13 (estart, lstart) 1 14 (4,4)
10
RHOP - Partitioning Phase
Modified Multilevel-KL algorithm [Kernighan ‘69] Multilevel graph partitioning consists of two stages Coarsening stage Refinement stage
11
Cluster Refinement 3 questions to answer:
Which cluster should operations move from? How good is the current partition? How profitable is it to move X from cluster A to B? ?
12
Where Should Operations Move From?
Cluster 1 2.5 2.0 0.5 0.0 1 2 Cluster_wgt1= 5.0 0.33 Cluster_wgt2= 0.67 1 2 3 4 1 2 4 5 6 3 9 cycle 8 12 14 Cluster 2 1 2 3 4 7 10 11 cycle 13
13
How Good is this Partition?
Cluster 1 Cluster 2 Max 1 2 1 2 2.5 0.0 2.5 2.0 0.33 2.0 0.5 0.33 0.5 0.0 0.0 0.0 0.0 0.0 0.0 Cluster_wgt1= 5.0 Cluster_wgt1= 0.67 SL= 5.0
14
How Good is This Proposed Move?
Cluster 1 1 2 3 4 1 2 1.0 SL(before)= 5.0 3 0.0 cycle 8 0.0 SL(after)= 4.5 12 0.0 14 0.0 Cluster 2 Lgain= 0.5 1 2 3 4 7 4 5 6 1.33 10 11 9 2.33 Egain= -1.0 cycle 13 0.83 0.0 Mgain= 4.0 0.0
15
Experimental Evaluation
Trimaran toolset: a retargetable VLIW compiler Evaluated DSP kernels and SPECint2000 Name Configuration 2-1111 2 Homogenous clusters 1 I, 1 F, 1 M, 1 B per cluster 2-2111 2 I, 1 F, 1 M, 1 B per cluster 4-1111 4 Homogenous clusters 4-2111 4-H 4 Heterogeneous clusters IM, IF, IB and IMF clusters 64 registers per cluster Latencies similar to Itanium Perfect caches For more detailed results, see paper
16
2 Cluster Results vs 1 Cluster
17
4 Cluster Results vs 1 Cluster
18
Conclusions A new, region-scoped method for clustering operations
Prescheduling technique Estimates on schedule length used instead of scheduler Combines slack distribution with multilevel-KL partitioning Performs better as number of resources increases Average Improvement Machine RHOP vs BUG 2-1111 -1.8% 2-2111 3.7% 4-1111 14.3% 4-2111 15.3% 4-H 8.0%
19
Questions?
20
Backup Slides
21
Previous Work X X X X Algorithm UAS CARS Convergent Leupers Capitanio
GP(B) B-ITER BUG RHOP When (rel. to sched) During Iterative Before X Scope Local Region X Desirability Metric Sched Pseudo Est Count X Grouping Hier. Flat X
22
Bottom-Up Greedy (BUG)
Typical clustering strategy, falls into trouble because of its limited view Places operations without knowing the rest of the graph Uses the scheduler to determine where to best place each operation First used in Multiflow trace [Ellis ‘85]
23
Graph Partitioning Algorithms
Local improvement methods: Kernighan-Lin Swaps pairs of operations between partitions Fiduccia and Matheyses KL-inspired, efficent O(|E|) algorithm Simulated annealing Genetic algorithms
24
Graph Partitioning Algorithms
Global methods: Geometric methods Coordinate based, not suitable for clustering Coordinate-free methods Recursive spectral bisection (RSB) Multilevel-RSB Multilevel-KL
25
RHOP - Example
26
Improvement at Increasing CPLs
27
Resource Manager Overhead
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.