Michael Chu, Kevin Fan, Scott Mahlke

Region-based Hierarchical Operation Partitioning for Multicluster Processors
Michael Chu, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan November 14, 2018

Clustered Architectures
Conventional Architecture Increasing width from 4 to 8 increases total delay 29% [Palacharla ‘98] Clustered Approach: Decentralized architecture Communication through interconnection network Used in Alpha 21264, TI C6x, Analog Tigersharc and others. Register File FU RF FU Register File FU Cluster 1 Cluster 2 Clustered Architecture

Basics of Multicluster Compilation
Objectives: Balance workload per cluster Minimize critical intercluster communication Interconnection Network + >> & Register File Register File * LW I MEM I MEM + Intercluster move Cluster 1 Cluster 2

Problem #1: Local vs Global Scope
Local scope clustering Global scope clustering 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 3 1 4 1 2 7 1 2 8 2 3 4 5 6 move 4 2 3 4 5 6 5 10 8 3 9 6 7 8 9 cycle 6 7 8 9 cycle 5 7 11 10 11 move 10 11 9 move 10 12 12 11 12 12

Problem #2: Scheduler-centric
Cluster assignment during scheduling adds complexity Detailed resource model/reservation tables is slow Forces local decisions Reservation Tables cycle Cluster 1 cycle Cluster 2 1 X X 1 X X 1 2 X X 2 X X 2 3 4 5 1 X X 1 X X 6 7 8 9 2 X X 2 X X 10 11 1 X X 1 X X 12 2 X X 2 X X

Our Approach Opposite approach to conventional clustering Global view
Graph partitioning strategy [Aletà ‘01, ‘02] Identify tightly coupled operations - treat uniformly Non scheduler-centric mindset Prescheduling technique Doesn’t complicate scheduler Enable global view of code Estimate-based approach [Lapinskii ‘01]

Region-based Hierarchical Operation Partitioning (RHOP)
Program Region 1 10 8 int main { int x; printf(…); . } Weight Calculation Graph Partitioning Code is considered region at a time Weight calculation creates guides for good partitions Partitioning clusters based on given weights

Node Weights Create a metric to determine resource usage 1 2
Dedicated Resources Shared Resources 3 8 5 9 6 10 7 11 13 14 12 1 2 4 1 2 I F M B Register File I F M B Register File 3 Accounts for FU’s Accounts for buses, ports

Edge Weights Slack distribution allocates slack to certain edges Edge slack = lstartdest - latencyedge - estartsrc First come, first serve method used (0,0) 1 (0,0) 2 10 (1,1) 3 (0,1) 4 (0,1) 5 (0,1) 6 (0,1) 7 1 1 2 8 1 (2,2) 8 (1,2) 9 (0,2) 10 (1,2) 11 1 1 (3,3) 12 (2,3) 13 (estart, lstart) 1 14 (4,4)

RHOP - Partitioning Phase
Modified Multilevel-KL algorithm [Kernighan ‘69] Multilevel graph partitioning consists of two stages Coarsening stage Refinement stage

Cluster Refinement 3 questions to answer:
Which cluster should operations move from? How good is the current partition? How profitable is it to move X from cluster A to B? ?

Where Should Operations Move From?
Cluster 1 2.5 2.0 0.5 0.0 1 2 Cluster_wgt1= 5.0 0.33 Cluster_wgt2= 0.67 1 2 3 4 1 2 4 5 6 3 9 cycle 8 12 14 Cluster 2 1 2 3 4 7 10 11 cycle 13

How Good is this Partition?
Cluster 1 Cluster 2 Max 1 2 1 2 2.5 0.0 2.5 2.0 0.33 2.0 0.5 0.33 0.5 0.0 0.0 0.0 0.0 0.0 0.0 Cluster_wgt1= 5.0 Cluster_wgt1= 0.67 SL= 5.0

How Good is This Proposed Move?
Cluster 1 1 2 3 4 1 2 1.0 SL(before)= 5.0 3 0.0 cycle 8 0.0 SL(after)= 4.5 12 0.0 14 0.0 Cluster 2 Lgain= 0.5 1 2 3 4 7 4 5 6 1.33 10 11 9 2.33 Egain= -1.0 cycle 13 0.83 0.0 Mgain= 4.0 0.0

Experimental Evaluation
Trimaran toolset: a retargetable VLIW compiler Evaluated DSP kernels and SPECint2000 Name Configuration 2-1111 2 Homogenous clusters 1 I, 1 F, 1 M, 1 B per cluster 2-2111 2 I, 1 F, 1 M, 1 B per cluster 4-1111 4 Homogenous clusters 4-2111 4-H 4 Heterogeneous clusters IM, IF, IB and IMF clusters 64 registers per cluster Latencies similar to Itanium Perfect caches For more detailed results, see paper

2 Cluster Results vs 1 Cluster

4 Cluster Results vs 1 Cluster

Conclusions A new, region-scoped method for clustering operations
Prescheduling technique Estimates on schedule length used instead of scheduler Combines slack distribution with multilevel-KL partitioning Performs better as number of resources increases Average Improvement Machine RHOP vs BUG 2-1111 -1.8% 2-2111 3.7% 4-1111 14.3% 4-2111 15.3% 4-H 8.0%

Questions?

Backup Slides

Previous Work X X X X Algorithm UAS CARS Convergent Leupers Capitanio
GP(B) B-ITER BUG RHOP When (rel. to sched) During Iterative Before X Scope Local Region X Desirability Metric Sched Pseudo Est Count X Grouping Hier. Flat X

Bottom-Up Greedy (BUG)
Typical clustering strategy, falls into trouble because of its limited view Places operations without knowing the rest of the graph Uses the scheduler to determine where to best place each operation First used in Multiflow trace [Ellis ‘85]

Graph Partitioning Algorithms
Local improvement methods: Kernighan-Lin Swaps pairs of operations between partitions Fiduccia and Matheyses KL-inspired, efficent O(|E|) algorithm Simulated annealing Genetic algorithms

Graph Partitioning Algorithms
Global methods: Geometric methods Coordinate based, not suitable for clustering Coordinate-free methods Recursive spectral bisection (RSB) Multilevel-RSB Multilevel-KL

RHOP - Example

Improvement at Increasing CPLs

Resource Manager Overhead

Michael Chu, Kevin Fan, Scott Mahlke

Similar presentations

Presentation on theme: "Michael Chu, Kevin Fan, Scott Mahlke"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Michael Chu, Kevin Fan, Scott Mahlke

Similar presentations

Presentation on theme: "Michael Chu, Kevin Fan, Scott Mahlke"— Presentation transcript:

Similar presentations

About project

Feedback