Michael Chu, Kevin Fan, Scott Mahlke

Slides:

Advertisements

Similar presentations

Virtual Cluster Scheduling Through the Scheduling Graph Josep M. Codina Jesús Sánchez Antonio González Intel Barcelona Research Center, Intel Labs - UPC.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Multilevel Hypergraph Partitioning Daniel Salce Matthew Zobel.

Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.

Static Bus Schedule aware Scratchpad Allocation in Multiprocessors Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

U P C MICRO36 San Diego December 2003 Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors Enric Gibert 1 Jesús Sánchez 2 Antonio González.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

Paul D. Bryan, Jason A. Poovey, Jesse G. Beu, Thomas M. Conte Georgia Institute of Technology.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

Predicate-Aware Scheduling: A Technique for Reducing Resource Constraints Mikhail Smelyanskiy, Scott Mahlke, Edward Davidson Department of EECS University.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.

Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.

System Partitioning Kris Kuchcinski

A scalable multilevel algorithm for community structure detection

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

Applying Evolutionary Algorithm to Chaco Tool on the Partitioning of Power Transmission System (CS448 Class Project) Yan Sun.

University of Michigan Electrical Engineering and Computer Science 1 Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures Michael.

University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,

Multilevel Graph Partitioning and Fiduccia-Mattheyses

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

A Budget Constrained Scheduling of Workflow Applications on Utility Grids using Genetic Algorithms Jia Yu and Rajkumar Buyya Grid Computing and Distributed.

Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Massachusetts Institute of Technology 1 L14 – Physical Design Spring 2007 Ajay Joshi.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Quadrisection-Based Task Mapping on Many-Core Processors for Energy-Efficient On-Chip Communication Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang.

Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Parallel Graph Partioning Using Simulated Annealing Parallel and Distributed Computing I Sadik Gokhan Caglar.

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

High Performance Computing Seminar

Contents Introduction Bus Power Model Related Works Motivation

Introduction to Load Balancing:

Architecture and Synthesis for Multi-Cycle Communication

Ph.D. in Computer Science

Design-Space Exploration

Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Compiler Supports and Optimizations for PAC VLIW DSP Processors

Plan Introduction to multilevel heuristics Rich partitioning problems

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Topic 6a Basic Back-End Optimization

Overcoming Resolution Limits in MDL Community Detection

Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality.

Research: Past, Present and Future

Presentation transcript:

Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke Advanced Computer Architecture Lab University of Michigan November 14, 2018

Clustered Architectures Conventional Architecture Increasing width from 4 to 8 increases total delay 29% [Palacharla ‘98] Clustered Approach: Decentralized architecture Communication through interconnection network Used in Alpha 21264, TI C6x, Analog Tigersharc and others. Register File FU RF FU Register File FU Cluster 1 Cluster 2 Clustered Architecture

Basics of Multicluster Compilation Objectives: Balance workload per cluster Minimize critical intercluster communication Interconnection Network + >> & Register File Register File * LW I MEM I MEM + Intercluster move Cluster 1 Cluster 2

Problem #1: Local vs Global Scope Local scope clustering Global scope clustering 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 3 1 4 1 2 7 1 2 8 2 3 4 5 6 move 4 2 3 4 5 6 5 10 8 3 9 6 7 8 9 cycle 6 7 8 9 cycle 5 7 11 10 11 move 10 11 9 move 10 12 12 11 12 12

Problem #2: Scheduler-centric Cluster assignment during scheduling adds complexity Detailed resource model/reservation tables is slow Forces local decisions Reservation Tables cycle Cluster 1 cycle Cluster 2 1 X X 1 X X 1 2 X X 2 X X 2 3 4 5 1 X X 1 X X 6 7 8 9 2 X X 2 X X 10 11 1 X X 1 X X 12 2 X X 2 X X

Our Approach Opposite approach to conventional clustering Global view Graph partitioning strategy [Aletà ‘01, ‘02] Identify tightly coupled operations - treat uniformly Non scheduler-centric mindset Prescheduling technique Doesn’t complicate scheduler Enable global view of code Estimate-based approach [Lapinskii ‘01]

Region-based Hierarchical Operation Partitioning (RHOP) Program Region 1 10 8 int main { int x; printf(…); . } Weight Calculation Graph Partitioning Code is considered region at a time Weight calculation creates guides for good partitions Partitioning clusters based on given weights

Node Weights Create a metric to determine resource usage 1 2 Dedicated Resources Shared Resources 3 8 5 9 6 10 7 11 13 14 12 1 2 4 1 2 I F M B Register File I F M B Register File 3 Accounts for FU’s Accounts for buses, ports

Edge Weights Slack distribution allocates slack to certain edges Edge slack = lstartdest - latencyedge - estartsrc First come, first serve method used (0,0) 1 (0,0) 2 10 (1,1) 3 (0,1) 4 (0,1) 5 (0,1) 6 (0,1) 7 1 1 2 8 1 (2,2) 8 (1,2) 9 (0,2) 10 (1,2) 11 1 1 (3,3) 12 (2,3) 13 (estart, lstart) 1 14 (4,4)

RHOP - Partitioning Phase Modified Multilevel-KL algorithm [Kernighan ‘69] Multilevel graph partitioning consists of two stages Coarsening stage Refinement stage

Cluster Refinement 3 questions to answer: Which cluster should operations move from? How good is the current partition? How profitable is it to move X from cluster A to B? ?

Where Should Operations Move From? Cluster 1 2.5 2.0 0.5 0.0 1 2 Cluster_wgt1= 5.0 0.33 Cluster_wgt2= 0.67 1 2 3 4 1 2 4 5 6 3 9 cycle 8 12 14 Cluster 2 1 2 3 4 7 10 11 cycle 13

How Good is this Partition? Cluster 1 Cluster 2 Max 1 2 1 2 2.5 0.0 2.5 2.0 0.33 2.0 0.5 0.33 0.5 0.0 0.0 0.0 0.0 0.0 0.0 Cluster_wgt1= 5.0 Cluster_wgt1= 0.67 SL= 5.0

How Good is This Proposed Move? Cluster 1 1 2 3 4 1 2 1.0 SL(before)= 5.0 3 0.0 cycle 8 0.0 SL(after)= 4.5 12 0.0 14 0.0 Cluster 2 Lgain= 0.5 1 2 3 4 7 4 5 6 1.33 10 11 9 2.33 Egain= -1.0 cycle 13 0.83 0.0 Mgain= 4.0 0.0

Experimental Evaluation Trimaran toolset: a retargetable VLIW compiler Evaluated DSP kernels and SPECint2000 Name Configuration 2-1111 2 Homogenous clusters 1 I, 1 F, 1 M, 1 B per cluster 2-2111 2 I, 1 F, 1 M, 1 B per cluster 4-1111 4 Homogenous clusters 4-2111 4-H 4 Heterogeneous clusters IM, IF, IB and IMF clusters 64 registers per cluster Latencies similar to Itanium Perfect caches For more detailed results, see paper

2 Cluster Results vs 1 Cluster

4 Cluster Results vs 1 Cluster

Conclusions A new, region-scoped method for clustering operations Prescheduling technique Estimates on schedule length used instead of scheduler Combines slack distribution with multilevel-KL partitioning Performs better as number of resources increases Average Improvement Machine RHOP vs BUG 2-1111 -1.8% 2-2111 3.7% 4-1111 14.3% 4-2111 15.3% 4-H 8.0%

Questions? http://cccp.eecs.umich.edu

Backup Slides

Previous Work X X X X Algorithm UAS CARS Convergent Leupers Capitanio GP(B) B-ITER BUG RHOP When (rel. to sched) During Iterative Before X Scope Local Region X Desirability Metric Sched Pseudo Est Count X Grouping Hier. Flat X

Bottom-Up Greedy (BUG) Typical clustering strategy, falls into trouble because of its limited view Places operations without knowing the rest of the graph Uses the scheduler to determine where to best place each operation First used in Multiflow trace [Ellis ‘85]

Graph Partitioning Algorithms Local improvement methods: Kernighan-Lin Swaps pairs of operations between partitions Fiduccia and Matheyses KL-inspired, efficent O(|E|) algorithm Simulated annealing Genetic algorithms

Graph Partitioning Algorithms Global methods: Geometric methods Coordinate based, not suitable for clustering Coordinate-free methods Recursive spectral bisection (RSB) Multilevel-RSB Multilevel-KL

RHOP - Example

Improvement at Increasing CPLs

Resource Manager Overhead