Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.

Slides:

Advertisements

Similar presentations

U of Houston – Clear Lake

Advertisements

ECE 667 Synthesis and Verification of Digital Circuits

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.

Parallel Simulation etc Roger Curry Presentation on Load Balancing.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

Courseware Path-Based Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads,

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.

Scientific Computing on Heterogeneous Clusters using DRUM (Dynamic Resource Utilization Model) Jamal Faik 1, J. D. Teresco 2, J. E. Flaherty 1, K. Devine.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.

University of Michigan Electrical Engineering and Computer Science 1 Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures Michael.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.

Graph Partitioning Donald Nguyen October 24, 2011.

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

PaGrid: A Mesh Partitioner for Computational Grids Virendra C. Bhavsar Professor and Dean Faculty of Computer Science UNB, Fredericton This.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Static Process Scheduling

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

6 Systems Analysis and Design in a Changing World, Fourth Edition.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.

Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

High Performance Computing Seminar

Ph.D. in Computer Science

Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Michael Chu, Kevin Fan, Scott Mahlke

Instruction Scheduling Hal Perkins Summer 2004

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Instruction Scheduling Hal Perkins Winter 2008

Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality.

Instruction Scheduling Hal Perkins Autumn 2005

Instruction Scheduling Hal Perkins Autumn 2011

Presentation transcript:

Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian Petrescu-Prahova

Clustered Register Files Why? Register file cost and access time grows with the square of he number of register ports Bypass logic grows quadratically with the number of operations issued per cycle Distance separating FUs from register file increases with a large number of FUs => Clustered register files Decentralized architecture with several small register files Each register file supplies operands to a subset of FUs Multiflow Trace, Alpha 21264, TI C6x, Analog Tigersharc (two clusters); reconfigurable meshes?

Goal Partition operations across the resources available on each cluster to maximize ILP Minimize inter-cluster communication Rule of thumb: 2 identical clusters processor loose ~20% performance 4 identical clusters processor loose ~30% performance Nonidentical clusters lead to even more performance loss

Well Known Technique: Bottom-Up Greedy Recurse along DFG, critical path first Assign each operation a cluster based on estimates of when the operation and its predecessors can complete earliest (from scheduler) Problem 1: makes local decisions (see figure) Problem 2: is slow - needs to query accurate cluster status info for each operation considered

Region-Based Hierarchical Operation Partitioning Works on acyclic DFGs extracted from the complete program based on region decomposition. I assume region ~ loop (?!?) Two phases: Weigth calculation: Node and Edge Partitioning: Coarsening and Refining

Node Weight Calculation Reflects the quantity of resources per operation Ignores dependencies Individual weight (FUs) Shared weight (ports, buses)

Edge Weight Calculation Measure of criticalness Based on the notion of slack First come first serve slack distribution

Coarsening Partitioning Multilevel graph partitioning algorithm (Chaco, Metis) Works by coarsening highly related nodes into partitions, takes in account only edge weights Takes a snapshot of each step for refining step

Refinement Partitioning Traverse back the coarsening stages, making improvements to the initial partition At each stage the coarsened nodes available at that point are considered for movement to another cluster Highly related operations are grouped together at each stage because we follow the coarsening process backwards Metrics Cluster weight estimate of the load per cluster the cluster with highest weight is denoted ‘the imbalanced cluster’ System load Estimates the load across all clusters Gain The gain of moving operations into other clusters

Cluster Weight Individual resource constraint per cluster, per cycle (op groups) Total node weight per cluster per cycle (shared constraints) Cycle weight per cluster Cluster weight

Sytem Load Inter-cluster move overhead Total load, based on cycle by cycle estimation

Gain Load gain Edge gain Move gain

Example

Evaluation Implemented using Trimaran tool set Compared with BUG algorithm 5 DSP benchmarks (high ILP), SPECint2000 (low ILP) 5 configurations, functional units: integer (I), float (F), memory (M), branch (B)

Improvement in dynamic total cycles of RHOP over BUG

Comparison of BUG and RHOP clustering performance versus a 1-cluster machine processor processor

Histogram of RHOP versus BUG Achieved schedule length versus critical path length. Numbers of top are dynamic execution percentage

Compiling performance: number of calls to the resource table