University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

Slides:

Advertisements

Similar presentations

ECE 667 Synthesis and Verification of Digital Circuits

Advertisements

Computer Organization and Architecture

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

EDA (CS286.5b) Day 10 Scheduling (Intro Branch-and-Bound)

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Original Development Team The Compiler and Architecture Research Group (formerly part of Hewlett-Packard Laboratories) Illinois Microarchitecture Project.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

EDA (CS286.5b) Day 11 Scheduling (List, Force, Approximation) N.B. no class Thursday (FPGA) …

Pipelined Processor II CPSC 321 Andreas Klappenecker.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

Appendix A Pipelining: Basic and Intermediate Concepts

University of Michigan Electrical Engineering and Computer Science 1 Processor Acceleration Through Automated Instruction Set Customization Nathan Clark,

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

Generic Software Pipelining at the Assembly Level Markus Pister

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day9:

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

Pipelining and Parallelism Mark Staveley

Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Advanced Architectures

Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.

Local Instruction Scheduling

Michael Chu, Kevin Fan, Scott Mahlke

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Instruction Scheduling Hal Perkins Winter 2008

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Superscalar and VLIW Architectures

Instruction Scheduling Hal Perkins Autumn 2011

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized Datapaths Manjunath Kudlur, Kevin Fan, Michael Chu, Rajiv Ravindran, Nathan Clark, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan

Electrical Engineering and Computer Science Introduction Bypass network : Important component of datapath Allows for data forwarding to reduce pipeline stalls Full bypass: any FU can bypass from any other FU and from any pipeline stage Cost of full bypass increases quadratically with number of FUs # paths = (# FU) 2  bypassable stages  input ports per FU  output ports per FU

University of Michigan Electrical Engineering and Computer Science Case for Bypass Customization Only few bypasses are heavily utilized The heavily utilized bypasses vary widely from application to application Customize bypass network in an application specific processor by removing under-utilized paths

University of Michigan Electrical Engineering and Computer Science Implications of Bypass Customization Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File

University of Michigan Electrical Engineering and Computer Science A Implications of Bypass Customization Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File B DFG

University of Michigan Electrical Engineering and Computer Science A Implications of Bypass Customization Latency depends on –Which FU the operation is scheduled on –Which FU the operation’s consumer is scheduled on Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File B 1 Cycle

University of Michigan Electrical Engineering and Computer Science A Implications of Bypass Customization Latency depends on –Which FU the operation is scheduled on –Which FU the operation’s consumer is scheduled on Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File B 2 Cycles

University of Michigan Electrical Engineering and Computer Science A Implications of Bypass Customization Latency depends on –Which FU the operation is scheduled on –Which FU the operation’s consumer is scheduled on Latency of an operation no longer constant –Varies per consumer Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File B 3 Cycles Bypass Customization introduces non-uniform operation latencies

University of Michigan Electrical Engineering and Computer Science Effects on List Scheduler (LS) Used widely in many compilation systems Assign each operation to a free FU at the earliest time (Greedy!) When more than one free FU available, pick one arbitrarily WHILE (Readylist is non-empty) DO op  Next unscheduled operation in priority order ; stime  Earliest time when op can be scheduled ; WHILE (no free resource available to execute op at stime) DO stime  stime + 1 ; END res  Free resource capable of executing op; schedule (op, res, stime) ; END

University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine ABC Operations have 1-cycle latency. Machine with full bypass network DFG

University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine CycleABC ABC Operations have 1-cycle latency. Machine with full bypass network.

University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine CycleABC Schedule length = 3 cycles ABC Operations have 1-cycle latency. Machine with full bypass network.

University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine CycleABC Schedule length = 3 cycles ABC Operations have 1-cycle latency. Machine with full bypass network.

University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine CycleABC Schedule length = 3 cycles ABC Operations have 1-cycle latency. Machine with full bypass network.

University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine CycleABC Schedule length = 3 cycles Choice of FU does not affect schedule length in a machine with full bypass. ABC Operations have 1-cycle latency. Machine with full bypass network.

University of Michigan Electrical Engineering and Computer Science LS on Partial Bypass Machine ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist CycleABC Schedule length = 5 cycles

University of Michigan Electrical Engineering and Computer Science LS on Partial Bypass Machine CycleABC Schedule length = 3 cycles ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist.

University of Michigan Electrical Engineering and Computer Science LS on Partial Bypass Machine CycleABC Schedule length = 5 cycles ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist.

University of Michigan Electrical Engineering and Computer Science LS on Partial Bypass Machine CycleABC Schedule length = 5 cycles Choice of FU affects schedule length drastically in a machine with partial bypass. Arbitrary choice no good! ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist.

University of Michigan Electrical Engineering and Computer Science Greediness of LS ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist Partial DFG CycleABC i i+1 i+2 i+3 i+4 Consider Scheduling Op1 Earliest Time

University of Michigan Electrical Engineering and Computer Science Greediness of LS ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist CycleABC i i+11 i+22 i+3 i+434 Greedily scheduling op 1at cycle i+1 delays ops 3 and 4

University of Michigan Electrical Engineering and Computer Science Greediness of LS ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist CycleABC i i+11 i+223 i+3 i+44 Greedily scheduling op 1 at cycle i+1 delays op 4

University of Michigan Electrical Engineering and Computer Science Greediness of LS ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist CycleABC i i+1 i+21 i+3234 Delayed 1 cycle Delaying ops could improve schedule. Being Greedy no good!

University of Michigan Electrical Engineering and Computer Science FLASH : Goals Keep the List Scheduling framework, it is fast and widely used Effectively deal with non-uniform latencies –Intelligently select from among multiple co- equal choices –Avoid greedy choices by delaying schedule slots

University of Michigan Electrical Engineering and Computer Science Observation I A B Consider FU choices for operation A :

University of Michigan Electrical Engineering and Computer Science Observation I A B No Good! Consider FU choices for operation A : 3 cycle delay

University of Michigan Electrical Engineering and Computer Science Observation I A B Good! Consider FU choices for operation A : An FU with a low latency path to a consumer FU is good Thus, the consumer operation won’t be delayed No delay

University of Michigan Electrical Engineering and Computer Science Observation I A B C Good ??? Consider FU choices for operation A : An FU with a low latency path to a consumer FU is good Thus, the consumer operation won’t be delayed No Delay 3 cycle delay

University of Michigan Electrical Engineering and Computer Science Observation I A B C Better! Consider FU choices for operation A : An FU with a low latency path to a consumer FU is good Thus, the consumer operation won’t be delayed Same observation extends to consumer’s consumer, and so on No Delay An FU which does not delay the consumers is a good choice

University of Michigan Electrical Engineering and Computer Science Observation II Consider FU choices for operation A : A BC D Slack 1Slack 0

University of Michigan Electrical Engineering and Computer Science Observation II Consider FU choices for operation A : A BC D Good ??? All consumers are not equal No Delay 3 cycle delay

University of Michigan Electrical Engineering and Computer Science Observation II All consumers are not equal Its better to delay a non- critical consumer Criticality  Consider FU choices for operation A : A BC D Better! An FU which does not delay a critical chain of consumers is a good choice No Delay 3 cycle delay 1 SLACK

University of Michigan Electrical Engineering and Computer Science The FLASH Technique Compute the merit (FLASH_RANK) of each FU choice for an operation FLASH_RANK - weighted estimate of schedule lengths of the dependence chains of an operation Schedule the operation on the FU with the best FLASH_RANK Avoid greediness by delaying schedule slot, if necessary FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D Slack 1Slack 0 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D Slack 1Slack 0 FLASH_RANK(A, Green FU) = ? FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A B C D Cycle 1 FLASH_RANK(A, Green FU) = MAX X1 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A B C D FLASH_RANK(A, Green FU) = MAX 0.5, X Cycle 1 Cycle 4 4 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D FLASH_RANK(A, Green FU) = MAX 0.5, 4 = 4 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D Slack 1Slack 0 FLASH_RANK(A, Yellow FU) = ? FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A B C D Cycle 1 FLASH_RANK(A, Yellow FU) = MAX X1 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A B C D FLASH_RANK(A, Yellow FU) = MAX 0.5, X Cycle 1 Cycle 2 2 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D FLASH_RANK(A, Yellow FU) = MAX 0.5, 2 = 2 Choose Yellow FU for op A FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X

University of Michigan Electrical Engineering and Computer Science Some Practical Considerations Impractical to estimate schedule length of entire dependence chain (few 10s of operations) –Truncate dependence chains to manageable depths, say 2 or 3 (Look Ahead depth) Impractical to calculate schedule lengths of all dependence chains together –Many dependence chains originate from an operation –Consider dependence chains independently –Ignore resource constraint between dependence chains

University of Michigan Electrical Engineering and Computer Science Experiments Implemented in TRIMARAN compiler framework Evaluated MediaBench and SPECint2000 Machine is a 9 wide VLIW (4I, 2F, 2M, 1B) Application specific bypass network [Fan ’03] –30% cost of a full bypass network

University of Michigan Electrical Engineering and Computer Science Comparisons Baseline is the performance achieved by the traditional list scheduler Global Resource Preference (GRP) algorithm [Fan ’03] –Global pre-scheduling phase assigns FU preferences to operations based on Bottom-Up Greedy (BUG) schedule estimates –List scheduler uses these preferences as hints while scheduling

University of Michigan Electrical Engineering and Computer Science FLASH vs. GRP

University of Michigan Electrical Engineering and Computer Science Bypass Utilization

University of Michigan Electrical Engineering and Computer Science Conclusion Developed a effective scheduling heuristic for machines with customized bypass interconnect –Intelligent FU choice –Avoid greediness Average performance improvement of 25% over baseline –Bypass paths utilized better Could be applied to other cases of non- uniform latencies

University of Michigan Electrical Engineering and Computer Science Questions

University of Michigan Electrical Engineering and Computer Science Backup

University of Michigan Electrical Engineering and Computer Science Backup

University of Michigan Electrical Engineering and Computer Science Backup