Tony Nowatzki∗ Newsha Ardalani† Karthikeyan Sankaralingam‡ Jian Weng∗

Slides:

Advertisements

Similar presentations

ECE 667 Synthesis and Verification of Digital Circuits

Advertisements

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

Lecture 12 Reduce Miss Penalty and Hit Time

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

The Microprocessor is no more General Purpose. Design Gap.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Issues in System-Level Direct Networks Jason D. Bakos.

Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

William Stallings Data and Computer Communications

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Gwangsun Kim, Jiyun Jeong, John Kim

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.

Ph.D. in Computer Science

Routing and Switching Fabrics

Advanced Computer Networks

SECTIONS 1-7 By Astha Chawla

Pipelining and Retiming 1

Simultaneous Multithreading

5.2 Eleven Advanced Optimizations of Cache Performance

A Closer Look at Instruction Set Architectures

Cache Memory Presentation I

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

/ Computer Architecture and Design

Lecture 5: GPU Compute Architecture

ISP and Egress Path Selection for Multihomed Networks

Parallel and Multiprocessor Architectures

Pipelining and Vector Processing

CSCI1600: Embedded and Real Time Software

Finding Heuristics Using Abstraction

Improved schedulability on the ρVEX polymorphic VLIW processor

Centar ( Global Signal Processing Expo

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 5: GPU Compute Architecture for the last time

Tony Nowatzki Vinay Gangadhar Karthikeyan Sankaralingam

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Objective of This Course

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

On-time Network On-chip

Persistence: hard disk drive

Analyzing Behavior Specialized Acceleration

Hossein Omidian, Guy Lemieux

/ Computer Architecture and Design

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

ECE 352 Digital System Fundamentals

CS703 - Advanced Operating Systems

CMSC 611: Advanced Computer Architecture

Routing and Switching Fabrics

EE 193: Parallel Computing

CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Tony Nowatzki∗ Newsha Ardalani† Karthikeyan Sankaralingam‡ Jian Weng∗ Hybrid Optimization/Heuristic Instruction Scheduling for Programmable Accelerator Codesign Tony Nowatzki∗ Newsha Ardalani† Karthikeyan Sankaralingam‡ Jian Weng∗ ∗ † ‡ Machines Simple PACT 2018, Nov. 3rd

Google TPU Coarse Grain CPU Reconfigurable Architectures >

Problem: How can we map computation to this effectively? Google TPU Problem: How can we map computation to this effectively? Circuit Switched Network Processing Element (ALU etc.) Coarse Grain PE PE Reconfigurable Architectures CGRAs are interesting because can greatly improve the efficiency of data-parallel applications over CPUs, by eliminating many of the per-instruction overheads in general purpose pipelines CGRAs can be found in even commercial designs now, for example TPUs, which has a systolic array at its heart to do very efficient matrix multiplication A systolic array is the simplest and most efficient version of a CGRA because each PE only does neighbor-to-neighbor execution, no flow control (backpressure) within the array They are incredibly high performance because they can be perfectly pipelined – producing one computation on any cycle The issue is generality … For example, if you need to do non-local communication, then this would be impossible In this work, we slightly generalize the systolic CGRA by adding a circuit switch network. (diamonds are a circuit switched router) Not made up, variety of programmable accelerators already use this kind of CGRA Finally, problem… Perfectly-pipelined execution (1 computation per cycle) What if I don’t exactly want to do dense matrix multiplies with large matrices? Examples: Charm, Camel, Stream-Dataflow, FPCA, DySER (almost), LSSD, Morphosys, etc … Systolic CGRA (dedicated PE CGRA) Systolic Array

Computation Graph Hardware Graph Mapping Routing Timing B[8] c[4] × × PE A[8] B[8] c[4] × × × × × × × × - - + + + + > + + + ? + ∑ o[1] Mapping Routing Timing

Computation Graph Hardware Graph Mapping Routing Timing B[8] c[4] × × PE A[8] B[8] c[4] × × × × × × × × - - + + + + > + + + ? + ∑ o[1] Mapping Routing Timing

Computation Graph Hardware Graph Mapping Routing Timing B[8] c[4] × × PE A[8] B[8] c[4] × × × × × × × × - - > + + + + > + + + ? + ∑ o[1] ? Mapping Routing Timing

Computation Graph Hardware Graph Mapping Routing Timing B[8] c[4] × × PE A[8] B[8] c[4] × × × × × × × × - - > + + + + > > + + + ? + ∑ o[1] ? Mapping Routing Timing

Technique Overview Joint Optimization Mapping Routing Timing (days for solution) Incremental Optimization Mapping Routing Timing Joint Heuristic Mapping Routing Timing (high area increase or integer factors performance loss) Our Approach: 1. Phase Overlapping 2. Hybrid Scheduling Mapping Routing Routing Timing (full throughput & seconds to minutes & low area overhead)

Outline “Systolic” CGRAs Scheduling Approach 1: Phasing-Overlapping Throughput sensitivity and fractional initiation intervals. Techniques for delay matching Scheduling Approach 1: Phasing-Overlapping Tractable Optimization through Overlapping Scheduling Approach 2: Hybrid Scheduling Heuristic to reduce search space Optimization to deal with tricky cases Evaluation

Requirements for Full-Pipelining Sufficient Data Bandwidth: Data has to arrive at a bandwidth of sufficient for one set of inputs / cycle. No Compute Contention: Each instruction gets a dedicated compute unit. No Routing Contention: Each dependence gets a dedicated routing path. No Storage Contention: Each operand is guaranteed a free storage location at each step.

Fully-pipelined Systolic CGRA Must move forward! Can’t be consumed! PE × 1 √ 3 / Delay FIFO (2+3) 5 5 2 + Mismatch=3 Mismatch=0 Cycle: 1 2 3 4 5 6

Fully-pipelined Systolic CGRA × 1 √ 3 / Delay FIFO (2+2) 4 5 + Mismatch=1 Cycle: 1 2 3 4 5 6

Throughput Formula FIFO Length Throughput = Max. Mismatch + FIFO Length Delay-FIFOs Help Reduce Latency Mismatch Provide buffer space which increases throughput But at the cost of area… Inverse of initiation interval in compiler terminology.

Alternatives to Delay-FIFOs (1) PE × √ / 5 5 + Use a longer route: Costs Routing Resources

Alternatives to Delay-FIFOs (2) PE × Do a NOP √ / This is also just good for improving routing bandwidth … 5 5 + Pass through a PE: Costs Mapping Resources

“Systolic” CGRA Challenge Summary Systolic CGRA throughput is very sensitive to timing constraints Several ways to balance delays: Pass-through (affects mapping) Long route (affects routing) Delay FIFOs (affects timing) Mapping / Routing / Timing responsibilities are highly interdependent Codesign: Either more delay FIFOs or more advanced scheduler

Outline Challenges for pipelining “Systolic” CGRAs Throughput sensitivity and fractional initiation intervals. Techniques for delay matching Scheduling Approach 1: Phasing-Overlapping Tractable Optimization through Overlapping Scheduling Approach 2: Hybrid Scheduling Heuristic to reduce search space Optimization to deal with tricky cases Evaluation

Optimization Background Prior work demonstrates optimization-based spatial scheduling: Integer Linear Programming [PLDI 2013], SMT [TOPLAS 2014] Basic Integer Linear Programming Approach: Decision Variables: Assigning instructions to PEs (binary) Assigning dependences to a set of Routes (binary) Assigning delays to each PE (integer) Linear Constraints: Limit schedule to be legal We added three forms of delay-matching.

Optimization Phases Joint Optimization: (50% Failure) Timeout -- Very restricted hardware of systolic CGRA Separate Phases: (85% Failure) Fast, but doesn’t capture inter-dependence! In-between Joint: (75% Failure, poor throughput) Mapping Routing Timing Mapping Routing Timing Mapping Routing Timing

Our Approach: Phase Overlapping Insight: Phase Interdependence Matters But relationship between adjacent phases is more relevant… Phase Overlapping: Perform Mapping and Routing together Discard Routing (keep Mapping) Perform Routing and Timing together Good: Only 15% fail, mostly fully-pipelined Remaining Problems: Mapping/routing phase takes time (90% or more) Sometimes, we get an unlucky mapping/routing (looks like high potential, but cannot be fixed for timing) Mapping Routing Routing Timing

Overview “Systolic” CGRAs Scheduling Approach 1: Phasing-Overlapping Throughput sensitivity and fractional initiation intervals. Techniques for delay matching Scheduling Approach 1: Phasing-Overlapping Tractable Optimization through Overlapping Scheduling Approach 2: Hybrid Scheduling Heuristic to reduce search space Optimization to deal with tricky cases Evaluation

Hybrid Scheduling: Integrate Heuristic Insight: Mitigate the problems overlapped scheduling by trying things repeatedly if we got an unlucky mapping. But … Optimization won’t work for mapping/routing Still too slow (and do we really need it?) ILP solver’s don’t tend to generate unique solutions Hybrid Approach: Use a heuristic for the mapping/routing phase, to quickly get a plausible mapping, then optimize the routing and timing together. Heuristic: Mapping Routing Optimization: Routing Timing

A Heuristic for Hybrid Scheduling Need a heuristic that quickly generates unique schedules! Basic Approach: Stochastic Scheduling Iteratively build schedule in topological order (fast) Normally be rational: Choose a location for each instruction which minimizing routing resources and timing mismatch. Occasionally be irrational: Choose a purposely bad mapping, hoping that it might help later on (unique) Repeat a few times to find a good schedule (good) Objective function: Minimize total mismatch (to make routing/timing scheduling easier)

Intuition for Solution Quality

Outline “Systolic” CGRAs Scheduling Approach 1: Phasing-Overlapping Throughput sensitivity and fractional initiation intervals. Techniques for delay matching Scheduling Approach 1: Phasing-Overlapping Tractable Optimization through Overlapping Scheduling Approach 2: Hybrid Scheduling Heuristic to reduce search space Optimization to deal with tricky cases Evaluation

Methodology – Accelerator Modeling Softbrain Accelerator 5x5 Processing Elem (PE) 64-bit datapath (subword) Heterogeneous Grid All FUs are fully pipelined Simulation: All blocks simulated at cycle-level in gem5, single “core” Area Analysis Synthesize Chisel-based design in 55nm tech library. Assumes Control core (single-issue inorder 16KB I$/D$) + 4KB SPAD Workloads (accelerator centric) MachSuite, CNN/DNN, Dense Linear Algebra QR/Cholesky/FFT) Stream SPAD Streaming Cache Ctrl Core PE IO Interface Softbrain [ISCA 2017]

Does delay FIFO area matter? FIFO Length Scheduling Difficulty Area (mm2) Overhead 2 Very Hard 0.528 9.67% 3 Hard 0.543 12.90% 7 Medium 0.606 25.85% 15 Easy 0.736 52.86%

Methodology – Experimental Setup Scheduler Designs: Joint Optimization Overlapped Optimization Heuristic Only Hybrid (overlap/part heuristic) Scheduling Timeout: 20 Minutes Problem: How to compare schedulers that across a set of workloads that may fail? We don’t! Instead, we seed all schedulers with an initial solution from the heuristic. Joint and Overlapped are also hybrid schedulers in the evaluation, just using a more traditional approach.

Performance vs Time vs Complexity

Conclusions Systolic CGRAs gain efficiency by simplifying the execution model. Consequence is tradeoff between easy scheduling low hardware overhead, and high performance: Minimizing delay-FIFO size is key factor in reducing area and increasing scheduler difficulty in finding minimum II. Two novel techniques: Phase Overlapping (capture the phase interdependence) Hybrid Scheduling (using heuristic to generate solutions) Broadly, codesign is the key to success of future reconfigurable architectures.

Thank you

Scheduling-time Sensitivity

Modeled vs Measured Throughput Heuristic Hybrid Overlapped Joint Modeled vs Measured Throughput