CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Slides:

Advertisements

Similar presentations

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Advertisements

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

COMP25212 Advanced Pipelining Out of Order Processors.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

Parallell Processing Systems1 Chapter 4 Vector Processors.

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

Instruction Level Parallelism (ILP) Colin Stevens.

1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

NISC set computer no-instruction

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Re-configurable Bus Encoding Scheme for Reducing Power Consumption of the Cross Coupling Capacitance for Deep Sub-micron Instructions Bus Siu-Kei Wong.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Nios II Processor: Memory Organization and Access

Instruction Level Parallelism

Ph.D. in Computer Science

CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

5.2 Eleven Advanced Optimizations of Cache Performance

CS203 – Advanced Computer Architecture

EPIMap: Using Epimorphism to Map Applications on CGRAs

Superscalar Processors & VLIW Processors

Hardware Multithreading

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Ka-Ming Keung Swamy D Ponpandi

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

1. Arizona State University, Tempe, USA

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

* From AMD 1996 Publication #18522 Revision E

CMSC 611: Advanced Computer Architecture

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

CGRA Express: Accelerating Execution using Dynamic Operation Fusion Yongjun Park, Hyunchul Park, Scott Mahlke CCCP Research Group, University of Michigan 1

Coarse-Grained Reconfigurable Architecture (CGRA) Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power consumption High flexibility with dynamic reconfiguration 2

CGRA : Attractive Alternative to ASICs Suitable for running multimedia applications for future embedded systems High throughput, low power consumption, high flexibility Morphosys SiliconHive ADRES viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW 3

Performance Bottleneck: Acyclic Code Block 1 Block 2 Block 3 Block 5 Block 0 … Normal schedule Block 1 Block 2 Block 3 Block 5 Block 0 … Software Pipeline … Software Pipeline Acyclic region dominant Original Loop region dominant Block 0 Block 1 Acyclic region is substantial! It’s time to optimize acyclic code. Block 2 Block 3 Block 4 Block 5 … Application Execution Time 4

Key Idea: Chaining Instructions 1. Clock period Longest operation with register file access. 2. CGRA is not VLIW. Register file access is not frequent! 3. Opportunity of instruction chaining. 4. Considerable register access time ≈ Arithmetic operation delay (3.5ns clock period @ IBM 90nm) Non-critical path : Fast! Critical Path: Slow! Group Opcode Delay(ns) Multi cycle op MUL, LD, ST 1.65 Arith ADD, SUB 1.74 Shift LSL, LSR, ASR 1.36 Comp EQ, NE, LT 0.93 Logic AND, OR, XOR 0.73 RF Access 1.61 5

Dynamic Operation Fusion Execute multiple dependent operations in one cycle Key benefits 1. Minimal hardware overhead 2. Multiple subgraphs can be executed simultaneously. 3. Dynamic merging of FUs MUL LD Add512r10 4x4 CGRA Operation fusion : 1 Cycle ADD LSR 4x4 CGRA Current : 3 Cycle A B Assumption Instruction time = RF read time = RF write time ADD 512 ADD 10 LSR Out 6

Hardware Support Simple bypass network Small overhead: 3.8%(SRAM), 2.3%(MUX) baseline modified overhead(%) control bit 845 877 3.8 area (mm^2) 1.447 1.48 2.3 7

Compiler Support Tick-based scheduling Tick-based scheduling Tick: small time unit based on hardware delay information Clock cycle = # of ticks Clock boundary constraint checking Resource conflict Time conflict Tick-based scheduling Tick: small time unit based on hardware delay information Clock cycle = # of ticks Clock boundary constraint checking Resource conflict Time conflict Tick-based scheduling Tick: small time unit based on hardware delay information Clock cycle = # of ticks Clock boundary constraint checking Resource conflict Time conflict 8

Dynamic Operation Fusion Example(1) 1. Conventional Scheduling 1. Conventional Scheduling – 5 cycle DataFlow Graph DataFlow Graph DataFlow Graph DataFlow Graph DataFlow Graph DataFlow Graph Schedule Table Schedule Table Schedule Table Schedule Table Schedule Table Schedule Table const const const const const const RF[0] RF[0] RF[0] RF[0] RF[0] RF[0] const const const const const const RF[1] RF[1] RF[1] RF[1] RF[1] RF[1] Time FU0 FU1 FU2 FU3 FU4 FU5 OP 0 OP 1 1 OP 2 2 3 4 Time FU0 FU1 FU2 FU3 FU4 FU5 OP 0 OP 1 1 2 3 4 Time FU0 FU1 FU2 FU3 FU4 FU5 OP 0 OP 1 1 OP 2 2 OP 3 3 OP4 4 Time FU0 FU1 FU2 FU3 FU4 FU5 OP 0 OP 1 1 OP 2 2 OP 3 3 4 Time FU0 FU1 FU2 FU3 FU4 FU5 OP 0 OP 1 1 OP 2 2 OP 3 3 OP4 4 OP 5 Time FU0 FU1 FU2 FU3 FU4 FU5 1 2 3 4 const const const const const const SUB(0) SUB(0) SUB(0) SUB(0) SUB(0) SUB(0) ADD(1) ADD(1) ADD(1) ADD(1) ADD(1) ADD(1) ADD(2) ADD(2) ADD(2) ADD(2) ADD(2) ADD(2) const const const const const const LSR(3) LSR(3) LSR(3) LSR(3) LSR(3) LSR(3) CGRA Mapping CGRA Mapping CGRA Mapping CGRA Mapping CGRA Mapping CGRA Mapping Register file Register file Register file Register file Register file Register file LSL(4) LSL(4) LSL(4) LSL(4) LSL(4) LSL(4) OP 0 OP 0 OP 0 OP 0 OP 0 OP 0 OP 1 OP 1 OP 1 OP 1 OP 1 OP 1 OP 5 OP 5 OP 5 OP 5 OP 5 OP 5 ADD(5) ADD(5) ADD(5) ADD(5) ADD(5) ADD(5) RF[2] RF[2] RF[2] RF[2] RF[2] RF[2] OP 2 OP 2 OP 2 OP 2 OP 2 OP 2 OP 3 OP 3 OP 3 OP 3 OP 3 OP 3 OP 4 OP 4 OP 4 OP 4 OP 4 OP 4 9

Dynamic Operation Fusion Example(2) 2. Dynamic Operation Fusion – 3 Cycle. Schedule Table Schedule Table Schedule Table Schedule Table DataFlow Graph DataFlow Graph DataFlow Graph DataFlow Graph Time FU0 FU1 FU2 FU3 FU4 FU5 RF OP 0 OP 1 OP 2 1 2 Time FU0 FU1 FU2 FU3 FU4 FU5 RF OP 0 OP 1 OP 2 1 OP 3 OP4 2 Time FU0 FU1 FU2 FU3 FU4 FU5 1 2 Time FU0 FU1 FU2 FU3 FU4 FU5 RF OP 0 OP 1 OP 2 1 OP 3 OP4 2 OP 5 RF[0] RF[0] RF[0] RF[0] const const const const RF[1] RF[1] RF[1] RF[1] const const const const const const const const SUB(0) SUB(0) SUB(0) SUB(0) ADD(1) ADD(1) ADD(1) ADD(1) ADD(2) ADD(2) ADD(2) ADD(2) const const const const LSR(3) LSR(3) LSR(3) LSR(3) CGRA Mapping CGRA Mapping CGRA Mapping CGRA Mapping LSL(4) LSL(4) LSL(4) LSL(4) Register file Register file Register file Register file ADD(5) ADD(5) ADD(5) ADD(5) OP 0 OP 0 OP 0 OP 0 OP 1 OP 1 OP 1 OP 1 OP 5 OP 5 OP 5 OP 5 RF[2] RF[2] RF[2] RF[2] OP 2 OP 2 OP 2 OP 2 OP 3 OP 3 OP 3 OP 3 OP 4 OP 4 OP 4 OP 4 10

Experimental Setup Benchmarks Two designs multimedia applications for embedded systems Audio decoding (AAC) Video decoding (H.264) 3D graphics (3D) Two designs baseline : 4x4 heterogeneous CGRA express : 4x4 heterogeneous CGRA with bypass network 11

Performance Enhancement Express achieves 7-17% reduction in execution time Most of reduction comes from acyclic code region. Express also improves the performance of resource-constrained loop. Bypass network gives more freedom to compiler. 12

Detailed Result for 3D Graphics Target application 3D graphics Power consumption 3% higher than the baseline Performance enhancement 17% faster than the baseline Energy consumption 15% more efficient baseline express ratio power (mW) 298.26 306.78 102.86% # of cycles (million) 156.81 130.22 83.04% energy (mJ) 233.85 199.74 85.42% 13

Conclusion Acyclic region becomes the performance bottleneck. The run-time for loops decreases by large factors. Dynamic operation fusion enables to execute back-to- back operations in a cycle Bypass network Tick-based scheduler Up to17% faster and 15% more energy efficient with 3% hardware overhead 14

Questions? 15