Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

Slides:

Advertisements

Similar presentations

Goal: Split Compiler LLVM LLVM – DRESC bytecode staticdeployment time optimized architecture description compiler strategy ML annotations C code ADRES.

Advertisements

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.

Split Compilation for Accelerator-based Multicores Panagiotis Theocharis Computer Systems Lab (CSL) Ghent University.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

Advanced Computer Architectures

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.

CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Advanced Architectures

Ph.D. in Computer Science

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

/ Computer Architecture and Design

Michael Chu, Kevin Fan, Scott Mahlke

EPIMap: Using Epimorphism to Map Applications on CGRAs

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

Department of Electrical Engineering Joint work with Jiong Luo

RAMP: Resource-Aware Mapping for CGRAs

Presentation transcript:

Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures Hyunchul Park†, Kevin Fan†, Scott Mahlke†, Taewook Oh‡, Heeseok Kim‡, Hong-seok Kim‡ † University of Michigan ‡ Samsung Advanced Institute of Technology October 28, 2008 1

Coarse-Grained Reconfigurable Architecture (CGRA) Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power consumption High flexibility with dynamic reconfiguration 2

CGRA : Attractive Alternative to ASICs Suitable for running multimedia applications for future embedded systems High throughput, low power consumption, high flexibility Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW Morphosys SiliconHive ADRES viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW 3

Scheduling in CGRA Sparse interconnect and distributed register files No dedicated routing resources : FUs are used for routing Need explicit routing of operands by compiler FU RF FU RF FU RF FU RF Central RF FU RF FU RF FU RF FU RF FU FU FU FU FU RF FU RF FU RF FU RF Conventional VLIW FU RF FU RF FU RF FU RF CGRA 4

Scheduling Difficulties VLIW : routing is guaranteed by central RF CGRA : Multiple possible routes Compiler is responsible for finding routes Routing can easily fail by other operations time time VLIW CGRA 5

Objective of This Work Modulo scheduling technique for CGRAs Exploit loop-level parallelism by overlapping execution of iterations Customized approach based on characteristics of CGRAs Achieve fast compile time and good performance Huge scheduling space, distributed resources Naïve approach can result in either poor solution or long compile time 6

Traditional Approach : Node-centric time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 P1 P2 C P1 P2 C C 1 C C C 2 3 4 C C C 5 6 7 FU 0 FU 1 FU 2 FU 3 FU 4 C C C 8 9 10 Operations are placed first, then routing is performed Visit all candidate slots to find the solution 7

Node-centric Inefficiency 1 time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 P1 P2 C P1 P2 C 1 C C 2 3 4 C C 5 6 7 FU 0 FU 1 FU 2 FU 3 FU 4 C 8 9 10 Attempt routing to non-reachable slots by edge P1 to C 8

Node-centric Inefficiency 2 time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 P1 P2 P1 P2 C C 1 C 2 3 4 C 5 6 7 FU 0 FU 1 FU 2 FU 3 FU 4 C 8 9 10 Repeat the same routing already performed 9

Our Approach : Edge-centric time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 P1 P2 P1 P2 C 1 C 2 3 4 1 C 5 6 7 2 C C C 8 9 10 3 4 Node-centric Edge-centric Start routing without placing the operation Placement occurs during routing 10

Benefit 1 : Less Routing Calls time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 P1 P2 P1 P2 1 2 3 4 1 5 6 7 2 C C 8 9 10 3 4 Node-centric Edge-centric 11 routing calls for P1  C 1 routing call for P1  C Reduce compile time with less number of routing calls 11

Benefit 2 : Global View node-centric edge-centric time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 P P 10 1 C 1 1 C C 2 2 Assume slot 0 is a precious resource (better to save it for later use) Node-centric greedily picks slot 1 Edge-centric can avoid slot 0 by simply assigning a high cost 12

Edge-centric Modulo Scheduling It’s all about edges Scheduling is constructed by routing ‘edges’ Placement is integrated into routing process Global perspective for EMS Scheduling order of edges Prioritize edges to determine scheduling order Routing optimization Develop contention model for routing resources 13

1: Edge Prioritization Focus on # consumers Height-based priority Simple edges / High fanout edges Height-based priority Give high priority to high fanout edges Edges scheduled later will likely use extra resources Extra resources in simple edges are just being wasted Extra resources in high-fanout edges can be helpful Other consumers can make use of those 14

Fanout Clustering Our approach : the opposite Give priority to simple edges Operations connected in simple edges form a cluster Schedule simple edges within a cluster Schedule high-fanout edges when consumers are visited 17 of 81 loops in H.264 show better throughput Only 1 shows worse throughput 15

2: Routing Optimization Routing is guided by cost associated with each routing slot Intelligent routing cost metrics are important Minimize # routing resources for current edge Static cost : fixed positive cost for each resource Minimize # routing resources for other edges to prod/cons Affinity cost : use common consumer information Avoid routing failures for other edges Probabilistic cost : predict future resource usage routing cost = F(static cost, affinity cost, probabilistic cost) 16

Affinity Cost Heuristic time FU 0 FU 1 FU 2 FU 3 1 2 3 time FU 0 FU 1 FU 2 FU 3 1 2 3 A B C A B A B C C FU 0 FU 1 FU 2 FU 3 Routing Cost = 2 Routing Cost = 0 Affinity cost : utilize common consumer information Affinity value : how close common consumer is in DFG Place operations with high affinity close to each other 17

Probabilistic Cost Heuristic time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 5 6 7 P1 P2 P1 C1 C2 P2 . C1 C2 ST Three possible routes, all using same # routing slots 18

Probabilistic Cost Heuritsic time FU 0 FU 1 FU 2 FU 3 FU 4 1 2 3 4 5 6 7 P1 P2 P1 ST C1 C2 P2 . C1 C2 ST Need to consider other unplaced edges/operations Slots that might be used for routing P2  C2 Slots that might be used for placing ST 19

Probabilistic Cost Heuritsic time FU 0 FU 1 FU 2 FU 3 FU 4 0.33 1 2 1.0 3 4 5 0.5 6 7 P1 P2 P1 C1 C2 P2 . X X C1 C2 ST Probabilities on future usage of slots are calculated and guide routing of P1  C1 Route in the middle is selected 20

EMS System Flow Schedule Preprocessing DFG CGRA Select target edge Cost calculation Fanout clustering Final schedule DFG Perform routing Prioritize edges Place operations Route to others CGRA 21

Experimental Setup 214 loops from highly optimized media applications H.264, 3D graphics, AAC, MP3 Target architecture 4x4 heterogeneous CGRA (6 memory, 4 multiply) Local RF for each PE Mesh-plus interconnect : mesh + 2 hop connections Compared to 3 other solutions IMS : iterative modulo scheduling, no routing optimization NMS : same heuristics as EMS, but in a node-centric way DRESC : IMEC’s simulated annealing 22

Results Performance : normalized throughput of loops +10% 0.5x +24% 2x -2% -18x Performance : normalized throughput of loops Max throughput is determined by # ops in a loop and # resources Compile time : for all 214 loops 23

Conclusion EMS is a good match for scheduling in CGRA Routing is more important than placement Edge-centric approach allows fast compile time 18x speed up over simulated annealing Intelligent routing cost metrics allows good performance 24% improvement over IMS, 98% performance of existing solution 24

Questions ? 25