Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling Bingfeng Mei, Serge Vernalde, Diederik Verkest,

Slides:

Advertisements

Similar presentations

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

Advertisements

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

Cpeg421-08S/final-review1 Course Review Tom St. John.

Design of a Reconfigurable Hardware For Efficient Implementation of Secret Key and Public Key Cryptography.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.

Reconfigurable Computing (EN2911X, Fall07)

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Data-path Synthesis of VLIW Video Signal Processor Zhao Wu and Wayne Wolf Dept. of Electrical Engineering, Princeton University.

HARP: Hard-Wired Routing Pattern FPGAs Cristinel Ababei , Satish Sivaswamy ,Gang Wang , Kia Bazargan , Ryan Kastner , Eli Bozorgzadeh   ECE Dept.

Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)

ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement.

Automating Shift-Register-LUT Based Run-Time Reconfiguration Karel Heyse, Brahim Al Farisi, Karel Bruneel, Dirk Stroobandt

Generic Software Pipelining at the Assembly Level Markus Pister

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

A Flexible Interconnection Structure for Reconfigurable FPGA Dataflow Applications Gianluca Durelli, Alessandro A. Nacci, Riccardo Cattaneo, Christian.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

Automated Design of Custom Architecture Tulika Mitra

Solving Hard Instances of FPGA Routing with a Congestion-Optimal Restrained-Norm Path Search Space Keith So School of Computer Science and Engineering.

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

ARCHER:A HISTORY-DRIVEN GLOBAL ROUTING ALGORITHM Muhammet Mustafa Ozdal, Martin D. F. Wong ICCAD ’ 07.

USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall,

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

RICE UNIVERSITY DSPs for future wireless systems Sridhar Rajagopal.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

Memory-Aware Compilation Philip Sweany 10/20/2011.

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Marilyn Wolf1 With contributions from:

Chapter 1 Introduction.

Ph.D. in Computer Science

James Coole PhD student, University of Florida Aaron Landy Greg Stitt

Conception of parallel algorithms

Please do not distribute

Application-Specific Customization of Soft Processor Microarchitecture

Chapter 1 Introduction.

CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.

Michael Chu, Kevin Fan, Scott Mahlke

CSCI1600: Embedded and Real Time Software

From C to Elastic Circuits

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.

Dynamically Scheduled High-level Synthesis

Henk Corporaal TUEindhoven 2011

Application-Specific Customization of Soft Processor Microarchitecture

CSCI1600: Embedded and Real Time Software

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Exploiting Loop-Level Parallelism for Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, Rudy Lauwereins Presented By: Nikhil Bansal

Outline Introduction coarse-grained reconfigurable architectures coarse-grained reconfigurable architectures core problem: exploiting parallelism core problem: exploiting parallelism modulo scheduling problem modulo scheduling problem Compiler Framework Modulo Scheduling Algorithm Conclusions and Future Work

Example of Coarse-Grained Architectures: MorphoSys Topology of MorphoSys Architecture of a Reconfigurable Cell Ming-Hau Lee et al., University of California, Irvine Other examples: REMARC, PACT, Chameleon, KressArray, QuickSilver...

Core Problem: Exploiting Parallelism Which parallelism makes difference?  Instruction-level parallelism  limited parallelism (constrained by dependence)  VLIW does a good job  Task(thread)-level parallelism  hard to automate  lack support in coarse-grained architectures Loop-level parallelism (pipelining)  fit into coarse-grained architectures  higher parallelism than ILP

Pipelining Using Modulo Scheduling Modulo Scheduling (general): A way of pipelining Iterations are overlapped Each iteration is initiated at a fixed interval (II) For coarse-grained architectures: Where to place an operation? (placement) When to schedule an operation? (scheduling) How to connect operations? (routing) Modulo constraints

Modulo Scheduling Problem (cont.) n1 n2n3 n4 fu1 fu3fu4fu2 Iteration 1 t=0 t=1 t=2 t=3 t=4 b) space-time representation steady state (kernel) n2 n1 n3 n4 prologue epilogue II = 1 Pipeline stages = 3 4 operations/cycle for kernel n1 n2n3 n4 fu1fu2 fu3fu4 dataflow graph 2x2 matrix a) an example n2n4 n3 n1 n2 n3 n4

Outline Introduction Compiler Framework structure of compiler architecture description and abstraction Modulo Scheduling Algorithm Conclusion and Future Work

The Structure of DRESC Compiler C program IMPACT Frontend Lcode IR Dataflow Analysis & Transformation Modulo Scheduling Algorithm Architecture Description Architecture Abstraction Simulator Under development Architecture Parser DRESC refers to Dynamically Reconfigurable Embedded Systems Compiler External tool

The Target Architecture Template FU muxcmuxa muxb RF out1out2 in src1src2 pred dst1pred_dst2 pred _dst1 Configuration RAM reg Example of FU and register file Examples of topology Generalizing common features of other architectures Using an XML-based language to specify topology, resource allocation, operations and timing

Architecture Description and Abstraction XML-based architecture description Architecture Parser Architecture Abstraction MRRG representation Modulo Routing Resource Graph (MRRG) abstracts architecture for modulo scheduling. It combines features of: Modulo reservation table (MRT) from VLIW compilation Routing resource graph from FPGA P&R Specify resource allocation, operation binding, topology and timing.

Definitions of MRRG MRRG is defined as 3-tuple: G = {V, E, II} v = (r, t), r is related to resource, t refers to time stamp E = {(v m, v n ) | t(v m ) <= t(v n )} II = initiation interval Important properties: modulo: if node (r, t j ) is used, all the nodes {(r, t k )| t j mod II = t k mod II} are used too asymmetric: no route from v i to v j, if t(v i ) > t(v j ) Modulo scheduling problem is transformed to a placement and routing (P&R) problem on MRRG

Transform Components to MRRG Register allocation is transformed to part of P&R problem, implicitly solved by the modulo scheduling algorithm. register modeling is based on Roos2001 FU predsrc1src2 pred _dst1 pred _dst2 dst source predsrc1src2 sink pred_dst1pred_dst2dst RF in out1out2 cycle1 cycle2 in out1out2 in out1out2 cap

Outline Introduction Compiler Framework Modulo Scheduling Algorithm combined placement and routing congestion negotiation simulated annealing results and related work Conclusions and Future Work

Combined Placement and Routing Space-time routing resource graph can ’ t guarantee routability during the placement Rip-Up op Routing Success? No Yes Init Placement &Routing Re-placement n1 n2 ? For normal FPGA P&R: LUT1 LUT2 switch block LUT

Proposed Algorithm Rip-Up op Success? No Yes Init P&R, Penalty Re-P&R op UpdatePenalty InitTemperature Evaluate New P&R Accept? Restore op UpdateTemperature No Yes 1. Sort the operations 2. For each II, first generate initial schedule which respects dependency constraints only. 3. The algorithm iteratively reduces resource overuse and tries to come up with a legal schedule At every iteration, an operation is ripped up from the existing schedule and is placed randomly Connected nets are rerouted accordingly A cost function (next slide) is computed to evaluate the new placement and routing Simulated annealing strategy is used to decide whether we accept new placement or not.

Cost Function Allow to overuse resources during P&R The cost of using one node is computed as follow: Rip-Up op Routing Success? No Yes Init P&R Re-placement base: base cost of the node in MRRG occ: occupancy cap: capacity of the node p: penalty factor penalty is increased over time as follow: UpdatePenalty InitPenalty

Parameters to Tune the Algorithm Ordering of operations techniques from Llosa2001 Relaxing factor of schedule length difficulty of moving operations VS. more pipeline stages Parameters of SA algorithm Costs associated with different resources register file get less base cost Penalty factor associated with overused resources compromise between scheduling quality and speed...

Scheduling Results Scheduling results on a 8x8 matrix resembles topology of Morphosys Algorithm limitations:  Scheduling speed is relatively slow  Scheduling quality still has space to improve  Can’t handle pipelined FUs  Can only handle the inner loop of a loop nest

Related Work  Modulo scheduling on clustered VLIWs  problem is simpler in nature (no routing).  RaPiD, Garp  row-based architecture and scheduling techniques.  no multiplexing  PipeRench  ring-like architecture is very specific, scheduling techniques are not general  Z. Huang, S. Malik, DAC2002  either use a full cross-bar, or generate a dedicated datapath for several loops for pipelining

Outline Introduction Compiler Framework Exploiting Parallelism Conclusions and Future Work

Conclusions: Coarse-grained architectures have distinct features. Compilers are possible and needed Loop-level parallelism is the right one for coarse-grained reconfigurable architectures A novel modulo scheduling algorithm and an abstract architecture representations are developed Future Work: Improve quality and speed of scheduling algorithm Enlarge the scope of pipelineable loops Techniques to reduce the bottleneck of pipelineable loops, e.g., taking into account of distributed memory