1/30 Course-Grained Reconfigurable Architectures Patrick Cooke and Elizabeth Graham.

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
Graduate Computer Architecture I Lecture 16: FPGA Design.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
1 Performed By: Khaskin Luba Einhorn Raziel Einhorn Raziel Instructor: Rivkin Ina Spring 2004 Spring 2004 Virtex II-Pro Dynamical Test Application Part.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
 Based on the resource constraints a lower bound on the iteration interval is estimated  Synthesis targeting reconfigurable logic (e.g. FPGA) faces the.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Programmable logic and FPGA
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.
CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.
OpenCL High-Level Synthesis for Mainstream FPGA Acceleration
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Dr. Konstantinos Tatas ACOE201 – Computer Architecture I – Laboratory Exercises Background and Introduction.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
J. Christiansen, CERN - EP/MIC
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
designKilla: The 32-bit pipelined processor Brought to you by: Victoria Farthing Dat Huynh Jerry Felker Tony Chen Supervisor: Young Cho.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
Rinoy Pazhekattu. Introduction  Most IPs today are designed using component-based design  Each component is its own IP that can be switched out for.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
Exploiting Parallelism
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Sunpyo Hong, Hyesoon Kim
Hyunchul Park†, Kevin Fan†, Scott Mahlke†,
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
Intermediate Fabrics: Virtual FPGA Architectures for Circuit Portability and Fast Placement and Routing on FPGAs James Coole PhD student, University of.
Architecture & Organization 1
Architecture & Organization 1
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Application-Specific Customization of Soft Processor Microarchitecture
Programmable logic and FPGA
Presentation transcript:

1/30 Course-Grained Reconfigurable Architectures Patrick Cooke and Elizabeth Graham

2/30 Introduction FPGA Benefits ▫Better performance than software ▫Rapid prototyping ▫Lower NRE costs ▫Field-upgradable FPGA Disadvantages ▫Learning curve ▫Lengthy compilation times ▫Lack of portability

3/30 Solution: CGRA Learning curve ▫High-level synthesis ▫Simpler basic building blocks Lengthy compilation times ▫Separate virtual hardware and application compilation ▫Shorter application compilation time Lack of portability ▫Hardware abstraction ▫New FPGA, same application

4/30 Intermediate Fabrics: Virtual Architectures for Circuit Portability and Fast Placement and Routing James Coole, Dr. Greg Stitt University of Florida Department of Electrical & Computer Engineering Published in CODES + ISSS 2010

5/30 Intermediate Fabrics (IFs) Specialized virtual reconfigurable architectures ▫Configure FPGA with a specialized, higher-level FPGA

6/30 IF Architecture Data plane ▫Functional units ▫Tracks and switches ▫Connections Control plane ▫State register ▫State machine LUT Stream plane ▫Inputs and outputs

7/30 Data Plane Performs application calculations Island-Style Topology ▫Grid of CUs  E.g., ALUs, multipliers, adders ▫Routing resources in between CUs  Tracks connect CUs  Switch boxes connect tracks  Connection boxes connect I/O from CUs to tracks

8/30 Control Plane Provides primitives for state machines and control logic ▫State register ▫Next state logic ▫State-dependent output logic ▫State-independent output logic Limitation: Scalability ▫Not scalable to many inputs or large state machines ▫Data-parallel circuits require < 1% resources for control

9/30 Stream Plane Transfers data to and from external memories ▫Saves data plane resources for computations Components ▫Counter ▫Basic control ▫Memory controller ▫Optional specialized buffers  E.g., smart buffers  Improve memory bandwidth

10/30 IF Overhead High usage of MUXs for routing Reduction techniques ▫Decrease track density ▫Long tracks ▫Jump tracks ▫Wide channels ▫Connection box flexibility

11/30 Experiments Metrics ▫Routability – % of random netlists routed successfully ▫PAR time – Time to complete PAR on the IF ▫Clock overhead – % clock frequency lowered to accommodate additional circuit complexity Sample case studies (12 cases; 21 variations) ▫Matrix Multiply – Inner product of two vectors ▫Accum – Monitors an input stream, increments when value below threshold ▫Max Filter – Image filter, selects max of 3x3 window

12/30 Select Results PAR Time Speedup IF Area Overhead IF Area Overhead Savings* IF Clock Overhead Matrix Multiply FXD112×16%63%16% Matrix Multiply FLT602×31%58%-11% Accum FXD280×4%50%41% Accum FLT323×14%29%25% Max Filter444×9%56%23% Average FXD275×16%48%18% Average FLT1112×23%39%19% Average554×18%45%18% * Savings of IF area overhead versus using IF area overhead reduction techniques

13/30 Routability vs Overhead RoutabilityOverhead 2 Tracks per Channel89%15% 3 Tracks per Channel99%23% 4 Tracks per Channel100%28% 5 Tracks per Channel100%37% Values averaged over different fabric sizes 3×3, 4×4, 5×5, 6×6, 7×7, 8×8, 9×9, 12×8 CUs are DSP48

14/30 Conclusions Average 554× PAR speedup IF area overhead can be substantial, but routability remains relatively high Overhead reduction techniques on average reduce overhead by 45% IF clock overhead negligible to other system bottlenecks

15/30 Future Work Directly map IF routing resources to reduce overhead Evaluate performance of multiple smaller IFs with respect to one large IF Create library of IFs Develop algorithms for automatically selecting most appropriate IF IF synthesis (done manually in this paper)

16/30 Shortcomings IFs do not scale well IF synthesis done by hand, so examples were overly simple Besides random netlist generator, no tools developed for experiment or paper

17/30 An FPGA-based Heterogeneous Coarse-Grained Dynamically Reconfigurable Architecture Ricardo Ferreira, Julio Goldner Vendramini, Lucas Mucida Departamento de Informatica Universidade Federal de Vicosa Monica Magalhaes Pereira, Luigi Carro Instituto de Informatica-PPGC Universidade Federal do Rio Grande do Sul Published in CASES 2011.

18/30 FPGA-based Coarse-Grained Reconfigurable Architecture (CGRA) Virtual device implemented on any commercial off-the-shelf FPGA Simple configuration algorithm enables fast prototyping ▫Algorithm maps dataflow graphs (DFGs) onto word level reconfigurable architecture Proposed CGRA is x faster compared to previous CGRA work

19/30 CGRA Architecture Three components ▫Registers  Normal and bypass ▫Functional units (FUs)  Heterogeneous or Homogeneous FUs  Heterogeneous reduces cost, power, and complexity  Homogeneous simplify scheduling, placement and routing ▫Global interconnection network  Single cycle latency between FUs  Structured & Unstructured Communication Patterns

20/30 Dynamic Interconnection Network Multistage Interconnection Network (MIN) ▫Given n inputs, n outputs and switch radix r, log r n stages with n/r switches each Two parallel Omega networks ▫Blocking networks ▫Switch radix 4  Works well on 6 input LUTs  Half the cost of radix 2 network ▫Each extra stage doubles number of paths connecting each input/output pair

21/30 MIN Routing Upper network routes first operand of each FU, lower network routes second operand Commutative operators allow network to avoid conflicts by switching order of operands Switches support multicast connections

22/30 Scheduling, Placement and Routing (SPR) SPR all performed at same time Modulo scheduling ▫Repeat schedule of configurations in loop ▫Greedy heuristic ▫Polynomial complexity Placement and Routing ▫Greedy heuristic

23/30 SPR Algorithm As Soon As Possible (ASAP) & As Late As Posssible (ALAP) scheduling to find slack Initiation Interval (II) ▫Number of network configurations ▫Initialized based on DFG and architecture configuration Starting from output, attempt place and route for each node from current level in current configuration ▫If success, proceed to next level and next configuration until end of DFG ▫If fail, increment II and restart

24/30 Placement Algorithm Request FU for node placement If no available FU, request bypass register ▫If no available register, placement fails ▫Otherwise, reschedule node one level up Placed nodes are immediately routed

25/30 Routing Algorithm Attempt to route placed node’s FU to destination FU If routing fails, request bypass register ▫If no available register, routing fails ▫Otherwise, reschedule node one level up and attempt to route to register Algorithm returns success or fail of routing attempt

26/30 SPR Walkthrough 5 node DFG 2 FUs 1 bypass register Initiation Interval starts at ceiling(5/2) = 3 Algorithm begins at node E Assume node A is chosen for rescheduling

27/30 Experiments Setup 12 DFGs of digital signal processing benchmarks 6 architecture configurations ▫3 medium configurations (64 I/O MINs) ▫3 large configurations (256 I/O MINs) ▫Each configuration had unique combination of heterogeneous FUs Results Medium configurations ▫Instructions per cycle (IPC) range = ▫20% overhead on minimum Initiation Interval ▫Average CPU time = 40 ms Large configurations ▫Instructions per cycle (IPC) range = ▫40% overhead on minimum Initiation Interval ▫Average CPU time = 130 ms

28/30 Resource Utilization Xilinx Virtex6 configured using ISE 12.4 Medium architectures (64 I/O MINs) ▫1% of FPGA register resources ▫15% of LUT resources ▫4% of DSP resources Large architectures (256 I/O MINs) ▫6% of FPGA register resources ▫82% of LUT resources ▫16-25% of DSP resources

29/30 Conclusions/Future Work Dynamic CGRA and SPR algorithm achieve on average 50% resource utilization per cycle and CPU time between ms Add local register file to FUs to reduce number of configurations in SPR algorithm Integrate SPR tool into compiler tools for softcore FPGA processors ▫Significantly increase performance of data intensive applications

30/30 Shortcomings No in-depth comparison of results with previous work No comparison of CGRA circuits with equivalent FPGA circuits to evaluate quality of circuits mapped to CGRA