Presentation is loading. Please wait.

Presentation is loading. Please wait.

1/30 Course-Grained Reconfigurable Architectures Patrick Cooke and Elizabeth Graham.

Similar presentations


Presentation on theme: "1/30 Course-Grained Reconfigurable Architectures Patrick Cooke and Elizabeth Graham."— Presentation transcript:

1 1/30 Course-Grained Reconfigurable Architectures Patrick Cooke and Elizabeth Graham

2 2/30 Introduction FPGA Benefits ▫Better performance than software ▫Rapid prototyping ▫Lower NRE costs ▫Field-upgradable FPGA Disadvantages ▫Learning curve ▫Lengthy compilation times ▫Lack of portability

3 3/30 Solution: CGRA Learning curve ▫High-level synthesis ▫Simpler basic building blocks Lengthy compilation times ▫Separate virtual hardware and application compilation ▫Shorter application compilation time Lack of portability ▫Hardware abstraction ▫New FPGA, same application

4 4/30 Intermediate Fabrics: Virtual Architectures for Circuit Portability and Fast Placement and Routing James Coole, Dr. Greg Stitt University of Florida Department of Electrical & Computer Engineering Published in CODES + ISSS 2010

5 5/30 Intermediate Fabrics (IFs) Specialized virtual reconfigurable architectures ▫Configure FPGA with a specialized, higher-level FPGA

6 6/30 IF Architecture Data plane ▫Functional units ▫Tracks and switches ▫Connections Control plane ▫State register ▫State machine LUT Stream plane ▫Inputs and outputs

7 7/30 Data Plane Performs application calculations Island-Style Topology ▫Grid of CUs  E.g., ALUs, multipliers, adders ▫Routing resources in between CUs  Tracks connect CUs  Switch boxes connect tracks  Connection boxes connect I/O from CUs to tracks

8 8/30 Control Plane Provides primitives for state machines and control logic ▫State register ▫Next state logic ▫State-dependent output logic ▫State-independent output logic Limitation: Scalability ▫Not scalable to many inputs or large state machines ▫Data-parallel circuits require < 1% resources for control

9 9/30 Stream Plane Transfers data to and from external memories ▫Saves data plane resources for computations Components ▫Counter ▫Basic control ▫Memory controller ▫Optional specialized buffers  E.g., smart buffers  Improve memory bandwidth

10 10/30 IF Overhead High usage of MUXs for routing Reduction techniques ▫Decrease track density ▫Long tracks ▫Jump tracks ▫Wide channels ▫Connection box flexibility

11 11/30 Experiments Metrics ▫Routability – % of random netlists routed successfully ▫PAR time – Time to complete PAR on the IF ▫Clock overhead – % clock frequency lowered to accommodate additional circuit complexity Sample case studies (12 cases; 21 variations) ▫Matrix Multiply – Inner product of two vectors ▫Accum – Monitors an input stream, increments when value below threshold ▫Max Filter – Image filter, selects max of 3x3 window

12 12/30 Select Results PAR Time Speedup IF Area Overhead IF Area Overhead Savings* IF Clock Overhead Matrix Multiply FXD112×16%63%16% Matrix Multiply FLT602×31%58%-11% Accum FXD280×4%50%41% Accum FLT323×14%29%25% Max Filter444×9%56%23% Average FXD275×16%48%18% Average FLT1112×23%39%19% Average554×18%45%18% * Savings of IF area overhead versus using IF area overhead reduction techniques

13 13/30 Routability vs Overhead RoutabilityOverhead 2 Tracks per Channel89%15% 3 Tracks per Channel99%23% 4 Tracks per Channel100%28% 5 Tracks per Channel100%37% Values averaged over different fabric sizes 3×3, 4×4, 5×5, 6×6, 7×7, 8×8, 9×9, 12×8 CUs are DSP48

14 14/30 Conclusions Average 554× PAR speedup IF area overhead can be substantial, but routability remains relatively high Overhead reduction techniques on average reduce overhead by 45% IF clock overhead negligible to other system bottlenecks

15 15/30 Future Work Directly map IF routing resources to reduce overhead Evaluate performance of multiple smaller IFs with respect to one large IF Create library of IFs Develop algorithms for automatically selecting most appropriate IF IF synthesis (done manually in this paper)

16 16/30 Shortcomings IFs do not scale well IF synthesis done by hand, so examples were overly simple Besides random netlist generator, no tools developed for experiment or paper

17 17/30 An FPGA-based Heterogeneous Coarse-Grained Dynamically Reconfigurable Architecture Ricardo Ferreira, Julio Goldner Vendramini, Lucas Mucida Departamento de Informatica Universidade Federal de Vicosa Monica Magalhaes Pereira, Luigi Carro Instituto de Informatica-PPGC Universidade Federal do Rio Grande do Sul Published in CASES 2011.

18 18/30 FPGA-based Coarse-Grained Reconfigurable Architecture (CGRA) Virtual device implemented on any commercial off-the-shelf FPGA Simple configuration algorithm enables fast prototyping ▫Algorithm maps dataflow graphs (DFGs) onto word level reconfigurable architecture Proposed CGRA is 10-100x faster compared to previous CGRA work

19 19/30 CGRA Architecture Three components ▫Registers  Normal and bypass ▫Functional units (FUs)  Heterogeneous or Homogeneous FUs  Heterogeneous reduces cost, power, and complexity  Homogeneous simplify scheduling, placement and routing ▫Global interconnection network  Single cycle latency between FUs  Structured & Unstructured Communication Patterns

20 20/30 Dynamic Interconnection Network Multistage Interconnection Network (MIN) ▫Given n inputs, n outputs and switch radix r, log r n stages with n/r switches each Two parallel Omega networks ▫Blocking networks ▫Switch radix 4  Works well on 6 input LUTs  Half the cost of radix 2 network ▫Each extra stage doubles number of paths connecting each input/output pair

21 21/30 MIN Routing Upper network routes first operand of each FU, lower network routes second operand Commutative operators allow network to avoid conflicts by switching order of operands Switches support multicast connections

22 22/30 Scheduling, Placement and Routing (SPR) SPR all performed at same time Modulo scheduling ▫Repeat schedule of configurations in loop ▫Greedy heuristic ▫Polynomial complexity Placement and Routing ▫Greedy heuristic

23 23/30 SPR Algorithm As Soon As Possible (ASAP) & As Late As Posssible (ALAP) scheduling to find slack Initiation Interval (II) ▫Number of network configurations ▫Initialized based on DFG and architecture configuration Starting from output, attempt place and route for each node from current level in current configuration ▫If success, proceed to next level and next configuration until end of DFG ▫If fail, increment II and restart

24 24/30 Placement Algorithm Request FU for node placement If no available FU, request bypass register ▫If no available register, placement fails ▫Otherwise, reschedule node one level up Placed nodes are immediately routed

25 25/30 Routing Algorithm Attempt to route placed node’s FU to destination FU If routing fails, request bypass register ▫If no available register, routing fails ▫Otherwise, reschedule node one level up and attempt to route to register Algorithm returns success or fail of routing attempt

26 26/30 SPR Walkthrough 5 node DFG 2 FUs 1 bypass register Initiation Interval starts at ceiling(5/2) = 3 Algorithm begins at node E Assume node A is chosen for rescheduling

27 27/30 Experiments Setup 12 DFGs of digital signal processing benchmarks 6 architecture configurations ▫3 medium configurations (64 I/O MINs) ▫3 large configurations (256 I/O MINs) ▫Each configuration had unique combination of heterogeneous FUs Results Medium configurations ▫Instructions per cycle (IPC) range = 19-26 ▫20% overhead on minimum Initiation Interval ▫Average CPU time = 40 ms Large configurations ▫Instructions per cycle (IPC) range = 37-104 ▫40% overhead on minimum Initiation Interval ▫Average CPU time = 130 ms

28 28/30 Resource Utilization Xilinx Virtex6 configured using ISE 12.4 Medium architectures (64 I/O MINs) ▫1% of FPGA register resources ▫15% of LUT resources ▫4% of DSP resources Large architectures (256 I/O MINs) ▫6% of FPGA register resources ▫82% of LUT resources ▫16-25% of DSP resources

29 29/30 Conclusions/Future Work Dynamic CGRA and SPR algorithm achieve on average 50% resource utilization per cycle and CPU time between 10-300 ms Add local register file to FUs to reduce number of configurations in SPR algorithm Integrate SPR tool into compiler tools for softcore FPGA processors ▫Significantly increase performance of data intensive applications

30 30/30 Shortcomings No in-depth comparison of results with previous work No comparison of CGRA circuits with equivalent FPGA circuits to evaluate quality of circuits mapped to CGRA


Download ppt "1/30 Course-Grained Reconfigurable Architectures Patrick Cooke and Elizabeth Graham."

Similar presentations


Ads by Google