Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 15: Multi-FPGA System Software I November 1, 2004 ECE 697F Reconfigurable Computing Lecture 15 Mid-term Review.

Similar presentations


Presentation on theme: "Lecture 15: Multi-FPGA System Software I November 1, 2004 ECE 697F Reconfigurable Computing Lecture 15 Mid-term Review."— Presentation transcript:

1 Lecture 15: Multi-FPGA System Software I November 1, 2004 ECE 697F Reconfigurable Computing Lecture 15 Mid-term Review

2 Lecture 15: Multi-FPGA System Software I November 1, 2004 SRAM-based FPGA SRAM bits can be programmed many times Each programming bit takes up five transistors Larger device area reduces speed versus EPROM and antifuse. Read or Write Data Q Q Programming Bit I1I2 P1 P2 P3 P4 Out 2-Input LUT

3 Lecture 15: Multi-FPGA System Software I November 1, 2004 Field Programmable Gate Array

4 Lecture 15: Multi-FPGA System Software I November 1, 2004 Connection Box Flexibility F c -> How many tracks does an input pin connect to? If logic cluster is small, F C is large F C = W If logic cluster is large, F c can be less. -Approximately 0.2W for Xilinx XC4000EX, Virtex Logic Cluster IO pin Tracks Out T0T0 T1T1 T2T2 T0T0 T1T1 T2T2 F C = 3 T0T0 T1T1 T2T2

5 Lecture 15: Multi-FPGA System Software I November 1, 2004 Switchbox Flexibility Switch box provides optimized interconnection area. Flexibility found to be not as important as F C Six transistors needed for F S = 3 0 1 0 1 01 01

6 Lecture 15: Multi-FPGA System Software I November 1, 2004 Switchbox Issues

7 Lecture 15: Multi-FPGA System Software I November 1, 2004 Fine-grained Approach For 4-input LUTs 16 bits of information available Can be chained together through programmable network. Decoder and multiplexer an issue. Flexibility is a key aspect. Addr AD AD 16X1 LUT1 LUT2

8 Lecture 15: Multi-FPGA System Software I November 1, 2004 Growth Rate of Memory Approximately 2400 transistors per CLB -(1200 per LUT) for XC4000-like implementation (32x1 SRAM) Six transistors per cell for Altera SRAM (2K per EAB) Altera 10KXilinx 4000E SizeEABstransCLBstrans 32x111228812400 32x8112288819200 128x81122883276800 512x8224576128307200 For 512x8 fine-grained requires 10X more size

9 Lecture 15: Multi-FPGA System Software I November 1, 2004 Toward Computational Comparison Dehon metrics: Computational density of a device λ 2 x s 4 input gate-evaluations Processor: 2 x N ALU x W ALU A proc x t cycle FPGA: N 4lut A array x t cycle

10 Lecture 15: Multi-FPGA System Software I November 1, 2004 Degradation FPGA can’t really be clocked at 1/7 ns due to interconnect. Consider the Bubblesort block from the first class. If (A > B) { H = A; L = B; } else { H = B; L = A; } Ci00001111Ci00001111 A00110011A00110011 B01010101B01010101 S01101001S01101001 Co00010111Co00010111 AB AB compare H requires 33 LUT delays

11 Lecture 15: Multi-FPGA System Software I November 1, 2004 Single-Instruction Multiple Data Same instruction distributed to fine-grained cells. Typically organized as 2-D array Ideal for image processing Typically fixed hardware located in cell op multi-bit

12 Lecture 15: Multi-FPGA System Software I November 1, 2004 Computation Unit for SIMD Performs different operation on every cycle Easy to distribute instructions on device (use global lines) Some local storage for data in each tile From local state or other array elements To local state or other array elements Global Instruction common to all elements............

13 Lecture 15: Multi-FPGA System Software I November 1, 2004 Computation Unit for FPGA Performs same operation on every cycle No global distribution of instructions at all (stored locally) Also has local storage for data. From local state or other array elements To local state or other array elements Static instruction distinct for each array element............

14 Lecture 15: Multi-FPGA System Software I November 1, 2004 Hybrid Architecture Configuration selects operation of computation unit Context identifier changes over time to allow change in functionality DPGA – Dynamically Programmable Gate Array............ in Computation Unit (LUT) out Address Inputs (Inst. Store) Context Identifier Programming may differ for each element

15 Lecture 15: Multi-FPGA System Software I November 1, 2004 In-Place Partitioning Recursively bipartition netlist into regions of device. ab cd abcd

16 Lecture 15: Multi-FPGA System Software I November 1, 2004 Enhanced Mincut Terminal propogation takes previous cuts into account during partitioning. Effectively create node “anchors” Helps minimize wire length ab cd

17 Lecture 15: Multi-FPGA System Software I November 1, 2004 Formulating Force Equations Use Hooke’s Law Modules 1, 2, … N m i mass of module i x i x position of module i K ij Attractive constant between module i and j F i Net force on module i from rest of modules °

18 Lecture 15: Multi-FPGA System Software I November 1, 2004 Hill Climbing Algorithms To avoid getting trapped in local minima, consider “hill- climbing” approach Need to accept worse solutions or make “bad” moves to get global minima. Acceptance is probabalistic. Only accept cost-increasing moves some of the time. Cost Solution space

19 Lecture 15: Multi-FPGA System Software I November 1, 2004 Maze Routing Evaluate shortest feasible paths based on a cost function Like row-based device global route allocates channel bandwidth not specific solutions. Formulate cost function as needed to address desired goal. L L C S

20 Lecture 15: Multi-FPGA System Software I November 1, 2004 Routing Tradeoffs Bias router to find first, best route. Vary number of node expansions using: pcost i = (1 – a) x pcost i-1 + ncost i + a x dist i

21 Lecture 15: Multi-FPGA System Software I November 1, 2004 Architectural Limitation Routing architecture necessitates domain selection. Bigger effect for multi-fanout nets

22 Lecture 15: Multi-FPGA System Software I November 1, 2004 Pathfinder Use a non-decreasing history value to represent congestion. Similarities to multi-commodity flow Can be implemented efficiently but does require substantial run time Only update after an interation. c i = (1 + h n * h fac ) * (1 + p n * p fac ) + b n, n-1

23 Lecture 15: Multi-FPGA System Software I November 1, 2004 DP-FPGA Break FPGA into datapath and control sections Save storage for LUTs and connection transistors Key issue is grain size Cherepacha/Lewis – U. Toronto

24 Lecture 15: Multi-FPGA System Software I November 1, 2004 Rapid Reconfigurable Pipeline Datapath Ebeling –University of Washington Uses hard-coded functional units (ALU, Memory, multiply) Good for signal processing Linear array of processing elements. Cell

25 Lecture 15: Multi-FPGA System Software I November 1, 2004 Basic Functional Unit Two inputs from adjacent blocks. Local memory for instructions, data.

26 Lecture 15: Multi-FPGA System Software I November 1, 2004 Chess Basic Block Switchbox memory can be used as storage ALU core for computation

27 Lecture 15: Multi-FPGA System Software I November 1, 2004 FPICs High internal connectivity Not always cost effective

28 Lecture 15: Multi-FPGA System Software I November 1, 2004 Reconfigurable Processing From Hauck: Role of FPGAs Many places to put reconfigurable computing components Most implementations involve multiple discrete devices How should these devices be connected together? From Hauck: Role of FPGAs

29 Lecture 15: Multi-FPGA System Software I November 1, 2004 Emulation Software Steps Many of these are dependent on device interconnect topology Netlist Translation Partitioner Global Placer Global Router FPGA-specific P+R Technology Mapping Divide netlist into fixed-sized chunks Locate an FPGA for a chunk Make connections between devices Xilinx P+R FPGA bitstreams

30 Lecture 15: Multi-FPGA System Software I November 1, 2004 Network Routing FPGAs popular in network hardware New protocols implemented directly in silicon Easy to upgrade in the field Washington University Gigabit Switch (WUGS) -Switch provides up to 160 Gbps of bandwidth.

31 Lecture 15: Multi-FPGA System Software I November 1, 2004 Programmable Active Memory Developed by DEC Paris Research Group (1988-1993) Attached to DEC workstation via Turbochannel bus interface for burst transfers. Total of 12 manufactured and distributed worldwide. Flexible software environment.

32 Lecture 15: Multi-FPGA System Software I November 1, 2004 Hybrid Architecture Buses connect groups of FPGAs to SRAM Extra devices used for RAM controller and map to external interface.

33 Lecture 15: Multi-FPGA System Software I November 1, 2004 Logic Emulation Emulation takes a sizable amount of resources Compilation time can be large due to FPGA compiles One application: also direct ties to other FPGA computing applications.

34 Lecture 15: Multi-FPGA System Software I November 1, 2004 Are Meshes Realistic? The number of wires leaving a partition grows with Rent’s Rule P = KG B Perimeter grows as G 0.5 but unfortunately most circuits grow at G B where B > 0.5 Effectively devices highly pin limited What does this mean for meshes?

35 Lecture 15: Multi-FPGA System Software I November 1, 2004 Virtual Wires Overcome pin limitations by multiplexing pins and signals Schedule when communication will take place.

36 Lecture 15: Multi-FPGA System Software I November 1, 2004 A Simple Example FPGA 1FPGA 2 FPGA 3FPGA 4

37 Lecture 15: Multi-FPGA System Software I November 1, 2004 KLFM Partitioning Identify nodes to swap to reduce overall cut size Lock moved nodes Algorithm continues until no un-locked node can be moved without violating size constraints Bin 1Bin 2

38 Lecture 15: Multi-FPGA System Software I November 1, 2004 Clustering Technology mapping before partitioning is typically ineffective since frequently area is secondary to interconnect Frequently bipartitioning continues after unclustering as well. Cluster KLFM unclusterKLFM This allows for additional fine-grain moves.

39 Lecture 15: Multi-FPGA System Software I November 1, 2004 Higher-level Gains Effectively look-ahead to try to anticipate next move Look-ahead of 3 considered best tradeoff

40 Lecture 15: Multi-FPGA System Software I November 1, 2004 Are Meshes Really Realistic? The number of wires leaving a partition grows with Rent’s Rule Perimeter grows as G 0.5 but unfortunately most circuits grow at G B where B > 0.5 Effectively devices highly pin limited What does this mean for meshes? P = KG B


Download ppt "Lecture 15: Multi-FPGA System Software I November 1, 2004 ECE 697F Reconfigurable Computing Lecture 15 Mid-term Review."

Similar presentations


Ads by Google