Presentation is loading. Please wait.

Presentation is loading. Please wait.

Instructor: Dr. Phillip Jones

Similar presentations


Presentation on theme: "Instructor: Dr. Phillip Jones"— Presentation transcript:

1 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 3: Wed 8/31/2011 (Reconfigurable Computing Hardware) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

2 Questions From Last Lecture?

3 Questions From Last Lecture?

4 Announcements/Reminders
HW1 due Friday of next week Try to have it completed by this Friday since MP1 will be released on Friday Start thinking about topics you may want to do your mini-literature survey on (HW 2). Guest Lecturer on this Friday (I will be out of town, but should have access)

5 Overview Logic Interconnect/Routing Optimized resources
Adders, Multipliers Memory System-on-chip building blocks Example Commercial FPGA structure

6 What you should learn Basic understanding of the major components that make up an FPGA device.

7 Basic FPGA Architectural Components
FPGA: Field Programmable Gate Array Sea of general purpose logic gates CLB Configurable Logic Block (CLB)

8 Computational Fabric - LUT
LUT = Look up Table Z A 4-LUT B C D X000 X001 X010 X101 X110 X111 ABCD Z 1 0000 0001 1110 1111 ABCD Z 0000 0001 1110 1111 ABCD Z 1 0000 0001 1110 1111 ABCD Z 1 B 2:1 Mux C D Z 1 AND Z A B C D OR Z A B C D

9 Computational Fabric - LUT
LUT = Look up Table Z A 4-LUT B C D How many 4-LUTs needed to OR 32-bits Draw 32 1

10 Computational Fabric - LUT
LUT = Look up Table Z A 4-LUT B C D How many 4-LUTs needed to OR 32-bits Draw 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT

11 Computational Fabric - LUT
LUT = Look up Table Z A 4-LUT B C D How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT

12 Computational Fabric - LUT
LUT = Look up Table Z A 4-LUT B C D How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT

13 Computational Fabric - LUT
LUT = Look up Table Z A 4-LUT B Write out the Truth table C D ABCD Z How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT

14 Computational Fabric - LUT
LUT = Look up Table Z A 4-LUT B Write out the Truth table C D ABCD Z How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT

15 Computational Fabric - LUT
LUT = Look up Table Z A 4-LUT B Write out the Truth table C D ABCD Z How many 4-LUTs needed to AND 2-bits with the 32-bit OR Draw 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 1 4 LUT 4 LUT 4 LUT 32 4 LUT 4 LUT 4 LUT 1 4 LUT 4 LUT 4 LUT 4 LUT 4 LUT

16 Computational Fabric - LUT
LUT = Look up Table Z A 4-LUT B C D How could one build a 4-LUT? 4 ABCD 1x16 Memory 1 16:1 Mux Z

17 Computational Fabric - LUT
LUT = Look up Table Z A 4-LUT B C D How many different 4 input functions can a 4-LUT implement? 1 1x16 Memory 16:1 Mux 4 ABCD Z 216 = 65536

18 Computational Fabric - LUT
LUT = Look up Table Z A 4-LUT B C D How many different N input functions can a N-LUT implement? 1 1x16 Memory 16:1 Mux 4 ABCD Z

19 Computational Fabric - LUT
LUT = Look up Table Z A 4-LUT B C D How many different N input functions can a N-LUT implement? 1 1x16 Memory 16:1 Mux N ABCD Z

20 Computational Fabric - LUT
LUT = Look up Table Z A 4-LUT B C D How many different N input functions can a N-LUT implement? 1 1x2N Memory 16:1 Mux N ABCD Z = 22N N = 4 216 =224=65536

21 Granularity of Computation
Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Microprocessor 1024-bits

22 Granularity of Computation
Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT op 4 A 3 3 10-LUT Microprocessor B 3 1024-bits

23 Granularity of Computation
Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT op 4 A 3 3 10-LUT Microprocessor B 3 op 4 1024-bits A 3 3 B 3 op 4 3 A 3 B 3

24 Granularity of Computation
Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT op 4 A 3 10-LUT Microprocessor 3 B 3 1024-bits op A 3 3 B 3 op 4 A 3 3 B 3

25 Granularity of Computation
Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 3 A B op 10-LUT Microprocessor 1024-bits 4 3 A B op 4 op A 3 3 B 3

26 Granularity of Computation
Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits

27 Granularity of Computation
Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

28 Granularity of Computation
Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT A 10-LUT B Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

29 Granularity of Computation
Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT AND 4 A 10-LUT 1 Bit logic and constants 1024-bits OR Area that was required using 2-LUTS (A and “1100”) or (B or “1000”) OR 4 B It’s much worse, each 10-LUT only has one output

30 Computational Fabric - DFF
Z A 4-LUT B C D LUTs are fine for implementing any arbitrary combinational logic (output is ONLY a function of its inputs) function. But what about sequential logic (output is a function of input AND previous state information)? Need Memory!!

31 Computational Fabric - DFF
Z(t) A 4-LUT B Z(t+1) C DFF D DFF = D Flip Flop 1/0 0/0 1 11 110 1101 1/1 Start Input/output Detect the pattern “1101”

32 Computational Fabric - DFF
Z(t) A 4-LUT B Z(t+1) C DFF D DFF = D Flip Flop Increase circuit performance (pipelining) 4 LUT delays per output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D 4-LUT B C D A DFF 1 DFF delay per output

33 Communication: Interconnect & Routing
Need a mechanism to move results of computation around. CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

34 Communication: Interconnect & Routing
Need a mechanism to move results of computation around. Nearest Neighbor: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

35 Communication: Interconnect & Routing
Need a mechanism to move results of computation around. Nearest Neighbor: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Segmented: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

36 Communication: Interconnect & Routing
Need a mechanism to move results of computation around. Nearest Neighbor: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Segmented: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB Hierarchical: CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

37 Optimized Resources: Dedicated Logic
LUTs + DFFs can implement any arbitrary digital logic. But not optimally (ASICs give make much better use of silicon area for Power, Speed, routing resources) Arithmetic Add, Multiply On chip memory System on chip building blocks Processor, PCI-express, Gigabit Ethernet, ADC, etc.

38 Optimized Resources: Dedicated Logic
Fast Addition generate propagate logic Carry in Carry out 6-LUT A3 B3 A2 B2 A1 B1 Sum 3 Sum 2 Sum 1 Two output LUT Carry Look Ahead c4 G1 P1 Sum 1 CLB P2 Carry 2 G2 A1 B1 Carry1 A2 B2 Dedicated routing resources

39 Optimized Resources: Dedicated Logic
Embedded Memory 96 bits, 300 MHz 8 12

40 Optimized Resources: Dedicated Logic
Embedded Memory 18 Kbits, 550 MHz 8 Dedicated memory block 12

41 Optimized Resources: Dedicated Logic
Multiplication 18x18 multiply Type # LUTs Latency Speed LUT ~400 5 clks 380 MHz Dedicated 18x18 Multiplier 3 clks 450 MHz Virtex-5 (6-LUTs) Very rough estimate of Silicon area comparison (assuming SX95 andLX110 have about the same die size) 6-LUT 6-LUT 18x18 Multiplier In other word you can replace one LUT based 18x18 multiplier With 100 dedicated 18x18 Multipliers!!! 6-LUT 6-LUT

42 Optimized Resources: Dedicated Logic
Processor PowerPC hard-core MicroBlaze soft-core 500 MHz Super scalor Highspeed 2x5 switch fabric 250 MHz Simple scalar

43 Optimized Resources: Dedicated Logic
System on Chip Dedicated Logic Reconfigurable Logic RAM ADC Sensor Matrix Multiplier Coprocessor Sensor Motor Data Buffer PID Controller Ethernet MAC Also see Actel Fusion:

44 Xilinx CLB Architecture
Virtex 5 FPGA User Guide

45 Questions/Comments/Concerns

46 Computational Fabric - LUT
N-Lut, 3,4…6,…8-LUT AND, XOR, NOT Exercises How many 4-LUTs to OR 32 bits (draw) How many 4-LUTs to AND 2 bits with the OR of these 32 bits (draw) Draw the truth table for the 4-LUT that gives the final output How could one implement a LUT (Memory + MUX) How many ways can a 4-LUT be programmed How many ways can a N-LUT be programmed Granularity trade-off: Functionality vs. propagation delay (2-LUT -> CPU), bit-level vs. datapath

47 Computational Fabric - DFF
Enable building circuits that can store information (sequential circuits, state machines) Enables pipelining to increase operating frequency/ throughput

48 Communication: Interconnect & Routing
Need a mechanism to move the results of a LUT to other LUTs. Island stale (Array of CB) Nearest neighbor (paper on reconfigure arch that uses this) Not scalable (large delays, and uses logic elements for routing?) Segmented (different length for latency trade-off) Multi hop scales < O(N)? Avoid using logic Hierarchical (good for apps with lots of local communication and little remote communication) Typical an FPGA silicon area will be 10% logic and 90% interconnect!!

49 Optimized Resources: Hard Cores
LUTs + DFFs can implement any arbitrary digital logic. But not optimally (ASICs give make much better use of silicon area for Power, Speed, routing resources) Arithmetic Add, Mult On chip memory System on chip building blocks Processor, PCI-express, Gigbit Ethernet, A/D


Download ppt "Instructor: Dr. Phillip Jones"

Similar presentations


Ads by Google