Presentation is loading. Please wait.

Presentation is loading. Please wait.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #7 – Applications.

Similar presentations


Presentation on theme: "CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #7 – Applications."— Presentation transcript:

1 CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #7 – Applications I

2 Lect-07.2CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Recap – Fixed-Point Arithmetic Addition, subtraction the same (Q4.4 example): Multiplication requires realignment: 3.6250 0011.1010 + 2.8125 0010.1101 6.4375 0110.0111 3.6250 0011.1010 x 2.8125 0010.1101 00111010 10.1953125 1010.00110010

3 Lect-07.3CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Recap – Capacity Trends Year 1985 Xilinx Device Complexity XC2000 50 MHz 1K gates XC4000 100 MHz 250K gates Virtex 200 MHz 1M gates Virtex-II 450 MHz 8M gates Spartan 80 MHz 40K gates Spartan-II 200 MHz 200K gates Spartan-3 326 MHz 5M gates 19911987 XC3000 85 MHz 7.5K gates Virtex-E 240 MHz 4M gates XC5200 50 MHz 23K gates 199519981999200020022003 Virtex-II Pro 450 MHz 8M gates* 20042006 Virtex-4 500 MHz 16M gates* Virtex-5 550 MHz 24M gates*

4 Lect-07.4CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Recap – FPGA Memory Resources Distributed LUT-based RAM Block Select+RAM

5 Lect-07.5CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Recap – Block Multipliers Block multipliers (18b x 18b) arranged in columns near RAM

6 Lect-07.6CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Recap – Block Multipliers (cont.) Synthesis tools can take larger multipliers and break them down into 18x18 multipliers

7 Lect-07.7CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Recap – FPGA CPU Resources Built-in PowerPC 405 processor blocks

8 Lect-07.8CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Recap – Xilinx Virtex-4 FPGAs Comes in three varieties: Virtex-4 LX: most amount of LUTs Virtex-4 FX: has PowerPCs like V2P Virtex-4 SX: contains most amount of XtremeDSP slices CLB structure similar to Virtex-II Largest LX device – 89,088 slices = 178,176 4-LUTs! FX devices limited to 2 PPC 405s like Virtex-II Pro XTremeDSP Slices: Same 18x18 block multiplier, now with optional pipelining Includes built-in 48-bit accumulator for MAC operations

9 Lect-07.9CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Recap – Xilinx Virtex-5 FPGAs CLB slices uses 6-input LUTs Block RAMs now 36Kbits per block DSP slices now support 25x18 MAC Diagonal routing LX, LXT, SXT, FXT varieties

10 Lect-07.10CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Outline Recap Splash / Splash 2 System Architecture Splash 2 Programming Models Splash 2 Applications Text searching Genetic pattern matching Image processing

11 Lect-07.11CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Overview An early well-known reconfigurable computer was Splash / Splash 2 Implemented as linear, systolic array Developed at Supercomputing Research Center (1990-1994) Memory tightly coupled with each FPGA Multiple Splash boards can be combined to form larger system

12 Lect-07.12CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Splash 1 Born to solve DNA string matching Operational in 1989 Features 32 Xilinx XC3090 FPGAs (420 Mλ 2 ) 32 128KB SRAMs (600 Mλ 2 ) VMEbus interface (FIFO 1 MHz clock) 33 Gλ 2 total (not counting interconnect) Linear interconnect only

13 Lect-07.13CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Splash 1 Architecture F3F3 M3M3 VME Bus Interface VSB Bus Control Interface FIFO IN FIFO OUT M4M4 F4F4 F 11 M 11 M 12 F 12 F0F0 M0M0 M7M7 F7F7 F8F8 M8M8 M 15 F 15 F1F1 M1M1 M6M6 F6F6 F9F9 M9M9 M 14 F 14 F2F2 M2M2 M5M5 F5F5 F 10 M 10 M 13 F 13 F 31 M 31 M 24 F 24 F 23 M 23 M 16 F 16 F 28 M 28 M 27 F 27 F 20 M 20 M 19 F 19 F 29 M 29 M 26 F 26 F 21 M 21 M 18 F 18 F 30 M 30 M 25 F 25 F 22 M 22 M 17 F 17

14 Lect-07.14CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Splash 2 Operational in 1992 Attached processor system Features 17 Xilinx XC4010 FPGAs (500 Mλ 2 ) 17 512KB SRAMs (2 Gλ 2 ) 9 TI SN74ACT8841s (16 port, 4b crossbar) Sbus interface (< 100MB/s) 43 Gλ 2 total (not counting interconnect) Supported 2 programming models: Systolic SIMD

15 Lect-07.15CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Splash 2 Architecture M1M1 M4M4 M3M3 M2M2 M5M5 M8M8 M7M7 F1F1 F4F4 F3F3 F2F2 F5F5 F8F8 F7F7 F6F6 M6M6 F0F0 M0M0 M 16 F 16 M 13 F 13 M 14 F 14 M 15 F 15 M 12 F 12 M9M9 F9F9 M 10 F 10 M 11 F 11 Crossbar Switch From Prev Board To Next Board

16 Lect-07.16CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Splash 2 Architecture (cont.) Sparc-based Host Interface Board S-Bus External Input External Output Array Boards R-Bus SIMDS-Bus

17 Lect-07.17CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Comparing Splash 1 and 2 Year Splash 1Splash 2 FPGA 4-LUTs / Board Max Input Rate 19891992 XC3090XC4010 10,24013,600 1 MB/s100 MB/s InterconnectLinear Array Memory / FPGA128 KB Broadcast Area33 Gλ 2 Crossbar 512 KB 43 Gλ 2

18 Lect-07.18CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Splash 2 Programming Models Linear (systolic) array All near-neighbor communication, pipelined Very fast (at the time) speed of 20-30MHz achieved All FPGAs run the same program SIMD array Instructions fanned out to all processing elements Data across all elements collected at the end. F1F1 F 16 F3F3 F2F2 … F1F1 F3F3 F2F2 … F0F0 F3F3 F3F3 outin outin

19 Lect-07.19CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Programming Splash 2 Languages VHDL dbC – a C-like language capable of SIMD instructions How many programs? Host Interface board (XL and XR) Splash array board X0 X1 – X16 TI crossbar chips Five programs! Requires intimate knowledge of the architecture

20 Lect-07.20CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Application #1 – Dictionary Search Search through dictionary of words for data hit Applicable to internet search engines / databases Opportunities for search parallelism Splash implementation uses systolic communication

21 Lect-07.21CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Dictionary Search (cont.) Each FPGA used to look into local memory Longer data words “hashed” into 18 bit address Valid bit in memory indicates if data value is currently stored Could be stored in several locations FiFi MiMi F i-1 M i-1 F i+1 M i+1

22 Lect-07.22CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Example Hash Function XOR two character value with temp result and hash function Rotate result Different hash function for each FPGA Shift amount: 7 bits Hash function: 1100 1000 1010 0011 00 0000 0000 0000 0000 0000 Clear hash register 01 1010 0001 1101 00 Input the letters “th” --------------------------------- 10 1000 0011 0101 1100 0000 Temporary Result 10 0000 0101 0000 0110 1011 Result for “th” 00 0000 0001 1001 01 Input for letters “e_” ----------------------------------------- 01 0010 0110 0001 1110 1011 Temporary result 10 0101 1010 0100 1100 0011 Result for “the_”

23 Lect-07.23CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Dictionary Search (cont.) Distribute dictionary in parallel to all memories Collect word values in FIFOs Distribute words two characters at a time across all devices Perform local hashing and lookup in parallel Collect “hit” result at end Splash 2 implementation results 25 MHz Three phases needed Fetch 2 bit-sliced characters Perform hash Table look-up Takes advantage of both systolic and SIMD modes

24 Lect-07.24CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Genetic Pattern Matching Comparing strings by edit distance Motivation: The Human Genome Project Do two genetic strings match? How are they related? When biologists characterize a new sequence, they want to compare it to the (growing) database of known sequences Abstraction: What is the cost of transforming s into t Given – costs for insertion, deletion, substitution

25 Lect-07.25CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Alphabet and Costs Alphabet Letters in the string. For DNA, there are four: A (Adenine) C (Cytosine) T (Thymine) G (Guanine) Transformation Costs Insert:1, Delete:1, Substitute:2, match:0 Type of comparison One target to many sources One target to one source

26 Lect-07.26CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Substitution Example baboon CostMoveWord bab|on bobon boubon bourbon Delete ‘o’ Substitute ‘o’ Insert ‘u’ Insert ‘r’ Match? Total cost: 1 2 1 1 0 5bourbon

27 Lect-07.27CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Dynamic Programming Solution Source sequence: s 1, s 2, … s m Target sequence: t 1, t 2, … t n d i,j = distance between subsequence s 1, s 2, …s i and subsequence t 1, t 2, …t j, where d 0,0 = 0 d i,0 = d i-1,0 + Delete(s i ) d 0,j = d 0,j-1 + Insert(t j ) d i-1,j + Delete(s i ) d i,j = min d i,j-1 + Insert(t j ) d i-1,j-1 + Substitute(s i, t j ) Distance(Source, Target) = d m,n

28 Lect-07.28CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Dynamic Programming Example b a b o o n 0 1 2 3 4 5 6 b 1 0 1 2 2 3 4 o 2 1 2 2 2 2 3 u 3 2 3 3 3 3 4 r 4 3 4 4 3 4 5 b 5 3 4 4 4 5 5 o 6 4 5 5 4 5 6 n 7 5 6 6 5 6 5

29 Lect-07.29CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Parallelism on the Anti-Diagonal b a b o o n 0 1 2 3 4 5 6 b 1 0 1 2 2 3 4 o 2 1 2 2 2 2 3 u 3 2 3 3 3 3 4 r 4 3 4 4 3 4 5 b 5 3 4 4 4 5 5 o 6 4 5 5 4 5 6 n 7 5 6 6 5 6 5 di,jdi,j d i-1, j d i, j-1 d i-1, j-1

30 Lect-07.30CprE 583 – Reconfigurable ComputingSeptember 11, 2007 SCin SDin TCout TDout If (SCin != 0) and (TCin != 0) PEDist + Substitute(SCin, TCin) PEDist  min TDin + Delete(SCin) SDin + Insert(TCin) else-if (SCin != 0) PEDist  SDin else-if (TCin != 0) PEDist  TDin endif SCout  SCin TCout  TCin SDout  PEDist TDout  PEDist Bidirectional PE Example Bidir PE PEDist Bidir PE PEDist Bidir PE PEDist SCout SDout TCin TDin

31 Lect-07.31CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Bidirectional Summary 16 CLBs/PE 384 PEs/Board 2,100 Million Cells/sec Requires 2*(m+n) PEs Uses only half the processors at any one time Must stream both source and target for each comparison Makes comparison against large DB impractical

32 Lect-07.32CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Genetic Search Performance Nearly linear scaling in cell updates per second (CUPS) Need to reuse array for large patterns HardwareCUPS Splash 2 x1643,000M Splash 23,000M Splash 1370M P-NAC (34)500M λ 0.60μ 2.0μ CM-2 (64K)150M CM-5 (32)33M SPARC 101.2M SPARC 10.87M ? ? 0.40μ 0.75μ Area 500Mλ 2 x17x16 500Mλ 2 x16 420Mλ 2 x32 7.8Mλ 2 x34 1.6GMλ 2 273Mλ 2 CUP/λ 2 s 0.32 0.38 0.028 1.9 0.00075 0.0032

33 Lect-07.33CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Application #3 – Image Processing Reconfigurable computers well suited to image processing due to high parallelism and specialization (filtering) Algorithms change sufficiently fast such that ASIC implementations become outdated Examine two issues with Splash Image compression Image error estimation Parallelize across array in SIMD and systolic fashion

34 Lect-07.34CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Gaussian Pyramid Operations Gaussian Pyramid Down sample image to compress image size for communication Average over a set of points to create new point Laplacian Pyramid Determine error found from Gaussian Pyramid Expand contracted picture and compare with original

35 Lect-07.35CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Gaussian Pyramids Systolic array in which each device performs a separate function Limited by clock rate of slowest device

36 Lect-07.36CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Pyramid Flow Generates both Gaussian and Laplacian pyramid for 512 x 480 image in 22.7 ms at 15.7 MHz

37 Lect-07.37CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Other Image Processing Target recognition Break image into “chips” Each chip passed through linear array in attempt to match with stored image Images can be rotated, mirrored Zoom in if suspicious object found

38 Lect-07.38CprE 583 – Reconfigurable ComputingSeptember 11, 2007 Summary Splash 2 effective due to scalability and programming model Parameterizable applications benefit that are regular and distributed High bandwidth effective for searching/signal processing Challenges remain in software development


Download ppt "CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #7 – Applications."

Similar presentations


Ads by Google