Download presentation
Presentation is loading. Please wait.
Published byTracy Bradford Modified over 9 years ago
1
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. A Program Transformation Approach to High Performance Embedded Computing using the SRC MAP Compiler Wim Bohm, Colorado State University and Jeff Hammes, SRC Computers, Inc.
2
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. SRC–6 MAP ® System SRC-6 MAP –FPGA based High Performance architecture –Fortran / C compiler for the whole system One Node: –Microprocessor –MAP reconfigurable hardware board –SNAP μproc and MAP interconnected via DIM slot –GPIO ports allow connection to other MAPs –PCI-X can connect to other μprocs Multiple configurations / implementations –this talk: MAPstation - one node MAP C Compiler –Compiler generates both μproc and MAP code –user partitions μproc, MAP tasks PCI-X MAPstation MAP PPPP Memory SNAP™ GPIOPorts
3
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. MAP ® board architecture Six Banks Dual-ported On-Board Memory (24 MB) 4800 MB/s (6 x 64b) 4800 MB/s 192b 2400 MB/s each GPIO 4800 MB/s (6 x 64b) ControlFPGA User FPGA 0 User FPGA 1 108b 4800 MB/s (6 x 64b) 108b 1400 MB/s sustained payload MAP Direct Execution Logic (DEL) made up of one or more User FPGAs Control FPGA performs off board memory access Multiple banks of On-Board Memory maximize local memory bandwidth GPIO ports allow direct MAP to MAP chain connections or direct data input Multiple parallel data transports: –Distributed SRAM in FPGA 264 KB @ 844 GB/s264 KB @ 844 GB/s –Block SRAM in FPGA 648 KB @ 260 GB/s648 KB @ 260 GB/s –On-Board SRAM Memories 28 MB @ 9.6 GB/s28 MB @ 9.6 GB/s –Microprocessor Memory 8 GB @ 1400 MB/s8 GB @ 1400 MB/s 1400 MB/s sustained payload Dual-portedMemory (4 MB)
4
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. MAP Programmers View 4800 MB/s (6 x 64b) 4800 MB/s 192b 2400 MB/s each GPIO 4800 MB/s (6 x 64b) ControlFPGA User FPGA 0 User FPGA 1 108b 4800 MB/s (6 x 64b) 108b 1400 MB/s sustained payload MAP 1400 MB/s sustained payload Dual-portedMemory (4 MB) OBMAOBMBOBMCOBMDOBMEOBMF
5
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. MAP C compiler Pure C runs on the MAP !! MAP C Compiler –Intermediate form: dataflow graph of basic blocks –Generated code: circuits Basic blocks in outer loops become special purpose hardware “function units”Basic blocks in outer loops become special purpose hardware “function units” Basic blocks in inner loop bodies are merged and become pipelined circuitsBasic blocks in inner loop bodies are merged and become pipelined circuits Sequential semantics obeyed –One basic block executed at the time –Pipelined inner loops are slowed down to disambiguate read/write conflicts if necessary –MAP C compiler identifies (cause of) loop slowdown
6
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Execution Modes DEBUG Mode –code runs on workstation –allows debugging ( printf ) –allows most performance tuning (avoiding loop slow downs) –user spends most time here Two SIMULATION Modes –Dataflow level and Hardware level –mostly used by compiler / hardware function unit developers –very fine grain information HARDWARE Mode –final stage of code development –allows performance tuning using timer calls
7
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Transformational Approach Start with pure C code Partition Code and Data –distribute data over OBMs and Block RAMs –distribute code over two FPGAs only one chip at the time can access a particular OBMonly one chip at the time can access a particular OBM MPI type communication over the bridgeMPI type communication over the bridge Performance tune (removing inefficiencies) –avoid re-reading of data from OBMs using Delay Queues –avoid read / write conflicts in same iteration –avoid multiple accesses to a memory in one iteration –avoid OBM traffic by fusing loops Today’s transformation is tomorrow’s compiler optimization
8
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. How to performance tune: Macros C code can be extended using macros allowing for program transformations that cannot be expressed straightforwardly in C Macros have semantics unlike C functions –have a period (#clocks between inputs) –have a pipeline delay (#clocks between in and output) –MAP C compiler takes care of period and delay –can have state (kept between macro calls) –two types of macros system providedsystem provided –compiler knows their period and delay user provided (written in e.g. Verilog )user provided (written in e.g. Verilog ) –user needs to provide period and delay
9
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Two Case Studies Wavelet Versatility Benchmark –Image processing application (wavelet compression) –Part of DARPA/ITO ACS (Adaptive Computing Systems) benchmark suite –Versatile: Four phases of different computational nature 1: wavelet transform: window access, multiple outputs 1: wavelet transform: window access, multiple outputs 2: quantization: sum, min, max reductions 2: quantization: sum, min, max reductions 3: run length encoding: while loop, irregular output 3: run length encoding: while loop, irregular output 4: Huffman encoding: table lookups 4: Huffman encoding: table lookups Gauss Seidel Linear Equation Solver –Numerical (Floating Point) kernel –Iterative nature: non-perfect loop structure –Many applications
10
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Wavelet Versatility Benchmark Wavelet transform –Applied three times Second and third passes use upper left quadrant of previous passSecond and third passes use upper left quadrant of previous pass –L: Low pass filter (average) –H: High pass filter (derivative) Wavelet does not compress but enables compression in further stages (many 0s in H) –Quantization –Run-Length Encoding –Huffman Encoding HL LHHH HL LHHH HL LHHH LL
11
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. First wavelet step
12
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Second wavelet step
13
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Final wavelet step
14
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. MAP C Algorithm One 5x5 window stepping by 2 in both directions One 5x5 window stepping by 2 in both directions –Computes LL, LH, HL, and HH simultaneously –Inefficiency: naive first implementation re-accesses overlapping image elements LLHL LHHH
15
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Efficient Window Access Keep data on chip using Delay Queues –E.g. 16 deep (using efficient hardware SLR16 shifters) Simplified example: –3x3 window stepping 1 by 1stepping 1 by 1 in column major orderin column major order –image 16 deep general case dividesgeneral case divides the Image in 16 deep strips the Image in 16 deep strips 9 points in stencil 8 have been seen before Per window the leading point should be the only data access. input array traversal
16
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Delay Queues 1 two 16–word Delay Queues shift and store previous columns
17
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Delay Queues 2f1 f7 f8 f1 f4 f5 f6 f1 f0 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f0 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 Compute f(x 1,x 2,..x 9 ) Data access input(i)
18
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Delay Queues 3f1 f7 f8 f1 f4 f5 f6 f0 f1 f2 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f2 f1 f0 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 Compute f(x 1,x 2,..x 9 ) Data access input(i)
19
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Delay Queues 4f1 f7 f8 f1 f4 f5 f6 f1 f2 f3 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f3 f2 f1 f0 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 Compute f(x 1,x 2,..x 9 ) Data access input(i)
20
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Delay Queues 5f1 f7 f8 f1 f4 f5 f6 f2 f3 f4 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f4 f3 f2 f1 f0 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 Compute f(x 1,x 2,..x 9 ) Data access input(i)
21
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. … Delay Queues 16f0 f7 f8 f1 f4 f5 f0 f13 f14 f15 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4 f3 f2 f1 f0 Compute f(x 1,x 2,..x 9 ) Data access input(i)
22
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Delay Queues 17f0 f7 f8 f1 f4 f0 f0 f14 f15 f16 f0 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 Compute f(x 1,x 2,..x 9 ) f16 f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4 f3 f2 f1 Data access input(i)
23
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Delay Queues 18f1 f7 f8 f1 f0 f0 f1 f15 f16 f17 f0 f0 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 f1 Compute f(x 1,x 2,..x 9 ) f17 f16 f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4 f3 f2 Data access input(i)
24
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. …Delay Queues 32f15 f7 f8 f1 f13 f14 f15 f29 f30 f31 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4 f3 f2 f1 f0 Compute f(x 1,x 2,..x 9 ) f31 f30 f29 f28 f27 f26 f25 f24 f23 f22 f21 f20 f19 f18 f17 f16 Data access input(i)
25
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. …Delay Queues 35f18 f0 f1 f2 f16 f17 f18 f32 f33 f34 f17 f16 f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4 f3 Compute f(x 1,x 2,..x 9 ) f34 f33 f32 f31 f30 f29 f28 f27 f26 f25 f24 f23 f22 f21 f20 f19 Data access input(i)
26
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Delay Queues 36f19 f1 f2 f3 f17 f18 f19 f33 f34 f35 f18 f17 f16 f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 f4 Compute f(x 1,x 2,..x 9 ) f35 f34 f33 f32 f31 f30 f29 f28 f27 f26 f25 f24 f23 f22 f21 f20 Data access input(i)
27
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Delay Queues 37f20 f2 f3 f4 f18 f19 f20 f34 f35 f36 f19 f18 f17 f16 f15 f14 f13 f12 f11 f10 f9 f8 f7 f6 f5 Compute f(x 1,x 2,..x 9 ) f36 f35 f34 f33 f32 f31 f30 f29 f28 f27 f26 f25 f24 f23 f22 f21 Data access input(i)
28
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Delay Queues: Performance 3x3 window access code 512 x 512 pixel image Routine StyleNumberComment of clocks Straight Window2,376,617close to 9 clocks per iteration 2,340,900: the difference 2,340,900: the difference is pipeline prime effect is pipeline prime effect Delay Queue279,999close to1 clock per iteration 262144: theoretical limit 262144: theoretical limit FPGA timing behavior is very predictable FPGA timing behavior is very predictable
29
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Wavelet Benchmark cont’ Rest of the code: –Quantize each block in 16 bins per block –Run Length Encode zeroes Occur frequently in derivative blocksOccur frequently in derivative blocks –Huffman Encode Three transformations –Fuse the three loops avoiding OBM traffic –Use accumulator macros to avoid R / W conflicts (see Gauss Seidel case study)(see Gauss Seidel case study) –Task parallelize the complete code over two FPGAs
30
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Versatility Benchmark: Performance 512x512 image Bit true results as compared to reference code Full implementation: All phases run on FPGAs Reference code compiled using Intel C compiler executed on 2.8 GHz Pentium IV: 76.0 milli-sec executed on 2.8 GHz Pentium IV: 76.0 milli-sec MAP execution time: 2.0 milli-sec MAP Speedup vs. Pentium 38
31
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Gauss Seidel Iterative Solver Scientific Floating Point Kernel (single precision for now) Works for diagonally dominant matrices Some math manipulation to create an iterative solver: Ax = b (L+D+U)x = b x = D -1 b-D -1 (L+U)x x n+1 = (Ab)x n Ax = b (L+D+U)x = b x = D -1 b-D -1 (L+U)x x n+1 = (Ab)x n while(maxerror > tolerance) { // do a next iteration while(maxerror > tolerance) { // do a next iteration maxerror = 0.0; maxerror = 0.0; for(i=0;i<n;i++) { // compute new x[ i ] for(i=0;i<n;i++) { // compute new x[ i ] sxi = x[ i ]; sxi = x[ i ]; xi = 0.0; xi = 0.0; for(j=0;j<n+1;j++) for(j=0;j<n+1;j++) xi += Ab[ i*COL+j ] * x[ j ]; // in product xi += Ab[ i*COL+j ] * x[ j ]; // in product error = abs(xi – sxi); error = abs(xi – sxi); } maxerror = max(maxerror, error); maxerror = max(maxerror, error); }
32
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Pure C Loop Slowdown caused by Read/Writeconflict
33
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Accumulator MacroHardwareAccumulatormacroresolves read / write conflict
34
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Packing the data 64 bit OBM word can contain two floats This requires unrolling j loop which now accesses Ab i j Ab i j+1 X j X j+1 To avoid multiple Block RAM reads, X is stripe partitioned over two Block RAM arrays
35
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Pack Abstracted
36
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Data Partitioned into 3 blocks Ab is now row-blockdistributed (3 blocks in 3 OBMs) 3 OBMs) j loop now computes 3 new Xs
37
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Two FPGAs Ab is row block distributed (6 blocks in 6 OBMs) The j-loops perform 24 Floating Ops in each clock FPGA0 and FPGA1 exchange 3 Xs, 1 error
38
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Gauss Seidel Performance n=500n=1000n=2000 n=500n=1000n=2000 No. Iterations2767 Pentium IV 0.19 secs0.48 secs1.90 secs (2.8 GHz )65 MFlops26 MFlops28 MFlops MAP 0.013 secs0.008 secs0.031 secs 830 MFlops1.23 GFlops1.65 GFlops 830 MFlops1.23 GFlops1.65 GFlops MAP speedup vs. Pentium 14 57 60
39
HPEC 2004 Copyright © 2004 SRC Computers, Inc.ALL RIGHTS RESERVED. Conclusions High Level Algorithmic Language runs on FPGA based HPEC system –DEBUG Mode allows most development on workstation We can apply standard software design methodologies –stepwise refinement currently using macroscurrently using macros later using (user directed?) compiler optimizationslater using (user directed?) compiler optimizations Bandwidth is key to FPGA performance –Often, more operations are available in the FPGA fabric than can be supplied by the available off-chip I/O –FPGA capability is improving rapidly Currently speedups ~50 vs. Pentium IV Future: Multiple MAPs –More complex, streaming applications
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.