Download presentation
Presentation is loading. Please wait.
Published byAndrea Armstrong Modified over 9 years ago
1
Gaj1MAPLD 2005/1016 Development and Maintenance of User Libraries for SRC Reconfigurable Computers Kris Gaj 1, Tarek El-Ghazawi 2, Paul Gage 3, Dan Poznanovic 3, Chang Shu 1, Deapesh Misra 1, Miaoqing Huang 2, Esam El-Araby 2, Mohamed Taher 2 1 George Mason University 2 The George Washington University 3 SRC Computers, Inc.
2
Gaj2MAPLD 2005/1016 Reconfigurable Computers
3
Gaj3MAPLD 2005/1016 Interface P memory P memory... PP PP I/O Interface FPGA memory FPGA memory... FPGA... I/O Microprocessor systemFPGA system What is a reconfigurable computer?
4
Gaj4MAPLD 2005/1016 Examples of High-End Reconfigurable Computers SRC-6E and SRC High-Bar Based Systems from SRC Computers, Inc. Cray XD1 (formerly Octiga Bay 12 K) from Cray Inc. SGI Altix 3000 from Silicon Graphics Star Bridge Hypercomputer from Star Bridge Systems
5
Gaj5MAPLD 2005/1016 SRC MAP™ Reconfigurable Processor Source: [SRC, MAPLD04]
6
Gaj6MAPLD 2005/1016 SRC-6E Hardware Architecture
7
Gaj7MAPLD 2005/1016 Storage Area Network Local Area Network Wide Area Network Disk Customers’ Existing Networks Hi-Bar sustains 1.4 GB/s per port with 180 ns latency per tier Up to 256 input and 256 output ports Common Memory (CM) has controller with DMA capability Up to 8 GB DDR SDRAM supported per CM node PCI-X PCI-X SRC Hi-Bar Based Systems MAP ® SRC-6 MAP PPPP Memory SNAP™ PPPP Memory SNAP Gig Ethernet etc. Common Memory ChainingGPIO SRC Hi-Bar Switch Source: [SRC, MAPLD04]
8
Gaj8MAPLD 2005/1016 SRC Programming HLL (C) HDL (VHDL) SRC P system FPGA system Application Programmer Library Developer
9
Gaj9MAPLD 2005/1016 C function for P C function for FPGAs VHDL macro for FPGAs SRC Program Partitioning P system FPGA system HLL HDL
10
Gaj10MAPLD 2005/1016 Main program Function_1(a, d, e) Function_2(d, e, f) Function_1 Function_2 Macro_1(a, b, c) Macro_2(b, d) Macro_2(c, e) Macro_3(s, t) Macro_1(n, b) Macro_4(t, k) FPGA …… Macro_1 Macro_2 a b c de FPGA contents after the Function_1 call Program in C or Fortran Run Time Reconfiguration in SRC
11
Gaj11MAPLD 2005/1016 SRC Development Environment Object files Application sources MAP Compiler PCompiler Logic synthesis Place & Route Linker.bin files.edffiles.o files Application executable Configuration bitstreams HDL sources.c or.f files.vhdor.v files Object files Application sources User Macro Sources MAP Compiler PCompiler Logic synthesis Place & Route Linker.edf files.bin files.files.o files Application executable Configuration bitstreams HDL.c or.f files.vhdor.v files
12
Gaj12MAPLD 2005/1016 Advantages of reconfigurable computers can be programmed by mathematicians themselves using traditional programming languages or GUI environments encourage innovation and experimentation general-purpose: cost distributed among multiple users with different needs behave like hardware: - parallel processing - distributed memory - specialized functional units, etc.
13
Gaj13MAPLD 2005/1016 Conditions necessary for the success of reconfigurable computers ease of use of library macros and functions existence of comprehensive libraries of user macros and functions capable of running on FPGAs significant speed-ups ( 100 x) of basic functions running on FPGAs compared to state-of-the-art microprocessors
14
Gaj14MAPLD 2005/1016 Development and Maintenance of SRC Libraries
15
Gaj15MAPLD 2005/1016 Structure of the macro repository common rev_drev_e hdlfile InfoFileBlkBoxFile macro1 macro2macro3 rev_f DebugCodeFile DataSheet
16
Gaj16MAPLD 2005/1016 common: These are macros that have no connections to external pins nor to any specific FPGA type specific feature. This type of macro can be used on any MAP rev_d: These macros have a specific dependency on the dual MAP rev_e: These macros have a specific dependency on the single MAP rev_f: These macros have a specific dependency on compact MAP Macro Types
17
Gaj17MAPLD 2005/1016 Files describing the macro Platform independent HDL file: macro.v or macro.vh Verilog or VHDL code defining the macro Debug Code File: macro.c provides the equivalent C functionality for the macro Platform dependent Blk Box File: blackbox.v Interface (black box) definition for the macro in Verilog Data sheet file: datasheet contains the documentation for the macro Info File: info Info file entry for the given macro, containing macro type, latency, names of input/output/control signals, etc.
18
Gaj18MAPLD 2005/1016 To properly manage a distribution of macros a CVS repository must be setup. This allows the source code changes to be controlled and permits multiple developers to work on the code. CVS repository
19
Gaj19MAPLD 2005/1016 The Installed Macro Library Structure map 3 (built for the Xilinx Virtex2)map 4 (built for the Xilinx Virtex2Pro) common rev_drev_e ngo blkbox.vmacros.info macro1 macro2macro3...... common rev_drev_e Single info file Single blackbox file Obtained by running a special script developed by SRC
20
Gaj20MAPLD 2005/1016 Library Script Usage: build_libs [OPTION] [-b, --branch br]Specify CVS branch [-c, --checkout]Checkout only [-d, --CVSROOT cvsroot]Specify CVSROOT [-M, --MAP maptype]Build for MAP maptype [-m, --module mod]Build mod only [-r, --restart mmddyy-hhmm]Restart previous build [-s, --step target]Run build step target [-v, --version N.n]Package as version N.n [-V, --vendor vend]Specify distribution vendor [-w, --workspace path]Create workspace in path
21
Gaj21MAPLD 2005/1016 Building libraries build_libs will checkout library and perform a build in /var/tmp/builds in a folder with a time stamp (i.e. 080405-1705) If there is an error check file called ‘output’ in the /var/tmp/builds. Fix the error and restart build by: build_libs --restart 080405-1705 You can also do a partial build, say only build the library and not the CD build_libs --step lib To build only a particular subset of a library, you can do so using a command such as: build_libs --module crypto
22
Gaj22MAPLD 2005/1016 Structure for the repository of MAP C functions common rev_drev_e routine1 routine2routine3 rev_f
23
Gaj23MAPLD 2005/1016 Source file: This is the.mc or.mf file defining the MAP routine proto.h: This file provides a prototype of the MAP routine Makefile: This is a standard Carte Makefile, with the exception that no BIN environment variable is provided. Docfile: This file provide a man page format documentation of the MAP routine. Files describing the MAP C routine
24
Gaj24MAPLD 2005/1016 The Installed MAP Routine Library Structure map 3map 4 common rev_drev_e lib1.a lib1.so lib2.a common rev_drev_e lib2.so......
25
Gaj25MAPLD 2005/1016 Known problems: No support for variable size of operands
26
Gaj26MAPLD 2005/1016 We would like to be able to create and maintain a library of generic components that work for various operand sizes. Problem statement Example: Basic arithmetic operations (addition, subtraction, multiplication, division) of multiprecision (n-bit) integers.
27
Gaj27MAPLD 2005/1016 Possible solutions 1. Fixed-size interface to a macro using streams without using streams 2. Variable-size interface to a macro cell
28
Gaj28MAPLD 2005/1016 Input (64-bits) Output (64-bits) Process
29
Gaj29MAPLD 2005/1016 Passing variable-size operands without streams for (i=0; i<3*N+1; i++) { if (i < N) A_in = c[i]; B_in = d[i]; else A_in = 0; B_in = 0; mul (i, A_in, B_in, &C_out); if (i > N) e[i-N] = C_out; }
30
Gaj30MAPLD 2005/1016 Passing variable size operands using streams #pragma src section { for (i=0; i<N; i++) { put_stream (&S0, A[i], 1); // put A[i] to S0 put_stream (&S1, B[i], 1); // put B[i] to S1 } #pragma src section { mul (&S0, &S1, &S2); // read from S0 and S1, write to S2 } #pragma src section { for (i=0; i<2*N; i++) get_stream (&S2, &C[i]); // take from S2 and write to C[i] }
31
Gaj31MAPLD 2005/1016 Process
32
Gaj32MAPLD 2005/1016 Multiprecision Integer Library Generator Multiprecision Integer Library Generator (C engine) C/VHDL Wrapper Black Box Info file Size of operands - N In-line MAP C function
33
Gaj33MAPLD 2005/1016 Inline MAP C function for N=2 int mul (int64_t *A, int64_t *B, int64_t *C, N) { int64_t A0, A1; int64_t B0, B1; int64_t C0, C1, C2, C3; A0=A[0]; A1=A[1]; B0=B[0]; B1=B[1]; Mul_128(A0, A1, B0, B1, &C0, &C1, &C2, &C3); C[0] = C0; C[1] = C1; C[2] = C2; C[3] = C3; }
34
Gaj34MAPLD 2005/1016 Pros and cons of both methods 1. Fixed-size interface to a macro Pros: Interface independent of the operand size Cons: input/output overhead 2. Variable-size interface to a macro cell Pros: minimum overhead Cons: need to generate automatically several macro files, need for changes in the compiler
35
Gaj35MAPLD 2005/1016 GMU/GWU Libraries
36
Gaj36MAPLD 2005/1016 Cryptographic Libraries Secret Key Ciphers Secret key ciphers encryption and breaking – SecCiph Public Key Ciphers Elliptic Curve Cryptosystems arithmetic - ECC Binary Galois Field GF(2 m ) arithmetic in Polynomial Basis - GF2n_PB Binary Galois Field GF(2 m ) arithmetic in Normal Basis - GF2n_NB Multiprecision integer arithmetic (in collaboration with University of South Carolina) – Long_Int Operations supporting factorization of large integers using Number Field Sieve - NFS
37
Gaj37MAPLD 2005/1016 Digital Image Processing Libraries Image Enhancement / Restoration Single-Resolution Noise Reduction (Convolution Filtering) Smoothing (Lowpass) Gaussian (Lowpass) Blurring (Lowpass) Sharpening (Highpass) Edge Detection (Derivative Filters) Prewitt Sobel Multi-Resolution Discrete Wavelet Transform (DWT) Inverse Discrete Wavelet Transform (IDWT) Similarity Measures Correlation
38
Gaj38MAPLD 2005/1016 Miscellaneous Libraries Sorting Stream-searching BMM - Bit Matrix Multiply DARPA benchmarks
39
Gaj39MAPLD 2005/1016 Performance of selected applications based on GMU/GWU libraries
40
Gaj40MAPLD 2005/1016 1.input/output intensive applications bulk data encryption (DES, IDEA, and RC5 encryption) image processing (Sobel Edge Detection, Median Filter, Wavelet Hyperspectral Dimension Reduction) 2. computationally intensive applications secret-key cipher breaking based on the exhaustive key search (DES, IDEA, RC5 breakers) public-key cipher breaking based on factoring 3. latency-critical applications cipher key agreement and signature (ECC schemes, RSA) Classes of applications
41
Gaj41MAPLD 2005/1016 PC based on Pentium IV, 2.4 GHz clock, 512 MB of RAM, 512 KB of cache Reference Platform Treated as a basic building block of a cluster of microprocessor boards. Platform used in experiments SRC-6E from SRC Computers, Inc.
42
Gaj42MAPLD 2005/1016 Timing Measurements MAP Alloc. MAP Free DMA DataOut DMA Data In FPGA Computation.c file.mc file End-to-End time (SW) MAP function MAP function FPGA Configure Configuration time MAP Allocation time MAP Release Time End-to-End time (HW) MAP – SRC Reconfigurable Processor based on two User FPGAs
43
Gaj43MAPLD 2005/1016 Application Computational Throughput (Mbits/s) Data Transfer In Throughput (Mbits/s) Data Transfer Out Throughput (Mbits/s) End-to-End Throughput (Mbits/s) Speed up SRC 6E Pentium IV DES Encryption 6,3982,4881,7058635814.9 IDEA Encryption 12,7882,4871,7999381655.7 RC5 Encryption 6,3982,5051,5908363662.3 Sobel Edge Detection 5,6802,4931,7018497611.0 Median Filter 5,6812,4841,7108505170 Wavelet Hyperspectral Dimension Reduction 63952,5731,477818 67 – 159 (5 levels – 1 level) 5 – 12 (1 level – 5 levels) Input/Output Intensive Applications P3 version of SRC-6E
44
Gaj44MAPLD 2005/1016 Wavelet Hyperspectral Dimension Reduction Time contributions P3 version of SRC-6E vs. Pentium IV PC
45
Gaj45MAPLD 2005/1016 Application Computatinal Throughput (Mbits/s) Data Transfer In Throughput (Mbits/s) Data Transfer Out Throughput (Mbits/s) End-to-End Throughput (Mbits/s) Speed up SRC 6E Pentium IV IDEA Encryption 12,79010,62710,5833,47916521 RC5 Encryption 6398637163732,0983665.7 Sobel Edge Detection 5,6836,3846,3802,0447627 Median Filter 5,6846,3846,3832,0445409 Wavelet Hyperspectral Dimension Reduction 6,3946,3946,3493,1851,626 67 – 159 (5 levels – 1 level) 10 – 24 (1 level – 5 levels) Input/Output Intensive Applications P4 version of SRC-6E
46
Gaj46MAPLD 2005/1016 Wavelet Hyperspectral Dimension Reduction Time contributions P4 version of SRC-6E vs. Pentium IV PC
47
Gaj47MAPLD 2005/1016 Application Computational Throughput (Mbits/s) Data Transfer In Throughput (Mbits/s) Data Transfer Out Throughput (Mbits/s) End-to-End Throughput (Mbits/s) Speed up SRC 6E Pentium IV IDEA Encryption (no overlapping) 12,79010,62710,5833,47916521 IDEA Encryption (with overlapping) 10,8579,79210,5644,88716530 RC5 Encryption (no overlapping) 6398637163732,0983665.7 RC5 Encryption (with overlapping) 63986,3726,3493,1103668.5 Input/Output Intensive Applications P4 version of SRC-6E without and with overlapping computations and data transfers
48
Gaj48MAPLD 2005/1016 Application Computational Throughput (Mbits/s) Data Transfer In Throughput (Mbits/s) Data Transfer Out Throughput (Mbits/s) End-to-End Throughput (Mbits/s) Speed up SRC 6 Pentium IV DES Encryption (no overlapping) 19,20011,35010,7604,2405873 IDEA Encryption (no overlapping) 19,20011,35010,7604,24016526 RC5 Encryption (no overlapping) 19,20011,35010,7604,24036612 Input/Output Intensive Applications SRC Hi-Bar Based System
49
Gaj49MAPLD 2005/1016 Application Computational Throughput Data Transfer In Throughput Data Transfer Out Throughput End-to-End Throughput (mln keys/s) Speed up SRC 6E Pentium IV DES Breaker 800N/A 8000.4691706 IDEA Breaker 1000N/A 5001.701294 RC5 Breaker 100N/A 1000.516194 Computationally Intensive Applications P3 version of SRC-6E
50
Gaj50MAPLD 2005/1016 Latency-Critical Applications Application Computatinal Latency Data Transfer In Latency Data Transfer Out Latency End-to-End Latency (μs)(μs) (μs)(μs) (μs)(μs)(μs)(μs) Speed up SRC 6E Pentium IV ECC DH Key Agreement over GF(2 233 ), Optimal Normal Basis 2013917592364,000615 ECC DH Key Agreement over GF(2 233 ), Polynomial Basis 56066794331,05033
51
Gaj51MAPLD 2005/1016 RSA: SRC vs. OpenSSL Software Comparison Data Size SW Function Time (ms) SW Speedup vs. MAP SW 102447.2484.821x 1536138.4663.642x 2048269.9483.321x 3072853.0503.468x 40961755.2663.624x
52
Gaj52MAPLD 2005/1016 Sparse matrix by vector multiplication Reference Optimized SW Implementation: PC, Pentium IV, 2.768 GHz, 1 GB RAM
53
Gaj53MAPLD 2005/1016 Summary & Conclusions
54
Gaj54MAPLD 2005/1016 Summary Type of application End-to-end speed-up of SRC vs. P4 Computationally intensive (cipher breaking) 200-1700 Latency critical RSA 0.2-0.3 ECC polynomial bases, general fields 33 ECC polynomial bases, special fields 12-27 ECC optimal normal bases 600 Input/output intensive 3-30 (secret key encryption/decryption)
55
Gaj55MAPLD 2005/1016 Summary & conclusions (1) General methodology for the design and maintenance of SRC user libraries developed and tested Existing libraries evaluated in terms of - performance - ease of use - flexibility for three wide classes of applications Initial results very encouraging
56
Gaj56MAPLD 2005/1016 Selected files from the SRC libraries can be used for development of comparable libraries for other reconfigurable computers Full compatibility with other reconfigurable computers difficult to achieve because of the technical differences and intellectual property constraints Summary & conclusions (2)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.