Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci 1, D. Poznanovic 2, K. Gaj 3, T. El-Ghazawi 1, N. Alexandridis 1 1 George Washington.

Similar presentations


Presentation on theme: "Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci 1, D. Poznanovic 2, K. Gaj 3, T. El-Ghazawi 1, N. Alexandridis 1 1 George Washington."— Presentation transcript:

1 Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci 1, D. Poznanovic 2, K. Gaj 3, T. El-Ghazawi 1, N. Alexandridis 1 1 George Washington University, 2 SRC Computers Inc., 3 George Mason University http://cpe02.gmu.edu/rcm/

2 Features of General-Purpose Reconfigurable Computers composed of traditional microprocessors and Field Programmable Gate Arrays (FPGAs) closely integrated with each other programming does not require knowledge of hardware design permit run-time reconfiguration of FPGAs

3 Hardware Architecture and Programming Model of SRC-6E

4 SRC Hardware Architecture 2 Intel® micro- processors SNAP MAP processor 2 Intel® micro- processors SNAP Chain ports 800 MB/s MAP processor MAP module

5 SRC Hardware Architecture – cont.

6 Main program Function_1(a, d, e) Function_2(d, e, f) Function_1 Function_2 Macro_1(a, b, c) Macro_2(b, d) Macro_2(c, e) Macro_3(s, t) Macro_1(n, b) Macro_4(t, k) FPGA …… Macro_1 Macro_2 a b c de FPGA contents after the Function_1 call Program in C or Fortran SRC Programming Model

7 Object files Application sources Macro sources MAP Compiler  P Compiler Logic synthesis Place & Route Linker.v files.bin files.ngo files.o files Application executable Configuration bitstreams HDL sources Netlists.c or.f files.vhd or.v files Compilation Process of SRC-6E Synplicity Xilinx Intel

8 High-throughput Triple DES encryption Application Case Study 1

9 High-throughput encryption 3 DES MiMi M i+1 M i+2 CiCi C i+1 C i+2.. K0K0

10 Fully pipelined architecture of Triple DES.. 1 2 17 … 18 19 34 … 35 36 51 … DES macro 51 pipeline stages New input & new output every clock cycle

11 Overhead of the data transfer L2 MIOC PCISlot SNAPSNAPSNAPSNAP Private Memory  P Board Xeon  P L2 PCISlot MIOC Private Memory SNAPSNAPSNAPSNAP L2  P Board Control Chip On-BoardMemory (24 MB) (6x) UserChip UserChip Control Chip On-BoardMemory (24 MB) UserChip UserChip Xeon  P (6x) (6x) (6x) Xeon  P Xeon  P MAP Board

12 Timing Measurements 1.end-to-end execution time: (wall clock time - HLL Level) includes the configuration, data transfer and data processing times 2.w/o configuration time: (wall clock time - HLL Level) excludes the configuration time but includes data transfer and data processing times 3.MAP Time: (clock counter - Hardware Level) only includes data processing time Three-level timing measurement scheme has been employed:

13 Triple DES Encryption 0 20 40 60 80 100 120 140 160 1024 10,000 25,000 50,000 100,000 250,000 500,000 configuration data transfer computation Execution time [ms] Number of encrypted blocks

14 execution time dominated by - configuration of the MAP FPGA and - data transfer between the System Common Memory and On-Board-Memory Problems configuration time hiding techniques  preloading the configuration before execution  flip-flopping FPGAs during reconfiguration

15 Data transfer hiding techniques Data transfer can be hidden by overlapping DMA time with the data processing time Input DMA Encryption Output DMA Input DMA Encry- ption Output DMA Input DMA Encry- ption Possible speed-up up to 33% Output DMA

16 Reference software implementations Platform: Software: Pentium 4, 1.8 GHz, 512 kB cache, 1 GB RAM Non-optimized: Optimized for encryption (but not for cipher breaking): Public domain code C only Intel C++ -O3 optimization Phil Karn’s DES code C and assembly language with look-up table precomputations GNU gcc v. 2.96 -O4 optimization

17 Optimized P4 code Non-optimized P4 code Total execution time of Triple DES for Pentium 4 using optimized and non-optimized code  4

18 Throughput results for SRC-6E and Pentium 4

19 SRC-6E vs. Pentium 4 speed-up

20 DES cipher breaking Application Case Study 2

21 Secret-key breaking DES M0M0 C0C0 … K1K1 K2K2 K3K3 KNKN Generated by the DES breaker

22 Keys generated in the User FPGA L2 MIOC PCISlot SNAPSNAPSNAPSNAP Private Memory  P Board Xeon  P L2 PCISlot MIOC Private Memory SNAPSNAPSNAPSNAP L2  P Board Control Chip On-BoardMemory (24 MB) (6x) UserChip UserChip Control Chip On-BoardMemory (24 MB) UserChip UserChip Xeon  P (6x) (6x) (6x) Xeon  P Xeon  P MAP Board

23 0 200 400 600 800 1,000 1,200 128,000 1,000,000 100,000,000 Number of tested keys Execution time [ms] DES breaking machine configuration data transfer computation

24 SRC-6e vs. Pentium 4 Speed-up

25 Conclusions Two different classes of applications developed and tested for SRC-6E and Pentium 4 PC - Triple DES encryption: real-time data streaming - DES breaking: minimal input/output

26 Wall-clock speed-ups 3 DES Encryption Speed-ups without reconfiguration Conclusions – cont. DES Breaking 3.4 vs. P4 C code 12.5 vs. P4 assembly code 894 vs. P4 C code (larger for real-time input sizes) 11 vs. P4 C code 41 vs. P4 assembly code 1583 vs. P4 C code 3 DES EncryptionDES Breaking

27 Informal speed/cost comparison Cost of the SRC machine Cost of PC  100 Speed of the SRC machine Speed of PC  1600 * * with only one out of four FPGAs used in computations 16 x improved speed/cost ratio

28 Conclusions: Overheads Reconfiguration time Data transfer time Most affected applications: Minimization techniques: Most affected applications: Minimization techniques: short execution time, large resource requirements, frequent reconfiguration high speed real-time input/output overlapping data transfer with computations preloading configuration flip-flopping among multiple FPGAs


Download ppt "Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci 1, D. Poznanovic 2, K. Gaj 3, T. El-Ghazawi 1, N. Alexandridis 1 1 George Washington."

Similar presentations


Ads by Google