Presentation is loading. Please wait.

Presentation is loading. Please wait.

MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Similar presentations


Presentation on theme: "MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical."— Presentation transcript:

1 MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical and Computer Engineering Department Virginia Tech

2 Table of Contents Section 1 Performance Evaluation and Analysis Section 2 Matrix Multiplication Algorithm Optimization Section 3 HW/SW System Implementation Section 4 Co-design Flow and Methodology Section 5 Conclusion

3 Section 1 Performance Evaluation and Analysis

4 Performance Results Section 1 Performance Evaluation and Analysis Matrix Size641282565121024 Run Time (sec) Our Design (Average) 0.00520.03220.21701.517611.882 Reference0.03460.66975.313342.302338.72 SpeedUp6.6520.824.526.928.5 Device Utilization BRAM80 (64 Coprocessor + 16 On-Chip-Memory) Mult128

5 Performance Calculation F CPU-Speed = 1, we used 300Mhz PPC F FPGA-Capacity = 1, we used XUP’s XC2VP30 F FPGA-speed = 1, we used 100Mhz clock for bus and coprocessor Time Effective = (T meas,N=1024 + T meas,N=256 * 64) * F CPU-Speed * F FPGA-Capacity * F FPGA-speed = (11.882 + 64*0.217) * 1 * 1 * 1 = 25.77 seconds Section 1 Performance Evaluation and Analysis

6 Performance Results Section 1 Performance Evaluation and Analysis

7 Section 2 Matrix Multiplication Algorithm Optimization

8 Algorithm Optimization Algorithm is optimized based on targeting platform (Virtex2 Pro VP30) Optimization goal:  Best utilized the slow DDR Memory Interface Optimally 128-bit/cycle transfers => 4 Complex Numbers Linear accesses result in better throughput  Utilize as many fast discrete FPGA Resources as possible 136 18x18-Hardware Multipliers 136 18kbits Block Rams Section 2 Matrix Multiplication Algorithm Optimization

9 [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Optimized Algorithm A B Section 2 Matrix Multiplication Algorithm Optimization C

10 Optimized Algorithm A B C Bring in 4 complex numbers from “A” [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

11 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

12 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

13 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

14 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

15 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C Bring in four numbers from “B” and perform the following calculations: C[0][0] = C[0][0] + A[0][0]*B[0][0] C[0][1] = C[0][0] + A[0][0]*B[0][1] C[0][2] = C[0][0] + A[0][0]*B[0][2] C[0][3] = C[0][0] + A[0][0]*B[0][3] … C[8][0] = C[8][0] + A[8][0]*B[0][0] C[8][1] = C[8][0] + A[8][0]*B[0][1] C[8][2] = C[8][0] + A[8][0]*B[0][2] C[8][3] = C[8][0] + A[8][0]*B[0][3] Where “A*B” is a complex multiplication. 32 Complex multiplication in parallel = 128 multiplies, 64 additions/subtractions and 64 accumulates per cycle

16 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

17 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

18 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

19 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

20 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

21 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

22 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

23 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

24 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

25 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

26 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

27 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

28 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

29 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

30 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

31 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

32 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

33 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

34 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

35 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

36 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C At this point we have completed calculating the first 8xN rows of C in our coprocessor and we write the results back to RAM

37 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

38 Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization Next, we repeat the previous algorithm to calculate the next “8xN CSlice”

39 Optimized Algorithm Performs 128 MACs per cycle (utilizing 128 out of 136 hard multipliers) Linear scan through B matrix (optimizing interface to DDR storage) Section 2 Matrix Multiplication Algorithm Optimization

40 Section 3 HW/SW System Implementation

41 System Architecture Processor Local Bus Section 3 HW/SW System Implementation

42 Minor deviation from proposed algorithm  I/O size for coprocessor: B elements are loaded 2 at a time instead of 4 PLB DMA failed to function resulting in a much slower {DDR->PPC->Coprocessor FIFO} datapath. FIFO width of 64-bit => 2-number sends from PPC to Coprocessor FIFO  To maintain SAME calculation capacity: A-Block dimension doubled from 8x4 to 16x4. C-Slice doubled from 8xN to 16xN Still utilizes 128 Hardware Multipliers. Coprocessor Architecture vs. Optimized Algorithm Section 3 HW/SW System Implementation

43 Coprocessor Architecture Coprocessor is scalable! Reduce the depth of the A-matrix subblock to reduce the amount of MAC needed Section 3 HW/SW System Implementation

44 Coprocessor Architecture Section 3 HW/SW System Implementation

45 MAC Unit Architecture Section 3 HW/SW System Implementation

46 MAC Unit Architecture Complex Multiply Accumulate BlockRAM Storage for current “C” value Input “B” Value “A” Values Section 3 HW/SW System Implementation

47 Section 4 Co-design Flow and Methodology

48 Design Flow Reference C Algorithm Optimized C Algorithm Driver C Algorithm GEZEL Coprocessor VHDLPPC Binary XUP Board Manual Partitioning Rectangular-Block Transformation Cosimulation Synthesis Performance Analysis Section 4 Co-design Flow and Methodology

49 Simulation Reference C Algorithm Optimized C Algorithm Driver C Algorithm GEZEL Coprocessor VHDLPPC Binary XUP Board workstation cycle-based instruction-set cosimulator FPGA Section 4 Co-design Flow and Methodology

50 Simulation Simulation-based verification on three levels  workstation (behavioral)  cycle-based ISS (functional model of coprocessor)  FPGA board (skipping VHDL simulation since synthesis is swift and easy) Drawback - simulations capture only behavior, but not the architecture.  Example: Hard to estimate post-synthesis timing  Example: Hard to reflect memory-bus behavior (DMA, DDR,...) in a C simulation model Section 4 Co-design Flow and Methodology

51 Cycle-based Instruction-set Simulation Uses GEZEL Cosimulation Tool  http://rijndael.ece.vt.edu/gezel2 Application SW (C Code) uPDDR “N” RegFIFO INFIFO OUT Coprocessor Executable Instruction Set simulator Cosimulation Interfaces Coprocessor Hardware Section 4 Co-design Flow and Methodology

52 Cycle-based Instruction-set Simulation Need cycle-based cosimulation of software and hardware before synthesis Coprocessor mapped in FSMD semantics  Modular bottom-up hardware description Cosimulation Interfaces captured with GEZEL simulation primitives  Memory-mapped register  FIFO based (with request/acknowledge handshake) Section 4 Co-design Flow and Methodology

53 HW-SW Interface Example ipblock fsl1(out data : ns(32); out exists : ns(1); in read : ns(1)) { iptype "armfslslave"; ipparm "core=ppc"; ipparm "write=0x80000000"; ipparm "status=0x80000004"; } to coprocessor data exists read connected to ISS PPC SW can write to address 0x80000000  Will drive data output and perform handshake PPC SW can check status with read from 0x80000004 fsl1 GEZEL CodeHardware Section 4 Co-design Flow and Methodology

54 Synthesis Application SW (C Code) uPDDR “N” RegFIFO INFIFO OUT Coprocessor Instruction Set simulator Cosimulation Interfaces Coprocessor Hardware Automatic conversion to hierarchical RTL-VHDL, with black-boxes for cosimulation interfaces Xilinx EDK + ISE Section 4 Co-design Flow and Methodology

55 Conclusions Matrix Multiplication can be sped up by 25 times over standard reference C implementation  Rectangular Blocking  Dedicated Coprocessor Hardware, highly scalable  Integrated design flow

56 Conclusions Remaining Challenges  Memory bottleneck (hardware/software codesign yields ~7 % computation time and 93 % memory access time) Further optimization possible using DMA and data caching schemes

57 Conclusions Challenge to the MEMOCODE community accurate system-level modeling of platform artifacts to support the designer


Download ppt "MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical."

Similar presentations


Ads by Google