Download presentation
Presentation is loading. Please wait.
1
MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical and Computer Engineering Department Virginia Tech
2
Table of Contents Section 1 Performance Evaluation and Analysis Section 2 Matrix Multiplication Algorithm Optimization Section 3 HW/SW System Implementation Section 4 Co-design Flow and Methodology Section 5 Conclusion
3
Section 1 Performance Evaluation and Analysis
4
Performance Results Section 1 Performance Evaluation and Analysis Matrix Size641282565121024 Run Time (sec) Our Design (Average) 0.00520.03220.21701.517611.882 Reference0.03460.66975.313342.302338.72 SpeedUp6.6520.824.526.928.5 Device Utilization BRAM80 (64 Coprocessor + 16 On-Chip-Memory) Mult128
5
Performance Calculation F CPU-Speed = 1, we used 300Mhz PPC F FPGA-Capacity = 1, we used XUP’s XC2VP30 F FPGA-speed = 1, we used 100Mhz clock for bus and coprocessor Time Effective = (T meas,N=1024 + T meas,N=256 * 64) * F CPU-Speed * F FPGA-Capacity * F FPGA-speed = (11.882 + 64*0.217) * 1 * 1 * 1 = 25.77 seconds Section 1 Performance Evaluation and Analysis
6
Performance Results Section 1 Performance Evaluation and Analysis
7
Section 2 Matrix Multiplication Algorithm Optimization
8
Algorithm Optimization Algorithm is optimized based on targeting platform (Virtex2 Pro VP30) Optimization goal: Best utilized the slow DDR Memory Interface Optimally 128-bit/cycle transfers => 4 Complex Numbers Linear accesses result in better throughput Utilize as many fast discrete FPGA Resources as possible 136 18x18-Hardware Multipliers 136 18kbits Block Rams Section 2 Matrix Multiplication Algorithm Optimization
9
[A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Optimized Algorithm A B Section 2 Matrix Multiplication Algorithm Optimization C
10
Optimized Algorithm A B C Bring in 4 complex numbers from “A” [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
11
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
12
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
13
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
14
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
15
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C Bring in four numbers from “B” and perform the following calculations: C[0][0] = C[0][0] + A[0][0]*B[0][0] C[0][1] = C[0][0] + A[0][0]*B[0][1] C[0][2] = C[0][0] + A[0][0]*B[0][2] C[0][3] = C[0][0] + A[0][0]*B[0][3] … C[8][0] = C[8][0] + A[8][0]*B[0][0] C[8][1] = C[8][0] + A[8][0]*B[0][1] C[8][2] = C[8][0] + A[8][0]*B[0][2] C[8][3] = C[8][0] + A[8][0]*B[0][3] Where “A*B” is a complex multiplication. 32 Complex multiplication in parallel = 128 multiplies, 64 additions/subtractions and 64 accumulates per cycle
16
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
17
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
18
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
19
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
20
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
21
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
22
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
23
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
24
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
25
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
26
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
27
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
28
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
29
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
30
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
31
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
32
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
33
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
34
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
35
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
36
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C At this point we have completed calculating the first 8xN rows of C in our coprocessor and we write the results back to RAM
37
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
38
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization Next, we repeat the previous algorithm to calculate the next “8xN CSlice”
39
Optimized Algorithm Performs 128 MACs per cycle (utilizing 128 out of 136 hard multipliers) Linear scan through B matrix (optimizing interface to DDR storage) Section 2 Matrix Multiplication Algorithm Optimization
40
Section 3 HW/SW System Implementation
41
System Architecture Processor Local Bus Section 3 HW/SW System Implementation
42
Minor deviation from proposed algorithm I/O size for coprocessor: B elements are loaded 2 at a time instead of 4 PLB DMA failed to function resulting in a much slower {DDR->PPC->Coprocessor FIFO} datapath. FIFO width of 64-bit => 2-number sends from PPC to Coprocessor FIFO To maintain SAME calculation capacity: A-Block dimension doubled from 8x4 to 16x4. C-Slice doubled from 8xN to 16xN Still utilizes 128 Hardware Multipliers. Coprocessor Architecture vs. Optimized Algorithm Section 3 HW/SW System Implementation
43
Coprocessor Architecture Coprocessor is scalable! Reduce the depth of the A-matrix subblock to reduce the amount of MAC needed Section 3 HW/SW System Implementation
44
Coprocessor Architecture Section 3 HW/SW System Implementation
45
MAC Unit Architecture Section 3 HW/SW System Implementation
46
MAC Unit Architecture Complex Multiply Accumulate BlockRAM Storage for current “C” value Input “B” Value “A” Values Section 3 HW/SW System Implementation
47
Section 4 Co-design Flow and Methodology
48
Design Flow Reference C Algorithm Optimized C Algorithm Driver C Algorithm GEZEL Coprocessor VHDLPPC Binary XUP Board Manual Partitioning Rectangular-Block Transformation Cosimulation Synthesis Performance Analysis Section 4 Co-design Flow and Methodology
49
Simulation Reference C Algorithm Optimized C Algorithm Driver C Algorithm GEZEL Coprocessor VHDLPPC Binary XUP Board workstation cycle-based instruction-set cosimulator FPGA Section 4 Co-design Flow and Methodology
50
Simulation Simulation-based verification on three levels workstation (behavioral) cycle-based ISS (functional model of coprocessor) FPGA board (skipping VHDL simulation since synthesis is swift and easy) Drawback - simulations capture only behavior, but not the architecture. Example: Hard to estimate post-synthesis timing Example: Hard to reflect memory-bus behavior (DMA, DDR,...) in a C simulation model Section 4 Co-design Flow and Methodology
51
Cycle-based Instruction-set Simulation Uses GEZEL Cosimulation Tool http://rijndael.ece.vt.edu/gezel2 Application SW (C Code) uPDDR “N” RegFIFO INFIFO OUT Coprocessor Executable Instruction Set simulator Cosimulation Interfaces Coprocessor Hardware Section 4 Co-design Flow and Methodology
52
Cycle-based Instruction-set Simulation Need cycle-based cosimulation of software and hardware before synthesis Coprocessor mapped in FSMD semantics Modular bottom-up hardware description Cosimulation Interfaces captured with GEZEL simulation primitives Memory-mapped register FIFO based (with request/acknowledge handshake) Section 4 Co-design Flow and Methodology
53
HW-SW Interface Example ipblock fsl1(out data : ns(32); out exists : ns(1); in read : ns(1)) { iptype "armfslslave"; ipparm "core=ppc"; ipparm "write=0x80000000"; ipparm "status=0x80000004"; } to coprocessor data exists read connected to ISS PPC SW can write to address 0x80000000 Will drive data output and perform handshake PPC SW can check status with read from 0x80000004 fsl1 GEZEL CodeHardware Section 4 Co-design Flow and Methodology
54
Synthesis Application SW (C Code) uPDDR “N” RegFIFO INFIFO OUT Coprocessor Instruction Set simulator Cosimulation Interfaces Coprocessor Hardware Automatic conversion to hierarchical RTL-VHDL, with black-boxes for cosimulation interfaces Xilinx EDK + ISE Section 4 Co-design Flow and Methodology
55
Conclusions Matrix Multiplication can be sped up by 25 times over standard reference C implementation Rectangular Blocking Dedicated Coprocessor Hardware, highly scalable Integrated design flow
56
Conclusions Remaining Challenges Memory bottleneck (hardware/software codesign yields ~7 % computation time and 93 % memory access time) Further optimization possible using DMA and data caching schemes
57
Conclusions Challenge to the MEMOCODE community accurate system-level modeling of platform artifacts to support the designer
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.