MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical and Computer Engineering Department Virginia Tech
Table of Contents Section 1 Performance Evaluation and Analysis Section 2 Matrix Multiplication Algorithm Optimization Section 3 HW/SW System Implementation Section 4 Co-design Flow and Methodology Section 5 Conclusion
Section 1 Performance Evaluation and Analysis
Performance Results Section 1 Performance Evaluation and Analysis Matrix Size Run Time (sec) Our Design (Average) Reference SpeedUp Device Utilization BRAM80 (64 Coprocessor + 16 On-Chip-Memory) Mult128
Performance Calculation F CPU-Speed = 1, we used 300Mhz PPC F FPGA-Capacity = 1, we used XUP’s XC2VP30 F FPGA-speed = 1, we used 100Mhz clock for bus and coprocessor Time Effective = (T meas,N= T meas,N=256 * 64) * F CPU-Speed * F FPGA-Capacity * F FPGA-speed = ( *0.217) * 1 * 1 * 1 = seconds Section 1 Performance Evaluation and Analysis
Performance Results Section 1 Performance Evaluation and Analysis
Section 2 Matrix Multiplication Algorithm Optimization
Algorithm Optimization Algorithm is optimized based on targeting platform (Virtex2 Pro VP30) Optimization goal: Best utilized the slow DDR Memory Interface Optimally 128-bit/cycle transfers => 4 Complex Numbers Linear accesses result in better throughput Utilize as many fast discrete FPGA Resources as possible x18-Hardware Multipliers kbits Block Rams Section 2 Matrix Multiplication Algorithm Optimization
[A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Optimized Algorithm A B Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C Bring in 4 complex numbers from “A” [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C Bring in four numbers from “B” and perform the following calculations: C[0][0] = C[0][0] + A[0][0]*B[0][0] C[0][1] = C[0][0] + A[0][0]*B[0][1] C[0][2] = C[0][0] + A[0][0]*B[0][2] C[0][3] = C[0][0] + A[0][0]*B[0][3] … C[8][0] = C[8][0] + A[8][0]*B[0][0] C[8][1] = C[8][0] + A[8][0]*B[0][1] C[8][2] = C[8][0] + A[8][0]*B[0][2] C[8][3] = C[8][0] + A[8][0]*B[0][3] Where “A*B” is a complex multiplication. 32 Complex multiplication in parallel = 128 multiplies, 64 additions/subtractions and 64 accumulates per cycle
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C At this point we have completed calculating the first 8xN rows of C in our coprocessor and we write the results back to RAM
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C
Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization Next, we repeat the previous algorithm to calculate the next “8xN CSlice”
Optimized Algorithm Performs 128 MACs per cycle (utilizing 128 out of 136 hard multipliers) Linear scan through B matrix (optimizing interface to DDR storage) Section 2 Matrix Multiplication Algorithm Optimization
Section 3 HW/SW System Implementation
System Architecture Processor Local Bus Section 3 HW/SW System Implementation
Minor deviation from proposed algorithm I/O size for coprocessor: B elements are loaded 2 at a time instead of 4 PLB DMA failed to function resulting in a much slower {DDR->PPC->Coprocessor FIFO} datapath. FIFO width of 64-bit => 2-number sends from PPC to Coprocessor FIFO To maintain SAME calculation capacity: A-Block dimension doubled from 8x4 to 16x4. C-Slice doubled from 8xN to 16xN Still utilizes 128 Hardware Multipliers. Coprocessor Architecture vs. Optimized Algorithm Section 3 HW/SW System Implementation
Coprocessor Architecture Coprocessor is scalable! Reduce the depth of the A-matrix subblock to reduce the amount of MAC needed Section 3 HW/SW System Implementation
Coprocessor Architecture Section 3 HW/SW System Implementation
MAC Unit Architecture Section 3 HW/SW System Implementation
MAC Unit Architecture Complex Multiply Accumulate BlockRAM Storage for current “C” value Input “B” Value “A” Values Section 3 HW/SW System Implementation
Section 4 Co-design Flow and Methodology
Design Flow Reference C Algorithm Optimized C Algorithm Driver C Algorithm GEZEL Coprocessor VHDLPPC Binary XUP Board Manual Partitioning Rectangular-Block Transformation Cosimulation Synthesis Performance Analysis Section 4 Co-design Flow and Methodology
Simulation Reference C Algorithm Optimized C Algorithm Driver C Algorithm GEZEL Coprocessor VHDLPPC Binary XUP Board workstation cycle-based instruction-set cosimulator FPGA Section 4 Co-design Flow and Methodology
Simulation Simulation-based verification on three levels workstation (behavioral) cycle-based ISS (functional model of coprocessor) FPGA board (skipping VHDL simulation since synthesis is swift and easy) Drawback - simulations capture only behavior, but not the architecture. Example: Hard to estimate post-synthesis timing Example: Hard to reflect memory-bus behavior (DMA, DDR,...) in a C simulation model Section 4 Co-design Flow and Methodology
Cycle-based Instruction-set Simulation Uses GEZEL Cosimulation Tool Application SW (C Code) uPDDR “N” RegFIFO INFIFO OUT Coprocessor Executable Instruction Set simulator Cosimulation Interfaces Coprocessor Hardware Section 4 Co-design Flow and Methodology
Cycle-based Instruction-set Simulation Need cycle-based cosimulation of software and hardware before synthesis Coprocessor mapped in FSMD semantics Modular bottom-up hardware description Cosimulation Interfaces captured with GEZEL simulation primitives Memory-mapped register FIFO based (with request/acknowledge handshake) Section 4 Co-design Flow and Methodology
HW-SW Interface Example ipblock fsl1(out data : ns(32); out exists : ns(1); in read : ns(1)) { iptype "armfslslave"; ipparm "core=ppc"; ipparm "write=0x "; ipparm "status=0x "; } to coprocessor data exists read connected to ISS PPC SW can write to address 0x Will drive data output and perform handshake PPC SW can check status with read from 0x fsl1 GEZEL CodeHardware Section 4 Co-design Flow and Methodology
Synthesis Application SW (C Code) uPDDR “N” RegFIFO INFIFO OUT Coprocessor Instruction Set simulator Cosimulation Interfaces Coprocessor Hardware Automatic conversion to hierarchical RTL-VHDL, with black-boxes for cosimulation interfaces Xilinx EDK + ISE Section 4 Co-design Flow and Methodology
Conclusions Matrix Multiplication can be sped up by 25 times over standard reference C implementation Rectangular Blocking Dedicated Coprocessor Hardware, highly scalable Integrated design flow
Conclusions Remaining Challenges Memory bottleneck (hardware/software codesign yields ~7 % computation time and 93 % memory access time) Further optimization possible using DMA and data caching schemes
Conclusions Challenge to the MEMOCODE community accurate system-level modeling of platform artifacts to support the designer