MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.

Slides:



Advertisements
Similar presentations
MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.
Advertisements

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
Sumitha Ajith Saicharan Bandarupalli Mahesh Borgaonkar.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
Developing Video Applications on Xilinx FPGAs
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
1 Performed By: Khaskin Luba Einhorn Raziel Einhorn Raziel Instructor: Rivkin Ina Spring 2004 Spring 2004 Virtex II-Pro Dynamical Test Application Part.
© 2004 Xilinx, Inc. All Rights Reserved Implemented by : Alon Ben Shalom Yoni Landau Project supervised by: Mony Orbach High speed digital systems laboratory.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Hardware accelerator for PPC microprocessor Final presentation By: Instructor: Kopitman Reem Fiksman Evgeny Stolberg Dmitri.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
Define Embedded Systems Small (?) Application Specific Computer Systems.
Configurable System-on-Chip: Xilinx EDK
הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל Technion - Israel institute of technology department of Electrical Engineering Virtex II-PRO Dynamical.
1 EE249 Discussion A Method for Architecture Exploration for Heterogeneous Signal Processing Systems Sam Williams EE249 Discussion Section October 15,
Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Reconfigurable Computing in the Undergraduate Curriculum Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina.
Device Driver for Generic ASC Module - Project Presentation - By: Yigal Korman Erez Fuchs Instructor: Evgeny Fiksman Sponsored by: High Speed Digital Systems.
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
הטכניון - מכון טכנולוגי לישראל הפקולטה להנדסת חשמל Technion - Israel institute of technology department of Electrical Engineering Virtex II-PRO Dynamical.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
Implementation of DSP Algorithm on SoC. Characterization presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompany engineer : Emilia Burlak.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 13/10/2006 1/26 Superscalar Coprocessor for High-speed Curve-based Cryptography K.
Lecture 18 Lecture 18: Case Study of SoC Design ECE 412: Microcomputer Laboratory.
3 1 3 C H A P T E R Hardware: Input, Processing, and Output Devices.
B212/MAPLD 2005 Craven1 Configurable Soft Processor Arrays Using the OpenFire Processor Stephen Craven Cameron Patterson Peter Athanas Configurable Computing.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
High Performance Embedded Computing © 2007 Elsevier Lecture 3: Design Methodologies Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based.
High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 2: Embedded Computing High Performance Embedded Computing Wayne Wolf.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
IEEE ICECS 2010 SysPy: Using Python for processor-centric SoC design Evangelos Logaras Elias S. Manolakos {evlog, Department of Informatics.
Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.
Micro-Research Finland Oy Components for Integrating Device Controllers for Fast Orbit Feedback Jukka Pietarinen EPICS Collaboration Meeting Knoxville.
Towards the Design of Heterogeneous Real-Time Multicore System m Yumiko Kimezawa February 1, 20131MT2012.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Towards the Design of Heterogeneous Real-Time Multicore System Adaptive Systems Laboratory, Master of Computer Science and Engineering in the Graduate.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
1 EDK 7.1 Tutorial -- SystemACE and EthernetMAC on Avnet Virtex II pro Development Boards Chia-Tien Dan Lo Department of Computer Science University of.
Network On Chip Platform
FPL Sept. 2, 2003 Software Decelerators Eric Keller, Gordon Brebner and Phil James-Roxby Xilinx Research Labs.
1 chapter 1 Computer Architecture and Design ECE4480/5480 Computer Architecture and Design Department of Electrical and Computer Engineering University.
VAPRES A Virtual Architecture for Partially Reconfigurable Embedded Systems Presented by Joseph Antoon Abelardo Jara-Berrocal, Ann Gordon-Ross NSF Center.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
ZHAO(176/MAPLD2004)1 FFT Mapping on Mathstar’s FPOA FilterBuilder Platform MathStar, Inc. Sept 2004.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
2/19/2016http://csg.csail.mit.edu/6.375L11-01 FPGAs K. Elliott Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
بسم الله الرحمن الرحيم MEMORY AND I/O.
10/15: Lecture Topics Input/Output –Types of I/O Devices –How devices communicate with the rest of the system communicating with the processor communicating.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
System-on-Chip Design Homework Solutions
Backprojection Project Update January 2002
ESE532: System-on-a-Chip Architecture
Ioannis E. Venetis Department of Computer Engineering and Informatics
Introduction to cosynthesis Rabi Mahapatra CSCE617
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Presentation transcript:

MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical and Computer Engineering Department Virginia Tech

Table of Contents Section 1 Performance Evaluation and Analysis Section 2 Matrix Multiplication Algorithm Optimization Section 3 HW/SW System Implementation Section 4 Co-design Flow and Methodology Section 5 Conclusion

Section 1 Performance Evaluation and Analysis

Performance Results Section 1 Performance Evaluation and Analysis Matrix Size Run Time (sec) Our Design (Average) Reference SpeedUp Device Utilization BRAM80 (64 Coprocessor + 16 On-Chip-Memory) Mult128

Performance Calculation F CPU-Speed = 1, we used 300Mhz PPC F FPGA-Capacity = 1, we used XUP’s XC2VP30 F FPGA-speed = 1, we used 100Mhz clock for bus and coprocessor Time Effective = (T meas,N= T meas,N=256 * 64) * F CPU-Speed * F FPGA-Capacity * F FPGA-speed = ( *0.217) * 1 * 1 * 1 = seconds Section 1 Performance Evaluation and Analysis

Performance Results Section 1 Performance Evaluation and Analysis

Section 2 Matrix Multiplication Algorithm Optimization

Algorithm Optimization Algorithm is optimized based on targeting platform (Virtex2 Pro VP30) Optimization goal:  Best utilized the slow DDR Memory Interface Optimally 128-bit/cycle transfers => 4 Complex Numbers Linear accesses result in better throughput  Utilize as many fast discrete FPGA Resources as possible x18-Hardware Multipliers kbits Block Rams Section 2 Matrix Multiplication Algorithm Optimization

[A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Optimized Algorithm A B Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C Bring in 4 complex numbers from “A” [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C Bring in four numbers from “B” and perform the following calculations: C[0][0] = C[0][0] + A[0][0]*B[0][0] C[0][1] = C[0][0] + A[0][0]*B[0][1] C[0][2] = C[0][0] + A[0][0]*B[0][2] C[0][3] = C[0][0] + A[0][0]*B[0][3] … C[8][0] = C[8][0] + A[8][0]*B[0][0] C[8][1] = C[8][0] + A[8][0]*B[0][1] C[8][2] = C[8][0] + A[8][0]*B[0][2] C[8][3] = C[8][0] + A[8][0]*B[0][3] Where “A*B” is a complex multiplication. 32 Complex multiplication in parallel = 128 multiplies, 64 additions/subtractions and 64 accumulates per cycle

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C At this point we have completed calculating the first 8xN rows of C in our coprocessor and we write the results back to RAM

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization C

Optimized Algorithm A B C [A] currently in coprocessor [A] currently used for calculation [B] currently used for calculation [C] stored and accumulated in BRAM [C] being multiplied and accumulated Section 2 Matrix Multiplication Algorithm Optimization Next, we repeat the previous algorithm to calculate the next “8xN CSlice”

Optimized Algorithm Performs 128 MACs per cycle (utilizing 128 out of 136 hard multipliers) Linear scan through B matrix (optimizing interface to DDR storage) Section 2 Matrix Multiplication Algorithm Optimization

Section 3 HW/SW System Implementation

System Architecture Processor Local Bus Section 3 HW/SW System Implementation

Minor deviation from proposed algorithm  I/O size for coprocessor: B elements are loaded 2 at a time instead of 4 PLB DMA failed to function resulting in a much slower {DDR->PPC->Coprocessor FIFO} datapath. FIFO width of 64-bit => 2-number sends from PPC to Coprocessor FIFO  To maintain SAME calculation capacity: A-Block dimension doubled from 8x4 to 16x4. C-Slice doubled from 8xN to 16xN Still utilizes 128 Hardware Multipliers. Coprocessor Architecture vs. Optimized Algorithm Section 3 HW/SW System Implementation

Coprocessor Architecture Coprocessor is scalable! Reduce the depth of the A-matrix subblock to reduce the amount of MAC needed Section 3 HW/SW System Implementation

Coprocessor Architecture Section 3 HW/SW System Implementation

MAC Unit Architecture Section 3 HW/SW System Implementation

MAC Unit Architecture Complex Multiply Accumulate BlockRAM Storage for current “C” value Input “B” Value “A” Values Section 3 HW/SW System Implementation

Section 4 Co-design Flow and Methodology

Design Flow Reference C Algorithm Optimized C Algorithm Driver C Algorithm GEZEL Coprocessor VHDLPPC Binary XUP Board Manual Partitioning Rectangular-Block Transformation Cosimulation Synthesis Performance Analysis Section 4 Co-design Flow and Methodology

Simulation Reference C Algorithm Optimized C Algorithm Driver C Algorithm GEZEL Coprocessor VHDLPPC Binary XUP Board workstation cycle-based instruction-set cosimulator FPGA Section 4 Co-design Flow and Methodology

Simulation Simulation-based verification on three levels  workstation (behavioral)  cycle-based ISS (functional model of coprocessor)  FPGA board (skipping VHDL simulation since synthesis is swift and easy) Drawback - simulations capture only behavior, but not the architecture.  Example: Hard to estimate post-synthesis timing  Example: Hard to reflect memory-bus behavior (DMA, DDR,...) in a C simulation model Section 4 Co-design Flow and Methodology

Cycle-based Instruction-set Simulation Uses GEZEL Cosimulation Tool  Application SW (C Code) uPDDR “N” RegFIFO INFIFO OUT Coprocessor Executable Instruction Set simulator Cosimulation Interfaces Coprocessor Hardware Section 4 Co-design Flow and Methodology

Cycle-based Instruction-set Simulation Need cycle-based cosimulation of software and hardware before synthesis Coprocessor mapped in FSMD semantics  Modular bottom-up hardware description Cosimulation Interfaces captured with GEZEL simulation primitives  Memory-mapped register  FIFO based (with request/acknowledge handshake) Section 4 Co-design Flow and Methodology

HW-SW Interface Example ipblock fsl1(out data : ns(32); out exists : ns(1); in read : ns(1)) { iptype "armfslslave"; ipparm "core=ppc"; ipparm "write=0x "; ipparm "status=0x "; } to coprocessor data exists read connected to ISS PPC SW can write to address 0x  Will drive data output and perform handshake PPC SW can check status with read from 0x fsl1 GEZEL CodeHardware Section 4 Co-design Flow and Methodology

Synthesis Application SW (C Code) uPDDR “N” RegFIFO INFIFO OUT Coprocessor Instruction Set simulator Cosimulation Interfaces Coprocessor Hardware Automatic conversion to hierarchical RTL-VHDL, with black-boxes for cosimulation interfaces Xilinx EDK + ISE Section 4 Co-design Flow and Methodology

Conclusions Matrix Multiplication can be sped up by 25 times over standard reference C implementation  Rectangular Blocking  Dedicated Coprocessor Hardware, highly scalable  Integrated design flow

Conclusions Remaining Challenges  Memory bottleneck (hardware/software codesign yields ~7 % computation time and 93 % memory access time) Further optimization possible using DMA and data caching schemes

Conclusions Challenge to the MEMOCODE community accurate system-level modeling of platform artifacts to support the designer