HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B 044169 By: Zaid Abassi Supervisor: Rolf.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Lecture 19: Parallel Algorithms
EECC756 - Shaaban #1 lec # 1 Spring Systolic Architectures Replace single processor with an array of regular processing elements Orchestrate.
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Examples of Two- Dimensional Systolic Arrays. Obvious Matrix Multiply Rows of a distributed to each PE in row. Columns of b distributed to each PE in.
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Synthesizable, Space and Time Efficient Algorithms for String Editing Problem. Vamsi K. Kundeti.
Chapter 5, CLR Textbook Algorithms on Grids of Processors.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Image Processing With FPGAs Zach Fuchs Sarit Patel EEL April 2008.
Table of Contents Matrices - Multiplication Assume that matrix A is of order m  n and matrix B is of order p  q. To determine whether or not A can be.
1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
Elementary Operations of Matrix
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
Efficient FPGA Implementation of QR
Database Management 9. course. Execution of queries.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
An Efficient Implementation of Scalable Architecture for Discrete Wavelet Transform On FPGA Michael GUARISCO, Xun ZHANG, Hassan RABAH and Serge WEBER Nancy.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.
The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.
Principles of Linear Pipelining
Chapter One Introduction to Pipelined Processors
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching Yao Song 11/05/2015.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
Chapter One Introduction to Pipelined Processors
Notes Over 4.2 Finding the Product of Two Matrices Find the product. If it is not defined, state the reason. To multiply matrices, the number of columns.
Chapter One Introduction to Pipelined Processors.
Improvement to Hessenberg Reduction
William Stallings Computer Organization and Architecture 8th Edition
Embedded Systems Design
The University of Adelaide, School of Computer Science
High-Performance Matrix Multiplication
Advanced Design and Analysis Techniques
Fei Li Jinjun Xiong University of Wisconsin-Madison
Parallel Matrix Multiplication and other Full Matrix Algorithms
Array Processor.
Centar ( Global Signal Processing Expo
STUDY AND IMPLEMENTATION
Parallel Matrix Operations
CSCE569 Parallel Computing
Multiplication using the grid method.
Parallel Matrix Multiplication and other Full Matrix Algorithms
Wavelet “Block-Processing” for Reduced Memory Transfers
Matrix Multiplication Sec. 4.2
Presentation transcript:

HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B By: Zaid Abassi Supervisor: Rolf Hilgendorf April 2, 2014

Background and Motivation: 1.Matrix multiplication naively carried out is unjustifiably expensive, ergo there is a need for research into an efficient algorithm for Matrix Multiplication with a parallel approach.

2. In application specific (in this case Matrix Multiplication) designs, as opposed to broader architectural designs, the order and magnitude of operations is known at design time thus providing a potential to save overhead that would have been incurred.

3. Matrix multiplication is an elementary building block of more advanced Linear Algebra Core operations on matrices such as inverting matrices and linear transformations, so the need for efficient matrix multiplication is ever greater.

4. Over the years matrix multiplication complexity in software has improved with specialized data structures and we aim to research inspired approaches on an FPGA implementation.

Our Goal To develop a matrix multiplication algorithm especially on FPGA to maximize efficiency via parallel design, while at the same time reducing power consumption as much as possible.

The System Top Level View

Processing Entity (PE)

PE unit

PE unit The controller for each PE is a FSM to regulate PE operations : storage, computation and communication (broadcasting). The controller needs to be smart and autonomously manage synchronized PE operations with handshake and global communication depending on implicit synchronization between all PEs.

PE unit Each PE is equipped with its own local memory for the purpose of storing entries of the multiplied matrices upon commencing and for broadcasting via same rows and columns

Handling Larger Matrices For handling larger matrices, we choose the possibility of breaking down the input matrices to a sequence of smaller updates using a hierarchical blocking of input matrices. Each update in the hierarchy is called a “loop”. No loop-carried dependency so we aim to pipeline outer loop to overlap current cycle’s computation along with previous cycle’s write back and next cycle’s prefetching of matrices.

A Problem With Larger Matrices Moving data in and out of the computational grid for each hierarchy block independently can be expensive and so we need to amortize the cost.