HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B 044169 By: Zaid Abassi Supervisor: Rolf.

Slides:

Advertisements

Similar presentations

CPU Structure and Function

Advertisements

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Lecture 19: Parallel Algorithms

EECC756 - Shaaban #1 lec # 1 Spring Systolic Architectures Replace single processor with an array of regular processing elements Orchestrate.

ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

MATH 685/ CSI 700/ OR 682 Lecture Notes

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

Examples of Two- Dimensional Systolic Arrays. Obvious Matrix Multiply Rows of a distributed to each PE in row. Columns of b distributed to each PE in.

1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,

V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Synthesizable, Space and Time Efficient Algorithms for String Editing Problem. Vamsi K. Kundeti.

Chapter 5, CLR Textbook Algorithms on Grids of Processors.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Image Processing With FPGAs Zach Fuchs Sarit Patel EEL April 2008.

Table of Contents Matrices - Multiplication Assume that matrix A is of order m  n and matrix B is of order p  q. To determine whether or not A can be.

1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.

03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.

Elementary Operations of Matrix

Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.

Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Efficient FPGA Implementation of QR

Database Management 9. course. Execution of queries.

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.

An Efficient Implementation of Scalable Architecture for Discrete Wavelet Transform On FPGA Michael GUARISCO, Xun ZHANG, Hassan RABAH and Serge WEBER Nancy.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.

Parallel architecture Technique. Pipelining Processor Pipelining is a technique of decomposing a sequential process into sub-processes, with each sub-process.

The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.

Principles of Linear Pipelining

Chapter One Introduction to Pipelined Processors

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching Yao Song 11/05/2015.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

Chapter One Introduction to Pipelined Processors

Notes Over 4.2 Finding the Product of Two Matrices Find the product. If it is not defined, state the reason. To multiply matrices, the number of columns.

Chapter One Introduction to Pipelined Processors.

Improvement to Hessenberg Reduction

William Stallings Computer Organization and Architecture 8th Edition

Embedded Systems Design

The University of Adelaide, School of Computer Science

High-Performance Matrix Multiplication

Advanced Design and Analysis Techniques

Fei Li Jinjun Xiong University of Wisconsin-Madison

Parallel Matrix Multiplication and other Full Matrix Algorithms

Array Processor.

Centar ( Global Signal Processing Expo

STUDY AND IMPLEMENTATION

Parallel Matrix Operations

CSCE569 Parallel Computing

Multiplication using the grid method.

Parallel Matrix Multiplication and other Full Matrix Algorithms

Wavelet “Block-Processing” for Reduced Memory Transfers

Matrix Multiplication Sec. 4.2

Presentation transcript:

HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B By: Zaid Abassi Supervisor: Rolf Hilgendorf April 2, 2014

Background and Motivation: 1.Matrix multiplication naively carried out is unjustifiably expensive, ergo there is a need for research into an efficient algorithm for Matrix Multiplication with a parallel approach.

2. In application specific (in this case Matrix Multiplication) designs, as opposed to broader architectural designs, the order and magnitude of operations is known at design time thus providing a potential to save overhead that would have been incurred.

3. Matrix multiplication is an elementary building block of more advanced Linear Algebra Core operations on matrices such as inverting matrices and linear transformations, so the need for efficient matrix multiplication is ever greater.

4. Over the years matrix multiplication complexity in software has improved with specialized data structures and we aim to research inspired approaches on an FPGA implementation.

Our Goal To develop a matrix multiplication algorithm especially on FPGA to maximize efficiency via parallel design, while at the same time reducing power consumption as much as possible.

The System Top Level View

Processing Entity (PE)

PE unit

PE unit The controller for each PE is a FSM to regulate PE operations : storage, computation and communication (broadcasting). The controller needs to be smart and autonomously manage synchronized PE operations with handshake and global communication depending on implicit synchronization between all PEs.

PE unit Each PE is equipped with its own local memory for the purpose of storing entries of the multiplied matrices upon commencing and for broadcasting via same rows and columns

Handling Larger Matrices For handling larger matrices, we choose the possibility of breaking down the input matrices to a sequence of smaller updates using a hierarchical blocking of input matrices. Each update in the hierarchy is called a “loop”. No loop-carried dependency so we aim to pipeline outer loop to overlap current cycle’s computation along with previous cycle’s write back and next cycle’s prefetching of matrices.

A Problem With Larger Matrices Moving data in and out of the computational grid for each hierarchy block independently can be expensive and so we need to amortize the cost.