1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

PIPELINE AND VECTOR PROCESSING
Adding the Jump Instruction
Programmability Issues
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
CPEN Digital System Design Chapter 9 – Computer Design
Distributed Arithmetic: Implementations and Applications
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Digital signature using MD5 algorithm Hardware Acceleration
Sub-Nyquist Sampling DSP & SCD Modules Presented by: Omer Kiselov, Daniel Primor Supervised by: Ina Rivkin, Moshe Mishali Winter 2010High Speed Digital.
Chapter 1 Algorithm Analysis
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
RM2D Let’s write our FIRST basic SPIN program!. The Labs that follow in this Module are designed to teach the following; Turn an LED on – assigning I/O.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
Efficient FPGA Implementation of QR
Chapter One Introduction to Pipelined Processors.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
Quadratic Programming Solver for Image Deblurring Engine Rahul Rithe, Michael Price Massachusetts Institute of Technology.
Hardware Implementation of 2-D Wavelet Transforms in Viva on Starbridge Hypercomputer S. Gakkhar, A. Dasu Utah State University Why Wavelet Transforms.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
Performance.
By: Daniel BarskyNatalie Pistunovich Supervisors: Rolf HilgendorfInna Rivkin 10/06/2010.
High Speed Digital Systems Lab. Agenda  High Level Architecture.  Part A.  DSP Overview. Matrix Inverse. SCD  Verification Methods. Verification Methods.
EE204 L12-Single Cycle DP PerformanceHina Anwar Khan EE204 Computer Architecture Single Cycle Data path Performance.
Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
Principles of Linear Pipelining
Lab 2 Parallel processing using NIOS II processors
Chapter One Introduction to Pipelined Processors
Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.
Overview Real World NP-hard problems, such as fluid dynamics, calcium cell signaling, and stomata networks in plant leaves involve extensive computation.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
By: Daniel Barsky, Natalie Pistunovich Supervisors: Rolf Hilgendorf, Ina Rivkin Characterization Sub Nyquist Implementation Optimization 11/04/2010.
Exploiting Parallelism
Fundamentals of Programming Languages-II
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Chapter One Introduction to Pipelined Processors
Speedup Speedup is defined as Speedup = Time taken for a given computation by a non-pipelined functional unit Time taken for the same computation by a.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
GCSE Computing - The CPU
GCSE OCR Computing A451 The CPU Computing hardware 1.
Backprojection Project Update January 2002
Ioannis E. Venetis Department of Computer Engineering and Informatics
Defining Performance Which airplane has the best performance?
8086 Microprocessor.
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
CSCE 212 Chapter 4: Assessing and Understanding Performance
Morgan Kaufmann Publishers Computer Organization and Assembly Language
Central Processing Unit
Dense Linear Algebra (Data Distributions)
Memory System Performance Chapter 3
GCSE Computing - The CPU
Presentation transcript:

1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University

Sudarsanam MAPLD2005/171 2/29 Overview Problem definition Matrix inverse algorithm Types of Polymorphism Design Set-up Hardware design flow (For LU Decomposition) Results Conclusions

Sudarsanam MAPLD2005/171 3/29 Problem Definition Given a 2-D matrix, A[N][N], A = A[1,1] A[1,2] A[1,3]…….. A[1,N] A[2,1] A[2,2] A[2,3]…….. A[2,N] A[3,1] A[3,2] A[3,3]…….. A[3,N]. A[N,1] A[N,2] A[N,3]…….. A[N,N] Determine the Inverse matrix A -1, defined as AxA -1 = I

Sudarsanam MAPLD2005/171 4/29 Algorithm flow Step 1: LU Decomposition  Matrix A is split into two triangular matrices, L and U For i = 1:N For j = I+1:N A(j,i) = A(j,i)/A(I,i)); A(j,(i+1):N) = A(j,(i+1):N) - A(j,i)*A(i,(i+1):N); End For j End For i

Sudarsanam MAPLD2005/171 5/29 Algorithm flow Step 2: Inverse computation for triangular matrices  L -1 and U -1 are computed using a variation of Gaussian elimination For i = 1:N For j = i+1:N Linv(j,i+1:N) = Linv(j,i+1:N) - L(j,i)* Linv(i,i+1:N); End For j End For i

Sudarsanam MAPLD2005/171 6/29 Algorithm flow Step 3: Matrix multiplication  L -1 and U -1 are multiplied together to generate A -1 For i = 1:N For j = 1:N Ainv[i,j] = Ainv[i,j] +U[i,k]*L[k,j] End For j End For i

Sudarsanam MAPLD2005/171 7/29 Types of Polymorphism Following parameters can be varied for the input matrix:  Data type – variable precision, signed/unsigned, and float  Information rate – Rate at which input arrives into, and leaves the system (pipelining/parallelism)  Order tensor – matrix size (16x16, 32x32 etc.)

Sudarsanam MAPLD2005/171 8/29 Polymorphism and Viva Viva supports polymorphic hardware implementation, just as any software programming language. A large library of polymorphic arithmetic, control and memory modules is available.

Sudarsanam MAPLD2005/171 9/29 Data Type Polymorphism Poly- morphi c

Sudarsanam MAPLD2005/171 10/29 Information Rate Polymorphism Clock speed can be changed based on the input data rate This ‘Mul’ unit is a Truly polymorphic object. Based on the input list size, the Viva compiler will generate the required number of parallel multiplier units. The number of parallel units will be denoted as ‘K’

Sudarsanam MAPLD2005/171 11/29 Order Tensor Polymorphism Value of ‘N’ set at run time

Sudarsanam MAPLD2005/171 12/29 Design Flow – Top level block diagram Central Control Unit (CCGU) Memory Unit for A Memory Unit for L Memory Unit for U Memory Unit for L -1 Memory Unit for U -1 Memory Unit for A -1 LU Decompose Loop Unit Inverse of L Loop Unit Inverse of U Loop Unit U -1 X L -1 Loop Unit From Files

Sudarsanam MAPLD2005/171 13/29 Design Flow Main StepsOperationSub StepsSub Module 1Initialize0Generate address 1Write A onto BRAM 2LU Decompose0Generate ‘i’, ‘j’, ‘k’ 1Read A[j,i], A[j()]… 2Compute new A[j,()] 3Write A[j,()],A[j,i] 3A2LU Convert0Generate ‘j’,’k’ 1Read A[j,()] 2Compute L[j,()], U[j()] 3Write L[j()], U[j()]

Sudarsanam MAPLD2005/171 14/29 Design Flow Main StepsOperationSub StepsSub Module 4L inverse0Generate ‘i’,‘j’, ‘k’ 1Read L[j()],L -1 [j()].. 2Compute new L -1 [j()] 3Write L -1 [j,()] 5U inverse0Generate ‘i’, ‘j’, ‘k’ 1Read U[j,()],U -1 [j,()].. 2Compute U -1 [j,()] 3Write U -1 [j,()] 6A inverse0Generate ‘i’, ‘j’, ‘k’ 1Read L[I,()], U[j,()] 2Compute Ainv[i,j,()] 3Update Ainv[i,j]

Sudarsanam MAPLD2005/171 15/29 Hardware Design Set-up Hardware: PE6 (Xilinx 2V6000 FPGA) of the Starbridge Hypercomputer, connected to an Intel x86 processor. (66 MHz / 33,768 Slices) Software: Viva 2.3, developed at Starbridge Systems

Sudarsanam MAPLD2005/171 16/29 Implementation – LU Decomposition Loop Unit i,j,k Address Generation Unit Memory Unit A[j,()],A[i,()], A[j,i], A[i,i] Computation Unit i,j,k A[j,()], A[j,i]

Sudarsanam MAPLD2005/171 17/29 Loop Unit - Functionality Given the order of the matrix ‘N’ and the parallelism to be supported ‘K’, The following loop structure needs to be generated. For i = 1 to N For k = ((i-1)/K)*K to N+1-K in steps of K For j = i to N Generate(i,k,j); End j End k End i

Sudarsanam MAPLD2005/171 18/29 Loop Unit - Architecture A simple register-based implementation is shown. The overall latency is 2 Clock cycles.

Sudarsanam MAPLD2005/171 19/29 Memory Unit - Distribution A[1,1:8]A[2,9:16]A[1,17:24]A[1,25:32] ………… ….. A[2,1:8]A[2,9:16]A[2,17:24] A[3,1:8]A[3,9:16] A[4,1:8] One Block

Sudarsanam MAPLD2005/171 20/29 Memory Unit - Architecture  BRAM memories are used to store data internally. (Matrix is expected to fit into the BRAMs. Maximum value of N is 128)  There are ‘K’ [(NxN)/K]x(variable Data Size) individual BRAMs.  The ‘K’ values in each block in Matrix is distributed over the ‘K’ BRAMs. This results in a single clock access time for internal memory.  A[j] and A[j,i] will be fetched one after the other on every iteration.  The overall latency was found to be 3 clock cycles.

Sudarsanam MAPLD2005/171 21/29 Address Generation - Functionality Inputs: i,j,k from the Loop Unit Outputs: Address in the BRAM for the A[j,()] and A[i,()] blocks of data Address in the BRAM of A[j,i] and A[i,i]  The computations have been organized in such a way that A[i,()] needs to be fetched only once for processing a complete column of blocks.  Thus, only one port is required to access both A[i,()] and A[j,()]

Sudarsanam MAPLD2005/171 22/29 Address Generation - Architecture ‘Shift’ used instead of multipliers: N,K assumed to be powers of 2. (Latency = 1 cc)

Sudarsanam MAPLD2005/171 23/29 Computation Units - Functionality Inputs: - A[j,()] and A[i,()] blocks from BRAM unit - A[j,i] and A[i,i] from the BRAM unit. - Indices i,j,k from the loop unit. Output: The modified A[j,()] block and the A[j,i] value. Three steps are performed: 1.Modify A[i,()] based on the loop indices 2.Perform computations: Divide, Multiply, Subtract 3.Include A[j,i] on A[j,()] if required

Sudarsanam MAPLD2005/171 24/29 Computation Units – Architecture (K=8)

Sudarsanam MAPLD2005/171 25/29 Results for LUD – Slice Counts (N=16) List TypeFix16Fix32Float Size=41862 (8)7305 (32)5012 (12) Size=83731 (16)14472 (64)9802 (24) Size= (32)29018 (128)19024 (48) Number of ROM multipliers used shown in brackets.

Sudarsanam MAPLD2005/171 26/29 Results for LUD – Time Taken (in cycles) List TypeFix16Fix32Float Size= Size= Size=

Sudarsanam MAPLD2005/171 27/29 Time taken Vs Size of Matrix (Fix16, K = 8) Size of the matrixTime taken (in cycles) 16x x x x ( ns) A ‘C’ code (N=128;Fix16) will take O(M*N 3 ) time ~ *M ns (where ‘M’ is number of cycles per iteration ~ 30) (On Intel Centrino 1.5GHz) ~ M/6 speed-up

Sudarsanam MAPLD2005/171 28/29 Conclusions A polymorphic design for matrix inverse was implemented  Data type - Float/Fix16/Fix32  Information rate (K) - 4/8/16  Order Tensor (N) – 16/32/64/128 Viva’s effectiveness in polymorphic implementation was evaluated. Hardware design flow and Results were shown for LU Decomposition.

Sudarsanam MAPLD2005/171 29/29 Lessons learned Pseudo polymorphism  Some of the polymorphic objects in the Viva library are pseudo polymorphic. For e.g. floating point and fixed point implementations of adder unit. Need for timing analysis tool  It was difficult to compute the delays associated with each block in the Viva library Fix32 Vs Float  The division unit in the Viva library is optimized for Floating point and not for fixed point (as shown in the results)