Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Slides:



Advertisements
Similar presentations
Hao wang and Jyh-Charn (Steve) Liu
Advertisements

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
A HIGH-PERFORMANCE IPV6 LOOKUP ENGINE ON FPGA Author : Thilan Ganegedara, Viktor Prasanna Publisher : FPL 2013.
Parallell Processing Systems1 Chapter 4 Vector Processors.
An Introduction to Reconfigurable Computing Mitch Sukalski and Craig Ulmer Dean R&D Seminar 11 December 2003.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Spartan II Features  Plentiful logic and memory resources –15K to 200K system gates (up to 5,292 logic cells) –Up to 57 Kb block RAM storage  Flexible.
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Seven Minute Madness: Special-Purpose Parallel Architectures Dr. Jason D. Bakos.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Optimal Layout of CMOS Functional Arrays ECE665- Computer Algorithms Optimal Layout of CMOS Functional Arrays T akao Uehara William M. VanCleemput Presented.
V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.
Programmable logic and FPGA
The Design of Improved Dynamic AES and Hardware Implementation Using FPGA 游精允.
Introduction to Field Programmable Gate Arrays (FPGAs) COE 203 Digital Logic Laboratory Dr. Aiman El-Maleh College of Computer Sciences and Engineering.
Henry Hexmoor1 Chapter 10- Control units We introduced the basic structure of a control unit, and translated assembly instructions into a binary representation.
Distributed Arithmetic: Implementations and Applications
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
Introduction to FPGA’s FPGA (Field Programmable Gate Array) –ASIC chips provide the highest performance, but can only perform the function they were designed.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
The Xilinx Spartan 3 FPGA EGRE 631 2/2/09. Basic types of FPGA’s One time programmable Reprogrammable (non-volatile) –Retains program when powered down.
Viterbi Decoder Project Alon weinberg, Dan Elran Supervisors: Emilia Burlak, Elisha Ulmer.
Introduction to Digital Logic Design Appendix A of CO&A Dr. Farag
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
(TPDS) A Scalable and Modular Architecture for High-Performance Packet Classification Authors: Thilan Ganegedara, Weirong Jiang, and Viktor K. Prasanna.
Implementation of Finite Field Inversion
Modular SRAM-based Binary Content-Addressable Memories Ameer M.S. Abdelhadi and Guy G.F. Lemieux Department of Electrical and Computer Engineering University.
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.
Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
Module : Algorithmic state machines. Machine language Machine language is built up from discrete statements or instructions. On the processing architecture,
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
EE3A1 Computer Hardware and Digital Design
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
Lab 2 Parallel processing using NIOS II processors
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
November 29, 2011 Final Presentation. Team Members Troy Huguet Computer Engineer Post-Route Testing Parker Jacobs Computer Engineer Post-Route Testing.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Introduction to Field Programmable Gate Arrays (FPGAs) EDL Spring 2016 Johns Hopkins University Electrical and Computer Engineering March 2, 2016.
Author: Yun R. Qu, Shijie Zhou, and Viktor K. Prasanna Publisher:
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Genomic Data Clustering on FPGAs for Compression
Electronics for Physicists
FPGAs in AWS and First Use Cases, Kees Vissers
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Programmable Logic- How do they do that?
Programmable Logic- How do they do that?
Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.
Electronics for Physicists
Memory System Performance Chapter 3
Programmable logic and FPGA
Presentation transcript:

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012

Introduction The presented work is based on the algorithm by T. Herendi for constructing uniformly distributed linear recurring sequences to be used for pseudo-random number generation The most time-consuming part is the exponentiation of large matrices to an extremely high power. An extremely fast FPGA design is detailed that achieves a speedup factor of ~1000

Mathematical background The algorithm constructs uniformly distributed linear recurring sequences modulo powers of 2 The sequences can have arbitrarily large period lengths New elements are easy to compute Unpredictability does not hold

Mathematical backgound The sequences are of the form The coefficients are such that holds for some P(x) irreducible polynomial It is practical to choose P(x) to have maximal order, since the order of P(x) is closely related to the period length of the corresponding sequence.

Mathematical background The sequence obtained this way does not necessarily have uniform distribution, but exactly one of the following do: Two of them can be easily eliminated

Mathematical background Let be the companion matrix of sequence u We need to compute If this is the identity matrix, then the period length of u is If it is not, then u has a uniform distribution

Mathematical background Computing is done using 1 bit elements: Multiplication modulo 2

Implementation Matrix exponentiation for interesting problem sizes can quickly become very time consuming For matrix size 1000×1000: (Intel E8400 3GHz Dual Core CPU) Matlab implementation: ~6 minutes Highly optimized C++ program: ~105 seconds Previous FPGA implementation: ~0.6 seconds New FPGA implementation (in development): ~5-10 faster than the previous version

FPGA Field-programmable gate array Creates an application specific hardware solution (like an ASIC) Grid of computing elements and connecting elements Reprogrammable! Look-up tables, registers, block RAMs, special multipliers, etc.

Look-up table 6-LUT: look-up table with 6 bit inputs: 64 bits of memory, addressed bit by bit By manipulating this 64 bit value, it can be configured to compute any Boolean function with 6 bit input Arranged into a grid on the chip, organized into slices containing usually 2 or 4 LUTs Some have added functionality, like being used as shift registers Additional features to increase efficiency (registers, carry chain, etc.)

FPGA Solutions are extremely efficient Supports massive parallelism Best at algorithms performing many operations on relatively small amounts of data at a time Departure from traditional Von Neumann architecture

FPGA Physically, configurations are automata networks Creating a module takes multiple iterations: Synthesize, Translate, Map, Place & Route, Generate programming file

FPGA However: Large power consumption Large modules take very long to compile (simulation is important)

Hardware used XUPV505-LX110T development platform Virtex-5 XC5VLX110T FPGA 6-LUT: 64 bit look-up table Slice; LUT LUTs as 32 bit deep shift registers kb block RAM 256MB DDR2 SODIMM

Modules Basic LUTs: Multiplier: 3 pairs of 1-bit elements Adder: 6 1-bit elements Old version: cascaded multiply-accumulate LUTs loses efficiency at higher clock rate New version: adder tree structure 32 multiplier LUTs compute the dot product of two 96 bit long vectors Matrix size: 1920×1920 (multiple of 96)

Modules 1024 such multiplier modules work in parallel, multiplying a 32×96 and a 96×32 piece of the input into a 32×32 piece of the solution in a single clock cycle (~40000 LUTs) The multiplier is very fast compared to the main storage (DDR2, low bandwidth, high capacity) Old version: careful control of the input flow New version: intermediate storage (block RAM, high bandwidth, low capacity)

Modules Input matrices are divided into 96×1920 strips To maximise matrix size, the block RAM tries to contain as little from the input as possible Using a 1920×96 and a 96×1920 strip from the input, the module computes a 1920×1920 intermediate result Strips are iteratively read from the input, their results are accumulated together

Thank you for your attention.