An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.

Slides:



Advertisements
Similar presentations
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
Advertisements

Lecture 11 Oct 12 Circuits for floating-point operations addition multiplication division (only sketchy)
Chapter 3 Pipelining. 3.1 Pipeline Model n Terminology –task –subtask –stage –staging register n Total processing time for each task. –T pl =, where t.
Institute of Applied Microelectronics and Computer Engineering © 2014 UNIVERSITY OF ROSTOCK | College of Computer Science and Electrical Engineering.
A Sparse Matrix Personality for the Convey HC-1 Dept. of Computer Science and Engineering University of South Carolina Krishna K Nagar, Jason D. Bakos.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.
Copyright 2008 Koren ECE666/Koren Part.6b.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.
1. 2 Farhan Mohamed Ali Jigar Vora Sonali Kapoor Avni Jhunjhunwala 1 st May, 2006 Final Presentation MAD MAC 525 Design Manager: Zack Menegakis Design.
CSE 378 Floating-point1 How to represent real numbers In decimal scientific notation –sign –fraction –base (i.e., 10) to some power Most of the time, usual.
Distributed Arithmetic: Implementations and Applications
1 ECE369 Chapter 3. 2 ECE369 Multiplication More complicated than addition –Accomplished via shifting and addition More time and more area.
CPSC 321 Computer Architecture ALU Design – Integer Addition, Multiplication & Division Copyright 2002 David H. Albonesi and the University of Rochester.
Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Shiven Seth (W2-5) Presentation 1 MAD MAC st February,
ECEN 248 Integer Multiplication, Number Format Adopted from Copyright 2002 David H. Albonesi and the University of Rochester.
Built-In Self-Test of Programmable I/O Cells in Virtex-4 FPGAs Bradley F. Dutton, Lee W. Lerner, and Charles E. Stroud Dept. of Electrical & Computer Engineering.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
AICCSA’06 Sharja 1 A CAD Tool for Scalable Floating Point Adder Design and Generation Using C++/VHDL By Asim J. Al-Khalili.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing CSCE 791 Dr. Jason D. Bakos.
Accuracy, Cost, and Performance Trade-offs for Floating Point Accumulation Krishna K. Nagar and Jason D. Bakos Univ. of South Carolina.
Efficient FPGA Implementation of QR
1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU1 IEEE Floating Point The IEEE Floating Point Standard and execution.
07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.
ENG241 Digital Design Week #8 Registers and Counters.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
High Performance FPGA-based Floating Point Adder with Three Inputs
Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser
Principles of Linear Pipelining
1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU1 IEEE Floating Point The IEEE Floating Point Standard and execution.
Chapter One Introduction to Pipelined Processors
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 January Session 2.
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
Short Cuts for Multiply and Divide For Positive Numbers 1. Multiply by 2 k is the same as shift k to the left, 0 fill 2. Divide by 2 k is the same as.
IT11004: Data Representation and Organization Floating Point Representation.
Computer Architecture Lecture 11 Arithmetic Ralph Grishman Oct NYU.
Accuracy, Cost and Performance Tradeoffs for Floating Point Accumulation Krishna K. Nagar & Jason D. Bakos University of South Carolina, Columbia, SC Objective.
Speedup Speedup is defined as Speedup = Time taken for a given computation by a non-pipelined functional unit Time taken for the same computation by a.
EE3A1 Computer Hardware and Digital Design Lecture 9 Pipelining.
CSE477 L21 Multiplier Design.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (
Buffering Techniques Greg Stitt ECE Department University of Florida.
Yang Gao and Dr. Jason D. Bakos
Floating-Point FPGA (FPFPGA)
CSE477 VLSI Digital Circuits Fall 2003 Lecture 21: Multiplier Design
High-throughput Online Hash Table on FPGA
Floating Point Operations
Instructor: Dr. Phillip Jones
Spartan FPGAs مرتضي صاحب الزماني.
Data Representation and Arithmetic Algorithms
Centar ( Global Signal Processing Expo
CSCE 350 Computer Architecture
CSCI206 - Computer Organization & Programming
Arithmetic Logical Unit
How to represent real numbers
How to represent real numbers
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Data Representation and Arithmetic Algorithms
IEEE Floating Point Adder Verification
UNIVERSITY OF MASSACHUSETTS Dept
The IEEE Floating Point Standard and execution units for it
IT11004: Data Representation and Organization
Presentation transcript:

An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University of South Carolina

Double Precision Accumulation Many kernels targeted for acceleration include For large datasets, values delivered serially to an accumulator HPRCTA ’092 A, set 1 Σ B, set 1 C, set 1 D, set 2 E, set 2 F, set 2 G, set 3 A+B +C, set 1 D+E +F, set 2 H, set 3 I, set 3 G+H +I, set 3

The Reduction Problem HPRCTA ’ Mem Control Partial sums

Reduction-Based Accumulator: Previous Work Paper# d.p. adder IP (~1000 slices/ea) Reduc’n Logic Reduc’n BRAM # DSP48D.p. adder speed Accumulator speed Out-of- order outputs Prasanna DSA ’07 (Virtex 2P) slices 3n/a170 MHz142 MHzYes Prasanna SSA ’07 (Virtex 2P) slices 6n/a170 MHz165 MHzYes Gerards ’08 (Virtex 4) slices 93 (from d.p. adder) 324 MHz200 MHzNo This work (Virtex 5) 0< 1000 slices MHz300+ MHzNo HPRCTA ’094

Approach Reduction complexity scales with the latency of the core operation –Reduce latency of double precision add? IEEE 754 adder pipeline (assume 4-bit significand): HPRCTA ’095 Compare exponents Add 53-bit mantissas De- normalize smaller value Round Re- normalize x x x x x x x 2 24 Round x 2 24

Adder Pipeline HPRCTA ’096 Mantissa addition –Cascaded, pipelined DSP48 adders –Scales well, operates fast De-normalize –Exponent comparison and a variable shift of one significand –Xilinx IP uses a DSP48 for the 11-bit comparison (waste)

Base Conversion Previous work in s.p. MAC designs base conversion –Idea: Shift both inputs to the left by amout specified in low-order bits of exponents Reduces size of exponent, requires wider adder Example: –Base-8 conversion: , exp=10110 ( x 2 22 => ~5.7 million) Shift to the left by 6 bits… , exp=10 (87.25 x 2 8*2 = > ~5.7 million) HPRCTA ’097

Exponent Compare vs. Adder Width HPRCTA ’098 Base Exponent Width Denormalize speed Adder Width#DSP48s MHz MHz MHz MHz MHz3107 denormDSP48 renorm

Accumulator Design HPRCTA ’099

Three-Stage Reduction Architecture HPRCTA ’0910 “Adder” pipeline Input buffer Output buffer Input

Three-Stage Reduction Architecture HPRCTA ’0911 “Adder” pipeline Input buffer Output buffer 33 22 11 B1 Input 0

Three-Stage Reduction Architecture HPRCTA ’0912 “Adder” pipeline Input buffer Output buffer 33 22 11 B2 Input B1

Three-Stage Reduction Architecture HPRCTA ’0913 “Adder” pipeline Input buffer Output buffer B1 33 B2 Input  2 B3

Three-Stage Reduction Architecture HPRCTA ’0914 “Adder” pipeline Input buffer Output buffer B1 33 Input  2 B4 B2+B3

Three-Stage Reduction Architecture HPRCTA ’0915 “Adder” pipeline Input buffer Output buffer 33 Input  2 B5 B2+B3B1+B4

Three-Stage Reduction Architecture HPRCTA ’0916 “Adder” pipeline Input buffer Output buffer Input  2  3 B6 B2+B3B1+B4 B5

Three-Stage Reduction Architecture HPRCTA ’0917 “Adder” pipeline Input buffer Output buffer Input  2  3 B7 B2+B3 +B6 B1+B4 B5

Three-Stage Reduction Architecture HPRCTA ’0918 “Adder” pipeline Input buffer Output buffer Input  2  3 B8 B2+B3 +B6 B1+B4 +B7 B5

Three-Stage Reduction Architecture HPRCTA ’0919 “Adder” pipeline Input buffer Output buffer Input C1 B2+B3 +B6 B1+B4 +B7 B5+B8 0

Minimum Set Size Four “configurations”: Deterministic control sequence, triggered by set change: –D, A, C, B, A, B, B, C, B/D Minimum set size is 8 HPRCTA ’0920

Use Case: Sparse Matrix-Vector Multiply HPRCTA ’0921 A000B0 000C0D E000FG H I0J0 000K00 val col ptr ABCDEFGHIJK (A,0) (B,4) (0,0) (C,3) (D,4) (0,0)… Group vol/col Zero-terminate

SpMV Architecture HPRCTA ’0922 Enough memory bandwidth to read: –5 val/col pairs (80 x 5 bits) per cycle –~15-20 GB/s Requires minimum number of entries per row: –5 x 8 = 40 –Many sparse matrices don’t have this many values per row –Zero padding will degrade performance for many matrices

New SpMV Architecture HPRCTA ’0923 Delete tree, replicate accumulator, schedule matrix data: 400 bits

Performance Results HPRCTA ’0924

Conclusions Developed serially-delivered accumulator using base- conversion technique Limited to shallow pipelines –Deeper pipelines require large minimum set size 4 -> 11, 5 -> 19, 6 -> 23 Goal: new reduction circuit to support deeper pipelines with no minimum set size Acknowledgements: –NSF awards CCF , CCF HPRCTA ’0925