Accuracy, Cost and Performance Tradeoffs for Floating Point Accumulation Krishna K. Nagar & Jason D. Bakos University of South Carolina, Columbia, SC Objective.

Slides:



Advertisements
Similar presentations
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture 11 Oct 12 Circuits for floating-point operations addition multiplication division (only sketchy)
Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.
A Sparse Matrix Personality for the Convey HC-1 Dept. of Computer Science and Engineering University of South Carolina Krishna K Nagar, Jason D. Bakos.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
UNIVERSITY OF MASSACHUSETTS Dept
EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.
A Systolic FFT Architecture for Real Time FPGA Systems.
FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.
Iterative Refinement of Computational Circuits using Genetic Programming Matthew J. Streeter Genetic Programming Inc. Mountain View, California
1 Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 5 MAD MAC nd February, 2006 Top Level Integration.
COMP381 by M. Hamdi 1 Pipelining Control Hazards and Deeper pipelines.
High Dynamic Range Emeka Ezekwe M11 Christopher Thayer M12 Shabnam Aggarwal M13 Charles Fan M14 Manager: Matthew Russo 6/26/
1. 2 Farhan Mohamed Ali Jigar Vora Sonali Kapoor Avni Jhunjhunwala 1 st May, 2006 Final Presentation MAD MAC 525 Design Manager: Zack Menegakis Design.
Issues in System-Level Direct Networks Jason D. Bakos.
Adaptive Playout Scheduling Using Time- scale Modification in Packet Voice Communications Yi J. Liang, Nikolaus Farber, Bernd Girod Information Systems.
Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Siven Seth (W2-5) Presentation 1 MAD MAC th January, 2006.
A Parameterized Floating Point Library Applied to Multispectral Image Clustering Xiaojun Wang Dr. Miriam Leeser Rapid Prototyping Laboratory Northeastern.
EENG449b/Savvides Lec 5.1 1/27/04 January 27, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Camera Auto Focus Presentation 4, February 14 th, 2007 Team W1: Tom Goff (W11) David Hwang (W12) Kate Killfoile (W13) Greg Look (W14) Design Manager: Bowei.
An Extra-Regular, Compact, Low-Power Multiplier Design Using Triple-Expansion Schemes and Borrow Parallel Counter Circuits Rong Lin Ronald B. Alonzo SUNY.
1 Extending Summation Precision for Network Reduction Operations George Michelogiannakis, Xiaoye S. Li, David H. Bailey, John Shalf Computer Architecture.
Accuracy-Configurable Adder for Approximate Arithmetic Designs
AICCSA’06 Sharja 1 A CAD Tool for Scalable Floating Point Adder Design and Generation Using C++/VHDL By Asim J. Al-Khalili.
Accuracy, Cost, and Performance Trade-offs for Floating Point Accumulation Krishna K. Nagar and Jason D. Bakos Univ. of South Carolina.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Efficient FPGA Implementation of QR
Variable Precision Floating Point Division and Square Root Albert Conti Xiaojun Wang Dr. Miriam Leeser Rapid Prototyping Laboratory Northeastern University,
(TPDS) A Scalable and Modular Architecture for High-Performance Packet Classification Authors: Thilan Ganegedara, Weirong Jiang, and Viktor K. Prasanna.
Distortion Correction ECE 6276 Project Review Team 5: Basit Memon Foti Kacani Jason Haedt Jin Joo Lee Peter Karasev.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
SIGNIFICANCE LEVEL FOR TWO-TAILED TEST df
Approaches to Low-Power Implementations of DSP Systems Class Advisor : Dr. Fakhraie Presentor : Nariman Moezi DSP Design & Implementation Course Seminar.
1 C.H. Ho © Rapid Prototyping of FPGA based Floating Point DSP Systems C.H. Ho Department of Computer Science and Engineering The Chinese University of.
LANL FEM design proposal S. Butsyk For LANL P-25 group.
Chapter One Introduction to Pipelined Processors.
Speeding up of pipeline segments © Fr Dr Jaison Mulerikkal CMI.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Under-Graduate Project Improving Timing, Area, and Power Speaker: 黃乃珊 Adviser: Prof.
Principles of Linear Pipelining
Accuracy vs. Precision Measurements need to accurate & precise. Accurate -(correct) the measurement is close to the true value. Precise –(reproducible)
Chapter One Introduction to Pipelined Processors
Digital Integrated Circuits 2e: Chapter Copyright  2002 Prentice Hall PTR, Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture.
Pipelined ADC We propose two variants: low power and reliability optimized A. Gumenyuk, V. Shunkov, Y. Bocharov, A. Simakov.
Full Tree Multipliers All k PPs Produced Simultaneously Input to k-input Multioperand Tree Multiples of a (Binary, High-Radix or Recoded) Formed at Top.
IT11004: Data Representation and Organization Floating Point Representation.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
Speedup Speedup is defined as Speedup = Time taken for a given computation by a non-pipelined functional unit Time taken for the same computation by a.
CSE477 L21 Multiplier Design.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Washington University School of Engineering and Applied Science
Integer Division.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Supervised Learning Based Model for Predicting Variability-Induced Timing Errors Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose.
An MTCMOS Design Methodology and Its Application to Mobile Computing
CS 232: Computer Architecture II
Pipelining Example Cycle 1 b[0] b[1] b[2] + +
Stripes: Bit-Serial Deep Neural Network Computing
Centar ( Global Signal Processing Expo
Superscalar Processors & VLIW Processors
Arithmetic Logical Unit
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Final Project presentation
Mathematical Preliminaries
IT11004: Data Representation and Organization
Numbers with fractions Could be done in pure binary
The state in a stored-program digital computer
Presentation transcript:

Accuracy, Cost and Performance Tradeoffs for Floating Point Accumulation Krishna K. Nagar & Jason D. Bakos University of South Carolina, Columbia, SC Objective Achieve high throughput for streaming set-wise floating point summation without sacrificing accuracy. Issues  Data scheduling around deeply pipelined floating point adder  Accuracy of floating point summation Streaming Set-wise Summation: Reduction Circuit Resolves data hazards by dynamically scheduling the inputs to FP adder: Rules d4d4 c 2 +c 3 d3d3 c1c1 d 1 +d 2 d1d1 Ac1c1 c3c3 c2c2 B d2d2 Bc 2 +c 3 d1d1 c1c1 g4g4 g1g1 e 3 +e 5 +e 1 +e 2 +e 4 g 2 +g 3 d3d3 c1c1 d 1 +d 2 c 2 +c 3 d3d3 d4d4 e1e1 c1c1 d 1 +d 2 c 2 +c 3 Compensated Summation Incorporate in subsequent addition Accumulate the error and incorporate in the final result Error Extraction: Custom floating point adder to reduce latency Accumulated Error Compensation (AEC)  VRC accumulates input values and supplies error generated by custom adder to ERC  ERC accumulates the errors  1 custom adder, 2 standard adders  4743 slices (+153%), 176 MHz (-6%) Adaptive Error Compensation In Subsequent Addition (AECSA)  VRC accumulates input values  Error may be compensated in VRC if available  ERC accumulates the errors  ERC can supply errors to VRC  1 custom adder, 3 standard adders, Increased pipeline depth in VRC  7938 slices (+323%), 135 MHz (-28%) Extended Precision Reduction Circuit  All intermediate additions in extended precision  Wider, deeper adder  Wider buffers to store partial results  EPRC80: 2656 slices (+42%), 182 MHz (-3%)  EPRC128: 4600 slices (+145%), 182 MHz (-3%) 19 cycle 80 bit adder 26 cycle 128 bit adder  = 1.0, Varying Exp Range, Set Size = 100 Exp. Range Red. Ckt. AECAECSAEPRC80EPRC  = 1.0, Varying Exp Range, Set Size = 10,000 Exp. Range Red. Ckt. AECAECSAEPRC80EPRC  = , Varying Exp. Range, Set Size = 100 Exp. Range Red. Ckt. AECAECSAEPRC80EPRC Exp. Range=0, Varying  Set Size = 100   appx.  Red. Ckt. AECAECSAEPRC80EPRC  = , Varying Exp. Range, Set Size = 100 Exp. Range Red. Ckt. AECAECSAEPRC80EPRC Exp. Range=0, Varying  Set Size = 10,000   appx.  Red. Ckt. AECAECSAEPRC80EPRC Results: Average Erroneous Bits = lg(2*Relative Error) Conclusion: Accuracy improving measures reduce errors significantly  Exponent range affects the relative error: Reduction Circuit affected most, AEC, AECSA not affected much  Condition number matters a lot and relative error increases with increase in condition number: Shows the effect of the error due to cancellation, Reduction Circuit affected most. Accuracy and throughput for set-wise summation hand-in-hand! Rule 1 Rule 2 Rule 3 Rule 4 Rule 5 Rule 6