Joseph Schneider February 23, 2010 1.  Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction  Faster, more precise.

Slides:



Advertisements
Similar presentations
UNIT 2: Data Flow description
Advertisements

The MIPS 32 1)Project 1 Discussion? 1)HW 2 Discussion? 2)We want to get some feel for programming in an assembly language - MIPS 32 We want to fully understand.
UNIVERSITY OF MASSACHUSETTS Dept
EE 382 Processor DesignWinter 98/99Michael Flynn 1 AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under.
Chapter 3 Arithmetic for Computers. Multiplication More complicated than addition accomplished via shifting and addition More time and more area Let's.
Copyright 2008 Koren ECE666/Koren Part.6b.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 4 MAD MAC th February, 2006 Gate Level Design.
Copyright 2008 Koren ECE666/Koren Sample Mid-term 2.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital.
Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 3 MAD MAC th February, 2006 Size estimates/Floor.
UNIVERSITY OF MASSACHUSETTS Dept
Communication [Lower] Bounds for Heterogeneous Architectures Julian Bui.
CSE 246: Computer Arithmetic Algorithms and Hardware Design Instructor: Prof. Chung-Kuan Cheng Fall 2006 Lecture 10 Floating Point Number Rounding, Polynomial.
CHAPTER 5: Floating Point Numbers
Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Siven Seth (W2-5) Presentation 1 MAD MAC th January, 2006.
Distributed Arithmetic: Implementations and Applications
Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.
CPSC 321 Computer Architecture ALU Design – Integer Addition, Multiplication & Division Copyright 2002 David H. Albonesi and the University of Rochester.
Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Shiven Seth (W2-5) Presentation 1 MAD MAC st February,
CPS Computer Architecture Assignment 4: Multiplication, Division, Floating Point.
1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU1 IEEE Floating Point The IEEE Floating Point Standard and execution.
Chapter 6-2 Multiplier Multiplier Next Lecture Divider
AICCSA’06 Sharja 1 A CAD Tool for Scalable Floating Point Adder Design and Generation Using C++/VHDL By Asim J. Al-Khalili.
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
Efficient FPGA Implementation of QR
ECE232: Hardware Organization and Design
Chapter One Introduction to Pipelined Processors.
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /14/2013 Lecture 16: Floating Point Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE.
Number Systems and Arithmetic or Computers go to elementary school Reading – Peer Instruction Lecture Materials for Computer Architecture by Dr.
07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.
9.4 FLOATING-POINT REPRESENTATION
1 EGRE 426 Fall 08 Chapter Three. 2 Arithmetic What's up ahead: –Implementing the Architecture 32 operation result a b ALU.
Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.
Abdullah Aldahami ( ) March 12, Introduction 2. Background 3. Proposed Multiplier Design a.System Overview b.Fixed Point Multiplier.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 15 - Floating.
Lecture 4 Multiplier using FPGA 2007/09/28 Prof. C.M. Kyung.
AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.
Lecture 12: Integer Arithmetic and Floating Point CS 2011 Fall 2014, Dr. Rozier.
Institute of Applied Microelectronics and Computer Engineering College of Computer Science and Electrical Engineering, University of Rostock Slide 1 Selected.
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
Computer Architecture Lecture 32 Fasih ur Rehman.
1 Lecture 10: Floating Point, Digital Design Today’s topics:  FP arithmetic  Intro to Boolean functions.
Chapter 3 Arithmetic for Computers. Chapter 3 — Arithmetic for Computers — 2 Arithmetic for Computers Operations on integers Addition and subtraction.
Floating Point Numbers Representation, Operations, and Accuracy CS223 Digital Design.
1 ELEN 033 Lecture 4 Chapter 4 of Text (COD2E) Chapters 3 and 4 of Goodman and Miller book.
Computer Architecture Lecture Notes Spring 2005 Dr. Michael P. Frank Competency Area 4: Computer Arithmetic.
Computer Architecture Lecture 11 Arithmetic Ralph Grishman Oct NYU.
CH.3 Floating Point Hardware and Algorithms 2/18/
Memory Buffering Techniques Greg Stitt ECE Department University of Florida.
CH.3 Floating Point Hardware and Algorithms 3/10/
By Wannarat Computer System Design Lecture 3 Wannarat Suntiamorntut.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Array Multiplier Haibin Wang Qiong Wu. Outlines Background & Motivation Principles Implementation & Simulation Advantages & Disadvantages Conclusions.
By Liang-Kai Wang and Michael J. Schulte Joseph Schneider March 12, 2010.
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Floating Point Representations
Lecture 10 CUDA Instructions
© David Kirk/NVIDIA and Wen-mei W
Memory Buffering Techniques
By: Mohammadreza Meidnai Urmia university, Urmia, Iran Fall 2014
Arithmetic for Computers
Integers in 2’s compliment Floating point
CSCE 350 Computer Architecture
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Arithmetic Logical Unit
A.R. Hurson 323 CS Building, Missouri S&T
Floating Point Hardware and Algorithms
Presentation transcript:

Joseph Schneider February 23,

 Fused Multiply-Add (FMA) is a unit designed to perform (A x B) + C as a single instruction  Faster, more precise than using two consecutive instructions with standard multiplier and adder  Can perform standard addition and multiplication with appropriate constants 2

 Performing standard addition and multiplication suffers greater latencies than when using a standard adder or multiplier  When using an FMA instead, can’t perform addition and multiplication in parallel 3

 Goal: To design architecture between FADD and FMUL units.  Reuse components to minimize area and power consumption  Allow both standard operations and the FMA functionality 4

 Floating-point units all assume double- precision (64-bit) IEEE-754 standard format 5

 Compare adder standalone, multiplier standalone, FMA standalone, and the FMA bridge  Compared on basis of latency, area, and power 6

 (A x B) + C  A and B multiplied while C is aligned based on exponent difference  Carry-save adder implemented  Result is rounded- only once as opposed to two roundings necessary for performing the equation in two operations 7

8

 Follows same architecture of FMA, only reusing parts from FADD and FMUL as appropriate  From FMUL, uses multiplier array.  From FADD, uses rounding unit.  In this method, FADD and FMUL can be used individually or in parallel, while the FMA is used only when needed.  Clock-gating used to ensure bridge is only powered when needed 9

10

 Same as a standard unit, only with additional outputs from multiplier array leading to FMA  Round element shut down via clock-gating when performing an FMA operation 11

12

 Uses Farmwald dual-path FADD design; Two paths available based on exponent difference of inputs  Multiplexer used to select between paths for rounding unit now include option for FMA input  In this manner, FMA uses FADD’s rounding unit 13

14

15

 End result, Bridge FMA hardware is essentially the original FMA hardware, only without the multiplier array and rounding unit. 16

17

 FMUL, FADD, FMA, and Bridge FMA all implemented in Verilog  Uses AMD 65-nm silicon-on-insulator design set 18

 Bridge architecture 30%-70% faster than FMA architecture when performing FADD or FMUL instructions with significant savings in power consumption  Also allows for an FADD and FMUL instruction in parallel, further improving speed  12% performance gain when executing FMA instruction over consecutive operations on individual FADD and FMUL. 19

 Takes 40% more area to include Bridge FMA with FADD and FMUL Unit  60% increase in power for FMA instruction over consecutive FADD and FMUL instructions in worst case conditions  Increased latency and power over standalone FMA unit 20