Lecture 10 CUDA Instructions

Slides:



Advertisements
Similar presentations
Computer Engineering FloatingPoint page 1 Floating Point Number system corresponding to the decimal notation 1,837 * 10 significand exponent A great number.
Advertisements

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 11: Floating-Point Considerations.
1 IEEE Floating Point Revision Guide for Phase Test Week 5.
University of Washington Today: Floats! 1. University of Washington Today Topics: Floating Point Background: Fractional binary numbers IEEE floating point.
1 Lecture 9: Floating Point Today’s topics:  Division  IEEE 754 representations  FP arithmetic Reminder: assignment 4 will be posted later today.
CS 447 – Computer Architecture Lecture 3 Computer Arithmetic (2)
1 CSE1301 Computer Programming Lecture 30: Real Number Representation.
CSE1301 Computer Programming Lecture 33: Real Number Representation
Floating Point Numbers
Floating Point Numbers
CPS Computer Architecture Assignment 4: Multiplication, Division, Floating Point.
1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU1 IEEE Floating Point The IEEE Floating Point Standard and execution.
Simple Data Type Representation and conversion of numbers
Computer Arithmetic. Instruction Formats Layout of bits in an instruction Includes opcode Includes (implicit or explicit) operand(s) Usually more than.
IT253: Computer Organization
CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU1 IEEE Floating Point The IEEE Floating Point Standard and execution.
Floating Point. Agenda  History  Basic Terms  General representation of floating point  Constructing a simple floating point representation  Floating.
1 Number Systems Lecture 10 Digital Design and Computer Architecture Harris & Harris Morgan Kaufmann / Elsevier, 2007.
1/8/ L24 IEEE Floating Point Basics Copyright Joanne DeGroat, ECE, OSU1 IEEE Floating Point The IEEE Floating Point Standard and execution.
Computer Architecture Lecture 22 Fasih ur Rehman.
Computer Engineering FloatingPoint page 1 Floating Point Number system corresponding to the decimal notation 1,837 * 10 significand exponent A great number.
Computer Architecture Lecture 11 Arithmetic Ralph Grishman Oct NYU.
CS 232: Computer Architecture II Prof. Laxmikant (Sanjay) Kale Floating point arithmetic.
Floating Point Arithmetic – Part I
Chapter 9 Computer Arithmetic
MATH Lesson 2 Binary arithmetic.
Introduction to Numerical Analysis I
Floating Point Representations
© David Kirk/NVIDIA and Wen-mei W
Computer Architecture & Operations I
Integer Division.
Lecture 9: Floating Point
CS 105 “Tour of the Black Holes of Computing!”
NxN Crossbar design for Barrel Shifter
Floating Point Number system corresponding to the decimal notation
CS 232: Computer Architecture II
CS/COE0447 Computer Organization & Assembly Language
April 2006 Saeid Nooshabadi
PRESENTED BY J.SARAVANAN. Introduction: Objective: To provide hardware support for floating point arithmetic. To understand how to represent floating.
William Stallings Computer Organization and Architecture 7th Edition
Chapter 6 Floating Point
Outline Introduction Floating Point Arithmetic Adder Multiplier.
Data Structures Mohammed Thajeel To the second year students
Lecture 10: Floating Point, Digital Design
CSCE 350 Computer Architecture
Systems Architecture I
Floating Point Representation
The IEEE Floating Point Standard and execution units for it
ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations
Arithmetic Logical Unit
How to represent real numbers
How to represent real numbers
Computer Arithmetic Multiplication, Floating Point
ECEG-3202 Computer Architecture and Organization
CSCE 4643 GPU Programming Lecture 9 - Floating Point Considerations
October 17 Chapter 4 – Floating Point Read 5.1 through 5.3 1/16/2019
Chapter 8 Computer Arithmetic
Chapter 6: Computer Arithmetic
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Systems Architecture I
The IEEE Floating Point Standard and execution units for it
CS 105 “Tour of the Black Holes of Computing!”
Review In last lecture, done with unsigned and signed number representation. Introduced how to represent real numbers in float format.
Computer Organization and Assembly Language
Computer Architecture and System Programming Laboratory
Lecture 9: Shift, Mult, Div Fixed & Floating Point
CS 105 “Tour of the Black Holes of Computing!”
Presentation transcript:

Lecture 10 CUDA Instructions Kyu Ho Park May 2, 2017 Lecture 10 CUDA Instructions Ref:[PCCP]Professional CUDA C Programming

Issues Applications: Low-level instruction tuning I/O-bound applications Compute-bound applications Low-level instruction tuning double value= a*b + c;//MAD(multiply-add) This pattern is so common, modern architectures support a MAD instruction that fuses a multiply and an add operation.

MAD instruction The number of cycles to execute the MAD operation is halved. The results of a single MAD instruction are often less numerically accurate that with separate multiply and add instructions.

CUDA Instructions Three topics that significantly affect the instructions generated for a CUDA kernel: floating point operation : Affect both accuracy and performance of CUDA programs intrinsic and standard functions, : they implement overlapping sets of mathematical operations but offer different accuracy and performance. atomic instructions :they guarantee correctness of concurrent operations on a variable from multiple threads.

Floating-Point Instructions Issues Accuracy of floating-point arithmetic Precision of floating-point number representation Consideration in parallel computation

Floating-Point Format IEEE floating-point standard: A numerical value is represented in three groups of bits, S(sign),E(Exponent) and M(Mantissa). value=(-1)S x 1.M x {2E-bias} ,where S=0 means a positive number and S=1 a negative number. sign exponent fraction

32-bit and 64-bit format float 1 8 23 double 1 11 52

Representation of M value=(-1)S x 1.M x {2E-bias} Example: a decimal number 0.5, represented by 0.5D. 0.5D=1.0B x 2-1 , therefore M=0. The numbers that satisfy this restriction is referred to as normalized numbers. The mantissa of 0.5D in a 2-bit mantissa representatio is 00. by omitting 1. from 1.00.

Floating-Point Intructions float a=3.1415927; float b=3.1415928; if(a==b) { printf(“a is equal to b\n”); } else { printf(“a is not equal to b\n”); }

On architecture compatible with the IEEE754, the output is “ a is equal to b”. Floating point values are rounded to representable value.

double a=3.1415927; double b=3.1415928; if(a==b) { printf(“a is equal to b\n”); } else { printf(“a is not equal to b\n”); }

Single and Double Precision

Single and Double Precision

Algorithmic Considerations Consider 1 bit S, 2 bits M, 2 bits E. 1.00B x20 +1.00B x20 +1.00B x2-2 +1.00B x2-2 =? (((1.00B x20 +1.00B x20 ) +1.00B x2-2 ) +1.00B x2-2 )= (1.00B x21+ 1.00B x2-2 ) +1.00B x2-2 = 1.00B x21 + 1.00B x2-2 = 1.00B x21

Algorithmic Considerations 1.00B x20 +1.00B x20 +1.00B x2-2 +1.00B x2-2 =(1.00B x20 +1.00B x20 )+(1.00B x2-2 +1.00B x2-2 ) =1.00B x21 + 1.00B x2-1 = 1.01B x 21

Algorithmic Considerations A technique to maximize floating point arithmetic accuracy is to sort data before a reduction computation. Divide the numbers into groups in a parallel algorithm. And use each thread to sequentially reduce values within each group, Having the numbers sorted in ascending order allows a sequential addition to get higher accuracy. [Kahan, Further remarks on reducing truncation errors,Communications of ACM,8(1)40.]

Intrinsic and Standard Functions CUDA arithmetic functions: Intrinsic functions: They can be accessed only from device code. Many trigonometric functions which are directly implemented in hardware on GPUs. Standard functions: It includes C standard math library, single- instruction operations like multiplication and addition.

Atomic Instructions An atomic instructions performs a mathematical operation in a single uninterruptable operation with no interference from other threads. CUDA provides atomic functions that perform read-modify-write atomic operations on 32-bits or 64-bits of global memory and shared memory.

Atomic Instructions Each atomic function implements a basi mathematicla operation such as addition, multiplication, or subtraction. Atomic instructions have a defined behavior when operating on a memory location shared by two competing threads.

Atomic Instructions A kernel: __global__ void incr(int *ptr){ int temp=*ptr; temp=temp+1; *ptr=temp; } If a single block of 32 threads were launched running this kernel, what output will it be?

Atomic Instruction int atomicAdd( int *M, int V); //the atomic function is executed on V and the value already stored at location, *M and the result is saved to the same memory location. __global__ void incr(__global__ int *ptr){ int temp=atomicAdd(ptr,1); }

Atomic Operations