Introduction to CUDA Programming

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 2: Data types and addressing modes dr.ir. A.C. Verschueren.

Advertisements

L9: Floating Point Issues CS6963. Outline Finish control flow and predicated execution discussion Floating point – Mostly single precision until recent.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.

Fabián E. Bustamante, Spring 2007 Floating point Today IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 11: Floating-Point Considerations.

Topics covered: Floating point arithmetic CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 3: IT Students.

L10: Floating Point Issues and Project CS6963. Administrative Issues Project proposals – Due 5PM, Friday, March 13 (hard deadline) Homework (Lab 2) –

COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.

Information Representation (Level ISA3) Floating point numbers.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 CSCE 4643 GPU Programming Lecture 8 - Floating Point Considerations.

1 Lecture 5 Floating Point Numbers ITEC 1000 “Introduction to Information Technology”

CEN 316 Computer Organization and Design Computer Arithmetic Floating Point Dr. Mansour AL Zuair.

CUDA Performance Considerations Patrick Cozzi University of Pennsylvania CIS Spring 2011.

L9: Project Discussion and Floating Point Issues CS6235.

Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 15 - Floating.

Floating Point Arithmetic

Floating Point Representation for non-integral numbers – Including very small and very large numbers Like scientific notation – –2.34 × –

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 ECE408 Applied Parallel Programming Lecture 16 - Floating.

Lecture notes Reading: Section 3.4, 3.5, 3.6 Multiplication

Chapter 3 Arithmetic for Computers. Chapter 3 — Arithmetic for Computers — 2 Arithmetic for Computers Operations on integers Addition and subtraction.

Floating Point Numbers Representation, Operations, and Accuracy CS223 Digital Design.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Floating-Point Considerations.

CHAPTER 5: Representing Numerical Data

Floating Point Representations

Lecture 10 CUDA Instructions

© David Kirk/NVIDIA and Wen-mei W

Bitwise Operations C includes operators that permit working with the bit-level representation of a value. You can: - shift the bits of a value to the left.

Floating Point Borrowed from the Carnegie Mellon (Thanks, guys!)

Floating Point Numbers

CS 105 “Tour of the Black Holes of Computing!”

Morgan Kaufmann Publishers Arithmetic for Computers

2.4. Floating Point Numbers

Computer Architecture & Operations I

Integer Division.

Topics IEEE Floating Point Standard Rounding Floating Point Operations

Floating Point Numbers: x 10-18

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

Floating Point Number system corresponding to the decimal notation

CS/COE0447 Computer Organization & Assembly Language

CS 367 Floating Point Topics (Ch 2.4) IEEE Floating Point Standard

Chapter 6 Floating Point

Arithmetic for Computers

Lecture 10: Floating Point, Digital Design

Topic 3d Representation of Real Numbers

CSCE 350 Computer Architecture

CS201 - Lecture 5 Floating Point

ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations

CS 105 “Tour of the Black Holes of Computing!”

CSCE 4643 GPU Programming Lecture 9 - Floating Point Considerations

CS 105 “Tour of the Black Holes of Computing!”

Floating Point Arithmetic August 31, 2009

CS 105 “Tour of the Black Holes of Computing!”

UNIVERSITY OF MASSACHUSETTS Dept

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Morgan Kaufmann Publishers Arithmetic for Computers

CS213 Floating Point Topics IEEE Floating Point Standard Rounding

Review In last lecture, done with unsigned and signed number representation. Introduced how to represent real numbers in float format.

Topic 3d Representation of Real Numbers

UNIVERSITY OF MASSACHUSETTS Dept

CS 105 “Tour of the Black Holes of Computing!”

Computer Organization and Assembly Language

Computer Architecture and System Programming Laboratory

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Introduction to CUDA Programming Profiler, Assembly, and Floating-Point Andreas Moshovos Winter 2009 Some material from: Wen-Mei Hwu and David Kirk NVIDIA Robert Strzodka, Dominik Göddeke, NVISION08 presentation http://www.mathematik.uni-dortmund.de/~goeddeke/pubs/NVISION08-long.pdf Introduction to CUDA Programming

Both GUI and command-line Non-GUI control: The CUDA Profiler Both GUI and command-line Non-GUI control: CUDA_PROFILE set to 1 or 0 to enable or disable the profiler CUDA_PROFILE_LOG set to the name of the log file (will default to ./cuda_profile.log) CUDA_PROFILE_CSV set to 1 or 0 to enable or disable a comma separated version of the log CUDA_PROFILE_CONFIG specify a config file with up to 4 signals

Profiler Signals

Grid size X, Y Block size X, Y, Z Dyn smem per block: Profiler Counters Grid size X, Y Block size X, Y, Z Dyn smem per block: Dynamic shared memory Sta smem per block: static shared memory Reg per thread Mem transfer dir Direction: 0  host to device, 1  device to host Mem transfer size bytes

Interpreting Profiler Counters Values represent events within a thread warp Only targets one multiprocessor Values will not correspond to the total number of warps launched for a particular kernel. Launch enough thread blocks to ensure that the target multiprocessor is given a consistent percentage of the total work. Values are best used to identify relative performance differences between non-optimized and optimized code e.g., make the number of non-coalesced loads go from some non-zero value to zero

Helps measure and find potential performance problem CUDA Visual Profiler Helps measure and find potential performance problem GPU and CPU timing for all kernel invocations and memcpys Time stamps Access to hardware performance counters

Assembly

PTX: Assembly for NVIDIA GPUs Parallel Thread eXecution Virtual Assembly Translated to actual machine code at runtime Allows for different hardware implementations Might enable additional optimizations E.g., %clock register to time blocks of code

Parallel Thread eXecution (PTX) Code Generation Flow Parallel Thread eXecution (PTX) Virtual Machine and ISA Programming model Execution resources and state ISA – Instruction Set Architecture Variable declarations Instructions and operands Translator is an optimizing compiler Translates PTX to Target code Program install time Driver implements VM runtime Coupled with Translator C/C++ Application C/C++ Compiler ASM-level Library Programmer PTX Code PTX Code PTX to Target Translator C G80 … GPU Target code

nvcc --opencc-options -LIST:source=on How to See the PTX code nvcc –keep Produces .ptx and .cubin nvcc --opencc-options -LIST:source=on

PTX PTX Example CUDA float4 me = gx[gtid]; me.x += me.y * me.z; ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0]; # 174 me.x += me.y * me.z; mad.f32 $f1, $f5, $f3, $f1; PTX Registers are virtual – The actual hardware registers are hidden from PTX

PTX Syntax Example

Another Example: CUDA Function PTX __device__ void interaction( float4 b0, float4 b1, float3 *accel) { r.x = b1.x - b0.x; r.y = b1.y - b0.y; r.z = b1.z - b0.z; float distSqr = r.x * r.x + r.y * r.y + r.z * r.z; float s = 1.0f/sqrt(distSqr); accel->x += r.x * s; accel->y += r.y * s; accel->z += r.z * s; } sub.f32 $f18, $f1, $f15; sub.f32 $f19, $f3, $f16; sub.f32 $f20, $f5, $f17; mul.f32 $f21, $f18, $f18; mul.f32 $f22, $f19, $f19; mul.f32 $f23, $f20, $f20; add.f32 $f24, $f21, $f22; add.f32 $f25, $f23, $f24; rsqrt.f32 $f26, $f25; mad.f32 $f13, $f18, $f26, $f13; mov.f32 $f14, $f13; mad.f32 $f11, $f19, $f26, $f11; mov.f32 $f12, $f11; mad.f32 $f9, $f20, $f26, $f9; mov.f32 $f10, $f9;

PTX Data types

After: p = Evaluate cond Branch not true After Then Code After Code Predicated Execution p = Evaluate cond Branch not true After Then Code After: After Code If (cond) Then Code After Code p = Evaluate cond (p) Then Code After Code

PTX Predicated Execution

Variable Declaration

Parameterized Variable Names How to create 100 register “variables” .reg .b32 %r<100> Declares %r0 - %r99

Addresses as Operands The value of x The value of tbl[12] The base address of tlb

Compiling a loop that calls a function - again CUDA sx is shared mx, accel are local PTX mov.s32 $r12, 0; $Lt_0_26: setp.eq.u32 $p1, $r12, $r5; // i != threadIdx.x @$p1 bra $Lt_0_27; mul.lo.u32 $r13, $r12, 16; add.u32 $r14, $r13, $r1; ld.shared.f32 $f15, [$r14+0]; ld.shared.f32 $f16, [$r14+4]; ld.shared.f32 $f17, [$r14+8]; [func body] $Lt_0_27: add.s32 $r12, $r12, 1; mov.s32 $r15, 128; setp.ne.s32 $p2, $r12, $r15; @$p2 bra $Lt_0_26; for (i = 0; i < K; i++) { if (i != threadIdx.x) { interaction( sx[i], mx, &accel ); }

Yet Another Example: SAXPY code cvt.u32.u16 $blockid, %ctaid.x; // Calculate i from thread/block IDs cvt.u32.u16 $blocksize, %ntid.x; cvt.u32.u16 $tid, %tid.x; mad24.lo.u32 $i, $blockid, $blocksize, $tid; ld.param.u32 $n, [N]; // Nothing to do if n ≤ i setp.le.u32 $p1, $n, $i; @$p1 bra $L_finish; mul.lo.u32 $offset, $i, 4; // Load y[i] ld.param.u32 $yaddr, [Y]; add.u32 $yaddr, $yaddr, $offset; ld.global.f32 $y_i, [$yaddr+0]; ld.param.u32 $xaddr, [X]; // Load x[i] add.u32 $xaddr, $xaddr, $offset; ld.global.f32 $x_i, [$xaddr+0]; ld.param.f32 $alpha, [ALPHA]; // Compute and store alpha*x[i] + y[i] mad.f32 $y_i, $alpha, $x_i, $y_i; st.global.f32 [$yaddr+0], $y_i; $L_finish: exit;

Real time clock cycle counter How to read: The %clock register Real time clock cycle counter How to read: mov.u32 $r1, %clock; Can be used to time code It measures real time not just time spent executing this thread If a thread is blocks time still elapses

Please Read the PTX ISA specification PTX Reference Please Read the PTX ISA specification Posted under the handouts section

Occupancy Calculator http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls GPU Occupancy Active warps / max warps Threads/block Registers/thread Shared memory/block Nvcc –cubin code { name = my_kernel lmem = 0 smem = 24 reg = 5 bar = 0 bincode { � } const { � } }

Occupancy Calculator Example

Floating Point Considerations

Comparison of FP Capabilities G80 SSE IBM Altivec Cell SPE Precision IEEE 754 Rounding modes for FADD and FMUL Round to nearest and round to zero All 4 IEEE, round to nearest, zero, inf, -inf Round to nearest only Round to zero/truncate only Denormal handling Flush to zero Supported, 1000’s of cycles NaN support Yes No Overflow and Infinity support Yes, only clamps to max norm No, infinity Flags Some Square root Software only Hardware Division Reciprocal estimate accuracy 24 bit 12 bit Reciprocal sqrt estimate accuracy 23 bit log2(x) and 2^x estimates accuracy

IEEE Floating Point Representation A floating point binary number consists of three parts: sign (S), exponent (E), and mantissa (M). Each (S, E, M) pattern uniquely identifies a floating point number. For each bit pattern, its IEEE floating-point value is derived as: value = (-1)S * M * {2E}, where 1.0 ≤ M < 10.0B The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number.

Normalized Representation Specifying that 1.0B ≤ M < 10.0B makes the mantissa value for each floating point number unique. For example, the only one mantissa value allowed for 0.5D is M =1.0 0.5D = 1.0B * 2-1 Neither 10.0B * 2 -2 nor 0.1B * 2 0 qualifies Because all mantissa values are of the form 1.XX…, one can omit the “1.” part in the representation. The mantissa value of 0.5D in a 2-bit mantissa is 00, which is derived by omitting “1.” from 1.00.

Exponent Representation In an n-bits exponent representation, 2n-1-1 is added to its 2's complement representation to form its excess representation. See Table for a 3-bit exponent representation A simple unsigned integer comparator can be used to compare the magnitude of two FP numbers Symmetric range for +/- exponents (111 reserved) 2’s complement Actual decimal Excess-3 000 011 001 1 100 010 2 101 3 110 (reserved pattern) 111 -3 -2 -1 E = represented E - BIAS

A Hypothetical 5-bit Floating Point Representation Assume 1-bit S, 2-bit E, and 2-bit M 0.5D = 1.00B * 2-1 0.5D = 0 00 00, where S = 0, E = 00 M = (1.)00 2’s complement Actual decimal Excess-1 00 01 1 10 (reserved pattern) 11 -1

Representable Numbers The representable numbers of a given format is the set of all numbers that can be exactly represented in the format. See Table for representable numbers of an unsigned 3-bit integer format 000 001 1 010 2 011 3 100 4 101 5 110 6 111 7 -1 1 2 3 4 5 6 7 8 9

Hypothetical 5-bit FP: Representable Numbers No-zero Abrupt underflow Gradual underflow E M S=0 S=1 00 2-1 -(2-1) 01 2-1+1*2-3 -(2-1+1*2-3) 1*2-2 -1*2-2 10 2-1+2*2-3 -(2-1+2*2-3) 2*2-2 -2*2-2 11 2-1+3*2-3 -(2-1+3*2-3) 3*2-2 -3*2-2 20 -(20) 20+1*2-2 -(20+1*2-2) 20+2*2-2 -(20+2*2-2) 20+3*2-2 -(20+3*2-2) 21 -(21) 21+1*2-1 -(21+1*2-1) 21+2*2-1 -(21+2*2-1) 21+3*2-1 -(21+3*2-1) Reserved pattern

Treat all bit patterns with E=0 as 0.0 Flush To Zero Treat all bit patterns with E=0 as 0.0 This takes away several representable numbers near zero and lump them all into 0.0 For a representation with large M, a large number of representable numbers numbers will be removed. 1 2 3 4

Hypothetical 5-bit FP: Representable Numbers No-zero Abrupt underflow Gradual underflow E M S=0 S=1 00 2-1 -(2-1) 01 2-1+1*2-3 -(2-1+1*2-3) 1*2-2 -1*2-2 10 2-1+2*2-3 -(2-1+2*2-3) 2*2-2 -2*2-2 11 2-1+3*2-3 -(2-1+3*2-3) 3*2-2 -3*2-2 20 -(20) 20+1*2-2 -(20+1*2-2) 20+2*2-2 -(20+2*2-2) 20+3*2-2 -(20+3*2-2) 21 -(21) 21+1*2-1 -(21+1*2-1) 21+2*2-1 -(21+2*2-1) 21+3*2-1 -(21+3*2-1) Reserved pattern

Denormalized Numbers The actual method adopted by the IEEE standard is called denormalized numbers or gradual underflow. The method relaxes the normalization requirement for numbers very close to 0. whenever E=0, the mantissa is no longer assumed to be of the form 1.XX. Rather, it is assumed to be 0.XX. In general, if the n-bit exponent is 0, the value is 0.M * 2 - 2 ^(n-1) + 2 1 2 3

Hypothetical 5-bit FP: Representable Numbers No-zero Abrupt underflow Gradual underflow E M S=0 S=1 00 2-1 -(2-1) 01 2-1+1*2-3 -(2-1+1*2-3) 1*2-2 -1*2-2 10 2-1+2*2-3 -(2-1+2*2-3) 2*2-2 -2*2-2 11 2-1+3*2-3 -(2-1+3*2-3) 3*2-2 -3*2-2 20 -(20) 20+1*2-2 -(20+1*2-2) 20+2*2-2 -(20+2*2-2) 20+3*2-2 -(20+3*2-2) 21 -(21) 21+1*2-1 -(21+1*2-1) 21+2*2-1 -(21+2*2-1) 21+3*2-1 -(21+3*2-1) Reserved pattern

Floating Point Numbers As the exponent gets larger The distance between two representable numbers increases

Arithmetic Instruction Throughput int and float add, shift, min, max and float mul, mad: 4 cycles per warp int multiply (*) is by default 32-bit requires multiple cycles / warp Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply For G80, for G200 should be OK Integer divide and modulo are expensive Compiler will convert literal power-of-2 divides to shifts Be explicit in cases where compiler can’t tell that divisor is a power of 2 Useful trick: foo % n == foo & (n-1) if n is a power of 2

Arithmetic Instruction Throughput Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp These are the versions prefixed with “__” Examples:__rcp(), __sin(), __exp() Other functions are combinations of the above y / x == rcp(x) * y == 20 cycles per warp sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp

There are two types of runtime math operations Runtime Math Library There are two types of runtime math operations __func(): direct mapping to hardware ISA Fast but low accuracy (see prog. guide for details) Examples: __sin(x), __exp(x), __pow(x,y) func() : compile to multiple instructions Slower but higher accuracy (5 ulp, units in the least place, or less) Examples: sin(x), exp(x), pow(x,y) The -use_fast_math compiler option forces every func() to compile to __func() The error is measured in ulp (units in the last place), that is the values of the results are multiplied by a power of the base such that an error in the last significant place is an error of 1.

Make your program float-safe G20 has double precision support G80 is single-precision only Double precision has additional performance cost Only one unit per multiprocessor Careless use of double or undeclared types may run more slowly on G80+ Important to be float-safe (be explicit whenever you want single precision) to avoid using double precision where it is not needed Add ‘f’ specifier on float literals: foo = bar * 0.123; // double assumed foo = bar * 0.123f; // float explicit Use float version of standard library functions foo = sin(bar); // double assumed foo = sinf(bar); // single precision explicit

Addition and Multiplication are IEEE 754 compliant Deviations from IEEE-754 Addition and Multiplication are IEEE 754 compliant Maximum 0.5 ulp (units in the least place) error However, often combined into multiply-add (FMAD) Intermediate result is truncated Division is non-compliant (2 ulp) Not all rounding modes are supported Denormalized numbers are not supported No mechanism to detect floating-point exceptions The error is measured in ulp (units in the last place), that is the values of the results are multiplied by a power of the base such that an error in the last significant place is an error of 1.

Units in the Last Place Error If the result of a FP computation is: -0.0312 But the answer when computed to infinite precision is: -0.0312159 Then ulp is: 0.0314 – 0.0312 = 0.159 For binary representations the maximum ulp is 0.5 Round to nearest number

Mixed Precision Methods From slides by Robert Strzodka Dominik Göddeke http://www.mathematik.uni-dortmund.de/~goeddeke/pubs/NVISION08-long.pdf

What is a Mixed Precision Method?

Mixed Precision Performance Gains

Single vs Double Precision FP Float s23e8 Double s53e11

Round off and Cancellation

Double Precision != Better Accuracy

The Dominant Data Error

Understanding Floating Point Operations

Commutative Summation

Commutative Summation Example Say we want to calculate 1 + 0.0000004 – 0.00000003

High Precision Emulation

Example: Addition c = a + b

Please read the following: