Martin Kruliš 12. 11. 2015 by Martin Kruliš (v1.0)1.

Martin Kruliš 12. 11. 2015 by Martin Kruliš (v1.0)1

 GPU Design ◦ Oriented for rendering and geometry calculations ◦ Rich set of mathematical functions  Many of those are implemented as instructions  Separate functions for doubles and floats  e.g., sqrtf(float) and sqrt(double) ◦ Instruction behavior depends on compiler options  -use_fast_math – fast but lower precision  -ftz=bool – flush denormals to zero  -prec-div=bool – precise float divisions  -prec-sqrt=bool – precise float sqrts  -fmad=bool – use mul-add instructions (e.g., FFMA) 12. 11. 2015 by Martin Kruliš (v1.0)2 Single precision floats only

12. 11. 2015 by Martin Kruliš (v1.0)3

 Math instruction costs ◦ Math intrinsics vs. functions ( __sinf() vs. sinf() )  Intrinsics are faster but imprecise  Compiler uses intrinsics if -use_fast_math is set ◦ Division and modulo on integers  Should be avoided or replaced with shifts and bitwise ands whenever possible (e.g., divisor is a power of 2)  Compiler optimize only if the divisor is a literal ◦ Prefer more specialized functions  rsqrtf over 1/sqrtf, sinpif(x) over sinf(pi*x), …  x*x or expf() over powf()  … 12. 11. 2015 by Martin Kruliš (v1.0)5

 Atomic Instructions ◦ Perform read-modify-write operation of one 32bit or 64bit word in global or shared memory ◦ Require CC 1.1 or higher (1.2 for 64bit global atomics and 2.0 for 64bit shared mem. atomics) ◦ Operate on integers, except for atomicExch() and atomicAdd() which also work on 32bit floats ◦ Atomic operations on mapped memory are atomic only from the perspective of the device  Since they are usually performed on L2 cache 12. 11. 2015 by Martin Kruliš (v1.0)6

 Atomic Instructions Overview ◦ atomicAdd(&p,v), atomicSub(&p,v) – atomically adds or subtracts v to/from p and return the old value ◦ atomicInc(&p,v), atomicDec(&p,v) – atomic increment/decrement computed modulo v ◦ atomicMin(&p,v), atomicMax(&p,v) ◦ atomicExch(&p,v) – atomically swaps a value ◦ atomicCAS(&p,v) – classical compare-and-set ◦ atomicAnd(&p,v), atomicOr(&p,v), atomicXor(&p,v) – atomic bitwise operations 12. 11. 2015 by Martin Kruliš (v1.0)7

 Implementing Other Functions __device__ int atomicOp(int *addr, int val) { int old = *addr, assumed; do { assumed = old; old = atomicCAS(addr, assumed, op(assumed, val)); } while (assumed != old); return old; }  Synchronization ◦ Using atomics for global locks or barriers is not recommended (possibility of a deadlock) 12. 11. 2015 by Martin Kruliš (v1.0)8

 Voting Instructions ◦ Intrinsics that allows the whole warp to perform reduction and broadcast in one step  Only active threads are participating ◦ __all(predicate)  All active threads evaluate predicate  Returns non-zero if ALL predicates returned non-zero ◦ __any(predicate)  Like __all, but the results are combined by logical OR ◦ __ballot(predicate)  Return bitmask, where each bit represents the predicate result of the corresponding thread 12. 11. 2015 by Martin Kruliš (v1.0)9 CC >= 1.2 CC >= 2.0

 Warp Shuffle Instructions ◦ Fast variable exchange within the warp ◦ Available for architectures with CC 3.0 or newer ◦ Intrinsics  __shfl() – direct copy from indexed lane  __shfl_up() – copy with lower ID relative to caller  __shfl_down() – copy with higher ID relative to caller  __shfl_xor() – copy from a lane, which ID is computed as XOR of caller ID  All functions have optional width parameter, that allows to divide warp into smaller segments 12. 11. 2015 by Martin Kruliš (v1.0)10

 Warp Shuffle Instructions ◦ Broadcast example int laneId = threadIdx.x % warpSize; int value; if (laneId == 0) value = valueToBeBroadcasted; value = __shfl(value, 0); 12. 11. 2015 by Martin Kruliš (v1.0)11 Index within the warp Only the first thread sets the value (others’ value is unused) All threads send value, and get value from lane 0

 Warp Shuffle Instructions ◦ Reduction example int value = per_thread_reduction(); for (int i = warpSize/2; i >= 1; i /= 2) value += __shfl_xor(value, i); 12. 11. 2015 by Martin Kruliš (v1.0)12 Return the value sent by laneID xor i Note that i is 10000b, 01000b, 00100b, … All threads accumulate the total sum into the value Quiz: What if we iterate upwards (from 1 to warpSize/2 )?

 Levenshtein Edit Distance Example ◦ One warp computes one diagonal of a sub-problem ◦ Shuffle functions can be used to exchange data ◦ Up to 1.3x speedup over shared memory solution 12. 11. 2015 by Martin Kruliš (v1.0)13

 Loop Unrolling ◦ Useful for performance (less instructions, jumps, …) ◦ Automatically done by compiler  Small loops with a known trip count ◦ Can be affected by code pragma #pragma unroll 5 for (int i = 0; i < n; ++i) ◦ If the unroll number is omitted, the loop is either  completely unrolled if the trip count is constant  or not unrolled at all ◦ #pragma unroll 1 prevents the loop unrolling 12. 11. 2015 by Martin Kruliš (v1.0)14 Unrolled 5x and some code inserted to ensure correctness

 SIMD Video Instructions ◦ Special instructions added in CC 3.0 ◦ Work along with SIMT model (each thread can perform them) ◦ Operate on two 16bit integers or four 8bit integers ◦ They must be encoded directly in asm() statement ◦ Only a few operations frequent in video processing:  vadd2, vadd4, vsub2, vsub4  vavrg2, vavrg4, vabsdiff2, vabsdiff4  vmin2, vmin4, vmax2, vmax4  vset2, vset4 12. 11. 2015 by Martin Kruliš (v1.0)15

Martin Kruliš 12. 11. 2015 by Martin Kruliš (v1.0)1.

Similar presentations

Presentation on theme: "Martin Kruliš 12. 11. 2015 by Martin Kruliš (v1.0)1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Martin Kruliš 12. 11. 2015 by Martin Kruliš (v1.0)1.

Similar presentations

Presentation on theme: "Martin Kruliš 12. 11. 2015 by Martin Kruliš (v1.0)1."— Presentation transcript:

Similar presentations

About project

Feedback