Presentation is loading. Please wait.

Presentation is loading. Please wait.

Martin Kruliš 12. 11. 2015 by Martin Kruliš (v1.0)1.

Similar presentations


Presentation on theme: "Martin Kruliš 12. 11. 2015 by Martin Kruliš (v1.0)1."— Presentation transcript:

1 Martin Kruliš 12. 11. 2015 by Martin Kruliš (v1.0)1

2  GPU Design ◦ Oriented for rendering and geometry calculations ◦ Rich set of mathematical functions  Many of those are implemented as instructions  Separate functions for doubles and floats  e.g., sqrtf(float) and sqrt(double) ◦ Instruction behavior depends on compiler options  -use_fast_math – fast but lower precision  -ftz=bool – flush denormals to zero  -prec-div=bool – precise float divisions  -prec-sqrt=bool – precise float sqrts  -fmad=bool – use mul-add instructions (e.g., FFMA) 12. 11. 2015 by Martin Kruliš (v1.0)2 Single precision floats only

3 12. 11. 2015 by Martin Kruliš (v1.0)3

4 12. 11. 2015 by Martin Kruliš (v1.0)4

5  Math instruction costs ◦ Math intrinsics vs. functions ( __sinf() vs. sinf() )  Intrinsics are faster but imprecise  Compiler uses intrinsics if -use_fast_math is set ◦ Division and modulo on integers  Should be avoided or replaced with shifts and bitwise ands whenever possible (e.g., divisor is a power of 2)  Compiler optimize only if the divisor is a literal ◦ Prefer more specialized functions  rsqrtf over 1/sqrtf, sinpif(x) over sinf(pi*x), …  x*x or expf() over powf()  … 12. 11. 2015 by Martin Kruliš (v1.0)5

6  Atomic Instructions ◦ Perform read-modify-write operation of one 32bit or 64bit word in global or shared memory ◦ Require CC 1.1 or higher (1.2 for 64bit global atomics and 2.0 for 64bit shared mem. atomics) ◦ Operate on integers, except for atomicExch() and atomicAdd() which also work on 32bit floats ◦ Atomic operations on mapped memory are atomic only from the perspective of the device  Since they are usually performed on L2 cache 12. 11. 2015 by Martin Kruliš (v1.0)6

7  Atomic Instructions Overview ◦ atomicAdd(&p,v), atomicSub(&p,v) – atomically adds or subtracts v to/from p and return the old value ◦ atomicInc(&p,v), atomicDec(&p,v) – atomic increment/decrement computed modulo v ◦ atomicMin(&p,v), atomicMax(&p,v) ◦ atomicExch(&p,v) – atomically swaps a value ◦ atomicCAS(&p,v) – classical compare-and-set ◦ atomicAnd(&p,v), atomicOr(&p,v), atomicXor(&p,v) – atomic bitwise operations 12. 11. 2015 by Martin Kruliš (v1.0)7

8  Implementing Other Functions __device__ int atomicOp(int *addr, int val) { int old = *addr, assumed; do { assumed = old; old = atomicCAS(addr, assumed, op(assumed, val)); } while (assumed != old); return old; }  Synchronization ◦ Using atomics for global locks or barriers is not recommended (possibility of a deadlock) 12. 11. 2015 by Martin Kruliš (v1.0)8

9  Voting Instructions ◦ Intrinsics that allows the whole warp to perform reduction and broadcast in one step  Only active threads are participating ◦ __all(predicate)  All active threads evaluate predicate  Returns non-zero if ALL predicates returned non-zero ◦ __any(predicate)  Like __all, but the results are combined by logical OR ◦ __ballot(predicate)  Return bitmask, where each bit represents the predicate result of the corresponding thread 12. 11. 2015 by Martin Kruliš (v1.0)9 CC >= 1.2 CC >= 2.0

10  Warp Shuffle Instructions ◦ Fast variable exchange within the warp ◦ Available for architectures with CC 3.0 or newer ◦ Intrinsics  __shfl() – direct copy from indexed lane  __shfl_up() – copy with lower ID relative to caller  __shfl_down() – copy with higher ID relative to caller  __shfl_xor() – copy from a lane, which ID is computed as XOR of caller ID  All functions have optional width parameter, that allows to divide warp into smaller segments 12. 11. 2015 by Martin Kruliš (v1.0)10

11  Warp Shuffle Instructions ◦ Broadcast example int laneId = threadIdx.x % warpSize; int value; if (laneId == 0) value = valueToBeBroadcasted; value = __shfl(value, 0); 12. 11. 2015 by Martin Kruliš (v1.0)11 Index within the warp Only the first thread sets the value (others’ value is unused) All threads send value, and get value from lane 0

12  Warp Shuffle Instructions ◦ Reduction example int value = per_thread_reduction(); for (int i = warpSize/2; i >= 1; i /= 2) value += __shfl_xor(value, i); 12. 11. 2015 by Martin Kruliš (v1.0)12 Return the value sent by laneID xor i Note that i is 10000b, 01000b, 00100b, … All threads accumulate the total sum into the value Quiz: What if we iterate upwards (from 1 to warpSize/2 )?

13  Levenshtein Edit Distance Example ◦ One warp computes one diagonal of a sub-problem ◦ Shuffle functions can be used to exchange data ◦ Up to 1.3x speedup over shared memory solution 12. 11. 2015 by Martin Kruliš (v1.0)13

14  Loop Unrolling ◦ Useful for performance (less instructions, jumps, …) ◦ Automatically done by compiler  Small loops with a known trip count ◦ Can be affected by code pragma #pragma unroll 5 for (int i = 0; i < n; ++i) ◦ If the unroll number is omitted, the loop is either  completely unrolled if the trip count is constant  or not unrolled at all ◦ #pragma unroll 1 prevents the loop unrolling 12. 11. 2015 by Martin Kruliš (v1.0)14 Unrolled 5x and some code inserted to ensure correctness

15  SIMD Video Instructions ◦ Special instructions added in CC 3.0 ◦ Work along with SIMT model (each thread can perform them) ◦ Operate on two 16bit integers or four 8bit integers ◦ They must be encoded directly in asm() statement ◦ Only a few operations frequent in video processing:  vadd2, vadd4, vsub2, vsub4  vavrg2, vavrg4, vabsdiff2, vabsdiff4  vmin2, vmin4, vmax2, vmax4  vset2, vset4 12. 11. 2015 by Martin Kruliš (v1.0)15

16 12. 11. 2015 by Martin Kruliš (v1.0)16


Download ppt "Martin Kruliš 12. 11. 2015 by Martin Kruliš (v1.0)1."

Similar presentations


Ads by Google