{ Optimizing C63 for x86 Group 9.  Bird’s-eye view: gprof of reference encoder  Optimizing SAD  Results Outline.

{ Optimizing C63 for x86 Group 9

 Bird’s-eye view: gprof of reference encoder  Optimizing SAD  Results Outline

  Each sample counts as 0.01 seconds.   % cumulative self self total   time seconds seconds calls us/call us/call name   93.05 37.36 37.36 500214528 0.07 0.07 sad_block_8x8   3.04 38.58 1.22 705672 1.73 54.67 me_block_8x8   1.77 39.29 0.71 712800 1.00 1.00 dequant_idct_block_8x8   1.25 39.79 0.50 712800 0.70 0.70 dct_quant_block_8x8   0.35 39.93 0.14 713100 0.20 0.29 flush_bits   0.17 40.00 0.07 16398942 0.00 0.00 put_bits   granularity: each sample hit covers 2 byte(s) for 0.02% of 40.16 seconds gprof: reference, foreman

Optimizing SAD

SAD SSE2 PSAD   void sad_block_8x8(uint8_t *block1, uint8_t *block2, int stride, int *result)   {   int v;   __m128i r = _mm_setzero_si128();   for (v = 0; v < 8; v += 2)   {   const __m128i b1 = _mm_set_epi64(*(__m64 *) &block1[(v+0)*stride],   *(__m64 *) &block1[(v+1)*stride]);   const __m128i b2 = _mm_set_epi64(*(__m64 *) &block2[(v+0)*stride],   *(__m64 *) &block2[(v+1)*stride]);   r = _mm_add_epi16(r, _mm_sad_epu8(b2, b1));   }   *result =   _mm_extract_epi16(r, 0) +   _mm_extract_epi16(r, 4);   }

 Gprof: Improved SSE2 SAD uses 3.69 s vs. 38.50s*  Cachegrind: Lots of branch prediction misses in me_block_8x8 *) Foreman sequence on gpu-7 How well does this perform?

SAD SSE4.1 MPSADBW+PHMINSUM   void sad_block_8x8x8(uint8_t *block1, uint8_t *block2, int stride, int *best,   int *result)   {   int v;   __m128i r = _mm_setzero_si128();   union {   __m128i v;   struct {   uint16_t sad;   unsigned int index : 3;   } minpos;   } mp;   for (v = 0; v < 8; v += 2) {   const __m128i b1 = _mm_set_epi64(*(__m64 *) &block1[(v+1)*stride],   *(__m64 *) &block1[(v+0)*stride]);   const __m128i b2 = _mm_loadu_si128((__m128i *) &block2[(v+0)*stride]);   const __m128i b3 = _mm_loadu_si128((__m128i *) &block2[(v+1)*stride]);   r = _mm_add_epi16(r, _mm_mpsadbw_epu8(b2, b1, 0b000));   r = _mm_add_epi16(r, _mm_mpsadbw_epu8(b2, b1, 0b101));   r = _mm_add_epi16(r, _mm_mpsadbw_epu8(b3, b1, 0b010));   r = _mm_add_epi16(r, _mm_mpsadbw_epu8(b3, b1, 0b111));   }   mp.v = _mm_minpos_epu16(r);   *result = mp.minpos.sad;   *best = mp.minpos.index;   }

 Gprof: Improved SSE4.1 SAD uses 0.90s vs 3.69s  Cachegrind: Branch prediction misses reduced by a factor of 8  Intel’s IACA tool: CPU pipeline appears to be filled! Appears both source and reference block loads compete with (V)MPSADBW for the CPUs execution ports  Assembly: Better utilize AVX’s non-destructive instructions (less register copies) Better utilize loaded data for SAD computations with same source block How well does this perform?

SAD 8x8x8x8: Less src loads, branches   Load source block once:.macro.load_src   vmovq (%rdi), %xmm0# src[0]   vpinsrq$1,(%rdi,%rdx),%xmm0,%xmm0# src[1]   vmovq (%rdi,%rdx,2),%xmm1# src[2]   vmovhps(%rdi,%r8),%xmm1,%xmm1# src[3]   vmovq (%rdi,%rdx,4),%xmm2# src[4]   vmovhps(%rdi,%r9),%xmm2,%xmm2# src[5]   vmovq (%rdi,%r8,2),%xmm3# src[6]   vmovhps(%rdi,%rax),%xmm3,%xmm3# src[7]   Do sad for 8x8 - 8x8 blocks (relative y = 0…8, x = 0…8):.macro.8x8x8x8   vmovdqu (%rsi), %xmm12# ref[0]  .8x8x1 0, 0, 12, 4 0   vmovdqu(%rsi,%rdx), %xmm13# ref[1]  .8x8x1 0, 1, 13, 4 0  .8x8x1 1, 0, 13, 5 0

 Gprof: Improved SAD 8x8x8x8 uses 0.53s vs 0.90s  Valgrind: Even less branching How well does this perform?

SAD 4x8x8x8x8: Branchless UV-plane ME   sad_4x8x8x8x8:.load_src # Load source block from %rdi   mov %rsi, %rdi# Reference block x,y  .8x8x8x8   lea 8(%rdi), %rsi# Reference block x+8, y  .8x8x8x8   lea (%rdi,%rdx,8), %rsi  .8x8x8x8   lea 8(%rdi,%rdx,8), %rsi #  .8x8x8x8   …   vphminposuw   …   ret

 Gprof: Improved 4x8x8x8x8 SAD uses 0.47s vs 0.53s  Valgrind: Even less branching!  Total runtime of Foreman sequence reduced from ~40.1s to ~1.6s (factor of 25) How well does this perform?

InstructionFunction variantCycle mean, adjusted for single 8x8 block MPSADBWSAD 8x 8x87.5 VMPSADBWSAD 2x8x 8x83.6 SAD 4x8x 8x84.3 SAD 8x8x 8x8 v12.9 SAD 8x8x 8x8 v22.8 SAD 4x 8x8x 8x82.7 Future researchSAD 16x8x8x8x82.6? Even less branches? SAD assembly code: Iterative improvement

Image quality: comparable PSNR Mean PSNR 95% SSIM Mean SSIM 95% Tractor reference33.1 304534.46 5490.9614 40.97672 Tractor33.1 233934.46 5870.9614 60.97672 Foreman reference33.4 621535.36 2800.951 530.964 76 Foreman33.4 480635.36 2170.951 390.964 36

  NOT ON NDLAB!   Each sample counts as 0.01 seconds.   % cumulative self self total   time seconds seconds calls s/call s/call name   92.90 124.66 124.66 1809584896 0.00 0.00 sad_block_8x8   3.06 128.76 4.10 49 0.08 2.63 c63_motion_estimate   1.58 130.89 2.13 2448000 0.00 0.00 dct_quant_block_8x8   1.58 133.01 2.13 2448000 0.00 0.00 dequant_idct_block_8x8   0.26 133.36 0.35 2448000 0.00 0.00 write_block   0.18 133.60 0.24 150 0.00 0.02 dequantize_idct   0.16 133.82 0.22 49 0.00 0.00 c63_motion_compensate   0.13 133.99 0.17 150 0.00 0.02 dct_quantize   0.11 134.14 0.15 55939437 0.00 0.00 put_bits   granularity: each sample hit covers 2 byte(s) for 0.01% of 134.21 seconds gprof: reference, tractor 50 frames (-O3)

  NOT ON NDLAB!   Each sample counts as 0.01 seconds.   % cumulative self self total   time seconds seconds calls ms/call ms/call name   31.15 1.83 1.83 6869016 0.00 0.00 sad_4x8x8x8x8   24.75 3.28 1.45 2448000 0.00 0.00 dct_quant_block_8x8   23.81 4.67 1.40 2448000 0.00 0.00 dequant_idct_block_8x8   7.17 5.09 0.42 50 8.40 12.20 write_frame   4.44 5.35 0.26 150 1.73 11.34 dequantize_idct   3.24 5.54 0.19 55937692 0.00 0.00 put_bits   1.71 5.64 0.10 150 0.67 10.64 dct_quantize   1.54 5.73 0.09 9792000 0.00 0.00 transpose_block_avx   0.85 5.78 0.05 798700 0.00 0.00 sad_8x8x8x8   0.68 5.82 0.04 49 0.82 39.09 c63_motion_estimate   0.51 5.85 0.03 49 0.61 0.61 c63_motion_compensate   granularity: each sample hit covers 2 byte(s) for 0.17% of 5.86 seconds   134.2/5.86 = ~24 gprof: improved, tractor 50 frames (-O3)

{ Optimizing C63 for x86 Group 9.  Bird’s-eye view: gprof of reference encoder  Optimizing SAD  Results Outline.

Similar presentations

Presentation on theme: "{ Optimizing C63 for x86 Group 9.  Bird’s-eye view: gprof of reference encoder  Optimizing SAD  Results Outline."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

{ Optimizing C63 for x86 Group 9.  Bird’s-eye view: gprof of reference encoder  Optimizing SAD  Results Outline.

Similar presentations

Presentation on theme: "{ Optimizing C63 for x86 Group 9.  Bird’s-eye view: gprof of reference encoder  Optimizing SAD  Results Outline."— Presentation transcript:

Similar presentations

About project

Feedback