SSE for H.264 Encoder Chuck Tsen Sean Pieper
SSE– what can’t it do? Mixed scalar and vector Unaligned memory accesses Predicated execution >2 source arguments (no shuffle and add) Many options for FP don’t exist for integer Can’t apply everywhere!
Profile of code GCOV for per-line execution counts SATD most promising candidate –SSE SAD instruction does not help!!! Walsh Hadamard Transform –Used for motion estimation of 8x8 16 applications per block, 8 element vectors –Used for sub-pixel motion estimation one vector at a shot, 16 element vector
Hadamard Matrix All values are in {1,-1} Orthogonal and symmetric First row/column only 1’s equal 1’s and -1’s after first row Transform –multiply input by matrix –Sum abs of output vector Calculations not super regular
Our optimization Bin by positive terms in each group –{0-3,4-7,8-11,12-15} –Allows aligning data in columns Four terms cannot align by column –But, can align within row Hard parts align by column –Use vector ops across columns –Much shuffling, but SSE still a win Simpler optimization for 8x8 columns
Before and After Original 1D has 64 lines, modified ~70 –Used intermediate calculations heavily But IA32 has only ~4 GP registers Original has mem traffic==SLOW SSE keeps data minty fresh in 8 “special” registers. Oh, and we load the data 4x faster
Questions?Questions?
BACKUPS! line0 = x[0] + x[3] + (x[4] + x[8]) + x[7] + x[b] + x[c] + x[f] - x[1] - x[2] - x[5] - x[6] - x[9] - x[a] - x[d] - x[e] line1 = x[0] + x[1] + (x[5] + x[9]) + x[4] + x[8] + x[c] + x[d] - x[2] - x[3] - x[6] - x[7] - x[a] - x[b] - x[e] - x[f] line2 = x[0] + x[2] + (x[6] + x[a]) + x[4] + x[8] + x[c] + x[e] - x[3] - x[1] - x[7] - x[5] - x[b] - x[9] - x[f] - x[d] line12= x[0] + x[1] + (x[7] + x[b]) + x[6] + x[a] + x[c] + x[d] - x[2] - x[3] - x[5] - x[4] - x[8] - x[9] - x[f] - x[e] alpha.sse = mm_shuffle_epi32(x_zero_three, (int) 0x67); // alpha = x[1,2,1,3] alpha.sse = mm_add_epi32(alpha.sse, x_zeros.sse); // alpha += x[0,0,0,0] alpha.sse = mm_add_epi32(alpha.sse, special_47p8b.sse); // alpha += x[7,6,5,4] + x[b,a,9,8] beta.sse = mm_shuffle_epi32(x_four_seven, (int) 0x83); // alpha += x[6,4,4,7] alpha.sse = mm_add_epi32(alpha.sse, beta.sse); beta.sse = mm_shuffle_epi32(x_eight_eleven, (int) 0x83); // alpha += x[a,8,8,b] alpha.sse = mm_add_epi32(alpha.sse, beta.sse); alpha.sse = mm_add_epi32(alpha.sse, x_cs.sse); // alpha += x[c,c,c,c] beta.sse = mm_shuffle_epi32(x_twelve_fifteeen,(int) 0x67);// alpha + =x[d,e,d,f]) alpha.sse = mm_add_epi32(alpha.sse, beta.sse);