Download presentation
Presentation is loading. Please wait.
Published byJonah Park Modified over 9 years ago
1
Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu
2
Outline Introduction MMX/SSE/SSE2 MPEG 2 Video Compression What we have done? Conclusion
3
MMX/SSE/SSE2 MMX 57 new instructions; 8 64-bit wide MMX registers; 4 new data types. (3 packed data type and 1 64-bit entity) SSE 8 new 128-bit SIMD floating-point registers; 50 new instructions that work on packed floating-point data; 8 new instructions to control data cacheability; 12 new instructions that extend the MMX instruction set. SSE2 Support 64-bit floating-point values
4
MPEG 2 video compression
5
Project outline 1.Dig out a MPEG2 Enc/Dec C code 2.Generate profiling information 5.Performance results 4.Rewrite kernels using SSE 3.Identify the kernels
6
Profiling results of the original code mpeg2decodempeg2encode idct() dist1() fdct()
7
Example 1 – optimizing dist1() if ((v = p1[0] - p2[0])<0) v = -v; s+= v; if ((v = p1[1] - p2[1])<0) v = -v; s+= v; if ((v = p1[2] - p2[2])<0) v = -v; s+= v; if ((v = p1[3] - p2[3])<0) v = -v; s+= v; if ((v = p1[4] - p2[4])<0) v = -v; s+= v; if ((v = p1[5] - p2[5])<0) v = -v; s+= v; if ((v = p1[6] - p2[6])<0) v = -v; s+= v; if ((v = p1[7] - p2[7])<0) v = -v; s+= v; if ((v = p1[8] - p2[8])<0) v = -v; s+= v; if ((v = p1[9] - p2[9])<0) v = -v; s+= v; if ((v = p1[10] - p2[10])<0) v = -v; s+= v; if ((v = p1[11] - p2[11])<0) v = -v; s+= v; if ((v = p1[12] - p2[12])<0) v = -v; s+= v; if ((v = p1[13] - p2[13])<0) v = -v; s+= v; if ((v = p1[14] - p2[14])<0) v = -v; s+= v; if ((v = p1[15] - p2[15])<0) v = -v; s+= v; asm volatile (" movdqu (%1), %XMM0 movdqu (%2), %XMM1 psadbw %XMM0, %XMM1 movdq2q %XMM1, %MM0 pslldq $8, %XMM1 movdq2q %XMM1, %MM1 paddd %MM1, %MM0 movd %MM0, %0" : "=r"(s) : "r"(p1), "r"(p2)); This code segment is for calculating residual matrices in the prediction stage in Encoder 4-5X speed-up, but it can be faster!
8
Four ways to write super-fast code Rearrange data fetching to maximize cache hit; Unroll loops to eliminate unnecessary branches; Utilize SSE instructions to take full advantage of parallelism; Apply code scheduling to exploit multiple issue capability of Pentium 4's superscalar micro- architecture.
9
Example 2 – optimize idct() for (i=0; i<8; i++) for (j=0; j<8; j++) { partial_product = 0.0; for (k=0; k<8; k++) partial_product+= c[k][j]*block[i][k]; tmp[i][j] = partial_product; } Three nested loops forms the kernel of DCT:
10
A verbatim translation from C to assembly doesn’t do much better. It misses the whole point of manually writing an assembly procedure.
11
We need parallelism!
12
Results 50.1s 16.34s 2.45s3.83s 68.72% 13.04% 34.39% 9.99% Experimental Results are averaged over 3 runs. 25X in idct() 4X in dist1()
13
Platform Compatibility (1) Algorithm for Checking Availability of MMX bool isMMXSupported() { int fSupported; asm { mov eax,1 // CPUID level 1 cpuid // EDX = feature flag and edx,0x800000 // test bit 23 of feature flag mov fSupported,edx // != 0 if MMX is supported } if (fSupported != 0) return true; else return false; }
14
Platform Compatibility (2) Algorithm for Checking Availability of SSE bool isISSESupported() { int processor; int features; int extfeatures = 0; asm{ pusha mov eax,1 cpuid mov processor,eax // Store processor family/model/step mov features,edx // Store features bits mov eax,080000000h cpuid // Check which extended functions can be called cmp eax,080000001h // Extended Feature Bits jb nofeatures // Jump if not supported mov eax,080000001h // Select function 0x80000001 cpuid mov extfeatures,edx // Store extended features bits nofeatures: popa } if (((features $>>$ 25) \& 1) != 0) return true; else if (((extfeatures $>>$ 22) \& 1) != 0) return true; else return false; } N SSE? MMX? Normal Routine SSE Routine MMX Routine END N Y Y
15
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.