Telecommunications and Signal Processing Seminar 24 - 1 Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.

Telecommunications and Signal Processing Seminar 24 - 1 Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at Austin Department of Electrical and Computer Engineering * Laboratory of Computer Architecture Evaluating MMX Technology Using DSP and Multimedia Applications November 22, 1999

Telecommunications and Signal Processing Seminar 24 - 2 This talk is a condensed version of a presentation given at: The 31st International Symposium on Microarchitecture (MICRO-31) Dallas, Texas November 30, 1998 http://www.ece.utexas.edu/~ravib/mmxdsp/ Evaluating MMX Technology Using DSP and Multimedia Applications

Telecommunications and Signal Processing Seminar 24 - 3 p 57 New assembly instructions p 64-bit registers p Aliased to FP registers p EMMS Instruction p No compiler support

Telecommunications and Signal Processing Seminar 24 - 4 p 8, 16, 32, 64-bit fixed-point data p Packing, unpacking of data p Packed moves p 16-bit multiply-accumulate p Saturation arithmetic

Telecommunications and Signal Processing Seminar 24 - 5 p Independent evaluation of MMX p How much speedup is possible? p What tradeoffs are involved? p Time, complexity, performance, precision p Characterization of MMX workloads p Instruction mix, memory accesses, etc.

Telecommunications and Signal Processing Seminar 24 - 6 p Finite Impulse Response Filter p Speech, general filtering p Fast Fourier Transform p MPEG, spectral analysis p Matrix &Vector Multiplication p Image processing p Infinite Impulse Response Filter p Audio, LPC

Telecommunications and Signal Processing Seminar 24 - 7 p JPEG Image Compression p Bitmap Image to JPEG Image p 2D DCT p G.722 Speech Encoding p Compression, Encoding of Speech p ADPCM p Image Processing p Uniform Color Manipulation p Vector Arithmetic p Doppler Radar Processing p Vector Arithmetic, FFT

Telecommunications and Signal Processing Seminar 24 - 8 p Adjust non-MMX benchmark p DSP environment p Create MMX version p Setup like non-MMX p Use Intel Assembly Libraries p Microsoft Visual C++ 5.0 p Simulate with VTune 2.5.1

Telecommunications and Signal Processing Seminar 24 - 9 p Not just function swapping p Different input data types p Fixed-point versus floating-point p 16-bit versus 32-bit p Reordering of data p Ex: Arrangement of filter coefficients p Row-order versus column-order

Telecommunications and Signal Processing Seminar 24 - 10 p Intel performance profiling tool p Designed for “hot spots” p Simulate sections of code p Pentium with MMX p CPU penalties p Instruction mix p Library calls p Hardware performance counters

Telecommunications and Signal Processing Seminar 24 - 11 Ratio of non-MMX to MMX Programs

Telecommunications and Signal Processing Seminar 24 - 12 p JPEG and G722 show slowdowns p Superlinear speedup in MatVec p 16-bit data, 6.6X speedup p Free unrolling p MMX related overhead p FIR, Radar, JPEG, G722 p MMX multiplication p Fewer cycles p Requires unpacking

Telecommunications and Signal Processing Seminar 24 - 13 % MMX Instructions and MMX Instruction Mix. Speedup increasing from left to right

Telecommunications and Signal Processing Seminar 24 - 14 p Input set size p Small: FIR, Radar, G722, JPEG p Large: IIR, Image, MatVec, FFT p Affects MMX %, speedup p “Automatic” Packing p Less than 50% MMX arithmetic p FFT p Converts to FP p Old version: 40% MMX, less speedup

Telecommunications and Signal Processing Seminar 24 - 15 Ratio of Non-MMX Assembly to MMX

Telecommunications and Signal Processing Seminar 24 - 16 p Non-MMX version 1.98X faster p But... inserted MMX code 1.6X faster p Function call overhead p 8.8X more in MMX version p MMX Maintenance Instructions p Accounting for precision p Non-sequential data accesses

Telecommunications and Signal Processing Seminar 24 - 17 p Slowdown possible p JPEG and G722 p Parallel, contiguous data p Hard to find p Precision p Obtainable at a price p Library function call overhead p Hand-coded assembly, inlining

Telecommunications and Signal Processing Seminar 24 - 18 p Speedup available with libraries p Kernels: 1.25 to 6.6 p Applications: 1.21 to 5.5 p Versus optimized FP: 1.25 to 1.71 p General Characteristics of MMX p More static instructions used p Fewer dynamic instructions p Fewer memory references p Less than 50% of MMX is arithmetic

Telecommunications and Signal Processing Seminar 24 - 19 This concludes this portion of the talk. The following slides provide further information on: methodology, benchmarks, results, and additional work.

Telecommunications and Signal Processing Seminar 24 - 20 Unreal 1.0 p Doom-like game p Command-line MMX switch p Hardware Performance Counters p 48% MMX Instructions p Real-time. What is speedup? p 1.34X more frame/second p Same trends as benchmarks

Telecommunications and Signal Processing Seminar 24 - 21 p Focus on “Important” Code p Buffer Inputs and Outputs p No OS Effects Measured p Real-time Atmosphere

Telecommunications and Signal Processing Seminar 24 - 22 p Some functions use MMX p 8-bit and 16-bit data p Scale factors p Vector inputs p Library-specific structures p Signal Processing Library 4.0 p Recognition Primitives Library 3.1 p Image Processing Library 2.0

Telecommunications and Signal Processing Seminar 24 - 23 Precision p JPEG p Non-MMX SNR: 31.05 dB p MMX SNR: 31.04 dB p Image: No Change p G722 p Non-MMX SNR: 5.46 dB p MMX SNR: 5.18 dB p Doppler Radar p Less than 1%

Telecommunications and Signal Processing Seminar 24 - 24 p Profiled Program p 2D DCT p Quantization p Color Conversion p 74% of execution time p Small Block Size p 8x8 blocks of pixels

Telecommunications and Signal Processing Seminar 24 - 25 p 2D DCT p Library only has 1D DCT p Data in different order p Quantization p Not enough data parallelism p Color conversion p Create and fill vectors

Telecommunications and Signal Processing Seminar 24 - 26 FIR Filter p Finite Impulse Response Filter p Moving averages filter p Process one input at a time p Non-MMX: 32-bit FP p MMX: 16-bit fixed-point p Filter length is 35

Telecommunications and Signal Processing Seminar 24 - 27 FFT p Fast Fourier Transform p Computes discrete Fourier Transform p 4096-point p In-place p Whole FFT to MMX function p Non-MMX: 32-bit FP p MMX: 16-bit fixed-point

Telecommunications and Signal Processing Seminar 24 - 28 MatVec p Matrix & Vector Multiplication p 512x512 matrix times 512-entry vector p Dot product of two 512-entry vectors p Both versions: 16-bit data

Telecommunications and Signal Processing Seminar 24 - 29 IIR p Infinite Impulse Response Filter p Butterworth coefficients p Direct form, Bandpass p Filter length of 8, 17 coefficients p Requires high precision p Feedback p Our versions unstable

Telecommunications and Signal Processing Seminar 24 - 30 Doppler Radar Processing p Subtract complex echo signals p Removing stationary targets p Estimates power spectrum p Dominant frequency from peak of FFT p 16-point, in-place FFT

Telecommunications and Signal Processing Seminar 24 - 31 G.722 Speech Encoding p Input signal: 16-bit, 16 kHz p Output signal: 8-bit, 8 kHz p 6 kb speech file

Telecommunications and Signal Processing Seminar 24 - 1 Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.

Similar presentations

Presentation on theme: "Telecommunications and Signal Processing Seminar 24 - 1 Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Telecommunications and Signal Processing Seminar 24 - 1 Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.

Similar presentations

Presentation on theme: "Telecommunications and Signal Processing Seminar 24 - 1 Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at."— Presentation transcript:

Similar presentations

About project

Feedback