Telecommunications and Signal Processing Seminar 24 - 1 Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.

Slides:



Advertisements
Similar presentations
Chapter 19 Fast Fourier Transform
Advertisements

DSPs Vs General Purpose Microprocessors
T.Sharon-A.Frank 1 Multimedia Compression Basics.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Fourier Transforms and Their Use in Data Compression
Computer Abstractions and Technology
EE313 Linear Systems and Signals Fall 2010 Initial conversion of content to PowerPoint by Dr. Wade C. Schwartzkopf Prof. Brian L. Evans Dept. of Electrical.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
CEN352, Dr. Ghulam Muhammad King Saud University
A Matlab Playground for JPEG Andy Pekarske Nikolay Kolev.
School of Computing Science Simon Fraser University
Department of Computer Engineering University of California at Santa Cruz Data Compression (3) Hai Tao.
Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory.
Sampling, Reconstruction, and Elementary Digital Filters R.C. Maher ECEN4002/5002 DSP Laboratory Spring 2002.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.
Losslessy Compression of Multimedia Data Hao Jiang Computer Science Department Sept. 25, 2007.
Optimization Of Power Consumption For An ARM7- BASED Multimedia Handheld Device Hoseok Chang; Wonchul Lee; Wonyong Sung Circuits and Systems, ISCAS.
T.Sharon-A.Frank 1 Multimedia Image Compression 2 T.Sharon-A.Frank Coding Techniques – Hybrid.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Digital Communication Techniques
LPC Speech Coder on the TI C6x DSP Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
Image Compression - JPEG. Video Compression MPEG –Audio compression Lossy / perceptually lossless / lossless 3 layers Models based on speech generation.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Input/Output devices Part 3: Programmable I/O and DSP's dr.ir. A.C. Verschueren.
DSP. What is DSP? DSP: Digital Signal Processing---Using a digital process (e.g., a program running on a microprocessor) to modify a digital representation.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
EE302 Lesson 19: Digital Communications Techniques 3.
Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.
Implementing a Speech Recognition System on a GPU using CUDA
Multimedia Elements: Sound, Animation, and Video.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
 Embedded Digital Signal Processing (DSP) systems  Specification with floating-point data types  Implementation in fixed-point architectures  Precision.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
Ch.5 Fixed-Point vs. Floating Point. 5.1 Q-format Number Representation on Fixed-Point DSPs 2’s Complement Number –B = b N-1 …b 1 b 0 –Decimal Value D.
Image Processing Architecture, © 2001, 2002, 2003 Oleh TretiakPage 1 ECE-C490 Image Processing Architecture MP-3 Compression Course Review Oleh Tretiak.
Copyright © 2003 Texas Instruments. All rights reserved. DSP C5000 Chapter 18 Image Compression and Hardware Extensions.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Introduction to MMX, XMM, SSE and SSE2 Technology
CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
Automatic Evaluation of the Accuracy of Fixed-point Algorithms Daniel MENARD 1, Olivier SENTIEYS 1,2 1 LASTI, University of Rennes 1 Lannion, FRANCE 2.
Image processing Honza Černocký, ÚPGM. 1D signal 2.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
July 23, BSA, a Fast and Accurate Spike Train Encoding Scheme Benjamin Schrauwen.
Signals and Systems Lecture Filter Structure and Quantization Effects.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
Implementing JPEG Encoder for FPGA ECE 734 PROJECT Deepak Agarwal.
Image Processing Architecture, © Oleh TretiakPage 1Lecture 5 ECEC 453 Image Processing Architecture Lecture 5, 1/22/2004 Rate-Distortion Theory,
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Finite Impulse Response Filters
EEE4176 Applications of Digital Signal Processing
Algorithms in the Real World
Embedded Systems Design
By: Mohammadreza Meidnai Urmia university, Urmia, Iran Fall 2014
Vector Processing => Multimedia
Digital Signal Processors
Lect5 A framework for digital filter design
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Image Coding and Compression
Chapter 19 Fast Fourier Transform
Govt. Polytechnic Dhangar(Fatehabad)
Presentation transcript:

Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at Austin Department of Electrical and Computer Engineering * Laboratory of Computer Architecture Evaluating MMX Technology Using DSP and Multimedia Applications November 22, 1999

Telecommunications and Signal Processing Seminar This talk is a condensed version of a presentation given at: The 31st International Symposium on Microarchitecture (MICRO-31) Dallas, Texas November 30, Evaluating MMX Technology Using DSP and Multimedia Applications

Telecommunications and Signal Processing Seminar p 57 New assembly instructions p 64-bit registers p Aliased to FP registers p EMMS Instruction p No compiler support

Telecommunications and Signal Processing Seminar p 8, 16, 32, 64-bit fixed-point data p Packing, unpacking of data p Packed moves p 16-bit multiply-accumulate p Saturation arithmetic

Telecommunications and Signal Processing Seminar p Independent evaluation of MMX p How much speedup is possible? p What tradeoffs are involved? p Time, complexity, performance, precision p Characterization of MMX workloads p Instruction mix, memory accesses, etc.

Telecommunications and Signal Processing Seminar p Finite Impulse Response Filter p Speech, general filtering p Fast Fourier Transform p MPEG, spectral analysis p Matrix &Vector Multiplication p Image processing p Infinite Impulse Response Filter p Audio, LPC

Telecommunications and Signal Processing Seminar p JPEG Image Compression p Bitmap Image to JPEG Image p 2D DCT p G.722 Speech Encoding p Compression, Encoding of Speech p ADPCM p Image Processing p Uniform Color Manipulation p Vector Arithmetic p Doppler Radar Processing p Vector Arithmetic, FFT

Telecommunications and Signal Processing Seminar p Adjust non-MMX benchmark p DSP environment p Create MMX version p Setup like non-MMX p Use Intel Assembly Libraries p Microsoft Visual C p Simulate with VTune 2.5.1

Telecommunications and Signal Processing Seminar p Not just function swapping p Different input data types p Fixed-point versus floating-point p 16-bit versus 32-bit p Reordering of data p Ex: Arrangement of filter coefficients p Row-order versus column-order

Telecommunications and Signal Processing Seminar p Intel performance profiling tool p Designed for “hot spots” p Simulate sections of code p Pentium with MMX p CPU penalties p Instruction mix p Library calls p Hardware performance counters

Telecommunications and Signal Processing Seminar Ratio of non-MMX to MMX Programs

Telecommunications and Signal Processing Seminar p JPEG and G722 show slowdowns p Superlinear speedup in MatVec p 16-bit data, 6.6X speedup p Free unrolling p MMX related overhead p FIR, Radar, JPEG, G722 p MMX multiplication p Fewer cycles p Requires unpacking

Telecommunications and Signal Processing Seminar % MMX Instructions and MMX Instruction Mix. Speedup increasing from left to right

Telecommunications and Signal Processing Seminar p Input set size p Small: FIR, Radar, G722, JPEG p Large: IIR, Image, MatVec, FFT p Affects MMX %, speedup p “Automatic” Packing p Less than 50% MMX arithmetic p FFT p Converts to FP p Old version: 40% MMX, less speedup

Telecommunications and Signal Processing Seminar Ratio of Non-MMX Assembly to MMX

Telecommunications and Signal Processing Seminar p Non-MMX version 1.98X faster p But... inserted MMX code 1.6X faster p Function call overhead p 8.8X more in MMX version p MMX Maintenance Instructions p Accounting for precision p Non-sequential data accesses

Telecommunications and Signal Processing Seminar p Slowdown possible p JPEG and G722 p Parallel, contiguous data p Hard to find p Precision p Obtainable at a price p Library function call overhead p Hand-coded assembly, inlining

Telecommunications and Signal Processing Seminar p Speedup available with libraries p Kernels: 1.25 to 6.6 p Applications: 1.21 to 5.5 p Versus optimized FP: 1.25 to 1.71 p General Characteristics of MMX p More static instructions used p Fewer dynamic instructions p Fewer memory references p Less than 50% of MMX is arithmetic

Telecommunications and Signal Processing Seminar This concludes this portion of the talk. The following slides provide further information on: methodology, benchmarks, results, and additional work.

Telecommunications and Signal Processing Seminar Unreal 1.0 p Doom-like game p Command-line MMX switch p Hardware Performance Counters p 48% MMX Instructions p Real-time. What is speedup? p 1.34X more frame/second p Same trends as benchmarks

Telecommunications and Signal Processing Seminar p Focus on “Important” Code p Buffer Inputs and Outputs p No OS Effects Measured p Real-time Atmosphere

Telecommunications and Signal Processing Seminar p Some functions use MMX p 8-bit and 16-bit data p Scale factors p Vector inputs p Library-specific structures p Signal Processing Library 4.0 p Recognition Primitives Library 3.1 p Image Processing Library 2.0

Telecommunications and Signal Processing Seminar Precision p JPEG p Non-MMX SNR: dB p MMX SNR: dB p Image: No Change p G722 p Non-MMX SNR: 5.46 dB p MMX SNR: 5.18 dB p Doppler Radar p Less than 1%

Telecommunications and Signal Processing Seminar p Profiled Program p 2D DCT p Quantization p Color Conversion p 74% of execution time p Small Block Size p 8x8 blocks of pixels

Telecommunications and Signal Processing Seminar p 2D DCT p Library only has 1D DCT p Data in different order p Quantization p Not enough data parallelism p Color conversion p Create and fill vectors

Telecommunications and Signal Processing Seminar FIR Filter p Finite Impulse Response Filter p Moving averages filter p Process one input at a time p Non-MMX: 32-bit FP p MMX: 16-bit fixed-point p Filter length is 35

Telecommunications and Signal Processing Seminar FFT p Fast Fourier Transform p Computes discrete Fourier Transform p 4096-point p In-place p Whole FFT to MMX function p Non-MMX: 32-bit FP p MMX: 16-bit fixed-point

Telecommunications and Signal Processing Seminar MatVec p Matrix & Vector Multiplication p 512x512 matrix times 512-entry vector p Dot product of two 512-entry vectors p Both versions: 16-bit data

Telecommunications and Signal Processing Seminar IIR p Infinite Impulse Response Filter p Butterworth coefficients p Direct form, Bandpass p Filter length of 8, 17 coefficients p Requires high precision p Feedback p Our versions unstable

Telecommunications and Signal Processing Seminar Doppler Radar Processing p Subtract complex echo signals p Removing stationary targets p Estimates power spectrum p Dominant frequency from peak of FFT p 16-point, in-place FFT

Telecommunications and Signal Processing Seminar G.722 Speech Encoding p Input signal: 16-bit, 16 kHz p Output signal: 8-bit, 8 kHz p 6 kb speech file