November 22, 1999The University of Texas at Austin23 - 1 Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
The University of Adelaide, School of Computer Science
Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.
ECE291 Computer Engineering II Lecture 24 Josh Potts University of Illinois at Urbana- Champaign.
Computer Organization CS224 Fall 2012 Lesson 19. Floating-Point Example  What number is represented by the single-precision float …00 
Real time DSP Professors: Eng. Julian S. Bruno Eng. Jerónimo F. Atencio Sr. Lucio Martinez Garbino.
Overview of Popular DSP Architectures: TI, ADI, Motorola R.C. Maher ECEN4002/5002 DSP Laboratory Spring 2003.
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
Computer Organization and Architecture
 Understanding the Sources of Inefficiency in General-Purpose Chips.
What is an instruction set?
1 RISC Machines l RISC system »instruction –standard, fixed instruction format –single-cycle execution of most instructions –memory access is available.
CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
CMPE 511 Computer Architecture Caner AKSOY CmpE Boğaziçi University December 2006 Intel ® Core 2 Duo Desktop Processor Architecture.
NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.
Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.
Enhancing GPU for Scientific Computing Some thoughts.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
XP Practical PC, 3e Chapter 16 1 Looking “Under the Hood”
Machine Instruction Characteristics
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CS-334: Computer.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
COMPUTER ARCHITECTURE (P175B125)
Computer Architecture and Organization
Computer Architecture EKT 422
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Introduction to MMX, XMM, SSE and SSE2 Technology
CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.
Digital Signal Processors (DSPs). DSP Advanced signal processor circuits MAC (Multiply and Accumulate) unit (s) - provides fast multiplication of two.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
The Alpha Thomas Daniels Other Dude Matt Ziegler.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,
Group # 3 Jorge Chavez Henry Diaz Janty Ghazi German Montenegro.
® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March.
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.
Xinsong1 Multimedia Extension Technology survey Xinsong Yang Electrical and Computer Engineering 734 Final Project 5/10/2002.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.
Single Instruction Multiple Data
Vector Processing => Multimedia
Advanced Computer Architecture 5MD00 / 5Z032 Instruction Set Design
MMX Multi Media eXtensions
Special Instructions for Graphics and Multi-Media
Multivector and SIMD Computers
Chapter 9 Instruction Sets: Characteristics and Functions
Morgan Kaufmann Publishers Arithmetic for Computers
Chapter 10 Instruction Sets: Characteristics and Functions
Presentation transcript:

November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer Engineering Department The University of Texas at Austin November 22, 1999

The University of Texas at Austin Motivation Need new ways to support emerging applications –Multimedia –Signal processing –3D graphics Many applications have similar properties –Large streams of data –Data parallelism –Small data types –Tight loops –High percentage of computations –Common computations (e.g. MAC)

November 22, 1999The University of Texas at Austin Target Applications IP telephony gateways Multi-channel modems Speech processing systems Echo cancelers Image and video processing systems Scientific array processing systems Internet routers Virtual private network servers

November 22, 1999The University of Texas at Austin Basic Idea Extend capabilities of general-purpose processor –Additions to the instruction set architecture –Additions to the CPU state Exploit Data Parallelism –Single Instruction Multiple Data (SIMD) –Multiple operations using one functional unit

November 22, 1999The University of Texas at Austin SIMD Use one register to store many pieces of data

November 22, 1999The University of Texas at Austin Packed Arithmetic Unsigned Packed Add with Saturation

November 22, 1999The University of Texas at Austin Packed Arithmetic Multiply Accumulate

November 22, 1999The University of Texas at Austin MMX Example 16-bit Vector Dot-Product

November 22, 1999The University of Texas at Austin Dot-Product (cont.)

November 22, 1999The University of Texas at Austin Alternative Dot-Product Useful in certain cases

November 22, 1999The University of Texas at Austin Alternative Dot-Product (cont.) Interleaving the Data

November 22, 1999The University of Texas at Austin Support for NSP Libraries –Most common type of support –Tradeoff speed for robustness Compiler –Can’t do too much, but getting better –Can use macros, annotations, and “intrinsics” Operating System –Needs to know about any new registers and states –Needs to know about any new exceptions

November 22, 1999The University of Texas at Austin Common Concerns Organizing SIMD data –Packing and unpacking SIMD registers –Shuffling data within the register –“swizzling”, “scatter and gather” –Permute units Aligning data –Special load and store instructions –New data types in compilers –Padding data types

November 22, 1999The University of Texas at Austin New Instructions Packed Comparisons Packed maximum, minimum Packed Average Packed Sum of absolute differences Multiply-accumulate Conversion: signed, data length, data type Approximations: reciprocal, reciprocal square root Prefetch Store fence

November 22, 1999The University of Texas at Austin Evolution of NSP Visual Instruction Set (VIS) –First NSP extension found on Sun’s UltraSPARC MMX –First x86 NSP extensions, created for Intel’s Pentium 3DNow! –Addition of FP SIMD instructions on AMD’s K6-2 Altivec –Adds new CPU State for Motorola’s G4 PowerPC Streaming SIMD Extensions –New x86 FP SIMD for Intel’s Pentium III

November 22, 1999The University of Texas at Austin VIS Uses 64-bit floating point registers Data formats “tailored for graphics” No saturation arithmetic Only 8-bit x 16-bit multiplication Implicit assumptions about rounding & number of significant bits 3% increase in die area C Macros and “MediaLib” for development

November 22, 1999The University of Texas at Austin VIS Tailored for Graphics Assumes pixels will be most common data

November 22, 1999The University of Texas at Austin UltraSPARC with VIS

November 22, 1999The University of Texas at Austin DNow! on K6-II 21 new SIMD instructions –Including floating point SIMD instructions –PREFETCH instruction –Approximation and intra-element instructions Two 32-bit FP instructions in one register –Two 3DNow! instructions possible per cycle –Latency of two cycles, throughput of one cycle Aliased to the FP registers (like MMX)

November 22, 1999The University of Texas at Austin DNow! on the K6-II

November 22, 1999The University of Texas at Austin Streaming SIMD Extension 70 new instructions 8 new 128-bit SSE registers (OS support) Execute two SSE instructions simultaneously –Double-cycle through existing 64-bit hardware –2 micro-ops per SSE instruction –Latency often two cycles or greater Explicit scalar SSE instructions Prefetch to various levels of cache Non-allocating stores 10% increase in die area

November 22, 1999The University of Texas at Austin Pentium III

November 22, 1999The University of Texas at Austin DNow! on the Athlon Zero switching overhead back to FP! Most instructions have one cycle latency 24 more instructions –12 more integer SIMD instructions –7 more data manipulation instructions –5 “unique” DSP instructions Packed FP to Integer word conversion with sign extend Packed FP negative accumulate Packed FP mixed negative-positive accumulate Packed integer word to FP conversion Packed swap double-word

November 22, 1999The University of Texas at Austin Athlon

November 22, 1999The University of Texas at Austin Altivec 162 new instructions –Intra-element: rotate, shift, max, min, sum across –Approximations: log2, 2 raised to an exponent –No explicit SAD instruction Separate 32-entry, 128-bit vector register file Floating-point and fixed-point data types Instructions have three source operands

November 22, 1999The University of Texas at Austin Altivec (cont.) Permute Unit –Shuffle data from any of the 32 byte locations in two operands to any of 16 byte locations in the destination –Permute instructions - one cycle Vector Unit –Three sub-units –Simple fixed-point - one cycle –Complex fixed-point - three cycles –Floating-point - four cycles

November 22, 1999The University of Texas at Austin Applications using 3DNow! 3D games and engines –Descent 4, Diablo 2, Quake III, Unreal 2 3D video card drivers –Matrox, Diamond, ATI, nVidia, 3DLabs, 3dfx Graphics APIs –OpenGL, DirectX, Glide Other apps –Naturally Speaking, LiveArt98, WinAMP, PicVideo, PC-Doctor, QuickTech, XingMP3

November 22, 1999The University of Texas at Austin Additional Information MMX talk and additional references – NSP talk outline and links –