Vector Processing => Multimedia

Slides:

Advertisements

Similar presentations

1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.

Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

DSPs Vs General Purpose Microprocessors

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Lecture 6: Multicore Systems

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

The University of Adelaide, School of Computer Science

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.

CS 203A Computer Architecture Lecture 10: Multimedia and Multithreading Instructor: L.N. Bhuyan.

Trident Processor A Scalable Architecture for Scalar, Vector, and Matrix Operations Trident Processor A Scalable Architecture for Scalar, Vector, and Matrix.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Advanced Computer Architectures

NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.

Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.

Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.

Basics and Architectures

Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

RISC Architecture RISC vs CISC Sherwin Chan.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.

November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.

The Alpha Thomas Daniels Other Dude Matt Ziegler.

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Vector computers.

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

1 Lecture 5a: CPU architecture 101 boris.

COMP 740: Computer Architecture and Implementation

Advanced Architectures

A Closer Look at Instruction Set Architectures

Simultaneous Multithreading

Exploiting Parallelism

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.

Pipeline Implementation (4.6)

Morgan Kaufmann Publishers

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

COMP4211 : Advance Computer Architecture

Flow Path Model of Superscalars

Superscalar Processors & VLIW Processors

Levels of Parallelism within a Single Processor

Hardware Multithreading

STUDY AND IMPLEMENTATION

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

The Vector-Thread Architecture

EE 193: Parallel Computing

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Superscalar and VLIW Architectures

Chapter 12 Pipelining and RISC

Lecture 4: Instruction Set Design/Pipelining

Lecture 5: Pipeline Wrap-up, Static ILP

CSE 502: Computer Architecture

ESE532: System-on-a-Chip Architecture

Presentation transcript:

Vector Processing => Multimedia

Vector Processors Initially developed for super-computing applications, today important for multimedia. Vector processors have high-level operations that work on linear arrays of numbers: "vectors" + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 vector length vadd.vv v3, v1, v2 VECTOR (N operations)

Properties of Vector Processors Single vector instruction implies lots of work (loop) Fewer instruction fetches Each result independent of previous result Multiple operations can be executed in parallel Simpler design, high clock rate Compiler (programmer) ensures no dependencies Reduces branches and branch problems in pipelines Vector instructions access memory with known pattern Effective prefetching Amortize memory latency of over large number of elements Can exploit a high bandwidth memory system No (data) caches required!

Example: MPEG Decoding Input Stream Load Breakdown Parsing 10% Dequantization 20% IDCT 25% Block Reconstruction 30% RGB->YUV 15% Output to Screen

Characteristics of Multimedia Apps (1) Requirement for real-time response “Incorrect” result often preferred to slow result Unpredictability can be bad (e.g. dynamic execution) Narrow data-types Typical width of data in memory: 8 to 16 bits Typical width of data during computation: 16 to 32 bits 64-bit data types rarely needed Fixed-point arithmetic often replaces floating-point Fine-grain (data) parallelism Identical operation applied on streams of input data Branches have high predictability High instruction locality in small loops or kernels

Characteristics of Multimedia Apps (2) Coarse-grain parallelism Most apps organized as a pipeline of functions Multiple threads of execution can be used Memory requirements High bandwidth requirements but can tolerate high latency High spatial locality (predictable pattern) but low temporal locality Cache bypassing and prefetching can be crucial

Examples of Media Functions Matrix transpose/multiply Discrete Cosine Trans./FFT Motion estimation Gamma correction Haar transform Median filter Separable convolution Viterbi decode Bit packing Galois-fields arithmetic … (3D graphics) (Video, audio, communications) (Video) (Media mining) (Image processing) (Communications, speech) (Communications, cryptography)

Overview of SIMD Extensions Vendor Extension Year # Instr Registers HP MAX-1 and 2 94,95 9,8 (int) Int 32x64b Sun VIS 95 121 (int) FP 32x64b Intel MMX 97 57 (int) FP 8x64b AMD 3DNow! 98 21 (fp) Motorola Altivec 162 (int,fp) 32x128b (new) SSE 70 (fp) 8x128b (new) MIPS MIPS-3D ? 23 (fp) AVX 12 144 (int,fp) 16x256 (new)

Programming with SIMD Extensions Optimized shared libraries Written in assembly, distributed by vendor Need well defined API for data format and use Intrinsics C/C++ wrappers for short vector variables and function calls Allows instruction scheduling and register allocation optimizations for specific processors Lack of portability, non standard Compilers for SIMD extensions Problems Identifying parallelism in iterations of loop Identifying alignment of arrays Assembly coding

Context Switch Overhead? The vector register file holds a huge amount of architectural state To expensive to save and restore all on each context switch Cray: exchange packet Extra dirty bit per processor If vector registers not written, don’t need to save on context switch Extra valid bit per vector register, cleared on process start Don’t need to restore on context switch until needed Extra tip: Save/restore vector state only if the new context needs to issue vector instructions

Why Vectors for Multimedia? Natural match to parallelism in multimedia Vector operations with VL the image or frame width Easy to efficiently support vectors of narrow data types High performance at low cost Multiple ops/cycle while issuing 1 instr/cycle Multiple ops/cycle at low power consumption Structured access pattern for registers and memory Scalable Get higher performance by adding lanes without architecture modifications Compact code size Describe N operations with 1 short instruction (v. VLIW) Predictable performance No need for caches, no dynamic execution Mature, developed compiler technology

Vector Summary Alternative model for explicitly expressing data parallelism If code is vectorizable, then simpler hardware, more power efficient, and better real-time model than out-of-order machines with SIMD support Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, conditional operations Vector architectures give big speedups on certain types of applications with relatively little additional hardware complexity or power consumption Problem is programming