A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions Gang Ren Peng WuDavid Padua University of Illinois IBM T.J.

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.

Streaming SIMD Extension (SSE)

© 2009 IBM Corporation July, 2009 | PADTAD Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

C++  PPL  AMP When NO branches between a micro-op and retiring to the visible architectural state – its no longer speculative.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

Advanced microprocessor optimization Kampala August, 2007 Agner Fog

Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Convey Computer Status Steve Wallach swallach”at”conveycomputer.com.

Compilation Techniques for Multimedia Processors Andreas Krall and Sylvain Lelait Technische Universitat Wien.

Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Bitwidth Analysis with Application to Silicon Compilation Mark Stephenson Jonathan Babb Saman Amarasinghe MIT Laboratory for Computer Science.

NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.

Assembly Language for Intel-Based Computers, 4 th Edition Chapter 2: IA-32 Processor Architecture (c) Pearson Education, All rights reserved. You.

Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.

09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.

Linked Lists in MIPS Let’s see how singly linked lists are implemented in MIPS on MP2, we have a special type of doubly linked list Each node consists.

Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.

Performance Enhancement of Video Compression Algorithms using SIMD Valia, Shamik Jamkar, Saket.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar

Assembly תרגול 5 תכנות באסמבלי. Assembly vs. Higher level languages There are NO variables’ type definitions.  All kinds of data are stored in the same.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

ULTRASPARC 2005 INTRODUCTION AND ISA BY JAMES MURITHI.

December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.

Introduction to MMX, XMM, SSE and SSE2 Technology

CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu.

A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

Single Node Optimization Computational Astrophysics.

Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,

A Retargetable Preprocessor for Multimedia Instructions* (work in progress) INRIA F. Bodin, G. Pokam, J. Simonnet *partially supported by ST Microelectronics.

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

Chapter Overview General Concepts IA-32 Processor Architecture

Topics to be covered Instruction Execution Characteristics

SIMD Multimedia Extensions

Roadmap C: Java: Assembly language: OS: Machine code: Computer system:

ISA's, Compilers, and Assembly

Exploiting Parallelism

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.

FPGAs in AWS and First Use Cases, Kees Vissers

Basics Of X86 Architecture

Compilers for Embedded Systems

Vector Processing => Multimedia

SIMD Programming CS 240A, 2017.

Software Cache Coherent Control by Parallelizing Compiler

Compiler Back End Panel

STUDY AND IMPLEMENTATION

Compiler Back End Panel

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.

Presentation transcript:

A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions Gang Ren Peng WuDavid Padua University of Illinois IBM T.J. Watson Research University of Illinois Presented by Gang Ren

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions Multimedia Extensions (MME)  Additions to accelerate multimedia applications For general-purpose processors:  MAX(HP), VIS(Sun), AltiVec(Motorola/IBM/Apple), SSE(Intel) For special-purpose processors:  PS2(SONY), Graphics Processing Unit(NVIDIA)  Use SIMD architecture Intel SSE2 128bits Vector Unit Register File (xmm0~xmm7) 16 chars 8 shorts 4 integers 2 doubles 4 singles

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions Programming Multimedia Extensions foo.c foo.s foo.intrinsic.c Native Compiler MME Vectorizer Assembler & Linker a.out Multimedia Extensions int a[16],b[16],c[16]; for(i=0; i<16; i++) c[i] = a[i] + b[i]; vector int a[4],b[4],c[4]; for(i=0; i<4; i++) c[i] = vec_add(a[i], b[i]); movaps xmm0, XMMWORD PTR [eax] addps xmm0, XMMWORD PTR [edx] movaps XMMWORD PTR [ecx], xmm0

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions Motivation Scientific Applications Traditional Vectorization Multimedia Applications MME Vectorization GAP Diff Vector Processors Multimedia Extensions

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions Gaps From Architecture Diff Scientific Applications Traditional Vectorization Multimedia Applications MME Vectorization Vector Processors Multimedia Extensions Diff GAP

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions  Differences in memory unit MME: No scatter/gather memory operations MME: Only support aligned memory access  Differences in ISA MME: Special instructions for media processing  Example: Saturated Operations MME: Non-uniform support for different element types  SSE2: Max/min operations on 16-bit short integers MME vs. Vector Processor for(i=0; i<8; i++) for(j=0; j<8; j++) s += a[i*8+j] * b[j*8+i];

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions Gaps From Applications Diff Scientific Applications Traditional Vectorization Multimedia Applications MME Vectorization GAP Diff Vector Processors Multimedia Extensions

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions Berkeley Multimedia Workload  Evolves from MediaBench  12 applications written in C/C++ Audio compression: ADPCM, GSM, LAME, mpg123 Image/video compression: DVJU, JPEG, MPEG2 Graphics: POVray, Mesa, Doom Others: Rsynth, Timidity  Where are the example codes from? Important loops in core procedures ( >10% total ex. time)

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions Where Are Gaps From?  Different programming styles Pointer access Manually unrolled loops  Mismatches between application and language Integer promotion Saturated operation  Different code patterns Bit-wise operations Lookup tables

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions C Language Issues: Integer Promotion  Integer promotion Forced by ANSI C semantics (ISO/IEC 9899:1999)  All char or short types are automatically promoted to integer type before conducting any arithmetic operations. Fit traditional scalar architecture well  MME supports sub-word level parallelism Integer promotion will waste computation bandwidth  How to eliminate unnecessary integer promotion? Some analyses needed to ensure the same result for(i=0; i<1024; i++) for(j=0; j<1024; j++) dst[i,j]=src1[i,j]+src2[i,j];

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions C Language Issues: Saturated Operations /* From BMW/GSM */ ltmp=a+b; if(( (unsigned)ltmp – MIN_WORD ) > ( MAX_WORD - MIN_WORD )) if(ltmp>0) ltmp = MAX_WORD; else ltmp = MIN_WORD; for(i=0; i 255) dst[i,j] = 255; if(dst[i,j] < 0 ) dst[i,j] = 0; }

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions Code Pattern: Lookup Tables  To implement saturated operations  To replace expensive math function calls /* From BMW/Lame */ if (init==0) for (i=0;i<LUTABSIZE;i++) lutab[i]=pow(...);... for (i=0;i<l_end;i++) { temp=...; if (temp<1000.0) { ix[i]=lutab[(temp*10)]; } }

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions Some Related Work  Compilation based on traditional vectorization Cheong and Lam’s optimizer for VIS (Sun) Krall and Lelait’s traditional vectorizer for VIS Sreraman and Govindarajan’s vectorizer for MMX(Intel) Aart’s intra-register vectorization for the Intel architecture  Other compilation techniques Krall and Lelait’s “Vectorization by loop unrolling” Larsen and Amarasinghe’s “Superword level parallelism” Fisher and Dietz’s “SIMD-within-a-register”  Product compilers VAST/AltiVec, CodePlay/VectorC, Intel compiler,…

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions Conclusions  Gaps exist between traditional vectorization and compilation for multimedia extensions From differences between two architectures From different programming styles, mismatch with language semantics, different code patterns  Additional compiler techniques need to be developed or extended to bridge these gaps

LCPC ‘03 A Preliminary Study On the Vectorization of Multimedia Applications for Multimedia Extensions Future Work  Our first step to unleash the power of MMEs Manual vectorization to see how far we can go Implement our vectorizer on SUIF  Propose new techniques to bridge the gaps  Extend application domain Traditional applications: SPECfp, SPECint Applications for embedded systems

Thank You