ACCELERATING MULTIMEDIA APPLICATIONS USING THE INTEL SSE AND AVX ISA MIN LI 05/08/2013.

Slides:

Advertisements

Similar presentations

CSC211 Data Structures Lecture 9 Linked Lists Instructor: Prof. Xiaoyan Li Department of Computer Science Mount Holyoke College.

Advertisements

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.

Carnegie Mellon Automatic Generation of Vectorized Fast Fourier Transform Libraries for the Larrabee and AVX Instruction Set Extension Automatic Generation.

Templates in C++. Generic Programming Programming/developing algorithms with the abstraction of types The uses of the abstract type define the necessary.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

R4 Dynamically loading processes. Overview R4 is closely related to R3, much of what you have written for R3 applies to R4 In R3, we executed procedures.

Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.

C++  PPL  AMP When NO branches between a micro-op and retiring to the visible architectural state – its no longer speculative.

1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment.

05/03/2009CA&O Lecture 8,9,10 By Engr. Umbreen sabir1 Computer Arithmetic Computer Engineering Department.

{ Optimizing C63 for x86 Group 9.  Bird’s-eye view: gprof of reference encoder  Optimizing SAD  Results Outline.

SPIM and MIPS programming

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Character and String definitions, algorithms, library functions Characters and Strings.

Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.

MIPS Assembly Language Programming

Software and Services Group Optimization Notice Advancing HPC == advancing the business of software Rich Altmaier Director of Engineering Sept 1, 2011.

Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute.

Pentium 4 and IA-32 ISA ELEC 5200/6200 Computer Architecture and Design, Fall 2006 Lectured by Dr. V. Agrawal Lectured by Dr. V. Agrawal Kyungseok Kim.

1 SC'03, Nov. 15–21, 2003 A Million-Fold Speed Improvement in Genomic Repeats Detection John W. Romein Jaap Heringa Henri E. Bal Vrije Universiteit, Amsterdam.

High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion.

University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,

CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.

Hiep Hong CS 147 Spring Intel Core 2 Duo. CPU Chronology 2.

INDEX ∞ Image Processing ∞ OpenCV ∞ Download & Setup ∞ Make Project ∞ Show Result ∞ Q & A Setup OpenCV & Tutorial.

NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.

IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.

Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.

Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.

5-1 Chapter 5 - Languages and the Machine Department of Information Technology, Radford University ITEC 352 Computer Organization Principles of Computer.

Lecture 4: MIPS Subroutines and x86 Architecture Professor Mike Schulte Computer Architecture ECE 201.

5-1 Chapter 5 - Languages and the Machine Principles of Computer Architecture by M. Murdocca and V. Heuring © 1999 M. Murdocca and V. Heuring Principles.

Tamanna Chhabra, M. Oguzhan Kulekci, and Jorma Tarhio Aalto University.

December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Introduction to MMX, XMM, SSE and SSE2 Technology

November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.

With a focus on floating point.  For floating point (i.e., real numbers), MASM supports:  real4  single precision; IEEE standard; analogous to float.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

EEL5708/Bölöni Lec 8.1 9/19/03 September, 2003 Lotzi Bölöni Fall 2003 EEL 5708 High Performance Computer Architecture Lecture 5 Intel 80x86.

An MPEG-7 Based Semantic Album for Home Entertainment Presented by Chen-hsiu Huang 2003/08/12 Presented by Chen-hsiu Huang 2003/08/12.

Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Optical Flow walk through Aidean Sharghi Spring 14.

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

GCSE COMPUTER SCIENCE Computers 1.5 Assembly Language.

Optimizing the code using SSE intrinsics

SIMD Multimedia Extensions

Exploiting Parallelism

Basics Of X86 Architecture

Compilers for Embedded Systems

Multi-core SOC for Future Media Processing

Vector Processing => Multimedia

Register Use Policy Conventions

Advanced Computer Architecture 5MD00 / 5Z033 Instruction Set Design

SIMD Programming CS 240A, 2017.

MMX Multi Media eXtensions

CS170 Computer Organization and Architecture I

Storing Information Each memory cell stores a set number of bits (usually 8 bits, or one byte) (byte addressable)

File Management.

Digital System Design II 数字系统设计2

EE 193: Parallel Computing

Morgan Kaufmann Publishers Arithmetic for Computers

Fourier Transform of Boundaries

January 16 The books are here. Assignment 1 now due Thursday 18 Jan.

Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.

Fault Tolerant Systems in a Space Environment

CS 295: Modern Systems Modern Processors – SIMD Extensions

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

ACCELERATING MULTIMEDIA APPLICATIONS USING THE INTEL SSE AND AVX ISA MIN LI 05/08/2013

INTEL SSE AND AVX ISA  Intel ISA  SSE1, SSE2, SSE3, SSE4 (SSE4.1, SSE4.2)  SSE4.2 Specialized for String and Text applications (suitable for applications like template matching, Genome Sequence Comparison)  AVX (mainly for floating point operations)  AVX1: 256bits  AVX2: 256bits (with some instructions extension)  XMM register and YMM register  XMM: 128bits  YMM: 256bits

INTEL OPENCV LIBRARY  Opencv Library  Various of multimedia applications  Object detection, face recognition, image processing…  Good candidates for using Intel SSE or AVX ISA for speedup  Intensive computations  I made a video on Youtube to show some tricks in using Opencv library

GUIDELINES FOR ENABLING THE ISA  Intel SSE and AVX  cat /proc/cpuinfoMake sure SSE and AVX are enabled. Otherwise enable them.  As you can see  All SSE ISA are activated  However only AVX1 is activated, which means I can only use 128bits XMM registers  Note: AVX2 is released in the mid of 2012

INTEL OPENCV LIBRARY  Opencv Library  Various of multimedia applications  Object detection, face recognition, image processing…

ACCELERATION CASE I Original: for( int i = 0; i < length; i += 4 ){ double t0 = d1[i] - d2[i]; double t1 = d1[i+1] - d2[i+1]; double t2 = d1[i+2] - d2[i+2]; double t3 = d1[i+3] - d2[i+3]; total_cost += t0*t0 + t1*t1 + t2*t2 + t3*t3; } After modification: int chunk = length / 4; for(i = 0; i < chunk; i++){ __m128 m0, m1; m0 = _mm_load_ps(&d1[4 * i]); m1 = _mm_load_ps(&d2[4 * i]); m1 = _mm_sub_ps(m0, m1); m1 = _mm_mul_ps(m1, m1); m1 = _mm_hadd_ps(m1, m1); m2 = _mm_shuffle_ps(m1, m1, _MM_SHUFFLE(2,3,0,1)); m1 = _mm_add_ps(m1, m2); total_cost += ((float*)&m1)[0]; if( total_cost > best ) break; }

ACCELERATION CASE II Original: float minval = FLT_MAX, maxval = -FLT_MAX; for( i = 0; i < N; i++, ++it ) { float v = *(const float*)it.ptr; if( v < minval ) { minval = v; minidx = it.node()->idx; } if( v > maxval ) { maxval = v; maxidx = it.node()->idx; } if( _minval ) *_minval = minval; if( _maxval ) *_maxval = maxval; After modification : __mm128 m0, m1, m2, m3, m4, minArray, maxArray; int chunk = N / 4; for(i = 1; i < chunk; i++){ m0 = __mm_load_ps( (const float*)it.ptr ); it += 4; m1 = _mm_min_ps(m0, minArray); m2 = _mm_max_ps(m0, maxArray); m3 = _mm_cmp_ps(m0, minArray, _CMP_LT_OS); m4 = _mm_cmp_ps(m0, maxArray, _CMP_GT_OS); int* mask1 = (int*) &m3; int* mask2 = (int*) &m4; for(int j = 0; j < 4; j++){ if(mask1[j] == -1) minPos[j] = 4 * i + j; if(mask2[j] == -1) maxPos[j] = 4 * i + j; } minArray = m3; maxArray = m4; }

LOAD OF STRUCTURES  Structues like this : typedef point_{ int x; int y; } point;  _mm_load_ only takes consecutive mem space!  What is it like insider the XMM register?  How to achieve the following using SSE && AVX ISA? point* points; points[0]. x points[0]. y points[1]. x points[1]. y... X0X0 Y0Y0 X1X1 Y1Y1 X2X2 Y2Y2 X3X3 Y3Y3 X0X0 X1X1 X2X2 X3X3 Y0Y0 Y1Y1 Y2Y2 Y3Y3 Not easy!!!

PERMUTE AND BLEND (1) __m256i temp = _mm256_load_si256((__m256i*) &points[4 * i]); (2) __m256 temp2 = _mm256_cvtepi32_ps(temp); (3) v4si mask1 = {9,8,8,9}; (4) __m256 temp3 = _mm256_permutevar_ps(temp2, mask1); (5) __m256 temp4 = _mm256_permute2f128_ps(temp3, temp3, 0x01); (6) temp3 = _mm256_blend_ps(temp3, temp4, 0b ); (7) v4si mask2 = {0xd,4,4,0xd}; (8) temp3 = _mm256_permutevar_ps(temp2, mask2); (9) __m128 m1 = _mm256_extractf128_ps(temp3, 1); (10) __m128 m2 = _mm256_extractf128_ps(temp3, 0); X0X0 Y0Y0 X1X1 Y1Y1 X2X2 Y2Y2 X3X3 Y3Y3 X0X0 X1X1 X2X2 X3X3 X0X0 X1X1 Y0Y0 Y1Y1 Y2Y2 Y2Y2 X2X2 X3X3 Y2Y2 Y3Y3 X2X2 X3X3 X0X0 X1X1 Y0Y0 Y1Y1 X0X0 X1X1 X2X2 X3X3 Y2Y2 Y3Y3 Y0Y0 Y1Y1 X0X0 X1X1 X2X2 X3X3 Y0Y0 Y1Y1 Y2Y2 Y3Y3 Y0Y0 Y1Y1 Y2Y2 Y3Y3

SIMULATION RESULTS Not only finding min/max, but also the position Too many overhead for loading structures

CONCLUSION AND FUTURE WORK  Opencv suitable for SSE or AVX acceleration  Single task has more chance to get speedup  Loading and arranging a structure is really a cumbersome task  Hints for smart automated compilation (such as loading structure)  Suggestions for the expansion of the ISA (new instruction introduced)