{ Optimizing C63 for x86 Group 9.  Bird’s-eye view: gprof of reference encoder  Optimizing SAD  Results Outline.

Slides:



Advertisements
Similar presentations
ACCELERATING MULTIMEDIA APPLICATIONS USING THE INTEL SSE AND AVX ISA MIN LI 05/08/2013.
Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
4/14/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,
Machine/Assembler Language Putting It All Together Noah Mendelsohn Tufts University Web:
1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment.
Time Optimization of HEVC Encoder over X86 Processors using SIMD
Home Exam 1: Video Encoding on Intel x86 using Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX) Home Exam 1: Video Encoding on Intel.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Lab6 – Debug Assembly Language Lab
Efficient multi-frame motion estimation algorithms for MPEG-4 AVC/JVTH.264 Mei-Juan Chen, Yi-Yen Chiang, Hung- Ju Li and Ming-Chieh Chi ISCAS 2004.
H.264 / MPEG-4 Part 10 Nimrod Peleg March 2003.
An Introduction to H.264/AVC and 3D Video Coding.
Using FMA everywhere hurts performance Cool one: Fused multiply accumulate (FMA)
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
Estimating Flight CPU Utilization of Algorithms in a Desktop Environment Using Xprof Dr. Matt Wette 2013 Flight Software Workshop.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.
INSTRUCTION SET AND ASSEMBLY LANGUAGE PROGRAMMING
Performance Enhancement of Video Compression Algorithms using SIMD Valia, Shamik Jamkar, Saket.
By: Hitesh Yadav Supervising Professor: Dr. K. R. Rao Department of Electrical Engineering The University of Texas at Arlington Optimization of the Deblocking.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Timing Programs and Performance Analysis Tools for Analysing and Optimising advanced Simulations.
Applying 3-D Methods to Video for Compression Salih Burak Gokturk Anne Margot Fernandez Aaron March 13, 2002 EE 392J Project Presentation.
Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp
Introduction To Computer Architecture Jean-Michel RICHER University of Angers France January 2003.
1 KIPA Game Engine Seminars Jonathan Blow Seoul, Korea December 12, 2002 Day 15.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu.
Reminder Bomb lab is due tomorrow! Attack lab is released tomorrow!!
GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.
Time Optimization of HEVC Encoder over X86 Processors using SIMD
Arrays. Outline 1.(Introduction) Arrays An array is a contiguous block of list of data in memory. Each element of the list must be the same type and use.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
Time Optimization of HEVC Encoder over X86 Processors using SIMD Kushal Shah Advisor: Dr. K. R. Rao Spring 2013 Multimedia.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Spring 2016Assembly Review Roadmap 1 car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Car c = new Car(); c.setMiles(100);
Introduction to Computer Systems Topics: Assembly Stack discipline Structs/alignment Caching CS 213 S ’12 rec8.pdf “The Class That Gives CMU Its.
Chapter Overview General Concepts IA-32 Processor Architecture
Instruction Set Architecture
Machine-Level Programming I: Basics
Speed up on cycle time Stalls – Optimizing compilers for pipelining
Credits and Disclaimers
Assembly Programming II CSE 351 Spring 2017
System/Networking performance analytics with perf
Recitation: Attack Lab
Vector Processing => Multimedia
Presented by: Isaac Martin
DCT-Domain Inverse Motion Compensation (IMC)
Prof. Jayanta Mukhopadhyay
Introduction to Intel x86-64 Assembly, Architecture, Applications, & Alliteration Xeno Kovah – 2014 xkovah at gmail.
CMPT 886: Computer Architecture Primer
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Assembly Programming II CSE 410 Winter 2017
Standards Presentation ECE 8873 – Data Compression and Modeling
Optimizing Baseline Profile in H
Machine Level Representation of Programs (IV)
Machine-Level Programming III: Arithmetic Comp 21000: Introduction to Computer Organization & Systems March 2017 Systems book chapter 3* * Modified slides.
x86-64 Programming I CSE 351 Autumn 2018
Get To Know Your Compiler
Machine-Level Programming III: Arithmetic Comp 21000: Introduction to Computer Organization & Systems March 2017 Systems book chapter 3* * Modified slides.
Machine-Level Programming II: Basics Comp 21000: Introduction to Computer Organization & Systems Spring 2016 Instructor: John Barr * Modified slides.
Carnegie Mellon Ithaca College
CS201- Lecture 8 IA32 Flow Control
Credits and Disclaimers
Credits and Disclaimers
x86-64 Programming I CSE 351 Spring 2019
CS 295: Modern Systems Modern Processors – SIMD Extensions
Computer Architecture and System Programming Laboratory
Presentation transcript:

{ Optimizing C63 for x86 Group 9

 Bird’s-eye view: gprof of reference encoder  Optimizing SAD  Results Outline

  Each sample counts as 0.01 seconds.   % cumulative self self total   time seconds seconds calls us/call us/call name   sad_block_8x8   me_block_8x8   dequant_idct_block_8x8   dct_quant_block_8x8   flush_bits   put_bits   granularity: each sample hit covers 2 byte(s) for 0.02% of seconds gprof: reference, foreman

Optimizing SAD

SAD SSE2 PSAD   void sad_block_8x8(uint8_t *block1, uint8_t *block2, int stride, int *result)   {   int v;   __m128i r = _mm_setzero_si128();   for (v = 0; v < 8; v += 2)   {   const __m128i b1 = _mm_set_epi64(*(__m64 *) &block1[(v+0)*stride],   *(__m64 *) &block1[(v+1)*stride]);   const __m128i b2 = _mm_set_epi64(*(__m64 *) &block2[(v+0)*stride],   *(__m64 *) &block2[(v+1)*stride]);   r = _mm_add_epi16(r, _mm_sad_epu8(b2, b1));   }   *result =   _mm_extract_epi16(r, 0) +   _mm_extract_epi16(r, 4);    }

 Gprof: Improved SSE2 SAD uses 3.69 s vs s*  Cachegrind: Lots of branch prediction misses in me_block_8x8 *) Foreman sequence on gpu-7 How well does this perform?

SAD SSE4.1 MPSADBW+PHMINSUM   void sad_block_8x8x8(uint8_t *block1, uint8_t *block2, int stride, int *best,   int *result)   {   int v;   __m128i r = _mm_setzero_si128();   union {   __m128i v;   struct {   uint16_t sad;   unsigned int index : 3;   } minpos;   } mp;   for (v = 0; v < 8; v += 2) {   const __m128i b1 = _mm_set_epi64(*(__m64 *) &block1[(v+1)*stride],   *(__m64 *) &block1[(v+0)*stride]);   const __m128i b2 = _mm_loadu_si128((__m128i *) &block2[(v+0)*stride]);   const __m128i b3 = _mm_loadu_si128((__m128i *) &block2[(v+1)*stride]);   r = _mm_add_epi16(r, _mm_mpsadbw_epu8(b2, b1, 0b000));   r = _mm_add_epi16(r, _mm_mpsadbw_epu8(b2, b1, 0b101));   r = _mm_add_epi16(r, _mm_mpsadbw_epu8(b3, b1, 0b010));   r = _mm_add_epi16(r, _mm_mpsadbw_epu8(b3, b1, 0b111));   }   mp.v = _mm_minpos_epu16(r);   *result = mp.minpos.sad;   *best = mp.minpos.index;   }

 Gprof: Improved SSE4.1 SAD uses 0.90s vs 3.69s  Cachegrind: Branch prediction misses reduced by a factor of 8  Intel’s IACA tool: CPU pipeline appears to be filled! Appears both source and reference block loads compete with (V)MPSADBW for the CPUs execution ports  Assembly: Better utilize AVX’s non-destructive instructions (less register copies) Better utilize loaded data for SAD computations with same source block How well does this perform?

SAD 8x8x8x8: Less src loads, branches   Load source block once:.macro.load_src   vmovq (%rdi), %xmm0# src[0]   vpinsrq$1,(%rdi,%rdx),%xmm0,%xmm0# src[1]   vmovq (%rdi,%rdx,2),%xmm1# src[2]   vmovhps(%rdi,%r8),%xmm1,%xmm1# src[3]   vmovq (%rdi,%rdx,4),%xmm2# src[4]   vmovhps(%rdi,%r9),%xmm2,%xmm2# src[5]   vmovq (%rdi,%r8,2),%xmm3# src[6]   vmovhps(%rdi,%rax),%xmm3,%xmm3# src[7]   Do sad for 8x8 - 8x8 blocks (relative y = 0…8, x = 0…8):.macro.8x8x8x8   vmovdqu (%rsi), %xmm12# ref[0]  .8x8x1 0, 0, 12, 4 0   vmovdqu(%rsi,%rdx), %xmm13# ref[1]  .8x8x1 0, 1, 13, 4 0  .8x8x1 1, 0, 13, 5 0

 Gprof: Improved SAD 8x8x8x8 uses 0.53s vs 0.90s  Valgrind: Even less branching How well does this perform?

SAD 4x8x8x8x8: Branchless UV-plane ME   sad_4x8x8x8x8:.load_src # Load source block from %rdi   mov %rsi, %rdi# Reference block x,y  .8x8x8x8   lea 8(%rdi), %rsi# Reference block x+8, y  .8x8x8x8   lea (%rdi,%rdx,8), %rsi  .8x8x8x8   lea 8(%rdi,%rdx,8), %rsi #  .8x8x8x8   …   vphminposuw   …   ret

 Gprof: Improved 4x8x8x8x8 SAD uses 0.47s vs 0.53s  Valgrind: Even less branching!  Total runtime of Foreman sequence reduced from ~40.1s to ~1.6s (factor of 25) How well does this perform?

InstructionFunction variantCycle mean, adjusted for single 8x8 block MPSADBWSAD 8x 8x87.5 VMPSADBWSAD 2x8x 8x83.6 SAD 4x8x 8x84.3 SAD 8x8x 8x8 v12.9 SAD 8x8x 8x8 v22.8 SAD 4x 8x8x 8x82.7 Future researchSAD 16x8x8x8x82.6? Even less branches? SAD assembly code: Iterative improvement

Image quality: comparable PSNR Mean PSNR 95% SSIM Mean SSIM 95% Tractor reference Tractor Foreman reference Foreman

  NOT ON NDLAB!   Each sample counts as 0.01 seconds.   % cumulative self self total   time seconds seconds calls s/call s/call name   sad_block_8x8   c63_motion_estimate   dct_quant_block_8x8   dequant_idct_block_8x8   write_block   dequantize_idct   c63_motion_compensate   dct_quantize   put_bits   granularity: each sample hit covers 2 byte(s) for 0.01% of seconds gprof: reference, tractor 50 frames (-O3)

  NOT ON NDLAB!   Each sample counts as 0.01 seconds.   % cumulative self self total   time seconds seconds calls ms/call ms/call name   sad_4x8x8x8x8   dct_quant_block_8x8   dequant_idct_block_8x8   write_frame   dequantize_idct   put_bits   dct_quantize   transpose_block_avx   sad_8x8x8x8   c63_motion_estimate   c63_motion_compensate   granularity: each sample hit covers 2 byte(s) for 0.17% of 5.86 seconds   134.2/5.86 = ~24 gprof: improved, tractor 50 frames (-O3)