SSE for H.264 Encoder Chuck Tsen Sean Pieper. SSE– what can’t it do? Mixed scalar and vector Unaligned memory accesses Predicated execution >2 source.

Slides:



Advertisements
Similar presentations
2.3 Modeling Real World Data with Matrices
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Matrix Multiplication To Multiply matrix A by matrix B: Multiply corresponding entries and then add the resulting products (1)(-1)+ (2)(3) Multiply each.
Maths for Computer Graphics
02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!
Overview Program in LC-3 machine language Use the Editor
Some thoughts: If it is too good to be true, it isn’t. Success is temporary. It is hard work to make it simple. Knowing you did it right is enough reward.
02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!
Introduction to Computing Systems (3rd Exam). 1.[5] Suppose R1 contains an integer x and R2 contains another integer y. Please write an instruction which.
Chapter 2 Section 3 Arithmetic Operations on Matrices.
Intro to Matrices Don’t be scared….
Fundamentals of matrices
Section 4.3 – A Review of Determinants Section 4.4 – The Cross Product.
CS-280 Dr. Mark L. Hornick 1 ASCII table. 2 Displaying Numbers as Text Problem: display numerical values as text Consider the numerical value 0x5A held.
Matlab for Engineers Manipulating Matlab Matrices Chapter 4.
If A and B are both m × n matrices then the sum of A and B, denoted A + B, is a matrix obtained by adding corresponding elements of A and B. add these.
Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +
Matrix Algebra Section 7.2. Review of order of matrices 2 rows, 3 columns Order is determined by: (# of rows) x (# of columns)
8.2 Operations With Matrices
Matrix Operations.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
Section 3.5 Revised ©2012 |
Multiplying Matrices Algebra 2—Section 3.6. Recall: Scalar Multiplication - each element in a matrix is multiplied by a constant. Multiplying one matrix.
Matrices Digital Lesson. Copyright © by Houghton Mifflin Company, Inc. All rights reserved. 2 A matrix is a rectangular array of real numbers. Each entry.
What is Matrix Multiplication? Matrix multiplication is the process of multiplying two matrices together to get another matrix. It differs from scalar.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Assembly Variables: Registers Unlike HLL like C or Java, assembly cannot use variables – Why not? Keep Hardware Simple Assembly Operands are registers.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
The Algebra of Matrices Matrix: An array of numbers. Matrix Name (Capital Letter) Matrix Size (row x column) Columns Rows Elements: a rc a 22 = 4.
A rectangular array of numeric or algebraic quantities subject to mathematical operations. The regular formation of elements into columns and rows.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
13.4 Product of Two Matrices
12-1 Organizing Data Using Matrices
Multiplying Matrices.
ECE 1304 Introduction to Electrical and Computer Engineering
Manipulating MATLAB Matrices Chapter 4
Matrix Multiplication
Last update on June 15, 2010 Doug Young Suh
Matrix Operations Monday, August 06, 2018.
Matrix Operations.
Section 3.3 – The Cross Product
Multiplying Matrices Algebra 2—Section 3.6.
RISC Concepts, MIPS ISA Logic Design Tutorial 8.
Vector Processing => Multimedia
Notes on Homework 1 What is Peak ? CS267 Lecture 2 CS267 Lecture 2 1.
Implementation of DWT using SSE Instruction Set
asum.ys A Y86 Programming Example
LC-3 Details and Examples
Chapter 7 LC-2 Assembly Language.
Array Processor.
Multiplying Matrices.
WarmUp 2-3 on your calculator or on paper..
Transforming Data (Python®)
CS-401 Computer Architecture & Assembly Language Programming
MATRICES MATRIX OPERATIONS.
2.2 Introduction to Matrices
Homework Homework Continue Reading K&R Chapter 2 Questions?
Multiplying Matrices.
EE 193: Parallel Computing
Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.
Matrix Addition
ZB B yB zA B0 VA0B0 A xB yA A0 xA Figure 2.1: Two Coordinate Systems.
3.6 Multiply Matrices.
Math review - scalars, vectors, and matrices
MATRICES MATRIX OPERATIONS.
Multiplying Matrices.
Multiplying Matrices.
Multiplying Matrices.
Presentation transcript:

SSE for H.264 Encoder Chuck Tsen Sean Pieper

SSE– what can’t it do? Mixed scalar and vector Unaligned memory accesses Predicated execution >2 source arguments (no shuffle and add) Many options for FP don’t exist for integer Can’t apply everywhere!

Profile of code GCOV for per-line execution counts SATD most promising candidate –SSE SAD instruction does not help!!! Walsh Hadamard Transform –Used for motion estimation of 8x8 16 applications per block, 8 element vectors –Used for sub-pixel motion estimation one vector at a shot, 16 element vector

Hadamard Matrix All values are in {1,-1} Orthogonal and symmetric First row/column only 1’s equal 1’s and -1’s after first row Transform –multiply input by matrix –Sum abs of output vector Calculations not super regular

Our optimization Bin by positive terms in each group –{0-3,4-7,8-11,12-15} –Allows aligning data in columns Four terms cannot align by column –But, can align within row Hard parts align by column –Use vector ops across columns –Much shuffling, but SSE still a win Simpler optimization for 8x8 columns

Before and After Original 1D has 64 lines, modified ~70 –Used intermediate calculations heavily But IA32 has only ~4 GP registers Original has mem traffic==SLOW SSE keeps data minty fresh in 8 “special” registers. Oh, and we load the data 4x faster

Questions?Questions?

BACKUPS! line0 = x[0] + x[3] + (x[4] + x[8]) + x[7] + x[b] + x[c] + x[f] - x[1] - x[2] - x[5] - x[6] - x[9] - x[a] - x[d] - x[e] line1 = x[0] + x[1] + (x[5] + x[9]) + x[4] + x[8] + x[c] + x[d] - x[2] - x[3] - x[6] - x[7] - x[a] - x[b] - x[e] - x[f] line2 = x[0] + x[2] + (x[6] + x[a]) + x[4] + x[8] + x[c] + x[e] - x[3] - x[1] - x[7] - x[5] - x[b] - x[9] - x[f] - x[d] line12= x[0] + x[1] + (x[7] + x[b]) + x[6] + x[a] + x[c] + x[d] - x[2] - x[3] - x[5] - x[4] - x[8] - x[9] - x[f] - x[e] alpha.sse = mm_shuffle_epi32(x_zero_three, (int) 0x67); // alpha = x[1,2,1,3] alpha.sse = mm_add_epi32(alpha.sse, x_zeros.sse); // alpha += x[0,0,0,0] alpha.sse = mm_add_epi32(alpha.sse, special_47p8b.sse); // alpha += x[7,6,5,4] + x[b,a,9,8] beta.sse = mm_shuffle_epi32(x_four_seven, (int) 0x83); // alpha += x[6,4,4,7] alpha.sse = mm_add_epi32(alpha.sse, beta.sse); beta.sse = mm_shuffle_epi32(x_eight_eleven, (int) 0x83); // alpha += x[a,8,8,b] alpha.sse = mm_add_epi32(alpha.sse, beta.sse); alpha.sse = mm_add_epi32(alpha.sse, x_cs.sse); // alpha += x[c,c,c,c] beta.sse = mm_shuffle_epi32(x_twelve_fifteeen,(int) 0x67);// alpha + =x[d,e,d,f]) alpha.sse = mm_add_epi32(alpha.sse, beta.sse);