Notes on Homework 1 DBS=328 lines BLK+UNJ=68 lines BLK=48 lines

Slides:

Advertisements

Similar presentations

Section 13-4: Matrix Multiplication

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment.

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

Coding FLAME Algorithms with Example: dot product Robert van de Geijn Department of Computer Sciences, UT-Austin.

Maths for Computer Graphics

Introduction to Computers and Programming Lecture 10: For Loops Professor: Evan Korth New York University.

02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!

Compilation Techniques for Multimedia Processors Andreas Krall and Sylvain Lelait Technische Universitat Wien.

02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!

SSE for H.264 Encoder Chuck Tsen Sean Pieper. SSE– what can’t it do? Mixed scalar and vector Unaligned memory accesses Predicated execution >2 source.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

4.2 Operations with Matrices Scalar multiplication.

CS32310 MATRICES 1. Vector vs Matrix transformation formulae Geometric reasoning allowed us to derive vector expressions for the various transformation.

Problems 1 thru 4 use These matrices..

4.4 & 4.5 Notes Remember: Identity Matrices: If the product of two matrices equal the identity matrix then they are inverses.

Performance Enhancement of Video Compression Algorithms using SIMD Valia, Shamik Jamkar, Saket.

Performance Optimization Getting your programs to run faster CS 691.

Instructor: Justin Hsia

CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.

What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)

Performance Optimization Getting your programs to run faster.

Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

Matlab Basic. MATLAB Product Family 2 3 Entering & Quitting MATLAB To enter MATLAB double click on the MATLAB icon. To Leave MATLAB Simply type quit.

1.3 Solutions of Linear Systems

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

3.6 Multiplying Matrices Homework 3-17odd and odd.

MULTI-DIMENSIONAL ARRAYS 1. Multi-dimensional Arrays The types of arrays discussed so far are all linear arrays. That is, they all dealt with a single.

CS61C L20 Thread Level Parallelism I (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.

Notes Over 4.2 Finding the Product of Two Matrices Find the product. If it is not defined, state the reason. To multiply matrices, the number of columns.

4-3 Matrix Multiplication Objective: To multiply a matrix by a scalar multiple.

§9-3 Matrix Operations. Matrix Notation The matrix has 2 rows and 3 columns.

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

Chapter 3 Arithmetic for Computers

12-1 Organizing Data Using Matrices

Linear Algebra review (optional)

Optimizing the code using SSE intrinsics

SIMD Multimedia Extensions

Section 7: Memory and Caches

High-Performance Matrix Multiplication

Matrix Operations.

Vladimir Stojanovic & Nicholas Weaver

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Getting Started with Automatic Compiler Vectorization

Compilers for Embedded Systems

SIMD Programming CS 240A, 2017.

Instructor: Michael Greenbaum

Notes on Homework 1 What is Peak ? CS267 Lecture 2 CS267 Lecture 2 1.

Array Processor.

Multiplying Matrices.

WarmUp 2-3 on your calculator or on paper..

High Performance Computing (CS 540)

Computer Programming Machine and Assembly.

Pipelining hazards, Parallel Data, Threads

CS4961 Parallel Programming Lecture 12: Data Locality, cont

Multiplying Matrices.

EE 193: Parallel Computing

Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.

Multi-Dimensional Arrays

Loop Optimization “Programs spend 90% of time in loops”

3.5 Perform Basic Matrix Operations

Linear Algebra review (optional)

Multiplying Matrices.

Multiplying Matrices.

Presentation transcript:

Notes on Homework 1 DBS=328 lines BLK+UNJ=68 lines BLK=48 lines UNJ=35 lines NAÏVE=20 Techniques such as unroll-and-jam significantly help performance Must write SIMD code to get past 50% of peak! 1 CS267 Lecture 2 CS267 Lecture 2 1

Summary of SSE intrinsics Vector data type: __m128d Load and store operations: _mm_load_pd _mm_store_pd _mm_loadu_pd _mm_storeu_pd Load and broadcast across vector _mm_load1_pd Arithmetic: _mm_add_pd _mm_mul_pd 2 CS267 Lecture 2 CS267 Lecture 2 2

Example: multiplying 2x2 matrices c1 = _mm_loadu_pd( C+0*lda ) //load unaligned block in C c2 = _mm_loadu_pd( C+1*lda ) for( int i = 0; i < 2; i++ ) { a = _mm_load_pd( A+i*lda ) //load aligned i-th column of A b1 = _mm_load1_pd( B+i+0*lda ) //load i-th row of B b2 = _mm_load1_pd( B+i+1*lda ) c1=_mm_add_pd( c1, _mm_mul_pd( a, b1 ) ); //rank-1 update c2=_mm_add_pd( c2, _mm_mul_pd( a, b2 ) ); } _mm_storeu_pd( C+0*lda, c1 ); //store unaligned block in C _mm_storeu_pd( C+1*lda, c2 ); 3 CS267 Lecture 2 CS267 Lecture 2 3

Other Issues Checking efficiency of the compiler helps Use -S option to see the generated assembly code Inner loop should consist mostly of ADDPD and MULPD ops ADDSD and MULSD imply scalar computations Consider using another compiler Options are PGI, PathScale and GNU Look through Goto and van de Geijn’s paper 4 CS267 Lecture 2 CS267 Lecture 2 4