02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!

Slides:



Advertisements
Similar presentations
Section 13-4: Matrix Multiplication
Advertisements

Lecture 20: 11/12/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment.
Coding FLAME Algorithms with Example: dot product Robert van de Geijn Department of Computer Sciences, UT-Austin.
Maths for Computer Graphics
Introduction to Computers and Programming Lecture 10: For Loops Professor: Evan Korth New York University.
02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!
SSE for H.264 Encoder Chuck Tsen Sean Pieper. SSE– what can’t it do? Mixed scalar and vector Unaligned memory accesses Predicated execution >2 source.
4.2 Operations with Matrices Scalar multiplication.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
CS32310 MATRICES 1. Vector vs Matrix transformation formulae Geometric reasoning allowed us to derive vector expressions for the various transformation.
09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.
Problems 1 thru 4 use These matrices..
4.4 & 4.5 Notes Remember: Identity Matrices: If the product of two matrices equal the identity matrix then they are inverses.
Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +
Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.
Performance Optimization Getting your programs to run faster CS 691.
Stefanos Zafeiriou Machine Learning(395) Course 395: Machine Learning – Math. Intro. Brief Intro to Matrices, Vectors and Derivatives: Equality: Two matrices.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
4.3 Matrix Multiplication 1.Multiplying a Matrix by a Scalar 2.Multiplying Matrices.
Performance Optimization Getting your programs to run faster.
Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
With a focus on floating point.  For floating point (i.e., real numbers), MASM supports:  real4  single precision; IEEE standard; analogous to float.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Matlab Basic. MATLAB Product Family 2 3 Entering & Quitting MATLAB To enter MATLAB double click on the MATLAB icon. To Leave MATLAB Simply type quit.
1.3 Solutions of Linear Systems
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
3.6 Multiplying Matrices Homework 3-17odd and odd.
EGR 115 Introduction to Computing for Engineers Loops and Vectorization – Part 3 Friday 17 Oct 2014 EGR 115 Introduction to Computing for Engineers.
MULTI-DIMENSIONAL ARRAYS 1. Multi-dimensional Arrays The types of arrays discussed so far are all linear arrays. That is, they all dealt with a single.
CS61C L20 Thread Level Parallelism I (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Notes Over 4.2 Finding the Product of Two Matrices Find the product. If it is not defined, state the reason. To multiply matrices, the number of columns.
4-3 Matrix Multiplication Objective: To multiply a matrix by a scalar multiple.
§9-3 Matrix Operations. Matrix Notation The matrix has 2 rows and 3 columns.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
MTH108 Business Math I Lecture 20.
12-1 Organizing Data Using Matrices
Matrix Operations Free powerpoints at
Linear Algebra review (optional)
SIMD Multimedia Extensions
Matrix Operations Free powerpoints at
High-Performance Matrix Multiplication
Matrix Operations.
Vladimir Stojanovic & Nicholas Weaver
Getting Started with Automatic Compiler Vectorization
Compilers for Embedded Systems
SIMD Programming CS 240A, 2017.
Instructor: Michael Greenbaum
Notes on Homework 1 What is Peak ? CS267 Lecture 2 CS267 Lecture 2 1.
Notes on Homework 1 DBS=328 lines BLK+UNJ=68 lines BLK=48 lines
Implementation of DWT using SSE Instruction Set
Matrix Operations Free powerpoints at
Array Processor.
Multiplying Matrices.
WarmUp 2-3 on your calculator or on paper..
Computer Programming Machine and Assembly.
Pipelining hazards, Parallel Data, Threads
Parallel Matrix Operations
Multiplying Matrices.
EE 193: Parallel Computing
Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.
Multi-Dimensional Arrays
3.5 Perform Basic Matrix Operations
3.6 Multiply Matrices.
Linear Algebra review (optional)
Multiplying Matrices.
Multiplying Matrices.
Multiplying Matrices.
Presentation transcript:

02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!

02/09/2010CS267 Lecture 72 Summary of SSE intrinsics Vector data type: __m128d Load and store operations: _mm_load_pd _mm_store_pd _mm_loadu_pd _mm_storeu_pd Load and broadcast across vector _mm_load1_pd Arithmetic: _mm_add_pd _mm_mul_pd

02/09/2010CS267 Lecture 73 Example: multiplying 2x2 matrices c1 = _mm_loadu_pd( C+0*lda )//load unaligned block in C c2 = _mm_loadu_pd( C+1*lda ) for( int i = 0; i < 2; i++ ) { a = _mm_load_pd( A+i*lda )//load aligned i-th column of A b1 = _mm_load1_pd( B+i+0*lda )//load i-th row of B b2 = _mm_load1_pd( B+i+1*lda ) c1=_mm_add_pd( c1, _mm_mul_pd( a, b1 ) ); //rank-1 update c2=_mm_add_pd( c2, _mm_mul_pd( a, b2 ) ); } _mm_storeu_pd( C+0*lda, c1 );//store unaligned block in C _mm_storeu_pd( C+1*lda, c2 );

02/09/2010CS267 Lecture 74 Other Issues Checking efficiency of the compiler helps Use -S option to see the generated assembly code Inner loop should consist mostly of ADDPD and MULPD ops ADDSD and MULSD imply scalar computations Consider using another compiler Options are PGI, PathScale and GNU Look through Goto and van de Geijn’s paper