Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +

Slides:



Advertisements
Similar presentations
Section 13-4: Matrix Multiplication
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
The University of Adelaide, School of Computer Science
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
Maths for Computer Graphics
02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!
NVIDIA’s Experience with Open64 Mike Murphy NVIDIA.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!
Multiplying matrices An animated example. (3 x 3)x (3 x 2)= (3 x 2) These must be the same, otherwise multiplication cannot be done Is multiplication.
SSE for H.264 Encoder Chuck Tsen Sean Pieper. SSE– what can’t it do? Mixed scalar and vector Unaligned memory accesses Predicated execution >2 source.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Objective Video Example by Mrs. G Give It a Try Lesson 4.1  Add and subtract matrices  Multiply a matrix by a scalar number  Solve a matrix equation.
4.2 Operations with Matrices Scalar multiplication.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Multi-Dimensional Arrays
Ch X 2 Matrices, Determinants, and Inverses.
Performance Optimization Getting your programs to run faster CS 691.
Performance Optimization Getting your programs to run faster.
*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Multiplying Matrices Algebra 2—Section 3.6. Recall: Scalar Multiplication - each element in a matrix is multiplied by a constant. Multiplying one matrix.
Algebra Matrix Operations. Definition Matrix-A rectangular arrangement of numbers in rows and columns Dimensions- number of rows then columns Entries-
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
Single Node Optimization Computational Astrophysics.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
3.6 Multiplying Matrices Homework 3-17odd and odd.
CS61C L20 Thread Level Parallelism I (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Notes Over 4.2 Finding the Product of Two Matrices Find the product. If it is not defined, state the reason. To multiply matrices, the number of columns.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
1 Lecture 5a: CPU architecture 101 boris.
A few words on locality and arrays
MTH108 Business Math I Lecture 20.
Sections 2.4 and 2.5 Matrix Operations
12-1 Organizing Data Using Matrices
Matrix Operations Free powerpoints at
Matrix Operations.
SIMD Multimedia Extensions
Matrix Operations Free powerpoints at
High-Performance Matrix Multiplication
Matrix Operations.
Exploiting Parallelism
Multiplying Matrices Algebra 2—Section 3.6.
Vladimir Stojanovic & Nicholas Weaver
Getting Started with Automatic Compiler Vectorization
Compilers for Embedded Systems
Lecture 5: GPU Compute Architecture
SIMD Programming CS 240A, 2017.
Instructor: Michael Greenbaum
Notes on Homework 1 What is Peak ? CS267 Lecture 2 CS267 Lecture 2 1.
Notes on Homework 1 DBS=328 lines BLK+UNJ=68 lines BLK=48 lines
Implementation of DWT using SSE Instruction Set
Matrix Operations Free powerpoints at
Array Processor.
Lecture 5: GPU Compute Architecture for the last time
Computer Programming Machine and Assembly.
Performance Optimization for Embedded Software
Parallel Matrix Operations
General Optimization Issues
COMS 361 Computer Organization
Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.
3.5 Perform Basic Matrix Operations
Matrix Multiplication
Presentation transcript:

Notes on Homework 1

2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 + A 11 B 11 Rewrite as SIMD algebra C00_C10 += A00_A10 * B00_B00 C01_C11 += A00_A10 * B01_B01 C00_C10 += A01_A11 * B10_B10 C01_C11 += A01_A11 * B11_B11 02/11/2009CS267 Lecture 7

Summary of SSE intrinsics #include Vector data type: __m128d Load and store operations: _mm_load_pd _mm_store_pd _mm_loadu_pd _mm_storeu_pd Load and broadcast across vector _mm_load1_pd Arithmetic: _mm_add_pd _mm_mul_pd

Example: multiplying 2x2 matrices #include c1 = _mm_loadu_pd( C+0*lda );//load unaligned block in C c2 = _mm_loadu_pd( C+1*lda ); for( int i = 0; i < 2; i++ ) { a = _mm_load_pd( A+i*lda ); //load aligned i-th column of A b1 = _mm_load1_pd( B+i+0*lda );//load i-th row of B b2 = _mm_load1_pd( B+i+1*lda ); c1=_mm_add_pd( c1, _mm_mul_pd( a, b1 ) ); //rank-1 update c2=_mm_add_pd( c2, _mm_mul_pd( a, b2 ) ); } _mm_storeu_pd( C+0*lda, c1 );//store unaligned block in C _mm_storeu_pd( C+1*lda, c2 );

General suggestions for optimizations Changing the order of the loops (i, j, k) changes which matrix stays in memory and the type of product Blocking for multiple levels (registers or SIMD, L1, L2) L3 is not important as most matrices will fit and optimizations not likely to bring more benefits Unrolling the loops Copy optimizations will be a large benefit if done for sub-matrix that is kept in memory careful not to overdo (overhead is high for every level) Aligning memory and padding can help with SIMD performance eliminating special cases and bad performance Adding SIMD intrinsics cannot get over 50% without explicit intrinsics or auto-vectorization Tuning parameters various block sizes, register sizes (perhaps automate)

Other issues Checking compilers options available PGI, PathScale, GNU, Cray Cray compiler seems to have an issue with explicit intrinsics Using compiler flags remember not allowed to use MM specific flags (-matmul) Checking assembly code for SSE instructions use the –S flag to see assembly should have mainly ADDPD & MULPD ops, ADDSD & MULSD are scalar computations Remember the write-up want to know what optimizations you tried and what worked and failed try and have an incremental design and show the performance of multiple iterations DUE DATE changed, now due Monday Feb 17 th at 11:59pm