02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!

Slides:



Advertisements
Similar presentations
Section 13-4: Matrix Multiplication
Advertisements

Two-Dimensional Arrays Chapter What is a two-dimensional array? A two-dimensional array has “rows” and “columns,” and can be thought of as a series.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Lecture 20: 11/12/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
Reconfigurable Computing (EN2911X, Fall07) Lecture 04: Programmable Logic Technology (2/3) Prof. Sherief Reda Division of Engineering, Brown University.
1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
Coding FLAME Algorithms with Example: dot product Robert van de Geijn Department of Computer Sciences, UT-Austin.
Maths for Computer Graphics
Introduction to Computers and Programming Lecture 10: For Loops Professor: Evan Korth New York University.
02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!
Data Locality CS 524 – High-Performance Computing.
Compilation Techniques for Multimedia Processors Andreas Krall and Sylvain Lelait Technische Universitat Wien.
NVIDIA’s Experience with Open64 Mike Murphy NVIDIA.
Data Locality CS 524 – High-Performance Computing.
SSE for H.264 Encoder Chuck Tsen Sean Pieper. SSE– what can’t it do? Mixed scalar and vector Unaligned memory accesses Predicated execution >2 source.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Pointers (Continuation) 1. Data Pointer A pointer is a programming language data type whose value refers directly to ("points to") another value stored.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
ECON 1150 Matrix Operations Special Matrices
09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.
Problems 1 thru 4 use These matrices..
4.4 & 4.5 Notes Remember: Identity Matrices: If the product of two matrices equal the identity matrix then they are inverses.
Computer Science 320 Broadcasting. Floyd’s Algorithm on SMP for i = 0 to n – 1 parallel for r = 0 to n – 1 for c = 0 to n – 1 d rc = min(d rc, d ri +
Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
In Lesson 4-2, you multiplied matrices by a number called a scalar
Instructor: Justin Hsia
AES Encryption Code Generator Undergraduate Research Project by Paul Magrath. Supervised by Dr David Gregg.
What does C store? >>A = [1 2 3] >>B = [1 1] >>[C,D]=meshgrid(A,B) c) a) d) b)
Performance Optimization Getting your programs to run faster.
Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
With a focus on floating point.  For floating point (i.e., real numbers), MASM supports:  real4  single precision; IEEE standard; analogous to float.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
1.3 Solutions of Linear Systems
Warm Up Perform the indicated operations. If the matrix does not exist, write impossible
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
Single Node Optimization Computational Astrophysics.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
CS61C L20 Thread Level Parallelism I (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Notes Over 4.2 Finding the Product of Two Matrices Find the product. If it is not defined, state the reason. To multiply matrices, the number of columns.
4-3 Matrix Multiplication Objective: To multiply a matrix by a scalar multiple.
09/10/2010CS4961 CS4961 Parallel Programming Lecture 6: SIMD Parallelism in SSE-3 Mary Hall September 10,
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Chapter 3 Arithmetic for Computers
12-1 Organizing Data Using Matrices
SIMD Multimedia Extensions
The Hardware/Software Interface CSE351 Winter 2013
High-Performance Matrix Multiplication
Exploiting Parallelism
Vladimir Stojanovic & Nicholas Weaver
Getting Started with Automatic Compiler Vectorization
Compilers for Embedded Systems
Vector Processing => Multimedia
SIMD Programming CS 240A, 2017.
Instructor: Michael Greenbaum
Notes on Homework 1 What is Peak ? CS267 Lecture 2 CS267 Lecture 2 1.
Notes on Homework 1 DBS=328 lines BLK+UNJ=68 lines BLK=48 lines
Array Processor.
WarmUp 2-3 on your calculator or on paper..
Computer Programming Machine and Assembly.
Pipelining hazards, Parallel Data, Threads
Ronny Krashinsky and Mike Sung
CS-401 Computer Architecture & Assembly Language Programming
CS-401 Assembly Language Programming
EE 193: Parallel Computing
Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.
Multi-Dimensional Arrays
Memory System Performance Chapter 3
Presentation transcript:

02/09/2010CS267 Lecture 71 Notes on Homework 1 Must write SIMD code to get past 50% of peak!

02/09/2010CS267 Lecture 72 Summary of SSE intrinsics Vector data type: __m128d Load and store operations: _mm_load_pd _mm_store_pd _mm_loadu_pd _mm_storeu_pd Load and broadcast across vector _mm_load1_pd Arithmetic: _mm_add_pd _mm_mul_pd

02/09/2010CS267 Lecture 73 Example: multiplying 2x2 matrices c1 = _mm_loadu_pd( C+0*lda )//load unaligned block in C c2 = _mm_loadu_pd( C+1*lda ) for( int i = 0; i < 2; i++ ) { a = _mm_load_pd( A+i*lda )//load aligned i-th column of A b1 = _mm_load1_pd( B+i+0*lda )//load i-th row of B b2 = _mm_load1_pd( B+i+1*lda ) c1=_mm_add_pd( c1, _mm_mul_pd( a, b1 ) ); //rank-1 update c2=_mm_add_pd( c2, _mm_mul_pd( a, b2 ) ); } _mm_storeu_pd( C+0*lda, c1 );//store unaligned block in C _mm_storeu_pd( C+1*lda, c2 );

02/11/2009CS267 Lecture 7 Example: multiplying 2x2 matrices A B1B2 C1 += A * B1C2 += A * B2 Showing i = 0 B1 and B2 are broadcast to 128-bit registers Outer product in each iteration of the loop

02/09/2010CS267 Lecture 75 Other Issues Checking efficiency of the compiler helps Use -S option to see the generated assembly code Inner loop should consist mostly of ADDPD and MULPD ops ADDSD and MULSD imply scalar computations Consider using another compiler Options are PGI, PathScale and GNU Look through Goto and van de Geijn’s paper