Parallel Computing’s Challenges

Slides:



Advertisements
Similar presentations
Programmability Issues
Advertisements

The University of Adelaide, School of Computer Science
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
An intro to programming concepts with Scratch Session 2 of 10 sessions I/O, variables, simple computing.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Prince Sultan College For Woman
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Novel Architectures Copyright 2004 Daniel J. Sorin Duke University.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
Chapter One Introduction to Pipelined Processors.
Hardware vs Software Hardware: A physical part of the computer which you can touch. Software: A set of instructions which is run to perform tasks on your.
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.
CS717 Algorithm-Based Fault Tolerance Matrix Multiplication Greg Bronevetsky.
Grade Book Database Presentation Jeanne Winstead CINS 137.
High Performance Computing An overview Alan Edelman Massachusetts Institute of Technology Applied Mathematics & Computer Science and AI Labs (Interactive.
Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.
EKT303/4 Superscalar vs Super-pipelined.
Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Concurrency and Performance Based on slides by Henri Casanova.
Introduction. News you can use Hardware –Multicore chips (2009: mostly 2 cores and 4 cores, but doubling) (cores=processors) –Servers (often.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Single Instruction Multiple Data
AP CSP: Cleaning Data & Creating Summary Tables
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
Branching Error (a.k.a. the VM Program Instruction Break Error)
Eigenfaces (for Face Recognition)
Advanced Architectures
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
Matrix. Matrix Matrix Matrix (plural matrices) . a collection of numbers Matrix (plural matrices)  a collection of numbers arranged in a rectangle.
September 2 Performance Read 3.1 through 3.4 for Tuesday
Introduction Super-computing Tuesday
Introduction to Python
Distributed Processors
Multiscalar Processors
How We Think Of Computers
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang.
Algorithm Analysis CSE 2011 Winter September 2018.
Morgan Kaufmann Publishers
( Iteration / Repetition / Looping )
Lecture 2: Intro to the simd lifestyle and GPU internals
Pipelining and Vector Processing
A Balanced Introduction to Computer Science David Reed, Creighton University ©2005 Pearson Prentice Hall ISBN X Chapter 13 (Reed) - Conditional.
Array Processor.
Introduction.
Introduction.
The Pentium: A CISC Architecture
Chapter 17 Parallel Processing
Use of Mathematics using Technology (Maltlab)
UNIT 3 CHAPTER 1 LESSON 4 Using Simple Commands.
DONE Need password feature
Strategies for Test Success
Background and Motivation
An intro to programming concepts with Scratch
Tonga Institute of Higher Education IT 141: Information Systems
Part 2: Parallel Models (I)
Searching, Sorting, and Asymptotic Complexity
Analysis of Algorithms
MATLAB Programming Basics Copyright © Software Carpentry 2011
Mastering Memory Modes
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
Memory System Performance Chapter 3
INSTRUCTIONS for PSYCHOMETRIC TESTS.
Analysis of Algorithms
Algorithms for Selecting Mirror Sites for Parallel Download
A Balanced Introduction to Computer Science David Reed, Creighton University ©2005 Pearson Prentice Hall ISBN X Chapter 13 (Reed) - Conditional.
 Is a machine that is able to take information (input), do some work on (process), and to make new information (output) COMPUTER.
Analysis of Algorithms
Presentation transcript:

18.337 Parallel Computing’s Challenges

Old Homework (emphasized for effect) Download a parallel program from somewhere. Make it work Download another parallel program Now, …, make them work together!

SIMD SIMD (Single Instruction, Multiple Data) refers to parallel hardware that can execute the same instruction on multiple data. (Think the addition of two vectors. One add instruction applies to every element of the vector.) Term was coined with one element per processor in mind, but with today’s deep memories and hefty processors, large chunks of the vectors would be added on one processor. Term was coined with a broadcasting of an instruction in mind, hence the single instruction, but today’s machines are usually more flexible. Term was coined with A+B and elementwise AxB in mind and so nobody really knows for sure if matmul or fft is SIMD or not, but these operations can certainly be built from SIMD operations.  Today, it is not unusual to refer to a SIMD operation (sometimes but not always historically synonymous with Data Parallel Operations though this feels wrong to me) when the software appears to run “lock-step” with every processor executing the same instruction. Usage: “I hear that machine is particularly fast when the program primarily consists of SIMD operations.” Graphics processors such as NVIDEA seem to run fastest on SIMD type operations, but current research (and old research too) pushes the limits of SIMD.

Natural Question may not be the most important How do I parallelize x? First question many students ask Answer often either one of Fairly obvious Very difficult Can miss the true issues of high performance These days people are often good at exploiting locality for performance People are not very good about hiding communication and anticipating data movement to avoid bottlenecks People are not very good about interweaving multiple functions to make the best use of resources Usually misses the issue of interoperability Will my program play nicely with your program? Will my program really run on your machine?

Class Notation Vectors small roman letters”: x,y, … Vectors have length n if possible Matrices large roman (sometimes Greek) letters: A,B,X,Λ,Σ Matrices are n x n, or maybe m x n, but almost never n x m. Could be p x q. Scalars may be small greek letters or small roman letters – may not be as consistent

Algorithm Example: FFTs For now think of an FFT as a “black box” y=FFT(x) takes as input and output a vector of length n defined (but not computed) as a matrix time vector: y=Fnx, where (Fn)jk=e2πijk/n for j,k=0,…,(n-1). Important Use Cases Column fft: fft(X), fft(X,[ ],1) (MATLAB) Row fft: fft(X,[ ],2) (MATLAB) 2d fft: (do a row and column) fft2(X) fft2(X) = row_fft(col_fft(X)) = col_fft( row_fft(X))

How to implement a column FFT? Put block columns on each processor Do local column FFTs Local column FFTs may be “column at a time” or “pipelined” In the case of FFT probably a fast local package available, but may not be true for other ops. Also as MIT students have been known to do, you might try to beat the packages. P0 P1 P2

A closer look at column fft Put block columns on each processor Where were the columns? Where are they going? The cost of the above can be very expensive in performance. Can we hide it somewhere? P0 P1 P2

What about row fft Suppose block columns on each processor Many transpose and then apply column FFT and transpose back This thinking is simple and do-able Not only simple but encourages the paradigm of 1) do whatever 2) get good parallelism and 3) do whatever Harder to decide whether to do rows in parallel or to interweave transposing of pieces and start computation May be more performance, but nobody to my knowledge has done a good job of this yet. You maybe could be first. P0 P1 P2

Not load balanced column fft? Suppose block columns on each processor To load balance or to not load balance, that is the question Traditional Wisdom says this is badly load balanced and parallelism is lost, but there is a cost of moving the data which may or may not be worth the gain in load balancing P0 P1 P2

2d fft Suppose block columns on each processor Can do columns, transpose, rows, transpose Can do transpose, rows, transpose, columns Can be fancier? P0 P1 P2

So much has to do with access to memory and data movement The conventional wisdom is that it’s all about locality. This remains partially true and partially not quite as true as it used to be.

http://www.cs.berkeley.edu/~samw/research/talks/sc07.pdf

A peak inside an FFT (more later in the semester) Time wasted on the telephone

Tracing Back the data dependency

New term for the day: MIMD MIMD (Multiple Instruction stream, Multiple Data stream) refers to most current parallel hardware where each processor can independently execute their own instructions. The importance of MIMD over SIMD emerged in the early 1990’s, as commodity processors became the basis of much parallel computing.  One may also refer to a MIMD operation in an implementation, if one wishes to emphasize non-homogeneous execution. (Often contrasted to SIMD.)

Importance of Abstractions Ease of use requires that the very notion of a processor really should be buried underneath the user Some think that the very requirements of performance require the opposite I am fairly sure the above bullet is more false than true – you can be the ones to figure this all out!