18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

Slides:

Advertisements

Similar presentations

Introduction to Programming

Advertisements

Announcements You survived midterm 2! No Class / No Office hours Friday.

Jason Howard. Agenda I. How to download robotc II. What is tele-op used for? III. How to build a basic tele-op program IV. Getting the robot to drive.

Programmability Issues

The University of Adelaide, School of Computer Science

INSE - Lectures 19 & 20 SE for Real-Time & SE for Concurrency  Really these are two topics – but rather tangled together.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.

Data Analytics and Dynamic Languages Lee E. Edlefsen, Ph.D. VP of Engineering 1.

Introduction to Analysis of Algorithms

Examples of Two- Dimensional Systolic Arrays. Obvious Matrix Multiply Rows of a distributed to each PE in row. Columns of b distributed to each PE in.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

An intro to programming concepts with Scratch Session 2 of 10 sessions I/O, variables, simple computing.

Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.

Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.

Linear regression models in matrix terms. The regression function in matrix terms.

Prince Sultan College For Woman

SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.

Lecture 18 Page 1 CS 111 Online Design Principles for Secure Systems Economy Complete mediation Open design Separation of privileges Least privilege Least.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Novel Architectures Copyright 2004 Daniel J. Sorin Duke University.

Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.

Chapter One Introduction to Pipelined Processors.

Hardware vs Software Hardware: A physical part of the computer which you can touch. Software: A set of instructions which is run to perform tasks on your.

IT253: Computer Organization

Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.

System Analysis (Part 3) System Control and Review System Maintenance.

Pipelined and Parallel Computing Data Dependency Analysis for 1 Hongtao Du AICIP Research Mar 9, 2006.

Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.

CS717 Algorithm-Based Fault Tolerance Matrix Multiplication Greg Bronevetsky.

Grade Book Database Presentation Jeanne Winstead CINS 137.

Chapter 4 Controlling Execution CSE Objectives Evaluate logical expressions –Boolean –Relational Change the flow of execution –Diagrams (e.g.,

Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.

1 Algorithms  Algorithms are simply a list of steps required to solve some particular problem  They are designed as abstractions of processes carried.

{ What is a Number? Philosophy of Mathematics.  In philosophy and maths we like our definitions to give necessary and sufficient conditions.  This means.

High Performance Computing An overview Alan Edelman Massachusetts Institute of Technology Applied Mathematics & Computer Science and AI Labs (Interactive.

September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!

EKT303/4 Superscalar vs Super-pipelined.

Chapter 1 Section 1.5 Matrix Operations. Matrices A matrix (despite the glamour of the movie) is a collection of numbers arranged in a rectangle or an.

Parallel Processing Chapter 9. Problem: –Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available Solution:

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Copyright © Curt Hill SIMD Single Instruction Multiple Data.

More on Logic Today we look at the for loop and then put all of this together to look at some more complex forms of logic that a program will need The.

Concurrency and Performance Based on slides by Henri Casanova.

By: Krista Hass. You don’t have to be Einstein to pass this test. Just follow these simple steps and you’ll be on your way to great success on the ACT.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Introduction. News you can use Hardware –Multicore chips (2009: mostly 2 cores and 4 cores, but doubling) (cores=processors) –Servers (often.

Single Instruction Multiple Data

A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.

Eigenfaces (for Face Recognition)

Advanced Architectures

CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.

Introduction Super-computing Tuesday

Distributed Processors

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang.

Lecture 2: Intro to the simd lifestyle and GPU internals

Array Processor.

Chapter 17 Parallel Processing

Use of Mathematics using Technology (Maltlab)

Multiprocessors - Flynn’s taxonomy (1966)

Background and Motivation

An intro to programming concepts with Scratch

Parallel Computing’s Challenges

Mastering Memory Modes

Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.

Memory System Performance Chapter 3

Presentation transcript:

Parallel Computing’s Challenges

Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another parallel program –Now, …, make them work together!

SIMD SIMD (Single Instruction, Multiple Data) refers to parallel hardware that can execute the same instruction on multiple data. (Think the addition of two vectors. One add instruction applies to every element of the vector.) –Term was coined with one element per processor in mind, but with today’s deep memories and hefty processors, large chunks of the vectors would be added on one processor. –Term was coined with a broadcasting of an instruction in mind, hence the single instruction, but today’s machines are usually more flexible. –Term was coined with A+B and elementwise AxB in mind and so nobody really knows for sure if matmul or fft is SIMD or not, but these operations can certainly be built from SIMD operations. Today, it is not unusual to refer to a SIMD operation (sometimes but not always historically synonymous with Data Parallel Operations though this feels wrong to me) when the software appears to run “lock-step” with every processor executing the same instruction. –Usage: “I hear that machine is particularly fast when the program primarily consists of SIMD operations.” –Graphics processors such as NVIDEA seem to run fastest on SIMD type operations, but current research (and old research too) pushes the limits of SIMD.

Natural Question may not be the most important How do I parallelize x? –First question many students ask –Answer often either one of Fairly obvious Very difficult –Can miss the true issues of high performance These days people are often good at exploiting locality for performance People are not very good about hiding communication and anticipating data movement to avoid bottlenecks People are not very good about interweaving multiple functions to make the best use of resources –Usually misses the issue of interoperability Will my program play nicely with your program? Will my program really run on your machine?

Class Notation Vectors small roman letters”: x,y, … Vectors have length n if possible Matrices large roman (sometimes Greek) letters: A,B,X,Λ,Σ Matrices are n x n, or maybe m x n, but almost never n x m. Could be p x q. Scalars may be small greek letters or small roman letters – may not be as consistent

Algorithm Example: FFTs For now think of an FFT as a “black box” y=FFT(x) takes as input and output a vector of length n defined (but not computed) as a matrix time vector: y=F n x, where (F n ) jk =e -2πijk/n for j,k=0,…,(n-1). Important Use Cases –Column fft: fft(X), fft(X,[ ],1) (MATLAB) –Row fft: fft(X,[ ],2) (MATLAB) –2d fft: (do a row and column) fft2(X) fft2(X) = row_fft(col_fft(X)) = col_fft( row_fft(X))

How to implement a column FFT? Put block columns on each processor Do local column FFTs P0 P1 P2 Local column FFTs may be “column at a time” or “pipelined” In the case of FFT probably a fast local package available, but may not be true for other ops. Also as MIT students have been known to do, you might try to beat the packages.

A closer look at column fft Put block columns on each processor Where were the columns? Where are they going? The cost of the above can be very expensive in performance. Can we hide it somewhere? P0 P1 P2

What about row fft Suppose block columns on each processor Many transpose and then apply column FFT and transpose back This thinking is simple and do-able Not only simple but encourages the paradigm of –1) do whatever 2) get good parallelism and 3) do whatever Harder to decide whether to do rows in parallel or to interweave transposing of pieces and start computation –May be more performance, but nobody to my knowledge has done a good job of this yet. You maybe could be first. P0 P1 P2

Not load balanced column fft? Suppose block columns on each processor To load balance or to not load balance, that is the question Traditional Wisdom says this is badly load balanced and parallelism is lost, but there is a cost of moving the data which may or may not be worth the gain in load balancing P0 P1 P2

2d fft Suppose block columns on each processor Can do columns, transpose, rows, transpose Can do transpose, rows, transpose, columns Can be fancier? P0 P1 P2

So much has to do with access to memory and data movement The conventional wisdom is that it’s all about locality. This remains partially true and partially not quite as true as it used to be.

A peak inside an FFT (more later in the semester) Time wasted on the telephone

Tracing Back the data dependency

New term for the day: MIMD MIMD (Multiple Instruction stream, Multiple Data stream) refers to most current parallel hardware where each processor can independently execute their own instructions. The importance of MIMD over SIMD emerged in the early 1990’s, as commodity processors became the basis of much parallel computing. One may also refer to a MIMD operation in an implementation, if one wishes to emphasize non-homogeneous execution. (Often contrasted to SIMD.)

Importance of Abstractions Ease of use requires that the very notion of a processor really should be buried underneath the user Some think that the very requirements of performance require the opposite I am fairly sure the above bullet is more false than true – you can be the ones to figure this all out!