The Dilbert Approach.

Slides:



Advertisements
Similar presentations
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Advertisements

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
10/11: Lecture Topics Slides on starting a program from last time Where we are, where we’re going RISC vs. CISC reprise Execution cycle Pipelining Hazards.
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  logistics  why computer organization is important  modern trends.
EET 4250: Microcomputer Architecture Fall 2009 William Acosta URL:
1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.
CS 3305 Course Overview. Introduction r Instructor: Dr Hanan Lutfiyya r Office: MC 355 r hanan at csd dot uwo ca r Office Hours: m Drop-by m Appointment.
Lecture 2 CSS314 Parallel Computing
Fall 2015, Aug 17 ELEC / Lecture 1 1 ELEC / Computer Architecture and Design Fall 2015 Introduction Vishwani D. Agrawal.
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.
EECE 476: Computer Architecture Slide Set #1: Introduction Instructor: Tor Aamodt.
Class 1: What this course is about. Assignment Read: Chapter 1 Read: Chapter 1 Do: Chapter 1 ‘workbook’ pages not finished in class Do: Chapter 1 ‘workbook’
ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, Dec 26, 2012outline.1 ITCS 4145/5145 Parallel Programming Spring 2013 Barry Wilkinson Department.
1 EGRE 426 Handout 1 8/25/09. 2 Preliminary EGRE 365 is a prerequisite for this class. Class web page egre426/index.html.
Review of the numeration systems The hardware/software representation of the computer and the coverage of that representation by this course. What is the.
Spring 2016, Jan 13 ELEC / Lecture 1 1 ELEC / Computer Architecture and Design Spring 2016 Introduction Vishwani D. Agrawal.
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
Lecture 0. Course Introduction Prof. Taeweon Suh Computer Science Education Korea University COM515 Advanced Computer Architecture.
Administrative Preliminaries Computer Architecture.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Conclusions on CS3014 David Gregg Department of Computer Science
Welcome to CSE 502 Introduction.
CSc 1302 Principles of Computer Science II
CLOSE Please YOUR LAPTOPS, and get out your note-taking materials.
Lecture 1: CS/ECE 3810 Introduction
GC101 Introduction to computers and programs
CMSC 611 Advanced Computer Arch.
Binary Addition and Subtraction
The University of Adelaide, School of Computer Science
CISC (Complex Instruction Set Computer)
CS203 – Advanced Computer Architecture
Lecture 5: GPU Compute Architecture
EE 193: Parallel Computing
Lecture 1: CS/ECE 3810 Introduction
Lecture 5: GPU Compute Architecture for the last time
CDA 3100 Spring 2009.
Geo 318 – Introduction to GIS Programming
T Computer Architecture, Autumn 2005
CS/EE 6810: Computer Architecture
CMSC 611 Advanced Computer Arch.
Other time considerations
Welcome to CS220/MATH 320 – Applied Discrete Mathematics Fall 2018
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.
November 5 No exam results today. 9 Classes to go!
EE 193: Parallel Computing
Instructor: Joel Grodstein
Wrap Up Don Porter.
EE 193: Parallel Computing
Instructor: Joel Grodstein
Welcome to CSE 502 Introduction.
EE 193: Parallel Computing
Tonga Institute of Higher Education IT 141: Information Systems
Tonga Institute of Higher Education IT 141: Information Systems
Today: a look to the future
Vrije Universiteit Amsterdam
CS 8803: Memory Models David Devecsery.
EE 193: Parallel Computing
EE 155 / Comp 122 Parallel Computing
B-Trees.
EE 155 / COMP 122: Parallel Computing
EE 193: Parallel Computing
CSE 502: Computer Architecture
The Study of Computer Science Chapter 0
ELEC / Computer Architecture and Design Fall 2014 Introduction
Presentation transcript:

The Dilbert Approach

EE 183: Parallel Computing Fall 2017 Tufts University Instructor: Joel Grodstein joel.grodstein@tufts.edu Lecture 1: Introduction

Final exam Why are they so different? Haswell Server Nvidia Pascal P100 # cores 18 3840 Die area 660 mm2 (22nm) 610 mm2 (16nm) Frequency 2.3 GHz normal 1.3 GHz Max DRAM BW 100 GB/s 720 GB/s LLC size 2.5 MB/core (L3) 4 MB L2/chip LLC-1 size 256K/core(L2) 64 KB/SM (64 cores) Registers/core 180 (2 threads/core) 500 Power 165 watts 300 watts Company market cap $160B $90B Why are they so different? EE 183 Joel Grodstein

Primary goals Learn how hardware and software interact to result in performance (or the lack thereof) Because life is multi-disciplinary. Learn about parallel architectures: SIMD, multi-core, GPU Why do we care? How did we get here? What might be coming? Learn about concurrent-programming issues Why is writing parallel programs so hard?

Secondary Goal Learn some concurrent programming languages (C++ threads, CUDA) Pthreads is largely superceded by C++ threads We're engineers, and we like to build stuff, and writing a piece of software lets us actually make stuff that's useful. And fast. EE 183 Joel Grodstein

EE 193: topics, in order Intro to parallel processing C++ threads & CUDA – take 1 Quick review of EE126/Comp46 (caches, OOO, branch pred, speculative, multicore) SIMD instructions More multicore: ring caches and MESI Performance: false sharing and matrix multiply The trickiness of concurrent programming GPUs Project or final

Instructor Instructor: Joel Grodstein Office Hours: Monday/Wednesday 2-3pm (i.e., before class), or email for an appointment Halligan extension Foils and homeworks are available on the course web page My Background 30 years in the semiconductor industry My first semester as official ECE faculty

A short history of computers 1970-now: computers got really fast And the number of transistors doubled every 1-2 years Result: superscalar, OOO, speculative, SMT 2002: ran out of fun things to do with the transistors But the number of transistors still kept doubling Logical result: multi-core (1-50 cores) And SIMD (1 core), GPU (1000s of cores) But there’s a problem: When you combine a lot of little systems, you get a really big system. Human beings are not good at programming these EE 183 Joel Grodstein

A Trend in Computing Last Few Decades Old conventional wisdom: Trade performance for improved programmer productivity Higher-level languages, interfaces, abstraction layers, frameworks Graphical user interfaces (GUIs) How many hardware instructions to put “hello world” in a window? See: “Spending Moore’s Dividend”, Jim Larus, CACM 2009 New conventional wisdom: Obtain performance by parallel programming Programmers given additional burden: writing parallel software More conventional wisdom: Parallel programming is intractably difficult Conclusion: obtain performance by reducing programming productivity Or, equivalently: life kind of sucks 

What is Old is New Again Parallelism isn’t new Commonplace for computational science and engineering To tackle problems too large to solve on any one computer Old-school “supercomputers” were also highly parallel Mainstream parallel computing “next big thing” for decades Many companies bet on parallelism and failed Why? One reason: non-parallel computers got faster so quickly Ok, so why is parallelism so talked about now? The entire industry has bet on parallelism! Driven to parallelism by technology and architectural realities (next) Sequential (non-parallel) performance is lagging Thus, need for parallel programmers & related research Based on a slide by Katherine Yelick

Stolen with pride from Mark Sheldon Buying milk You live with 3 roommates. Saturday morning you wake up, get the cereal and notice a problem: no milk Go to the store and buy milk Get back home, open the fridge It has 3 brand-new containers of milk in it! What went wrong? Parallel systems are tricky! So tricky that we need an entire course about it (OK, the course is more about performance than correctness) Stolen with pride from Mark Sheldon

Any ideas how to fix this? Sprint to the store & back really really fast Good luck with that strategy Padlock the fridge when you go to the store People might get a bit mad We call this a "critical section." Take the shopping list with you when you go to the store If there is no list on the fridge, is it because somebody else is at the store, or because nobody made a list yet? EE 183 Joel Grodstein

Some code Does this work? Load m = "there is milk" if (m==false) { get milk from store put it in fridge } Load m = "there is milk" if (m==false) { get milk from store put it in fridge } Does this work? No, as we've just showed. Person #2 may do the load & check before person #1 returns. EE 183 Joel Grodstein

Some code Now we've fixed it! Load m = "there is milk" Load s = "there's a sign" if (!m && !s) { put a sign on the fridge get milk from store put it in fridge } Load m = "there is milk" Load s = "there's a sign" if (!m && !s) { put a sign on the fridge get milk from store put it in fridge } Now we've fixed it! EE 183 Joel Grodstein

Some code Each core may execute at a different rate Load m = "there is milk" Load s = "there's a sign" if (!m && !s) { put a sign on the fridge get milk from store put it in fridge } time Load m = "there is milk" Load s = "there's a sign" if (!m && !s) { put a sign on the fridge get milk from store put it in fridge } s=true, since the sign is already on the fridge Each core may execute at a different rate This way works fine. Or does it? EE 183 Joel Grodstein

Some code Different timing breaks the algorithm Load m = "there is milk" Load s = "there's a sign" if (!m && !s) { put a sign on the fridge get milk from store put it in fridge } time Load m = "there is milk" Load s = "there's a sign" if (!m && !s) { put a sign on the fridge get milk from store put it in fridge } s=false, since the sign is not on the fridge yet Different timing breaks the algorithm OK, the analogy isn’t perfect. But the message is that parallel programming is filled with corner cases. EE 183 Joel Grodstein

Example: add lots of numbers sum=0; for i=1…n { sum += f(i) } If n is big, this can take a while Idea: divvy up the work among multiple cores! Let each processor add n/p of the numbers EE 183 Joel Grodstein

Problems? Ideas? extern sum=0; for i=1…n/2 { sum += f(i) } extern sum; for i=1+n/2…n { sum += f(i) } Problems? sum += f(i) requires a load, add, store. What if, between the load and the store, somebody else went to the store and got milk too? Ideas? EE 183 Joel Grodstein

How can we use p processors to speed this up? Let each processor add n/p numbers itself Then add the p partial sums sum[0]=0; for i=1…n/p { sum[0] += f(i) } sum[1]=0; for i=n/p+1…2n/p { sum[1] += f(i) } int sum=0; for i=0..p-1 { sum += sum[i] } But this part only uses one core to do p additions. Can we do better? EE 183 Joel Grodstein

Better parallel algorithm Don’t make the master core do all the work. Share it among the other cores. Pair the cores so that core 0 adds its result with core 1’s result. Core 2 adds its result with core 3’s result, etc. Work with odd and even numbered pairs of cores. Copyright © 2010, Elsevier Inc. All rights Reserved

Better parallel algorithm (cont.) Repeat the process now with only the even cores. Core 0 (which has c0+c1) adds the result from core 2 (which has c2+c3). Core 4 (which has c4+c5) adds the result from core 6 (which has c6+c7), etc. Now repeat over and over, making a kind of binary tree, until core 0 has the final result. Copyright © 2010, Elsevier Inc. All rights Reserved

Multiple cores form a global sum Adding 8 numbers only takes 3 time units this way  Copyright © 2010, Elsevier Inc. All rights Reserved

Analysis (cont.) The difference is more dramatic with a larger number of cores. If we have 1000 cores: The first algorithm would require the master to perform 999 receives and 999 additions. The second algorithm would only require 10 receives and 10 additions. That’s an improvement of almost a factor of 100! Copyright © 2010, Elsevier Inc. All rights Reserved

Multiple cores form a global sum But is it really that simple? Any issues? How does a consumer know when its data is ready? We’ve assumed that data can move from any core to any other core infinitely fast. Copyright © 2010, Elsevier Inc. All rights Reserved

A crossbar needs too many wires How many wires for p processors? p*(p+1)/2 Does not scale well to hundreds of cores! C4 C0 C1 C3 C2 EE 183 Joel Grodstein

Multiple cores form a global sum If only one core can talk to another at a time… the graph looks very different Copyright © 2010, Elsevier Inc. All rights Reserved

Motto Transistors are cheap Wires are expensive Software is really expensive Parallel software that actually works is priceless EE 183 Joel Grodstein

The rest of the course So now we know we have some problems Parallel architecture is here today But it's really hard to use it correctly And to use it efficiently We'll learn techniques to avoid everyone buying milk atomic operations, critical sections, mutex, … We'll learn more architecture: multicore, GPUs, SIMD We'll learn about performance When is multicore appropriate? GPUs? SIMD? This will require learning (a lot) more about caches & coherence And of course… we can't really apply that knowledge unless we learn enough software to write some programs EE 183 Joel Grodstein

That's all there is That's all there is to the course Questions? except, well, all of the actual details Questions? EE 183 Joel Grodstein

Why we need ever-increasing performance Computational power is increasing, but so are our computation problems and needs. Problems we never dreamed of have been solved because of past increases, such as decoding the human genome. More complex problems are still waiting to be solved. Copyright © 2010, Elsevier Inc. All rights Reserved

Climate modeling Both weather prediction and climate change Copyright © 2010, Elsevier Inc. All rights Reserved

Protein folding Copyright © 2010, Elsevier Inc. All rights Reserved

Drug discovery Copyright © 2010, Elsevier Inc. All rights Reserved

Energy research Copyright © 2010, Elsevier Inc. All rights Reserved

Data analysis Copyright © 2010, Elsevier Inc. All rights Reserved

Resources "An Introduction to Parallel Programming" by Peter S. Pacheco (1st Ed 2011) online at Tisch good for concurrency problems not so great for hardware, architecture we're using C++ threads, which appeared after this book was written

“Computer Architecture: A Quantitative Approach,” Fifth Edition, John L. Hennessy and David A. Patterson online at Tisch Great reference for architecture I found the chapter on GPU architecture to be confusing (the terminology is inconsistent) EE 183 Joel Grodstein

Matrix Computations, Golub and Van Loan, 3rd edition The bible of matrix math (including how to make good use of memory) Not always easy to read for a beginner One copy on reserve in Tisch Various GPU books available online (listed in the syllabus) EE 183 Joel Grodstein

Prerequisites ECE 126 (Computer Architecture) or similar Logic Design (computer arithmetic) Basic ISA (what is a RISC instruction) Pipelining (control/data hazards, forwarding) Basic Caches and Memory Systems We'll spend 1-2 weeks reviewing it Reasonable C/C++ Programming Skills UNIX/Linux experience

Grading Grade Formula Programming Assignments – 50% (C++ threads & CUDA) Quizzes – 30% Final or project: 20% (see the syllabus for project suggestions) There will be about five quizzes with the lowest quiz grade dropped

Late Assignments + Academic Integrity 10% grade reduction per day Copying even small portions of assignments from other students or open-source projects and submitting them as your own will be a violation of academic integrity. Sharing code with other students would also be considered an offense. The best way to ensure that your code is your own is to only have high level discussions with other students and never share a line of code.