EE 155 / Comp 122 Parallel Computing

Slides:



Advertisements
Similar presentations
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Advertisements

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
SYNAR Systems Networking and Architecture Group CMPT 886: Special Topics in Operating Systems and Computer Architecture Dr. Alexandra Fedorova School of.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Intel® Processor Architecture: Multi-core Overview Intel® Software College.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 2: Pipeline problems & tricks dr.ir. A.C. Verschueren Eindhoven.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Multi-Core Architectures
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Memory Systems How to make the most out of cheap storage.
CSE332: Data Abstractions Lecture 8: Memory Hierarchy Tyler Robison Summer
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Multi-core, Mega-nonsense. Will multicore cure cancer? Given that multicore is a reality –…and we have quickly jumped from one core to 2 to 4 to 8 –It.
What is it and why do we need it? Chris Ward CS147 10/16/2008.
Moore’s Law Electronics 19 April Moore’s Original Data Gordon Moore Electronics 19 April 1965.
SSU 1 Dr.A.Srinivas PES Institute of Technology Bangalore, India 9 – 20 July 2012.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Conclusions on CS3014 David Gregg Department of Computer Science
Chapter Six.
CS 352H: Computer Systems Architecture
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
How will execution time grow with SIZE?
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
/ Computer Architecture and Design
Lecture 2: Intro to the simd lifestyle and GPU internals
EE 193: Parallel Computing
Multi-Processing in High Performance Computer Architecture:
ECEC 621: High Performance Computer Architecture
CS 152 Computer Architecture & Engineering
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture 2: Performance Today’s topics: Technology wrap-up
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Hardware Multithreading
Chapter Six.
Advanced Computer Architecture
Chapter Six.
Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.
November 5 No exam results today. 9 Classes to go!
EE 193: Parallel Computing
The Dilbert Approach.
EE 193: Parallel Computing
Instructor: Joel Grodstein
Multithreading Why & How.
EE 193: Parallel Computing
CSC3050 – Computer Architecture
Why we have Counterintuitive Memory Models
EE 193: Parallel Computing
CSE 373: Data Structures and Algorithms
EE 155 / COMP 122: Parallel Computing
Instruction Level Parallelism
1.3.7 High- and low-level languages and their translators
EE 193: Parallel Computing
EN Software Carpentry Python – A Crash Course Esoteric Sections Parallelization
EE 193/Comp 150 Computing with Biological Parts
Presentation transcript:

EE 155 / Comp 122 Parallel Computing Spring 2019 Tufts University Instructor: Joel Grodstein joel.grodstein@tufts.edu What have we learned about architecture?

EE 155 / Comp 122 Joel Grodstein Caches Why do we have caches? While main-memory density has greatly increased over the years, its speed has not kept up with CPU speed Caches let you put a small amount of high-speed memory near the CPU so it doesn’t spend lots of time waiting for memory When do caches work well, and not so well? They work well when your program has temporal and spatial locality. Otherwise, not. EE 155 / Comp 122 Joel Grodstein

EE 155 / Comp 122 Joel Grodstein Branch predictors What is branch prediction? Instead of stalling until you know if a branch is taken, just make your best guess and execute your choice Be prepared to undo stuff if you were wrong Making your best guess usually relies on past history at each branch What are pros and cons? If your guess turns out correct, then good. You run fast If not, then undoing everything costs energy The predictor takes lots of area and energy EE 155 / Comp 122 Joel Grodstein

EE 155 / Comp 122 Joel Grodstein Out of order What is OOO? Forget about executing instructions in program order Look ahead at the next 10-100 instructions to find any of them whose operands are ready Execute in dataflow order (i.e., whoever is ready) If something goes wrong (like an earlier instruction takes an exception), then undo all instructions after the exception (using a reorder buffer) Pros and cons? As usual: it helps you run fast, but it costs area & power EE 155 / Comp 122 Joel Grodstein

EE 155 / Comp 122 Joel Grodstein How far should we go? How many transistors should we spend on big caches? Caches cost power & area, and don’t do any computing If you organize your code very cleverly, you can often get it to run fast without needing a lot of cache But taking the time to organize your code cleverly takes time and time is money and most customers prefer software to be cheap or free EE 155 / Comp 122 Joel Grodstein

EE 155 / Comp 122 Joel Grodstein How far should we go? How many transistors should we spend on branch prediction? The BP itself costs power and area, but does no computation. A BP is necessary for good single-stream performance The difficult-to-predict branches are often data dependent; a better/bigger algorithm won’t help much It would be really nice if we just didn’t care about single-stream performance But we do – usually. EE 155 / Comp 122 Joel Grodstein

EE 155 / Comp 122 Joel Grodstein How far should we go? How many transistors should we spend on OOO infrastructure? A big ROB costs area and power, and doesn’t do any computing Instructions per cycle is hitting a wall; there’s just not that much parallelism in most code (no matter how hard your OOO transistors try) But OOO makes a really big difference in single-stream performance. EE 155 / Comp 122 Joel Grodstein

EE 155 / Comp 122 Joel Grodstein So what do we do? Keep adding more transistors? Bigger caches, bigger branch predictors, more OOO Will cost more and more power for very little execution speed Everybody stopped doing this 10 years ago Instead: more cores And, in fact, single-stream performance is no longer improving very quickly EE 155 / Comp 122 Joel Grodstein

EE 155 / Comp 122 Joel Grodstein CPU vs. GPU, once more Haswell Server Nvidia Pascal P100 # cores 18 3840 Die area 660 mm2 (22nm) 610 mm2 (16nm) Frequency 2.3 GHz normal 1.3 GHz Max DRAM BW 100 GB/s 720 GB/s LLC size 2.5 MB/core (L3) 4 MB L2/chip LLC-1 size 256K/core(L2),64B/c 64 KB/SM (64 cores) Registers/core 180 per 2 threads 1000 Power 165 watts 300 watts Company market cap $160B $90B How can a GPU fit so many cores in the same area? Their cores do not have OOO, speculation, BP, large caches, … But won’t a GPU have lousy single-thread performance? Yes. That’s not their market EE 155 / Comp 122 Joel Grodstein

EE 155 / Comp 122 Joel Grodstein CPU vs. GPU, once more Haswell Server Nvidia Pascal P100 # cores 18 3840 Die area 660 mm2 (22nm) 610 mm2 (16nm) Frequency 2.3 GHz normal 1.3 GHz Max DRAM BW 100 GB/s 720 GB/s LLC size 2.5 MB/core (L3) 4 MB L2/chip LLC-1 size 256K/core(L2),64B/c 64 KB/SM (64 cores) Registers/core 180 per 2 threads 1000 Power 165 watts 300 watts Company market cap $160B $90B How can GPU get away with so little cache? programmer is responsible for highly optimizing algorithms But won’t the software be hard to write? Yes EE 155 / Comp 122 Joel Grodstein

Approaches to the serial problem Rewrite serial programs so that they’re parallel This is not always easy  Write translation programs that automatically convert serial programs into parallel programs This is very difficult to do Success has been limited No magic bullet has been found yet Copyright © 2010, Elsevier Inc. All rights Reserved

EE 155 / Comp 122 Joel Grodstein What kind of multicore? We will build multicore CPUs Everyone has conceded this Even the Iphone X is a six-core chip. But what kind of cores? “Big” cores, with lots of cache, OOO, BP Less power efficient, but gives you good single-thread performance Intel Xeon has followed this route Small cores: much less cache, OOO, BP More power efficient; more cores can fit. So multithread performance is better But single-thread performance is unacceptable, and programming is challenging GPUs have taken this route, to the extreme IPhone X has 2 big cores, 4 medium IPhone X has 4 slow cores & 2 fast ones (plus the neural engine) EE 155 / Comp 122 Joel Grodstein

EE 155 / Comp 122 Joel Grodstein What to remember Caches: Write your code to have temporal and spatial locality Easier said than done sometimes! But extremely important And try to make it fit in cache (or break it into sub-problems) Branch prediction: big, power-hungry and usually still necessary for single-stream perf data-dependent branches are slow OOO: EE 155 / Comp 122 Joel Grodstein