EE 193: Parallel Computing

Slides:



Advertisements
Similar presentations
ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.
Advertisements

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Goal: Describe Pipelining
Chapter Six 1.
Instruction-Level Parallelism (ILP)
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
The Central Processing Unit
Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.
E X C E E D I N G E X P E C T A T I O N S VLIW-RISC CSIS Parallel Architectures and Algorithms Dr. Hoganson Kennesaw State University Instruction.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Processor Level Parallelism 1
1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Real-World Pipelines Idea Divide process into independent stages
Chapter Six.
Multiple Banked Register Files
Speed up on cycle time Stalls – Optimizing compilers for pipelining
Microarchitecture.
Multi-core processors
Lecture 16: Basic Pipelining
Computer Structure Multi-Threading
Single Clock Datapath With Control
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
\course\cpeg323-08F\Topic6b-323
EE 193: Parallel Computing
Morgan Kaufmann Publishers The Processor
CS 5513 Computer Architecture Pipelining Examples
ECEC 621: High Performance Computer Architecture
Lecture 16: Basic Pipelining
Levels of Parallelism within a Single Processor
Hardware Multithreading
Lecture 5: Pipelining Basics
\course\cpeg323-05F\Topic6b-323
ECE 454 Computer Systems Programming CPU Architecture
Chapter Six.
Chapter Six.
Guest Lecturer TA: Shreyas Chand
November 5 No exam results today. 9 Classes to go!
Instruction Execution Cycle
EE 193: Parallel Computing
Chapter 8. Pipelining.
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Pipelining, Superscalar, and Out-of-order architectures
EE 193: Parallel Computing
Instructor: Joel Grodstein
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Levels of Parallelism within a Single Processor
Hardware Multithreading
pipelining: data hazards Prof. Eric Rotenberg
Why we have Counterintuitive Memory Models
Wackiness Algorithm A: Algorithm B:
Lecture 1 An Overview of High-Performance Computer Architecture
EE 193: Parallel Computing
EE 155 / Comp 122 Parallel Computing
EE 155 / COMP 122: Parallel Computing
EE 193: Parallel Computing
CS 3853 Computer Architecture Pipelining Examples
Hmmm Assembly Language
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Presentation transcript:

EE 193: Parallel Computing Fall 2017 Tufts University Instructor: Joel Grodstein joel.grodstein@tufts.edu Simultaneous Multithreading (SMT)

Threads We've talked a lot about threads. In a 16-core system, why would you want to have >1 thread? If you have <16 threads, some of the cores will just be sitting around idle. Why would you ever want more than 16 threads? Perhaps you have 100 users on a 16-core machine, and the O/S rotates them around for fairness Today we'll talk about another reason EE 193 Joel Grodstein

More pipelining and stalls Consider a random assembly program The architecture says that instructions are executed in order. The 2nd instruction uses the r2 written by the first instruction load r2=mem[r1] add r5=r3+r2 add r8=r6+r7 EE194/Comp140 Mark Hempstead

Pipelined instructions Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 load r2=mem[r1] fetch read R1 execute nothing access cache write r2 add r5=r3+r2 read r3,r2 add r3,r2 load/st nothing write r5 add r8=r6+r7 read r6,r7 add r6,r7 write r8 store mem[r9]=r10 read r9,r10 store write nothing Pipelining cheats! It launches instructions before the previous ones finish. Hazards occur when one computation uses the results of another – it exposes our sleight of hand EE 193 Joel Grodstein

Dependence Pipelining works best when instructions don't use results from just-previous instruction But instructions in a thread tend to be working together on a common mission; this isn't good Each thread has its own register file, by definition They only interact via shared memory or messages One thread cannot read a register that another thread wrote Idea: what if we have one core execute many threads? Does that sound at all clever or useful? EE 193 Joel Grodstein

Your pipeline, on SMT Instruction Thread Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 load r2=mem[r1] fetch read R1 execute nothing access cache write r2 add r4=r2+r3 1 read r3,r2 add r3,r2 load/st nothing write r5 add r6=r2+r7 2 read r6,r7 add r6,r7 write r8 add r5=r3+r2 read r2,r3 store write nothing We still have our hazard (from loading r2 to the final add) The intervening instructions are not hazards; even though they (coincidentally) use r2, each thread uses its own r2. By the time thread 0 uses r2, r2 is in fact ready No stalls needed  EE 193 Joel Grodstein

Problems with SMT SMT is great! Given enough threads, we rarely need to stall Everyone does it nowadays (Intel calls it hyperthreading) How many threads should we use? EE 193 Joel Grodstein

Your pipeline, on SMT Instruction Thread Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 load r2=mem[r1] fetch read R1 execute nothing L1 miss write r2 add r4=r2+r3 1 read r3,r2 add r3,r2 load/st nothing write r5 add r6=r2+r7 2 read r6,r7 add r6,r7 write r8 add r5=r3+r2 read r2,r3 store write nothing What if mem[r1]is not in the L1, but in the L2? It will take more cycles to get the data. But then the final add will have to wait  Could we just stick an extra few instructions from other threads between the load and add, so we don't need the stall? (Note we've drawn an OOO pipe, where the "add r4" can finish before the load) EE 193 Joel Grodstein

Problems with SMT Intel, AMD, ARM, etc., only use 2-way SMT. There must be a reason… SMT is great because each thread uses its own regfile If we have 10-way SMT, how many regfiles will each core need? So why don't we do 10-way SMT? The mantra of memory: there's no place to fit lots of memory all situated on prime real estate. Too much SMT → register files become slow So we're stuck with 2-way SMT SMT is perhaps a GPU's biggest trick. It's much more than 2-way. More on that later In practice, we don't just alternate threads. We issue instructions from whichever thread isn't stalled. EE 193 Joel Grodstein