EE 193: Parallel Computing

Slides:

Advertisements

Similar presentations

ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.

Advertisements

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Goal: Describe Pipelining

Instruction-Level Parallelism (ILP)

Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.

Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.

Pipelined Processor II CPSC 321 Andreas Klappenecker.

The Central Processing Unit

Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.

E X C E E D I N G E X P E C T A T I O N S VLIW-RISC CSIS Parallel Architectures and Algorithms Dr. Hoganson Kennesaw State University Instruction.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Processor Level Parallelism 1

1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Real-World Pipelines Idea Divide process into independent stages

Multiple Banked Register Files

Speed up on cycle time Stalls – Optimizing compilers for pipelining

Microarchitecture.

Multi-core processors

Lecture 16: Basic Pipelining

Computer Structure Multi-Threading

Single Clock Datapath With Control

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

\course\cpeg323-08F\Topic6b-323

EE 193: Parallel Computing

Morgan Kaufmann Publishers The Processor

CS 5513 Computer Architecture Pipelining Examples

ECEC 621: High Performance Computer Architecture

Lecture 16: Basic Pipelining

Levels of Parallelism within a Single Processor

Hardware Multithreading

Lecture 5: Pipelining Basics

\course\cpeg323-05F\Topic6b-323

ECE 454 Computer Systems Programming CPU Architecture

Guest Lecturer TA: Shreyas Chand

November 5 No exam results today. 9 Classes to go!

Instruction Execution Cycle

EE 193: Parallel Computing

Chapter 8. Pipelining.

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Pipelining, Superscalar, and Out-of-order architectures

EE 193: Parallel Computing

Instructor: Joel Grodstein

Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Levels of Parallelism within a Single Processor

Hardware Multithreading

pipelining: data hazards Prof. Eric Rotenberg

Why we have Counterintuitive Memory Models

Wackiness Algorithm A: Algorithm B:

Lecture 1 An Overview of High-Performance Computer Architecture

EE 193: Parallel Computing

EE 155 / Comp 122 Parallel Computing

EE 155 / COMP 122: Parallel Computing

EE 193: Parallel Computing

CS 3853 Computer Architecture Pipelining Examples

Hmmm Assembly Language

A relevant question Assuming you’ve got: One washer (takes 30 minutes)

Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.

Presentation transcript:

EE 193: Parallel Computing Fall 2017 Tufts University Instructor: Joel Grodstein joel.grodstein@tufts.edu Simultaneous Multithreading (SMT)

Threads We've talked a lot about threads. In a 16-core system, why would you want to have >1 thread? If you have <16 threads, some of the cores will just be sitting around idle. Why would you ever want more than 16 threads? Perhaps you have 100 users on a 16-core machine, and the O/S rotates them around for fairness Today we'll talk about another reason EE 193 Joel Grodstein

More pipelining and stalls Consider a random assembly program The architecture says that instructions are executed in order. The 2nd instruction uses the r2 written by the first instruction load r2=mem[r1] add r5=r3+r2 add r8=r6+r7 EE194/Comp140 Mark Hempstead

Pipelined instructions Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 load r2=mem[r1] fetch read R1 execute nothing access cache write r2 add r5=r3+r2 read r3,r2 add r3,r2 load/st nothing write r5 add r8=r6+r7 read r6,r7 add r6,r7 write r8 store mem[r9]=r10 read r9,r10 store write nothing Pipelining cheats! It launches instructions before the previous ones finish. Hazards occur when one computation uses the results of another – it exposes our sleight of hand EE 193 Joel Grodstein

Dependence Pipelining works best when instructions don't use results from just-previous instruction But instructions in a thread tend to be working together on a common mission; this isn't good Each thread has its own register file, by definition They only interact via shared memory or messages One thread cannot read a register that another thread wrote Idea: what if we have one core execute many threads? Does that sound at all clever or useful? EE 193 Joel Grodstein

Your pipeline, on SMT Instruction Thread Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 load r2=mem[r1] fetch read R1 execute nothing access cache write r2 add r4=r2+r3 1 read r3,r2 add r3,r2 load/st nothing write r5 add r6=r2+r7 2 read r6,r7 add r6,r7 write r8 add r5=r3+r2 read r2,r3 store write nothing We still have our hazard (from loading r2 to the final add) The intervening instructions are not hazards; even though they (coincidentally) use r2, each thread uses its own r2. By the time thread 0 uses r2, r2 is in fact ready No stalls needed  EE 193 Joel Grodstein

Problems with SMT SMT is great! Given enough threads, we rarely need to stall Everyone does it nowadays (Intel calls it hyperthreading) How many threads should we use? EE 193 Joel Grodstein

Your pipeline, on SMT Instruction Thread Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 load r2=mem[r1] fetch read R1 execute nothing L1 miss write r2 add r4=r2+r3 1 read r3,r2 add r3,r2 load/st nothing write r5 add r6=r2+r7 2 read r6,r7 add r6,r7 write r8 add r5=r3+r2 read r2,r3 store write nothing What if mem[r1]is not in the L1, but in the L2? It will take more cycles to get the data. But then the final add will have to wait  Could we just stick an extra few instructions from other threads between the load and add, so we don't need the stall? (Note we've drawn an OOO pipe, where the "add r4" can finish before the load) EE 193 Joel Grodstein

Problems with SMT Intel, AMD, ARM, etc., only use 2-way SMT. There must be a reason… SMT is great because each thread uses its own regfile If we have 10-way SMT, how many regfiles will each core need? So why don't we do 10-way SMT? The mantra of memory: there's no place to fit lots of memory all situated on prime real estate. Too much SMT → register files become slow So we're stuck with 2-way SMT SMT is perhaps a GPU's biggest trick. It's much more than 2-way. More on that later In practice, we don't just alternate threads. We issue instructions from whichever thread isn't stalled. EE 193 Joel Grodstein