CSCE 432/832 High Performance Processor Architectures Thread Level Parallelism & Multiprocessors Adopted from Lecture notes based in part on slides created.

Slides:

Advertisements

Similar presentations

The University of Adelaide, School of Computer Science

Advertisements

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.

Cache Optimization Summary

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

ECE/CS 552: Shared Memory © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith.

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 20: Consistency Models, TM

COMP 740: Computer Architecture and Implementation

Memory Consistency Models

Simultaneous Multithreading

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

ECE/CS 552: Virtual Memory

Memory Consistency Models

Lecture 18: Coherence and Synchronization

5.2 Eleven Advanced Optimizations of Cache Performance

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Executing Multiple Threads ECE/CS 752 Fall 2017

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Computer Architecture: Multithreading (I)

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Levels of Parallelism within a Single Processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Hardware Multithreading

ECE/CS 757: Advanced Computer Architecture II

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Computer Architecture: Multithreading (IV)

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 22: Consistency Models, TM

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

/ Computer Architecture and Design

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Lecture 25: Multiprocessors

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Memory Consistency Models

Lecture: Cache Hierarchies

Levels of Parallelism within a Single Processor

The University of Adelaide, School of Computer Science

Hardware Multithreading

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 21: Synchronization & Consistency

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

CSC Multiprocessor Programming, Spring, 2011

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

CSCE 432/832 High Performance Processor Architectures Thread Level Parallelism & Multiprocessors Adopted from Lecture notes based in part on slides created by Mikko H. Lipasti, John Shen, Mark Hill, David Wood, Guri Sohi, and Jim Smith CS252 S05

Executing Multiple Threads Thread-level parallelism Synchronization Multiprocessors Explicit multithreading Implicit multithreading Redundant multithreading Summary 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Thread-level Parallelism Instruction-level parallelism Reaps performance by finding independent work in a single thread Thread-level parallelism Reaps performance by finding independent work across multiple threads Historically, requires explicitly parallel workloads Originate from mainframe time-sharing workloads Even then, CPU speed >> I/O speed Had to overlap I/O latency with “something else” for the CPU to do Hence, operating system would schedule other tasks/processes/threads that were “time-sharing” the CPU 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Thread-level Parallelism Reduces effectiveness of temporal and spatial locality 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Thread-level Parallelism Initially motivated by time-sharing of single CPU OS, applications written to be multithreaded Quickly led to adoption of multiple CPUs in a single system Enabled scalable product line from entry-level single-CPU systems to high-end multiple-CPU systems Same applications, OS, run seamlessly Adding CPUs increases throughput (performance) More recently: Multiple threads per processor core Coarse-grained multithreading (aka “switch-on-event”) Fine-grained multithreading Simultaneous multithreading Multiple processor cores per die Chip multiprocessors (CMP) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Thread-level Parallelism Parallelism limited by sharing Amdahl’s law: Access to shared state must be serialized Serial portion limits parallel speedup Many important applications share (lots of) state Relational databases (transaction processing): GBs of shared state Even completely independent processes “share” virtualized hardware, hence must synchronize access Access to shared state/shared variables Must occur in a predictable, repeatable manner Otherwise, chaos results Architecture must provide primitives for serializing access to shared state 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

CSCE 432/832, Thread Level Parallelism Synchronization 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Some Synchronization Primitives Semantic Comments Fetch-and-add Atomic load->add ->store operation Permits atomic increment, can be used to synthesize locks for mutual exclusion Compare-and-swap Atomic load->compare ->conditional store Stores only if load returns an expected value Load-linked/store-conditional Atomic load ->conditional store Stores only if load/store pair is atomic; that is, there is no intervening store Only one is necessary Itanium provides them all! 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Synchronization Examples All three provide guarantee same semantic: Initial value of A: 0 Final value of A: 4 b uses additional lock variable AL to protect critical section with a spin lock This is the most common synchronization method in modern multithreaded applications 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Multiprocessor Systems Focus here is only on shared-memory symmetric multiprocessors Many other types of parallel processor systems have been proposed and built Key attributes are: Shared memory: all physical memory is accessible to all CPUs Symmetric processors: all CPUs are alike Other parallel processors may: Share some memory, share disks, share nothing Have asymmetric processing units Shared memory idealisms Fully shared memory Unit latency Lack of contention Instantaneous propagation of writes Draw simple CFG, map to sequential, write branch instr. Deviates from implied sequential control flow 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

CSCE 432/832, Thread Level Parallelism UMA vs. NUMA Alpha 21264, Power 4 (PowerPC 970) use this approach: ROB group has up to 4 instructions x 16 groups (21264) or up to 5 instructions x 20 groups (PowerPC 970) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Update vs. Invalidation Protocols Coherent Shared Memory All processors see the effects of others’ writes How/when writes are propagated Determine by coherence protocol Alpha 21264, Power 4 (PowerPC 970) use this approach: ROB group has up to 4 instructions x 16 groups (21264) or up to 5 instructions x 20 groups (PowerPC 970) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Sample Invalidate Protocol (MESI) Current State s Event and Local Coherence Controller Responses and Actions (s' refers to next state) Local Read (LR) Local Write (LW) Local Eviction (EV) Bus Read (BR) Bus Write (BW) Bus Upgrade (BU) Invalid (I) Issue bus read if no sharers then s' = E else s' = S Issue bus write s' = M s' = I Do nothing Shared (S) Issue bus upgrade Respond shared Exclusive (E) s' = S Error Modified (M) Write data back; Respond dirty; Alpha 21264, Power 4 (PowerPC 970) use this approach: ROB group has up to 4 instructions x 16 groups (21264) or up to 5 instructions x 20 groups (PowerPC 970) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Sample Invalidate Protocol (MESI) Alpha 21264, Power 4 (PowerPC 970) use this approach: ROB group has up to 4 instructions x 16 groups (21264) or up to 5 instructions x 20 groups (PowerPC 970) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Implementing Cache Coherence Snooping implementation Origins in shared-bus systems All CPUs could observe all other CPUs requests on the bus; hence “snooping” Bus Read, Bus Write, Bus Upgrade React appropriately to snooped commands Invalidate shared copies Provide up-to-date copies of dirty lines Flush (writeback) to memory, or Direct intervention (modified intervention or dirty miss) Snooping suffers from: Scalability: shared busses not practical Ordering of requests without a shared bus Lots of recent and on-going work on scaling snoop-based systems Alpha 21264, Power 4 (PowerPC 970) use this approach: ROB group has up to 4 instructions x 16 groups (21264) or up to 5 instructions x 20 groups (PowerPC 970) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Implementing Cache Coherence Directory implementation Extra bits stored in memory (directory) record state of line Memory controller maintains coherence based on the current state Other CPUs’ commands are not snooped, instead: Directory forwards relevant commands Powerful filtering effect: only observe commands that you need to observe Meanwhile, bandwidth at directory scales by adding memory controllers as you increase size of the system Leads to very scalable designs (100s to 1000s of CPUs) Directory shortcomings Indirection through directory has latency penalty If shared line is dirty in other CPU’s cache, directory must forward request, adding latency This can severely impact performance of applications with heavy sharing (e.g. relational databases) Alpha 21264, Power 4 (PowerPC 970) use this approach: ROB group has up to 4 instructions x 16 groups (21264) or up to 5 instructions x 20 groups (PowerPC 970) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

CSCE 432/832, Thread Level Parallelism Memory Consistency How are memory references from different processors interleaved? If this is not well-specified, synchronization becomes difficult or even impossible ISA must specify consistency model Common example using Dekker’s algorithm for synchronization If load reordered ahead of store (as we assume for a baseline OOO CPU) Both Proc0 and Proc1 enter critical section, since both observe that other’s lock variable (A/B) is not set If consistency model allows loads to execute ahead of stores, Dekker’s algorithm no longer works Common ISAs allow this: IA-32, PowerPC, SPARC, Alpha Alpha 21264, Power 4 (PowerPC 970) use this approach: ROB group has up to 4 instructions x 16 groups (21264) or up to 5 instructions x 20 groups (PowerPC 970) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Sequential Consistency [Lamport 1979] Processors treated as if they are interleaved processes on a single time-shared CPU All references must fit into a total global order or interleaving that does not violate any CPUs program order Otherwise sequential consistency not maintained Now Dekker’s algorithm will work Appears to preclude any OOO memory references Hence precludes any real benefit from OOO CPUs Alpha 21264, Power 4 (PowerPC 970) use this approach: ROB group has up to 4 instructions x 16 groups (21264) or up to 5 instructions x 20 groups (PowerPC 970) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

High-Performance Sequential Consistency Coherent caches isolate CPUs if no sharing is occurring Absence of coherence activity means CPU is free to reorder references Still have to order references with respect to misses and other coherence activity (snoops) Key: use speculation Reorder references speculatively Track which addresses were touched speculatively Force replay (in order execution) of such references that collide with coherence activity (snoops) Alpha 21264, Power 4 (PowerPC 970) use this approach: ROB group has up to 4 instructions x 16 groups (21264) or up to 5 instructions x 20 groups (PowerPC 970) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

High-Performance Sequential Consistency Load queue records all speculative loads Bus writes/upgrades are checked against LQ Any matching load gets marked for replay At commit, loads are checked and replayed if necessary Results in machine flush, since load-dependent ops must also replay Practically, conflicts are rare, so expensive flush is OK Alpha 21264, Power 4 (PowerPC 970) use this approach: ROB group has up to 4 instructions x 16 groups (21264) or up to 5 instructions x 20 groups (PowerPC 970) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Relaxed Consistency Models Key insight: only synchronization references need to be ordered Hence, relax memory for all references Enable high-performance OOO implementation Require programmer to label synchronization references Hardware must carefully order these labeled references All other references can be performed out of order Labeling schemes: Explicit synchronization ops (acquire/release) Memory fence or memory barrier ops: All preceding ops must finish before following ones begin Often: fence ops cause pipeline drain in modern OOO machine More: read tutorial [Adve/Gharachorloo] Label left as “level 1”, right as “level 2” BHR as k wide Draw duplicate BHRs PHT as 2k x m FSM logic as counter (as before) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Coherent Memory Interface Label left as “level 1”, right as “level 2” BHR as k wide Draw duplicate BHRs PHT as 2k x m FSM logic as counter (as before) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

CSCE 432/832, Thread Level Parallelism Split Transaction Bus Label left as “level 1”, right as “level 2” BHR as k wide Draw duplicate BHRs PHT as 2k x m FSM logic as counter (as before) “Packet switched” vs. “circuit switched” Release bus after request issued Allow multiple concurrent requests to overlap memory latency More complicated control, arbitration for bus Much better throughput 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Explicitly Multithreaded Processors MT Approach Resources shared between threads Context Switch Mechanism None Everything Explicit operating system context switch Fine-grained Everything but register file and control logic/state Switch every cycle Coarse-grained Everything but I-fetch buffers, register file and con trol logic/state Switch on pipeline stall SMT Everything but instruction fetch buffers, return address stack, architected register file, control logic/state, reorder buffer, store queue, etc. All contexts concurrently active; no switching CMP Secondary cache, system interconnect Label left as “level 1”, right as “level 2” BHR as k wide Draw duplicate BHRs PHT as 2k x m FSM logic as counter (as before) Many approaches for executing multiple threads on a single die Mix-and-match: IBM Power5 CMP+SMT 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

CSCE 432/832, Thread Level Parallelism IBM Power4: Example CMP Label left as “level 1”, right as “level 2” BHR as k wide Draw duplicate BHRs PHT as 2k x m FSM logic as counter (as before) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Coarse-grained Multithreading Low-overhead approach for improving processor throughput Also known as “switch-on-event” Long history: Denelcor HEP Commercialized in IBM Northstar, Pulsar Rumored in Sun Rock, Niagara Label left as “level 1”, right as “level 2” BHR as k wide Draw duplicate BHRs PHT as 2k x m FSM logic as counter (as before) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

CSCE 432/832, Thread Level Parallelism SMT Resource Sharing Label left as “level 1”, right as “level 2” BHR as k wide Draw duplicate BHRs PHT as 2k x m FSM logic as counter (as before) 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Implicitly Multithreaded Processors Goal:speed up execution of a single thread Implicitly break program up into multiple smaller threads, execute them in parallel Parallelize loop iterations across multiple processing units Usually, exploit control independence in some fashion Many challenges: Maintain data dependences (RAW, WAR, WAW) for registers Maintain precise state for exception handling Maintain memory dependences (RAW/WAR/WAW) Maintain memory consistency model Not really addressed in any of the literature Active area of research Only a subset is covered here, in a superficial manner 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Sources of Control Independence 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Implicit Multithreading Proposals Multiscalar Disjoint Eager Execution (DEE) Dynamic Multithreading (DMT) Thread-level Speculation (TLS) Control Flow Attribute Exploited Control Independence Control independence Cumulative branch misprediction Source of implicit threads Loop bodies Control-flow joins Cumulative branch mispredictions Loop exits Subroutine returns Thread creation mechanism Software/compiler Implicit hardware Implicit Hardware Thread creation and sequencing Program order Out of program order Thread execution Distributed processing elements Shared processing elements Shared multithreaded processing elements Separate CPUs Register data dependences Software with hardware speculation support Hardware; no speculation Hardware; data dependence prediction and speculation Disallowed; compiler must avoid Memory data dependences Hardware-supported speculation Hardware Hardware; prediction and speculation Dependence speculation; checked with simple extension to MESI coherence 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Executing The Same Thread Why execute the same thread twice? Detect faults Better performance Prefetch, resolve branches 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

CSCE 432/832, Thread Level Parallelism Fault Detection AR/SMT [Rotenberg 1999] Use second SMT thread to execute program twice Compare results to check for hard and soft errors (faults) DIVA [Austin 1999] Use simple check-processor at commit Re-execute all ops in order Possibly relax main processor’s correctness constraints and safety margins to improve performance Lower voltage, higher frequency, etc. Lots of other variations proposed in more recent work 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

Speculative Pre-execution Idea: create a runahead or future thread that helps the main trailing thread Advantage: speculative future thread has no correctness requirement Slipstream processors [Rotenberg 2000] Construct speculative, stripped-down version (“future thread”) Let it run ahead and prefetch Speculative precomputation [Roth 2001, Zilles 2002, Collins et al. 2001] Construct backward dataflow slice for problematic instructions (mispredicted branches, cache misses) Pre-execute this slice of the program Resolve branches, prefetch data Implemented in Intel production compiler, reflected in Intel Pentium 4 SPEC results 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05

CSCE 432/832, Thread Level Parallelism Summary Thread-level parallelism Abundant, esp. in server applications Multiprocessors, cache coherence CMP, coarse-grained MT, SMT Implicit multithreading Multiscalar, DEE, DMT, TLS Redundant multithreading 11/17/2018 CSCE 432/832, Thread Level Parallelism CS252 S05