Why we have Counterintuitive Memory Models

Slides:

Advertisements

Similar presentations

1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,

Advertisements

Fetch Execute Cycle – In Detail -

Central Processing Unit

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 3, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Introduction)

Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.

A look at interrupts What are interrupts and why are they needed.

A look at interrupts What are interrupts and why are they needed in an embedded system? Equally as important – how are these ideas handled on the Blackfin.

Pipelining II Andreas Klappenecker CPSC321 Computer Architecture.

EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

CSCE 212 Quiz 9 – 3/30/11 1.What is the clock cycle time based on for single-cycle and for pipelining? 2.What two actions can be done to resolve data hazards?

Computer Architecture II 1 Computer architecture II Lecture 9.

A look at interrupts What are interrupts and why are they needed.

Pipelined Processor II CPSC 321 Andreas Klappenecker.

Pipelining By Toan Nguyen.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Multiprocessor Cache Consistency (or, what does volatile mean?) Andrew Whitaker CSE451.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

10/11: Lecture Topics Execution cycle Introduction to pipelining

CPS110: Implementing threads on a uni-processor Landon Cox January 29, 2008.

Tutorial 2: Homework 1 and Project 1

Multiprogramming. Readings r Chapter 2.1 of the textbook.

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Concurrency 2 CS 2110 – Spring 2016.

Memory Consistency Models

Lecture 11: Consistency Models

Atomic Operations in Hardware

Memory Consistency Models

Constructive Computer Architecture Tutorial 6: Discussion for lab6

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Morgan Kaufmann Publishers The Processor

Constructive Computer Architecture: Lab Comments & RISCV

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Hardware Multithreading

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Understanding the TigerSHARC ALU pipeline

Presented to CS258 on 3/12/08 by David McGrogan

15-740/ Computer Architecture Lecture 5: Precise Exceptions

How to improve (decrease) CPI

Advanced Computer Architecture

Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.

Instruction Level Parallelism (ILP)

November 5 No exam results today. 9 Classes to go!

Explaining issues with DCremoval( )

Implementing Mutual Exclusion

Memory Consistency Models

EE 193: Parallel Computing

CSE 153 Design of Operating Systems Winter 19

Relaxed Consistency Part 2

Relaxed Consistency Finale

Compilers, Languages, and Memory Models

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

CS 8803: Memory Models David Devecsery.

Problems with Locks Andrew Whitaker CSE451.

EE 155 / COMP 122: Parallel Computing

A relevant question Assuming you’ve got: One washer (takes 30 minutes)

CS 111 – Sept. 16 Machine language examples Instruction execution

Presentation transcript:

Why we have Counterintuitive Memory Models David Devecsery

Administration! First reading review due next week! Don’t forget about syllabus quiz. Did you read the papers for today? (There weren’t reviews due, don’t panic!)

Goals this class (and possibly next class) Understand relaxed hardware memory models (SC, TSO, PC) better Understand what hardware restrictions cause us to make weak memory models Be able to reason about how changing architectural memory systems changes observed memory model

Introduction to Memory Models Can this code have a null pointer exception? Initial: ready = false, a = null Answer: Maybe! (It depends on your memory model) Thread 1 a = new int[10]; ready = true; Thread 2 if (ready) a[5] = 5;

Introduction to Memory Models Can this code have a null pointer exception under sequential consistency? Initial: ready = false, a = null Thread 1 a = new int[10]; ready = true; Thread 2 if (ready) a[5] = 5; Thread 1 a = new int[10]; ready = true; Thread 2 if (ready) a[5] = 5;

Introduction to Memory Models Can this code have a null pointer exception under TSO? Initial: ready = false, a = null Thread 1 a = new int[10]; ready = true; Thread 2 if (ready) a[5] = 5;

Introduction to Memory Models Can this code print 0, 0 under TSO? Initial: a = 0, b = 0 Answer: Yes Thread 1 a = 1 print b Thread 2 b = 1 print a

Introduction to Memory Models But Prof. Devecsery, this still doesn’t make any sense! The program says the store happens before the load! Reasoning: There is a total order of stores, however the reads are not bounded by those stores (they can read before the stores complete!) Initial: a = 0, b = 0 Thread 1 a = 1 print b Thread 2 b = 1 print a

What is out-of-order processing? You should be familiar with the pipeline processor: Fetch Decode Execute Mem Write-back

What is out-of-order processing? You should be familiar with the pipeline processor: Fetch Decode Execute Mem Write-back

What is out-of-order processing? What happens if memory is slow? Fetch Decode Execute Mem Write-back Memory can take 100’s of cycles to read!

Imagine this program Should the load #2 stall line #3? 1. load [a] -> r1 2. load [b] -> r2 3. 20 / r1 -> r1 4. r2 + 20 -> r2 5. r1 + r2 -> r2 6. store r2 -> [c] On a pipeline processor, the program order dependence stops #3 from running, but that is an artificial dependence.

Imagine this program What are the dependencies between these instructions? 1. load [a] -> r1 2. load [b] -> r2 3. 20 / r1 -> r1 4. r2 + 20 -> r2 5. r1 + r2 -> r2 6. store r2 -> [c]

Imagine this program Should the load #2 stall line #3? 1. load [a] -> r1 2. load [b] -> r2 3. 20 / r1 -> r1 4. r2 + 20 -> r2 5. r1 + r2 -> r2 6. store r2 -> [c]

What is out-of-order processing? We’re going to execute instructions as their dependencies become available 1. load [a] -> r1 2. load [b] -> r2 3. 20 / r1 -> r1 4. r2 + 20 -> r2 5. r1 + r2 -> r2 6. store r2 -> [c] Out-of-order Fetch Decode Execute Mem Write-back Memory

What is out-of-order processing? We’re going to execute instructions as their dependencies become available 1. load [a] -> r1 2. load [b] -> r2 3. 20 / r1 -> r1 4. r2 + 20 -> r2 5. r1 + r2 -> r2 6. store r2 -> [c] Execute Waiting Done Fetch Decode Mem (FIFO) Memory

What is out-of-order processing? Out-of-order core guarantee: All instructions will “commit” in order! We’re going to execute instructions as their dependencies become available 1. load [a] -> r1 2. load [b] -> r2 3. 20 / r1 -> r1 4. r2 + 20 -> r2 5. r1 + r2 -> r2 6. store r2 -> [c] Out-of-order Fetch Decode Execute Mem Write-back What if r1 is 0 here? What value should r2 be? Memory

What are the dependencies (ignore program-order) of this code? Imagine this program Does out-of-order processing handle our memory stalls? store r1 -> [a] load [b] -> r2 r2 + 1 -> r2 store r2 -> [b] load [a] -> r3 r3 + 1 -> r3 store r3 -> [a] What are the dependencies (ignore program-order) of this code?

Imagine this program Does out-of-order processing handle our memory stalls? store r1 -> [a] load [b] -> r2 r2 + 1 -> r2 store r2 -> [b] load [a] -> r3 r3 + 1 -> r3 store r3 -> [a]

Imagine this program Does out-of-order processing handle our memory stalls? store r1 -> [a] load [b] -> r2 r2 + 1 -> r2 store r2 -> [b] load [a] -> r3 r3 + 1 -> r3 store r3 -> [a]

Out-of-order processing and memory store r1 -> [a] load [b] -> r2 r2 + 1 -> r2 store r2 -> [b] load [a] -> r3 r3 + 1 -> r3 store r3 -> [a] Out-of-order Fetch Decode Execute Mem Write-back Memory

Out-of-order processing and memory store r1 -> [a] load [b] -> r2 r2 + 1 -> r2 store r2 -> [b] load [a] -> r3 r3 + 1 -> r3 store r3 -> [a] Out-of-order Fetch Decode Execute Mem Write-back ANIMATIONS Memory

Out-of-order processing and memory Some of these memory operations are unrelated! Do we really need to stall 2 on 1? store r1 -> [a] load [b] -> r2 r2 + 1 -> r2 store r2 -> [b] load [a] -> r3 r3 + 1 -> r3 store r3 -> [a] Out-of-order Fetch Decode Execute Mem Write-back 4 1 2 ANIMATIONS Memory

Memory Dependencies: Loads For now, lets try this: Loads can be run after any stores they depend on. What are the dependencies surrounding loads? Can load 2 be run before load 1? load [a] -> r1 load [b] -> r2 store r2 -> [a] load [a] -> r3 load [c] -> r1 store r1 -> [b] Can load 4 be run before store 3? Can load 5 be run before store 3?

Memory Dependencies: Loads Stores should still be run in program order (FIFO) Memory Dependencies: Loads Loads run in parallel, once any store they depend on is finished Out-of-order (memory subsystem) load [a] -> r1 load [b] -> r2 store r2 -> [a] load [a] -> r3 load [c] -> r1 store r1 -> [b] Store Queue Load Buffer Memory

Memory Dependencies This is great! We just broke nearly all of our memory dependencies! Out-of-order (memory subsystem) load [a] -> r1 load [b] -> r2 store r2 -> [a] load [a] -> r3 load [c] -> r1 store r1 -> [b] Store Queue Load Buffer But, what does this do to our parallel programs? ANIMATIONS Memory

Can this dereference null? Processor 1 Processor 2 Initial: ready = false, a = null Out-of-order Out-of-order Processor 1 a = new int[10]; ready = true; Processor 2 if (ready) a[5] = 5; Store Queue Load Buffer Store Queue Load Buffer ANIMATIONS Memory

Can this dereference null? Processor 1 Processor 2 Initial: ready = false, a = null Out-of-order Out-of-order Processor 1 a = new int[10]; ready = true; Processor 2 if (ready) a[5] = 5; Store Queue Load Buffer Store Queue Load Buffer ANIMATIONS Memory

Can this dereference null? Processor 1 Processor 2 Initial: ready = false, a = null This can load null, if the loads are reordered What if the memory system returns shared memory loads in order? Out-of-order Out-of-order Processor 1 a = new int[10]; ready = true; Processor 2 if (ready) a[5] = 5; Store Queue Load Buffer Store Queue Load Buffer ANIMATIONS Memory

Can this dereference null? Processor 1 Processor 2 Initial: ready = false, a = null Out-of-order Out-of-order Processor 1 a = new int[10]; ready = true; Processor 2 if (ready) a[5] = 5; Store Queue Load Buffer Store Queue Load Buffer ANIMATIONS Memory

The reality. What can this print? Processor 1 Processor 2 What do we need to make this a TSO processor? Out-of-order Out-of-order Initial: a = 0, b = 0 Thread 1 a = 1 print b Thread 2 b = 1 print a Store Queue Load Buffer Store Queue Load Buffer ANIMATIONS Memory

What have we done? We’ve removed memory orderings with the assumption that nothing else will change the memory. FIFO Ordering store r1 -> [a] load [b] -> r2 store r3 -> [b] load [c] -> r4 store r5 -> [a] TSO Ordering store r1 -> [a] load [b] -> r2 store r3 -> [b] load [c] -> r4 store r5 -> [a]

Introduction to Memory Models Partial Consistency (PC) [x86] Stores to the same location by a single processor core are ordered Every store in a thread to a variable happens before the next store in that thread to that variable

Introduction to Memory Models Can this code have a null pointer exception under PC? Initial: a = null, ready = false Thread 1 a = new int[10]; Thread 2 while (a == null); ready1 = true; Thread 2 while (ready1 == false) ; print a[0]; Answer: Yes

Introduction to Memory Models Remember, ordering is based off each location Can this code have a null pointer exception under PC? Initial: a = null, ready = false Thread 1 a = new int[10]; Thread 2 while (a == null); ready1 = true; Thread 2 while (ready1 == false) ; print a[0];

Introduction to Memory Models Remember, ordering is based off each location Can this code have a null pointer exception under PC? Initial: a = null, ready = false Thread 1 a = new int[10]; Thread 2 while (a == null); ready1 = true; Thread 2 while (ready1 == false) ; print a[0];