Constructive Computer Architecture Tutorial 8 Final Project Part 2: Coherence Sizhuo Zhang 6.175 TA Nov 25, 2015T08-1http://csg.csail.mit.edu/6.175.

Slides:

Advertisements

Similar presentations

TRIPS Primary Memory System Simha Sethumadhavan 1.

Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

The Linux Kernel: Memory Management

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

Review: Multiprocessor Systems (MIMD)

CSC1016 Coursework Clarification Derek Mortimer March 2010.

Continuously Recording Program Execution for Deterministic Replay Debugging.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 24: Transactional Memory Topics: transactional memory implementations.

Status – Week 248 Victor Moya. Summary Streamer. Streamer. TraceDriver. TraceDriver. bGPU bGPU Signal Traffic Analyzer. Signal Traffic Analyzer. How to.

Synchron. CSE 4711 The Need for Synchronization Multiprogramming –“logical” concurrency: processes appear to run concurrently although there is only one.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Constructive Computer Architecture Tutorial 4: SMIPS on FPGA Andy Wright 6.S195 TA October 7, 2013http://csg.csail.mit.edu/6.s195T04-1.

March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Nachos Phase 1 Code -Hints and Comments

FINAL MPX DELIVERABLE Due when you schedule your interview and presentation.

Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.

CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.

Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1

Copyright © Bluespec Inc Bluespec SystemVerilog™ Design Example A DMA Controller with a Socket Interface.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

1 Workshop Topics - Outline Workshop 1 - Introduction Workshop 2 - module instantiation Workshop 3 - Lexical conventions Workshop 4 - Value Logic System.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

CPS110: Implementing threads Landon Cox. Recap and looking ahead Hardware OS Applications Where we’ve been Where we’re going.

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.

Constructive Computer Architecture Tutorial 2 Advanced BSV Sizhuo Zhang TA Oct 9, 2015T02-1http://csg.csail.mit.edu/6.175.

Lecture Topics: 11/24 Sharing Pages Demand Paging (and alternative) Page Replacement –optimal algorithm –implementable algorithms.

ECE 353 Lab 2 Pipeline Simulator Additional Material.

Chapter 7 - Interprocess Communication Patterns

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

Constructive Computer Architecture Tutorial 4: Running and Debugging SMIPS Andy Wright TA October 10, 2014http://csg.csail.mit.edu/6.175T04-1.

CSCI 156: Lab 11 Paging. Our Simple Architecture Logical memory space for a process consists of 16 pages of 4k bytes each. Your program thinks it has.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

6.175 Project Presentation Tony (Sen) Chang Antonio Rivera.

Realistic Memories and Caches – Part III Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 4, 2012L15-1.

The University of Adelaide, School of Computer Science

Big Picture Lab 4 Operating Systems C Andras Moritz

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

Multiprocessors – Locks

Cache Coherence Constructive Computer Architecture Arvind

6.175 Final Project Part 0: Understanding Non-Blocking Caches and Cache Coherency Answers.

Caches-2 Constructive Computer Architecture Arvind

Constructive Computer Architecture Tutorial 6 Cache & Exception

ECE 353 Lab 3 Pipeline Simulator

Project Part 2: Coherence

Caches-2 Constructive Computer Architecture Arvind

Notation Addresses are ordered triples:

Cache Coherence Constructive Computer Architecture Arvind

Operation System Program 4

Constructive Computer Architecture Tutorial 7 Final Project Overview

Cache Coherence Constructive Computer Architecture Arvind

Constructive Computer Architecture Tutorial 5 Epoch & Branch Predictor

Cache Coherence Constructive Computer Architecture Arvind

CC protocol for blocking caches

Caches-2 Constructive Computer Architecture Arvind

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Concurrency: Mutual Exclusion and Process Synchronization

Lecture 25: Multiprocessors

Geoff Gunow and Sara Sinback (geogunow, sinback)

Lecture 25: Multiprocessors

Lecture 23: Transactional Memory

Cache Coherence Constructive Computer Architecture Arvind

Lecture 18: Coherence and Synchronization

Caches-2 Constructive Computer Architecture Arvind

Presentation transcript:

Constructive Computer Architecture Tutorial 8 Final Project Part 2: Coherence Sizhuo Zhang TA Nov 25, 2015T08-1http://csg.csail.mit.edu/6.175

Debugging Techniques Deficiency about $display Everything shows up together Distinct log file for each module: write to file Also see src/unit_test/sc-test/Tb.bsv Nov 25, 2015T08-2http://csg.csail.mit.edu/6.175 Ehr#(2, File) file <- mkEhr(InvalidFile); Reg#(Bool) opened <- mkReg(False); rule doOpenFile(!opened); let f <- $fopen(“a.txt”, "w"); if(f == InvalidFile) $finish; file[0] <= f; opened <= True; endrule rule doPrint; $fwrite(file[1], "Hello world\n"); endrule Writing to InvalidFile will cause segfault. Use EHR if the logic will call $fwrite in the first cycle

Debugging Techniques Deficiency about cycle counter Rule for printing cycle may be scheduled before/after the rule we are interested in Don’t want to create a counter in each module Use simulation time $display(“%t: evict cache line”, $time); $time returns Bit#(64) representing time In SceMi simulation, $time outputs: 10, 30,... Nov 25, 2015T08-3http://csg.csail.mit.edu/6.175 ()

Debugging Techniques Add sanity check Example 1 Parent is handling upgrade request No other child has incompatible state Parent decides to send upgrade response Check: parent is not waiting for any child (waitc) Example 2 D cache receives upgrade response from memory Check: must be in WaitFillResp state Process the upgrade response Check: if in I state, then data in response must be valid, otherwise data must be invalid (data field is Maybe type in the lab) Nov 25, 2015T08-4http://csg.csail.mit.edu/6.175

Coherence Protocol: Differences From Lecture In lecture: address type for byte address Implementation: only uses cache line address addr >> 6 for 64B cache line In lecture: parent reads data using 0 cycle Implementation: read from memory, long latency In lecture: voluntary downgrade rule No need in implementation In lecture: Parent directory tracks states for all address 32-bit address space  huge directory Implementation: usually parent is L2 cache, so only track address in L2 cache We don’t have L2 cache Nov 25, 2015T08-5http://csg.csail.mit.edu/6.175

Coherence Protocol: Differences From Lecture Work around for large directory For each child, only tracks addresses in its L1 D cache To get MSI state for address a in core i Nov 25, 2015T08-6http://csg.csail.mit.edu/6.175 Vector#(CoreNum, Vector#(CacheRows, Reg#(CacheTag))) tags <- replicateM(replicateM(mkRegU)); Vector#(CoreNum, Vector#(CacheRows, Reg#(MSI)) states <- replicateM(replicateM(mkReg(I))); MSI s = tags[i][getIndex(a)] == getTag(a) ? states[i][getIndex(a)] : I;

Load-Reserve (lr.w) and Store-Conditional (sc.w) New state in D cache Reg#(Maybe#(CacheLineAddr)) la <- mkReg(Invalid); Cache line address reserved by lr.w Load reserve: lr.w rd, (rs1) rd <= mem[rs1] Make reservation: la <= Valid (getLineAddr(rs1)) ; Store conditional: sc.w rd, rs2, (rs1) Check la : la invalid or addresses don’t match: rd <= 1 Otherwise: get exclusive permission (upgrade to M)  Check la again  If address match: mem[rs1] <= rs2; rd <= 0  Otherwise: rd <= 1  If cache hit, no need to check again (address already match) Always clear reservation: la <= Invalid Nov 25, 2015T08-7http://csg.csail.mit.edu/6.175

Load-Reserve (lr.w) and Store-Conditional (sc.w) Cache line eviction Due to replacement, invalidation request... May lose track of reserved cache line  Then clear reservation Compare evicted cache line with la  If match: la <= invalid This is how lr.w/sc.w pair ensures atomicity Nov 25, 2015T08-8http://csg.csail.mit.edu/6.175

Reference Memory Model Debug interface returned by reference model is passed into every D cache D cache calls the debug interface refDMem Reference model will check violation of coherence based on the calls Referece model: src/ref Nov 25, 2015T08-9http://csg.csail.mit.edu/6.175 interface RefDMem; method Action issue(MemReq req); method Action commit(MemReq req, Maybe#(CacheLine) line, Maybe#(MemResp) resp); endinterface module mkDCache#(CoreID id)( MessageGet fromMem, MessagePut toMem, RefDMem refDMem, DCache ifc);

Reference Memory Model issue(MemReq req) Called when req issued to D cache in req method of D cache Give program order to reference model commit(MemReq req, Maybe#(CacheLine) line, Maybe#(MemResp) resp); Called when req finishes processing (commit) line : cache line accessed by req, set to Invalid if unknown resp : response to the core, set to Invalid if no repsonse Reference model checks when commit is called req can be committed or not line value is correct or not (not checked if Invalid) resp is correct or not Nov 25, 2015T08-10http://csg.csail.mit.edu/6.175

Adding Store Queue New behavior for memory requests Ld: can start processing when store queue is not empty St: enqueuer to store queue Lr, Sc: wait for store queue to be empty Fence: wait for all previous requests to commit (e.g. store queue must be empty)  Ordering memory accesses Issuing stores from store queue to process Only stall when there is a Ld request Nov 25, 2015T08-11http://csg.csail.mit.edu/6.175

Multicore Programs Run programs on 2-core system Single-thread programs Programs/assembly, programs/benchmarks core 1 starts looping forever at the very beginning Multithread programs Programs/mc_bench startup code (crt.S): allocate 128KB local stack for each core main function: fork based on core id Nov 25, 2015T08-12http://csg.csail.mit.edu/6.175 int main() { int coreid = getCoreId(); if(coreid == 0) { return core0(); } else { return core1(); } }

Multicore Programs: mc_print Easiest one Two cores print “0” and “1” respectively Sample output: (no cycle/inst count printed) Nov 25, 2015T08-13http://csg.csail.mit.edu/ /../programs/build/mc_bench/vmh/mc_print.riscv.vmh PASSED

Multicore Programs: mc_hello Core 0 passes each character of a string to core 1 Core 1 prints each character it receives Sample output: (no cycle/inst count printed) Nov 25, 2015T08-14http://csg.csail.mit.edu/ /../programs/build/mc_bench/vmh/mc_hello.riscv.vmh ---- Hello World! This message has been written to a software FIFO by core 0 and read and printed by core 1. PASSED

Multicore Programs: mc_produce_consume Larger version of mc_hello Core 1 passes each element of an array to core 0 Core 0 checks the data Sample output: Nov 25, 2015T08-15http://csg.csail.mit.edu/ /../programs/build/mc_bench/vmh/mc_produce_consume.riscv.vmh ---- Benchmark mc_produce_consume Cycles (core 0) = xxx Insts (core 0) = xxx Cycles (core 1) = xxx Insts (core 1) = xxx Cycles (total) = xxx Insts (total) = xxx Return 0 PASSED Instruction counts may vary due to variation in busy waiting time, so IPC is not a good performance metric. Execute time is a better metric.

Multicore Programs: mc_median/vvadd/multiply Data parallel: fork-join style Core 0 calculates first half results Core 1 calculates second half results Sample output: Nov 25, 2015T08-16http://csg.csail.mit.edu/ /../programs/build/mc_bench/vmh/mc_median.riscv.vmh ---- Benchmark mc_median Cycles (core 0) = xxx Insts (core 0) = xxx Cycles (core 1) = xxx Insts (core 1) = xxx Cycles (total) = xxx Insts (total) = xxx Return 0 PASSED

Multicore Programs: mc_dekker Two cores contend for a mutex (Dekker’s algo) After getting into critical section increment/decrement shared counter, print core ID Sample output: Nov 25, 2015T08-17http://csg.csail.mit.edu/ /../programs/build/mc_bench/vmh/mc_dekker.riscv.vmh ---- Benchm1ark mc_1dekker Core 0 decrements counter by 600 Core 1 increments counter by 900 Final counter value = 300 Cycles (core 0) = xxx Insts (core 0) = xxx Cycles (core 1) = xxx Insts (core 1) = xxx Cycles (total) = xxx Insts (total) = xxx Return 0 PASSED For implementation with store queue, fence is inserted in mc_dekker.

Multicore Programs: mc_spin_lock Similar to mc_dekker, but use spin lock implemented by lr.w/sc.w Sample output : Nov 25, 2015T08-18http://csg.csail.mit.edu/ /../programs/build/mc_bench/vmh/mc_spin_lock.riscv.vmh ---- Bench1mark mc1_spin_l1ock Core 0 increments counter by 300 Core 1 increments counter by 600 Final counter value = 900 Cycles (core 0) = xxx Insts (core 0) = xxx Cycles (core 1) = xxx Insts (core 1) = xxx Cycles (total) = xxx Insts (total) = xxx Return 0 PASSED

Multicore Programs: mc_incrementers Similar to mc_dekker, but use atomic fetch-and- add implemented by lr.w/sc.w Core ID is not printed Sample output: Nov 25, 2015T08-19http://csg.csail.mit.edu/ /../programs/build/mc_bench/vmh/mc_incrementers.riscv.vmh ---- Benchmark mc_incrementers core0 had 1000 successes out of xxx tries core1 had 1000 successes out of xxx tries shared_count = 2000 Cycles (core 0) = xxx Insts (core 0) = xxx Cycles (core 1) = xxx Insts (core 1) = xxx Cycles (total) = xxx Insts (total) = xxx Return 0 PASSED

Some Reminders Use CF regfile and scoreboard Compiler creates a conflict in my implementation with bypass regfile and pipelined scoreboard Signup for project meeting Half-page progress report Project deadline: 3:00pm Dec 9 Final presentation (10min) Nov 25, 2015T08-20http://csg.csail.mit.edu/6.175