Calvin: Deterministic or Not? Free Will to Choose Derek R. Hower, Polina Dudnik, Mark D. Hill, David A. Wood.

Slides:

Advertisements

Similar presentations

Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.

Advertisements

UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.

Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Continuously Recording Program Execution for Deterministic Replay Debugging.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.

BugNet Continuously Recording Program Execution for Deterministic Replay Debugging Satish Narayanasamy Gilles Pokam Brad Calder.

Lecture 13: Consistency Models

Multiscalar processors

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.

Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

What is the Cost of Determinism?

High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

A Case for Unlimited Watchpoints Joseph L. Greathouse †, Hongyi Xin*, Yixin Luo †‡, Todd Austin † † University of Michigan ‡ Shanghai Jiao Tong University.

Rerun: Exploiting Episodes for Lightweight Memory Race Recording Derek R. Hower and Mark D. Hill Computer systems complex – more so with multicore What.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Anshul Kumar, CSE IITD ECE729 : Advance Computer Architecture Lecture 26: Synchronization, Memory Consistency 25 th March, 2010.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.

A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.

CS 295 – Memory Models Harry Xu Oct 1, Multi-core Architecture Core-local L1 cache L2 cache shared by cores in a processor All processors share.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.

Detecting Atomicity Violations via Access Interleaving Invariants

Demand-Driven Software Race Detection using Hardware Performance Counters Joseph L. Greathouse †, Zhiqiang Ma ‡, Matthew I. Frank ‡ Ramesh Peri ‡, Todd.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

Lecture 8: Snooping and Directory Protocols

Maurice Herlihy and J. Eliot B. Moss, ISCA '93

Rerun: Exploiting Episodes for Lightweight Memory Race Recording

Transactional Memory : Hardware Proposals Overview

PHyTM: Persistent Hybrid Transactional Memory

Memory Consistency Models

Lecture 11: Consistency Models

Memory Consistency Models

Reactive Synchronization Algorithms for Multiprocessors

Safe and Efficient Supervised Memory Systems

Address-Value Delta (AVD) Prediction

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

Hybrid Transactional Memory

Store Atomicity What does atomicity really require?

DMP: Deterministic Shared Memory Multiprocessing

Presentation transcript:

Calvin: Deterministic or Not? Free Will to Choose Derek R. Hower, Polina Dudnik, Mark D. Hill, David A. Wood

Executive Summary Determinism Valuable: –Same inputs Same multithreaded execution –Debugging, Fault Tolerance, Security Performance Required: –Slow & deterministic not enough Propose: Calvin –Leverages Total Store Order (TSO) in hardware to... –… deterministically order memory operations Multiple modes w/o speculation –20% Deterministic (vs. software 1-11X) –8% Conventional Good Performance

Outline Motivation & Goals Model Implementation Evaluation Conclusion Related Work (optional)

Want Deterministic Execution if (account >= sum) account -= sum; if (account >= sum) account = 100 account = 0 Bug: unprotected account update thread 0

Bug: unprotected account update Want Deterministic Execution thread 0 if (account >= sum) account -= sum; if (account >= sum) account = 100 account = 0 account = -100

Specific Goals Strong Determinism: –Make no assumptions about program behavior –Help debug racey programs Performance: –Small enough overhead to be on all the time Compatibility: –Complex speculative cores –Non-speculative cores Strong Determinism Performance Compatibility

Outline Motivation & Goals Model Implementation Evaluation Conclusion Related Work (optional)

Proc 1 Proc 0 Calvin: The Big Picture Load A Load C Store B Store D Memory Order Load D Store B Store A Load A

Recall Total Store Order (TSO)… TSO is a Relaxed memory model Key point: write completion can be delayed processor 0 ST A <- 1 R1 <- LD B ST A <- 1 R1 <- LD B ST A <- 1 R1 <- LD B ST A <- 1 R1 <- LD B Memory Order PC -> local buffering R2 <- LD A

Buffe r Proc 1 Proc 0 Calvin Model: One Interleaving Memory Order Load A Load C Store B Store D Load D Store B Store A Load A Load C Store B Store D Load D Store B Store A Load A Execute Publish

Execute Publish PROCESSOR 0 PROCESSOR 1 Calvin Model: Reduce Scope Temporally divide multithreaded execution into global strata Stratum S Stratum S + 1 Begin Stratum Time Load Store Load Store Load Store Load Store Load Store Load Store Load Store Execute Publish End Stratum and Synchronize

Stratum Termination Function (3 Modes) 1.Unbounded deterministic : –determinism  architectural events only, e.g. instructions –(#instructions == threshold) OR synchronization 2.Conventional: –performance  reduce load imbalance, e.g. cycle count –(#cycles == threshold) OR synchronization 2.Bounded deterministic : –determinism  architectural events only, e.g. instructions –(#instructions == threshold) OR (synchronization) OR (resource exhaustion)

Outline Motivation & Goals Model Implementation –Write Cache –MIST Protocol –Stratum Size Predictor Evaluation Conclusion Related Work (optional)

Implementation: Overview Implementation Challenges: –Stratification  Load imbalance due to barriers –Buffering  Conventional store buffers do not scale –Ordering  Serial flush is sloooooooow Calvin-MIST Implementation: –Store buffers  Unordered write cache –Load imbalance  Stratum Size Predictor (in paper) –Fast flush  MIST Coherence Protocol

Proc 1 Proc 0 Load A Load C Load B Load A Execute Publish Unordered Write Cache Behavior : –drops program store ordering –coalesces stores –prohibits loads in publish phase Replacements/overflow: 1.End stratum –Bounded Deterministic Mode –Repeatable only on same HW 2.Log (TM-like) –Unbounded Deterministic Mode –Repeatable on any HW Store B Store D Store A Atomic Flush Store D

MIST Protocol Goal: speed up publish phase –delayed “timebomb” invalidate (in paper) –write caches flush in parallel Proc 1 Proc 0 Load A Load C Load B Load A Execut e Publis h Store B Store D Store D Store A Store D Store D

Outline Motivation & Goals Model Implementation Evaluation Conclusion Related Work (optional)

Evaluation Methodology Infrastructure –Bochs –GEMS Workloads –Parsec –Mantevo BaseCalvin-MIST Cores8, 2.0 Ghz in-order pipelined Write CacheN/A64 entry, 8 way L1 CachePrivate, Split L1 I&D, 32K 8-way, 1 cycle Coherence ProtocolConventional MOESIMultiple Writer MIST BarrierN/A16 cycle latency L2 CacheShared, 8MB, 16-way, 8 banks, 12 cycles DirectoryDistributed at the L2 banks

Unbounded Deterministic Mode Normalized Execution Time publish ~20% slowdown fine-grained locking frequent overflow

Bounded Deterministic Mode Normalized Execution Time publish ~20% simpler HW better stratum size

Conventional Mode Normalized Execution Time publish ~8% slowdown bad stratum size

Outline Motivation & Goals Model Implementation Evaluation Conclusion Related Work (optional)

Conclusion Determinism Valuable: –Same inputs Same multithreaded execution –Debugging, Fault Tolerance, Security Performance Required: –Uninteresting to be slow & deterministic Propose: Calvin –Leverages TSO in hardware to... –… deterministically order memory operations Multiple modes w/o speculation –20% Deterministic –8% Conventional Good Performance

Outline Motivation & Goals Model Implementation Evaluation Conclusion Related Work (optional)

Related Work DMP [Devietti, J. et al., ASPLOS ‘09] –First hardware solution for strong determinism –Good performance through TM-like speculation –Calvin seeks good performance with less speculation (power?) Kendo [Olszewski, M. et. al., ASPLOS ‘09] –First software solution for weak determinism –Good performance, but not as general (e.g., debugging data races) –Calvin seeks good performance for strong determinism CoreDet [Bergan, T. et al., ASPLOS ‘10] –First software solution for strong determinism –Exploits relaxed model, e.g., TSO with software store buffer –Performance left room for improvement –Calvin implements similar ideas in hardware to be fast

Questions?

Backup Slides Follow

R0 = 2R1 = 1 R2 = 0 Calvin Model Stratum S Memory Order processor 0 ST A <- 1 R2 <- LD A R1 <- LD B ST A <- 2 processor 1 ST B <- 3 R0 <- LD A Buffer A = 1 A = 2 B = 3 Execute Publish Deterministically order memory operations within stratum All loads before all stores All stores are ordered by processor

Coherence Protocol Write-back protocol Allows parallel write cache flush Allows fast reader invalidate

L1 Cache States StateMeaningGlobal Invariant INot Present/Invalid 0 or more readers, 0 or more writers SRead Permission, no other writers in the system 1 or more readers, 0 writers MWrite permission, didn’t write in current stratum 0 readers, 1 writer TsRead permission until the end of the stratum 1 or more readers, 1 or more writers MwWrite permission, wrote in current stratum 0 readers, 1 writer MMwWrite permission until the end of the stratum 2 or more writers, 0 or more readers

Directory States StateMeaningGlobal InvariantValid INot Present/Invalid 0 readers, 0 writers Memory SOne or more readers 1 or more readers, 0 writers L2 Cache MOnly one writer 0 or more readers, 1 writer Processor MMNo readers/writers 0 readers, 0 writers L2 Cache MSMultiple writers 0 or more readers, 1 or more writers L2 Cache

Stratum Size Predictor Stratum Size Predictor: –optimizes stratum size –adopts to loads imbalance Large stratum: –reduce instruction mix variability Small stratum: –adopt to synchronization Proc 1 Proc 0

L1 Cache Reader Self-Invalidation Time Execute Publish L2 Cache B: Shared Processor 0 Processor 1 B: Shared LD ST Intent B: Shared B: Modified B: Shared B: Modified

Predictor MemBar? C&BD: Overflow? MemBar? C&BD: Overflow? Stratum Ends Saturated ? Decrement Predictor Increment Predictor Size*2 Size/2 NoYes Yes/L ow Yes/ High Stratum Ends No

Predictor Helps Improve Performance Speedup

Write Cache Size Affects Performance Normalized Execution Time

Bottom Line Normalized Execution Time publish Mantevo

Calvin-MIST Operation

Example Protocol Operation

Atomic Operations Ensure that only one atomic operation executes per stratum Logically place the atomic operation at the end of the stratum Terminate stratum on atomic operation Execute both R and W parts of RMW as processor’s last store Allows processors to communicate within a stratum

Multi-Writer Example Core 2Core 1 L1 Cache Write Cache Execution PhasePublish Phase FWD L2 Cache ACK NACKACK

Atomic Operations TSO atomic ordering rules: 1)All previous loads and stores 2)Atomic (both load and store portion) 3)All subsequent loads and stores Calvin satisfies rules by: 1)Ending strata on atomics 2)Executing atomic op entirely in publish phase 3)Executing next instruction in next strata 43

Atomic Example 44 Proc 1 Proc 0 Load A Store A Store L Load C Store C Store B Load B Memory Order RMW L Load A Store C Stall

Deterministic Input Program’s repeatability depends on deterministic input Input: –Use mechanisms from uniprocessor deterministic replay, e.g.: Revirt VMware Replay FDR Interrupts: –Delivered only on strata boundaries Makes for easy logging (e.g., ) 45

Conventional Mode Slowdown Sources: –Barrier latency (16 cycle) Results indicate 4 cycle barrier largely eliminates overhead –Load imbalance Especially in presence of fine-grained communication –Slow inter-thread communication Threads cannot communicate within a stratum 46

With Average Stratum Size