BulkCommit: Scalable and Fast Commit of Atomic Blocks

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

The University of Adelaide, School of Computer Science
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Memory Consistency Arbob Ahmad, Henry DeYoung, Rakesh Iyer /18-740: Recent Research in Architecture October 14, 2009.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Scalable, Reliable, Power-Efficient Communication for Hardware Transactional Memory Seth Pugsley, Manu Awasthi, Niti Madan, Naveen Muralimanohar and Rajeev.
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Lecture 20: Consistency Models, TM
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Software Coherence Management on Non-Coherent-Cache Multicores
Speculative Lock Elision
Transactional Memory : Hardware Proposals Overview
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 11: Consistency Models
Multiprocessor Cache Coherency
CS203 – Advanced Computer Architecture
Lu Peng, Jih-Kwon Peir, Konrad Lai
RANDOM FILL CACHE ARCHITECTURE
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Lecture 2: Snooping-Based Coherence
Cache Coherence Protocols 15th April, 2006
CMSC 611: Advanced Computer Architecture
Lecture: Consistency Models, TM
Lecture 6: Transactions
Lecture 17: Transactional Memories I
Lecture 21: Transactional Memory
Yiannis Nikolakopoulos
11 – Snooping Cache and Directory Based Multiprocessors
Lecture 22: Consistency Models, TM
Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,
Hybrid Transactional Memory
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
Lecture 10: Consistency Models
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 23: Transactional Memory
Lecture 21: Transactional Memory
Lecture: Consistency Models, TM
Lecture 19: Coherence and Synchronization
Lecture: Transactional Memory
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 11: Consistency Models
Presentation transcript:

BulkCommit: Scalable and Fast Commit of Atomic Blocks in a Lazy Multiprocessor Environment Author: Xuehai Qian, Benjami Sahelices, Depei Qian DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA-SURATHKAL 2014 Presented by: Pravin Ramteke (CS13F08) Sawan Belekar (13IS04F) Course Instructor: Dr. Basavaraj Talawar

Motivation • Architectures that continuously execute Atomic Blocks or Chunks (e.g., TCC, BulkSC) • Chunk: a group of dynamically contiguous instructions executed atomically • Providing performance and programmability advantages [Hammond 04][Ahn 09] • Chunk commit is an important operation: making the state of a chunk visible atomically • We focus on the designs with lazy detection of conflicts • Provides higher concurrency in codes with high conflicts • Parallelizing the commit is challenging • Requires the consistent conflict resolution decision over all the distributed directory modules • Therefore, most current schemes have some sequential steps in the commit • In addition, the current lazy conflict resolutions are sub-optimal • Incur the squash when there is only Write-After-Write (WAW) conflict 2

Lifetime of a Chunk • Execution: Grouping Propagation Execution Commit • Execution: • Reads and writes bring lines into the cache • No written line is made visible to other processors • Execution ends when the last instruction of the chunk completes • Commit: make the chunk state visible atomically • Grouping: set the relative order of any two conflicting chunks • Grabbing the directory: locking the local memory lines and detecting the conflicts • After a commit grabs all the relevant directories, it is guaranteed to commit successfully • Propagation: making the stores in a chunk visible to the rest of the system • Involving sending invalidations and updating directory states • Atomicity is ensured since the relevant cache lines are logically locked by signatures during the process 3

x: D in P0 Dir wr x Serialize WAW-only Inefficiency : Squash on WAW-Only Time Grouping Propagation C0 C1 P0 P1 store buffer wr x wr x ?? C1 Squashed/ x: D in P0 Dir wr x Serialize WAW-only conflict with squash wr x wr x Execution Commit Chunks wr x wr x wr x x:I x:S x:D x:I x:S x:D x:I L1 cache Re-exec x: D in P1 x: S in P0&P1 m Serialize WAW Conventional System without re-execution Chunk-based System 5

Contribution: BulkCommit • BulkCommit: commit protocol with parallel grouping and squash-free serialization of WAW-only conflict • IntelliSquash: no squash on WAW • Insight: using L1 cache as the “store buffer” for the chunk • IntelliCommit: parallel grouping without broadcast • Insight: using preemption mechanism to ensure the consistent order of two conflicting chunks • BulkCommit tries to achieve the optimal commit protocol design 6

Outline • Motivation • IntelliSquash • IntelliCommit • Evaluation 7

IntelliSquash: Insight • Challenge: the speculative data produced by a chunk cannot be lost when the chunk is ready to commit • Solution: use the L1 cache as the “store buffer” for a chunk • Similar to the store buffer in the conventional system • On receiving an invalidation, the speculative dirty words of a line are preserved • Absent bit: it is set when • The line is not presented • The line contains some speculative words • Per-word dirty bit (not shown) P0 P0 P1 P1 0 0 0 1 1 1 1 v 0 1 1 v v 0 0 1 1 v v Dir State m: D in P1 Dir State m: D in P1 line(m) line(m) 1 1 1 v 1 1 1 v commit commit sp V d A sp V d 1 1 1 v v 1 1 1 v v sp V d A sp V d m: S in P0&P1 m: S in P0&P1 8

IntelliSquash: Merge Operation • Performed when the whole line with The dirty word is merged with • Merge the remote non-speculative cache line with the local speculative P0 read 0 1 1 1 v’ • On commit A sp V d • The line is not accesses again A sp V d • Therefore, need to bring the line Dir State m: S in P1&P1 to the cache as if there is a miss • Unset Absent (A) bit Absent bit set is brought to the cache the non-speculative line words line(m) • On misses to a word not presented 1 1 1 1 v 1 1 1 1 1 v v v v m: D in P1 m: S in P0&P1 9

Outline • Motivation • IntelliSquash • IntelliCommit • Evaluation 10

• Processor sends commit requests IntelliCommit Protocol • On chunk commit: • Processor sends commit requests P to all the relevant directory modules request: • Locks the memory lines D0 D1 D2 D3 • Responds with commit_ack • Processor counts the number of commit_ack received Group formed • Processor sends commit_confirm when it receives the expected number of commit_ack • Directory module receives commit 11

Conflicting Chunks Trying to Commit Different overlapped directory modules receive commit request in opposite order Need to avoid deadlock P0 D0 P1 D1 P3 D3 P2 D2 C3 C0 12

IntelliCommit: Deadlock Resolution • Basic idea: enforce a consistent order between two conflicting chunks • Piggyback a hardware-generated random number with the commit request 13

IntelliCommit: Deadlock Resolution (cont..)

Why Does IntelliCommit Work? 1. When the directory group of a chunk is already formed, the chunk cannot be preempted by another chunk 2. All the modules involved in a conflict reach the same decision on which chunk has the higher priority, locally 14

IntelliCommit Implementation • Extra messages (P=Processor, D=Directory): • preempt_request (D→P) • preempt_ack (P→D) • preempt_nack (P→D) • preempt_finish (D→P) • Commit Ack Counter (CAC): #(not received commit_ack) • Preemption Vector (PV) (N=#P=#D): • Each processor: N counters of size log(N) • PV[i] at Pj = k • Pj’s chunk is preempted by Pi’s chunk in k directories • Increase PV[i]: about to send preempt_ack for Pi’s chunk • Decrease PV[i]: received a preempt_finish for Pi’s chunk • When to send commit_confirm? • (CAC==0)&&(for each i, PV[i]==0) • Received all commit_ack and the chunk is not preempted by any other chunks in any directory

Outline • Motivation • IntelliSquash • IntelliCommit • Evaluation 16

• Cycle accurate NOC simulation with processor and cache model Evaluation • Cycle accurate NOC simulation with processor and cache model • Number of cores: 16 and 64 • 11 SPLASH-2 and 7 PARSEC applications • One or two outstanding chunks • Implemented most distributed commit protocols: • Scalable TCC (ST) • Scalable Bulk (SB) • BulkCommit without IntelliSquash (BC-SQ) • BulkCommit (BC) 17

• BulkCommit reduces both squash and commit time SPLASH-2 Performance • BulkCommit reduces both squash and commit time 18

PARSEC Performance 19

One and Two Outstanding Chunks • Using two outstanding chunks is not always useful due to the set restriction • Two chunks from the same processor cannot write the same cache set

• Serializing WAW between chunks without squashing Conclusion • Proposed BulkCommit: commit protocol with parallel grouping and squash-free serialization of WAW-only conflict • Key properties: • Serializing WAW between chunks without squashing • Exploiting the similarity of a chunk commit and an individual store • Parallel grouping • Using preemption mechanisms to order two conflicting chunks consistently • Results: • Eliminate the commit bottleneck with even single outstanding chunk • Reduce the squash time for some application