Signatures in Transactional Memory Systems Dissertation Defense Luke Yen 1/29/2009.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Lecture 12 Reduce Miss Penalty and Hit Time

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.

Chapter 101 Cleaning Policy When should a modified page be written out to disk?  Demand cleaning write page out only when its frame has been selected.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Memory Management (II)

Supporting Nested Transactional Memory in LogTM Authors Michelle J Moravan Mark Hill Jayaram Bobba Ben Liblit Kevin Moore Michael Swift Luke Yen David.

Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Computer Organization and Architecture

Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 Lecture 9: TM Implementations Topics: wrap-up of “lazy” implementation (TCC), eager implementation (LogTM)

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.

July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.

EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris,

Notary: Hardware Techniques to Enhance Signatures Luke Yen Collaborator: Prof. Stark C. Draper Advisor: Prof. Mark D. Hill University of Wisconsin, Madison.

Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.

Chapter 4 Memory Management Virtual Memory.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Review (1/2) °Caches are NOT mandatory: Processor performs arithmetic Memory stores data Caches simply make data transfers go faster °Each level of memory.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Implementing Signatures for Transactional Memory Daniel Sanchez, Luke Yen, Mark Hill, Karu Sankaralingam University of Wisconsin-Madison.

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.

Design and Implementation of Signatures in Transactional Memory Systems Daniel Sanchez August 2007 University of Wisconsin-Madison.

Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.

Processes and Virtual Memory

HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.

© 2006 Mulitfacet ProjectUniversity of Wisconsin-Madison LogTM: Log-Based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark.

Sunpyo Hong, Hyesoon Kim

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

On Transactional Memory, Spinlocks and Database Transactions Khai Q. Tran Spyros Blanas Jeffrey F. Naughton (University of Wisconsin Madison)

CS203 – Advanced Computer Architecture Virtual Memory.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

CS161 – Design and Architecture of Computer

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Transactional Memory : Hardware Proposals Overview

Simultaneous Multithreading

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

What we need to be able to count to tune programs

Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.

Lecture 19: Transactional Memories III

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Page Replacement.

Lecture 11: Transactional Memory

Lecture 6: Transactions

How to improve (decrease) CPI

Improving Multiple-CMP Systems with Token Coherence

Virtual Memory Hardware

/ Computer Architecture and Design

Hybrid Transactional Memory

LogTM-SE: Decoupling Hardware Transactional Memory from Caches

CSE451 Virtual Memory Paging Autumn 2002

Main Memory Background

Lecture 23: Transactional Memory

Presentation transcript:

Signatures in Transactional Memory Systems Dissertation Defense Luke Yen 1/29/2009

2 Key Contributions Trend: Transactional memory (TM) emerging parallel programming paradigm. Programmer-annotated transactions that execute atomically (all or nothing). Challenge #1: Hardware TM (HTM) systems may restrict transactions or incur overheads on common events (e.g., cache evictions). Contribution: LogTM-SE HTM: Simple hardware and interacts with operating system to virtualize transactions. No overhead on cache evictions.

3 Key Contributions Cont. Challenge #2: (1) H 3 signatures high area & power overheads & (2) Thread-private references cause false conflicts. Contribution: Notary: (1) Page-Block-XOR - performs similar to H 3 but lower overheads (2) Stack & heap-based privatization. Challenge #3: Difficult to understand HTM system performance. Contribution: TMProf: Lightweight hardware performance counters help HTM designers & TM programmers. Challenge #4: Signatures suffer from false conflicts. Contribution: Six hardware/software signature extensions to mitigate false conflicts.

4 Outline Introduction and Background Transactional Memory background LogTM-SE [HPCA 2007] Notary [MICRO 2008] TMProf (Submitted for publication) Conclusion Focus of presentation * Skip “Extensions to Signatures” Contribution #1 Contribution #2 Contribution #3 Contribution #4

5 4/29/2015 Transactional Memory (TM) Locks do not compose Can lead to deadlocks TM programmer says “I want this atomic” TM system “Makes it so” Focus on Hardware TM (HTM) Implementations Fast Leverage cache coherence & speculation But hardware finite & should be policy-free void move(T s, T d, Obj key){ atomic { tmp = s.remove(key); d.insert(key, tmp); } Example

LogTM Signature Edition (LogTM-SE) at 50,000 feet HTMs Fast Version management – for transaction commits & aborts HW handles old/new versions (e.g., write buffer) Conflict detection – commit only non-conflicting transactions HW handles conflict detection (R/W bits & coherence) But Closely Coupled to L1 cache On critical paths & hard for SW to save/restore Our Approach: Decoupled, Simple HW, SW control LogTM-SE HW: LogTM’s Log + Signatures (from Illinois Bulk) SW: Unbounded nesting, thread switching, & paging 6 Details

Signature Background Signatures used to summarize and detect conflicts with a transaction’s read- and write-sets Inspired by Bulk system [Ceze,ISCA’06] Imprecise, can be implemented with Bloom filters Can have false positives, but never false negatives Also proposed for non-TM purposes (e.g., SC violation detection, atomicity violation detection, race recording) Ex: Use k Bloom filters of size m/k, with independent hash functions 7

8 Outline Introduction and Background Notary Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions TMProf Conclusion

Notary Executive Summary Tackle 2 problems with hardware signatures: Problem 1: Best signature hashing (i.e., H 3 ) has high area & power overheads Solution 1: Use entropy analysis to guide lower-cost hashing (Page-Block-XOR, PBX) that performs similar to H 3 Ex: 8x fewer gates gates for H 3 vs 20 gates for PBX Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs Solution 2: Avoid inserting private stack addrs, propose privatization interface for higher performance 9

10 Outline Introduction and Background Notary Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions TMProf Conclusion

Signature hash functions Which hash function is best? [Sanchez, YEN, MICRO’07] Bit-selection? Hash simply decodes some number of input bits H 3 ? Each bit of a hash value is an XOR of (on avg.) half of the input address bits 11 Result: H 3 better with >=2 hash functions However, H 3 uses many multi-level XOR trees Can we improve this? LogTM-SE w/ 2kb signatures Details

H 3 implementation Num XOR Ex: 2kb signatures, k=2, c=10, 32-bit addr = 160 XOR gates per signature Can we reduce the total gate count? 12

13 Outline Introduction and Background Notary Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions TMProf Conclusion

Entropy defined Insight: Use most random bits for hashing Use entropy to measure bit randomness Entropy = p(x i ) = the probability of the occurrence of value x i N = number of sample values random variable x can take on Entropy = amount of information required on average to describe outcome of variable x (in bits) Ex: What is the best possible lossless compression? 14 n-bit field constant value with probability 1 All bit patterns in n-bit field equally probable Entropy value of n-bit field 0 bits n bits min max Other cases

Our measures of entropy For our workloads, we care about: Q1: What is the best achievable entropy? Global entropy – upper bound on entropy of address Q2: How does entropy change within an address? Local entropy – entropy of bit-field within the address 15 Addr 31 6 Global entropy Addr 31 6 Local entropy NSkip

Entropy results Workloads to be described later Global entropy is at most 16 bits Bit-window for local entropy is 16 bits wide (NSkip from 0-10) Smaller windows (<16b) may not reach global entropy value Larger windows (>16b) hides some fine-grain info 16 Commercial Workloads

Page-Block-XOR (PBX) Motivated by 3 findings: (1) Lower-order bits have most entropy Follows from our entropy results (2) XORing two bit-fields produces random hash values From prior work on XOR hashing (e.g., data placement in caches, DRAM) (3) Bit-field overlaps can lead to higher false positives Correlation between the two bit-fields can reduce the range of hash values produced (worse for larger signatures) 17 Overlap Details

PBX implementation For 2kb signatures with 2 hash functions: 20 XOR gates for PBX vs 160 XOR gates for H 3 ! 18 PPN and Cache-index fields not tied to system params: Use entropy to find two non-overlapping bit-fields with high randomness

Summary thus far Problem 1: H 3 has high area & power overheads Solution 1: Use entropy analysis to guide lower-cost PBX Ex: 160 gates for H 3 vs 20 gates for PBX Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs Solution 2: To be described 19

20 Outline Introduction and Background Notary Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions TMProf Conclusion

Privatization Problem: False conflicts caused by thread-private addrs Avoid conflicts if addrs not inserted in thread’s signatures Two privatization solutions: (1) Remove private stack references from sigs. Very little work for programmer/compiler Benefits depend on fraction of stack addresses versus all transactional references (2) Language-level interface (e.g., private_malloc(), shared_malloc() ) Even higher performance boost WARNING: Incorrectly marking shared objects as private can lead to program errors! 21

Page-based implementation Each page is assigned a status, private or shared Invariant: Page is shared if any object is shared If stack is private, library marks stack pages as private If using privatization heap functions, mark heap pages accordingly 22

OS support OS allocates different physical page frames for shared and private pages Sets a per-frame bit in translation entry if shared Reduce number of page frames used by packing objects with same status together Signatures insert memory addresses of transactional references to shared pages Query page sharing bit in HW TLB & current transactional status 23

24 Outline Introduction and Background Notary Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions TMProf Conclusion

Methodology Full-system simulation (GEMS) Transistor-level design for area & power of XOR gates CACTI for Bloom filter bit array area & power Linear scaling to 65nm or 90nm for area, original 400nm for power Single-chip CMP 16 single-threaded, in-order cores 32kB, 4-way private L1 I & D 8MB, 8-way shared L2 cache MESI directory protocol Signatures from 64b-64kb (8B-8kB) & “perfect” 25

Workloads Micro-benchmarks SPLASH-2 apps Barnes & Raytrace – exert most signature pressure Stanford STAMP apps Vacation, Genome, Delaunay, Bayes, Labyrinth, Yada, Intruder DNS server BIND 26

PBX vs H 3 area & power Area & power overheads (2kb, k=4): 27 Type of overhead Bloom filter bit array H 3 hashPBX hash H 3 sig.PBX sig.% savings for PBX sig. Area (mm 2 ) 4.67e-31.35e-37.83e-56.02e-34.75e-321 Power (mW) 1.80e21.04e e21.81e24.7

PBX vs H 3 execution time 28 PBX performs similar to H 3

Privatization results summary Removing private stack references from signatures did not help Most addr references not to stack Most likely because running with SPARC ISA. Other ISAs (e.g., x86) likely have more benefits Privatization interface helps five workloads Remainder either does not have private heap structures or does not have high transactional duty cycle 29 Stack Results

Privatization interface results 30 Can improve execution time

31 Outline Introduction and Background Notary Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions TMProf Conclusion

Conclusions Tackle 2 problems with signature designs: (1) Area and power overheads of H 3 hashing E.g., 160 XOR gates for H 3, 20 for PBX (2) False conflicts due to signature bits set by private memory references Our solutions: (1) Use entropy analysis to guide hashing function (PBX), a low-cost alternative that performs similarly to H 3 (2) Prevent private stack references from entering signatures, and propose a privatization interface for heap allocations Notary can be applied to non-TM uses: PBX hashing can directly transfer Privatization may transfer if addr filtering applies 32 Related Work

33 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

TMProf Executive Summary TM more parallelism than lock-based programs Complex thread interactions How can HTM designer understand HTM performance? How can TM programmer understand TM program performance? TMProf: Per-processor hardware performance counters to count cumulative event frequencies & overheads in HTM system 34

35 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

Critical-section Parallelism TM enables critical-section parallelism – more thread interleavings 36 Thread 0 Lock A Thread 1 Lock A Thread 0 xact_begin Thread 1 xact_begin With Locks With TM

Hard to Predict Program Performance TM programmers may not have mastered intricacies of HTM system Programs run faster on specific HTM Example: 37

Profiling with TMProf Allows HTM designers & TM programmers to understand HTM performance With TMProf: 38

39 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

Background on Conflicts Three types: RW, WR, and WW Analogous to WAR, RAW, and WAW dependencies in uniprocessors 40 Thread 0 Thread 1 xact_begin … LD A … xact_begin … ST A … xact_begin … ST B … xact_begin … LD B … xact_begin … ST C … xact_begin … ST C … RW WR WW

Conflict Detection & Resolution Conflicts detected eagerly or lazily Eagerly – when requests occur Lazily – at transaction commit Conflict resolution Stall or abort on conflict Choose set of procs to take action 41

42 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

43 TMProf Per-processor HW counters measuring cumulative event frequencies and cumulative event overheads Two implementations: Base & Extended Base (BaseTMProf): Breaks down HTM execution cycles into common components Extended (ExtTMProf): Builds on BaseTMProf & adds HTM-specific transaction-level profiling

BaseTMProf & ExtTMProf BaseTMProf: Total cycles = stalls + aborts + wasted_trans + useful_trans + committing + nontrans + implementation specific Assume in-order procs, but can extend for out-of-order procs ExtTMProf: BaseTMProf profiling plus Size of aborted transactions Amount of transactional work after write-set prediction HTMs may add more detailed profiling in future 44 Details

45 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

46 Two Case Studies TMProf profiling two HTMs: LogTM-SE (eager conflict detection & version management, EE) Approximation of Stanford’s TCC (lazy conflict detection & version management, LL) Examine key parameters of eager & lazy conflict detection Idealize version management Same system parameters as Notary 16-processor CMP w/ in-order, single-issue processor cores Perfect signatures Same workloads

EE: Different Conflict Resolutions Three different conflict resolutions: Base, Timestamp, Hybrid All use timestamps Base: Requestor stalls until possible deadlock Timestamp: Older requestors always abort younger transactions. Younger requestors stalled by older transactions. Hybrid: Base, except RW from older writer aborts younger reader 47

EE: Write-set Prediction Avoid aborts from load then store pattern from thread Predict & serialize on these conflicts 48 GetS … GetX … GetS … GetS … ABORT T 0 T2T 1 GetX … GetX … GetS … GetS … T 0 T 1T2 STALL

Results from Conflict Resolutions 49 Trends: 1) Timestamp & Hybrid better than Base

Timestamp & Hybrid Better than Base 50 Fewer total stalls & eliminates all RW Requestor older stalls

EE Summary with BaseTMProf BaseTMProf helps HTM designer understand performance of conflict resolution schemes Lightweight, fast, dynamic profiling Can be implemented in prototype HTM systems 51

Write-set Prediction Results 52 Focus on workloads that degrade from prediction Prediction increases Stall cycles

ExtTMProf’s Transaction-level Profiling 53 Prediction helps short transactions Prediction hurts large transactions – reduces concurrency Predictions Help Predictions Hurt

EE Summary with ExtTMProf Helps HTM designers understand why write-set prediction degrades (or improves) performance Offline analysis (e.g., traces) unable to determine performance implications of dynamic conflicts How can TMProf help analyze LL systems? 54

LL: Parallel Versus Serial Commit Serial = Only one committer at a time Parallel = Multiple concurrent committers Faster than Serial We idealize its implementation 55

LL: More Prefetching than EE Eager conflict detection: Progress bounded by location of conflicts Early conflicts  abort transactions early (little prefetching) Late conflicts  abort transactions late (lots of prefetching) Lazy conflict detection: Committers finish transaction before detecting conflicts High probability for lots of prefetching 56

Parallel Commit Results 57 Parallel commit removes commit token bottleneck

Conflicts with Parallel Commit 58 All conflicts either RW or WR – no WW conflicts

LL Summary with BaseTMProf BaseTMProf clearly shows why parallel commit helps Stall breakdown shows mostly WR conflicts BaseTMProf helps HTM designers decide whether to implement parallel commit Parallel commit more complex than serial commit 59

Prefetching Results 60 Useful Trans should be similar for EE & LL, but LL incurs fewer cycles Why?

ExtTMProf’s Transaction-level Profiling 61 LL’s aborted transactions prefetch farther than EE

LL Summary with ExtTMProf Explains why workloads execute faster on LL than on EE May influence HTM design decision to implement LL rather than EE Helps TM programmer understand why programs run faster on some HTMs 62

63 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

64 Software Rollback Better than Hardware Rollback Software rollback reduces Stalls & Wasted Trans May reduce contention in HTM?

Hardware for Critical-path Profiling Counter-based profiling is not sufficient Multi-threaded programs exhibit variability: Different dynamic code paths Inter-thread dependencies Memory latencies Factors change critical-path – longest control flow that determines execution time Hardware critical-path profiling can aid in understanding performance Faster than offline, software analyses 65

66 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

Conclusions TMProf – lightweight per-processor hardware counters for understanding HTM performance Cumulative event frequencies & overheads Two implementations: Base & Extended Two case studies: LogTM-SE & Approximation of TCC Future TMProf might add hardware support for critical-path profiling 67 Related Work

68 Outline Introduction and Background Notary TMProf Conclusion

Conclusions Challenge #1: Hardware TM (HTM) systems may restrict transactions or incur overheads on common events. Contribution: LogTM-SE HTM Challenge #2: (1) H 3 signatures high area & power overheads & (2) Thread-private references cause false conflicts. Contribution: Notary 69

Conclusions Cont. Challenge #3: Difficult to understand HTM system performance. Contribution: TMProf Challenge #4: Signatures suffer from false conflicts. Contribution: Six hardware/software extensions to signatures 70

Other Research & Contributions OS Support for Virtualizing Transactional Memory [Swift et al. TRANSACT ‘08] Implementing Signatures for Transactional Memory [Sanchez et al. MICRO ‘07] Performance Pathologies in Hardware Transactional Memory [Bobba et al., ISCA ’07 & Top Picks ‘08] Supporting Nested Transactional Memory in LogTM [Moravan et al., ASPLOS ‘06] GEMS 2.X development & support SMT in Opal, LogTM-SE in Ruby 71

Thank You! Questions? 72

73 Backup Slides

74 LogTM-SE Processor Hardware Segmented log, like LogTM Track R / W sets with R / W signatures Over-approximate R / W sets Tracks physical addresses Summary signature used for virtualization Conflict detection by coherence protocol Check signatures on every memory access for SMT Data Caches RegistersRegister Checkpoint LogFrame LogPtr TMcount Write SMT Thread Context Tag Data Read SummaryRead SummaryWrite NO TM STATE

75 Thread Switching Support Why? Support long-running transactions What? Conflict Detection for descheduled transactions How? Summary Read / Write signatures: If thread t of process P is scheduled to use an active signature, the corresponding summary signature holds the union of the saved signatures from all descheduled threads from process P. Updated using TLB-shootdown-like mechanism

76 Summary W R Summary W R Handling Thread Switching W R W R W R W R Summary W R Summary W R P1 P2 Summary W R P3 P4 T1 T2T3 OS

77 Handling Thread Switching W R W R W R W R Summary W R Summary W R Summary W R Summary W R Summary W R P1 P2P3 P4 Deschedule T1 T2T3 OS

78 Handling Thread Switching W R W R W R W R Summary W R Summary W R Summary W R Summary W R Summary W R P1 P2P3 P4 Summary W R Summary W R Deschedule T1 T2T3 OS

79 Handling Thread Switching W R W R W R W R Summary W R Summary W R P1 P2P3 P4 Summary W R Summary W R Summary W R T2T3 T1 OS

80 Thread Switching Support Summary Summary Read / Write signatures Summarizes descheduled threads with active transactions One OS structure per process Check summary signature on every memory access Updated on transaction deschedule Similar to TLB shootdown Coherence

81 Paging Support Summary Problem: Changing page frames Need to maintain isolation on transactional blocks Solution: On Page-Out: Save Virtual -> Physical mapping On Page-In: If different page frame, update signatures with physical address of transactional blocks in new page frame.

82 Paging Support Animation VP1 PP1 A B C D PP2 A’ B’ C’ D’ Read sig. Write sig. A? Y A’ B? Y B’ C? C’ D? D’ Page-out Page-in Read & Write signatures isolate memory blocks from PP1 & PP2 Return

BaseTMProf for LogTM-SE (1 of 3) Differentiate between read dependent & write dependent aborts Meta-data (e.g., 3 bits for conflict types + 1 bit indicating if responder older) on NACK messages Per-processor tables to track conflicts with other procs RW conflict only = read-dependent Stall cycles = cycle conflict detected – cycle request sent to memory subsystem Abort cycles = cycle abort completes – cycle abort initiates 83

BaseTMProf for LogTM-SE (2 of 3) Wasted_trans cycles = cycle abort initiates – cycle transaction begins Store transaction begin cycle in separate register Commit cycles = cycle commit completes – cycle commit initiates No commit actions = no commit cycles Track cycle of start of commit action in separate register 84

BaseTMProf for LogTM-SE (3 of 3) Nontrans cycles = cycle of transaction begin – cycle after last transaction commit Track cycle of last transaction commit in separate register Backoff cycles = cycle retry transaction – cycle abort completes Barrier cycles = cycle exit barrier – cycle enter barrier 85

ExtTMProf for LogTM-SE Work remaining after write-set prediction: Store transaction size (read+write-set sizes) at each prediction - lazily copy to software or use many registers At commit, subtract saved transaction size from final transaction size at commit Differences processed by software to produce histograms Size of aborted transactions: Store read- and write-set sizes of aborted transaction in separate registers 86

BaseTMProf for TCC Stall cycles recorded at transaction commit When write-sets broadcasted or commit request sent to directory No breakdown of read-dependent & write- dependent abort cycles Since aborts do not stall winner (abortee) Committing cycles = cycle commit phase completes – cycle commit phase begins Between cycle all stores flushed from write buffer & broadcasting write-set 87

ExtTMProf for TCC Size of aborted transactions: Track read- and write-set sizes of aborted transactions Just like for LogTM-SE 88 Return

Extensions to Signatures Overview Six extensions to reduce false conflicts Static Transaction Identifier (XID) Independence Object Identifiers (IDs) Spatial locality with static signatures Spatial locality with dynamic signatures Coarse-fine hashing Dynamic re-hashing Evaluate using ideal hardware & software 89 Best performance

XID Independence, Object IDs XID Independence: Programmer declares set of static XIDs that conflict with each other Information passed to hardware for conflict detection Signature check only for XIDs that possibly conflict Object IDs: High-level objects accessed by each transaction E.g., Trees, hash buckets, nodes Programmer declares set of objects accessed in transaction Designed to handle dynamic, fine-grain conflicts 90

Optimizing for Spatial locality Spatial locality exists in many programs High probability of accessing memory addresses neighboring current address in future Spatially local addresses may form a set that sets only a single signature bit Static signatures: Signature hashes operate on fixed, larger granularity (i.e., greater than cache-block) Granularity may not be suitable for all workloads Dynamic signatures: A set of signatures that hash on different granularities & set of hit counters Dynamically select which signature is “best” to use 91

Coarse-fine hashing, Dynamic re-hashing Coarse-fine hashing: Split addresses into two regions: Coarse & Fine Coarse – High-order address bits (e.g., page number) Fine – Low-order address bits (e.g., multiple cache blocks) Assign signature hashes to operate on Coarse & Fine bits Dynamic re-hashing: False conflicts can be caused by bad luck Dynamically alter hash functions – rotate input address bits before hashing Transform persistent false conflicts into transient false conflicts 92

Privatization interface 93 Privatization functionUsage shared_malloc(size), private_malloc(size) Dynamic allocation of shared and private memory objects shared_free(ptr), private_free(ptr) Frees up memory allocated by shared or private allocators privatize_barrier(num_threads, ptr, size), publicize_barrier(num_threads, ptr, size) Program threads come to a common point to privatize or publicize an object. Must be used outside of transactions

Dynamic privatization Dynamically switch from private to shared, and vice versa If transitioning from private -> shared, safe to mark page as shared (at cost of performance) If transitioning from shared -> private, default policy is to disallow if there exists other shared objects on same page Otherwise, trap to user software and let programmer call shared_free(), followed by private_malloc() on object 94

Bit-field overlaps hurt PBX 95 Return

Removing stack refs doesn’t help 96 Return

Entropy of commercial workloads 97 Return

Type of Hash Functions In real programs, addresses neither independent nor uniformly distributed (key assumptions to derive P FP (n)) But can generate hash values that are almost uniformly distributed and uncorrelated with good (universal/almost universal) hash functions Hash functions considered: 98 Bit-selection (inexpensive, low quality) H 3 [Carter, CSS79] (moderate, higher quality) Return

Notary Related Work Hash functions for memory hierarchy designs Used to reduce cache, bank, or row-buffer contention XOR hashes [Gonzales ‘97, Seznec ‘93, Zhang ‘00] Polynomial hashes [Rau ‘91] Alternatives to XOR hashing [Kharbutli ‘04,’05] Prime modulo & odd-multiplier displacement hashing Reduce probability of bad hash values Can require modifying existing hardware (e.g., additional TLB bits or adders) Detailed analysis of XOR hashes [Vandierendonck ‘05] Linear-algebra based analysis Replacing & swapping columns can minimize the fan-in and maximum fan-out of XOR gates Previous uses of entropy Overheads of addressing memory in ISA [Hammerstrom ‘77] Base Register Cache to reduce size of transferred address [Park ‘90] Mechanisms which compact & expand address & data values [Citron ‘95] Low-power TLB design [Ballapuram ’06] 99

Notary Related Work Cont. Software-only privatization Four pointer types for STMs [Scott ‘07] exclude & only keywords for transactional OpenMP [Milovanovic ‘07] private & shared keywords in OpenTM [Baek ‘07] protect() and unprotect() for transactional C# [Abadi ‘08] Hardware support for privatization Virtual Memory Filter [Matveev ‘07] More general than Notary’s privatization Programmer declares memory regions to be transactional 100 Return

TMProf Related Work Profiling transaction characteristics & implementation-specific features [Hammond ‘04] Ex: Read- and write-set sizes, nesting depth, commit bandwidth Disadvantage: Does not profile common, high-level HTM overheads Transactional Application Profiling Environment (TAPE) [Chafi ‘05] Profiles TCC HTM & summarizes problem areas back to source code lines Disadvantage: Tied to TCC-specific overheads 101

TMProf Related Work Cont. Performance Pathologies [Bobba ‘07] Identified several pathologies affecting performance of eager & lazy HTM systems Disadvantage: Pathologies identified offline using detailed traces Additional profiling [Perfumo ’08, Porter ‘08] Metrics like read-to-write ratio, abort rate Statically predicting TM performance using Syncchar Can be added to TMProf implementations 102 Return

Results from Conflict Resolutions 103 Trends: 1) Timestamp & Hybrid better than Base 2) Hybrid sometimes better than Timestamp

Hybrid Better than Timestamp 104 Fewer Stalls & Wasted Trans Fewer Wasted Trans

Results from Conflict Resolutions 105 Trends: 3) Timestamp can be worse than Base 4) Hybrid can be worse than Base

Timestamp Worse than Base 106 More Wasted Trans Fewer WW Req. Older Stalls (i.e., more younger thread aborts)

Hybrid Worse than Base 107 More RW Req. younger stalls Leads to load imbalance (more Barrier cycles)

Stall Breakdown 108 Prediction serializes read requests from older transactions

Stall Breakdown 109 Prediction serializes write requests (perhaps unnecessarily)

110 Locks are Hard // WITH LOCKS void move(T s, T d, Obj key){ LOCK(s); LOCK(d); tmp = s.remove(key); d.insert(key, tmp); UNLOCK(d); UNLOCK(s); } DEADLOCK! move(a, b, key1); move(b, a, key2); Thread 0 Thread 1 Moreover Coarse-grain locking limits concurrency Fine-grain locking difficult Return

Motivation 111

Background on Aborts Read-dependent & Write-dependent Read-dependent – conflict is RW only Write-dependent – conflicts include WR or WW HTM system may optimize for read-dependent aborts E.g., Eager conflict detection can release read-isolation early on aborts (no nesting) Does not stall requestor 112

Notary Future Work Dynamic entropy calculation: How to adapt PBX hashing to entropy changes over time? Dynamic privatization characteristics: How common is it for objects to change sharing status? 113 Related Work

Sun’s Rock HTM [Dice et al., ASPLOS’09] Best-effort HTM – 1 st general-purpose processor with HTM support Profiling targets why transactions fail TMProf profiles higher-level categories, including successes Aborts update Checkpoint Status Register (CPS) Version R2 includes more detailed breakdowns of CPS than R1 Different reasons for failure given same CPS status in R1 Profiling in common with ExtTMProf: read- and write-set sizes of aborted transactions 114