Presentation is loading. Please wait.

Presentation is loading. Please wait.

Signatures in Transactional Memory Systems Dissertation Defense Luke Yen 1/29/2009.

Similar presentations


Presentation on theme: "Signatures in Transactional Memory Systems Dissertation Defense Luke Yen 1/29/2009."— Presentation transcript:

1 Signatures in Transactional Memory Systems Dissertation Defense Luke Yen 1/29/2009

2 2 Key Contributions Trend: Transactional memory (TM) emerging parallel programming paradigm. Programmer-annotated transactions that execute atomically (all or nothing). Challenge #1: Hardware TM (HTM) systems may restrict transactions or incur overheads on common events (e.g., cache evictions). Contribution: LogTM-SE HTM: Simple hardware and interacts with operating system to virtualize transactions. No overhead on cache evictions.

3 3 Key Contributions Cont. Challenge #2: (1) H 3 signatures high area & power overheads & (2) Thread-private references cause false conflicts. Contribution: Notary: (1) Page-Block-XOR - performs similar to H 3 but lower overheads (2) Stack & heap-based privatization. Challenge #3: Difficult to understand HTM system performance. Contribution: TMProf: Lightweight hardware performance counters help HTM designers & TM programmers. Challenge #4: Signatures suffer from false conflicts. Contribution: Six hardware/software signature extensions to mitigate false conflicts.

4 4 Outline Introduction and Background Transactional Memory background LogTM-SE [HPCA 2007] Notary [MICRO 2008] TMProf (Submitted for publication) Conclusion Focus of presentation * Skip “Extensions to Signatures” Contribution #1 Contribution #2 Contribution #3 Contribution #4

5 5 4/29/2015 Transactional Memory (TM) Locks do not compose Can lead to deadlocks TM programmer says “I want this atomic” TM system “Makes it so” Focus on Hardware TM (HTM) Implementations Fast Leverage cache coherence & speculation But hardware finite & should be policy-free void move(T s, T d, Obj key){ atomic { tmp = s.remove(key); d.insert(key, tmp); } Example

6 LogTM Signature Edition (LogTM-SE) at 50,000 feet HTMs Fast Version management – for transaction commits & aborts HW handles old/new versions (e.g., write buffer) Conflict detection – commit only non-conflicting transactions HW handles conflict detection (R/W bits & coherence) But Closely Coupled to L1 cache On critical paths & hard for SW to save/restore Our Approach: Decoupled, Simple HW, SW control LogTM-SE HW: LogTM’s Log + Signatures (from Illinois Bulk) SW: Unbounded nesting, thread switching, & paging 6 Details

7 Signature Background Signatures used to summarize and detect conflicts with a transaction’s read- and write-sets Inspired by Bulk system [Ceze,ISCA’06] Imprecise, can be implemented with Bloom filters Can have false positives, but never false negatives Also proposed for non-TM purposes (e.g., SC violation detection, atomicity violation detection, race recording) Ex: Use k Bloom filters of size m/k, with independent hash functions 7

8 8 Outline Introduction and Background Notary Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions TMProf Conclusion

9 Notary Executive Summary Tackle 2 problems with hardware signatures: Problem 1: Best signature hashing (i.e., H 3 ) has high area & power overheads Solution 1: Use entropy analysis to guide lower-cost hashing (Page-Block-XOR, PBX) that performs similar to H 3 Ex: 8x fewer gates - 160 gates for H 3 vs 20 gates for PBX Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs Solution 2: Avoid inserting private stack addrs, propose privatization interface for higher performance 9

10 10 Outline Introduction and Background Notary Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions TMProf Conclusion

11 Signature hash functions Which hash function is best? [Sanchez, YEN, MICRO’07] Bit-selection? Hash simply decodes some number of input bits H 3 ? Each bit of a hash value is an XOR of (on avg.) half of the input address bits 11 Result: H 3 better with >=2 hash functions However, H 3 uses many multi-level XOR trees Can we improve this? LogTM-SE w/ 2kb signatures Details

12 H 3 implementation Num XOR Ex: 2kb signatures, k=2, c=10, 32-bit addr = 160 XOR gates per signature Can we reduce the total gate count? 12

13 13 Outline Introduction and Background Notary Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions TMProf Conclusion

14 Entropy defined Insight: Use most random bits for hashing Use entropy to measure bit randomness Entropy = p(x i ) = the probability of the occurrence of value x i N = number of sample values random variable x can take on Entropy = amount of information required on average to describe outcome of variable x (in bits) Ex: What is the best possible lossless compression? 14 n-bit field constant value with probability 1 All bit patterns in n-bit field equally probable Entropy value of n-bit field 0 bits n bits min max Other cases

15 Our measures of entropy For our workloads, we care about: Q1: What is the best achievable entropy? Global entropy – upper bound on entropy of address Q2: How does entropy change within an address? Local entropy – entropy of bit-field within the address 15 Addr 31 6 Global entropy Addr 31 6 Local entropy NSkip

16 Entropy results Workloads to be described later Global entropy is at most 16 bits Bit-window for local entropy is 16 bits wide (NSkip from 0-10) Smaller windows (<16b) may not reach global entropy value Larger windows (>16b) hides some fine-grain info 16 Commercial Workloads

17 Page-Block-XOR (PBX) Motivated by 3 findings: (1) Lower-order bits have most entropy Follows from our entropy results (2) XORing two bit-fields produces random hash values From prior work on XOR hashing (e.g., data placement in caches, DRAM) (3) Bit-field overlaps can lead to higher false positives Correlation between the two bit-fields can reduce the range of hash values produced (worse for larger signatures) 17 Overlap Details

18 PBX implementation For 2kb signatures with 2 hash functions: 20 XOR gates for PBX vs 160 XOR gates for H 3 ! 18 PPN and Cache-index fields not tied to system params: Use entropy to find two non-overlapping bit-fields with high randomness

19 Summary thus far Problem 1: H 3 has high area & power overheads Solution 1: Use entropy analysis to guide lower-cost PBX Ex: 160 gates for H 3 vs 20 gates for PBX Problem 2: Spurious signature conflicts caused by signature bits set by private memory addrs Solution 2: To be described 19

20 20 Outline Introduction and Background Notary Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions TMProf Conclusion

21 Privatization Problem: False conflicts caused by thread-private addrs Avoid conflicts if addrs not inserted in thread’s signatures Two privatization solutions: (1) Remove private stack references from sigs. Very little work for programmer/compiler Benefits depend on fraction of stack addresses versus all transactional references (2) Language-level interface (e.g., private_malloc(), shared_malloc() ) Even higher performance boost WARNING: Incorrectly marking shared objects as private can lead to program errors! 21

22 Page-based implementation Each page is assigned a status, private or shared Invariant: Page is shared if any object is shared If stack is private, library marks stack pages as private If using privatization heap functions, mark heap pages accordingly 22

23 OS support OS allocates different physical page frames for shared and private pages Sets a per-frame bit in translation entry if shared Reduce number of page frames used by packing objects with same status together Signatures insert memory addresses of transactional references to shared pages Query page sharing bit in HW TLB & current transactional status 23

24 24 Outline Introduction and Background Notary Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions TMProf Conclusion

25 Methodology Full-system simulation (GEMS) Transistor-level design for area & power of XOR gates CACTI for Bloom filter bit array area & power Linear scaling to 65nm or 90nm for area, original 400nm for power Single-chip CMP 16 single-threaded, in-order cores 32kB, 4-way private L1 I & D 8MB, 8-way shared L2 cache MESI directory protocol Signatures from 64b-64kb (8B-8kB) & “perfect” 25

26 Workloads Micro-benchmarks SPLASH-2 apps Barnes & Raytrace – exert most signature pressure Stanford STAMP apps Vacation, Genome, Delaunay, Bayes, Labyrinth, Yada, Intruder DNS server BIND 26

27 PBX vs H 3 area & power Area & power overheads (2kb, k=4): 27 Type of overhead Bloom filter bit array H 3 hashPBX hash H 3 sig.PBX sig.% savings for PBX sig. Area (mm 2 ) 4.67e-31.35e-37.83e-56.02e-34.75e-321 Power (mW) 1.80e21.04e11.021.90e21.81e24.7

28 PBX vs H 3 execution time 28 PBX performs similar to H 3

29 Privatization results summary Removing private stack references from signatures did not help Most addr references not to stack Most likely because running with SPARC ISA. Other ISAs (e.g., x86) likely have more benefits Privatization interface helps five workloads Remainder either does not have private heap structures or does not have high transactional duty cycle 29 Stack Results

30 Privatization interface results 30 Can improve execution time

31 31 Outline Introduction and Background Notary Signature Background Entropy & Page-Block-XOR Privatization Methodology & Results Conclusions TMProf Conclusion

32 Conclusions Tackle 2 problems with signature designs: (1) Area and power overheads of H 3 hashing E.g., 160 XOR gates for H 3, 20 for PBX (2) False conflicts due to signature bits set by private memory references Our solutions: (1) Use entropy analysis to guide hashing function (PBX), a low-cost alternative that performs similarly to H 3 (2) Prevent private stack references from entering signatures, and propose a privatization interface for heap allocations Notary can be applied to non-TM uses: PBX hashing can directly transfer Privatization may transfer if addr filtering applies 32 Related Work

33 33 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

34 TMProf Executive Summary TM more parallelism than lock-based programs Complex thread interactions How can HTM designer understand HTM performance? How can TM programmer understand TM program performance? TMProf: Per-processor hardware performance counters to count cumulative event frequencies & overheads in HTM system 34

35 35 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

36 Critical-section Parallelism TM enables critical-section parallelism – more thread interleavings 36 Thread 0 Lock A Thread 1 Lock A Thread 0 xact_begin Thread 1 xact_begin With Locks With TM

37 Hard to Predict Program Performance TM programmers may not have mastered intricacies of HTM system Programs run faster on specific HTM Example: 37

38 Profiling with TMProf Allows HTM designers & TM programmers to understand HTM performance With TMProf: 38

39 39 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

40 Background on Conflicts Three types: RW, WR, and WW Analogous to WAR, RAW, and WAW dependencies in uniprocessors 40 Thread 0 Thread 1 xact_begin … LD A … xact_begin … ST A … xact_begin … ST B … xact_begin … LD B … xact_begin … ST C … xact_begin … ST C … RW WR WW

41 Conflict Detection & Resolution Conflicts detected eagerly or lazily Eagerly – when requests occur Lazily – at transaction commit Conflict resolution Stall or abort on conflict Choose set of procs to take action 41

42 42 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

43 43 TMProf Per-processor HW counters measuring cumulative event frequencies and cumulative event overheads Two implementations: Base & Extended Base (BaseTMProf): Breaks down HTM execution cycles into common components Extended (ExtTMProf): Builds on BaseTMProf & adds HTM-specific transaction-level profiling

44 BaseTMProf & ExtTMProf BaseTMProf: Total cycles = stalls + aborts + wasted_trans + useful_trans + committing + nontrans + implementation specific Assume in-order procs, but can extend for out-of-order procs ExtTMProf: BaseTMProf profiling plus Size of aborted transactions Amount of transactional work after write-set prediction HTMs may add more detailed profiling in future 44 Details

45 45 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

46 46 Two Case Studies TMProf profiling two HTMs: LogTM-SE (eager conflict detection & version management, EE) Approximation of Stanford’s TCC (lazy conflict detection & version management, LL) Examine key parameters of eager & lazy conflict detection Idealize version management Same system parameters as Notary 16-processor CMP w/ in-order, single-issue processor cores Perfect signatures Same workloads

47 EE: Different Conflict Resolutions Three different conflict resolutions: Base, Timestamp, Hybrid All use timestamps Base: Requestor stalls until possible deadlock Timestamp: Older requestors always abort younger transactions. Younger requestors stalled by older transactions. Hybrid: Base, except RW from older writer aborts younger reader 47

48 EE: Write-set Prediction Avoid aborts from load then store pattern from thread Predict & serialize on these conflicts 48 GetS … GetX … GetS … GetS … ABORT T 0 T2T 1 GetX … GetX … GetS … GetS … T 0 T 1T2 STALL

49 Results from Conflict Resolutions 49 Trends: 1) Timestamp & Hybrid better than Base

50 Timestamp & Hybrid Better than Base 50 Fewer total stalls & eliminates all RW Requestor older stalls

51 EE Summary with BaseTMProf BaseTMProf helps HTM designer understand performance of conflict resolution schemes Lightweight, fast, dynamic profiling Can be implemented in prototype HTM systems 51

52 Write-set Prediction Results 52 Focus on workloads that degrade from prediction Prediction increases Stall cycles

53 ExtTMProf’s Transaction-level Profiling 53 Prediction helps short transactions Prediction hurts large transactions – reduces concurrency Predictions Help Predictions Hurt

54 EE Summary with ExtTMProf Helps HTM designers understand why write-set prediction degrades (or improves) performance Offline analysis (e.g., traces) unable to determine performance implications of dynamic conflicts How can TMProf help analyze LL systems? 54

55 LL: Parallel Versus Serial Commit Serial = Only one committer at a time Parallel = Multiple concurrent committers Faster than Serial We idealize its implementation 55

56 LL: More Prefetching than EE Eager conflict detection: Progress bounded by location of conflicts Early conflicts  abort transactions early (little prefetching) Late conflicts  abort transactions late (lots of prefetching) Lazy conflict detection: Committers finish transaction before detecting conflicts High probability for lots of prefetching 56

57 Parallel Commit Results 57 Parallel commit removes commit token bottleneck

58 Conflicts with Parallel Commit 58 All conflicts either RW or WR – no WW conflicts

59 LL Summary with BaseTMProf BaseTMProf clearly shows why parallel commit helps Stall breakdown shows mostly WR conflicts BaseTMProf helps HTM designers decide whether to implement parallel commit Parallel commit more complex than serial commit 59

60 Prefetching Results 60 Useful Trans should be similar for EE & LL, but LL incurs fewer cycles Why?

61 ExtTMProf’s Transaction-level Profiling 61 LL’s aborted transactions prefetch farther than EE

62 LL Summary with ExtTMProf Explains why workloads execute faster on LL than on EE May influence HTM design decision to implement LL rather than EE Helps TM programmer understand why programs run faster on some HTMs 62

63 63 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

64 64 Software Rollback Better than Hardware Rollback Software rollback reduces Stalls & Wasted Trans May reduce contention in HTM?

65 Hardware for Critical-path Profiling Counter-based profiling is not sufficient Multi-threaded programs exhibit variability: Different dynamic code paths Inter-thread dependencies Memory latencies Factors change critical-path – longest control flow that determines execution time Hardware critical-path profiling can aid in understanding performance Faster than offline, software analyses 65

66 66 Outline Introduction and Background Notary TMProf Motivation Background TMProf Two Case Studies Future Directions for TMProf Conclusions Conclusion

67 Conclusions TMProf – lightweight per-processor hardware counters for understanding HTM performance Cumulative event frequencies & overheads Two implementations: Base & Extended Two case studies: LogTM-SE & Approximation of TCC Future TMProf might add hardware support for critical-path profiling 67 Related Work

68 68 Outline Introduction and Background Notary TMProf Conclusion

69 Conclusions Challenge #1: Hardware TM (HTM) systems may restrict transactions or incur overheads on common events. Contribution: LogTM-SE HTM Challenge #2: (1) H 3 signatures high area & power overheads & (2) Thread-private references cause false conflicts. Contribution: Notary 69

70 Conclusions Cont. Challenge #3: Difficult to understand HTM system performance. Contribution: TMProf Challenge #4: Signatures suffer from false conflicts. Contribution: Six hardware/software extensions to signatures 70

71 Other Research & Contributions OS Support for Virtualizing Transactional Memory [Swift et al. TRANSACT ‘08] Implementing Signatures for Transactional Memory [Sanchez et al. MICRO ‘07] Performance Pathologies in Hardware Transactional Memory [Bobba et al., ISCA ’07 & Top Picks ‘08] Supporting Nested Transactional Memory in LogTM [Moravan et al., ASPLOS ‘06] GEMS 2.X development & support SMT in Opal, LogTM-SE in Ruby 71

72 Thank You! Questions? 72

73 73 Backup Slides

74 74 LogTM-SE Processor Hardware Segmented log, like LogTM Track R / W sets with R / W signatures Over-approximate R / W sets Tracks physical addresses Summary signature used for virtualization Conflict detection by coherence protocol Check signatures on every memory access for SMT Data Caches RegistersRegister Checkpoint LogFrame LogPtr TMcount Write SMT Thread Context Tag Data Read SummaryRead SummaryWrite NO TM STATE

75 75 Thread Switching Support Why? Support long-running transactions What? Conflict Detection for descheduled transactions How? Summary Read / Write signatures: If thread t of process P is scheduled to use an active signature, the corresponding summary signature holds the union of the saved signatures from all descheduled threads from process P. Updated using TLB-shootdown-like mechanism

76 76 Summary 00000000 W R Summary 00000000 W R Handling Thread Switching 01001000 01010010 W R 0100000 01010010 W R 0100000 01000010 W R 00000000 W R Summary 00000000 W R Summary 00000000 W R P1 P2 Summary 00000000 W R P3 P4 T1 T2T3 OS

77 77 Handling Thread Switching 01001000 01010010 W R 0100000 01010010 W R 0100000 01000010 W R 00000000 W R Summary 00000000 W R Summary 00000000 W R Summary 00000000 W R Summary 00000000 W R Summary 00000000 W R P1 P2P3 P4 Deschedule 01001000 01010010 01001000 01010010 T1 T2T3 OS

78 78 Handling Thread Switching 01001000 01010010 W R 0100000 01010010 W R 0100000 01000010 W R 00000000 W R Summary 01001000 01010010 W R Summary 00000000 W R Summary 00000000 W R Summary 00000000 W R Summary 00000000 W R P1 P2P3 P4 Summary 01001000 01010010 W R Summary 01001000 01010010 W R Deschedule T1 T2T3 OS

79 79 Handling Thread Switching 00000000 W R 0100000 01010010 W R 0100000 01000010 W R 00000000 W R Summary 01001000 01010010 W R Summary 00000000 W R P1 P2P3 P4 Summary 01001000 01010010 W R Summary 01001000 01010010 W R Summary 00000000 W R T2T3 T1 OS

80 80 Thread Switching Support Summary Summary Read / Write signatures Summarizes descheduled threads with active transactions One OS structure per process Check summary signature on every memory access Updated on transaction deschedule Similar to TLB shootdown Coherence

81 81 Paging Support Summary Problem: Changing page frames Need to maintain isolation on transactional blocks Solution: On Page-Out: Save Virtual -> Physical mapping On Page-In: If different page frame, update signatures with physical address of transactional blocks in new page frame.

82 82 Paging Support Animation VP1 PP1 A B C D PP2 A’ B’ C’ D’ Read sig. Write sig. A? Y A’ B? Y B’ C? C’ D? D’ Page-out Page-in Read & Write signatures isolate memory blocks from PP1 & PP2 Return

83 BaseTMProf for LogTM-SE (1 of 3) Differentiate between read dependent & write dependent aborts Meta-data (e.g., 3 bits for conflict types + 1 bit indicating if responder older) on NACK messages Per-processor tables to track conflicts with other procs RW conflict only = read-dependent Stall cycles = cycle conflict detected – cycle request sent to memory subsystem Abort cycles = cycle abort completes – cycle abort initiates 83

84 BaseTMProf for LogTM-SE (2 of 3) Wasted_trans cycles = cycle abort initiates – cycle transaction begins Store transaction begin cycle in separate register Commit cycles = cycle commit completes – cycle commit initiates No commit actions = no commit cycles Track cycle of start of commit action in separate register 84

85 BaseTMProf for LogTM-SE (3 of 3) Nontrans cycles = cycle of transaction begin – cycle after last transaction commit Track cycle of last transaction commit in separate register Backoff cycles = cycle retry transaction – cycle abort completes Barrier cycles = cycle exit barrier – cycle enter barrier 85

86 ExtTMProf for LogTM-SE Work remaining after write-set prediction: Store transaction size (read+write-set sizes) at each prediction - lazily copy to software or use many registers At commit, subtract saved transaction size from final transaction size at commit Differences processed by software to produce histograms Size of aborted transactions: Store read- and write-set sizes of aborted transaction in separate registers 86

87 BaseTMProf for TCC Stall cycles recorded at transaction commit When write-sets broadcasted or commit request sent to directory No breakdown of read-dependent & write- dependent abort cycles Since aborts do not stall winner (abortee) Committing cycles = cycle commit phase completes – cycle commit phase begins Between cycle all stores flushed from write buffer & broadcasting write-set 87

88 ExtTMProf for TCC Size of aborted transactions: Track read- and write-set sizes of aborted transactions Just like for LogTM-SE 88 Return

89 Extensions to Signatures Overview Six extensions to reduce false conflicts Static Transaction Identifier (XID) Independence Object Identifiers (IDs) Spatial locality with static signatures Spatial locality with dynamic signatures Coarse-fine hashing Dynamic re-hashing Evaluate using ideal hardware & software 89 Best performance

90 XID Independence, Object IDs XID Independence: Programmer declares set of static XIDs that conflict with each other Information passed to hardware for conflict detection Signature check only for XIDs that possibly conflict Object IDs: High-level objects accessed by each transaction E.g., Trees, hash buckets, nodes Programmer declares set of objects accessed in transaction Designed to handle dynamic, fine-grain conflicts 90

91 Optimizing for Spatial locality Spatial locality exists in many programs High probability of accessing memory addresses neighboring current address in future Spatially local addresses may form a set that sets only a single signature bit Static signatures: Signature hashes operate on fixed, larger granularity (i.e., greater than cache-block) Granularity may not be suitable for all workloads Dynamic signatures: A set of signatures that hash on different granularities & set of hit counters Dynamically select which signature is “best” to use 91

92 Coarse-fine hashing, Dynamic re-hashing Coarse-fine hashing: Split addresses into two regions: Coarse & Fine Coarse – High-order address bits (e.g., page number) Fine – Low-order address bits (e.g., multiple cache blocks) Assign signature hashes to operate on Coarse & Fine bits Dynamic re-hashing: False conflicts can be caused by bad luck Dynamically alter hash functions – rotate input address bits before hashing Transform persistent false conflicts into transient false conflicts 92

93 Privatization interface 93 Privatization functionUsage shared_malloc(size), private_malloc(size) Dynamic allocation of shared and private memory objects shared_free(ptr), private_free(ptr) Frees up memory allocated by shared or private allocators privatize_barrier(num_threads, ptr, size), publicize_barrier(num_threads, ptr, size) Program threads come to a common point to privatize or publicize an object. Must be used outside of transactions

94 Dynamic privatization Dynamically switch from private to shared, and vice versa If transitioning from private -> shared, safe to mark page as shared (at cost of performance) If transitioning from shared -> private, default policy is to disallow if there exists other shared objects on same page Otherwise, trap to user software and let programmer call shared_free(), followed by private_malloc() on object 94

95 Bit-field overlaps hurt PBX 95 Return

96 Removing stack refs doesn’t help 96 Return

97 Entropy of commercial workloads 97 Return

98 Type of Hash Functions In real programs, addresses neither independent nor uniformly distributed (key assumptions to derive P FP (n)) But can generate hash values that are almost uniformly distributed and uncorrelated with good (universal/almost universal) hash functions Hash functions considered: 98 Bit-selection (inexpensive, low quality) H 3 [Carter, CSS79] (moderate, higher quality) Return

99 Notary Related Work Hash functions for memory hierarchy designs Used to reduce cache, bank, or row-buffer contention XOR hashes [Gonzales ‘97, Seznec ‘93, Zhang ‘00] Polynomial hashes [Rau ‘91] Alternatives to XOR hashing [Kharbutli ‘04,’05] Prime modulo & odd-multiplier displacement hashing Reduce probability of bad hash values Can require modifying existing hardware (e.g., additional TLB bits or adders) Detailed analysis of XOR hashes [Vandierendonck ‘05] Linear-algebra based analysis Replacing & swapping columns can minimize the fan-in and maximum fan-out of XOR gates Previous uses of entropy Overheads of addressing memory in ISA [Hammerstrom ‘77] Base Register Cache to reduce size of transferred address [Park ‘90] Mechanisms which compact & expand address & data values [Citron ‘95] Low-power TLB design [Ballapuram ’06] 99

100 Notary Related Work Cont. Software-only privatization Four pointer types for STMs [Scott ‘07] exclude & only keywords for transactional OpenMP [Milovanovic ‘07] private & shared keywords in OpenTM [Baek ‘07] protect() and unprotect() for transactional C# [Abadi ‘08] Hardware support for privatization Virtual Memory Filter [Matveev ‘07] More general than Notary’s privatization Programmer declares memory regions to be transactional 100 Return

101 TMProf Related Work Profiling transaction characteristics & implementation-specific features [Hammond ‘04] Ex: Read- and write-set sizes, nesting depth, commit bandwidth Disadvantage: Does not profile common, high-level HTM overheads Transactional Application Profiling Environment (TAPE) [Chafi ‘05] Profiles TCC HTM & summarizes problem areas back to source code lines Disadvantage: Tied to TCC-specific overheads 101

102 TMProf Related Work Cont. Performance Pathologies [Bobba ‘07] Identified several pathologies affecting performance of eager & lazy HTM systems Disadvantage: Pathologies identified offline using detailed traces Additional profiling [Perfumo ’08, Porter ‘08] Metrics like read-to-write ratio, abort rate Statically predicting TM performance using Syncchar Can be added to TMProf implementations 102 Return

103 Results from Conflict Resolutions 103 Trends: 1) Timestamp & Hybrid better than Base 2) Hybrid sometimes better than Timestamp

104 Hybrid Better than Timestamp 104 Fewer Stalls & Wasted Trans Fewer Wasted Trans

105 Results from Conflict Resolutions 105 Trends: 3) Timestamp can be worse than Base 4) Hybrid can be worse than Base

106 Timestamp Worse than Base 106 More Wasted Trans Fewer WW Req. Older Stalls (i.e., more younger thread aborts)

107 Hybrid Worse than Base 107 More RW Req. younger stalls Leads to load imbalance (more Barrier cycles)

108 Stall Breakdown 108 Prediction serializes read requests from older transactions

109 Stall Breakdown 109 Prediction serializes write requests (perhaps unnecessarily)

110 110 Locks are Hard // WITH LOCKS void move(T s, T d, Obj key){ LOCK(s); LOCK(d); tmp = s.remove(key); d.insert(key, tmp); UNLOCK(d); UNLOCK(s); } DEADLOCK! move(a, b, key1); move(b, a, key2); Thread 0 Thread 1 Moreover Coarse-grain locking limits concurrency Fine-grain locking difficult Return

111 Motivation 111

112 Background on Aborts Read-dependent & Write-dependent Read-dependent – conflict is RW only Write-dependent – conflicts include WR or WW HTM system may optimize for read-dependent aborts E.g., Eager conflict detection can release read-isolation early on aborts (no nesting) Does not stall requestor 112

113 Notary Future Work Dynamic entropy calculation: How to adapt PBX hashing to entropy changes over time? Dynamic privatization characteristics: How common is it for objects to change sharing status? 113 Related Work

114 Sun’s Rock HTM [Dice et al., ASPLOS’09] Best-effort HTM – 1 st general-purpose processor with HTM support Profiling targets why transactions fail TMProf profiles higher-level categories, including successes Aborts update Checkpoint Status Register (CPS) Version R2 includes more detailed breakdowns of CPS than R1 Different reasons for failure given same CPS status in R1 Profiling in common with ExtTMProf: read- and write-set sizes of aborted transactions 114


Download ppt "Signatures in Transactional Memory Systems Dissertation Defense Luke Yen 1/29/2009."

Similar presentations


Ads by Google